From 117ed92992a8b7ec45b399a2b5e2f9b66358a7d4 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Tue, 16 Jul 2019 11:58:47 -0400
Subject: [PATCH 001/200] RestructuredText table for pretrained models.

---
 docs/source/pretrained_models.rst | 147 +++++++++++++++++++-----------
 1 file changed, 94 insertions(+), 53 deletions(-)

diff --git a/docs/source/pretrained_models.rst b/docs/source/pretrained_models.rst
index 2d72977951..e4ad7a6eaa 100644
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -3,57 +3,98 @@ Pretrained models
 
 Here is the full list of the currently provided pretrained models together with a short presentation of each model.
 
-+===============+============================================================+===========================+ 
-| Architecture  | Shortcut name                                              | Details of the model      |
-+===============+============================================================+===========================+ 
-|               | ``bert-base-uncased``                                      | 12-layer, 768-hidden, 12-heads, 110M parameters
-|               |                                                            | Trained on lower-cased English text                 |
-|               +------------------------------------------------------------+---------------------------+ 
-|               | ``bert-large-uncased``                                     | 24-layer, 1024-hidden, 16-heads, 340M parameters
-|               |                                                            | Trained on lower-cased English text                  |
-|               +------------------------------------------------------------+---------------------------+ 
-|               | ``bert-base-cased``                                        | 12-layer, 768-hidden, 12-heads, 110M parameters
-|               |                                                            | Trained on cased English text                 |
-|               +------------------------------------------------------------+---------------------------+ 
-|               | ``bert-large-cased``                                       | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
-|               |                                                            | Trained on cased English text                  |
-|               +------------------------------------------------------------+---------------------------+ 
-|               | ``bert-base-multilingual-uncased``                         | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters
-|               |                                                            | Trained on lower-cased text in the top 102 languages with the largest Wikipedias
-|               |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`_)                 |
-|               +------------------------------------------------------------+---------------------------+ 
-|               | ``bert-base-multilingual-cased``                           | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters                  |
-|               |                                                            | Trained on cased text in the top 104 languages with the largest Wikipedias
-|               |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`_)                 |
-|               +------------------------------------------------------------+---------------------------+ 
-|    BERT       | ``bert-base-chinese``                                      | 12-layer, 768-hidden, 12-heads, 110M parameters                  |
-|               |                                                            | Trained on cased Chinese Simplified and Traditional text |
-|               +------------------------------------------------------------+---------------------------+ 
-|               | ``bert-base-german-cased``                                 | 12-layer, 768-hidden, 12-heads, 110M parameters                  |
-|               |                                                            | Trained on cased German text by Deepset.ai |
-|               |                                                            | (see `details on deepset.ai website <https://deepset.ai/german-bert>`_)                 |
-|               +------------------------------------------------------------+---------------------------+ 
-|               | ``bert-large-uncased-whole-word-masking``                  | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
-|               |                                                            | Trained on lower-cased English text using Whole-Word-Masking                  |
-|               |                                                            | (see `details <https://github.com/google-research/bert/#bert>`_)                 |
-|               +------------------------------------------------------------+---------------------------+ 
-|               | ``bert-large-cased-whole-word-masking``                    | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
-|               |                                                            | Trained on cased English text using Whole-Word-Masking                  |
-|               |                                                            | (see `details <https://github.com/google-research/bert/#bert>`_)                 |
-|               +------------------------------------------------------------+---------------------------+ 
-|               | ``bert-large-uncased-whole-word-masking-finetuned-squad``  | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
-|               |                                                            | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD                  |
-|               |                                                            | (see details of fine-tuning in the `example section`_)                 |
-|               +------------------------------------------------------------+---------------------------+ 
-|               | ``bert-large-cased-whole-word-masking-finetuned-squad``    | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
-|               |                                                            | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD                  |
-|               |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`_)                 |
-|               +------------------------------------------------------------+---------------------------+ 
-|               | ``bert-base-cased-finetuned-mrpc``                         | 12-layer, 768-hidden, 12-heads, 110M parameters                  |
-|               |                                                            | The ``bert-base-cased`` model fine-tuned on MRPC                  |
-|               |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`_)                 |
-+---------------+------------------------------------------------------------+---------------------------+ 
-|    GPT        | Cells may span columns.                                                                |
-+---------------+----------------------------------------------------------------------------------------+ 
 
-.. <https://huggingface.co/pytorch-transformers/examples.html>`_
\ No newline at end of file
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+| Architecture      | Shortcut name                                              | Details of the model                                                                                                      |
++===================+============================================================+===========================================================================================================================+
+| BERT              | ``bert-base-uncased``                                      | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
+|                   |                                                            | Trained on lower-cased English text                                                                                       |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-large-uncased``                                     | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
+|                   |                                                            | Trained on lower-cased English text                                                                                       |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-cased``                                        | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
+|                   |                                                            | Trained on cased English text                                                                                             |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-large-cased``                                       | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
+|                   |                                                            | Trained on cased English text                                                                                             |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-multilingual-uncased``                         | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters                                               |
+|                   |                                                            | Trained on lower-cased text in the top 102 languages with the largest Wikipedias                                          |
+|                   |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__)                                   |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-multilingual-cased``                           | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters                                                    |
+|                   |                                                            | Trained on cased text in the top 104 languages with the largest Wikipedias                                                |
+|                   |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__)                                   |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-chinese``                                      | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
+|                   |                                                            | Trained on cased Chinese Simplified and Traditional text                                                                  |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-german-cased``                                 | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
+|                   |                                                            | Trained on cased German text by Deepset.ai                                                                                |
+|                   |                                                            | (see `details on deepset.ai website <https://deepset.ai/german-bert>`__)                                                  |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-large-uncased-whole-word-masking``                  | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
+|                   |                                                            | Trained on lower-cased English text using Whole-Word-Masking                                                              |
+|                   |                                                            | (see `details <https://github.com/google-research/bert/#bert>`__)                                                         |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-large-cased-whole-word-masking``                    | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
+|                   |                                                            | Trained on cased English text using Whole-Word-Masking                                                                    |
+|                   |                                                            | (see `details <https://github.com/google-research/bert/#bert>`__)                                                         |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-large-uncased-whole-word-masking-finetuned-squad``  | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
+|                   |                                                            | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD                                                   |
+|                   |                                                            | (see details of fine-tuning in the `example section`__)                                                                   |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-large-cased-whole-word-masking-finetuned-squad``    | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
+|                   |                                                            | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD                                                     |
+|                   |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`__)       |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-cased-finetuned-mrpc``                         | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
+|                   |                                                            | The ``bert-base-cased`` model fine-tuned on MRPC                                                                          |
+|                   |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`__)       |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+| GPT               | ``openai-gpt``                                             | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
+|                   |                                                            | OpenAI GPT English model                                                                                                  |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+| GPT-2             | ``gpt2``                                                   | 12-layer, 768-hidden, 12-heads, 117M parameters                                                                           |
+|                   |                                                            | OpenAI GPT-2 English model                                                                                                |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``gpt2-medium``                                            | 24-layer, 1024-hidden, 16-heads, 345M parameters                                                                          |
+|                   |                                                            | OpenAI's Medium-sized GPT-2 English model                                                                                 |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+| Transformer-XL    | ``transfo-xl-wt103``                                       | 18-layer, 1024-hidden, 16-heads, 257M parameters                                                                          |
+|                   |                                                            | English model trained on wikitext-103                                                                                     |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+| XLNet             | ``xlnet-base-cased``                                       | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
+|                   |                                                            | XLNet English model                                                                                                       |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlnet-large-cased``                                      | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
+|                   |                                                            | XLNet Large English model                                                                                                 |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+| XLM               | ``xlm-mlm-en-2048``                                        | 12-layer, 1024-hidden, 8-heads                                                                                            |
+|                   |                                                            | XLM English model                                                                                                         |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-ende-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
+|                   |                                                            | XLM English-German Multi-language model                                                                                   |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-enfr-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
+|                   |                                                            | XLM English-French Multi-language model                                                                                   |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-enro-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
+|                   |                                                            | XLM English-Romanian Multi-language model                                                                                 |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-xnli15-1024``                                    | 12-layer, 1024-hidden, 8-heads                                                                                            |
+|                   |                                                            | XLM Model pre-trained with MLM on the `15 XNLI languages<https://github.com/facebookresearch/XNLI>`__.                    |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-tlm-xnli15-1024``                                | 12-layer, 1024-hidden, 8-heads                                                                                            |
+|                   |                                                            | XLM Model pre-trained with MLM + TLM on the `15 XNLI languages<https://github.com/facebookresearch/XNLI>`__.              |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-clm-enfr-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
+|                   |                                                            | XLM English model trained with CLM (Causal Language Modeling)                                                             |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-clm-ende-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
+|                   |                                                            | XLM English-German Multi-language model trained with CLM (Causal Language Modeling)                                       |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
+
+.. <https://huggingface.co/pytorch-transformers/examples.html>`__
\ No newline at end of file

From 9d381e7be9fb97e09777fa66aa3e336ca132af70 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Wed, 17 Jul 2019 09:25:38 -0400
Subject: [PATCH 002/200] Fixed incorrect links in the PretrainedModel

---
 docs/source/pretrained_models.rst | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/source/pretrained_models.rst b/docs/source/pretrained_models.rst
index e4ad7a6eaa..b23a96ff7c 100644
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -43,8 +43,8 @@ Here is the full list of the currently provided pretrained models together with
 |                   |                                                            | (see `details <https://github.com/google-research/bert/#bert>`__)                                                         |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-large-uncased-whole-word-masking-finetuned-squad``  | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
-|                   |                                                            | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD                                                   |
-|                   |                                                            | (see details of fine-tuning in the `example section`__)                                                                   |
+|                   |                                                            | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD (see details of fine-tuning in the                |
+|                   |                                                            | `example section <https://github.com/huggingface/pytorch-transformers/tree/master/examples>`__)                           |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-large-cased-whole-word-masking-finetuned-squad``    | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
 |                   |                                                            | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD                                                     |
@@ -85,10 +85,10 @@ Here is the full list of the currently provided pretrained models together with
 |                   |                                                            | XLM English-Romanian Multi-language model                                                                                 |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-mlm-xnli15-1024``                                    | 12-layer, 1024-hidden, 8-heads                                                                                            |
-|                   |                                                            | XLM Model pre-trained with MLM on the `15 XNLI languages<https://github.com/facebookresearch/XNLI>`__.                    |
+|                   |                                                            | XLM Model pre-trained with MLM on the `15 XNLI languages <https://github.com/facebookresearch/XNLI>`__.                   |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-mlm-tlm-xnli15-1024``                                | 12-layer, 1024-hidden, 8-heads                                                                                            |
-|                   |                                                            | XLM Model pre-trained with MLM + TLM on the `15 XNLI languages<https://github.com/facebookresearch/XNLI>`__.              |
+|                   |                                                            | XLM Model pre-trained with MLM + TLM on the `15 XNLI languages <https://github.com/facebookresearch/XNLI>`__.             |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-clm-enfr-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
 |                   |                                                            | XLM English model trained with CLM (Causal Language Modeling)                                                             |

From c4e9615691a19128f446563718355aedf03cf01b Mon Sep 17 00:00:00 2001
From: Wei-Sheng Chin <wechi@microsoft.com>
Date: Wed, 17 Jul 2019 09:08:40 -0700
Subject: [PATCH 003/200] Fix a path so that test can run on Windows

---
 pytorch_transformers/tests/modeling_common_test.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/pytorch_transformers/tests/modeling_common_test.py b/pytorch_transformers/tests/modeling_common_test.py
index 5ea98d68e2..e974ae865d 100644
--- a/pytorch_transformers/tests/modeling_common_test.py
+++ b/pytorch_transformers/tests/modeling_common_test.py
@@ -21,6 +21,7 @@ import os
 import shutil
 import json
 import random
+import uuid
 
 import unittest
 import logging
@@ -527,7 +528,7 @@ class ConfigTester(object):
 
     def create_and_test_config_to_json_file(self):
         config_first = self.config_class(**self.inputs_dict)
-        json_file_path = "/tmp/config.json"
+        json_file_path = os.path.join(os.getcwd(), "config_" + str(uuid.uuid4()) + ".json")
         config_first.to_json_file(json_file_path)
         config_second = self.config_class.from_json_file(json_file_path)
         os.remove(json_file_path)

From 76be189b08941adc13f07bdeb57e511b68b67290 Mon Sep 17 00:00:00 2001
From: Peiqin Lin <lpq29743@gmail.com>
Date: Sun, 21 Jul 2019 20:39:42 +0800
Subject: [PATCH 004/200] typos

---
 examples/run_glue.py  | 4 ++--
 examples/run_squad.py | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/examples/run_glue.py b/examples/run_glue.py
index f017db2f6f..25a487156e 100644
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -116,8 +116,8 @@ def train(args, train_dataset, model, tokenizer):
                       'attention_mask': batch[1],
                       'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None,  # XLM don't use segment_ids
                       'labels':         batch[3]}
-            ouputs = model(**inputs)
-            loss = ouputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+            outputs = model(**inputs)
+            loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
 
             if args.n_gpu > 1:
                 loss = loss.mean() # mean() to average on multi-gpu parallel training
diff --git a/examples/run_squad.py b/examples/run_squad.py
index d72d67b87d..53ea0bfd64 100644
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -129,8 +129,8 @@ def train(args, train_dataset, model, tokenizer):
             if args.model_type in ['xlnet', 'xlm']:
                 inputs.update({'cls_index': batch[5],
                                'p_mask':    batch[6]})
-            ouputs = model(**inputs)
-            loss = ouputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+            outputs = model(**inputs)
+            loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
 
             if args.n_gpu > 1:
                 loss = loss.mean() # mean() to average on multi-gpu parallel (not distributed) training

From 2f869dc6651f9cf9253f4c5a43279027a0eccfc5 Mon Sep 17 00:00:00 2001
From: rish-16 <mail.rishabh.anand@gmail.com>
Date: Sat, 20 Jul 2019 16:49:42 +0800
Subject: [PATCH 005/200] Fixed typo

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 484c3c47df..81bc1ab6bd 100644
--- a/README.md
+++ b/README.md
@@ -58,7 +58,7 @@ python -m pytest -sv ./examples/
 
 ## Quick tour
 
-Let's do a very quick overview of PyTorch-Transformers. Detailled examples for each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [full documentation](https://huggingface.co/pytorch-transformers/).
+Let's do a very quick overview of PyTorch-Transformers. Detailed examples for each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [full documentation](https://huggingface.co/pytorch-transformers/).
 
 ```python
 import torch

From 897d0841bed5e0637aca7dec7744bedc06b54fae Mon Sep 17 00:00:00 2001
From: Yiqing-Zhou <40547184+Yiqing-Zhou@users.noreply.github.com>
Date: Mon, 22 Jul 2019 20:49:09 +0800
Subject: [PATCH 006/200] read().splitlines() -> readlines()

splitlines() does not work as what we expect here for bert-base-chinese because there is a '\u2028' (unicode line seperator) token in vocab file. Value of '\u2028'.splitlines() is ['', ''].
Perhaps we should use readlines() instead.
---
 pytorch_transformers/tokenization_bert.py | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/pytorch_transformers/tokenization_bert.py b/pytorch_transformers/tokenization_bert.py
index f1e900caaf..1ca758eda5 100644
--- a/pytorch_transformers/tokenization_bert.py
+++ b/pytorch_transformers/tokenization_bert.py
@@ -67,10 +67,9 @@ def load_vocab(vocab_file):
     """Loads a vocabulary file into a dictionary."""
     vocab = collections.OrderedDict()
     with open(vocab_file, "r", encoding="utf-8") as reader:
-        tokens = reader.read().splitlines()
+        tokens = reader.readlines()
     for index, token in enumerate(tokens):
         vocab[token] = index
-        index += 1
     return vocab
 
 

From bef0c629cae56734a5acb38720aea2bdd9d738bd Mon Sep 17 00:00:00 2001
From: Yiqing-Zhou <40547184+Yiqing-Zhou@users.noreply.github.com>
Date: Mon, 22 Jul 2019 22:30:49 +0800
Subject: [PATCH 007/200] fix

Remove '\n' before adding token into vocab
---
 pytorch_transformers/tokenization_bert.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/pytorch_transformers/tokenization_bert.py b/pytorch_transformers/tokenization_bert.py
index 1ca758eda5..acf89b6984 100644
--- a/pytorch_transformers/tokenization_bert.py
+++ b/pytorch_transformers/tokenization_bert.py
@@ -69,6 +69,7 @@ def load_vocab(vocab_file):
     with open(vocab_file, "r", encoding="utf-8") as reader:
         tokens = reader.readlines()
     for index, token in enumerate(tokens):
+        token = token[:-1]
         vocab[token] = index
     return vocab
 

From b8009cb0dac9698c1999af7121ea196065510905 Mon Sep 17 00:00:00 2001
From: Anish Moorthy <anish.moorthy@worthix.com>
Date: Mon, 22 Jul 2019 17:56:27 -0400
Subject: [PATCH 008/200] Make PreTrainedModel.from_pretrained pass unused
 arguments to model

---
 pytorch_transformers/modeling_utils.py | 35 +++++++++++++++++++-------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 324cdc17c9..a4e1a44c9d 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -78,7 +78,7 @@ class PretrainedConfig(object):
         self.to_json_file(output_config_file)
 
     @classmethod
-    def from_pretrained(cls, pretrained_model_name_or_path, *input, **kwargs):
+    def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
         r""" Instantiate a PretrainedConfig from a pre-trained model configuration.
 
         Params:
@@ -105,6 +105,7 @@ class PretrainedConfig(object):
 
         """
         cache_dir = kwargs.pop('cache_dir', None)
+        return_unused_args = kwargs.pop('return_unused_args', False)
 
         if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
             config_file = cls.pretrained_config_archive_map[pretrained_model_name_or_path]
@@ -148,7 +149,10 @@ class PretrainedConfig(object):
             kwargs.pop(key, None)
 
         logger.info("Model config %s", config)
-        return config
+        if return_unused_args:
+            return config, kwargs
+        else:
+            return config
 
     @classmethod
     def from_dict(cls, json_object):
@@ -305,7 +309,7 @@ class PreTrainedModel(nn.Module):
         torch.save(model_to_save.state_dict(), output_model_file)
 
     @classmethod
-    def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
         r"""Instantiate a pretrained pytorch model from a pre-trained model configuration.
 
             The model is set in evaluation mode by default using `model.eval()` (Dropout modules are desactivated)
@@ -336,9 +340,17 @@ class PreTrainedModel(nn.Module):
                 configuration should be cached if the standard cache should not be used.
             **output_loading_info**: (`optional`) boolean:
                 Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
+            **model_args**: (`optional`) Sequence:
+                All positional arguments will be passed to the underlying model's __init__ function
             **kwargs**: (`optional`) dict:
-                Dictionnary of key, values to update the configuration object after loading.
-                Can be used to override selected configuration parameters. E.g. ``output_attention=True``
+                Dictionary of key, values to update the configuration object after loading.
+                Can be used to override selected configuration parameters. E.g. ``output_attention=True``.
+
+                If config is None, then **kwargs will be passed to the model.
+                If said key is *not* present, then kwargs will be used to
+                override any keys shared with the default configuration for the
+                given pretrained_model_name_or_path, and only the unshared
+                key/value pairs will be passed to the model.
 
         Examples::
 
@@ -359,7 +371,12 @@ class PreTrainedModel(nn.Module):
 
         # Load config
         if config is None:
-            config = cls.config_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+            config, model_kwargs = cls.config_class.from_pretrained(
+                pretrained_model_name_or_path, *model_args,
+                return_unused_args=True, **kwargs
+            )
+        else:
+            model_kwargs = kwargs
 
         # Load model
         if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
@@ -400,7 +417,7 @@ class PreTrainedModel(nn.Module):
                 archive_file, resolved_archive_file))
 
         # Instantiate model.
-        model = cls(config)
+        model = cls(config, *model_args, **model_kwargs)
 
         if state_dict is None and not from_tf:
             state_dict = torch.load(resolved_archive_file, map_location='cpu')
@@ -530,7 +547,7 @@ class PoolerEndLogits(nn.Module):
             **start_states**: ``torch.LongTensor`` of shape identical to hidden_states
                 hidden states of the first tokens for the labeled span.
             **start_positions**: ``torch.LongTensor`` of shape ``(batch_size,)``
-                position of the first token for the labeled span: 
+                position of the first token for the labeled span:
             **p_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, seq_len)``
                 Mask of invalid position such as query and special symbols (PAD, SEP, CLS)
                 1.0 means token should be masked.
@@ -717,7 +734,7 @@ class SequenceSummary(nn.Module):
                 - 'attn' => Not implemented now, use multi-head attention
             summary_use_proj: Add a projection after the vector extraction
             summary_proj_to_labels: If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
-            summary_activation: 'tanh' => add a tanh activation to the output, Other => no activation. Default 
+            summary_activation: 'tanh' => add a tanh activation to the output, Other => no activation. Default
             summary_first_dropout: Add a dropout before the projection and activation
             summary_last_dropout: Add a dropout after the projection and activation
     """

From 490ebbdcf7f6e1171058fded5db6cb231d18636d Mon Sep 17 00:00:00 2001
From: Anish Moorthy <anish.moorthy@worthix.com>
Date: Mon, 22 Jul 2019 15:51:51 -0400
Subject: [PATCH 009/200] Fix PretrainedModel.from_pretrained not passing
 cache_dir forward

---
 pytorch_transformers/modeling_utils.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index a4e1a44c9d..0a4bfa7ba0 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -373,7 +373,8 @@ class PreTrainedModel(nn.Module):
         if config is None:
             config, model_kwargs = cls.config_class.from_pretrained(
                 pretrained_model_name_or_path, *model_args,
-                return_unused_args=True, **kwargs
+                cache_dir=cache_dir, return_unused_args=True,
+                **kwargs
             )
         else:
             model_kwargs = kwargs

From 0227b4a940ceacde43b703d74d03bed2603188e7 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 23 Jul 2019 14:06:43 +0200
Subject: [PATCH 010/200] fix #827

---
 .../convert_xlm_checkpoint_to_pytorch.py      |  2 +-
 pytorch_transformers/modeling_bert.py         |  8 +++---
 pytorch_transformers/modeling_gpt2.py         |  8 +++---
 pytorch_transformers/modeling_openai.py       |  8 +++---
 pytorch_transformers/modeling_transfo_xl.py   | 24 ++++++++---------
 .../modeling_transfo_xl_utilities.py          |  4 +--
 pytorch_transformers/modeling_xlm.py          |  4 +--
 pytorch_transformers/modeling_xlnet.py        | 26 +++++++++----------
 8 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/pytorch_transformers/convert_xlm_checkpoint_to_pytorch.py b/pytorch_transformers/convert_xlm_checkpoint_to_pytorch.py
index 8825f3c0dc..bf4b99b6ea 100755
--- a/pytorch_transformers/convert_xlm_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_xlm_checkpoint_to_pytorch.py
@@ -36,7 +36,7 @@ def convert_xlm_checkpoint_to_pytorch(xlm_checkpoint_path, pytorch_dump_folder_p
     model = chkpt['model']
 
     config = chkpt['params']
-    config = dict((n, v) for n, v in config.items() if not isinstance(v, (torch.Tensor, numpy.ndarray)))
+    config = dict((n, v) for n, v in config.items() if not isinstance(v, (torch.FloatTensor, numpy.ndarray)))
 
     vocab = chkpt['dico_word2id']
     vocab = dict((s + '</w>' if s.find('@@') == -1 and i > 13 else s.replace('@@', ''), i) for s, i in vocab.items())
diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index 93bde1db79..b59445513a 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -609,11 +609,11 @@ BERT_INPUTS_DOCSTRING = r"""
             Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
             corresponds to a `sentence B` token
             (see `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`_ for more details).
-        **attention_mask**: (`optional`) ``torch.Tensor`` of shape ``(batch_size, sequence_length)``:
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
             ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-        **head_mask**: (`optional`) ``torch.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
             ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
@@ -1027,12 +1027,12 @@ class BertForMultipleChoice(BertPreTrainedModel):
             Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
             corresponds to a `sentence B` token
             (see `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`_ for more details).
-        **attention_mask**: (`optional`) ``torch.Tensor`` of shape ``(batch_size, num_choices, sequence_length)``:
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
             Mask to avoid performing attention on padding token indices.
             The second dimension of the input (`num_choices`) indicates the number of choices to score.
             Mask values selected in ``[0, 1]``:
             ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-        **head_mask**: (`optional`) ``torch.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
             ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index b558e7ff88..b8a459db7d 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -402,11 +402,11 @@ GPT2_INPUTS_DOCSTRING = r"""    Inputs:
             list of ``torch.FloatTensor`` (one for each layer):
             that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
             (see `past` output below). Can be used to speed up sequential decoding.
-        **attention_mask**: (`optional`) ``torch.Tensor`` of shape ``(batch_size, sequence_length)``:
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
             ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-        **head_mask**: (`optional`) ``torch.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
             ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
@@ -638,11 +638,11 @@ class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
             list of ``torch.FloatTensor`` (one for each layer):
             that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
             (see `past` output below). Can be used to speed up sequential decoding.
-        **attention_mask**: (`optional`) ``torch.Tensor`` of shape ``(batch_size, num_choices, sequence_length)``:
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
             ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-        **head_mask**: (`optional`) ``torch.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
             ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
diff --git a/pytorch_transformers/modeling_openai.py b/pytorch_transformers/modeling_openai.py
index e5ede9a5ce..4ea19a965d 100644
--- a/pytorch_transformers/modeling_openai.py
+++ b/pytorch_transformers/modeling_openai.py
@@ -412,11 +412,11 @@ OPENAI_GPT_INPUTS_DOCSTRING = r"""    Inputs:
             A parallel sequence of tokens (can be used to indicate various portions of the inputs).
             The embeddings from these tokens will be summed with the respective token embeddings.
             Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).
-        **attention_mask**: (`optional`) ``torch.Tensor`` of shape ``(batch_size, sequence_length)``:
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
             ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-        **head_mask**: (`optional`) ``torch.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
             ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
@@ -624,11 +624,11 @@ class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
             A parallel sequence of tokens (can be used to indicate various portions of the inputs).
             The embeddings from these tokens will be summed with the respective token embeddings.
             Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).
-        **attention_mask**: (`optional`) ``torch.Tensor`` of shape ``(batch_size, num_choices, sequence_length)``:
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
             ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-        **head_mask**: (`optional`) ``torch.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
             ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
diff --git a/pytorch_transformers/modeling_transfo_xl.py b/pytorch_transformers/modeling_transfo_xl.py
index 71fd447a75..3280c4558d 100644
--- a/pytorch_transformers/modeling_transfo_xl.py
+++ b/pytorch_transformers/modeling_transfo_xl.py
@@ -394,8 +394,8 @@ class MultiHeadAttn(nn.Module):
         self.pre_lnorm = pre_lnorm
 
         if r_r_bias is None or r_w_bias is None: # Biases are not shared
-            self.r_r_bias = nn.Parameter(torch.Tensor(self.n_head, self.d_head))
-            self.r_w_bias = nn.Parameter(torch.Tensor(self.n_head, self.d_head))
+            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
         else:
             self.r_r_bias = r_r_bias
             self.r_w_bias = r_w_bias
@@ -483,8 +483,8 @@ class RelMultiHeadAttn(nn.Module):
         self.pre_lnorm = pre_lnorm
 
         if r_r_bias is None or r_w_bias is None: # Biases are not shared
-            self.r_r_bias = nn.Parameter(torch.Tensor(self.n_head, self.d_head))
-            self.r_w_bias = nn.Parameter(torch.Tensor(self.n_head, self.d_head))
+            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
         else:
             self.r_r_bias = r_r_bias
             self.r_w_bias = r_w_bias
@@ -803,13 +803,13 @@ class AdaptiveEmbedding(nn.Module):
                 nn.Embedding(n_token, d_embed, sparse=sample_softmax>0)
             )
             if d_proj != d_embed:
-                self.emb_projs.append(nn.Parameter(torch.Tensor(d_proj, d_embed)))
+                self.emb_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_embed)))
         else:
             for i in range(len(self.cutoffs)):
                 l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i+1]
                 d_emb_i = d_embed // (div_val ** i)
                 self.emb_layers.append(nn.Embedding(r_idx-l_idx, d_emb_i))
-                self.emb_projs.append(nn.Parameter(torch.Tensor(d_proj, d_emb_i)))
+                self.emb_projs.append(nn.Parameter(torch.FloatTensor(d_proj, d_emb_i)))
 
     def forward(self, inp):
         if self.div_val == 1:
@@ -941,7 +941,7 @@ TRANSFO_XL_INPUTS_DOCSTRING = r"""
             list of ``torch.FloatTensor`` (one for each layer):
             that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
             (see `mems` output below). Can be used to speed up sequential decoding and attend to longer context.
-        **head_mask**: (`optional`) ``torch.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
             ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
@@ -1003,8 +1003,8 @@ class TransfoXLModel(TransfoXLPreTrainedModel):
         self.attn_type = config.attn_type
 
         if not config.untie_r:
-            self.r_w_bias = nn.Parameter(torch.Tensor(self.n_head, self.d_head))
-            self.r_r_bias = nn.Parameter(torch.Tensor(self.n_head, self.d_head))
+            self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+            self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
 
         self.layers = nn.ModuleList()
         if config.attn_type == 0: # the default attention
@@ -1046,14 +1046,14 @@ class TransfoXLModel(TransfoXLPreTrainedModel):
         if self.attn_type == 0: # default attention
             self.pos_emb = PositionalEmbedding(self.d_model)
         elif self.attn_type == 1: # learnable
-            self.r_emb = nn.Parameter(torch.Tensor(
+            self.r_emb = nn.Parameter(torch.FloatTensor(
                     self.n_layer, self.max_klen, self.n_head, self.d_head))
-            self.r_bias = nn.Parameter(torch.Tensor(
+            self.r_bias = nn.Parameter(torch.FloatTensor(
                     self.n_layer, self.max_klen, self.n_head))
         elif self.attn_type == 2: # absolute standard
             self.pos_emb = PositionalEmbedding(self.d_model)
         elif self.attn_type == 3: # absolute deeper SA
-            self.r_emb = nn.Parameter(torch.Tensor(
+            self.r_emb = nn.Parameter(torch.FloatTensor(
                     self.n_layer, self.max_klen, self.n_head, self.d_head))
 
         self.apply(self.init_weights)
diff --git a/pytorch_transformers/modeling_transfo_xl_utilities.py b/pytorch_transformers/modeling_transfo_xl_utilities.py
index 6af13d7602..0773d0d5fc 100644
--- a/pytorch_transformers/modeling_transfo_xl_utilities.py
+++ b/pytorch_transformers/modeling_transfo_xl_utilities.py
@@ -56,7 +56,7 @@ class ProjectedAdaptiveLogSoftmax(nn.Module):
             for i in range(len(self.cutoffs)):
                 if d_proj != d_embed:
                     self.out_projs.append(
-                        nn.Parameter(torch.Tensor(d_proj, d_embed))
+                        nn.Parameter(torch.FloatTensor(d_proj, d_embed))
                     )
                 else:
                     self.out_projs.append(None)
@@ -68,7 +68,7 @@ class ProjectedAdaptiveLogSoftmax(nn.Module):
                 d_emb_i = d_embed // (div_val ** i)
 
                 self.out_projs.append(
-                    nn.Parameter(torch.Tensor(d_proj, d_emb_i))
+                    nn.Parameter(torch.FloatTensor(d_proj, d_emb_i))
                 )
 
                 self.out_layers.append(nn.Linear(d_emb_i, r_idx-l_idx))
diff --git a/pytorch_transformers/modeling_xlm.py b/pytorch_transformers/modeling_xlm.py
index c2a7996471..3bb864501a 100644
--- a/pytorch_transformers/modeling_xlm.py
+++ b/pytorch_transformers/modeling_xlm.py
@@ -436,7 +436,7 @@ XLM_INPUTS_DOCSTRING = r"""
             A parallel sequence of tokens to be used to indicate the language of each token in the input.
             Indices are selected in the pre-trained language vocabulary,
             i.e. in the range ``[0, config.n_langs - 1[``.
-        **attention_mask**: (`optional`) ``torch.Tensor`` of shape ``(batch_size, sequence_length)``:
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
             ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
@@ -449,7 +449,7 @@ XLM_INPUTS_DOCSTRING = r"""
             hidden-states (key and values in the attention blocks) as computed by the model
             (see `cache` output below). Can be used to speed up sequential decoding.
             The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.
-        **head_mask**: (`optional`) ``torch.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
             ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
diff --git a/pytorch_transformers/modeling_xlnet.py b/pytorch_transformers/modeling_xlnet.py
index 8686ebc5e6..515decdb3e 100644
--- a/pytorch_transformers/modeling_xlnet.py
+++ b/pytorch_transformers/modeling_xlnet.py
@@ -367,16 +367,16 @@ class XLNetRelativeAttention(nn.Module):
         self.d_model = config.d_model
         self.scale = 1 / (config.d_head ** 0.5)
 
-        self.q = nn.Parameter(torch.Tensor(config.d_model, self.n_head, self.d_head))
-        self.k = nn.Parameter(torch.Tensor(config.d_model, self.n_head, self.d_head))
-        self.v = nn.Parameter(torch.Tensor(config.d_model, self.n_head, self.d_head))
-        self.o = nn.Parameter(torch.Tensor(config.d_model, self.n_head, self.d_head))
-        self.r = nn.Parameter(torch.Tensor(config.d_model, self.n_head, self.d_head))
+        self.q = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))
+        self.k = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))
+        self.v = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))
+        self.o = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))
+        self.r = nn.Parameter(torch.FloatTensor(config.d_model, self.n_head, self.d_head))
 
-        self.r_r_bias = nn.Parameter(torch.Tensor(self.n_head, self.d_head))
-        self.r_s_bias = nn.Parameter(torch.Tensor(self.n_head, self.d_head))
-        self.r_w_bias = nn.Parameter(torch.Tensor(self.n_head, self.d_head))
-        self.seg_embed = nn.Parameter(torch.Tensor(2, self.n_head, self.d_head))
+        self.r_r_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+        self.r_s_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+        self.r_w_bias = nn.Parameter(torch.FloatTensor(self.n_head, self.d_head))
+        self.seg_embed = nn.Parameter(torch.FloatTensor(2, self.n_head, self.d_head))
 
         self.layer_norm = XLNetLayerNorm(config.d_model, eps=config.layer_norm_eps)
         self.dropout = nn.Dropout(config.dropout)
@@ -660,11 +660,11 @@ XLNET_INPUTS_DOCSTRING = r"""
             A parallel sequence of tokens (can be used to indicate various portions of the inputs).
             The embeddings from these tokens will be summed with the respective token embeddings.
             Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).
-        **attention_mask**: (`optional`) ``torch.Tensor`` of shape ``(batch_size, sequence_length)``:
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
             ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-        **input_mask**: (`optional`) ``torch.Tensor`` of shape ``(batch_size, sequence_length)``:
+        **input_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
             Mask to avoid performing attention on padding token indices.
             Negative of `attention_mask`, i.e. with 0 for real tokens and 1 for padding.
             Kept for compatibility with the original code base.
@@ -685,7 +685,7 @@ XLNET_INPUTS_DOCSTRING = r"""
             Mask to indicate the output tokens to use.
             If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on the j-th token.
             Only used during pretraining for partial prediction or for sequential decoding (generation).
-        **head_mask**: (`optional`) ``torch.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
             ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
@@ -735,7 +735,7 @@ class XLNetModel(XLNetPreTrainedModel):
         self.n_layer = config.n_layer
 
         self.word_embedding = nn.Embedding(config.n_token, config.d_model)
-        self.mask_emb = nn.Parameter(torch.Tensor(1, 1, config.d_model))
+        self.mask_emb = nn.Parameter(torch.FloatTensor(1, 1, config.d_model))
         self.layer = nn.ModuleList([XLNetLayer(config) for _ in range(config.n_layer)])
         self.dropout = nn.Dropout(config.dropout)
 

From b1019d2a8e5725f4f72fc8abb4085fef8a60c7e4 Mon Sep 17 00:00:00 2001
From: Yiqing-Zhou <40547184+Yiqing-Zhou@users.noreply.github.com>
Date: Tue, 23 Jul 2019 20:41:26 +0800
Subject: [PATCH 011/200] token[-1] -> token.rstrip('\n')

---
 pytorch_transformers/tokenization_bert.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/tokenization_bert.py b/pytorch_transformers/tokenization_bert.py
index acf89b6984..f9c97b7d12 100644
--- a/pytorch_transformers/tokenization_bert.py
+++ b/pytorch_transformers/tokenization_bert.py
@@ -69,7 +69,7 @@ def load_vocab(vocab_file):
     with open(vocab_file, "r", encoding="utf-8") as reader:
         tokens = reader.readlines()
     for index, token in enumerate(tokens):
-        token = token[:-1]
+        token = token.rstrip('\n')
         vocab[token] = index
     return vocab
 

From ba52fe69d5022dec5ab9a3df855918edb27cc213 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 23 Jul 2019 15:10:02 +0200
Subject: [PATCH 012/200] update breaking change section regarding
 from_pretrained keyword arguments

---
 README.md | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 81bc1ab6bd..aae27cc8ee 100644
--- a/README.md
+++ b/README.md
@@ -310,8 +310,11 @@ loss, logits, attentions = outputs
 
 ### Serialization
 
-Breaking change: Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method.
-To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
+Breaking change in the `from_pretrained()`method:
+
+1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
+
+2. The additional `*input` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute instead which can break derived model classes build based on the previous `BertForSequenceClassification` examples. We are working on a way to mitigate this breaking change in [#866](https://github.com/huggingface/pytorch-transformers/pull/866) by forwarding the the model `__init__()` method (i) the provided positional arguments and (ii) the keyword arguments which do not match any configuratoin class attributes.
 
 Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before.
 

From 0740e63e49db6da6c519aa0812d5d41420ea340b Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 23 Jul 2019 15:57:18 +0200
Subject: [PATCH 013/200] updating schedules for state_dict saving

---
 pytorch_transformers/optimization.py          | 64 ++++++++++---------
 .../tests/optimization_test.py                | 36 ++++++++++-
 2 files changed, 70 insertions(+), 30 deletions(-)

diff --git a/pytorch_transformers/optimization.py b/pytorch_transformers/optimization.py
index c08d3cb58b..39dc7a50ff 100644
--- a/pytorch_transformers/optimization.py
+++ b/pytorch_transformers/optimization.py
@@ -36,13 +36,13 @@ class WarmupConstantSchedule(LambdaLR):
         Keeps learning rate schedule equal to 1. after warmup_steps.
     """
     def __init__(self, optimizer, warmup_steps, last_epoch=-1):
+        self.warmup_steps = warmup_steps
+        super(WarmupConstantSchedule, self).__init__(optimizer, self.lr_lambda, last_epoch=last_epoch)
 
-        def lr_lambda(step):
-            if step < warmup_steps:
-                return float(step) / float(max(1.0, warmup_steps))
-            return 1.
-
-        super(WarmupConstantSchedule, self).__init__(optimizer, lr_lambda, last_epoch=last_epoch)
+    def lr_lambda(self, step):
+        if step < self.warmup_steps:
+            return float(step) / float(max(1.0, self.warmup_steps))
+        return 1.
 
 
 class WarmupLinearSchedule(LambdaLR):
@@ -51,13 +51,14 @@ class WarmupLinearSchedule(LambdaLR):
         Linearly decreases learning rate from 1. to 0. over remaining `t_total - warmup_steps` steps.
     """
     def __init__(self, optimizer, warmup_steps, t_total, last_epoch=-1):
+        self.warmup_steps = warmup_steps
+        self.t_total = t_total
+        super(WarmupLinearSchedule, self).__init__(optimizer, self.lr_lambda, last_epoch=last_epoch)
 
-        def lr_lambda(step):
-            if step < warmup_steps:
-                return float(step) / float(max(1, warmup_steps))
-            return max(0.0, float(t_total - step) / float(max(1.0, t_total - warmup_steps)))
-
-        super(WarmupLinearSchedule, self).__init__(optimizer, lr_lambda, last_epoch=last_epoch)
+    def lr_lambda(self, step):
+        if step < self.warmup_steps:
+            return float(step) / float(max(1, self.warmup_steps))
+        return max(0.0, float(self.t_total - step) / float(max(1.0, self.t_total - self.warmup_steps)))
 
 
 class WarmupCosineSchedule(LambdaLR):
@@ -66,17 +67,19 @@ class WarmupCosineSchedule(LambdaLR):
         Decreases learning rate from 1. to 0. over remaining `t_total - warmup_steps` steps following a cosine curve.
         If `cycles` (default=0.5) is different from default, learning rate follows cosine function after warmup.
     """
-    warn_t_total = True
     def __init__(self, optimizer, warmup_steps, t_total, cycles=.5, last_epoch=-1):
+        self.warmup_steps = warmup_steps
+        self.t_total = t_total
+        self.cycles = cycles
+        super(WarmupCosineSchedule, self).__init__(optimizer, self.lr_lambda, last_epoch=last_epoch)
 
-        def lr_lambda(step):
-            if step < warmup_steps:
-                return float(step) / float(max(1.0, warmup_steps))
-            else:
-                progress = float(step - warmup_steps) / float(max(1, t_total - warmup_steps))   # progress after warmup
-                return max(0.0, 0.5 * (1. + math.cos(math.pi * float(cycles) * 2.0 * progress)))
+    def lr_lambda(self, step):
+        if step < self.warmup_steps:
+            return float(step) / float(max(1.0, self.warmup_steps))
+        # progress after warmup
+        progress = float(step - self.warmup_steps) / float(max(1, self.t_total - self.warmup_steps))
+        return max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
 
-        super(WarmupCosineSchedule, self).__init__(optimizer, lr_lambda, last_epoch=last_epoch)
 
 class WarmupCosineWithHardRestartsSchedule(LambdaLR):
     """ Linear warmup and then cosine cycles with hard restarts.
@@ -85,17 +88,20 @@ class WarmupCosineWithHardRestartsSchedule(LambdaLR):
         learning rate (with hard restarts).
     """
     def __init__(self, optimizer, warmup_steps, t_total, cycles=1., last_epoch=-1):
+        self.warmup_steps = warmup_steps
+        self.t_total = t_total
+        self.cycles = cycles
+        super(WarmupCosineWithHardRestartsSchedule, self).__init__(optimizer, self.lr_lambda, last_epoch=last_epoch)
 
-        def lr_lambda(step):
-            if step < warmup_steps:
-                return float(step) / float(max(1, warmup_steps))
-            else:
-                progress = float(step - warmup_steps) / float(max(1, t_total - warmup_steps))   # progress after warmup
-                if progress >= 1.0:
-                    return 0.0
-                return max(0.0, 0.5 * (1. + math.cos(math.pi * ((float(cycles) * progress) % 1.0))))
+    def lr_lambda(self, step):
+        if step < self.warmup_steps:
+            return float(step) / float(max(1, self.warmup_steps))
+        # progress after warmup
+        progress = float(step - self.warmup_steps) / float(max(1, self.t_total - self.warmup_steps))
+        if progress >= 1.0:
+            return 0.0
+        return max(0.0, 0.5 * (1. + math.cos(math.pi * ((float(self.cycles) * progress) % 1.0))))
 
-        super(WarmupCosineWithHardRestartsSchedule, self).__init__(optimizer, lr_lambda, last_epoch=last_epoch)
 
 
 class AdamW(Optimizer):
diff --git a/pytorch_transformers/tests/optimization_test.py b/pytorch_transformers/tests/optimization_test.py
index ef1a1b1d50..0146541582 100644
--- a/pytorch_transformers/tests/optimization_test.py
+++ b/pytorch_transformers/tests/optimization_test.py
@@ -17,13 +17,14 @@ from __future__ import division
 from __future__ import print_function
 
 import unittest
+import os
 
 import torch
 
 from pytorch_transformers import (AdamW, ConstantLRSchedule, WarmupConstantSchedule,
                                   WarmupCosineSchedule, WarmupCosineWithHardRestartsSchedule, WarmupLinearSchedule)
 
-import numpy as np
+from .tokenization_tests_commons import TemporaryDirectory
 
 
 def unwrap_schedule(scheduler, num_steps=10):
@@ -33,6 +34,20 @@ def unwrap_schedule(scheduler, num_steps=10):
         lrs.append(scheduler.get_lr())
     return lrs
 
+def unwrap_and_save_reload_schedule(scheduler, num_steps=10):
+    lrs = []
+    for step in range(num_steps):
+        scheduler.step()
+        lrs.append(scheduler.get_lr())
+        if step == num_steps // 2:
+            with TemporaryDirectory() as tmpdirname:
+                file_name = os.path.join(tmpdirname, 'schedule.bin')
+                torch.save(scheduler.state_dict(), file_name)
+
+                state_dict = torch.load(file_name)
+                scheduler.load_state_dict(state_dict)
+    return lrs
+
 class OptimizationTest(unittest.TestCase):
 
     def assertListAlmostEqual(self, list1, list2, tol):
@@ -72,6 +87,10 @@ class ScheduleInitTest(unittest.TestCase):
         self.assertEqual(len(lrs[0]), 1)
         self.assertListEqual([l[0] for l in lrs], expected_learning_rates)
 
+        scheduler = ConstantLRSchedule(self.optimizer)
+        lrs_2 = unwrap_and_save_reload_schedule(scheduler, self.num_steps)
+        self.assertListEqual([l[0] for l in lrs], [l[0] for l in lrs_2])
+
     def test_warmup_constant_scheduler(self):
         scheduler = WarmupConstantSchedule(self.optimizer, warmup_steps=4)
         lrs = unwrap_schedule(scheduler, self.num_steps)
@@ -79,6 +98,10 @@ class ScheduleInitTest(unittest.TestCase):
         self.assertEqual(len(lrs[0]), 1)
         self.assertListEqual([l[0] for l in lrs], expected_learning_rates)
 
+        scheduler = WarmupConstantSchedule(self.optimizer, warmup_steps=4)
+        lrs_2 = unwrap_and_save_reload_schedule(scheduler, self.num_steps)
+        self.assertListEqual([l[0] for l in lrs], [l[0] for l in lrs_2])
+
     def test_warmup_linear_scheduler(self):
         scheduler = WarmupLinearSchedule(self.optimizer, warmup_steps=2, t_total=10)
         lrs = unwrap_schedule(scheduler, self.num_steps)
@@ -86,6 +109,10 @@ class ScheduleInitTest(unittest.TestCase):
         self.assertEqual(len(lrs[0]), 1)
         self.assertListEqual([l[0] for l in lrs], expected_learning_rates)
 
+        scheduler = WarmupLinearSchedule(self.optimizer, warmup_steps=2, t_total=10)
+        lrs_2 = unwrap_and_save_reload_schedule(scheduler, self.num_steps)
+        self.assertListEqual([l[0] for l in lrs], [l[0] for l in lrs_2])
+
     def test_warmup_cosine_scheduler(self):
         scheduler = WarmupCosineSchedule(self.optimizer, warmup_steps=2, t_total=10)
         lrs = unwrap_schedule(scheduler, self.num_steps)
@@ -93,6 +120,10 @@ class ScheduleInitTest(unittest.TestCase):
         self.assertEqual(len(lrs[0]), 1)
         self.assertListAlmostEqual([l[0] for l in lrs], expected_learning_rates, tol=1e-2)
 
+        scheduler = WarmupCosineSchedule(self.optimizer, warmup_steps=2, t_total=10)
+        lrs_2 = unwrap_and_save_reload_schedule(scheduler, self.num_steps)
+        self.assertListEqual([l[0] for l in lrs], [l[0] for l in lrs_2])
+
     def test_warmup_cosine_hard_restart_scheduler(self):
         scheduler = WarmupCosineWithHardRestartsSchedule(self.optimizer, warmup_steps=2, cycles=2, t_total=10)
         lrs = unwrap_schedule(scheduler, self.num_steps)
@@ -100,6 +131,9 @@ class ScheduleInitTest(unittest.TestCase):
         self.assertEqual(len(lrs[0]), 1)
         self.assertListAlmostEqual([l[0] for l in lrs], expected_learning_rates, tol=1e-2)
 
+        scheduler = WarmupCosineWithHardRestartsSchedule(self.optimizer, warmup_steps=2, cycles=2, t_total=10)
+        lrs_2 = unwrap_and_save_reload_schedule(scheduler, self.num_steps)
+        self.assertListEqual([l[0] for l in lrs], [l[0] for l in lrs_2])
 
 if __name__ == "__main__":
     unittest.main()

From fec76a481d1ecfbf068d87735dd44ffc26158f6e Mon Sep 17 00:00:00 2001
From: Thomas Wolf <thomwolf@users.noreply.github.com>
Date: Tue, 23 Jul 2019 16:05:29 +0200
Subject: [PATCH 014/200] Update readme

---
 README.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index aae27cc8ee..8e2074f727 100644
--- a/README.md
+++ b/README.md
@@ -82,7 +82,8 @@ for model_class, tokenizer_class, pretrained_weights in MODELS:
 
     # Encode text
     input_ids = torch.tensor([tokenizer.encode("Here is some text to encode")])
-    last_hidden_states = model(input_ids)[0]  # Models outputs are now tuples
+    with torch.no_grad():
+        last_hidden_states = model(input_ids)[0]  # Models outputs are now tuples
 
 # Each architecture is provided with several class for fine-tuning on down-stream tasks, e.g.
 BERT_MODEL_CLASSES = [BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,

From e179c55490269432fd9c67fd867f555e81259a34 Mon Sep 17 00:00:00 2001
From: Anish Moorthy <anish.moorthy@worthix.com>
Date: Tue, 23 Jul 2019 10:39:51 -0400
Subject: [PATCH 015/200] Add docs for from_pretrained functions, rename
 return_unused_args

---
 pytorch_transformers/modeling_utils.py | 41 ++++++++++++++++++--------
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 0a4bfa7ba0..3e8d2fbb1a 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -91,21 +91,33 @@ class PretrainedConfig(object):
             **cache_dir**: (`optional`) string:
                 Path to a directory in which a downloaded pre-trained model
                 configuration should be cached if the standard cache should not be used.
+            **return_unused_kwargs**: (`optional`) bool:
+                - If False, then this function returns just the final configuration object.
+                - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs`
+                is a dictionary consisting of the key/value pairs whose keys are not configuration attributes:
+                ie the part of kwargs which has not been used to update `config` and is otherwise ignored.
             **kwargs**: (`optional`) dict:
-                Dictionnary of key, values to update the configuration object after loading.
-                Can be used to override selected configuration parameters.
+                Dictionary of key/value pairs with which to update the configuration object after loading.
+                - The values in kwargs of any keys which are configuration attributes will be used
+                to override the loaded values.
+                - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
+                by the `return_unused_kwargs` keyword parameter.
 
         Examples::
 
             >>> config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.
             >>> config = BertConfig.from_pretrained('./test/saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
             >>> config = BertConfig.from_pretrained('./test/saved_model/my_configuration.json')
-            >>> config = BertConfig.from_pretrained('bert-base-uncased', output_attention=True)
+            >>> config = BertConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)
             >>> assert config.output_attention == True
+            >>> config, unused_kwargs = BertConfig.from_pretrained('bert-base-uncased', output_attention=True,
+            >>>                                                    foo=False, return_unused_kwargs=True)
+            >>> assert config.output_attention == True
+            >>> assert unused_kwargs == {'foo': False}
 
         """
         cache_dir = kwargs.pop('cache_dir', None)
-        return_unused_args = kwargs.pop('return_unused_args', False)
+        return_unused_kwargs = kwargs.pop('return_unused_kwargs', False)
 
         if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
             config_file = cls.pretrained_config_archive_map[pretrained_model_name_or_path]
@@ -149,7 +161,7 @@ class PretrainedConfig(object):
             kwargs.pop(key, None)
 
         logger.info("Model config %s", config)
-        if return_unused_args:
+        if return_unused_kwargs:
             return config, kwargs
         else:
             return config
@@ -326,6 +338,8 @@ class PreTrainedModel(nn.Module):
                     provided as `config` argument. This loading option is slower than converting the TensorFlow
                     checkpoint in a PyTorch model using the provided conversion scripts and loading
                     the PyTorch model afterwards.
+            **model_args**: (`optional`) Sequence:
+                All remaning positional arguments will be passed to the underlying model's __init__ function
             **config**: an optional configuration for the model to use instead of an automatically loaded configuation.
                 Configuration can be automatically loaded when:
                 - the model is a model provided by the library (loaded with a `shortcut name` of a pre-trained model), or
@@ -340,17 +354,18 @@ class PreTrainedModel(nn.Module):
                 configuration should be cached if the standard cache should not be used.
             **output_loading_info**: (`optional`) boolean:
                 Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
-            **model_args**: (`optional`) Sequence:
-                All positional arguments will be passed to the underlying model's __init__ function
             **kwargs**: (`optional`) dict:
                 Dictionary of key, values to update the configuration object after loading.
                 Can be used to override selected configuration parameters. E.g. ``output_attention=True``.
 
-                If config is None, then **kwargs will be passed to the model.
-                If said key is *not* present, then kwargs will be used to
-                override any keys shared with the default configuration for the
-                given pretrained_model_name_or_path, and only the unshared
-                key/value pairs will be passed to the model.
+               - If a configuration is provided with `config`, **kwargs will be directly passed
+                 to the underlying model's __init__ method.
+               - If a configuration is not provided, **kwargs will be first passed to the pretrained
+                 model configuration class loading function (`PretrainedConfig.from_pretrained`).
+                 Each key of **kwargs that corresponds to a configuration attribute
+                 will be used to override said attribute with the supplied **kwargs value.
+                 Remaining keys that do not correspond to any configuration attribute will
+                 be passed to the underlying model's __init__ function.
 
         Examples::
 
@@ -373,7 +388,7 @@ class PreTrainedModel(nn.Module):
         if config is None:
             config, model_kwargs = cls.config_class.from_pretrained(
                 pretrained_model_name_or_path, *model_args,
-                cache_dir=cache_dir, return_unused_args=True,
+                cache_dir=cache_dir, return_unused_kwargs=True,
                 **kwargs
             )
         else:

From 4fb56c7729a2e08287476d9ae9fe74e9f8ef4f0a Mon Sep 17 00:00:00 2001
From: Anish Moorthy <anish.moorthy@worthix.com>
Date: Tue, 23 Jul 2019 10:41:02 -0400
Subject: [PATCH 016/200] Remove unused *args parameter from
 PreTrainedConfig.from_pretrained

---
 pytorch_transformers/modeling_utils.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 3e8d2fbb1a..66bfe99d85 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -78,7 +78,7 @@ class PretrainedConfig(object):
         self.to_json_file(output_config_file)
 
     @classmethod
-    def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
+    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
         r""" Instantiate a PretrainedConfig from a pre-trained model configuration.
 
         Params:

From 2c9a3115b71aed07bf0745828e62b6f5ce1fca72 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 23 Jul 2019 16:45:55 +0200
Subject: [PATCH 017/200] fix #858

---
 examples/run_glue.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/run_glue.py b/examples/run_glue.py
index 25a487156e..b383bbcb80 100644
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -256,7 +256,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
             cls_token_at_end=bool(args.model_type in ['xlnet']),            # xlnet has a cls token at the end
             cls_token=tokenizer.cls_token,
             sep_token=tokenizer.sep_token,
-            cls_token_segment_id=2 if args.model_type in ['xlnet'] else 1,
+            cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0,
             pad_on_left=bool(args.model_type in ['xlnet']),                 # pad on the left for xlnet
             pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0)
         if args.local_rank in [-1, 0]:

From 6070b55443d14ae480a0f359f3aff45308e7341d Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 23 Jul 2019 17:46:01 +0200
Subject: [PATCH 018/200] fix #868

---
 examples/run_glue.py  | 13 +++++++------
 examples/run_squad.py | 13 +++++++------
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/examples/run_glue.py b/examples/run_glue.py
index b383bbcb80..5d9abd06fc 100644
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -92,6 +92,12 @@ def train(args, train_dataset, model, tokenizer):
             raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
         model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
 
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
     # Train!
     logger.info("***** Running training *****")
     logger.info("  Num examples = %d", len(train_dataset))
@@ -411,13 +417,8 @@ def main():
     if args.local_rank == 0:
         torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
 
-    # Distributed and parallel training
     model.to(args.device)
-    if args.local_rank != -1:
-        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
-                                                          output_device=args.local_rank,
-                                                          find_unused_parameters=True)
-    elif args.n_gpu > 1:
+    if args.n_gpu > 1:
         model = torch.nn.DataParallel(model)
 
     logger.info("Training/evaluation parameters %s", args)
diff --git a/examples/run_squad.py b/examples/run_squad.py
index 53ea0bfd64..36e03fb012 100644
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -101,6 +101,12 @@ def train(args, train_dataset, model, tokenizer):
             raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
         model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
 
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
     # Train!
     logger.info("***** Running training *****")
     logger.info("  Num examples = %d", len(train_dataset))
@@ -450,13 +456,8 @@ def main():
     if args.local_rank == 0:
         torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
 
-    # Distributed and parrallel training
     model.to(args.device)
-    if args.local_rank != -1:
-        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
-                                                          output_device=args.local_rank,
-                                                          find_unused_parameters=True)
-    elif args.n_gpu > 1:
+    if args.n_gpu > 1:
         model = torch.nn.DataParallel(model)
 
     logger.info("Training/evaluation parameters %s", args)

From 1383c7b87af19bf21adf19d66cf6ee1a80555ea4 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 23 Jul 2019 17:52:20 +0200
Subject: [PATCH 019/200] Fix #869

---
 pytorch_transformers/modeling_utils.py | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 324cdc17c9..3f1df0a49d 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -39,6 +39,20 @@ WEIGHTS_NAME = "pytorch_model.bin"
 TF_WEIGHTS_NAME = 'model.ckpt'
 
 
+try:
+    from torch.nn import Identity
+except ImportError:
+    # Older PyTorch compatibility
+    class Identity(nn.Module):
+        r"""A placeholder identity operator that is argument-insensitive.
+        """
+        def __init__(self, *args, **kwargs):
+            super(Identity, self).__init__()
+
+        def forward(self, input):
+            return input
+
+
 if not six.PY2:
     def add_start_docstrings(*docstr):
         def docstring_decorator(fn):
@@ -731,7 +745,7 @@ class SequenceSummary(nn.Module):
             # We can probably just use the multi-head attention module of PyTorch >=1.1.0
             raise NotImplementedError
 
-        self.summary = nn.Identity()
+        self.summary = Identity()
         if hasattr(config, 'summary_use_proj') and config.summary_use_proj:
             if hasattr(config, 'summary_proj_to_labels') and config.summary_proj_to_labels and config.num_labels > 0:
                 num_classes = config.num_labels
@@ -739,15 +753,15 @@ class SequenceSummary(nn.Module):
                 num_classes = config.hidden_size
             self.summary = nn.Linear(config.hidden_size, num_classes)
 
-        self.activation = nn.Identity()
+        self.activation = Identity()
         if hasattr(config, 'summary_activation') and config.summary_activation == 'tanh':
             self.activation = nn.Tanh()
 
-        self.first_dropout = nn.Identity()
+        self.first_dropout = Identity()
         if hasattr(config, 'summary_first_dropout') and config.summary_first_dropout > 0:
             self.first_dropout = nn.Dropout(config.summary_first_dropout)
 
-        self.last_dropout = nn.Identity()
+        self.last_dropout = Identity()
         if hasattr(config, 'summary_last_dropout') and config.summary_last_dropout > 0:
             self.last_dropout = nn.Dropout(config.summary_last_dropout)
 

From a7fce6d9176cf3662d153af54270f345eb0bec8d Mon Sep 17 00:00:00 2001
From: Chi-Liang Liu <liangtaiwan1230@gmail.com>
Date: Wed, 24 Jul 2019 16:11:36 +0800
Subject: [PATCH 020/200] fix squad v1 error (na_prob_file should be None)

---
 examples/run_squad.py | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/examples/run_squad.py b/examples/run_squad.py
index 36e03fb012..df8e3b4a82 100644
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -241,7 +241,10 @@ def evaluate(args, model, tokenizer, prefix=""):
     # Compute predictions
     output_prediction_file = os.path.join(args.output_dir, "predictions_{}.json".format(prefix))
     output_nbest_file = os.path.join(args.output_dir, "nbest_predictions_{}.json".format(prefix))
-    output_null_log_odds_file = os.path.join(args.output_dir, "null_odds_{}.json".format(prefix))
+    if args.version_2_with_negative:
+        output_null_log_odds_file = os.path.join(args.output_dir, "null_odds_{}.json".format(prefix))
+    else:
+        output_null_log_odds_file = None
 
     if args.model_type in ['xlnet', 'xlm']:
         # XLNet uses a more complex post-processing procedure

From 66b15f73f0caeadadf1c65c6e047ebb4285f1f7a Mon Sep 17 00:00:00 2001
From: rococo // Ron <rococo@tangleroad.com>
Date: Wed, 24 Jul 2019 11:27:08 -0700
Subject: [PATCH 021/200] Update docs for parameter rename

OpenAIGPTLMHeadModel now accepts `labels` instead of `lm_labels`
---
 pytorch_transformers/modeling_openai.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/modeling_openai.py b/pytorch_transformers/modeling_openai.py
index 4ea19a965d..17a46fa470 100644
--- a/pytorch_transformers/modeling_openai.py
+++ b/pytorch_transformers/modeling_openai.py
@@ -538,7 +538,7 @@ class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
     r"""
         **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Labels for language modeling.
-            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
+            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``
             Indices are selected in ``[-1, 0, ..., config.vocab_size]``
             All labels set to ``-1`` are ignored (masked), the loss is only
             computed for labels in ``[0, ..., config.vocab_size]``

From ae152cec09b496101841dcbc59613cc7a3d133a4 Mon Sep 17 00:00:00 2001
From: Joel Grus <joelgrus@gmail.com>
Date: Wed, 24 Jul 2019 16:54:48 -0700
Subject: [PATCH 022/200] make save_pretrained work with added tokens

right now it's dumping the *decoder* when it should be dumping the *encoder*. this fixes that.
---
 pytorch_transformers/tokenization_utils.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index f603a29d74..858edc7c50 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -266,7 +266,7 @@ class PreTrainedTokenizer(object):
 
         with open(added_tokens_file, 'w', encoding='utf-8') as f:
             if self.added_tokens_encoder:
-                out_str = json.dumps(self.added_tokens_decoder, ensure_ascii=False)
+                out_str = json.dumps(self.added_tokens_encoder, ensure_ascii=False)
             else:
                 out_str = u"{}"
             f.write(out_str)

From adb3ef636877586ab64ea9be97f3407433d053d8 Mon Sep 17 00:00:00 2001
From: zijunsun <zijun_sun@shannonai.com>
Date: Thu, 25 Jul 2019 13:09:10 +0800
Subject: [PATCH 023/200] multi-gpu training also should be after apex fp16

---
 examples/run_glue.py | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/examples/run_glue.py b/examples/run_glue.py
index 5d9abd06fc..0d4ffaa390 100644
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -92,6 +92,10 @@ def train(args, train_dataset, model, tokenizer):
             raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
         model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
 
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
     # Distributed training (should be after apex fp16 initialization)
     if args.local_rank != -1:
         model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
@@ -418,8 +422,6 @@ def main():
         torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
 
     model.to(args.device)
-    if args.n_gpu > 1:
-        model = torch.nn.DataParallel(model)
 
     logger.info("Training/evaluation parameters %s", args)
 

From 35c52f2f3cf85e26a85a7c52cff789983edaa62c Mon Sep 17 00:00:00 2001
From: Sukuya <sukuya@users.noreply.github.com>
Date: Thu, 25 Jul 2019 16:51:11 +0800
Subject: [PATCH 024/200] Update torchscript.rst

Import fixed to pytorch_transformers else torchscript flag can't be used.
---
 docs/source/torchscript.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/torchscript.rst b/docs/source/torchscript.rst
index 1b84559567..e207068fad 100644
--- a/docs/source/torchscript.rst
+++ b/docs/source/torchscript.rst
@@ -74,7 +74,7 @@ according to a ``BertConfig`` class and then saved to disk under the filename ``
 
 .. code-block:: python
 
-    from pytorch_pretrained_bert import BertModel, BertTokenizer, BertConfig
+    from pytorch_transformers import BertModel, BertTokenizer, BertConfig
     import torch
 
     enc = BertTokenizer.from_pretrained("bert-base-uncased")
@@ -129,4 +129,4 @@ Using the traced model for inference is as simple as using its ``__call__`` dund
 
 .. code-block:: python
 
-    traced_model(tokens_tensor, segments_tensors)
\ No newline at end of file
+    traced_model(tokens_tensor, segments_tensors)

From f0aeb7a814289a64a5b22577415a0cfcde3c7870 Mon Sep 17 00:00:00 2001
From: zijunsun <zijun_sun@shannonai.com>
Date: Fri, 26 Jul 2019 15:23:29 +0800
Subject: [PATCH 025/200] =?UTF-8?q?multi-gpu=20training=20also=20should=20?=
 =?UTF-8?q?be=20after=20apex=20fp16=EF=BC=88squad=EF=BC=89?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 examples/run_squad.py | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/examples/run_squad.py b/examples/run_squad.py
index 36e03fb012..692cb4a20c 100644
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -101,6 +101,10 @@ def train(args, train_dataset, model, tokenizer):
             raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
         model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
 
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
     # Distributed training (should be after apex fp16 initialization)
     if args.local_rank != -1:
         model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
@@ -457,8 +461,6 @@ def main():
         torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
 
     model.to(args.device)
-    if args.n_gpu > 1:
-        model = torch.nn.DataParallel(model)
 
     logger.info("Training/evaluation parameters %s", args)
 

From edfd965ac8a5446adb2c94ad043263b3144b3f95 Mon Sep 17 00:00:00 2001
From: David Pollack <david@i2x.ai>
Date: Fri, 26 Jul 2019 14:13:46 +0200
Subject: [PATCH 026/200] fix convert_to_tf

---
 .../convert_pytorch_checkpoint_to_tf.py                | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py b/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
index b8858ee3dc..a2e7b5c41a 100644
--- a/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
+++ b/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
@@ -72,11 +72,11 @@ def convert_pytorch_checkpoint_to_tf(model:BertModel, ckpt_dir:str, model_name:s
         return 'bert/{}'.format(name)
 
     def assign_tf_var(tensor:np.ndarray, name:str):
-        tmp_var = tf.Variable(initial_value=tensor)
-        tf_var = tf.get_variable(dtype=tmp_var.dtype, shape=tmp_var.shape, name=name)
-        op = tf.assign(ref=tf_var, value=tmp_var)
-        session.run(tf.variables_initializer([tmp_var, tf_var]))
-        session.run(fetches=[op, tf_var])
+        tf_dtype = tf.dtypes.as_dtype(tensor.dtype)
+        tf_var = tf.get_variable(dtype=tf_dtype, shape=tensor.shape, name=name)
+        session.run(tf.variables_initializer([tf_var]))
+        tf.keras.backend.set_value(tf_var, tensor)
+        session.run(tf_var)
         return tf_var
 
     for var_name in state_dict:

From 09ecf225e9ac00f78ecf9246957128f5d7d79a52 Mon Sep 17 00:00:00 2001
From: David Pollack <david@i2x.ai>
Date: Fri, 26 Jul 2019 15:20:44 +0200
Subject: [PATCH 027/200] fixed the fix.  tf session madness.

---
 .../convert_pytorch_checkpoint_to_tf.py       | 30 +++++++++----------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py b/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
index a2e7b5c41a..c24dddc4d6 100644
--- a/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
+++ b/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
@@ -62,34 +62,34 @@ def convert_pytorch_checkpoint_to_tf(model:BertModel, ckpt_dir:str, model_name:s
     if not os.path.isdir(ckpt_dir):
         os.makedirs(ckpt_dir)
 
-    session = tf.Session()
     state_dict = model.state_dict()
-    tf_vars = []
 
     def to_tf_var_name(name:str):
         for patt, repl in iter(var_map):
             name = name.replace(patt, repl)
         return 'bert/{}'.format(name)
 
-    def assign_tf_var(tensor:np.ndarray, name:str):
+    def create_tf_var(tensor:np.ndarray, name:str, session:tf.Session):
         tf_dtype = tf.dtypes.as_dtype(tensor.dtype)
-        tf_var = tf.get_variable(dtype=tf_dtype, shape=tensor.shape, name=name)
+        tf_var = tf.get_variable(dtype=tf_dtype, shape=tensor.shape, name=name, initializer=tf.zeros_initializer())
         session.run(tf.variables_initializer([tf_var]))
-        tf.keras.backend.set_value(tf_var, tensor)
         session.run(tf_var)
         return tf_var
 
-    for var_name in state_dict:
-        tf_name = to_tf_var_name(var_name)
-        torch_tensor = state_dict[var_name].numpy()
-        if any([x in var_name for x in tensors_to_transopse]):
-            torch_tensor = torch_tensor.T
-        tf_tensor = assign_tf_var(tensor=torch_tensor, name=tf_name)
-        tf_vars.append(tf_tensor)
-        print("{0}{1}initialized".format(tf_name, " " * (60 - len(tf_name))))
+    tf.reset_default_graph()
+    with tf.Session() as session:
+        for var_name in state_dict:
+            tf_name = to_tf_var_name(var_name)
+            torch_tensor = state_dict[var_name].numpy()
+            if any([x in var_name for x in tensors_to_transopse]):
+                torch_tensor = torch_tensor.T
+            tf_var = create_tf_var(tensor=torch_tensor, name=tf_name, session=session)
+            tf.keras.backend.set_value(tf_var, torch_tensor)
+            tf_weight = session.run(tf_var)
+            print("Successfully created {}: {}".format(tf_name, np.allclose(tf_weight, torch_tensor)))
 
-    saver = tf.train.Saver(tf_vars)
-    saver.save(session, os.path.join(ckpt_dir, model_name.replace("-", "_") + ".ckpt"))
+        saver = tf.train.Saver(tf.trainable_variables())
+        saver.save(session, os.path.join(ckpt_dir, model_name.replace("-", "_") + ".ckpt"))
 
 
 def main(raw_args=None):

From ac42049c0877104632cff44d52ddb0a06d927ee0 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Fri, 26 Jul 2019 17:08:59 +0200
Subject: [PATCH 028/200] add auto models and auto tokenizer

---
 pytorch_transformers/modeling_auto.py     | 230 ++++++++++++++++++++++
 pytorch_transformers/tokenization_auto.py | 100 ++++++++++
 2 files changed, 330 insertions(+)
 create mode 100644 pytorch_transformers/modeling_auto.py
 create mode 100644 pytorch_transformers/tokenization_auto.py

diff --git a/pytorch_transformers/modeling_auto.py b/pytorch_transformers/modeling_auto.py
new file mode 100644
index 0000000000..68eb85cbd8
--- /dev/null
+++ b/pytorch_transformers/modeling_auto.py
@@ -0,0 +1,230 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Auto Model class. """
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import logging
+
+from .modeling_bert import BertConfig, BertModel
+from .modeling_openai import OpenAIGPTConfig, OpenAIGPTModel
+from .modeling_gpt2 import GPT2Config, GPT2Model
+from .modeling_transfo_xl import TransfoXLConfig, TransfoXLModel
+from .modeling_xlnet import XLNetConfig, XLNetModel
+from .modeling_xlm import XLMConfig, XLMModel
+
+logger = logging.getLogger(__name__)
+
+class AutoConfig(object):
+    r""":class:`~pytorch_transformers.AutoConfig` is a generic configuration class
+        that will be instantiated as one of the configuration classes of the library
+        when created with the `AutoConfig.from_pretrained(pretrained_model_name_or_path)`
+        class method.
+
+        The `from_pretrained()` method take care of returning the correct model class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The base model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `bert`: BertConfig (Bert model)
+            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
+            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
+            - contains `xlnet`: XLNetConfig (XLNet model)
+            - contains `xlm`: XLMConfig (XLM model)
+
+        This class cannot be instantiated using `__init__()` (throw an error).
+    """
+    def __init__(self):
+        raise EnvironmentError("AutoConfig is designed to be instantiated "
+            "using the `AutoConfig.from_pretrained(pretrained_model_name_or_path)` method.")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
+        r""" Instantiate a one of the configuration classes of the library
+        from a pre-trained model configuration.
+
+        The configuration class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `bert`: BertConfig (Bert model)
+            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
+            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
+            - contains `xlnet`: XLNetConfig (XLNet model)
+            - contains `xlm`: XLMConfig (XLM model)
+
+        Params:
+            **pretrained_model_name_or_path**: either:
+                - a string with the `shortcut name` of a pre-trained model configuration to load from cache
+                    or download and cache if not already stored in cache (e.g. 'bert-base-uncased').
+                - a path to a `directory` containing a configuration file saved
+                    using the `save_pretrained(save_directory)` method.
+                - a path or url to a saved configuration `file`.
+            **cache_dir**: (`optional`) string:
+                Path to a directory in which a downloaded pre-trained model
+                configuration should be cached if the standard cache should not be used.
+            **return_unused_kwargs**: (`optional`) bool:
+                - If False, then this function returns just the final configuration object.
+                - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs`
+                is a dictionary consisting of the key/value pairs whose keys are not configuration attributes:
+                ie the part of kwargs which has not been used to update `config` and is otherwise ignored.
+            **kwargs**: (`optional`) dict:
+                Dictionary of key/value pairs with which to update the configuration object after loading.
+                - The values in kwargs of any keys which are configuration attributes will be used
+                to override the loaded values.
+                - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
+                by the `return_unused_kwargs` keyword parameter.
+
+        Examples::
+
+            >>> config = AutoConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.
+            >>> config = AutoConfig.from_pretrained('./test/bert_saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
+            >>> config = AutoConfig.from_pretrained('./test/bert_saved_model/my_configuration.json')
+            >>> config = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)
+            >>> assert config.output_attention == True
+            >>> config, unused_kwargs = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True,
+            >>>                                                    foo=False, return_unused_kwargs=True)
+            >>> assert config.output_attention == True
+            >>> assert unused_kwargs == {'foo': False}
+
+        """
+        if 'bert' in pretrained_model_name_or_path:
+            return BertConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'openai-gpt' in pretrained_model_name_or_path:
+            return OpenAIGPTConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'gpt2' in pretrained_model_name_or_path:
+            return GPT2Config.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'transfo-xl' in pretrained_model_name_or_path:
+            return TransfoXLConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'xlnet' in pretrained_model_name_or_path:
+            return XLNetConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'xlm' in pretrained_model_name_or_path:
+            return XLMConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+
+        raise ValueError("Unrecognized model identifier in {}. Should contains one of "
+                         "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
+                         "'xlm'".format(pretrained_model_name_or_path))
+
+
+class AutoModel(object):
+    r"""
+        :class:`~pytorch_transformers.AutoModel` is a generic model class
+        that will be instantiated as one of the base model classes of the library
+        when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)`
+        class method.
+
+        The `from_pretrained()` method take care of returning the correct model class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The base model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `bert`: BertConfig (Bert model)
+            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
+            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
+            - contains `xlnet`: XLNetConfig (XLNet model)
+            - contains `xlm`: XLMConfig (XLM model)
+
+        This class cannot be instantiated using `__init__()` (throw an error).
+    """
+    def __init__(self):
+        raise EnvironmentError("AutoModel is designed to be instantiated "
+            "using the `AutoModel.from_pretrained(pretrained_model_name_or_path)` method.")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        r""" Instantiate a one of the base model classes of the library
+        from a pre-trained model configuration.
+
+        The base model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `bert`: BertConfig (Bert model)
+            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
+            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
+            - contains `xlnet`: XLNetConfig (XLNet model)
+            - contains `xlm`: XLMConfig (XLM model)
+
+            The model is set in evaluation mode by default using `model.eval()` (Dropout modules are desactivated)
+            To train the model, you should first set it back in training mode with `model.train()`
+
+        Params:
+            **pretrained_model_name_or_path**: either:
+                - a string with the `shortcut name` of a pre-trained model to load from cache
+                    or download and cache if not already stored in cache (e.g. 'bert-base-uncased').
+                - a path to a `directory` containing a configuration file saved
+                    using the `save_pretrained(save_directory)` method.
+                - a path or url to a tensorflow index checkpoint `file` (e.g. `./tf_model/model.ckpt.index`).
+                    In this case, ``from_tf`` should be set to True and a configuration object should be
+                    provided as `config` argument. This loading option is slower than converting the TensorFlow
+                    checkpoint in a PyTorch model using the provided conversion scripts and loading
+                    the PyTorch model afterwards.
+            **model_args**: (`optional`) Sequence:
+                All remaning positional arguments will be passed to the underlying model's __init__ function
+            **config**: an optional configuration for the model to use instead of an automatically loaded configuation.
+                Configuration can be automatically loaded when:
+                - the model is a model provided by the library (loaded with a `shortcut name` of a pre-trained model), or
+                - the model was saved using the `save_pretrained(save_directory)` (loaded by suppling the save directory).
+            **state_dict**: an optional state dictionnary for the model to use instead of a state dictionary loaded
+                from saved weights file.
+                This option can be used if you want to create a model from a pretrained configuraton but load your own weights.
+                In this case though, you should check if using `save_pretrained(dir)` and `from_pretrained(save_directory)` is not
+                a simpler option.
+            **cache_dir**: (`optional`) string:
+                Path to a directory in which a downloaded pre-trained model
+                configuration should be cached if the standard cache should not be used.
+            **output_loading_info**: (`optional`) boolean:
+                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
+            **kwargs**: (`optional`) dict:
+                Dictionary of key, values to update the configuration object after loading.
+                Can be used to override selected configuration parameters. E.g. ``output_attention=True``.
+
+               - If a configuration is provided with `config`, **kwargs will be directly passed
+                 to the underlying model's __init__ method.
+               - If a configuration is not provided, **kwargs will be first passed to the pretrained
+                 model configuration class loading function (`PretrainedConfig.from_pretrained`).
+                 Each key of **kwargs that corresponds to a configuration attribute
+                 will be used to override said attribute with the supplied **kwargs value.
+                 Remaining keys that do not correspond to any configuration attribute will
+                 be passed to the underlying model's __init__ function.
+
+        Examples::
+
+            >>> model = AutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
+            >>> model = AutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
+            >>> model = AutoModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
+            >>> assert model.config.output_attention == True
+            >>> # Loading from a TF checkpoint file instead of a PyTorch model (slower)
+            >>> config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
+            >>> model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
+
+        """
+        if 'bert' in pretrained_model_name_or_path:
+            return BertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'openai-gpt' in pretrained_model_name_or_path:
+            return OpenAIGPTModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'gpt2' in pretrained_model_name_or_path:
+            return GPT2Model.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'transfo-xl' in pretrained_model_name_or_path:
+            return TransfoXLModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'xlnet' in pretrained_model_name_or_path:
+            return XLNetModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'xlm' in pretrained_model_name_or_path:
+            return XLMModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+        raise ValueError("Unrecognized model identifier in {}. Should contains one of "
+                         "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
+                         "'xlm'".format(pretrained_model_name_or_path))
+
diff --git a/pytorch_transformers/tokenization_auto.py b/pytorch_transformers/tokenization_auto.py
new file mode 100644
index 0000000000..66d0ce51ba
--- /dev/null
+++ b/pytorch_transformers/tokenization_auto.py
@@ -0,0 +1,100 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Auto Model class. """
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import logging
+
+from .tokenization_bert import BertTokenizer
+from .tokenization_openai import OpenAIGPTTokenizer
+from .tokenization_gpt2 import GPT2Tokenizer
+from .tokenization_transfo_xl import TransfoXLTokenizer
+from .tokenization_xlnet import XLNetTokenizer
+from .tokenization_xlm import XLMTokenizer
+
+logger = logging.getLogger(__name__)
+
+class AutoTokenizer(object):
+    r""":class:`~pytorch_transformers.AutoTokenizer` is a generic tokenizer class
+        that will be instantiated as one of the tokenizer classes of the library
+        when created with the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)`
+        class method.
+
+        The `from_pretrained()` method take care of returning the correct tokenizer class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The tokenizer class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `bert`: BertTokenizer (Bert model)
+            - contains `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)
+            - contains `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)
+            - contains `xlnet`: XLNetTokenizer (XLNet model)
+            - contains `xlm`: XLMTokenizer (XLM model)
+
+        This class cannot be instantiated using `__init__()` (throw an error).
+    """
+    def __init__(self):
+        raise EnvironmentError("AutoTokenizer is designed to be instantiated "
+            "using the `AutoTokenizer.from_pretrained(pretrained_model_name_or_path)` method.")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
+        r""" Instantiate a one of the tokenizer classes of the library
+        from a pre-trained model vocabulary.
+
+        The tokenizer class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `bert`: BertTokenizer (Bert model)
+            - contains `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)
+            - contains `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)
+            - contains `xlnet`: XLNetTokenizer (XLNet model)
+            - contains `xlm`: XLMTokenizer (XLM model)
+
+        Params:
+            **pretrained_model_name_or_path**: either:
+                - a string with the `shortcut name` of a pre-trained model configuration to load from cache
+                    or download and cache if not already stored in cache (e.g. 'bert-base-uncased').
+                - a path to a `directory` containing a configuration file saved
+                    using the `save_pretrained(save_directory)` method.
+                - a path or url to a saved configuration `file`.
+            **cache_dir**: (`optional`) string:
+                Path to a directory in which a downloaded pre-trained model
+                configuration should be cached if the standard cache should not be used.
+
+        Examples::
+
+            >>> config = AutoTokenizer.from_pretrained('bert-base-uncased')    # Download vocabulary from S3 and cache.
+            >>> config = AutoTokenizer.from_pretrained('./test/bert_saved_model/')  # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`
+
+        """
+        if 'bert' in pretrained_model_name_or_path:
+            return BertTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'openai-gpt' in pretrained_model_name_or_path:
+            return OpenAIGPTTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'gpt2' in pretrained_model_name_or_path:
+            return GPT2Tokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'transfo-xl' in pretrained_model_name_or_path:
+            return TransfoXLTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'xlnet' in pretrained_model_name_or_path:
+            return XLNetTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'xlm' in pretrained_model_name_or_path:
+            return XLMTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+
+        raise ValueError("Unrecognized model identifier in {}. Should contains one of "
+                         "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
+                         "'xlm'".format(pretrained_model_name_or_path))

From 57e54ec070258189695ba8cacdf7d2bcaf1c72bc Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Fri, 26 Jul 2019 17:09:07 +0200
Subject: [PATCH 029/200] add unk_token to gpt2

---
 pytorch_transformers/tokenization_gpt2.py | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/pytorch_transformers/tokenization_gpt2.py b/pytorch_transformers/tokenization_gpt2.py
index 43c57c9cd3..afcdf1e64e 100644
--- a/pytorch_transformers/tokenization_gpt2.py
+++ b/pytorch_transformers/tokenization_gpt2.py
@@ -102,7 +102,7 @@ class GPT2Tokenizer(PreTrainedTokenizer):
     pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
     max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
 
-    def __init__(self, vocab_file, merges_file, errors='replace',
+    def __init__(self, vocab_file, merges_file, errors='replace', unk_token="<|endoftext|>",
                  bos_token="<|endoftext|>", eos_token="<|endoftext|>", **kwargs):
         super(GPT2Tokenizer, self).__init__(bos_token=bos_token, eos_token=eos_token, **kwargs)
 
@@ -177,9 +177,7 @@ class GPT2Tokenizer(PreTrainedTokenizer):
 
     def _convert_token_to_id(self, token):
         """ Converts a token (str/unicode) in an id using the vocab. """
-        if token in self.encoder:
-            return self.encoder.get(token)
-        return self.encoder.get(self.unk_token)
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
 
     def _convert_id_to_token(self, index):
         """Converts an index (integer) in a token (string/unicode) using the vocab."""

From 27b0f86d36a1ee25dcc70ba602aefa556dc5f0a9 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Fri, 26 Jul 2019 17:09:21 +0200
Subject: [PATCH 030/200] clean up pretrained

---
 pytorch_transformers/tokenization_utils.py | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index f603a29d74..2b3219c4cc 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -152,11 +152,13 @@ class PreTrainedTokenizer(object):
 
 
     @classmethod
-    def _from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):
+    def _from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
         """
         Instantiate a PreTrainedTokenizer from pre-trained vocabulary files.
         Download and cache the vocabulary files if needed.
         """
+        cache_dir = kwargs.pop('cache_dir', None)
+
         s3_models = list(cls.max_model_input_sizes.keys())
         vocab_files = {}
         if pretrained_model_name_or_path in s3_models:
@@ -308,7 +310,8 @@ class PreTrainedTokenizer(object):
 
         to_add_tokens = []
         for token in new_tokens:
-            if self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token):
+            if token != self.unk_token and \
+                    self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token):
                 to_add_tokens.append(token)
                 logger.info("Adding %s to the vocabulary", token)
 

From 632d711411d2126e90cd4657f411a09bc180f561 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Fri, 26 Jul 2019 21:14:37 +0200
Subject: [PATCH 031/200] fix #908

---
 pytorch_transformers/__init__.py | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/pytorch_transformers/__init__.py b/pytorch_transformers/__init__.py
index f875e4ab18..b4b957192c 100644
--- a/pytorch_transformers/__init__.py
+++ b/pytorch_transformers/__init__.py
@@ -7,20 +7,20 @@ from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
 from .tokenization_xlm import XLMTokenizer
 from .tokenization_utils import (PreTrainedTokenizer, clean_up_tokenization)
 
-from .modeling_bert import (BertConfig, BertModel, BertForPreTraining,
-                       BertForMaskedLM, BertForNextSentencePrediction,
-                       BertForSequenceClassification, BertForMultipleChoice,
-                       BertForTokenClassification, BertForQuestionAnswering,
-                       load_tf_weights_in_bert, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
-                       BERT_PRETRAINED_CONFIG_ARCHIVE_MAP)
-from .modeling_openai import (OpenAIGPTConfig, OpenAIGPTModel,
+from .modeling_bert import (BertConfig, BertPreTrainedModel, BertModel, BertForPreTraining,
+                            BertForMaskedLM, BertForNextSentencePrediction,
+                            BertForSequenceClassification, BertForMultipleChoice,
+                            BertForTokenClassification, BertForQuestionAnswering,
+                            load_tf_weights_in_bert, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
+                            BERT_PRETRAINED_CONFIG_ARCHIVE_MAP)
+from .modeling_openai import (OpenAIGPTConfig, OpenAIGPTPreTrainedModel, OpenAIGPTModel,
                               OpenAIGPTLMHeadModel, OpenAIGPTDoubleHeadsModel,
                               load_tf_weights_in_openai_gpt, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
                               OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP)
-from .modeling_transfo_xl import (TransfoXLConfig, TransfoXLModel, TransfoXLLMHeadModel,
+from .modeling_transfo_xl import (TransfoXLConfig, TransfoXLPreTrainedModel, TransfoXLModel, TransfoXLLMHeadModel,
                                   load_tf_weights_in_transfo_xl, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,
                                   TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP)
-from .modeling_gpt2 import (GPT2Config, GPT2Model,
+from .modeling_gpt2 import (GPT2Config, GPT2PreTrainedModel, GPT2Model,
                             GPT2LMHeadModel, GPT2DoubleHeadsModel,
                             load_tf_weights_in_gpt2, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,
                             GPT2_PRETRAINED_MODEL_ARCHIVE_MAP)
@@ -29,7 +29,7 @@ from .modeling_xlnet import (XLNetConfig,
                              XLNetForSequenceClassification, XLNetForQuestionAnswering,
                              load_tf_weights_in_xlnet, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
                              XLNET_PRETRAINED_MODEL_ARCHIVE_MAP)
-from .modeling_xlm import (XLMConfig, XLMModel,
+from .modeling_xlm import (XLMConfig, XLMPreTrainedModel , XLMModel,
                            XLMWithLMHeadModel, XLMForSequenceClassification,
                            XLMForQuestionAnswering, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
                            XLM_PRETRAINED_MODEL_ARCHIVE_MAP)

From 7b6e474c9acc26962363e78ef95fdb6f006eb0b4 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Fri, 26 Jul 2019 21:26:44 +0200
Subject: [PATCH 032/200] fix #901

---
 pytorch_transformers/tokenization_utils.py | 28 ++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index 858edc7c50..e2fe46320e 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -160,26 +160,46 @@ class PreTrainedTokenizer(object):
         s3_models = list(cls.max_model_input_sizes.keys())
         vocab_files = {}
         if pretrained_model_name_or_path in s3_models:
+            # Get the vocabulary from AWS S3 bucket
             for file_id, map_list in cls.pretrained_vocab_files_map.items():
                 vocab_files[file_id] = map_list[pretrained_model_name_or_path]
         else:
+            # Get the vocabulary from local files
             logger.info(
                 "Model name '{}' not found in model shortcut name list ({}). "
                 "Assuming '{}' is a path or url to a directory containing tokenizer files.".format(
                     pretrained_model_name_or_path, ', '.join(s3_models),
                     pretrained_model_name_or_path))
-            all_vocab_files_names = {'added_tokens_file': ADDED_TOKENS_FILE,
-                                     'special_tokens_map_file': SPECIAL_TOKENS_MAP_FILE}
-            all_vocab_files_names.update(cls.vocab_files_names)
-            for file_id, file_name in all_vocab_files_names.items():
+
+            # Look for the tokenizer main vocabulary files
+            for file_id, file_name in cls.vocab_files_names.items():
                 if os.path.isdir(pretrained_model_name_or_path):
+                    # If a directory is provided we look for the standard filenames
                     full_file_name = os.path.join(pretrained_model_name_or_path, file_name)
                 else:
+                    # If a path to a file is provided we use it (will only work for non-BPE tokenizer using a single vocabulary file)
                     full_file_name = pretrained_model_name_or_path
                 if not os.path.exists(full_file_name):
                     logger.info("Didn't find file {}. We won't load it.".format(full_file_name))
                     full_file_name = None
                 vocab_files[file_id] = full_file_name
+
+            # Look for the additional tokens files
+            all_vocab_files_names = {'added_tokens_file': ADDED_TOKENS_FILE,
+                                     'special_tokens_map_file': SPECIAL_TOKENS_MAP_FILE}
+
+            # If a path to a file was provided, get the parent directory
+            saved_directory = pretrained_model_name_or_path
+            if os.path.exists(saved_directory) and not os.path.isdir(saved_directory):
+                saved_directory = os.path.dirname(saved_directory)
+
+            for file_id, file_name in all_vocab_files_names.items():
+                full_file_name = os.path.join(saved_directory, file_name)
+                if not os.path.exists(full_file_name):
+                    logger.info("Didn't find file {}. We won't load it.".format(full_file_name))
+                    full_file_name = None
+                vocab_files[file_id] = full_file_name
+
             if all(full_file_name is None for full_file_name in vocab_files.values()):
                 logger.error(
                     "Model name '{}' was not found in model name list ({}). "

From c717d38573dbb814a7b93e8057cf0e63c2c8e9df Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Fri, 26 Jul 2019 23:30:48 +0200
Subject: [PATCH 033/200] dictionnary => dictionary

---
 docs/source/model_doc/overview.rst         | 2 +-
 docs/source/serialization.rst              | 2 +-
 hubconfs/bert_hubconf.py                   | 2 +-
 hubconfs/gpt_hubconf.py                    | 2 +-
 hubconfs/transformer_xl_hubconf.py         | 2 +-
 pytorch_transformers/modeling_utils.py     | 4 ++--
 pytorch_transformers/tokenization_utils.py | 4 ++--
 7 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/source/model_doc/overview.rst b/docs/source/model_doc/overview.rst
index 8c77efd3f9..4cca4eb846 100644
--- a/docs/source/model_doc/overview.rst
+++ b/docs/source/model_doc/overview.rst
@@ -96,7 +96,7 @@ where
   ``cache_dir`` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example ``cache_dir='./pretrained_model_{}'.format(args.local_rank)`` (see the section on distributed training for more information).
 
 * ``from_tf``\ : should we load the weights from a locally saved TensorFlow checkpoint
-* ``state_dict``\ : an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
+* ``state_dict``\ : an optional state dictionary (collections.OrderedDict object) to use instead of Google pre-trained models
 * ``*inputs``\ , `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)
 
 ``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`__ or the original TensorFlow repository.
diff --git a/docs/source/serialization.rst b/docs/source/serialization.rst
index fb947ffb51..be5197135d 100644
--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
@@ -40,7 +40,7 @@ where
 
 - `cache_dir` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example `cache_dir='./pretrained_model_{}'.format(args.local_rank)` (see the section on distributed training for more information).
 - `from_tf`: should we load the weights from a locally saved TensorFlow checkpoint
-- `state_dict`: an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
+- `state_dict`: an optional state dictionary (collections.OrderedDict object) to use instead of Google pre-trained models
 - `*inputs`, `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)
 
 `Uncased` means that the text has been lowercased before WordPiece tokenization, e.g., `John Smith` becomes `john smith`. The Uncased model also strips out any accent markers. `Cased` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the [Multilingual README](https://github.com/google-research/bert/blob/master/multilingual.md) or the original TensorFlow repository.
diff --git a/hubconfs/bert_hubconf.py b/hubconfs/bert_hubconf.py
index 0ee0df6697..a0221ff9e1 100644
--- a/hubconfs/bert_hubconf.py
+++ b/hubconfs/bert_hubconf.py
@@ -37,7 +37,7 @@ bert_docstring = """
                  checkpoint
         cache_dir: an optional path to a folder in which the pre-trained models
                    will be cached.
-        state_dict: an optional state dictionnary
+        state_dict: an optional state dictionary
                     (collections.OrderedDict object) to use instead of Google
                     pre-trained models
         *inputs, **kwargs: additional input for the specific Bert class
diff --git a/hubconfs/gpt_hubconf.py b/hubconfs/gpt_hubconf.py
index 1683c881fa..c58c1fa708 100644
--- a/hubconfs/gpt_hubconf.py
+++ b/hubconfs/gpt_hubconf.py
@@ -40,7 +40,7 @@ gpt_docstring = """
 				. a series of NumPy files containing OpenAI TensorFlow trained weights
 		from_tf: should we load the weights from a locally saved TensorFlow checkpoint
 		cache_dir: an optional path to a folder in which the pre-trained models will be cached.
-		state_dict: an optional state dictionnary (collections.OrderedDict object)
+		state_dict: an optional state dictionary (collections.OrderedDict object)
 		        	to use instead of pre-trained models
 		*inputs, **kwargs: additional input for the specific OpenAI-GPT class
 """
diff --git a/hubconfs/transformer_xl_hubconf.py b/hubconfs/transformer_xl_hubconf.py
index d89db894ad..cfcc6aef5a 100644
--- a/hubconfs/transformer_xl_hubconf.py
+++ b/hubconfs/transformer_xl_hubconf.py
@@ -23,7 +23,7 @@ transformer_xl_docstring = """
                 . `model.chkpt` a TensorFlow checkpoint
         from_tf: should we load the weights from a locally saved TensorFlow checkpoint
         cache_dir: an optional path to a folder in which the pre-trained models will be cached.
-        state_dict: an optional state dictionnary (collections.OrderedDict object) to use instead of pre-trained models
+        state_dict: an optional state dictionary (collections.OrderedDict object) to use instead of pre-trained models
         *inputs, **kwargs: additional input for the specific TransformerXL class
 """
 
diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 66bfe99d85..4fabd49baf 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -358,7 +358,7 @@ class PreTrainedModel(nn.Module):
                 Dictionary of key, values to update the configuration object after loading.
                 Can be used to override selected configuration parameters. E.g. ``output_attention=True``.
 
-               - If a configuration is provided with `config`, **kwargs will be directly passed
+               - If a configuration is providedictionaryfig`, **kwargs will be directly passed
                  to the underlying model's __init__ method.
                - If a configuration is not provided, **kwargs will be first passed to the pretrained
                  model configuration class loading function (`PretrainedConfig.from_pretrained`).
@@ -367,7 +367,7 @@ class PreTrainedModel(nn.Module):
                  Remaining keys that do not correspond to any configuration attribute will
                  be passed to the underlying model's __init__ function.
 
-        Examples::
+        Examples::dictionary
 
             >>> model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
             >>> model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index 2b3219c4cc..eaef2fed1e 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -37,7 +37,7 @@ class PreTrainedTokenizer(object):
             additional_special_tokens = []
 
         We defined an added_tokens_encoder to add new tokens to the vocabulary without having to handle the
-            specific vocabulary augmentation methods of the various underlying dictionnary structures (BPE, sentencepiece...).
+            specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece...).
     """
     vocab_files_names = {}
     pretrained_vocab_files_map = {}
@@ -324,7 +324,7 @@ class PreTrainedTokenizer(object):
 
 
     def add_special_tokens(self, special_tokens_dict):
-        """ Add a dictionnary of special tokens (eos, pad, cls...) to the encoder and link them
+        """ Add a dictionary of special tokens (eos, pad, cls...) to the encoder and link them
             to class attributes. If the special tokens are not in the vocabulary, they are added
             to it and indexed starting from the last index of the current vocabulary.
 

From ac27548b25ffef5966bd11c90419230cbeafe06e Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Sat, 27 Jul 2019 11:50:47 +0200
Subject: [PATCH 034/200] fix unk_token test

---
 pytorch_transformers/tokenization_gpt2.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/tokenization_gpt2.py b/pytorch_transformers/tokenization_gpt2.py
index afcdf1e64e..29a9ae7660 100644
--- a/pytorch_transformers/tokenization_gpt2.py
+++ b/pytorch_transformers/tokenization_gpt2.py
@@ -104,7 +104,7 @@ class GPT2Tokenizer(PreTrainedTokenizer):
 
     def __init__(self, vocab_file, merges_file, errors='replace', unk_token="<|endoftext|>",
                  bos_token="<|endoftext|>", eos_token="<|endoftext|>", **kwargs):
-        super(GPT2Tokenizer, self).__init__(bos_token=bos_token, eos_token=eos_token, **kwargs)
+        super(GPT2Tokenizer, self).__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
 
         self.encoder = json.load(open(vocab_file))
         self.decoder = {v:k for k,v in self.encoder.items()}

From 4cc1bf81ee1326f1779f99ed5cc85370a550ef4a Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Sat, 27 Jul 2019 12:08:21 +0200
Subject: [PATCH 035/200] typos

---
 pytorch_transformers/modeling_auto.py     | 4 ++--
 pytorch_transformers/modeling_utils.py    | 4 ++--
 pytorch_transformers/tokenization_bert.py | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/pytorch_transformers/modeling_auto.py b/pytorch_transformers/modeling_auto.py
index 68eb85cbd8..aa50b1526d 100644
--- a/pytorch_transformers/modeling_auto.py
+++ b/pytorch_transformers/modeling_auto.py
@@ -157,7 +157,7 @@ class AutoModel(object):
             - contains `xlnet`: XLNetConfig (XLNet model)
             - contains `xlm`: XLMConfig (XLM model)
 
-            The model is set in evaluation mode by default using `model.eval()` (Dropout modules are desactivated)
+            The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
             To train the model, you should first set it back in training mode with `model.train()`
 
         Params:
@@ -179,7 +179,7 @@ class AutoModel(object):
                 - the model was saved using the `save_pretrained(save_directory)` (loaded by suppling the save directory).
             **state_dict**: an optional state dictionnary for the model to use instead of a state dictionary loaded
                 from saved weights file.
-                This option can be used if you want to create a model from a pretrained configuraton but load your own weights.
+                This option can be used if you want to create a model from a pretrained configuration but load your own weights.
                 In this case though, you should check if using `save_pretrained(dir)` and `from_pretrained(save_directory)` is not
                 a simpler option.
             **cache_dir**: (`optional`) string:
diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 4fabd49baf..7ae834f5e5 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -324,7 +324,7 @@ class PreTrainedModel(nn.Module):
     def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
         r"""Instantiate a pretrained pytorch model from a pre-trained model configuration.
 
-            The model is set in evaluation mode by default using `model.eval()` (Dropout modules are desactivated)
+            The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
             To train the model, you should first set it back in training mode with `model.train()`
 
         Params:
@@ -346,7 +346,7 @@ class PreTrainedModel(nn.Module):
                 - the model was saved using the `save_pretrained(save_directory)` (loaded by suppling the save directory).
             **state_dict**: an optional state dictionnary for the model to use instead of a state dictionary loaded
                 from saved weights file.
-                This option can be used if you want to create a model from a pretrained configuraton but load your own weights.
+                This option can be used if you want to create a model from a pretrained configuration but load your own weights.
                 In this case though, you should check if using `save_pretrained(dir)` and `from_pretrained(save_directory)` is not
                 a simpler option.
             **cache_dir**: (`optional`) string:
diff --git a/pytorch_transformers/tokenization_bert.py b/pytorch_transformers/tokenization_bert.py
index f9c97b7d12..d9cd881dfd 100644
--- a/pytorch_transformers/tokenization_bert.py
+++ b/pytorch_transformers/tokenization_bert.py
@@ -119,7 +119,7 @@ class BertTokenizer(PreTrainedTokenizer):
                 Only has an effect when do_basic_tokenize=True
             **tokenize_chinese_chars**: (`optional`) boolean (default True)
                 Whether to tokenize Chinese characters.
-                This should likely be desactivated for Japanese:
+                This should likely be deactivated for Japanese:
                 see: https://github.com/huggingface/pytorch-pretrained-BERT/issues/328
         """
         super(BertTokenizer, self).__init__(unk_token=unk_token, sep_token=sep_token,
@@ -214,7 +214,7 @@ class BasicTokenizer(object):
                 List of token not to split.
             **tokenize_chinese_chars**: (`optional`) boolean (default True)
                 Whether to tokenize Chinese characters.
-                This should likely be desactivated for Japanese:
+                This should likely be deactivated for Japanese:
                 see: https://github.com/huggingface/pytorch-pretrained-BERT/issues/328
         """
         if never_split is None:

From bfbe52ec397f0e43641ee58d4e347deff5216777 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Sat, 27 Jul 2019 20:25:39 +0200
Subject: [PATCH 036/200] cleaning up example docstrings

---
 hubconfs/bert_hubconf.py                    | 206 ++++++++++----------
 hubconfs/gpt2_hubconf.py                    |  84 ++++----
 hubconfs/gpt_hubconf.py                     |  76 ++++----
 hubconfs/transformer_xl_hubconf.py          |  68 +++----
 hubconfs/xlm_hubconf.py                     |  80 ++++----
 hubconfs/xlnet_hubconf.1.py                 |  84 ++++----
 pytorch_transformers/modeling_auto.py       |  32 +--
 pytorch_transformers/modeling_bert.py       | 122 ++++++------
 pytorch_transformers/modeling_gpt2.py       |  40 ++--
 pytorch_transformers/modeling_openai.py     |  40 ++--
 pytorch_transformers/modeling_transfo_xl.py |  24 +--
 pytorch_transformers/modeling_utils.py      |  32 +--
 pytorch_transformers/modeling_xlm.py        |  58 +++---
 pytorch_transformers/modeling_xlnet.py      |  68 +++----
 pytorch_transformers/tokenization_auto.py   |   4 +-
 15 files changed, 509 insertions(+), 509 deletions(-)

diff --git a/hubconfs/bert_hubconf.py b/hubconfs/bert_hubconf.py
index a0221ff9e1..6e2830617f 100644
--- a/hubconfs/bert_hubconf.py
+++ b/hubconfs/bert_hubconf.py
@@ -84,12 +84,12 @@ def bertTokenizer(*args, **kwargs):
                  Default: ["[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"]
 
     Example:
-        >>> import torch
-        >>> sentence = 'Hello, World!'
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
-        >>> toks = tokenizer.tokenize(sentence)
+        import torch
+        sentence = 'Hello, World!'
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
+        toks = tokenizer.tokenize(sentence)
         ['Hello', '##,', 'World', '##!']
-        >>> ids = tokenizer.convert_tokens_to_ids(toks)
+        ids = tokenizer.convert_tokens_to_ids(toks)
         [8667, 28136, 1291, 28125]
     """
     tokenizer = BertTokenizer.from_pretrained(*args, **kwargs)
@@ -105,20 +105,20 @@ def bertModel(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
         #  Prepare tokenized input
-        >>> text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
-        >>> tokenized_text = tokenizer.tokenize(text)
-        >>> indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
-        >>> segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
-        >>> tokens_tensor = torch.tensor([indexed_tokens])
-        >>> segments_tensors = torch.tensor([segments_ids])
+        text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+        tokenized_text = tokenizer.tokenize(text)
+        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+        segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
+        tokens_tensor = torch.tensor([indexed_tokens])
+        segments_tensors = torch.tensor([segments_ids])
         # Load bertModel
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'bertModel', 'bert-base-cased')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'bertModel', 'bert-base-cased')
+        model.eval()
         # Predict hidden states features for each layer
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 encoded_layers, _ = model(tokens_tensor, segments_tensors)
     """
     model = BertModel.from_pretrained(*args, **kwargs)
@@ -134,20 +134,20 @@ def bertForNextSentencePrediction(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
         #  Prepare tokenized input
-        >>> text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
-        >>> tokenized_text = tokenizer.tokenize(text)
-        >>> indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
-        >>> segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
-        >>> tokens_tensor = torch.tensor([indexed_tokens])
-        >>> segments_tensors = torch.tensor([segments_ids])
+        text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+        tokenized_text = tokenizer.tokenize(text)
+        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+        segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
+        tokens_tensor = torch.tensor([indexed_tokens])
+        segments_tensors = torch.tensor([segments_ids])
         # Load bertForNextSentencePrediction
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'bertForNextSentencePrediction', 'bert-base-cased')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'bertForNextSentencePrediction', 'bert-base-cased')
+        model.eval()
         # Predict the next sentence classification logits
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 next_sent_classif_logits = model(tokens_tensor, segments_tensors)
     """
     model = BertForNextSentencePrediction.from_pretrained(*args, **kwargs)
@@ -164,17 +164,17 @@ def bertForPreTraining(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
         #  Prepare tokenized input
-        >>> text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
-        >>> tokenized_text = tokenizer.tokenize(text)
-        >>> segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
-        >>> tokens_tensor = torch.tensor([indexed_tokens])
-        >>> segments_tensors = torch.tensor([segments_ids])
+        text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+        tokenized_text = tokenizer.tokenize(text)
+        segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
+        tokens_tensor = torch.tensor([indexed_tokens])
+        segments_tensors = torch.tensor([segments_ids])
         # Load bertForPreTraining
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'bertForPreTraining', 'bert-base-cased')
-        >>> masked_lm_logits_scores, seq_relationship_logits = model(tokens_tensor, segments_tensors)
+        model = torch.hub.load('huggingface/pytorch-transformers', 'bertForPreTraining', 'bert-base-cased')
+        masked_lm_logits_scores, seq_relationship_logits = model(tokens_tensor, segments_tensors)
     """
     model = BertForPreTraining.from_pretrained(*args, **kwargs)
     return model
@@ -188,25 +188,25 @@ def bertForMaskedLM(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
         #  Prepare tokenized input
-        >>> text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
-        >>> tokenized_text = tokenizer.tokenize(text)
-        >>> masked_index = 8
-        >>> tokenized_text[masked_index] = '[MASK]'
-        >>> indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
-        >>> segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
-        >>> tokens_tensor = torch.tensor([indexed_tokens])
-        >>> segments_tensors = torch.tensor([segments_ids])
+        text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+        tokenized_text = tokenizer.tokenize(text)
+        masked_index = 8
+        tokenized_text[masked_index] = '[MASK]'
+        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+        segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
+        tokens_tensor = torch.tensor([indexed_tokens])
+        segments_tensors = torch.tensor([segments_ids])
         # Load bertForMaskedLM
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'bertForMaskedLM', 'bert-base-cased')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'bertForMaskedLM', 'bert-base-cased')
+        model.eval()
         # Predict all tokens
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 predictions = model(tokens_tensor, segments_tensors)
-        >>> predicted_index = torch.argmax(predictions[0, masked_index]).item()
-        >>> predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
+        predicted_index = torch.argmax(predictions[0, masked_index]).item()
+        predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
         'henson'
     """
     model = BertForMaskedLM.from_pretrained(*args, **kwargs)
@@ -230,24 +230,24 @@ def bertForSequenceClassification(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
         #  Prepare tokenized input
-        >>> text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
-        >>> tokenized_text = tokenizer.tokenize(text)
-        >>> indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
-        >>> segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
-        >>> tokens_tensor = torch.tensor([indexed_tokens])
-        >>> segments_tensors = torch.tensor([segments_ids])
+        text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+        tokenized_text = tokenizer.tokenize(text)
+        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+        segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
+        tokens_tensor = torch.tensor([indexed_tokens])
+        segments_tensors = torch.tensor([segments_ids])
         # Load bertForSequenceClassification
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'bertForSequenceClassification', 'bert-base-cased', num_labels=2)
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'bertForSequenceClassification', 'bert-base-cased', num_labels=2)
+        model.eval()
         # Predict the sequence classification logits
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 seq_classif_logits = model(tokens_tensor, segments_tensors)
         # Or get the sequence classification loss
-        >>> labels = torch.tensor([1])
-        >>> seq_classif_loss = model(tokens_tensor, segments_tensors, labels=labels) # set model.train() before if training this loss
+        labels = torch.tensor([1])
+        seq_classif_loss = model(tokens_tensor, segments_tensors, labels=labels) # set model.train() before if training this loss
     """
     model = BertForSequenceClassification.from_pretrained(*args, **kwargs)
     return model
@@ -265,24 +265,24 @@ def bertForMultipleChoice(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
         #  Prepare tokenized input
-        >>> text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
-        >>> tokenized_text = tokenizer.tokenize(text)
-        >>> indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
-        >>> segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
-        >>> tokens_tensor = torch.tensor([indexed_tokens, indexed_tokens]).unsqueeze(0)
-        >>> segments_tensors = torch.tensor([segments_ids, segments_ids]).unsqueeze(0)
+        text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+        tokenized_text = tokenizer.tokenize(text)
+        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+        segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
+        tokens_tensor = torch.tensor([indexed_tokens, indexed_tokens]).unsqueeze(0)
+        segments_tensors = torch.tensor([segments_ids, segments_ids]).unsqueeze(0)
         # Load bertForMultipleChoice
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'bertForMultipleChoice', 'bert-base-cased', num_choices=2)
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'bertForMultipleChoice', 'bert-base-cased', num_choices=2)
+        model.eval()
         # Predict the multiple choice logits
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 multiple_choice_logits = model(tokens_tensor, segments_tensors)
         # Or get the multiple choice loss
-        >>> labels = torch.tensor([1])
-        >>> multiple_choice_loss = model(tokens_tensor, segments_tensors, labels=labels) # set model.train() before if training this loss
+        labels = torch.tensor([1])
+        multiple_choice_loss = model(tokens_tensor, segments_tensors, labels=labels) # set model.train() before if training this loss
     """
     model = BertForMultipleChoice.from_pretrained(*args, **kwargs)
     return model
@@ -298,25 +298,25 @@ def bertForQuestionAnswering(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
         #  Prepare tokenized input
-        >>> text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
-        >>> tokenized_text = tokenizer.tokenize(text)
-        >>> indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
-        >>> segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
-        >>> tokens_tensor = torch.tensor([indexed_tokens])
-        >>> segments_tensors = torch.tensor([segments_ids])
+        text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+        tokenized_text = tokenizer.tokenize(text)
+        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+        segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
+        tokens_tensor = torch.tensor([indexed_tokens])
+        segments_tensors = torch.tensor([segments_ids])
         # Load bertForQuestionAnswering
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'bertForQuestionAnswering', 'bert-base-cased')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'bertForQuestionAnswering', 'bert-base-cased')
+        model.eval()
         # Predict the start and end positions logits
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 start_logits, end_logits = model(tokens_tensor, segments_tensors)
         # Or get the total loss which is the sum of the CrossEntropy loss for the start and end token positions
-        >>> start_positions, end_positions = torch.tensor([12]), torch.tensor([14])
+        start_positions, end_positions = torch.tensor([12]), torch.tensor([14])
         # set model.train() before if training this loss
-        >>> multiple_choice_loss = model(tokens_tensor, segments_tensors, start_positions=start_positions, end_positions=end_positions)
+        multiple_choice_loss = model(tokens_tensor, segments_tensors, start_positions=start_positions, end_positions=end_positions)
     """
     model = BertForQuestionAnswering.from_pretrained(*args, **kwargs)
     return model
@@ -337,24 +337,24 @@ def bertForTokenClassification(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)
         #  Prepare tokenized input
-        >>> text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
-        >>> tokenized_text = tokenizer.tokenize(text)
-        >>> indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
-        >>> segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
-        >>> tokens_tensor = torch.tensor([indexed_tokens])
-        >>> segments_tensors = torch.tensor([segments_ids])
+        text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+        tokenized_text = tokenizer.tokenize(text)
+        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+        segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
+        tokens_tensor = torch.tensor([indexed_tokens])
+        segments_tensors = torch.tensor([segments_ids])
         # Load bertForTokenClassification
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'bertForTokenClassification', 'bert-base-cased', num_labels=2)
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'bertForTokenClassification', 'bert-base-cased', num_labels=2)
+        model.eval()
         # Predict the token classification logits
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 classif_logits = model(tokens_tensor, segments_tensors)
         # Or get the token classification loss
-        >>> labels = torch.tensor([[0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0]])
-        >>> classif_loss = model(tokens_tensor, segments_tensors, labels=labels) # set model.train() before if training this loss
+        labels = torch.tensor([[0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0]])
+        classif_loss = model(tokens_tensor, segments_tensors, labels=labels) # set model.train() before if training this loss
     """
     model = BertForTokenClassification.from_pretrained(*args, **kwargs)
     return model
diff --git a/hubconfs/gpt2_hubconf.py b/hubconfs/gpt2_hubconf.py
index dbaa2cd612..18afad3913 100644
--- a/hubconfs/gpt2_hubconf.py
+++ b/hubconfs/gpt2_hubconf.py
@@ -52,11 +52,11 @@ def gpt2Tokenizer(*args, **kwargs):
              Default: None
 
     Example:
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'gpt2Tokenizer', 'gpt2')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'gpt2Tokenizer', 'gpt2')
 
-        >>> text = "Who was Jim Henson ?"
-        >>> indexed_tokens = tokenizer.encode(tokenized_text)
+        text = "Who was Jim Henson ?"
+        indexed_tokens = tokenizer.encode(tokenized_text)
     """
     tokenizer = GPT2Tokenizer.from_pretrained(*args, **kwargs)
     return tokenizer
@@ -71,24 +71,24 @@ def gpt2Model(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'gpt2Tokenizer', 'gpt2')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'gpt2Tokenizer', 'gpt2')
 
         #  Prepare tokenized input
-        >>> text_1 = "Who was Jim Henson ?"
-        >>> text_2 = "Jim Henson was a puppeteer"
-        >>> indexed_tokens_1 = tokenizer.encode(text_1)
-        >>> indexed_tokens_2 = tokenizer.encode(text_2)
-        >>> tokens_tensor_1 = torch.tensor([indexed_tokens_1])
-        >>> tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+        text_1 = "Who was Jim Henson ?"
+        text_2 = "Jim Henson was a puppeteer"
+        indexed_tokens_1 = tokenizer.encode(text_1)
+        indexed_tokens_2 = tokenizer.encode(text_2)
+        tokens_tensor_1 = torch.tensor([indexed_tokens_1])
+        tokens_tensor_2 = torch.tensor([indexed_tokens_2])
 
         # Load gpt2Model
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'gpt2Model', 'gpt2')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'gpt2Model', 'gpt2')
+        model.eval()
 
         # Predict hidden states features for each layer
         # past can be used to reuse precomputed hidden state in a subsequent predictions
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 hidden_states_1, past = model(tokens_tensor_1)
                 hidden_states_2, past = model(tokens_tensor_2, past=past)
     """
@@ -104,31 +104,31 @@ def gpt2LMHeadModel(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'gpt2Tokenizer', 'gpt2')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'gpt2Tokenizer', 'gpt2')
 
         #  Prepare tokenized input
-        >>> text_1 = "Who was Jim Henson ?"
-        >>> text_2 = "Jim Henson was a puppeteer"
-        >>> indexed_tokens_1 = tokenizer.encode(text_1)
-        >>> indexed_tokens_2 = tokenizer.encode(text_2)
-        >>> tokens_tensor_1 = torch.tensor([indexed_tokens_1])
-        >>> tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+        text_1 = "Who was Jim Henson ?"
+        text_2 = "Jim Henson was a puppeteer"
+        indexed_tokens_1 = tokenizer.encode(text_1)
+        indexed_tokens_2 = tokenizer.encode(text_2)
+        tokens_tensor_1 = torch.tensor([indexed_tokens_1])
+        tokens_tensor_2 = torch.tensor([indexed_tokens_2])
 
         # Load gpt2LMHeadModel
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'gpt2LMHeadModel', 'gpt2')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'gpt2LMHeadModel', 'gpt2')
+        model.eval()
 
         # Predict hidden states features for each layer
         # past can be used to reuse precomputed hidden state in a subsequent predictions
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 predictions_1, past = model(tokens_tensor_1)
                 predictions_2, past = model(tokens_tensor_2, past=past)
 
         # Get the predicted last token
-        >>> predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
-        >>> predicted_token = tokenizer.decode([predicted_index])
-        >>> assert predicted_token == ' who'
+        predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
+        predicted_token = tokenizer.decode([predicted_index])
+        assert predicted_token == ' who'
     """
     model = GPT2LMHeadModel.from_pretrained(*args, **kwargs)
     return model
@@ -143,25 +143,25 @@ def gpt2DoubleHeadsModel(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'gpt2Tokenizer', 'gpt2')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'gpt2Tokenizer', 'gpt2')
 
         #  Prepare tokenized input
-        >>> text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
-        >>> text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
-        >>> tokenized_text1 = tokenizer.tokenize(text1)
-        >>> tokenized_text2 = tokenizer.tokenize(text2)
-        >>> indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
-        >>> indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
-        >>> tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
-        >>> mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
+        text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
+        text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
+        tokenized_text1 = tokenizer.tokenize(text1)
+        tokenized_text2 = tokenizer.tokenize(text2)
+        indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
+        indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
+        tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
+        mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
 
         # Load gpt2DoubleHeadsModel
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'gpt2DoubleHeadsModel', 'gpt2')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'gpt2DoubleHeadsModel', 'gpt2')
+        model.eval()
 
         # Predict hidden states features for each layer
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 lm_logits, multiple_choice_logits, presents = model(tokens_tensor, mc_token_ids)
     """
     model = GPT2DoubleHeadsModel.from_pretrained(*args, **kwargs)
diff --git a/hubconfs/gpt_hubconf.py b/hubconfs/gpt_hubconf.py
index c58c1fa708..649075980c 100644
--- a/hubconfs/gpt_hubconf.py
+++ b/hubconfs/gpt_hubconf.py
@@ -76,12 +76,12 @@ def openAIGPTTokenizer(*args, **kwargs):
 			 Default: None
 
     Example:
-		>>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTTokenizer', 'openai-gpt')
+		import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTTokenizer', 'openai-gpt')
 		
-		>>> text = "Who was Jim Henson ? Jim Henson was a puppeteer"
-        >>> tokenized_text = tokenizer.tokenize(text)
-        >>> indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+		text = "Who was Jim Henson ? Jim Henson was a puppeteer"
+        tokenized_text = tokenizer.tokenize(text)
+        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
         [763, 509, 4265, 2298, 945, 257, 4265, 2298, 945, 509, 246, 10148, 39041, 483]
     """
     tokenizer = OpenAIGPTTokenizer.from_pretrained(*args, **kwargs)
@@ -97,21 +97,21 @@ def openAIGPTModel(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-		>>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTTokenizer', 'openai-gpt')
+		import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTTokenizer', 'openai-gpt')
 
         #  Prepare tokenized input
-        >>> text = "Who was Jim Henson ? Jim Henson was a puppeteer"
-        >>> tokenized_text = tokenizer.tokenize(text)
-        >>> indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
-        >>> tokens_tensor = torch.tensor([indexed_tokens])
+        text = "Who was Jim Henson ? Jim Henson was a puppeteer"
+        tokenized_text = tokenizer.tokenize(text)
+        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+        tokens_tensor = torch.tensor([indexed_tokens])
 
         # Load openAIGPTModel
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTModel', 'openai-gpt')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTModel', 'openai-gpt')
+        model.eval()
 
         # Predict hidden states features for each layer
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 hidden_states = model(tokens_tensor)
     """
     model = OpenAIGPTModel.from_pretrained(*args, **kwargs)
@@ -126,26 +126,26 @@ def openAIGPTLMHeadModel(*args, **kwargs):
 
 	Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTTokenizer', 'openai-gpt')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTTokenizer', 'openai-gpt')
 
         #  Prepare tokenized input
-        >>> text = "Who was Jim Henson ? Jim Henson was a puppeteer"
-        >>> tokenized_text = tokenizer.tokenize(text)
-        >>> indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
-        >>> tokens_tensor = torch.tensor([indexed_tokens])
+        text = "Who was Jim Henson ? Jim Henson was a puppeteer"
+        tokenized_text = tokenizer.tokenize(text)
+        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+        tokens_tensor = torch.tensor([indexed_tokens])
 
         # Load openAIGPTLMHeadModel
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTLMHeadModel', 'openai-gpt')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTLMHeadModel', 'openai-gpt')
+        model.eval()
 
         # Predict hidden states features for each layer
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 predictions = model(tokens_tensor)
 
 		# Get the predicted last token
-		>>> predicted_index = torch.argmax(predictions[0, -1, :]).item()
-		>>> predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
+		predicted_index = torch.argmax(predictions[0, -1, :]).item()
+		predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
         '.</w>'
     """
     model = OpenAIGPTLMHeadModel.from_pretrained(*args, **kwargs)
@@ -161,25 +161,25 @@ def openAIGPTDoubleHeadsModel(*args, **kwargs):
 
 	Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTTokenizer', 'openai-gpt')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTTokenizer', 'openai-gpt')
 
         #  Prepare tokenized input
-        >>> text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
-        >>> text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
-        >>> tokenized_text1 = tokenizer.tokenize(text1)
-        >>> tokenized_text2 = tokenizer.tokenize(text2)
-        >>> indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
-        >>> indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
-        >>> tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
-        >>> mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
+        text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
+        text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
+        tokenized_text1 = tokenizer.tokenize(text1)
+        tokenized_text2 = tokenizer.tokenize(text2)
+        indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
+        indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
+        tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
+        mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
 
         # Load openAIGPTDoubleHeadsModel
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTDoubleHeadsModel', 'openai-gpt')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'openAIGPTDoubleHeadsModel', 'openai-gpt')
+        model.eval()
 
         # Predict hidden states features for each layer
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 lm_logits, multiple_choice_logits = model(tokens_tensor, mc_token_ids)
     """
     model = OpenAIGPTDoubleHeadsModel.from_pretrained(*args, **kwargs)
diff --git a/hubconfs/transformer_xl_hubconf.py b/hubconfs/transformer_xl_hubconf.py
index cfcc6aef5a..548d407581 100644
--- a/hubconfs/transformer_xl_hubconf.py
+++ b/hubconfs/transformer_xl_hubconf.py
@@ -45,12 +45,12 @@ def transformerXLTokenizer(*args, **kwargs):
                                        * transfo-xl-wt103
 
     Example:
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'transformerXLTokenizer', 'transfo-xl-wt103')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'transformerXLTokenizer', 'transfo-xl-wt103')
         
-        >>> text = "Who was Jim Henson ?"
-        >>> tokenized_text = tokenizer.tokenize(tokenized_text)
-        >>> indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+        text = "Who was Jim Henson ?"
+        tokenized_text = tokenizer.tokenize(tokenized_text)
+        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
     """
     tokenizer = TransfoXLTokenizer.from_pretrained(*args, **kwargs)
     return tokenizer
@@ -63,26 +63,26 @@ def transformerXLModel(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'transformerXLTokenizer', 'transfo-xl-wt103')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'transformerXLTokenizer', 'transfo-xl-wt103')
 
         #  Prepare tokenized input
-        >>> text_1 = "Who was Jim Henson ?"
-        >>> text_2 = "Jim Henson was a puppeteer"
-        >>> tokenized_text_1 = tokenizer.tokenize(text_1)
-        >>> tokenized_text_2 = tokenizer.tokenize(text_2)
-        >>> indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1)
-        >>> indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2)
-        >>> tokens_tensor_1 = torch.tensor([indexed_tokens_1])
-        >>> tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+        text_1 = "Who was Jim Henson ?"
+        text_2 = "Jim Henson was a puppeteer"
+        tokenized_text_1 = tokenizer.tokenize(text_1)
+        tokenized_text_2 = tokenizer.tokenize(text_2)
+        indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1)
+        indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2)
+        tokens_tensor_1 = torch.tensor([indexed_tokens_1])
+        tokens_tensor_2 = torch.tensor([indexed_tokens_2])
 
         # Load transformerXLModel
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'transformerXLModel', 'transfo-xl-wt103')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'transformerXLModel', 'transfo-xl-wt103')
+        model.eval()
 
         # Predict hidden states features for each layer
         # We can re-use the memory cells in a subsequent call to attend a longer context
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 hidden_states_1, mems_1 = model(tokens_tensor_1)
                 hidden_states_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
     """
@@ -98,33 +98,33 @@ def transformerXLLMHeadModel(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'transformerXLTokenizer', 'transfo-xl-wt103')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'transformerXLTokenizer', 'transfo-xl-wt103')
 
         #  Prepare tokenized input
-        >>> text_1 = "Who was Jim Henson ?"
-        >>> text_2 = "Jim Henson was a puppeteer"
-        >>> tokenized_text_1 = tokenizer.tokenize(text_1)
-        >>> tokenized_text_2 = tokenizer.tokenize(text_2)
-        >>> indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1)
-        >>> indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2)
-        >>> tokens_tensor_1 = torch.tensor([indexed_tokens_1])
-        >>> tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+        text_1 = "Who was Jim Henson ?"
+        text_2 = "Jim Henson was a puppeteer"
+        tokenized_text_1 = tokenizer.tokenize(text_1)
+        tokenized_text_2 = tokenizer.tokenize(text_2)
+        indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1)
+        indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2)
+        tokens_tensor_1 = torch.tensor([indexed_tokens_1])
+        tokens_tensor_2 = torch.tensor([indexed_tokens_2])
 
         # Load transformerXLLMHeadModel
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'transformerXLLMHeadModel', 'transfo-xl-wt103')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'transformerXLLMHeadModel', 'transfo-xl-wt103')
+        model.eval()
 
         # Predict hidden states features for each layer
         # We can re-use the memory cells in a subsequent call to attend a longer context
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 predictions_1, mems_1 = model(tokens_tensor_1)
                 predictions_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
 
         # Get the predicted last token
-        >>> predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
-        >>> predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
-        >>> assert predicted_token == 'who'
+        predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
+        predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
+        assert predicted_token == 'who'
     """
     model = TransfoXLLMHeadModel.from_pretrained(*args, **kwargs)
     return model
diff --git a/hubconfs/xlm_hubconf.py b/hubconfs/xlm_hubconf.py
index 4f6fd93c24..e96d923944 100644
--- a/hubconfs/xlm_hubconf.py
+++ b/hubconfs/xlm_hubconf.py
@@ -17,16 +17,16 @@ xlm_start_docstring = """
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlmTokenizer', 'xlm-mlm-en-2048')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlmTokenizer', 'xlm-mlm-en-2048')
 
         #  Prepare tokenized input
-        >>> text_1 = "Who was Jim Henson ?"
-        >>> text_2 = "Jim Henson was a puppeteer"
-        >>> indexed_tokens_1 = tokenizer.encode(text_1)
-        >>> indexed_tokens_2 = tokenizer.encode(text_2)
-        >>> tokens_tensor_1 = torch.tensor([indexed_tokens_1])
-        >>> tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+        text_1 = "Who was Jim Henson ?"
+        text_2 = "Jim Henson was a puppeteer"
+        indexed_tokens_1 = tokenizer.encode(text_1)
+        indexed_tokens_2 = tokenizer.encode(text_2)
+        tokens_tensor_1 = torch.tensor([indexed_tokens_1])
+        tokens_tensor_2 = torch.tensor([indexed_tokens_2])
 """
 
 # A lot of models share the same param doc. Use a decorator
@@ -76,11 +76,11 @@ def xlmTokenizer(*args, **kwargs):
              Default: None
 
     Example:
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlmTokenizer', 'xlm-mlm-en-2048')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlmTokenizer', 'xlm-mlm-en-2048')
 
-        >>> text = "Who was Jim Henson ?"
-        >>> indexed_tokens = tokenizer.encode(tokenized_text)
+        text = "Who was Jim Henson ?"
+        indexed_tokens = tokenizer.encode(tokenized_text)
     """
     tokenizer = XLMTokenizer.from_pretrained(*args, **kwargs)
     return tokenizer
@@ -91,11 +91,11 @@ def xlmTokenizer(*args, **kwargs):
 def xlmModel(*args, **kwargs):
     """
         # Load xlmModel
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'xlmModel', 'xlm-mlm-en-2048')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'xlmModel', 'xlm-mlm-en-2048')
+        model.eval()
 
         # Predict hidden states features for each layer
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 hidden_states_1, mems = model(tokens_tensor_1)
                 hidden_states_2, mems = model(tokens_tensor_2, past=mems)
     """
@@ -108,26 +108,26 @@ def xlmModel(*args, **kwargs):
 def xlmLMHeadModel(*args, **kwargs):
     """
         #  Prepare tokenized input
-        >>> text_1 = "Who was Jim Henson ?"
-        >>> text_2 = "Jim Henson was a puppeteer"
-        >>> indexed_tokens_1 = tokenizer.encode(text_1)
-        >>> indexed_tokens_2 = tokenizer.encode(text_2)
-        >>> tokens_tensor_1 = torch.tensor([indexed_tokens_1])
-        >>> tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+        text_1 = "Who was Jim Henson ?"
+        text_2 = "Jim Henson was a puppeteer"
+        indexed_tokens_1 = tokenizer.encode(text_1)
+        indexed_tokens_2 = tokenizer.encode(text_2)
+        tokens_tensor_1 = torch.tensor([indexed_tokens_1])
+        tokens_tensor_2 = torch.tensor([indexed_tokens_2])
 
         # Load xlnetLMHeadModel
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'xlnetLMHeadModel', 'xlm-mlm-en-2048')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'xlnetLMHeadModel', 'xlm-mlm-en-2048')
+        model.eval()
 
         # Predict hidden states features for each layer
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 predictions_1, mems = model(tokens_tensor_1)
                 predictions_2, mems = model(tokens_tensor_2, mems=mems)
 
         # Get the predicted last token
-        >>> predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
-        >>> predicted_token = tokenizer.decode([predicted_index])
-        >>> assert predicted_token == ' who'
+        predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
+        predicted_token = tokenizer.decode([predicted_index])
+        assert predicted_token == ' who'
     """
     model = XLMWithLMHeadModel.from_pretrained(*args, **kwargs)
     return model
@@ -142,25 +142,25 @@ def xlmLMHeadModel(*args, **kwargs):
 
 #     Example:
 #         # Load the tokenizer
-#         >>> import torch
-#         >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlnetTokenizer', 'xlm-mlm-en-2048')
+#         import torch
+#         tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlnetTokenizer', 'xlm-mlm-en-2048')
 
 #         #  Prepare tokenized input
-#         >>> text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
-#         >>> text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
-#         >>> tokenized_text1 = tokenizer.tokenize(text1)
-#         >>> tokenized_text2 = tokenizer.tokenize(text2)
-#         >>> indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
-#         >>> indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
-#         >>> tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
-#         >>> mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
+#         text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
+#         text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
+#         tokenized_text1 = tokenizer.tokenize(text1)
+#         tokenized_text2 = tokenizer.tokenize(text2)
+#         indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
+#         indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
+#         tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
+#         mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
 
 #         # Load xlnetForSequenceClassification
-#         >>> model = torch.hub.load('huggingface/pytorch-transformers', 'xlnetForSequenceClassification', 'xlm-mlm-en-2048')
-#         >>> model.eval()
+#         model = torch.hub.load('huggingface/pytorch-transformers', 'xlnetForSequenceClassification', 'xlm-mlm-en-2048')
+#         model.eval()
 
 #         # Predict sequence classes logits
-#         >>> with torch.no_grad():
+#         with torch.no_grad():
 #                 lm_logits, mems = model(tokens_tensor)
 #     """
 #     model = XLNetForSequenceClassification.from_pretrained(*args, **kwargs)
diff --git a/hubconfs/xlnet_hubconf.1.py b/hubconfs/xlnet_hubconf.1.py
index 4c5105a241..fa7b7ddb9f 100644
--- a/hubconfs/xlnet_hubconf.1.py
+++ b/hubconfs/xlnet_hubconf.1.py
@@ -53,11 +53,11 @@ def xlnetTokenizer(*args, **kwargs):
              Default: None
 
     Example:
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlnetTokenizer', 'xlnet-large-cased')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlnetTokenizer', 'xlnet-large-cased')
 
-        >>> text = "Who was Jim Henson ?"
-        >>> indexed_tokens = tokenizer.encode(tokenized_text)
+        text = "Who was Jim Henson ?"
+        indexed_tokens = tokenizer.encode(tokenized_text)
     """
     tokenizer = XLNetTokenizer.from_pretrained(*args, **kwargs)
     return tokenizer
@@ -72,23 +72,23 @@ def xlnetModel(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlnetTokenizer', 'xlnet-large-cased')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlnetTokenizer', 'xlnet-large-cased')
 
         #  Prepare tokenized input
-        >>> text_1 = "Who was Jim Henson ?"
-        >>> text_2 = "Jim Henson was a puppeteer"
-        >>> indexed_tokens_1 = tokenizer.encode(text_1)
-        >>> indexed_tokens_2 = tokenizer.encode(text_2)
-        >>> tokens_tensor_1 = torch.tensor([indexed_tokens_1])
-        >>> tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+        text_1 = "Who was Jim Henson ?"
+        text_2 = "Jim Henson was a puppeteer"
+        indexed_tokens_1 = tokenizer.encode(text_1)
+        indexed_tokens_2 = tokenizer.encode(text_2)
+        tokens_tensor_1 = torch.tensor([indexed_tokens_1])
+        tokens_tensor_2 = torch.tensor([indexed_tokens_2])
 
         # Load xlnetModel
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'xlnetModel', 'xlnet-large-cased')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'xlnetModel', 'xlnet-large-cased')
+        model.eval()
 
         # Predict hidden states features for each layer
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 hidden_states_1, mems = model(tokens_tensor_1)
                 hidden_states_2, mems = model(tokens_tensor_2, past=mems)
     """
@@ -106,30 +106,30 @@ def xlnetLMHeadModel(*args, **kwargs):
 
     Example:
         # Load the tokenizer
-        >>> import torch
-        >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlnetTokenizer', 'xlnet-large-cased')
+        import torch
+        tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlnetTokenizer', 'xlnet-large-cased')
 
         #  Prepare tokenized input
-        >>> text_1 = "Who was Jim Henson ?"
-        >>> text_2 = "Jim Henson was a puppeteer"
-        >>> indexed_tokens_1 = tokenizer.encode(text_1)
-        >>> indexed_tokens_2 = tokenizer.encode(text_2)
-        >>> tokens_tensor_1 = torch.tensor([indexed_tokens_1])
-        >>> tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+        text_1 = "Who was Jim Henson ?"
+        text_2 = "Jim Henson was a puppeteer"
+        indexed_tokens_1 = tokenizer.encode(text_1)
+        indexed_tokens_2 = tokenizer.encode(text_2)
+        tokens_tensor_1 = torch.tensor([indexed_tokens_1])
+        tokens_tensor_2 = torch.tensor([indexed_tokens_2])
 
         # Load xlnetLMHeadModel
-        >>> model = torch.hub.load('huggingface/pytorch-transformers', 'xlnetLMHeadModel', 'xlnet-large-cased')
-        >>> model.eval()
+        model = torch.hub.load('huggingface/pytorch-transformers', 'xlnetLMHeadModel', 'xlnet-large-cased')
+        model.eval()
 
         # Predict hidden states features for each layer
-        >>> with torch.no_grad():
+        with torch.no_grad():
                 predictions_1, mems = model(tokens_tensor_1)
                 predictions_2, mems = model(tokens_tensor_2, mems=mems)
 
         # Get the predicted last token
-        >>> predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
-        >>> predicted_token = tokenizer.decode([predicted_index])
-        >>> assert predicted_token == ' who'
+        predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
+        predicted_token = tokenizer.decode([predicted_index])
+        assert predicted_token == ' who'
     """
     model = XLNetLMHeadModel.from_pretrained(*args, **kwargs)
     return model
@@ -144,25 +144,25 @@ def xlnetLMHeadModel(*args, **kwargs):
 
 #     Example:
 #         # Load the tokenizer
-#         >>> import torch
-#         >>> tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlnetTokenizer', 'xlnet-large-cased')
+#         import torch
+#         tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'xlnetTokenizer', 'xlnet-large-cased')
 
 #         #  Prepare tokenized input
-#         >>> text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
-#         >>> text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
-#         >>> tokenized_text1 = tokenizer.tokenize(text1)
-#         >>> tokenized_text2 = tokenizer.tokenize(text2)
-#         >>> indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
-#         >>> indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
-#         >>> tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
-#         >>> mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
+#         text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
+#         text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
+#         tokenized_text1 = tokenizer.tokenize(text1)
+#         tokenized_text2 = tokenizer.tokenize(text2)
+#         indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
+#         indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
+#         tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
+#         mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
 
 #         # Load xlnetForSequenceClassification
-#         >>> model = torch.hub.load('huggingface/pytorch-transformers', 'xlnetForSequenceClassification', 'xlnet-large-cased')
-#         >>> model.eval()
+#         model = torch.hub.load('huggingface/pytorch-transformers', 'xlnetForSequenceClassification', 'xlnet-large-cased')
+#         model.eval()
 
 #         # Predict sequence classes logits
-#         >>> with torch.no_grad():
+#         with torch.no_grad():
 #                 lm_logits, mems = model(tokens_tensor)
 #     """
 #     model = XLNetForSequenceClassification.from_pretrained(*args, **kwargs)
diff --git a/pytorch_transformers/modeling_auto.py b/pytorch_transformers/modeling_auto.py
index aa50b1526d..3e28fbd0a9 100644
--- a/pytorch_transformers/modeling_auto.py
+++ b/pytorch_transformers/modeling_auto.py
@@ -89,15 +89,15 @@ class AutoConfig(object):
 
         Examples::
 
-            >>> config = AutoConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.
-            >>> config = AutoConfig.from_pretrained('./test/bert_saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
-            >>> config = AutoConfig.from_pretrained('./test/bert_saved_model/my_configuration.json')
-            >>> config = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)
-            >>> assert config.output_attention == True
-            >>> config, unused_kwargs = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True,
-            >>>                                                    foo=False, return_unused_kwargs=True)
-            >>> assert config.output_attention == True
-            >>> assert unused_kwargs == {'foo': False}
+            config = AutoConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.
+            config = AutoConfig.from_pretrained('./test/bert_saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
+            config = AutoConfig.from_pretrained('./test/bert_saved_model/my_configuration.json')
+            config = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)
+            assert config.output_attention == True
+            config, unused_kwargs = AutoConfig.from_pretrained('bert-base-uncased', output_attention=True,
+                                                               foo=False, return_unused_kwargs=True)
+            assert config.output_attention == True
+            assert unused_kwargs == {'foo': False}
 
         """
         if 'bert' in pretrained_model_name_or_path:
@@ -202,13 +202,13 @@ class AutoModel(object):
 
         Examples::
 
-            >>> model = AutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
-            >>> model = AutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
-            >>> model = AutoModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
-            >>> assert model.config.output_attention == True
-            >>> # Loading from a TF checkpoint file instead of a PyTorch model (slower)
-            >>> config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
-            >>> model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
+            model = AutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
+            model = AutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
+            model = AutoModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
+            assert model.config.output_attention == True
+            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
+            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
+            model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
 
         """
         if 'bert' in pretrained_model_name_or_path:
diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index b59445513a..3f2e7cbda1 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -643,12 +643,12 @@ class BertModel(BertPreTrainedModel):
 
     Examples::
 
-        >>> config = BertConfig.from_pretrained('bert-base-uncased')
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> model = BertModel(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        config = BertConfig.from_pretrained('bert-base-uncased')
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertModel(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -754,13 +754,13 @@ class BertForPreTraining(BertPreTrainedModel):
 
     Examples::
 
-        >>> config = BertConfig.from_pretrained('bert-base-uncased')
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> 
-        >>> model = BertForPreTraining(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> prediction_scores, seq_relationship_scores = outputs[:2]
+        config = BertConfig.from_pretrained('bert-base-uncased')
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        
+        model = BertForPreTraining(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        prediction_scores, seq_relationship_scores = outputs[:2]
 
     """
     def __init__(self, config):
@@ -824,13 +824,13 @@ class BertForMaskedLM(BertPreTrainedModel):
 
     Examples::
 
-        >>> config = BertConfig.from_pretrained('bert-base-uncased')
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> 
-        >>> model = BertForMaskedLM(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, masked_lm_labels=input_ids)
-        >>> loss, prediction_scores = outputs[:2]
+        config = BertConfig.from_pretrained('bert-base-uncased')
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        
+        model = BertForMaskedLM(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, masked_lm_labels=input_ids)
+        loss, prediction_scores = outputs[:2]
 
     """
     def __init__(self, config):
@@ -891,13 +891,13 @@ class BertForNextSentencePrediction(BertPreTrainedModel):
 
     Examples::
 
-        >>> config = BertConfig.from_pretrained('bert-base-uncased')
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> 
-        >>> model = BertForNextSentencePrediction(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> seq_relationship_scores = outputs[0]
+        config = BertConfig.from_pretrained('bert-base-uncased')
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        
+        model = BertForNextSentencePrediction(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        seq_relationship_scores = outputs[0]
 
     """
     def __init__(self, config):
@@ -951,14 +951,14 @@ class BertForSequenceClassification(BertPreTrainedModel):
 
     Examples::
 
-        >>> config = BertConfig.from_pretrained('bert-base-uncased')
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> 
-        >>> model = BertForSequenceClassification(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=labels)
-        >>> loss, logits = outputs[:2]
+        config = BertConfig.from_pretrained('bert-base-uncased')
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        
+        model = BertForSequenceClassification(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1057,15 +1057,15 @@ class BertForMultipleChoice(BertPreTrainedModel):
 
     Examples::
 
-        >>> config = BertConfig.from_pretrained('bert-base-uncased')
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> 
-        >>> model = BertForMultipleChoice(config)
-        >>> choices = ["Hello, my dog is cute", "Hello, my cat is amazing"]
-        >>> input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
-        >>> labels = torch.tensor(1).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=labels)
-        >>> loss, classification_scores = outputs[:2]
+        config = BertConfig.from_pretrained('bert-base-uncased')
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        
+        model = BertForMultipleChoice(config)
+        choices = ["Hello, my dog is cute", "Hello, my cat is amazing"]
+        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
+        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, classification_scores = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1127,14 +1127,14 @@ class BertForTokenClassification(BertPreTrainedModel):
 
     Examples::
 
-        >>> config = BertConfig.from_pretrained('bert-base-uncased')
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> 
-        >>> model = BertForTokenClassification(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=labels)
-        >>> loss, scores = outputs[:2]
+        config = BertConfig.from_pretrained('bert-base-uncased')
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        
+        model = BertForTokenClassification(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, scores = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1203,15 +1203,15 @@ class BertForQuestionAnswering(BertPreTrainedModel):
 
     Examples::
 
-        >>> config = BertConfig.from_pretrained('bert-base-uncased')
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> 
-        >>> model = BertForQuestionAnswering(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> start_positions = torch.tensor([1])
-        >>> end_positions = torch.tensor([3])
-        >>> outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
-        >>> loss, start_scores, end_scores = outputs[:2]
+        config = BertConfig.from_pretrained('bert-base-uncased')
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        
+        model = BertForQuestionAnswering(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        start_positions = torch.tensor([1])
+        end_positions = torch.tensor([3])
+        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        loss, start_scores, end_scores = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index b8a459db7d..4341f0d8a1 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -433,12 +433,12 @@ class GPT2Model(GPT2PreTrainedModel):
 
     Examples::
 
-        >>> config = GPT2Config.from_pretrained('gpt2')
-        >>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-        >>> model = GPT2Model(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        config = GPT2Config.from_pretrained('gpt2')
+        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        model = GPT2Model(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -567,12 +567,12 @@ class GPT2LMHeadModel(GPT2PreTrainedModel):
 
     Examples::
 
-        >>> config = GPT2Config.from_pretrained('gpt2')
-        >>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-        >>> model = GPT2LMHeadModel(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=input_ids)
-        >>> loss, logits = outputs[:2]
+        config = GPT2Config.from_pretrained('gpt2')
+        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        model = GPT2LMHeadModel(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=input_ids)
+        loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -683,14 +683,14 @@ class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
 
     Examples::
 
-        >>> config = GPT2Config.from_pretrained('gpt2')
-        >>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-        >>> model = GPT2DoubleHeadsModel(config)
-        >>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
-        >>> input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
-        >>> mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, mc_token_ids)
-        >>> lm_prediction_scores, mc_prediction_scores = outputs[:2]
+        config = GPT2Config.from_pretrained('gpt2')
+        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        model = GPT2DoubleHeadsModel(config)
+        choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
+        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
+        mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, mc_token_ids)
+        lm_prediction_scores, mc_prediction_scores = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/modeling_openai.py b/pytorch_transformers/modeling_openai.py
index 4ea19a965d..a6cb6212ef 100644
--- a/pytorch_transformers/modeling_openai.py
+++ b/pytorch_transformers/modeling_openai.py
@@ -439,12 +439,12 @@ class OpenAIGPTModel(OpenAIGPTPreTrainedModel):
 
     Examples::
 
-        >>> config = OpenAIGPTConfig.from_pretrained('openai-gpt')
-        >>> tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-        >>> model = OpenAIGPTModel(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        config = OpenAIGPTConfig.from_pretrained('openai-gpt')
+        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+        model = OpenAIGPTModel(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -558,12 +558,12 @@ class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
 
     Examples::
 
-        >>> config = OpenAIGPTConfig.from_pretrained('openai-gpt')
-        >>> tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-        >>> model = OpenAIGPTLMHeadModel(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=input_ids)
-        >>> loss, logits = outputs[:2]
+        config = OpenAIGPTConfig.from_pretrained('openai-gpt')
+        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+        model = OpenAIGPTLMHeadModel(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=input_ids)
+        loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -665,14 +665,14 @@ class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
 
     Examples::
 
-        >>> config = OpenAIGPTConfig.from_pretrained('openai-gpt')
-        >>> tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-        >>> model = OpenAIGPTDoubleHeadsModel(config)
-        >>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
-        >>> input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
-        >>> mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, mc_token_ids)
-        >>> lm_prediction_scores, mc_prediction_scores = outputs[:2]
+        config = OpenAIGPTConfig.from_pretrained('openai-gpt')
+        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+        model = OpenAIGPTDoubleHeadsModel(config)
+        choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
+        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
+        mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, mc_token_ids)
+        lm_prediction_scores, mc_prediction_scores = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/modeling_transfo_xl.py b/pytorch_transformers/modeling_transfo_xl.py
index 3280c4558d..7c999edda7 100644
--- a/pytorch_transformers/modeling_transfo_xl.py
+++ b/pytorch_transformers/modeling_transfo_xl.py
@@ -968,12 +968,12 @@ class TransfoXLModel(TransfoXLPreTrainedModel):
 
     Examples::
 
-        >>> config = TransfoXLConfig.from_pretrained('transfo-xl-wt103')
-        >>> tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
-        >>> model = TransfoXLModel(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states, mems = outputs[:2]
+        config = TransfoXLConfig.from_pretrained('transfo-xl-wt103')
+        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
+        model = TransfoXLModel(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states, mems = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1284,12 +1284,12 @@ class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
 
     Examples::
 
-        >>> config = TransfoXLConfig.from_pretrained('transfo-xl-wt103')
-        >>> tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
-        >>> model = TransfoXLLMHeadModel(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> prediction_scores, mems = outputs[:2]
+        config = TransfoXLConfig.from_pretrained('transfo-xl-wt103')
+        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
+        model = TransfoXLLMHeadModel(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        prediction_scores, mems = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 7ae834f5e5..e458c5ef74 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -105,15 +105,15 @@ class PretrainedConfig(object):
 
         Examples::
 
-            >>> config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.
-            >>> config = BertConfig.from_pretrained('./test/saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
-            >>> config = BertConfig.from_pretrained('./test/saved_model/my_configuration.json')
-            >>> config = BertConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)
-            >>> assert config.output_attention == True
-            >>> config, unused_kwargs = BertConfig.from_pretrained('bert-base-uncased', output_attention=True,
-            >>>                                                    foo=False, return_unused_kwargs=True)
-            >>> assert config.output_attention == True
-            >>> assert unused_kwargs == {'foo': False}
+            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.
+            config = BertConfig.from_pretrained('./test/saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
+            config = BertConfig.from_pretrained('./test/saved_model/my_configuration.json')
+            config = BertConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)
+            assert config.output_attention == True
+            config, unused_kwargs = BertConfig.from_pretrained('bert-base-uncased', output_attention=True,
+                                                               foo=False, return_unused_kwargs=True)
+            assert config.output_attention == True
+            assert unused_kwargs == {'foo': False}
 
         """
         cache_dir = kwargs.pop('cache_dir', None)
@@ -369,13 +369,13 @@ class PreTrainedModel(nn.Module):
 
         Examples::dictionary
 
-            >>> model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
-            >>> model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
-            >>> model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
-            >>> assert model.config.output_attention == True
-            >>> # Loading from a TF checkpoint file instead of a PyTorch model (slower)
-            >>> config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')
-            >>> model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_tf=True, config=config)
+            model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
+            model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
+            model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
+            assert model.config.output_attention == True
+            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
+            config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')
+            model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_tf=True, config=config)
 
         """
         config = kwargs.pop('config', None)
diff --git a/pytorch_transformers/modeling_xlm.py b/pytorch_transformers/modeling_xlm.py
index 3bb864501a..7325ff7875 100644
--- a/pytorch_transformers/modeling_xlm.py
+++ b/pytorch_transformers/modeling_xlm.py
@@ -472,12 +472,12 @@ class XLMModel(XLMPreTrainedModel):
 
     Examples::
 
-        >>> config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
-        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        >>> model = XLMModel(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        model = XLMModel(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     ATTRIBUTES = ['encoder', 'eos_index', 'pad_index',  # 'with_output', 
@@ -745,12 +745,12 @@ class XLMWithLMHeadModel(XLMPreTrainedModel):
 
     Examples::
 
-        >>> config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
-        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        >>> model = XLMWithLMHeadModel(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        model = XLMWithLMHeadModel(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -805,14 +805,14 @@ class XLMForSequenceClassification(XLMPreTrainedModel):
 
     Examples::
 
-        >>> config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
-        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        >>> 
-        >>> model = XLMForSequenceClassification(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=labels)
-        >>> loss, logits = outputs[:2]
+        config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        
+        model = XLMForSequenceClassification(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -885,15 +885,15 @@ class XLMForQuestionAnswering(XLMPreTrainedModel):
 
     Examples::
 
-        >>> config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
-        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        >>> 
-        >>> model = XLMForQuestionAnswering(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> start_positions = torch.tensor([1])
-        >>> end_positions = torch.tensor([3])
-        >>> outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
-        >>> loss, start_scores, end_scores = outputs[:2]
+        config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        
+        model = XLMForQuestionAnswering(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        start_positions = torch.tensor([1])
+        end_positions = torch.tensor([3])
+        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        loss, start_scores, end_scores = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/modeling_xlnet.py b/pytorch_transformers/modeling_xlnet.py
index 515decdb3e..9c1752eb74 100644
--- a/pytorch_transformers/modeling_xlnet.py
+++ b/pytorch_transformers/modeling_xlnet.py
@@ -712,12 +712,12 @@ class XLNetModel(XLNetPreTrainedModel):
 
     Examples::
 
-        >>> config = XLNetConfig.from_pretrained('xlnet-large-cased')
-        >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
-        >>> model = XLNetModel(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        config = XLNetConfig.from_pretrained('xlnet-large-cased')
+        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
+        model = XLNetModel(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -1019,17 +1019,17 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
 
     Examples::
 
-        >>> config = XLNetConfig.from_pretrained('xlnet-large-cased')
-        >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
-        >>> model = XLNetLMHeadModel(config)
-        >>> # We show how to setup inputs to predict a next token using a bi-directional context.
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is very <mask>")).unsqueeze(0)  # We will predict the masked token
-        >>> perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
-        >>> perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
-        >>> target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token
-        >>> target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)
-        >>> outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
-        >>> next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
+        config = XLNetConfig.from_pretrained('xlnet-large-cased')
+        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
+        model = XLNetLMHeadModel(config)
+        # We show how to setup inputs to predict a next token using a bi-directional context.
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is very <mask>")).unsqueeze(0)  # We will predict the masked token
+        perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
+        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
+        target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token
+        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)
+        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
+        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
 
     """
     def __init__(self, config):
@@ -1100,14 +1100,14 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
 
     Examples::
 
-        >>> config = XLNetConfig.from_pretrained('xlnet-large-cased')
-        >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
-        >>> 
-        >>> model = XLNetForSequenceClassification(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=labels)
-        >>> loss, logits = outputs[:2]
+        config = XLNetConfig.from_pretrained('xlnet-large-cased')
+        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
+        
+        model = XLNetForSequenceClassification(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1200,15 +1200,15 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
 
     Examples::
 
-        >>> config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
-        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        >>> 
-        >>> model = XLMForQuestionAnswering(config)
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> start_positions = torch.tensor([1])
-        >>> end_positions = torch.tensor([3])
-        >>> outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
-        >>> loss, start_scores, end_scores = outputs[:2]
+        config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        
+        model = XLMForQuestionAnswering(config)
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        start_positions = torch.tensor([1])
+        end_positions = torch.tensor([3])
+        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        loss, start_scores, end_scores = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/tokenization_auto.py b/pytorch_transformers/tokenization_auto.py
index 66d0ce51ba..acbe1cebc6 100644
--- a/pytorch_transformers/tokenization_auto.py
+++ b/pytorch_transformers/tokenization_auto.py
@@ -78,8 +78,8 @@ class AutoTokenizer(object):
 
         Examples::
 
-            >>> config = AutoTokenizer.from_pretrained('bert-base-uncased')    # Download vocabulary from S3 and cache.
-            >>> config = AutoTokenizer.from_pretrained('./test/bert_saved_model/')  # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`
+            config = AutoTokenizer.from_pretrained('bert-base-uncased')    # Download vocabulary from S3 and cache.
+            config = AutoTokenizer.from_pretrained('./test/bert_saved_model/')  # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`
 
         """
         if 'bert' in pretrained_model_name_or_path:

From c90119e5430954abc9e852dd334d90d3ca906eb1 Mon Sep 17 00:00:00 2001
From: David Pollack <david@i2x.ai>
Date: Mon, 29 Jul 2019 16:56:02 +0200
Subject: [PATCH 037/200] spelling mistake

---
 pytorch_transformers/convert_pytorch_checkpoint_to_tf.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py b/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
index c24dddc4d6..025c2f396c 100644
--- a/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
+++ b/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
@@ -41,7 +41,7 @@ def convert_pytorch_checkpoint_to_tf(model:BertModel, ckpt_dir:str, model_name:s
         N BertForQuestionAnswering
     """
 
-    tensors_to_transopse = (
+    tensors_to_transpose = (
         "dense.weight",
         "attention.self.query",
         "attention.self.key",
@@ -81,7 +81,7 @@ def convert_pytorch_checkpoint_to_tf(model:BertModel, ckpt_dir:str, model_name:s
         for var_name in state_dict:
             tf_name = to_tf_var_name(var_name)
             torch_tensor = state_dict[var_name].numpy()
-            if any([x in var_name for x in tensors_to_transopse]):
+            if any([x in var_name for x in tensors_to_transpose]):
                 torch_tensor = torch_tensor.T
             tf_var = create_tf_var(tensor=torch_tensor, name=tf_name, session=session)
             tf.keras.backend.set_value(tf_var, torch_tensor)

From 769bb643ce4e6d5836b41b41430ce02473907ec8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Gr=C3=A9gory=20Ch=C3=A2tel?= <chatel.gregory@gmail.com>
Date: Wed, 31 Jul 2019 16:17:15 +0200
Subject: [PATCH 038/200] Fixing a broken link.

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 8e2074f727..a4905e5854 100644
--- a/README.md
+++ b/README.md
@@ -18,7 +18,7 @@ These implementations have been tested on several datasets (see the example scri
 | Section | Description |
 |-|-|
 | [Installation](#installation) | How to install the package |
-| [Quick tour: Usage](#quick-tour-usage) | Tokenizers & models usage: Bert and GPT-2 |
+| [Quick tour: Usage](#quick-tour) | Tokenizers & models usage: Bert and GPT-2 |
 | [Quick tour: Fine-tuning/usage scripts](#quick-tour-of-the-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
 | [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-pytorch-transformers) | Migrating your code from pytorch-pretrained-bert to pytorch-transformers |
 | [Documentation](https://huggingface.co/pytorch-transformers/) | Full API documentation and more |

From 97091acb8c5bd192a354375e58352694007b2390 Mon Sep 17 00:00:00 2001
From: Pierric Cistac <Pierrci@users.noreply.github.com>
Date: Wed, 31 Jul 2019 10:37:56 -0400
Subject: [PATCH 039/200] Small spelling fix

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index a4905e5854..54c0ac94a1 100644
--- a/README.md
+++ b/README.md
@@ -283,7 +283,7 @@ Here is a quick summary of what you should take care of when migrating from `pyt
 
 The main breaking change when migrating from `pytorch-pretrained-bert` to `pytorch-transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
 
-The exact content of the tuples for each model are detailled in the models' docstrings and the [documentation](https://huggingface.co/pytorch-transformers/).
+The exact content of the tuples for each model are detailed in the models' docstrings and the [documentation](https://huggingface.co/pytorch-transformers/).
 
 In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.
 

From f2a3eb987e1fc2c85320fc3849c67811f5736b50 Mon Sep 17 00:00:00 2001
From: Anthony MOI <m.anthony.moi@gmail.com>
Date: Wed, 31 Jul 2019 11:05:06 -0400
Subject: [PATCH 040/200] Fix small typos

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 54c0ac94a1..7365e02a09 100644
--- a/README.md
+++ b/README.md
@@ -194,7 +194,7 @@ python ./examples/run_glue.py \
     --warmup_steps=120
 ```
 
-On this machine we thus have a batch size of 32, please increase `gradient_accumulation_steps` to reach the same batch size if you have a smaller machine. These hyper-parameters should results in a Pearson correlation coefficient of `+0.917` on the development set.
+On this machine we thus have a batch size of 32, please increase `gradient_accumulation_steps` to reach the same batch size if you have a smaller machine. These hyper-parameters should result in a Pearson correlation coefficient of `+0.917` on the development set.
 
 #### Fine-tuning Bert model on the MRPC classification task
 
@@ -264,7 +264,7 @@ This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-s
 ### `run_generation.py`: Text generation with GPT, GPT-2, Transformer-XL and XLNet
 
 A conditional generation script is also included to generate text from a prompt.
-The generation script include the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
+The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
 
 Here is how to run the script with the small version of OpenAI GPT-2 model:
 

From 44dd941efb602433b7edc29612cbdd0a03bf14dc Mon Sep 17 00:00:00 2001
From: Julien Chaumond <chaumond@gmail.com>
Date: Wed, 31 Jul 2019 21:09:04 -0400
Subject: [PATCH 041/200] link to `swift-coreml-transformers`

---
 README.md                    | 10 ++++++++++
 docs/source/installation.rst | 13 +++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/README.md b/README.md
index 7365e02a09..703eb47df9 100644
--- a/README.md
+++ b/README.md
@@ -56,6 +56,16 @@ python -m pytest -sv ./pytorch_transformers/tests/
 python -m pytest -sv ./examples/
 ```
 
+### Do you want to run a Transformer model on a mobile device?
+
+You should check out our [`swift-coreml-transformers`](https://github.com/huggingface/swift-coreml-transformers) repo.
+
+It contains an example of a conversion script from a Pytorch trained Transformer model (here, `GPT-2`) to a CoreML model that runs on iOS devices.
+
+At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
+or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
+
+
 ## Quick tour
 
 Let's do a very quick overview of PyTorch-Transformers. Detailed examples for each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [full documentation](https://huggingface.co/pytorch-transformers/).
diff --git a/docs/source/installation.rst b/docs/source/installation.rst
index f8beb9f1c8..3a4663da0b 100644
--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
@@ -50,3 +50,16 @@ If you want to reproduce the original tokenization process of the ``OpenAI GPT``
    python -m spacy download en
 
 If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
+
+
+Do you want to run a Transformer model on a mobile device?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You should check out our `swift-coreml-transformers <https://github.com/huggingface/swift-coreml-transformers>`_ repo.
+
+It contains an example of a conversion script from a Pytorch trained Transformer model (here, ``GPT-2``) to a CoreML model that runs on iOS devices.
+
+It also contains an implementation of BERT for Question answering.
+
+At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
+or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!

From a24f830604fc150526d9fd4596a4f3900916abe9 Mon Sep 17 00:00:00 2001
From: wangfei <1140554608@qq.com>
Date: Sat, 3 Aug 2019 12:17:06 +0800
Subject: [PATCH 042/200] Fix comment typo

---
 pytorch_transformers/modeling_bert.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index b59445513a..418939f7da 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -857,7 +857,7 @@ class BertForMaskedLM(BertPreTrainedModel):
         sequence_output = outputs[0]
         prediction_scores = self.cls(sequence_output)
 
-        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention is they are here
+        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here
         if masked_lm_labels is not None:
             loss_fct = CrossEntropyLoss(ignore_index=-1)
             masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))

From 836e51369820797759a71064066a78fb161fe804 Mon Sep 17 00:00:00 2001
From: Saket Khandelwal <saketdecade17@gmail.com>
Date: Sun, 4 Aug 2019 16:05:10 +1000
Subject: [PATCH 043/200] Fixed small typo

---
 pytorch_transformers/convert_gpt2_checkpoint_to_pytorch.py   | 2 +-
 pytorch_transformers/convert_openai_checkpoint_to_pytorch.py | 2 +-
 pytorch_transformers/convert_tf_checkpoint_to_pytorch.py     | 2 +-
 pytorch_transformers/convert_xlnet_checkpoint_to_pytorch.py  | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/pytorch_transformers/convert_gpt2_checkpoint_to_pytorch.py b/pytorch_transformers/convert_gpt2_checkpoint_to_pytorch.py
index 68cb798a7d..f9e83f5d5b 100755
--- a/pytorch_transformers/convert_gpt2_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_gpt2_checkpoint_to_pytorch.py
@@ -58,7 +58,7 @@ if __name__ == "__main__":
                         default = None,
                         type = str,
                         required = True,
-                        help = "Path the TensorFlow checkpoint path.")
+                        help = "Path to the TensorFlow checkpoint path.")
     parser.add_argument("--pytorch_dump_folder_path",
                         default = None,
                         type = str,
diff --git a/pytorch_transformers/convert_openai_checkpoint_to_pytorch.py b/pytorch_transformers/convert_openai_checkpoint_to_pytorch.py
index 8ec852a4bd..70895b4002 100755
--- a/pytorch_transformers/convert_openai_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_openai_checkpoint_to_pytorch.py
@@ -58,7 +58,7 @@ if __name__ == "__main__":
                         default = None,
                         type = str,
                         required = True,
-                        help = "Path the TensorFlow checkpoint path.")
+                        help = "Path to the TensorFlow checkpoint path.")
     parser.add_argument("--pytorch_dump_folder_path",
                         default = None,
                         type = str,
diff --git a/pytorch_transformers/convert_tf_checkpoint_to_pytorch.py b/pytorch_transformers/convert_tf_checkpoint_to_pytorch.py
index 9f121e8b79..220204f36e 100755
--- a/pytorch_transformers/convert_tf_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_tf_checkpoint_to_pytorch.py
@@ -47,7 +47,7 @@ if __name__ == "__main__":
                         default = None,
                         type = str,
                         required = True,
-                        help = "Path the TensorFlow checkpoint path.")
+                        help = "Path to the TensorFlow checkpoint path.")
     parser.add_argument("--bert_config_file",
                         default = None,
                         type = str,
diff --git a/pytorch_transformers/convert_xlnet_checkpoint_to_pytorch.py b/pytorch_transformers/convert_xlnet_checkpoint_to_pytorch.py
index 834b47484f..038c706960 100755
--- a/pytorch_transformers/convert_xlnet_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_xlnet_checkpoint_to_pytorch.py
@@ -79,7 +79,7 @@ if __name__ == "__main__":
                         default = None,
                         type = str,
                         required = True,
-                        help = "Path the TensorFlow checkpoint path.")
+                        help = "Path to the TensorFlow checkpoint path.")
     parser.add_argument("--xlnet_config_file",
                         default = None,
                         type = str,

From 009273dbddd0964c378d7131445bfc0ae63bc29c Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Sun, 4 Aug 2019 12:14:57 +0200
Subject: [PATCH 044/200] big doc update [WIP]

---
 README.md                                     |  1 +
 docs/source/converting_tensorflow_models.rst  | 43 +++++++++++------
 docs/source/index.rst                         |  9 ++++
 docs/source/installation.rst                  | 16 +++----
 docs/source/main_classes/configuration.rst    | 10 ++++
 docs/source/main_classes/model.rst            |  8 ++++
 .../main_classes/optimizer_schedules.rst      | 26 ++++++++++
 docs/source/main_classes/tokenizer.rst        |  8 ++++
 docs/source/migration.md                      |  9 ++--
 docs/source/model_doc/bert.rst                |  6 ---
 docs/source/quickstart.md                     | 47 +++++++++++++++++--
 docs/source/serialization.rst                 |  3 ++
 pytorch_transformers/__init__.py              |  2 +-
 pytorch_transformers/modeling_utils.py        | 38 ++++++++++-----
 pytorch_transformers/tokenization_bert.py     |  2 +-
 pytorch_transformers/tokenization_gpt2.py     |  2 +-
 .../tokenization_transfo_xl.py                |  2 +-
 pytorch_transformers/tokenization_utils.py    | 15 +++---
 pytorch_transformers/tokenization_xlnet.py    |  2 +-
 19 files changed, 189 insertions(+), 60 deletions(-)
 create mode 100644 docs/source/main_classes/configuration.rst
 create mode 100644 docs/source/main_classes/model.rst
 create mode 100644 docs/source/main_classes/optimizer_schedules.rst
 create mode 100644 docs/source/main_classes/tokenizer.rst

diff --git a/README.md b/README.md
index 8e2074f727..c31bbd24b7 100644
--- a/README.md
+++ b/README.md
@@ -119,6 +119,7 @@ tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
 ```
 
 ## Quick tour of the fine-tuning/usage scripts
+
 The library comprises several example scripts with SOTA performances for NLU and NLG tasks:
 
 - `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
diff --git a/docs/source/converting_tensorflow_models.rst b/docs/source/converting_tensorflow_models.rst
index 36c1e4050f..8441c9b1f7 100644
--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
@@ -1,7 +1,7 @@
 Converting Tensorflow Checkpoints
 ================================================
 
-A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the ``BertForPreTraining`` class  (for BERT) or NumPy checkpoint in a PyTorch dump of the ``OpenAIGPTModel`` class  (for OpenAI GPT).
+A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library.
 
 BERT
 ^^^^
@@ -41,6 +41,20 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
      $PYTORCH_DUMP_OUTPUT \
      [OPENAI_GPT_CONFIG]
 
+OpenAI GPT-2
+^^^^^^^^^^^^
+
+Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here <https://github.com/openai/gpt-2>`__\ )
+
+.. code-block:: shell
+
+   export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/gpt2/pretrained/weights
+
+   pytorch_transformers gpt2 \
+     $OPENAI_GPT2_CHECKPOINT_PATH \
+     $PYTORCH_DUMP_OUTPUT \
+     [OPENAI_GPT2_CONFIG]
+
 Transformer-XL
 ^^^^^^^^^^^^^^
 
@@ -55,19 +69,6 @@ Here is an example of the conversion process for a pre-trained Transformer-XL mo
      $PYTORCH_DUMP_OUTPUT \
      [TRANSFO_XL_CONFIG]
 
-GPT-2
-^^^^^
-
-Here is an example of the conversion process for a pre-trained OpenAI's GPT-2 model.
-
-.. code-block:: shell
-
-   export GPT2_DIR=/path/to/gpt2/checkpoint
-
-   pytorch_transformers gpt2 \
-     $GPT2_DIR/model.ckpt \
-     $PYTORCH_DUMP_OUTPUT \
-     [GPT2_CONFIG]
 
 XLNet
 ^^^^^
@@ -84,3 +85,17 @@ Here is an example of the conversion process for a pre-trained XLNet model, fine
      $TRANSFO_XL_CONFIG_PATH \
      $PYTORCH_DUMP_OUTPUT \
      STS-B \
+
+
+XLM
+^^^
+
+Here is an example of the conversion process for a pre-trained XLM model:
+
+.. code-block:: shell
+
+   export XLM_CHECKPOINT_PATH=/path/to/xlm/checkpoint
+
+   pytorch_transformers xlm \
+     $XLM_CHECKPOINT_PATH \
+     $PYTORCH_DUMP_OUTPUT \
diff --git a/docs/source/index.rst b/docs/source/index.rst
index be8cfc2a39..c403a0ad4f 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -21,11 +21,20 @@ The library currently contains PyTorch implementations, pre-trained model weight
     pretrained_models
     examples
     notebooks
+    serialization
     converting_tensorflow_models
     migration
     bertology
     torchscript
 
+.. toctree::
+    :maxdepth: 2
+    :caption: Main classes
+
+    main_classes/configuration
+    main_classes/model
+    main_classes/tokenizer
+    main_classes/optimizer_schedules
 
 .. toctree::
     :maxdepth: 2
diff --git a/docs/source/installation.rst b/docs/source/installation.rst
index f8beb9f1c8..9e6269da94 100644
--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
@@ -1,12 +1,12 @@
 Installation
 ================================================
 
-This repo was tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1/1.0.0
+PyTorch-Transformers is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.1.0
 
 With pip
 ^^^^^^^^
 
-PyTorch pretrained bert can be installed with pip as follows:
+PyTorch Transformers can be installed using pip as follows:
 
 .. code-block:: bash
 
@@ -15,7 +15,7 @@ PyTorch pretrained bert can be installed with pip as follows:
 From source
 ^^^^^^^^^^^
 
-Clone the repository and instal locally:
+To install from source, clone the repository and install with:
 
 .. code-block:: bash
 
@@ -27,11 +27,11 @@ Clone the repository and instal locally:
 Tests
 ^^^^^
 
-An extensive test suite is included for the library and the example scripts. Library tests can be found in the `tests folder <https://github.com/huggingface/pytorch-transformers/tree/master/pytorch_transformers/tests>`_ and examples tests in the `examples folder <https://github.com/huggingface/pytorch-transformers/tree/master/examples>`_.
+An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the `tests folder <https://github.com/huggingface/pytorch-transformers/tree/master/pytorch_transformers/tests>`_ and examples tests in the `examples folder <https://github.com/huggingface/pytorch-transformers/tree/master/examples>`_.
 
-These tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
+Tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
 
-You can run the tests from the root of the cloned repository with the commands:
+Run all the tests from the root of the cloned repository with the commands:
 
 .. code-block:: bash
 
@@ -42,11 +42,11 @@ You can run the tests from the root of the cloned repository with the commands:
 OpenAI GPT original tokenization workflow
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (limit to version 4.4.3 if you are using Python 2) and ``SpaCy`` :
+If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (use version 4.4.3 if you are using Python 2) and ``SpaCy`` :
 
 .. code-block:: bash
 
    pip install spacy ftfy==4.4.3
    python -m spacy download en
 
-If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
+If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer defaults to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
diff --git a/docs/source/main_classes/configuration.rst b/docs/source/main_classes/configuration.rst
new file mode 100644
index 0000000000..5e069629b8
--- /dev/null
+++ b/docs/source/main_classes/configuration.rst
@@ -0,0 +1,10 @@
+Configuration
+----------------------------------------------------
+
+We provide a base class, ``PretrainedConfig``, which can load a pretrained instance either from a local file or directory or from a pretrained model configuration provided by the library (downloaded from HuggingFace AWS S3 repository).
+
+``PretrainedConfig``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.PretrainedConfig
+    :members:
diff --git a/docs/source/main_classes/model.rst b/docs/source/main_classes/model.rst
new file mode 100644
index 0000000000..dd4c9d87dd
--- /dev/null
+++ b/docs/source/main_classes/model.rst
@@ -0,0 +1,8 @@
+Models
+----------------------------------------------------
+
+``PreTrainedModel``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.PreTrainedModel
+    :members:
diff --git a/docs/source/main_classes/optimizer_schedules.rst b/docs/source/main_classes/optimizer_schedules.rst
new file mode 100644
index 0000000000..2d91d495a4
--- /dev/null
+++ b/docs/source/main_classes/optimizer_schedules.rst
@@ -0,0 +1,26 @@
+Optimizer
+----------------------------------------------------
+
+``AdamW``
+~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.AdamW
+    :members:
+
+Schedules
+----------------------------------------------------
+
+.. autoclass:: pytorch_transformers.ConstantLRSchedule
+    :members:
+
+.. autoclass:: pytorch_transformers.WarmupConstantSchedule
+    :members:
+
+.. autoclass:: pytorch_transformers.WarmupCosineSchedule
+    :members:
+
+.. autoclass:: pytorch_transformers.WarmupCosineWithHardRestartsSchedule
+    :members:
+
+.. autoclass:: pytorch_transformers.WarmupLinearSchedule
+    :members:
diff --git a/docs/source/main_classes/tokenizer.rst b/docs/source/main_classes/tokenizer.rst
new file mode 100644
index 0000000000..cd6b4786bb
--- /dev/null
+++ b/docs/source/main_classes/tokenizer.rst
@@ -0,0 +1,8 @@
+Tokenizer
+----------------------------------------------------
+
+``PreTrainedTokenizer``
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.PreTrainedTokenizer
+    :members:
diff --git a/docs/source/migration.md b/docs/source/migration.md
index fff4807d5c..ba09253472 100644
--- a/docs/source/migration.md
+++ b/docs/source/migration.md
@@ -35,10 +35,13 @@ loss, logits, attentions = outputs
 
 ### Serialization
 
-Breaking change: Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method.
-To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
+Breaking change in the `from_pretrained()`method:
 
-Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other seralization method before.
+1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
+
+2. The additional `*inputs` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute first which can break derived model classes build based on the previous `BertForSequenceClassification` examples. More precisely, the positional arguments `*inputs` provided to `from_pretrained()` are directly forwarded the model `__init__()` method while the keyword arguments `**kwargs` (i) which match configuration class attributes are used to update said attributes (ii) which don't match any configuration class attributes are forwarded to the model `__init__()` method.
+
+Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before.
 
 Here is an example:
 
diff --git a/docs/source/model_doc/bert.rst b/docs/source/model_doc/bert.rst
index 8c786aa24f..cbce74e73b 100644
--- a/docs/source/model_doc/bert.rst
+++ b/docs/source/model_doc/bert.rst
@@ -15,12 +15,6 @@ BERT
     :members:
 
 
-``AdamW``
-~~~~~~~~~~~~~~~~
-
-.. autoclass:: pytorch_transformers.AdamW
-    :members:
-
 ``BertModel``
 ~~~~~~~~~~~~~~~~~~~~
 
diff --git a/docs/source/quickstart.md b/docs/source/quickstart.md
index 7414ef48c1..814021038a 100644
--- a/docs/source/quickstart.md
+++ b/docs/source/quickstart.md
@@ -1,17 +1,58 @@
 # Quickstart
 
+## Philosophy
+
+PyTorch-Transformers is an opinionated library built for NLP researchers seeking to use/study/extend large-scale transformers models.
+
+The library was designed with two strong goals in mind:
+
+- be as easy and fast to use as possible:
+
+  - we strongly limited the number of abstractions to learn, in fact there are almost no abstractions, just three standard classes for each model: configuration, models and tokenizer,
+  - each pretrained model configuration, weights and vocabulary can be downloaded, cached and loaded in the related class in a simple way by using a common `from_pretrained()` instantiation method.
+  - this library is NOT a modular toolbox of building blocks for neural nets, to extend/build-upon the library, just use your regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.
+
+- provide state-of-the-art models with performances as close as possible to the original models:
+
+  - we provide at least one example for each model which reproduces a result provided by the official authors of said model,
+  - the code is usually as close to the original code base as possible which means some PyTorch code may be not as *pytorchic* as it could be as a result of being converted TensorFlow code.
+
+A few other goals:
+
+- expose the models internals as consistently as possible:
+
+  - we give access, using a single API to the full hidden-states and attention weights,
+  - tokenizer and base model's API are standardized to easily switch between models.
+
+- incorporate a subjective selection of promising tools for fine-tuning/investiguating these models:
+
+  - a simple/consistent way to add new tokens to the vocabulary and embeddings for fine-tuning,
+  - simple ways to mask and prune transformer heads.
+
 ## Main concepts
 
+The library is build around three type of classes for each models:
+
+- **model classes** which are PyTorch models (`torch.nn.Modules`) of the 6 models architectures currently provided in the library, e.g. `BertModel`
+- **configuration classes** which store all the parameters required to build a model, e.g. `BertConfig`
+- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding strings in list of token embeddings indices to be fed to a model, e.g. `BertTokenizer`
+
+All these classes can be instantiated from pretrained instances and saved locally using two methods:
+
+- `from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed [here](https://huggingface.co/pytorch-transformers/pretrained_models.html)) or stored locally (or on a server) by the user,
+- `save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using `from_pretrained()`.
+
+Let's go through a few simple quick-start examples to see how we can instantiate and use these classes.
 
 ## Quick tour: Usage
 
-Here are two quick-start examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.
+Here are two examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.
 
-See package reference for examples for each model classe.
+See full API reference for examples for each model classe.
 
 ### BERT example
 
-First let's prepare a tokenized input from a text string using `BertTokenizer`
+Let's start by preparing a tokenized input (a list of token embeddings indices to be fed to Bert) from a text string using `BertTokenizer`
 
 ```python
 import torch
diff --git a/docs/source/serialization.rst b/docs/source/serialization.rst
index be5197135d..c0de1324cf 100644
--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
@@ -1,3 +1,6 @@
+Serialization
+----------------------------------------------------
+
 ### Loading Google AI or OpenAI pre-trained weights or PyTorch dump
 
 ### `from_pretrained()` method
diff --git a/pytorch_transformers/__init__.py b/pytorch_transformers/__init__.py
index f875e4ab18..c9b0aeebb7 100644
--- a/pytorch_transformers/__init__.py
+++ b/pytorch_transformers/__init__.py
@@ -5,7 +5,7 @@ from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus)
 from .tokenization_gpt2 import GPT2Tokenizer
 from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
 from .tokenization_xlm import XLMTokenizer
-from .tokenization_utils import (PreTrainedTokenizer, clean_up_tokenization)
+from .tokenization_utils import (PreTrainedTokenizer)
 
 from .modeling_bert import (BertConfig, BertModel, BertForPreTraining,
                        BertForMaskedLM, BertForNextSentencePrediction,
diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index e458c5ef74..f21927e18c 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -55,11 +55,19 @@ else:
 
 class PretrainedConfig(object):
     """ Base class for all configuration classes.
-        Handle a few common parameters and methods for loading/downloading/saving configurations.
+        Handle a few common attributes and methods for loading/downloading/saving configurations.
     """
     pretrained_config_archive_map = {}
 
     def __init__(self, **kwargs):
+        r""" The initialization of :class:`~pytorch_transformers.PretrainedConfig` extracts
+            a few configuration attributes from `**kwargs` which are common to all models:
+                - `finetuning_task`: string, default `None`. Name of the task used to fine-tune the model (used to convert from original checkpoint)
+                - `num_labels`: integer, default `2`. Number of classes to use when the model is a classification model (sequences/tokens)
+                - `output_attentions`: boolean, default `False`. Should the model returns attentions weights.
+                - `output_hidden_states`: string, default `False`. Should the model returns all hidden-states.
+                - `torchscript`: string, default `False`. Is the model used with Torchscript.
+        """
         self.finetuning_task = kwargs.pop('finetuning_task', None)
         self.num_labels = kwargs.pop('num_labels', 2)
         self.output_attentions = kwargs.pop('output_attentions', False)
@@ -67,7 +75,7 @@ class PretrainedConfig(object):
         self.torchscript = kwargs.pop('torchscript', False)
 
     def save_pretrained(self, save_directory):
-        """ Save a configuration object to a directory, so that it
+        """ Save a configuration object to the directory `save_directory`, so that it
             can be re-loaded using the `from_pretrained(save_directory)` class method.
         """
         assert os.path.isdir(save_directory), "Saving path should be a directory where the model and configuration can be saved"
@@ -81,30 +89,34 @@ class PretrainedConfig(object):
     def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
         r""" Instantiate a PretrainedConfig from a pre-trained model configuration.
 
-        Params:
+        Parameters:
             **pretrained_model_name_or_path**: either:
-                - a string with the `shortcut name` of a pre-trained model configuration to load from cache
-                    or download and cache if not already stored in cache (e.g. 'bert-base-uncased').
-                - a path to a `directory` containing a configuration file saved
-                    using the `save_pretrained(save_directory)` method.
-                - a path or url to a saved configuration `file`.
+
+                - a string with the `shortcut name` of a pre-trained model configuration to load from cache or download, e.g.: ``bert-base-uncased``.
+                - a path to a `directory` containing a configuration file saved using the `save_pretrained(save_directory)` method, e.g.: ``./my_model_directory/``.
+                - a path or url to a saved configuration JSON `file`, e.g.: ``./my_model_directory/configuration.json``.
+
             **cache_dir**: (`optional`) string:
                 Path to a directory in which a downloaded pre-trained model
                 configuration should be cached if the standard cache should not be used.
+
             **return_unused_kwargs**: (`optional`) bool:
+
                 - If False, then this function returns just the final configuration object.
-                - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs`
-                is a dictionary consisting of the key/value pairs whose keys are not configuration attributes:
-                ie the part of kwargs which has not been used to update `config` and is otherwise ignored.
+                - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part of kwargs which has not been used to update `config` and is otherwise ignored.
+
             **kwargs**: (`optional`) dict:
                 Dictionary of key/value pairs with which to update the configuration object after loading.
+
                 - The values in kwargs of any keys which are configuration attributes will be used
-                to override the loaded values.
+                    to override the loaded values.
                 - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
-                by the `return_unused_kwargs` keyword parameter.
+                    by the `return_unused_kwargs` keyword parameter.
 
         Examples::
 
+            # We can't instantiate directly the base class `PretrainedConfig` so let's show the examples on a
+            # derived class: BertConfig
             config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.
             config = BertConfig.from_pretrained('./test/saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
             config = BertConfig.from_pretrained('./test/saved_model/my_configuration.json')
diff --git a/pytorch_transformers/tokenization_bert.py b/pytorch_transformers/tokenization_bert.py
index d9cd881dfd..d7aeff7c39 100644
--- a/pytorch_transformers/tokenization_bert.py
+++ b/pytorch_transformers/tokenization_bert.py
@@ -22,7 +22,7 @@ import os
 import unicodedata
 from io import open
 
-from .tokenization_utils import PreTrainedTokenizer, clean_up_tokenization
+from .tokenization_utils import PreTrainedTokenizer
 
 logger = logging.getLogger(__name__)
 
diff --git a/pytorch_transformers/tokenization_gpt2.py b/pytorch_transformers/tokenization_gpt2.py
index 29a9ae7660..0aee856180 100644
--- a/pytorch_transformers/tokenization_gpt2.py
+++ b/pytorch_transformers/tokenization_gpt2.py
@@ -31,7 +31,7 @@ except ImportError:
     def lru_cache():
         return lambda func: func
 
-from .tokenization_utils import PreTrainedTokenizer, clean_up_tokenization
+from .tokenization_utils import PreTrainedTokenizer
 
 logger = logging.getLogger(__name__)
 
diff --git a/pytorch_transformers/tokenization_transfo_xl.py b/pytorch_transformers/tokenization_transfo_xl.py
index 237f8ea387..992dff80d5 100644
--- a/pytorch_transformers/tokenization_transfo_xl.py
+++ b/pytorch_transformers/tokenization_transfo_xl.py
@@ -30,7 +30,7 @@ import torch
 import numpy as np
 
 from .file_utils import cached_path
-from .tokenization_utils import PreTrainedTokenizer, clean_up_tokenization
+from .tokenization_utils import PreTrainedTokenizer
 
 if sys.version_info[0] == 2:
     import cPickle as pickle
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index eaef2fed1e..556f094f6d 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -444,7 +444,7 @@ class PreTrainedTokenizer(object):
         filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
         text = self.convert_tokens_to_string(filtered_tokens)
         if clean_up_tokenization_spaces:
-            text = clean_up_tokenization(text)
+            text = self.clean_up_tokenization(text)
         return text
 
     @property
@@ -480,10 +480,9 @@ class PreTrainedTokenizer(object):
         all_ids = list(self.convert_tokens_to_ids(t) for t in all_toks)
         return all_ids
 
-
-
-def clean_up_tokenization(out_string):
-    out_string = out_string.replace(' .', '.').replace(' ?', '?').replace(' !', '!').replace(' ,', ','
-                    ).replace(" ' ", "'").replace(" n't", "n't").replace(" 'm", "'m").replace(" do not", " don't"
-                    ).replace(" 's", "'s").replace(" 've", "'ve").replace(" 're", "'re")
-    return out_string
+    @staticmethod
+    def clean_up_tokenization(out_string):
+        out_string = out_string.replace(' .', '.').replace(' ?', '?').replace(' !', '!').replace(' ,', ','
+                        ).replace(" ' ", "'").replace(" n't", "n't").replace(" 'm", "'m").replace(" do not", " don't"
+                        ).replace(" 's", "'s").replace(" 've", "'ve").replace(" 're", "'re")
+        return out_string
diff --git a/pytorch_transformers/tokenization_xlnet.py b/pytorch_transformers/tokenization_xlnet.py
index a4f3fdfde2..919ac97bce 100644
--- a/pytorch_transformers/tokenization_xlnet.py
+++ b/pytorch_transformers/tokenization_xlnet.py
@@ -23,7 +23,7 @@ from shutil import copyfile
 import unicodedata
 import six
 
-from .tokenization_utils import PreTrainedTokenizer, clean_up_tokenization
+from .tokenization_utils import PreTrainedTokenizer
 
 logger = logging.getLogger(__name__)
 

From 28ba345eccb8a7af3e044f3dd82c1d661a065d80 Mon Sep 17 00:00:00 2001
From: Ethan Perez <perez@nyu.edu>
Date: Sun, 4 Aug 2019 12:31:46 -0400
Subject: [PATCH 045/200] Fixing unused weight_decay argument

Currently the L2 regularization is hard-coded to "0.01", even though there is a --weight_decay flag implemented (that is unused). I'm making this flag control the weight decay used for fine-tuning in this script.
---
 examples/single_model_scripts/run_openai_gpt.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/single_model_scripts/run_openai_gpt.py b/examples/single_model_scripts/run_openai_gpt.py
index af737b953e..479c08782d 100644
--- a/examples/single_model_scripts/run_openai_gpt.py
+++ b/examples/single_model_scripts/run_openai_gpt.py
@@ -205,7 +205,7 @@ def main():
         param_optimizer = list(model.named_parameters())
         no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
         optimizer_grouped_parameters = [
-            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
+            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
             {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
             ]
         optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)

From 00132b7a7a79b7bed6574ad16550e50eb5af3a8f Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Sun, 4 Aug 2019 22:42:55 +0200
Subject: [PATCH 046/200] updating docs - adding few tests to tokenizers

---
 docs/source/index.rst                         |   1 -
 docs/source/main_classes/configuration.rst    |   2 +-
 docs/source/main_classes/model.rst            |   7 +
 .../main_classes/optimizer_schedules.rst      |  29 ++
 docs/source/main_classes/tokenizer.rst        |   8 +
 docs/source/model_doc/overview.rst            | 285 ------------------
 docs/source/quickstart.md                     |  17 +-
 docs/source/serialization.rst                 | 250 +++++++--------
 pytorch_transformers/modeling_utils.py        | 161 +++++-----
 pytorch_transformers/tokenization_utils.py    | 151 ++++++++--
 10 files changed, 390 insertions(+), 521 deletions(-)
 delete mode 100644 docs/source/model_doc/overview.rst

diff --git a/docs/source/index.rst b/docs/source/index.rst
index c403a0ad4f..b80fd8437b 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -40,7 +40,6 @@ The library currently contains PyTorch implementations, pre-trained model weight
     :maxdepth: 2
     :caption: Package Reference
 
-    model_doc/overview
     model_doc/bert
     model_doc/gpt
     model_doc/transformerxl
diff --git a/docs/source/main_classes/configuration.rst b/docs/source/main_classes/configuration.rst
index 5e069629b8..5181874c1a 100644
--- a/docs/source/main_classes/configuration.rst
+++ b/docs/source/main_classes/configuration.rst
@@ -1,7 +1,7 @@
 Configuration
 ----------------------------------------------------
 
-We provide a base class, ``PretrainedConfig``, which can load a pretrained instance either from a local file or directory or from a pretrained model configuration provided by the library (downloaded from HuggingFace AWS S3 repository).
+The base class ``PretrainedConfig`` implements the common methods for loading/saving a configuration either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace's AWS S3 repository).
 
 ``PretrainedConfig``
 ~~~~~~~~~~~~~~~~~~~~~
diff --git a/docs/source/main_classes/model.rst b/docs/source/main_classes/model.rst
index dd4c9d87dd..ba61afadf0 100644
--- a/docs/source/main_classes/model.rst
+++ b/docs/source/main_classes/model.rst
@@ -1,6 +1,13 @@
 Models
 ----------------------------------------------------
 
+The base class ``PreTrainedModel`` implements the common methods for loading/saving a model either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace's AWS S3 repository).
+
+``PreTrainedModel`` also implements a few methods which are common among all the models to:
+
+- resize the input token embeddings when new tokens are added to the vocabulary
+- prune the attention heads of the model.
+
 ``PreTrainedModel``
 ~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/docs/source/main_classes/optimizer_schedules.rst b/docs/source/main_classes/optimizer_schedules.rst
index 2d91d495a4..70fefb7c6d 100644
--- a/docs/source/main_classes/optimizer_schedules.rst
+++ b/docs/source/main_classes/optimizer_schedules.rst
@@ -1,6 +1,11 @@
 Optimizer
 ----------------------------------------------------
 
+The ``.optimization`` module provides:
+
+- an optimizer with weight decay fixed that can be used to fine-tuned models, and
+- several schedules in the form of schedule objects that inherit from ``_LRSchedule``:
+
 ``AdamW``
 ~~~~~~~~~~~~~~~~
 
@@ -10,17 +15,41 @@ Optimizer
 Schedules
 ----------------------------------------------------
 
+Learning Rate Schedules
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
 .. autoclass:: pytorch_transformers.ConstantLRSchedule
     :members:
 
+
 .. autoclass:: pytorch_transformers.WarmupConstantSchedule
     :members:
 
+.. image:: /imgs/warmup_constant_schedule.png
+    :target: /imgs/warmup_constant_schedule.png
+    :alt:
+
+
 .. autoclass:: pytorch_transformers.WarmupCosineSchedule
     :members:
 
+.. image:: /imgs/warmup_cosine_schedule.png
+    :target: /imgs/warmup_cosine_schedule.png
+    :alt:
+
+
 .. autoclass:: pytorch_transformers.WarmupCosineWithHardRestartsSchedule
     :members:
 
+.. image:: /imgs/warmup_cosine_hard_restarts_schedule.png
+    :target: /imgs/warmup_cosine_hard_restarts_schedule.png
+    :alt:
+
+
+
 .. autoclass:: pytorch_transformers.WarmupLinearSchedule
     :members:
+
+.. image:: /imgs/warmup_linear_schedule.png
+    :target: /imgs/warmup_linear_schedule.png
+    :alt:
diff --git a/docs/source/main_classes/tokenizer.rst b/docs/source/main_classes/tokenizer.rst
index cd6b4786bb..12ca5522de 100644
--- a/docs/source/main_classes/tokenizer.rst
+++ b/docs/source/main_classes/tokenizer.rst
@@ -1,6 +1,14 @@
 Tokenizer
 ----------------------------------------------------
 
+The base class ``PreTrainedTokenizer`` implements the common methods for loading/saving a tokenizer either from a local file or directory, or from a pretrained tokenizer provided by the library (downloaded from HuggingFace's AWS S3 repository).
+
+``PreTrainedTokenizer`` is the main entry point into tokenizers as it also implements the main methods for using all the tokenizers:
+
+- tokenizing, converting tokens to ids and back and encoding/decoding,
+- adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...),
+- managing special tokens (adding them, assigning them to roles, making sure they are not split during tokenization)
+
 ``PreTrainedTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/docs/source/model_doc/overview.rst b/docs/source/model_doc/overview.rst
deleted file mode 100644
index 4cca4eb846..0000000000
--- a/docs/source/model_doc/overview.rst
+++ /dev/null
@@ -1,285 +0,0 @@
-Overview
-================================================
-
-
-Here is a detailed documentation of the classes in the package and how to use them:
-
-.. list-table::
-   :header-rows: 1
-
-   * - Sub-section
-     - Description
-   * - `Loading pre-trained weights <#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump>`__
-     - How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance
-   * - `Serialization best-practices <#serialization-best-practices>`__
-     - How to save and reload a fine-tuned model
-   * - `Configurations <#configurations>`__
-     - API of the configuration classes for BERT, GPT, GPT-2 and Transformer-XL
-
-
-TODO Lysandre filled: Removed Models/Tokenizers/Optimizers as no single link can be made.
-
-
-Configurations
-^^^^^^^^^^^^^^
-
-Models (BERT, GPT, GPT-2 and Transformer-XL) are defined and build from configuration classes which contains the
-parameters of the models (number of layers, dimensionalities...) and a few utilities to read and write from JSON
-configuration files. The respective configuration classes are:
-
-
-* ``BertConfig`` for ``BertModel`` and BERT classes instances.
-* ``OpenAIGPTConfig`` for ``OpenAIGPTModel`` and OpenAI GPT classes instances.
-* ``GPT2Config`` for ``GPT2Model`` and OpenAI GPT-2 classes instances.
-* ``TransfoXLConfig`` for ``TransfoXLModel`` and Transformer-XL classes instances.
-
-These configuration classes contains a few utilities to load and save configurations:
-
-
-* ``from_dict(cls, json_object)``\ : A class method to construct a configuration from a Python dictionary of parameters. Returns an instance of the configuration class.
-* ``from_json_file(cls, json_file)``\ : A class method to construct a configuration from a json file of parameters. Returns an instance of the configuration class.
-* ``to_dict()``\ : Serializes an instance to a Python dictionary. Returns a dictionary.
-* ``to_json_string()``\ : Serializes an instance to a JSON string. Returns a string.
-* ``to_json_file(json_file_path)``\ : Save an instance to a json file.
-
-
-Loading Google AI or OpenAI pre-trained weights or PyTorch dump
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-``from_pretrained()`` method
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-To load one of Google AI's, OpenAI's pre-trained models or a PyTorch saved model (an instance of ``BertForPreTraining`` saved with ``torch.save()``\ ), the PyTorch model classes and the tokenizer can be instantiated using the ``from_pretrained()`` method:
-
-.. code-block:: python
-
-   model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None, from_tf=False, state_dict=None, *input, **kwargs)
-
-where
-
-
-* ``BERT_CLASS`` is either a tokenizer to load the vocabulary (\ ``BertTokenizer`` or ``OpenAIGPTTokenizer`` classes) or one of the eight BERT or three OpenAI GPT PyTorch model classes (to load the pre-trained weights): ``BertModel``\ , ``BertForMaskedLM``\ , ``BertForNextSentencePrediction``\ , ``BertForPreTraining``\ , ``BertForSequenceClassification``\ , ``BertForTokenClassification``\ , ``BertForMultipleChoice``\ , ``BertForQuestionAnswering``\ , ``OpenAIGPTModel``\ , ``OpenAIGPTLMHeadModel`` or ``OpenAIGPTDoubleHeadsModel``\ , and
-*
-  ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is either:
-
-
-  *
-    the shortcut name of a Google AI's or OpenAI's pre-trained model selected in the list:
-
-
-    * ``bert-base-uncased``: 12-layer, 768-hidden, 12-heads, 110M parameters
-    * ``bert-large-uncased``: 24-layer, 1024-hidden, 16-heads, 340M parameters
-    * ``bert-base-cased``: 12-layer, 768-hidden, 12-heads , 110M parameters
-    * ``bert-large-cased``: 24-layer, 1024-hidden, 16-heads, 340M parameters
-    * ``bert-base-multilingual-uncased``: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
-    * ``bert-base-multilingual-cased``: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
-    * ``bert-base-chinese``: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
-    * ``bert-base-german-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://deepset.ai/german-bert>`__
-    * ``bert-large-uncased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
-    * ``bert-large-cased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
-    * ``bert-large-uncased-whole-word-masking-finetuned-squad``: The ``bert-large-uncased-whole-word-masking`` model finetuned on SQuAD (using the ``run_bert_squad.py`` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
-    * ``openai-gpt``: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
-    * ``gpt2``: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
-    * ``gpt2-medium``: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
-    * ``transfo-xl-wt103``: Transformer-XL English model trained on wikitext-103, 18-layer, 1024-hidden, 16-heads, 257M parameters
-
-  *
-    a path or url to a pretrained model archive containing:
-
-
-    * ``bert_config.json`` or ``openai_gpt_config.json`` a configuration file for the model, and
-    * ``pytorch_model.bin`` a PyTorch dump of a pre-trained instance of ``BertForPreTraining``\ , ``OpenAIGPTModel``\ , ``TransfoXLModel``\ , ``GPT2LMHeadModel`` (saved with the usual ``torch.save()``\ )
-
-  If ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links `here <https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/modeling_bert.py>`__\ ) and stored in a cache folder to avoid future download (the cache folder can be found at ``~/.pytorch_pretrained_bert/``\ ).
-
-*
-  ``cache_dir`` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example ``cache_dir='./pretrained_model_{}'.format(args.local_rank)`` (see the section on distributed training for more information).
-
-* ``from_tf``\ : should we load the weights from a locally saved TensorFlow checkpoint
-* ``state_dict``\ : an optional state dictionary (collections.OrderedDict object) to use instead of Google pre-trained models
-* ``*inputs``\ , `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)
-
-``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`__ or the original TensorFlow repository.
-
-When using an ``uncased model``\ , make sure to pass ``--do_lower_case`` to the example training scripts (or pass ``do_lower_case=True`` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).
-
-Examples:
-
-.. code-block:: python
-
-   # BERT
-   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True)
-   model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
-
-   # OpenAI GPT
-   tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-   model = OpenAIGPTModel.from_pretrained('openai-gpt')
-
-   # Transformer-XL
-   tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
-   model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
-
-   # OpenAI GPT-2
-   tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-   model = GPT2Model.from_pretrained('gpt2')
-
-Cache directory
-~~~~~~~~~~~~~~~
-
-``pytorch_pretrained_bert`` save the pretrained weights in a cache directory which is located at (in this order of priority):
-
-
-* ``cache_dir`` optional arguments to the ``from_pretrained()`` method (see above),
-* shell environment variable ``PYTORCH_PRETRAINED_BERT_CACHE``\ ,
-* PyTorch cache home + ``/pytorch_pretrained_bert/``
-  where PyTorch cache home is defined by (in this order):
-
-  * shell environment variable ``ENV_TORCH_HOME``
-  * shell environment variable ``ENV_XDG_CACHE_HOME`` + ``/torch/``\ )
-  * default: ``~/.cache/torch/``
-
-Usually, if you don't set any specific environment variable, ``pytorch_pretrained_bert`` cache will be at ``~/.cache/torch/pytorch_pretrained_bert/``.
-
-You can alsways safely delete ``pytorch_pretrained_bert`` cache but the pretrained model weights and vocabulary files wil have to be re-downloaded from our S3.
-
-Serialization best-practices
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL).
-There are three types of files you need to save to be able to reload a fine-tuned model:
-
-
-* the model it-self which should be saved following PyTorch serialization `best practices <https://pytorch.org/docs/stable/notes/serialization.html#best-practices>`__\ ,
-* the configuration file of the model which is saved as a JSON file, and
-* the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
-
-The *default filenames* of these files are as follow:
-
-
-* the model weights file: ``pytorch_model.bin``\ ,
-* the configuration file: ``config.json``\ ,
-* the vocabulary file: ``vocab.txt`` for BERT and Transformer-XL, ``vocab.json`` for GPT/GPT-2 (BPE vocabulary),
-* for GPT/GPT-2 (BPE vocabulary) the additional merges file: ``merges.txt``.
-
-**If you save a model using these *default filenames*\ , you can then re-load the model and tokenizer using the ``from_pretrained()`` method.**
-
-Here is the recommended way of saving the model, configuration and vocabulary to an ``output_dir`` directory and reloading the model and tokenizer afterwards:
-
-.. code-block:: python
-
-   from pytorch_pretrained_bert import WEIGHTS_NAME, CONFIG_NAME
-
-   output_dir = "./models/"
-
-   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
-
-   # If we have a distributed model, save only the encapsulated model
-   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
-   model_to_save = model.module if hasattr(model, 'module') else model
-
-   # If we save using the predefined names, we can load using `from_pretrained`
-   output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
-   output_config_file = os.path.join(output_dir, CONFIG_NAME)
-
-   torch.save(model_to_save.state_dict(), output_model_file)
-   model_to_save.config.to_json_file(output_config_file)
-   tokenizer.save_vocabulary(output_dir)
-
-   # Step 2: Re-load the saved model and vocabulary
-
-   # Example for a Bert model
-   model = BertForQuestionAnswering.from_pretrained(output_dir)
-   tokenizer = BertTokenizer.from_pretrained(output_dir, do_lower_case=args.do_lower_case)  # Add specific options if needed
-   # Example for a GPT model
-   model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
-   tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
-
-Here is another way you can save and reload the model if you want to use specific paths for each type of files:
-
-.. code-block:: python
-
-   output_model_file = "./models/my_own_model_file.bin"
-   output_config_file = "./models/my_own_config_file.bin"
-   output_vocab_file = "./models/my_own_vocab_file.bin"
-
-   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
-
-   # If we have a distributed model, save only the encapsulated model
-   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
-   model_to_save = model.module if hasattr(model, 'module') else model
-
-   torch.save(model_to_save.state_dict(), output_model_file)
-   model_to_save.config.to_json_file(output_config_file)
-   tokenizer.save_vocabulary(output_vocab_file)
-
-   # Step 2: Re-load the saved model and vocabulary
-
-   # We didn't save using the predefined WEIGHTS_NAME, CONFIG_NAME names, we cannot load using `from_pretrained`.
-   # Here is how to do it in this situation:
-
-   # Example for a Bert model
-   config = BertConfig.from_json_file(output_config_file)
-   model = BertForQuestionAnswering(config)
-   state_dict = torch.load(output_model_file)
-   model.load_state_dict(state_dict)
-   tokenizer = BertTokenizer(output_vocab_file, do_lower_case=args.do_lower_case)
-
-   # Example for a GPT model
-   config = OpenAIGPTConfig.from_json_file(output_config_file)
-   model = OpenAIGPTDoubleHeadsModel(config)
-   state_dict = torch.load(output_model_file)
-   model.load_state_dict(state_dict)
-   tokenizer = OpenAIGPTTokenizer(output_vocab_file)
-
-Learning Rate Schedules
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The ``.optimization`` module also provides additional schedules in the form of schedule objects that inherit from ``_LRSchedule``.
-All ``_LRSchedule`` subclasses accept ``warmup`` and ``t_total`` arguments at construction.
-When an ``_LRSchedule`` object is passed into ``AdamW``\ ,
-the ``warmup`` and ``t_total`` arguments on the optimizer are ignored and the ones in the ``_LRSchedule`` object are used.
-An overview of the implemented schedules:
-
-
-* ``ConstantLR``\ : always returns learning rate 1.
-* ``WarmupConstantSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
-    Keeps learning rate equal to 1. after warmup.
-
-  .. image:: /imgs/warmup_constant_schedule.png
-     :target: /imgs/warmup_constant_schedule.png
-     :alt:
-
-
-* ``WarmupLinearSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
-    Linearly decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps.
-
-  .. image:: /imgs/warmup_linear_schedule.png
-     :target: /imgs/warmup_linear_schedule.png
-     :alt:
-
-
-* ``WarmupCosineSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
-  Decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps following a cosine curve. \
-  If ``cycles`` (default=0.5) is different from default, learning rate follows cosine function after warmup.
-
-  .. image:: /imgs/warmup_cosine_schedule.png
-     :target: /imgs/warmup_cosine_schedule.png
-     :alt:
-
-
-* ``WarmupCosineWithHardRestartsSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
-  If ``cycles`` (default=1.) is different from default, learning rate follows ``cycles`` times a cosine decaying learning rate (with hard restarts).
-
-  .. image:: /imgs/warmup_cosine_hard_restarts_schedule.png
-     :target: /imgs/warmup_cosine_hard_restarts_schedule.png
-     :alt:
-
-
-* ``WarmupCosineWithWarmupRestartsSchedule`` : All training progress is divided in ``cycles`` (default=1.) parts of equal length.
-  Every part follows a schedule with the first ``warmup`` fraction of the training steps linearly increasing from 0. to 1.,
-  followed by a learning rate decreasing from 1. to 0. following a cosine curve.
-  Note that the total number of all warmup steps over all cycles together is equal to ``warmup`` * ``cycles``
-
-  .. image:: /imgs/warmup_cosine_warm_restarts_schedule.png
-     :target: /imgs/warmup_cosine_warm_restarts_schedule.png
-     :alt:
\ No newline at end of file
diff --git a/docs/source/quickstart.md b/docs/source/quickstart.md
index 814021038a..f037a95a3a 100644
--- a/docs/source/quickstart.md
+++ b/docs/source/quickstart.md
@@ -8,13 +8,13 @@ The library was designed with two strong goals in mind:
 
 - be as easy and fast to use as possible:
 
-  - we strongly limited the number of abstractions to learn, in fact there are almost no abstractions, just three standard classes for each model: configuration, models and tokenizer,
-  - each pretrained model configuration, weights and vocabulary can be downloaded, cached and loaded in the related class in a simple way by using a common `from_pretrained()` instantiation method.
-  - this library is NOT a modular toolbox of building blocks for neural nets, to extend/build-upon the library, just use your regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.
+  - we strongly limited the number of user-facing abstractions to learn, in fact there are almost no abstractions, just three standard classes required to use each model: configuration, models and tokenizer,
+  - all of these classes can be initialized in a simple and unified way from pretrained instances by using a common `from_pretrained()` instantiation method which will take care of downloading (if needed), caching and loading the related class from a pretrained instance supplied in the library or your own saved instance.
+  - as a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to extend/build-upon the library, just use regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.
 
 - provide state-of-the-art models with performances as close as possible to the original models:
 
-  - we provide at least one example for each model which reproduces a result provided by the official authors of said model,
+  - we provide at least one example for each architecture which reproduces a result provided by the official authors of said architecture,
   - the code is usually as close to the original code base as possible which means some PyTorch code may be not as *pytorchic* as it could be as a result of being converted TensorFlow code.
 
 A few other goals:
@@ -34,15 +34,18 @@ A few other goals:
 The library is build around three type of classes for each models:
 
 - **model classes** which are PyTorch models (`torch.nn.Modules`) of the 6 models architectures currently provided in the library, e.g. `BertModel`
-- **configuration classes** which store all the parameters required to build a model, e.g. `BertConfig`
-- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding strings in list of token embeddings indices to be fed to a model, e.g. `BertTokenizer`
+- **configuration classes** which store all the parameters required to build a model, e.g. `BertConfig`. You don't always need to instantiate these your-self, in particular if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
+- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model, e.g. `BertTokenizer`
 
 All these classes can be instantiated from pretrained instances and saved locally using two methods:
 
 - `from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed [here](https://huggingface.co/pytorch-transformers/pretrained_models.html)) or stored locally (or on a server) by the user,
 - `save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using `from_pretrained()`.
 
-Let's go through a few simple quick-start examples to see how we can instantiate and use these classes.
+We'll finish this quickstart tour by going through a few simple quick-start examples to see how we can instantiate and use these classes. The rest of the documentation is organized in two parts:
+
+- the **MAIN CLASSES** section details the common functionalities/method/attributes of the three main type of classes (configuration, model, tokenizer) plus some optimization related classes provided as utilities for training,
+- the **PACKAGE REFERENCE** section details all the variants of each class for each model architectures and in particular the input/output that you should expect when calling each of them.
 
 ## Quick tour: Usage
 
diff --git a/docs/source/serialization.rst b/docs/source/serialization.rst
index c0de1324cf..61854f61ea 100644
--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
@@ -1,174 +1,188 @@
-Serialization
-----------------------------------------------------
+Loading Google AI or OpenAI pre-trained weights or PyTorch dump
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-### Loading Google AI or OpenAI pre-trained weights or PyTorch dump
+``from_pretrained()`` method
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-### `from_pretrained()` method
+To load one of Google AI's, OpenAI's pre-trained models or a PyTorch saved model (an instance of ``BertForPreTraining`` saved with ``torch.save()``\ ), the PyTorch model classes and the tokenizer can be instantiated using the ``from_pretrained()`` method:
 
-To load one of Google AI's, OpenAI's pre-trained models or a PyTorch saved model (an instance of `BertForPreTraining` saved with `torch.save()`), the PyTorch model classes and the tokenizer can be instantiated using the `from_pretrained()` method:
+.. code-block:: python
 
-```python
-model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None, from_tf=False, state_dict=None, *input, **kwargs)
-```
+   model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None, from_tf=False, state_dict=None, *input, **kwargs)
 
 where
 
-- `BERT_CLASS` is either a tokenizer to load the vocabulary (`BertTokenizer` or `OpenAIGPTTokenizer` classes) or one of the eight BERT or three OpenAI GPT PyTorch model classes (to load the pre-trained weights): `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification`, `BertForTokenClassification`, `BertForMultipleChoice`, `BertForQuestionAnswering`, `OpenAIGPTModel`, `OpenAIGPTLMHeadModel` or `OpenAIGPTDoubleHeadsModel`, and
-- `PRE_TRAINED_MODEL_NAME_OR_PATH` is either:
 
-  - the shortcut name of a Google AI's or OpenAI's pre-trained model selected in the list:
+* ``BERT_CLASS`` is either a tokenizer to load the vocabulary (\ ``BertTokenizer`` or ``OpenAIGPTTokenizer`` classes) or one of the eight BERT or three OpenAI GPT PyTorch model classes (to load the pre-trained weights): ``BertModel``\ , ``BertForMaskedLM``\ , ``BertForNextSentencePrediction``\ , ``BertForPreTraining``\ , ``BertForSequenceClassification``\ , ``BertForTokenClassification``\ , ``BertForMultipleChoice``\ , ``BertForQuestionAnswering``\ , ``OpenAIGPTModel``\ , ``OpenAIGPTLMHeadModel`` or ``OpenAIGPTDoubleHeadsModel``\ , and
+*
+  ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is either:
 
-    - `bert-base-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters
-    - `bert-large-uncased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
-    - `bert-base-cased`: 12-layer, 768-hidden, 12-heads , 110M parameters
-    - `bert-large-cased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
-    - `bert-base-multilingual-uncased`: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
-    - `bert-base-multilingual-cased`: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
-    - `bert-base-chinese`: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
-    - `bert-base-german-cased`: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters [Performance Evaluation](https://deepset.ai/german-bert)
-    - `bert-large-uncased-whole-word-masking`: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
-    - `bert-large-cased-whole-word-masking`: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
-    - `bert-large-uncased-whole-word-masking-finetuned-squad`: The `bert-large-uncased-whole-word-masking` model finetuned on SQuAD (using the `run_bert_squad.py` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
-    - `openai-gpt`: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
-    - `gpt2`: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
-    - `gpt2-medium`: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
-    - `transfo-xl-wt103`: Transformer-XL English model trained on wikitext-103, 18-layer, 1024-hidden, 16-heads, 257M parameters
 
-  - a path or url to a pretrained model archive containing:
+  *
+    the shortcut name of a Google AI's or OpenAI's pre-trained model selected in the list:
 
-    - `bert_config.json` or `openai_gpt_config.json` a configuration file for the model, and
-    - `pytorch_model.bin` a PyTorch dump of a pre-trained instance of `BertForPreTraining`, `OpenAIGPTModel`, `TransfoXLModel`, `GPT2LMHeadModel` (saved with the usual `torch.save()`)
 
-  If `PRE_TRAINED_MODEL_NAME_OR_PATH` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links [here](pytorch_transformers/modeling.py)) and stored in a cache folder to avoid future download (the cache folder can be found at `~/.pytorch_transformers/`).
+    * ``bert-base-uncased``: 12-layer, 768-hidden, 12-heads, 110M parameters
+    * ``bert-large-uncased``: 24-layer, 1024-hidden, 16-heads, 340M parameters
+    * ``bert-base-cased``: 12-layer, 768-hidden, 12-heads , 110M parameters
+    * ``bert-large-cased``: 24-layer, 1024-hidden, 16-heads, 340M parameters
+    * ``bert-base-multilingual-uncased``: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+    * ``bert-base-multilingual-cased``: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+    * ``bert-base-chinese``: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
+    * ``bert-base-german-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://deepset.ai/german-bert>`__
+    * ``bert-large-uncased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
+    * ``bert-large-cased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
+    * ``bert-large-uncased-whole-word-masking-finetuned-squad``: The ``bert-large-uncased-whole-word-masking`` model finetuned on SQuAD (using the ``run_bert_squad.py`` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
+    * ``openai-gpt``: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
+    * ``gpt2``: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
+    * ``gpt2-medium``: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
+    * ``transfo-xl-wt103``: Transformer-XL English model trained on wikitext-103, 18-layer, 1024-hidden, 16-heads, 257M parameters
 
-- `cache_dir` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example `cache_dir='./pretrained_model_{}'.format(args.local_rank)` (see the section on distributed training for more information).
-- `from_tf`: should we load the weights from a locally saved TensorFlow checkpoint
-- `state_dict`: an optional state dictionary (collections.OrderedDict object) to use instead of Google pre-trained models
-- `*inputs`, `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)
+  *
+    a path or url to a pretrained model archive containing:
 
-`Uncased` means that the text has been lowercased before WordPiece tokenization, e.g., `John Smith` becomes `john smith`. The Uncased model also strips out any accent markers. `Cased` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the [Multilingual README](https://github.com/google-research/bert/blob/master/multilingual.md) or the original TensorFlow repository.
 
-**When using an `uncased model`, make sure to pass `--do_lower_case` to the example training scripts (or pass `do_lower_case=True` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).**
+    * ``bert_config.json`` or ``openai_gpt_config.json`` a configuration file for the model, and
+    * ``pytorch_model.bin`` a PyTorch dump of a pre-trained instance of ``BertForPreTraining``\ , ``OpenAIGPTModel``\ , ``TransfoXLModel``\ , ``GPT2LMHeadModel`` (saved with the usual ``torch.save()``\ )
+
+  If ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links `here <https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/modeling_bert.py>`__\ ) and stored in a cache folder to avoid future download (the cache folder can be found at ``~/.pytorch_pretrained_bert/``\ ).
+
+*
+  ``cache_dir`` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example ``cache_dir='./pretrained_model_{}'.format(args.local_rank)`` (see the section on distributed training for more information).
+
+* ``from_tf``\ : should we load the weights from a locally saved TensorFlow checkpoint
+* ``state_dict``\ : an optional state dictionary (collections.OrderedDict object) to use instead of Google pre-trained models
+* ``*inputs``\ , `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)
+
+``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`__ or the original TensorFlow repository.
+
+When using an ``uncased model``\ , make sure to pass ``--do_lower_case`` to the example training scripts (or pass ``do_lower_case=True`` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).
 
 Examples:
 
-```python
-# BERT
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True)
-model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+.. code-block:: python
 
-# OpenAI GPT
-tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-model = OpenAIGPTModel.from_pretrained('openai-gpt')
+   # BERT
+   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True)
+   model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
 
-# Transformer-XL
-tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
-model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
+   # OpenAI GPT
+   tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+   model = OpenAIGPTModel.from_pretrained('openai-gpt')
 
-# OpenAI GPT-2
-tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-model = GPT2Model.from_pretrained('gpt2')
+   # Transformer-XL
+   tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
+   model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
 
-```
+   # OpenAI GPT-2
+   tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+   model = GPT2Model.from_pretrained('gpt2')
 
-#### Cache directory
+Cache directory
+~~~~~~~~~~~~~~~
 
-`pytorch_transformers` save the pretrained weights in a cache directory which is located at (in this order of priority):
+``pytorch_pretrained_bert`` save the pretrained weights in a cache directory which is located at (in this order of priority):
 
-- `cache_dir` optional arguments to the `from_pretrained()` method (see above),
-- shell environment variable `PYTORCH_PRETRAINED_BERT_CACHE`,
-- PyTorch cache home + `/pytorch_transformers/`
+
+* ``cache_dir`` optional arguments to the ``from_pretrained()`` method (see above),
+* shell environment variable ``PYTORCH_PRETRAINED_BERT_CACHE``\ ,
+* PyTorch cache home + ``/pytorch_pretrained_bert/``
   where PyTorch cache home is defined by (in this order):
-  - shell environment variable `ENV_TORCH_HOME`
-  - shell environment variable `ENV_XDG_CACHE_HOME` + `/torch/`)
-  - default: `~/.cache/torch/`
 
-Usually, if you don't set any specific environment variable, `pytorch_transformers` cache will be at `~/.cache/torch/pytorch_transformers/`.
+  * shell environment variable ``ENV_TORCH_HOME``
+  * shell environment variable ``ENV_XDG_CACHE_HOME`` + ``/torch/``\ )
+  * default: ``~/.cache/torch/``
 
-You can alsways safely delete `pytorch_transformers` cache but the pretrained model weights and vocabulary files wil have to be re-downloaded from our S3.
+Usually, if you don't set any specific environment variable, ``pytorch_pretrained_bert`` cache will be at ``~/.cache/torch/pytorch_pretrained_bert/``.
 
-### Serialization best-practices
+You can alsways safely delete ``pytorch_pretrained_bert`` cache but the pretrained model weights and vocabulary files wil have to be re-downloaded from our S3.
+
+Serialization best-practices
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL).
 There are three types of files you need to save to be able to reload a fine-tuned model:
 
-- the model it-self which should be saved following PyTorch serialization [best practices](https://pytorch.org/docs/stable/notes/serialization.html#best-practices),
-- the configuration file of the model which is saved as a JSON file, and
-- the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
+
+* the model it-self which should be saved following PyTorch serialization `best practices <https://pytorch.org/docs/stable/notes/serialization.html#best-practices>`__\ ,
+* the configuration file of the model which is saved as a JSON file, and
+* the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
 
 The *default filenames* of these files are as follow:
 
-- the model weights file: `pytorch_model.bin`,
-- the configuration file: `config.json`,
-- the vocabulary file: `vocab.txt` for BERT and Transformer-XL, `vocab.json` for GPT/GPT-2 (BPE vocabulary),
-- for GPT/GPT-2 (BPE vocabulary) the additional merges file: `merges.txt`.
 
-**If you save a model using these *default filenames*, you can then re-load the model and tokenizer using the `from_pretrained()` method.**
+* the model weights file: ``pytorch_model.bin``\ ,
+* the configuration file: ``config.json``\ ,
+* the vocabulary file: ``vocab.txt`` for BERT and Transformer-XL, ``vocab.json`` for GPT/GPT-2 (BPE vocabulary),
+* for GPT/GPT-2 (BPE vocabulary) the additional merges file: ``merges.txt``.
 
-Here is the recommended way of saving the model, configuration and vocabulary to an `output_dir` directory and reloading the model and tokenizer afterwards:
+**If you save a model using these *default filenames*\ , you can then re-load the model and tokenizer using the ``from_pretrained()`` method.**
 
-```python
-from pytorch_transformers import WEIGHTS_NAME, CONFIG_NAME
+Here is the recommended way of saving the model, configuration and vocabulary to an ``output_dir`` directory and reloading the model and tokenizer afterwards:
 
-output_dir = "./models/"
+.. code-block:: python
 
-# Step 1: Save a model, configuration and vocabulary that you have fine-tuned
+   from pytorch_pretrained_bert import WEIGHTS_NAME, CONFIG_NAME
 
-# If we have a distributed model, save only the encapsulated model
-# (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
-model_to_save = model.module if hasattr(model, 'module') else model
+   output_dir = "./models/"
 
-# If we save using the predefined names, we can load using `from_pretrained`
-output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
-output_config_file = os.path.join(output_dir, CONFIG_NAME)
+   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
 
-torch.save(model_to_save.state_dict(), output_model_file)
-model_to_save.config.to_json_file(output_config_file)
-tokenizer.save_vocabulary(output_dir)
+   # If we have a distributed model, save only the encapsulated model
+   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
+   model_to_save = model.module if hasattr(model, 'module') else model
 
-# Step 2: Re-load the saved model and vocabulary
+   # If we save using the predefined names, we can load using `from_pretrained`
+   output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
+   output_config_file = os.path.join(output_dir, CONFIG_NAME)
 
-# Example for a Bert model
-model = BertForQuestionAnswering.from_pretrained(output_dir)
-tokenizer = BertTokenizer.from_pretrained(output_dir, do_lower_case=args.do_lower_case)  # Add specific options if needed
-# Example for a GPT model
-model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
-tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
-```
+   torch.save(model_to_save.state_dict(), output_model_file)
+   model_to_save.config.to_json_file(output_config_file)
+   tokenizer.save_vocabulary(output_dir)
+
+   # Step 2: Re-load the saved model and vocabulary
+
+   # Example for a Bert model
+   model = BertForQuestionAnswering.from_pretrained(output_dir)
+   tokenizer = BertTokenizer.from_pretrained(output_dir, do_lower_case=args.do_lower_case)  # Add specific options if needed
+   # Example for a GPT model
+   model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
+   tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
 
 Here is another way you can save and reload the model if you want to use specific paths for each type of files:
 
-```python
-output_model_file = "./models/my_own_model_file.bin"
-output_config_file = "./models/my_own_config_file.bin"
-output_vocab_file = "./models/my_own_vocab_file.bin"
+.. code-block:: python
 
-# Step 1: Save a model, configuration and vocabulary that you have fine-tuned
+   output_model_file = "./models/my_own_model_file.bin"
+   output_config_file = "./models/my_own_config_file.bin"
+   output_vocab_file = "./models/my_own_vocab_file.bin"
 
-# If we have a distributed model, save only the encapsulated model
-# (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
-model_to_save = model.module if hasattr(model, 'module') else model
+   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
 
-torch.save(model_to_save.state_dict(), output_model_file)
-model_to_save.config.to_json_file(output_config_file)
-tokenizer.save_vocabulary(output_vocab_file)
+   # If we have a distributed model, save only the encapsulated model
+   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
+   model_to_save = model.module if hasattr(model, 'module') else model
 
-# Step 2: Re-load the saved model and vocabulary
+   torch.save(model_to_save.state_dict(), output_model_file)
+   model_to_save.config.to_json_file(output_config_file)
+   tokenizer.save_vocabulary(output_vocab_file)
 
-# We didn't save using the predefined WEIGHTS_NAME, CONFIG_NAME names, we cannot load using `from_pretrained`.
-# Here is how to do it in this situation:
+   # Step 2: Re-load the saved model and vocabulary
 
-# Example for a Bert model
-config = BertConfig.from_json_file(output_config_file)
-model = BertForQuestionAnswering(config)
-state_dict = torch.load(output_model_file)
-model.load_state_dict(state_dict)
-tokenizer = BertTokenizer(output_vocab_file, do_lower_case=args.do_lower_case)
+   # We didn't save using the predefined WEIGHTS_NAME, CONFIG_NAME names, we cannot load using `from_pretrained`.
+   # Here is how to do it in this situation:
+
+   # Example for a Bert model
+   config = BertConfig.from_json_file(output_config_file)
+   model = BertForQuestionAnswering(config)
+   state_dict = torch.load(output_model_file)
+   model.load_state_dict(state_dict)
+   tokenizer = BertTokenizer(output_vocab_file, do_lower_case=args.do_lower_case)
+
+   # Example for a GPT model
+   config = OpenAIGPTConfig.from_json_file(output_config_file)
+   model = OpenAIGPTDoubleHeadsModel(config)
+   state_dict = torch.load(output_model_file)
+   model.load_state_dict(state_dict)
+   tokenizer = OpenAIGPTTokenizer(output_vocab_file)
 
-# Example for a GPT model
-config = OpenAIGPTConfig.from_json_file(output_config_file)
-model = OpenAIGPTDoubleHeadsModel(config)
-state_dict = torch.load(output_model_file)
-model.load_state_dict(state_dict)
-tokenizer = OpenAIGPTTokenizer(output_vocab_file)
-```
diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index f21927e18c..8970cd56f8 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -54,20 +54,22 @@ else:
 
 
 class PretrainedConfig(object):
-    """ Base class for all configuration classes.
-        Handle a few common attributes and methods for loading/downloading/saving configurations.
+    r""" Base class for all configuration classes.
+        Handles a few parameters common to all models' configurations as well as methods for loading/downloading/saving configurations.
+
+        Class attributes (overridden by derived classes):
+            - ``pretrained_config_archive_map``: a python ``dict`` of with `short-cut-names` (string) as keys and `url` (string) of associated pretrained model configurations as values.
+
+        Parameters:
+            ``finetuning_task``: string, default `None`. Name of the task used to fine-tune the model. This can be used when converting from an original (TensorFlow or PyTorch) checkpoint.
+            ``num_labels``: integer, default `2`. Number of classes to use when the model is a classification model (sequences/tokens)
+            ``output_attentions``: boolean, default `False`. Should the model returns attentions weights.
+            ``output_hidden_states``: string, default `False`. Should the model returns all hidden-states.
+            ``torchscript``: string, default `False`. Is the model used with Torchscript.
     """
     pretrained_config_archive_map = {}
 
     def __init__(self, **kwargs):
-        r""" The initialization of :class:`~pytorch_transformers.PretrainedConfig` extracts
-            a few configuration attributes from `**kwargs` which are common to all models:
-                - `finetuning_task`: string, default `None`. Name of the task used to fine-tune the model (used to convert from original checkpoint)
-                - `num_labels`: integer, default `2`. Number of classes to use when the model is a classification model (sequences/tokens)
-                - `output_attentions`: boolean, default `False`. Should the model returns attentions weights.
-                - `output_hidden_states`: string, default `False`. Should the model returns all hidden-states.
-                - `torchscript`: string, default `False`. Is the model used with Torchscript.
-        """
         self.finetuning_task = kwargs.pop('finetuning_task', None)
         self.num_labels = kwargs.pop('num_labels', 2)
         self.output_attentions = kwargs.pop('output_attentions', False)
@@ -76,7 +78,7 @@ class PretrainedConfig(object):
 
     def save_pretrained(self, save_directory):
         """ Save a configuration object to the directory `save_directory`, so that it
-            can be re-loaded using the `from_pretrained(save_directory)` class method.
+            can be re-loaded using the :func:`~pytorch_transformers.PretrainedConfig.from_pretrained` class method.
         """
         assert os.path.isdir(save_directory), "Saving path should be a directory where the model and configuration can be saved"
 
@@ -87,32 +89,29 @@ class PretrainedConfig(object):
 
     @classmethod
     def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
-        r""" Instantiate a PretrainedConfig from a pre-trained model configuration.
+        r""" Instantiate a :class:`~pytorch_transformers.PretrainedConfig` (or a derived class) from a pre-trained model configuration.
 
         Parameters:
-            **pretrained_model_name_or_path**: either:
+            pretrained_model_name_or_path: either:
 
                 - a string with the `shortcut name` of a pre-trained model configuration to load from cache or download, e.g.: ``bert-base-uncased``.
-                - a path to a `directory` containing a configuration file saved using the `save_pretrained(save_directory)` method, e.g.: ``./my_model_directory/``.
+                - a path to a `directory` containing a configuration file saved using the :func:`~pytorch_transformers.PretrainedConfig.save_pretrained` method, e.g.: ``./my_model_directory/``.
                 - a path or url to a saved configuration JSON `file`, e.g.: ``./my_model_directory/configuration.json``.
 
-            **cache_dir**: (`optional`) string:
+            cache_dir: (`optional`) string:
                 Path to a directory in which a downloaded pre-trained model
                 configuration should be cached if the standard cache should not be used.
 
-            **return_unused_kwargs**: (`optional`) bool:
+            kwargs: (`optional`) dict: key/value pairs with which to update the configuration object after loading.
+
+                - The values in kwargs of any keys which are configuration attributes will be used to override the loaded values.
+                - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled by the `return_unused_kwargs` keyword parameter.
+
+            return_unused_kwargs: (`optional`) bool:
 
                 - If False, then this function returns just the final configuration object.
                 - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs` is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: ie the part of kwargs which has not been used to update `config` and is otherwise ignored.
 
-            **kwargs**: (`optional`) dict:
-                Dictionary of key/value pairs with which to update the configuration object after loading.
-
-                - The values in kwargs of any keys which are configuration attributes will be used
-                    to override the loaded values.
-                - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
-                    by the `return_unused_kwargs` keyword parameter.
-
         Examples::
 
             # We can't instantiate directly the base class `PretrainedConfig` so let's show the examples on a
@@ -215,14 +214,26 @@ class PretrainedConfig(object):
 
 
 class PreTrainedModel(nn.Module):
-    """ Base class for all models. Handle loading/storing model config and
-        a simple interface for dowloading and loading pretrained models.
+    r""" Base class for all models.
+
+        :class:`~pytorch_transformers.PreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models
+        as well as a few methods commons to all models to (i) resize the input embeddings and (ii) prune heads in the self-attention heads.
+
+        Class attributes (overridden by derived classes):
+            - ``config_class``: a class derived from :class:`~pytorch_transformers.PretrainedConfig` to use as configuration class for this model architecture.
+            - ``pretrained_model_archive_map``: a python ``dict`` of with `short-cut-names` (string) as keys and `url` (string) of associated pretrained weights as values.
+            - ``load_tf_weights``: a python ``method`` for loading a TensorFlow checkpoint in a PyTorch model, taking as arguments:
+
+                - ``model``: an instance of the relevant subclass of :class:`~pytorch_transformers.PreTrainedModel`,
+                - ``config``: an instance of the relevant subclass of :class:`~pytorch_transformers.PretrainedConfig`,
+                - ``path``: a path (string) to the TensorFlow checkpoint.
+
+            - ``base_model_prefix``: a string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.
     """
-    config_class = PretrainedConfig
+    config_class = None
     pretrained_model_archive_map = {}
     load_tf_weights = lambda model, config, path: None
     base_model_prefix = ""
-    input_embeddings = None
 
     def __init__(self, config, *inputs, **kwargs):
         super(PreTrainedModel, self).__init__()
@@ -280,17 +291,16 @@ class PreTrainedModel(nn.Module):
 
     def resize_token_embeddings(self, new_num_tokens=None):
         """ Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
-            Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.
+        Take care of tying weights embeddings afterwards if the model class has a `tie_weights()` method.
 
-        Args:
-            new_num_tokens: (`optional`) int
-                New number of tokens in the embedding matrix.
-                Increasing the size will add newly initialized vectors at the end
-                Reducing the size will remove vectors from the end
-                If not provided or None: does nothing and just returns a pointer to the input tokens Embedding Module of the model.
+        Arguments:
+
+            new_num_tokens: (`optional`) int:
+                New number of tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end. 
+                If not provided or None: does nothing and just returns a pointer to the input tokens ``torch.nn.Embeddings`` Module of the model.
 
         Return: ``torch.nn.Embeddings``
-            Pointer to the input tokens Embedding Module of the model
+            Pointer to the input tokens Embeddings Module of the model
         """
         base_model = getattr(self, self.base_model_prefix, self)  # get the base model if needed
         model_embeds = base_model._resize_token_embeddings(new_num_tokens)
@@ -309,15 +319,17 @@ class PreTrainedModel(nn.Module):
 
     def prune_heads(self, heads_to_prune):
         """ Prunes heads of the base model.
-            Args:
-                heads_to_prune: dict of {layer_num (int): list of heads to prune in this layer (list of int)}
+
+            Arguments:
+
+                heads_to_prune: dict with keys being selected layer indices (`int`) and associated values being the list of heads to prune in said layer (list of `int`).
         """
         base_model = getattr(self, self.base_model_prefix, self)  # get the base model if needed
         base_model._prune_heads(heads_to_prune)
 
     def save_pretrained(self, save_directory):
-        """ Save a model with its configuration file to a directory, so that it
-            can be re-loaded using the `from_pretrained(save_directory)` class method.
+        """ Save a model and its configuration file to a directory, so that it
+            can be re-loaded using the `:func:`~pytorch_transformers.PreTrainedModel.from_pretrained`` class method.
         """
         assert os.path.isdir(save_directory), "Saving path should be a directory where the model and configuration can be saved"
 
@@ -336,50 +348,45 @@ class PreTrainedModel(nn.Module):
     def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
         r"""Instantiate a pretrained pytorch model from a pre-trained model configuration.
 
-            The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
-            To train the model, you should first set it back in training mode with `model.train()`
+        The model is set in evaluation mode by default using ``model.eval()`` (Dropout modules are deactivated)
+        To train the model, you should first set it back in training mode with ``model.train()``
 
-        Params:
-            **pretrained_model_name_or_path**: either:
-                - a string with the `shortcut name` of a pre-trained model to load from cache
-                    or download and cache if not already stored in cache (e.g. 'bert-base-uncased').
-                - a path to a `directory` containing a configuration file saved
-                    using the `save_pretrained(save_directory)` method.
-                - a path or url to a tensorflow index checkpoint `file` (e.g. `./tf_model/model.ckpt.index`).
-                    In this case, ``from_tf`` should be set to True and a configuration object should be
-                    provided as `config` argument. This loading option is slower than converting the TensorFlow
-                    checkpoint in a PyTorch model using the provided conversion scripts and loading
-                    the PyTorch model afterwards.
-            **model_args**: (`optional`) Sequence:
-                All remaning positional arguments will be passed to the underlying model's __init__ function
-            **config**: an optional configuration for the model to use instead of an automatically loaded configuation.
-                Configuration can be automatically loaded when:
-                - the model is a model provided by the library (loaded with a `shortcut name` of a pre-trained model), or
-                - the model was saved using the `save_pretrained(save_directory)` (loaded by suppling the save directory).
-            **state_dict**: an optional state dictionnary for the model to use instead of a state dictionary loaded
-                from saved weights file.
+        Parameters:
+            pretrained_model_name_or_path: either:
+
+                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
+                - a path to a `directory` containing model weights saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.
+                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
+
+            model_args: (`optional`) Sequence of positional arguments:
+                All remaning positional arguments will be passed to the underlying model's ``__init__`` method
+
+            config: (`optional`) instance of a class derived from :class:`~pytorch_transformers.PretrainedConfig`:
+                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:
+
+                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or
+                - the model was saved using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.
+                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.
+
+            state_dict: (`optional`) dict:
+                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.
                 This option can be used if you want to create a model from a pretrained configuration but load your own weights.
-                In this case though, you should check if using `save_pretrained(dir)` and `from_pretrained(save_directory)` is not
-                a simpler option.
-            **cache_dir**: (`optional`) string:
+                In this case though, you should check if using :func:`~pytorch_transformers.PreTrainedModel.save_pretrained` and :func:`~pytorch_transformers.PreTrainedModel.from_pretrained` is not a simpler option.
+
+            cache_dir: (`optional`) string:
                 Path to a directory in which a downloaded pre-trained model
                 configuration should be cached if the standard cache should not be used.
-            **output_loading_info**: (`optional`) boolean:
+
+            output_loading_info: (`optional`) boolean:
                 Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
-            **kwargs**: (`optional`) dict:
-                Dictionary of key, values to update the configuration object after loading.
-                Can be used to override selected configuration parameters. E.g. ``output_attention=True``.
 
-               - If a configuration is providedictionaryfig`, **kwargs will be directly passed
-                 to the underlying model's __init__ method.
-               - If a configuration is not provided, **kwargs will be first passed to the pretrained
-                 model configuration class loading function (`PretrainedConfig.from_pretrained`).
-                 Each key of **kwargs that corresponds to a configuration attribute
-                 will be used to override said attribute with the supplied **kwargs value.
-                 Remaining keys that do not correspond to any configuration attribute will
-                 be passed to the underlying model's __init__ function.
+            kwargs: (`optional`) Remaining dictionary of keyword arguments:
+                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
 
-        Examples::dictionary
+                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
+                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~pytorch_transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
+
+        Examples::
 
             model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
             model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index 556f094f6d..1852d74021 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -30,14 +30,34 @@ SPECIAL_TOKENS_MAP_FILE = 'special_tokens_map.json'
 ADDED_TOKENS_FILE = 'added_tokens.json'
 
 class PreTrainedTokenizer(object):
-    """ An abstract class to handle dowloading and loading pretrained tokenizers and adding tokens to the vocabulary.
+    """ Base class for all tokenizers.
+    Handle all the shared methods for tokenization and special tokens as well as methods dowloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.
 
-        Derived class can set up a few special tokens to be used in common scripts and internals:
-            bos_token, eos_token, EOP_TOKEN, EOD_TOKEN, unk_token, sep_token, pad_token, cls_token, mask_token
-            additional_special_tokens = []
+    This class also contain the added tokens in a unified way on top of all tokenizers so we don't have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece...).
 
-        We defined an added_tokens_encoder to add new tokens to the vocabulary without having to handle the
-            specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece...).
+    Class attributes (overridden by derived classes):
+
+        - ``vocab_files_names``: a python ``dict`` with, as keys, the ``__init__`` keyword name of each vocabulary file required by the model, and as associated values, the filename for saving the associated file (string).
+        - ``pretrained_vocab_files_map``: a python ``dict of dict`` the high-level keys being the ``__init__`` keyword name of each vocabulary file required by the model, the low-level being the `short-cut-names` (string) of the pretrained models with, as associated values, the `url` (string) to the associated pretrained vocabulary file.
+        - ``max_model_input_sizes``: a python ``dict`` with, as keys, the `short-cut-names` (string) of the pretrained models, and as associated values, the maximum length of the sequence inputs of this model, or None if the model has no maximum input size.
+
+    Parameters:
+
+        - ``bos_token``: (`Optional`) string: a beginning of sentence token. Will be associated to ``self.bos_token``
+
+        - ``eos_token``: (`Optional`) string: an end of sentence token. Will be associated to ``self.eos_token``
+
+        - ``unk_token``: (`Optional`) string: an unknown token. Will be associated to ``self.unk_token``
+
+        - ``sep_token``: (`Optional`) string: a separation token (e.g. to separate context and query in an input sequence). Will be associated to ``self.sep_token``
+
+        - ``pad_token``: (`Optional`) string: a padding token. Will be associated to ``self.pad_token``
+
+        - ``cls_token``: (`Optional`) string: a classification token (e.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model). Will be associated to ``self.cls_token``
+
+        - ``mask_token``: (`Optional`) string: a masking token (e.g. when training a model with masked-language modeling). Will be associated to ``self.mask_token``
+
+        - ``additional_special_tokens``: (`Optional`) list: a list of additional special tokens. Adding all special tokens here ensure they won't be split by the tokenization process. Will be associated to ``self.additional_special_tokens``
     """
     vocab_files_names = {}
     pretrained_vocab_files_map = {}
@@ -49,82 +69,98 @@ class PreTrainedTokenizer(object):
 
     @property
     def bos_token(self):
+        """ Beginning of sentence token (string). Log an error if used while not having been set. """
         if self._bos_token is None:
             logger.error("Using bos_token, but it is not set yet.")
         return self._bos_token
 
     @property
     def eos_token(self):
+        """ End of sentence token (string). Log an error if used while not having been set. """
         if self._eos_token is None:
             logger.error("Using eos_token, but it is not set yet.")
         return self._eos_token
 
     @property
     def unk_token(self):
+        """ Unknown token (string). Log an error if used while not having been set. """
         if self._unk_token is None:
             logger.error("Using unk_token, but it is not set yet.")
         return self._unk_token
 
     @property
     def sep_token(self):
+        """ Separation token (string). E.g. separate context and query in an input sequence. Log an error if used while not having been set. """
         if self._sep_token is None:
             logger.error("Using sep_token, but it is not set yet.")
         return self._sep_token
 
     @property
     def pad_token(self):
+        """ Padding token (string). Log an error if used while not having been set. """
         if self._pad_token is None:
             logger.error("Using pad_token, but it is not set yet.")
         return self._pad_token
 
     @property
     def cls_token(self):
+        """ Classification token (string). E.g. to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set. """
         if self._cls_token is None:
             logger.error("Using cls_token, but it is not set yet.")
         return self._cls_token
 
     @property
     def mask_token(self):
+        """ Mask token (string). E.g. when training a model with masked-language modeling. Log an error if used while not having been set. """
         if self._mask_token is None:
             logger.error("Using mask_token, but it is not set yet.")
         return self._mask_token
 
     @property
     def additional_special_tokens(self):
+        """ All the additional special tokens you may want to use (list of strings). Log an error if used while not having been set. """
         if self._additional_special_tokens is None:
             logger.error("Using additional_special_tokens, but it is not set yet.")
         return self._additional_special_tokens
 
     @bos_token.setter
     def bos_token(self, value):
+        self.add_tokens([value]) 
         self._bos_token = value
 
     @eos_token.setter
     def eos_token(self, value):
+        self.add_tokens([value]) 
         self._eos_token = value
 
     @unk_token.setter
     def unk_token(self, value):
+        self.add_tokens([value]) 
         self._unk_token = value
 
     @sep_token.setter
     def sep_token(self, value):
+        self.add_tokens([value]) 
         self._sep_token = value
 
     @pad_token.setter
     def pad_token(self, value):
+        self.add_tokens([value]) 
         self._pad_token = value
 
     @cls_token.setter
     def cls_token(self, value):
+        self.add_tokens([value]) 
         self._cls_token = value
 
     @mask_token.setter
     def mask_token(self, value):
+        self.add_tokens([value]) 
         self._mask_token = value
 
     @additional_special_tokens.setter
     def additional_special_tokens(self, value):
+        self.add_tokens(value) 
         self._additional_special_tokens = value
 
     def __init__(self, max_len=None, **kwargs):
@@ -148,15 +184,47 @@ class PreTrainedTokenizer(object):
 
     @classmethod
     def from_pretrained(cls, *inputs, **kwargs):
+        r""" Instantiate a :class:`~pytorch_transformers.PreTrainedTokenizer` (or a derived class) from a predefined tokenizer.
+
+        Parameters:
+            pretrained_model_name_or_path: either:
+
+                - a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.
+                - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~pytorch_transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
+                - (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
+
+            cache_dir: (`optional`) string:
+                Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.
+
+            inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.
+
+            kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~pytorch_transformers.PreTrainedTokenizer` for details.
+
+        Examples::
+
+            # We can't instantiate directly the base class `PreTrainedTokenizer` so let's show our examples on a derived class: BertTokenizer
+
+            # Download vocabulary from S3 and cache.
+            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+
+            # If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)
+            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/')
+
+            # If the tokenizer uses a single vocabulary file, you can point directly to this file
+            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/my_vocab.txt')
+
+            # You can link tokens to special vocabulary when instantiating
+            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', unk_token='<unk>')
+            # You should be sure '<unk>' is in the vocabulary when doing that.
+            # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)
+            assert tokenizer.unk_token == '<unk>'
+
+        """
         return cls._from_pretrained(*inputs, **kwargs)
 
 
     @classmethod
     def _from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
-        """
-        Instantiate a PreTrainedTokenizer from pre-trained vocabulary files.
-        Download and cache the vocabulary files if needed.
-        """
         cache_dir = kwargs.pop('cache_dir', None)
 
         s3_models = list(cls.max_model_input_sizes.keys())
@@ -253,8 +321,9 @@ class PreTrainedTokenizer(object):
 
     def save_pretrained(self, save_directory):
         """ Save the tokenizer vocabulary files (with added tokens) and the
-            special-tokens-to-class-attributes-mapping to a directory, so that it
-            can be re-loaded using the `from_pretrained(save_directory)` class method.
+            special-tokens-to-class-attributes-mapping to a directory.
+
+            This method make sure the full tokenizer can then be re-loaded using the :func:`~pytorch_transformers.PreTrainedTokenizer.from_pretrained` class method.
         """
         if not os.path.isdir(save_directory):
             logger.error("Saving directory ({}) should be a directory".format(save_directory))
@@ -279,37 +348,50 @@ class PreTrainedTokenizer(object):
 
 
     def save_vocabulary(self, save_directory):
-        """ Save the tokenizer vocabulary to a directory. This method doesn't save added tokens
+        """ Save the tokenizer vocabulary to a directory. This method does *NOT* save added tokens
             and special token mappings.
-            
-            Please use `save_pretrained()` to save the full Tokenizer state so that it can be
-            reloaded using the `from_pretrained(save_directory)` class method.
+
+            Please use :func:`~pytorch_transformers.PreTrainedTokenizer.save_pretrained` `()` to save the full Tokenizer state if you want to reload it using the :func:`~pytorch_transformers.PreTrainedTokenizer.from_pretrained` class method.
         """
         raise NotImplementedError
 
 
     def vocab_size(self):
+        """ Size of the base vocabulary (without the added tokens) """
         raise NotImplementedError
 
 
     def __len__(self):
+        """ Size of the full vocabulary with the added tokens """
         return self.vocab_size + len(self.added_tokens_encoder)
 
 
     def add_tokens(self, new_tokens):
         """ Add a list of new tokens to the tokenizer class. If the new tokens are not in the
-            vocabulary, they are added to the added_tokens_encoder with indices starting from
-            the last index of the current vocabulary.
+        vocabulary, they are added to it with indices starting from length of the current vocabulary.
+
+            Parameters:
+                new_tokens: list of string. Each string is a token to add. Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
 
             Returns:
-                Number of tokens added to the vocabulary which can be used to correspondingly
-                    increase the size of the associated model embedding matrices.
+                Number of tokens added to the vocabulary.
+
+        Examples::
+
+            # Let's see how to increase the vocabulary of Bert model and tokenizer
+            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+            model = BertModel.from_pretrained('bert-base-uncased')
+
+            num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
+            print('We have added', num_added_toks, 'tokens')
+            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
         """
         if not new_tokens:
             return 0
 
         to_add_tokens = []
         for token in new_tokens:
+            assert isinstance(token, str) or (six.PY2 and isinstance(token, unicode))
             if token != self.unk_token and \
                     self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token):
                 to_add_tokens.append(token)
@@ -325,23 +407,23 @@ class PreTrainedTokenizer(object):
 
     def add_special_tokens(self, special_tokens_dict):
         """ Add a dictionary of special tokens (eos, pad, cls...) to the encoder and link them
-            to class attributes. If the special tokens are not in the vocabulary, they are added
-            to it and indexed starting from the last index of the current vocabulary.
+            to class attributes. If special tokens are NOT in the vocabulary, they are added
+            to it (indexed starting from the last index of the current vocabulary).
+
+            Parameters:
+                special_tokens_dict: dict of string. Keys should be in the list of predefined special attributes: [``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``].
+                
+                    Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
 
-            Returns:
-                Number of tokens added to the vocabulary which can be used to correspondingly
-                    increase the size of the associated model embedding matrices.
         """
         if not special_tokens_dict:
             return 0
 
-        added_special_tokens = self.add_tokens(special_tokens_dict.values())
         for key, value in special_tokens_dict.items():
+            assert key in self.SPECIAL_TOKENS_ATTRIBUTES
             logger.info("Assigning %s to the %s key of the tokenizer", value, key)
             setattr(self, key, value)
 
-        return added_special_tokens
-
 
     def tokenize(self, text, **kwargs):
         """ Converts a string in a sequence of tokens (string), using the tokenizer.
@@ -369,13 +451,13 @@ class PreTrainedTokenizer(object):
             Split in words for word-based vocabulary or sub-words for sub-word-based
             vocabularies (BPE/SentencePieces/WordPieces).
 
-            Don't take care of added tokens.
+            Do NOT take care of added tokens.
         """
         raise NotImplementedError
 
     def convert_tokens_to_ids(self, tokens):
-        """ Converts a single token or a sequence of tokens (str/unicode) in a integer id
-            (resp.) a sequence of ids, using the vocabulary.
+        """ Converts a single token, or a sequence of tokens, (str/unicode) in a single integer id
+            (resp. a sequence of ids), using the vocabulary.
         """
         if isinstance(tokens, str) or (six.PY2 and isinstance(tokens, unicode)):
             return self._convert_token_to_id_with_added_voc(tokens)
@@ -400,7 +482,8 @@ class PreTrainedTokenizer(object):
 
     def encode(self, text):
         """ Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
-            same as self.convert_tokens_to_ids(self.tokenize(text)).
+        
+        Same doing ``self.convert_tokens_to_ids(self.tokenize(text))``.
         """
         return self.convert_tokens_to_ids(self.tokenize(text))
 
@@ -440,6 +523,8 @@ class PreTrainedTokenizer(object):
     def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):
         """ Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary
             with options to remove special tokens and clean up tokenization spaces.
+
+        Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.
         """
         filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
         text = self.convert_tokens_to_string(filtered_tokens)
@@ -482,6 +567,8 @@ class PreTrainedTokenizer(object):
 
     @staticmethod
     def clean_up_tokenization(out_string):
+        """ Clean up a list of simple English tokenization artifacts like spaces before punctuations and abreviated forms.
+        """
         out_string = out_string.replace(' .', '.').replace(' ?', '?').replace(' !', '!').replace(' ,', ','
                         ).replace(" ' ", "'").replace(" n't", "n't").replace(" 'm", "'m").replace(" do not", " don't"
                         ).replace(" 's", "'s").replace(" 've", "'ve").replace(" 're", "'re")

From 84eb69908226ed78eceb4d6c69e83ca54c39cc21 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E9=9B=B7=E6=89=93=E4=B8=8D=E5=8A=A8=EF=BC=81?=
 <779222056@qq.com>
Date: Mon, 5 Aug 2019 08:57:09 +0800
Subject: [PATCH 047/200] Update modeling_xlnet.py

---
 pytorch_transformers/modeling_xlnet.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/modeling_xlnet.py b/pytorch_transformers/modeling_xlnet.py
index 515decdb3e..764eba1768 100644
--- a/pytorch_transformers/modeling_xlnet.py
+++ b/pytorch_transformers/modeling_xlnet.py
@@ -335,7 +335,7 @@ class XLNetConfig(PretrainedConfig):
 
 try:
     from apex.normalization.fused_layer_norm import FusedLayerNorm as XLNetLayerNorm
-except ImportError:
+except (ImportError, AttributeError) as e:
     logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .")
     class XLNetLayerNorm(nn.Module):
         def __init__(self, d_model, eps=1e-12):

From d7fd10568cd7046e3d9bc53343445c225f444954 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E9=9B=B7=E6=89=93=E4=B8=8D=E5=8A=A8=EF=BC=81?=
 <779222056@qq.com>
Date: Mon, 5 Aug 2019 08:58:19 +0800
Subject: [PATCH 048/200] Update modeling_bert.py

---
 pytorch_transformers/modeling_bert.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index b59445513a..3af9670cef 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -222,7 +222,7 @@ class BertConfig(PretrainedConfig):
 
 try:
     from apex.normalization.fused_layer_norm import FusedLayerNorm as BertLayerNorm
-except ImportError:
+except (ImportError, AttributeError) as e:
     logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .")
     class BertLayerNorm(nn.Module):
         def __init__(self, hidden_size, eps=1e-12):

From 05c083520ab58ae8a73d853d0e366e56d5690da4 Mon Sep 17 00:00:00 2001
From: Julien Chaumond <chaumond@gmail.com>
Date: Sun, 4 Aug 2019 21:39:21 -0400
Subject: [PATCH 049/200] =?UTF-8?q?[RoBERTa]=20model=20conversion,=20infer?=
 =?UTF-8?q?ence,=20tests=20=F0=9F=94=A5?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 README.md                                     |   1 +
 .../convert_roberta_checkpoint_to_pytorch.py  | 164 +++++++++++++
 pytorch_transformers/modeling_roberta.py      | 128 ++++++++++
 .../tests/modeling_roberta_test.py            |  69 ++++++
 .../tests/tokenization_roberta_test.py        |  42 ++++
 pytorch_transformers/tokenization_roberta.py  | 218 ++++++++++++++++++
 6 files changed, 622 insertions(+)
 create mode 100644 pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
 create mode 100644 pytorch_transformers/modeling_roberta.py
 create mode 100644 pytorch_transformers/tests/modeling_roberta_test.py
 create mode 100644 pytorch_transformers/tests/tokenization_roberta_test.py
 create mode 100644 pytorch_transformers/tokenization_roberta.py

diff --git a/README.md b/README.md
index 703eb47df9..1e2b025eed 100644
--- a/README.md
+++ b/README.md
@@ -12,6 +12,7 @@ The library currently contains PyTorch implementations, pre-trained model weight
 4. **[Transformer-XL](https://github.com/kimiyoung/transformer-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
 5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
+7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott et al.
 
 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html).
 
diff --git a/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py b/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
new file mode 100644
index 0000000000..7a17ee3f1b
--- /dev/null
+++ b/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
@@ -0,0 +1,164 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert RoBERTa checkpoint."""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import logging
+import numpy as np
+import torch
+
+from fairseq.models.roberta import RobertaModel as FairseqRobertaModel
+from fairseq.modules import TransformerSentenceEncoderLayer
+from pytorch_transformers.modeling_bert import (BertConfig, BertEncoder,
+                                                BertIntermediate, BertLayer,
+                                                BertModel, BertOutput,
+                                                BertSelfAttention,
+                                                BertSelfOutput)
+from pytorch_transformers.modeling_roberta import (RobertaEmbeddings,
+                                                   RobertaForMaskedLM,
+                                                   RobertaModel)
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+SAMPLE_TEXT = 'Hello world! cécé herlolip'
+
+
+def convert_roberta_checkpoint_to_pytorch(roberta_checkpoint_path, pytorch_dump_folder_path):
+    """
+    Copy/paste/tweak roberta's weights to our BERT structure.
+    """
+    roberta = FairseqRobertaModel.from_pretrained(roberta_checkpoint_path)
+    roberta.eval()  # disable dropout
+    config = BertConfig(
+        vocab_size_or_config_json_file=50265,
+        hidden_size=roberta.args.encoder_embed_dim,
+        num_hidden_layers=roberta.args.encoder_layers,
+        num_attention_heads=roberta.args.encoder_attention_heads,
+        intermediate_size=roberta.args.encoder_ffn_embed_dim,
+        max_position_embeddings=514,
+        type_vocab_size=1,
+    )
+    print("Our BERT config:", config)
+
+    model = RobertaForMaskedLM(config)
+    model.eval()
+
+    # Now let's copy all the weights.
+    # Embeddings
+    roberta_sent_encoder = roberta.model.decoder.sentence_encoder
+    model.roberta.embeddings.word_embeddings.weight = roberta_sent_encoder.embed_tokens.weight
+    model.roberta.embeddings.position_embeddings.weight = roberta_sent_encoder.embed_positions.weight
+    model.roberta.embeddings.token_type_embeddings.weight.data = torch.zeros_like(model.roberta.embeddings.token_type_embeddings.weight)  # just zero them out b/c RoBERTa doesn't use them.
+    model.roberta.embeddings.LayerNorm.weight = roberta_sent_encoder.emb_layer_norm.weight
+    model.roberta.embeddings.LayerNorm.bias = roberta_sent_encoder.emb_layer_norm.bias
+    model.roberta.embeddings.LayerNorm.variance_epsilon = roberta_sent_encoder.emb_layer_norm.eps
+
+    for i in range(config.num_hidden_layers):
+        # Encoder: start of layer
+        layer: BertLayer = model.roberta.encoder.layer[i]
+        roberta_layer: TransformerSentenceEncoderLayer = roberta_sent_encoder.layers[i]
+
+        ### self attention
+        self_attn: BertSelfAttention = layer.attention.self
+        assert(
+            roberta_layer.self_attn.in_proj_weight.shape == torch.Size((3 * config.hidden_size, config.hidden_size))
+        )
+        # we use three distinct linear layers so we split the source layer here.
+        self_attn.query.weight.data = roberta_layer.self_attn.in_proj_weight[:config.hidden_size, :]
+        self_attn.query.bias.data = roberta_layer.self_attn.in_proj_bias[:config.hidden_size]
+        self_attn.key.weight.data = roberta_layer.self_attn.in_proj_weight[config.hidden_size:2*config.hidden_size, :]
+        self_attn.key.bias.data = roberta_layer.self_attn.in_proj_bias[config.hidden_size:2*config.hidden_size]
+        self_attn.value.weight.data = roberta_layer.self_attn.in_proj_weight[2*config.hidden_size:, :]
+        self_attn.value.bias.data = roberta_layer.self_attn.in_proj_bias[2*config.hidden_size:]
+
+        ### self-attention output
+        self_output: BertSelfOutput = layer.attention.output
+        assert(
+            self_output.dense.weight.shape == roberta_layer.self_attn.out_proj.weight.shape
+        )
+        self_output.dense.weight = roberta_layer.self_attn.out_proj.weight
+        self_output.dense.bias = roberta_layer.self_attn.out_proj.bias
+        self_output.LayerNorm.weight = roberta_layer.self_attn_layer_norm.weight
+        self_output.LayerNorm.bias = roberta_layer.self_attn_layer_norm.bias
+        self_output.LayerNorm.variance_epsilon = roberta_layer.self_attn_layer_norm.eps
+
+        ### intermediate
+        intermediate: BertIntermediate = layer.intermediate
+        assert(
+            intermediate.dense.weight.shape == roberta_layer.fc1.weight.shape
+        )
+        intermediate.dense.weight = roberta_layer.fc1.weight
+        intermediate.dense.bias = roberta_layer.fc1.bias
+
+        ### output
+        bert_output: BertOutput = layer.output
+        assert(
+            bert_output.dense.weight.shape == roberta_layer.fc2.weight.shape
+        )
+        bert_output.dense.weight = roberta_layer.fc2.weight
+        bert_output.dense.bias = roberta_layer.fc2.bias
+        bert_output.LayerNorm.weight = roberta_layer.final_layer_norm.weight
+        bert_output.LayerNorm.bias = roberta_layer.final_layer_norm.bias
+        bert_output.LayerNorm.variance_epsilon = roberta_layer.final_layer_norm.eps
+        #### end of layer
+    
+    # LM Head
+    model.lm_head.dense.weight = roberta.model.decoder.lm_head.dense.weight
+    model.lm_head.dense.bias = roberta.model.decoder.lm_head.dense.bias
+    model.lm_head.layer_norm.weight = roberta.model.decoder.lm_head.layer_norm.weight
+    model.lm_head.layer_norm.bias = roberta.model.decoder.lm_head.layer_norm.bias
+    model.lm_head.layer_norm.variance_epsilon = roberta.model.decoder.lm_head.layer_norm.eps
+    model.lm_head.weight = roberta.model.decoder.lm_head.weight
+    model.lm_head.bias = roberta.model.decoder.lm_head.bias
+
+    # Let's check that we get the same results.
+    input_ids: torch.Tensor = roberta.encode(SAMPLE_TEXT).unsqueeze(0) # batch of size 1
+
+    our_output = model(input_ids)[0]
+    their_output = roberta.model(input_ids)[0]
+    print(our_output.shape, their_output.shape)
+    success = torch.allclose(our_output, their_output, atol=1e-3)
+    print(
+        "Do both models output the same tensors?",
+        "🔥" if success else "💩"
+    )
+    if not success:
+        raise Exception("Something went wRoNg")
+
+    print(f"Saving model to {pytorch_dump_folder_path}")
+    model.save_pretrained(pytorch_dump_folder_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    ## Required parameters
+    parser.add_argument("--roberta_checkpoint_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path the official PyTorch dump.")
+    parser.add_argument("--pytorch_dump_folder_path",
+                        default = None,
+                        type = str,
+                        required = True,
+                        help = "Path to the output PyTorch model.")
+    args = parser.parse_args()
+    convert_roberta_checkpoint_to_pytorch(
+        args.roberta_checkpoint_path,
+        args.pytorch_dump_folder_path
+    )
diff --git a/pytorch_transformers/modeling_roberta.py b/pytorch_transformers/modeling_roberta.py
new file mode 100644
index 0000000000..b92ffd0433
--- /dev/null
+++ b/pytorch_transformers/modeling_roberta.py
@@ -0,0 +1,128 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch RoBERTa model. """
+
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import logging
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from pytorch_transformers.modeling_bert import (BertConfig, BertEmbeddings,
+                                                BertLayerNorm, BertModel,
+                                                BertPreTrainedModel, gelu)
+
+logger = logging.getLogger(__name__)
+
+ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP = {
+    'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-pytorch_model.bin",
+    'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-pytorch_model.bin",
+    'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-pytorch_model.bin",
+}
+
+ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json",
+    'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-config.json",
+    'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-config.json",
+}
+
+
+class RobertaEmbeddings(BertEmbeddings):
+    """
+    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.
+    """
+    def __init__(self, config):
+        super(RobertaEmbeddings, self).__init__(config)
+        self.padding_idx = 1
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None):
+        seq_length = input_ids.size(1)
+        if position_ids is None:
+            # Position numbers begin at padding_idx+1. Padding symbols are ignored.
+            # cf. fairseq's `utils.make_positions`
+            position_ids = torch.arange(self.padding_idx+1, seq_length+self.padding_idx+1, dtype=torch.long, device=input_ids.device)
+            position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
+        return super().forward(input_ids, token_type_ids=token_type_ids, position_ids=position_ids)
+
+
+class RobertaConfig(BertConfig):
+    pretrained_config_archive_map = ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP
+
+class RobertaModel(BertModel):
+    """
+    Same as BertModel with:
+    - a tiny embeddings tweak.
+    - setup for Roberta pretrained models
+    """
+    config_class = RobertaConfig
+    pretrained_model_archive_map = ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
+    base_model_prefix = "roberta"
+
+    def __init__(self, config):
+        super(RobertaModel, self).__init__(config)
+
+        self.embeddings = RobertaEmbeddings(config)
+
+
+
+class RobertaForMaskedLM(BertPreTrainedModel):
+    """
+    Roberta Model with a `language modeling` head on top.
+    """
+    config_class = RobertaConfig
+    pretrained_model_archive_map = ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
+    base_model_prefix = "roberta"
+
+    def __init__(self, config):
+        super(RobertaForMaskedLM, self).__init__(config)
+
+        self.roberta = RobertaModel(config)
+        self.lm_head = RobertaLMHead(config)
+    
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None, position_ids=None, head_mask=None):
+        outputs = self.roberta(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
+                            attention_mask=attention_mask, head_mask=head_mask)
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+
+        outputs = (prediction_scores,) + outputs[2:]
+        return outputs
+
+
+
+class RobertaLMHead(nn.Module):
+    """Roberta Head for masked language modeling."""
+
+    def __init__(self, config: BertConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+
+        self.weight = nn.Linear(config.hidden_size, config.vocab_size, bias=False).weight
+        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
+
+    def forward(self, features, **kwargs):
+        x = self.dense(features)
+        x = gelu(x)
+        x = self.layer_norm(x)
+
+        # project back to size of vocabulary with bias
+        x = F.linear(x, self.weight) + self.bias
+
+        return x
diff --git a/pytorch_transformers/tests/modeling_roberta_test.py b/pytorch_transformers/tests/modeling_roberta_test.py
new file mode 100644
index 0000000000..62707326a6
--- /dev/null
+++ b/pytorch_transformers/tests/modeling_roberta_test.py
@@ -0,0 +1,69 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import os
+import unittest
+import pytest
+import torch
+
+from pytorch_transformers.modeling_roberta import (RobertaForMaskedLM,
+                                                   RobertaModel)
+
+
+class RobertaModelTest(unittest.TestCase):
+
+    # @pytest.mark.slow
+    def test_inference_masked_lm(self):
+        model = RobertaForMaskedLM.from_pretrained('roberta-base')
+        
+        input_ids = torch.tensor([[    0, 31414,   232,   328,   740,  1140, 12695,    69, 46078,  1588,   2]])
+        output = model(input_ids)[0]
+        expected_shape = torch.Size((1, 11, 50265))
+        self.assertEqual(
+            output.shape,
+            expected_shape
+        )
+        # compare the actual values for a slice.
+        expected_slice = torch.Tensor(
+            [[[33.8843, -4.3107, 22.7779],
+              [ 4.6533, -2.8099, 13.6252],
+              [ 1.8222, -3.6898,  8.8600]]]
+        )
+        self.assertTrue(
+            torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
+        )
+
+    # @pytest.mark.slow
+    def test_inference_no_head(self):
+        model = RobertaModel.from_pretrained('roberta-base')
+        
+        input_ids = torch.tensor([[    0, 31414,   232,   328,   740,  1140, 12695,    69, 46078,  1588,   2]])
+        output = model(input_ids)[0]
+        # compare the actual values for a slice.
+        expected_slice = torch.Tensor(
+            [[[-0.0231,  0.0782,  0.0074],
+              [-0.1854,  0.0539, -0.0174],
+              [ 0.0548,  0.0799,  0.1687]]]
+        )
+        self.assertTrue(
+            torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
+        )
+
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/pytorch_transformers/tests/tokenization_roberta_test.py b/pytorch_transformers/tests/tokenization_roberta_test.py
new file mode 100644
index 0000000000..01268f7d25
--- /dev/null
+++ b/pytorch_transformers/tests/tokenization_roberta_test.py
@@ -0,0 +1,42 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import os
+import unittest
+import pytest
+
+from pytorch_transformers.tokenization_roberta import RobertaTokenizer
+
+
+class RobertaTokenizationTest(unittest.TestCase):
+
+    # @pytest.mark.slow
+    def test_full_tokenizer(self):
+        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
+        self.assertListEqual(
+            tokenizer.encode('Hello world!'),
+            [0, 31414, 232, 328, 2]
+        )
+        self.assertListEqual(
+            tokenizer.encode('Hello world! cécé herlolip'),
+            [0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]
+        )
+
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/pytorch_transformers/tokenization_roberta.py b/pytorch_transformers/tokenization_roberta.py
new file mode 100644
index 0000000000..92717c6dd1
--- /dev/null
+++ b/pytorch_transformers/tokenization_roberta.py
@@ -0,0 +1,218 @@
+# coding=utf-8
+# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for RoBERTa."""
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import json
+import logging
+import re
+
+from .tokenization_utils import PreTrainedTokenizer
+from .tokenization_gpt2 import GPT2Tokenizer
+
+logger = logging.getLogger(__name__)
+
+VOCAB_FILES_NAMES = {
+    'dict_file': 'dict.txt',
+}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    'dict_file':
+    {
+        'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
+        'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
+        'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
+    },
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    'roberta-base': 512,
+    'roberta-large': 512,
+    'roberta-large-mnli': 512,
+}
+
+
+SPACE_NORMALIZER = re.compile(r"\s+")
+
+def tokenize_line(line):
+    line = SPACE_NORMALIZER.sub(" ", line)
+    line = line.strip()
+    return line.split()
+
+
+class Dictionary(object):
+    """
+    A mapping from symbols to consecutive integers
+
+    From Facebook's fairseq.
+    """
+
+    def __init__(
+        self,
+        pad='<pad>',
+        eos='</s>',
+        unk='<unk>',
+        bos='<s>',
+        extra_special_symbols=None,
+    ):
+        self.unk_word, self.pad_word, self.eos_word = unk, pad, eos
+        self.symbols = []
+        self.count = []
+        self.indices = {}
+        self.bos_index = self.add_symbol(bos)
+        self.pad_index = self.add_symbol(pad)
+        self.eos_index = self.add_symbol(eos)
+        self.unk_index = self.add_symbol(unk)
+        if extra_special_symbols:
+            for s in extra_special_symbols:
+                self.add_symbol(s)
+        self.nspecial = len(self.symbols)
+
+    def __getitem__(self, idx):
+        if idx < len(self.symbols):
+            return self.symbols[idx]
+        return self.unk_word
+
+    def index(self, sym):
+        """Returns the index of the specified symbol"""
+        assert isinstance(sym, str)
+        if sym in self.indices:
+            return self.indices[sym]
+        return self.unk_index
+
+    def add_symbol(self, word, n=1):
+        """Adds a word to the dictionary"""
+        if word in self.indices:
+            idx = self.indices[word]
+            self.count[idx] = self.count[idx] + n
+            return idx
+        else:
+            idx = len(self.symbols)
+            self.indices[word] = idx
+            self.symbols.append(word)
+            self.count.append(n)
+            return idx
+
+    @classmethod
+    def load(cls, f, ignore_utf_errors=False):
+        """Loads the dictionary from a text file with the format:
+
+        ```
+        <symbol0> <count0>
+        <symbol1> <count1>
+        ...
+        ```
+        """
+        d = cls()
+        d.add_from_file(f, ignore_utf_errors)
+        return d
+
+    def add_from_file(self, f, ignore_utf_errors=False):
+        """
+        Loads a pre-existing dictionary from a text file and adds its symbols
+        to this instance.
+        """
+        if isinstance(f, str):
+            try:
+                if not ignore_utf_errors:
+                    with open(f, 'r', encoding='utf-8') as fd:
+                        self.add_from_file(fd)
+                else:
+                    with open(f, 'r', encoding='utf-8', errors='ignore') as fd:
+                        self.add_from_file(fd)
+            except FileNotFoundError as fnfe:
+                raise fnfe
+            except UnicodeError:
+                raise Exception("Incorrect encoding detected in {}, please "
+                                "rebuild the dataset".format(f))
+            return
+
+        lines = f.readlines()
+        for line in lines:
+            idx = line.rfind(' ')
+            if idx == -1:
+                raise ValueError("Incorrect dictionary format, expected '<token> <cnt>'")
+            word = line[:idx]
+            count = int(line[idx + 1:])
+            self.indices[word] = len(self.symbols)
+            self.symbols.append(word)
+            self.count.append(count)
+    
+    def encode_line(self, line, line_tokenizer=tokenize_line, add_if_not_exist=True,
+                    consumer=None, append_eos=True, reverse_order=False):
+        words = line_tokenizer(line)
+        if reverse_order:
+            words = list(reversed(words))
+        nwords = len(words)
+        ids = [0] * (nwords + 1 if append_eos else nwords)
+
+        for i, word in enumerate(words):
+            if add_if_not_exist:
+                idx = self.add_symbol(word)
+            else:
+                idx = self.index(word)
+            if consumer is not None:
+                consumer(word, idx)
+            ids[i] = idx
+        if append_eos:
+            ids[nwords] = self.eos_index
+        return ids
+
+
+
+
+class RobertaTokenizer(PreTrainedTokenizer):
+    """
+    RoBERTa tokenizer. Peculiarities:
+        - GPT-2 tokenizer with a different integer mapping on top.
+    """
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(self, dict_file,
+                 bos_token="<s>", eos_token="</s>", **kwargs):
+        super(RobertaTokenizer, self).__init__(bos_token=bos_token, eos_token=eos_token, **kwargs)
+
+        self.gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        self.dictionary = Dictionary.load(dict_file)
+
+    def _tokenize(self, text):
+        """ Use GPT-2 Tokenizer """
+        return self.gpt2_tokenizer._tokenize(text)
+
+    def encode(self, text):
+        """ Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
+        """
+        gpt2_tokens_joined = " ".join(
+            str(x) for x in self.gpt2_tokenizer.convert_tokens_to_ids(self.tokenize(text))
+        )
+        bpe_sentence = '<s> ' + gpt2_tokens_joined + ' </s>'
+        return self.dictionary.encode_line(bpe_sentence, append_eos=False)
+
+    def _convert_token_to_id(self, token):
+        return self.dictionary.index(token)
+
+    def _convert_id_to_token(self, index):
+        symbol = self.dictionary[index]
+        try:
+            idx = int(symbol)
+            return self.gpt2_tokenizer._convert_id_to_token(idx)
+        except:
+            return symbol
+
+    def convert_tokens_to_string(self, tokens):
+        return self.gpt2_tokenizer.convert_tokens_to_string(tokens)

From cb9db101c744276a5028f5b8c675c35536f2096f Mon Sep 17 00:00:00 2001
From: Julien Chaumond <chaumond@gmail.com>
Date: Sun, 4 Aug 2019 22:04:15 -0400
Subject: [PATCH 050/200] Python 2 must DIE

---
 pytorch_transformers/modeling_roberta.py               |  6 +++---
 .../tests/tokenization_roberta_test.py                 | 10 ++++++----
 pytorch_transformers/tokenization_roberta.py           |  4 +++-
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/pytorch_transformers/modeling_roberta.py b/pytorch_transformers/modeling_roberta.py
index b92ffd0433..109a719616 100644
--- a/pytorch_transformers/modeling_roberta.py
+++ b/pytorch_transformers/modeling_roberta.py
@@ -58,7 +58,7 @@ class RobertaEmbeddings(BertEmbeddings):
             # cf. fairseq's `utils.make_positions`
             position_ids = torch.arange(self.padding_idx+1, seq_length+self.padding_idx+1, dtype=torch.long, device=input_ids.device)
             position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
-        return super().forward(input_ids, token_type_ids=token_type_ids, position_ids=position_ids)
+        return super(RobertaEmbeddings, self).forward(input_ids, token_type_ids=token_type_ids, position_ids=position_ids)
 
 
 class RobertaConfig(BertConfig):
@@ -109,8 +109,8 @@ class RobertaForMaskedLM(BertPreTrainedModel):
 class RobertaLMHead(nn.Module):
     """Roberta Head for masked language modeling."""
 
-    def __init__(self, config: BertConfig):
-        super().__init__()
+    def __init__(self, config):
+        super(RobertaLMHead, self).__init__()
         self.dense = nn.Linear(config.hidden_size, config.hidden_size)
         self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
 
diff --git a/pytorch_transformers/tests/tokenization_roberta_test.py b/pytorch_transformers/tests/tokenization_roberta_test.py
index 01268f7d25..cd4e17ec34 100644
--- a/pytorch_transformers/tests/tokenization_roberta_test.py
+++ b/pytorch_transformers/tests/tokenization_roberta_test.py
@@ -18,6 +18,7 @@ from __future__ import (absolute_import, division, print_function,
 import os
 import unittest
 import pytest
+import six
 
 from pytorch_transformers.tokenization_roberta import RobertaTokenizer
 
@@ -31,10 +32,11 @@ class RobertaTokenizationTest(unittest.TestCase):
             tokenizer.encode('Hello world!'),
             [0, 31414, 232, 328, 2]
         )
-        self.assertListEqual(
-            tokenizer.encode('Hello world! cécé herlolip'),
-            [0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]
-        )
+        if six.PY3:
+            self.assertListEqual(
+                tokenizer.encode('Hello world! cécé herlolip'),
+                [0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]
+            )
 
 
 
diff --git a/pytorch_transformers/tokenization_roberta.py b/pytorch_transformers/tokenization_roberta.py
index 92717c6dd1..4f9a7bc0fa 100644
--- a/pytorch_transformers/tokenization_roberta.py
+++ b/pytorch_transformers/tokenization_roberta.py
@@ -19,6 +19,8 @@ from __future__ import (absolute_import, division, print_function,
 import json
 import logging
 import re
+from io import open
+import six
 
 from .tokenization_utils import PreTrainedTokenizer
 from .tokenization_gpt2 import GPT2Tokenizer
@@ -125,7 +127,7 @@ class Dictionary(object):
         Loads a pre-existing dictionary from a text file and adds its symbols
         to this instance.
         """
-        if isinstance(f, str):
+        if isinstance(f, six.string_types):
             try:
                 if not ignore_utf_errors:
                     with open(f, 'r', encoding='utf-8') as fd:

From 328afb70971c2b9144a06f7e7ed9c0c7704bfe92 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Mon, 5 Aug 2019 14:08:56 +0200
Subject: [PATCH 051/200] cleaning up tokenizer tests structure (at last) -
 last remaining ppb refs

---
 README.md                                     |  11 +-
 docs/source/migration.md                      |  11 +-
 docs/source/serialization.rst                 |   2 +-
 docs/source/torchscript.rst                   |   5 +-
 pytorch_transformers/__init__.py              |   2 +-
 .../convert_pytorch_checkpoint_to_tf.py       |   2 +-
 pytorch_transformers/file_utils.py            |  15 +-
 .../tests/tokenization_bert_test.py           |  35 ++--
 .../tests/tokenization_gpt2_test.py           |  55 +++---
 .../tests/tokenization_openai_test.py         |  54 +++---
 .../tests/tokenization_tests_commons.py       | 172 ++++++++++--------
 .../tests/tokenization_transfo_xl_test.py     |  37 ++--
 .../tests/tokenization_xlm_test.py            |  54 +++---
 .../tests/tokenization_xlnet_test.py          |  72 ++++----
 pytorch_transformers/tokenization_bert.py     |   2 +-
 pytorch_transformers/tokenization_utils.py    |  36 +++-
 16 files changed, 332 insertions(+), 233 deletions(-)

diff --git a/README.md b/README.md
index c31bbd24b7..c130675dbd 100644
--- a/README.md
+++ b/README.md
@@ -345,8 +345,13 @@ tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
 
 ### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules
 
-The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer.
-The new optimizer `AdamW` matches PyTorch `Adam` optimizer API.
+The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer which has a few differences:
+
+- it only implements weights decay correction,
+- schedules are now externals (see below),
+- gradient clipping is now also external (see below).
+
+The new optimizer `AdamW` matches PyTorch `Adam` optimizer API and let you use standard PyTorch or apex methods for the schedule and clipping.
 
 The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.
 
@@ -355,6 +360,7 @@ Here is a conversion examples from `BertAdam` with a linear warmup and decay sch
 ```python
 # Parameters:
 lr = 1e-3
+max_grad_norm = 1.0
 num_total_steps = 1000
 num_warmup_steps = 100
 warmup_proportion = float(num_warmup_steps) / float(num_total_steps)  # 0.1
@@ -374,6 +380,7 @@ scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_tot
 for batch in train_data:
     loss = model(batch)
     loss.backward()
+    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
     scheduler.step()
     optimizer.step()
 ```
diff --git a/docs/source/migration.md b/docs/source/migration.md
index ba09253472..9cfcaade13 100644
--- a/docs/source/migration.md
+++ b/docs/source/migration.md
@@ -68,8 +68,13 @@ tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
 
 ### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules
 
-The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer.
-The new optimizer `AdamW` matches PyTorch `Adam` optimizer API.
+The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer which has a few differences:
+
+- it only implements weights decay correction,
+- schedules are now externals (see below),
+- gradient clipping is now also external (see below).
+
+The new optimizer `AdamW` matches PyTorch `Adam` optimizer API and let you use standard PyTorch or apex methods for the schedule and clipping.
 
 The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.
 
@@ -78,6 +83,7 @@ Here is a conversion examples from `BertAdam` with a linear warmup and decay sch
 ```python
 # Parameters:
 lr = 1e-3
+max_grad_norm = 1.0
 num_total_steps = 1000
 num_warmup_steps = 100
 warmup_proportion = float(num_warmup_steps) / float(num_total_steps)  # 0.1
@@ -97,6 +103,7 @@ scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_tot
 for batch in train_data:
     loss = model(batch)
     loss.backward()
+    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
     scheduler.step()
     optimizer.step()
 ```
diff --git a/docs/source/serialization.rst b/docs/source/serialization.rst
index 61854f61ea..7117d7ffa6 100644
--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
@@ -122,7 +122,7 @@ Here is the recommended way of saving the model, configuration and vocabulary to
 
 .. code-block:: python
 
-   from pytorch_pretrained_bert import WEIGHTS_NAME, CONFIG_NAME
+   from pytorch_transformers import WEIGHTS_NAME, CONFIG_NAME
 
    output_dir = "./models/"
 
diff --git a/docs/source/torchscript.rst b/docs/source/torchscript.rst
index 1b84559567..3c38177353 100644
--- a/docs/source/torchscript.rst
+++ b/docs/source/torchscript.rst
@@ -74,7 +74,7 @@ according to a ``BertConfig`` class and then saved to disk under the filename ``
 
 .. code-block:: python
 
-    from pytorch_pretrained_bert import BertModel, BertTokenizer, BertConfig
+    from pytorch_transformers import BertModel, BertTokenizer, BertConfig
     import torch
 
     enc = BertTokenizer.from_pretrained("bert-base-uncased")
@@ -105,6 +105,9 @@ according to a ``BertConfig`` class and then saved to disk under the filename ``
     # The model needs to be in evaluation mode
     model.eval()
 
+    # If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag
+    model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)
+
     # Creating the trace
     traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
     torch.jit.save(traced_model, "traced_bert.pt")
diff --git a/pytorch_transformers/__init__.py b/pytorch_transformers/__init__.py
index c9b0aeebb7..72d666448e 100644
--- a/pytorch_transformers/__init__.py
+++ b/pytorch_transformers/__init__.py
@@ -39,4 +39,4 @@ from .modeling_utils import (WEIGHTS_NAME, CONFIG_NAME, TF_WEIGHTS_NAME,
 from .optimization import (AdamW, ConstantLRSchedule, WarmupConstantSchedule, WarmupCosineSchedule,
                            WarmupCosineWithHardRestartsSchedule, WarmupLinearSchedule)
 
-from .file_utils import (PYTORCH_PRETRAINED_BERT_CACHE, cached_path)
+from .file_utils import (PYTORCH_TRANSFORMERS_CACHE, PYTORCH_PRETRAINED_BERT_CACHE, cached_path)
diff --git a/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py b/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
index b8858ee3dc..d866365fd0 100644
--- a/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
+++ b/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
@@ -20,7 +20,7 @@ import argparse
 import torch
 import numpy as np
 import tensorflow as tf
-from pytorch_pretrained_bert.modeling import BertModel
+from pytorch_transformers.modeling import BertModel
 
 
 def convert_pytorch_checkpoint_to_tf(model:BertModel, ckpt_dir:str, model_name:str):
diff --git a/pytorch_transformers/file_utils.py b/pytorch_transformers/file_utils.py
index fd655cec0e..75c075720c 100644
--- a/pytorch_transformers/file_utils.py
+++ b/pytorch_transformers/file_utils.py
@@ -38,10 +38,13 @@ except ImportError:
 try:
     from pathlib import Path
     PYTORCH_PRETRAINED_BERT_CACHE = Path(
-        os.getenv('PYTORCH_PRETRAINED_BERT_CACHE', default_cache_path))
+        os.getenv('PYTORCH_TRANSFORMERS_CACHE', os.getenv('PYTORCH_PRETRAINED_BERT_CACHE', default_cache_path)))
 except (AttributeError, ImportError):
-    PYTORCH_PRETRAINED_BERT_CACHE = os.getenv('PYTORCH_PRETRAINED_BERT_CACHE',
-                                              default_cache_path)
+    PYTORCH_PRETRAINED_BERT_CACHE = os.getenv('PYTORCH_TRANSFORMERS_CACHE',
+                                              os.getenv('PYTORCH_PRETRAINED_BERT_CACHE',
+                                                        default_cache_path))
+
+PYTORCH_TRANSFORMERS_CACHE = PYTORCH_PRETRAINED_BERT_CACHE  # Kept for backward compatibility
 
 logger = logging.getLogger(__name__)  # pylint: disable=invalid-name
 
@@ -70,7 +73,7 @@ def filename_to_url(filename, cache_dir=None):
     Raise ``EnvironmentError`` if `filename` or its stored metadata do not exist.
     """
     if cache_dir is None:
-        cache_dir = PYTORCH_PRETRAINED_BERT_CACHE
+        cache_dir = PYTORCH_TRANSFORMERS_CACHE
     if sys.version_info[0] == 3 and isinstance(cache_dir, Path):
         cache_dir = str(cache_dir)
 
@@ -98,7 +101,7 @@ def cached_path(url_or_filename, cache_dir=None):
     make sure the file exists and then return the path.
     """
     if cache_dir is None:
-        cache_dir = PYTORCH_PRETRAINED_BERT_CACHE
+        cache_dir = PYTORCH_TRANSFORMERS_CACHE
     if sys.version_info[0] == 3 and isinstance(url_or_filename, Path):
         url_or_filename = str(url_or_filename)
     if sys.version_info[0] == 3 and isinstance(cache_dir, Path):
@@ -187,7 +190,7 @@ def get_from_cache(url, cache_dir=None):
     If it's not there, download it. Then return the path to the cached file.
     """
     if cache_dir is None:
-        cache_dir = PYTORCH_PRETRAINED_BERT_CACHE
+        cache_dir = PYTORCH_TRANSFORMERS_CACHE
     if sys.version_info[0] == 3 and isinstance(cache_dir, Path):
         cache_dir = str(cache_dir)
     if sys.version_info[0] == 2 and not isinstance(cache_dir, str):
diff --git a/pytorch_transformers/tests/tokenization_bert_test.py b/pytorch_transformers/tests/tokenization_bert_test.py
index 0b9cfb1b32..5eb39b729d 100644
--- a/pytorch_transformers/tests/tokenization_bert_test.py
+++ b/pytorch_transformers/tests/tokenization_bert_test.py
@@ -24,30 +24,37 @@ from pytorch_transformers.tokenization_bert import (BasicTokenizer,
                                                     _is_control, _is_punctuation,
                                                     _is_whitespace, VOCAB_FILES_NAMES)
 
-from .tokenization_tests_commons import create_and_check_tokenizer_commons, TemporaryDirectory
+from .tokenization_tests_commons import CommonTestCases
 
-class TokenizationTest(unittest.TestCase):
+class BertTokenizationTest(CommonTestCases.CommonTokenizerTester):
+
+    tokenizer_class = BertTokenizer
+
+    def setUp(self):
+        super(BertTokenizationTest, self).setUp()
 
-    def test_full_tokenizer(self):
         vocab_tokens = [
             "[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn",
             "##ing", ",", "low", "lowest",
         ]
-        with TemporaryDirectory() as tmpdirname:
-            vocab_file = os.path.join(tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
-            with open(vocab_file, "w", encoding='utf-8') as vocab_writer:
-                vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+        with open(self.vocab_file, "w", encoding='utf-8') as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
 
-            input_text = u"UNwant\u00E9d,running"
-            output_text = u"unwanted, running"
+    def get_tokenizer(self):
+        return BertTokenizer.from_pretrained(self.tmpdirname)
 
-            create_and_check_tokenizer_commons(self, input_text, output_text, BertTokenizer, tmpdirname)
+    def get_input_output_texts(self):
+        input_text = u"UNwant\u00E9d,running"
+        output_text = u"unwanted, running"
+        return input_text, output_text
 
-            tokenizer = BertTokenizer(vocab_file)
+    def test_full_tokenizer(self):
+        tokenizer = BertTokenizer(self.vocab_file)
 
-            tokens = tokenizer.tokenize(u"UNwant\u00E9d,running")
-            self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
-            self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [7, 4, 5, 10, 8, 9])
+        tokens = tokenizer.tokenize(u"UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [7, 4, 5, 10, 8, 9])
 
     def test_chinese(self):
         tokenizer = BasicTokenizer()
diff --git a/pytorch_transformers/tests/tokenization_gpt2_test.py b/pytorch_transformers/tests/tokenization_gpt2_test.py
index 8dae72ec99..da7028c27d 100644
--- a/pytorch_transformers/tests/tokenization_gpt2_test.py
+++ b/pytorch_transformers/tests/tokenization_gpt2_test.py
@@ -20,42 +20,49 @@ import json
 
 from pytorch_transformers.tokenization_gpt2 import GPT2Tokenizer, VOCAB_FILES_NAMES
 
-from .tokenization_tests_commons import create_and_check_tokenizer_commons, TemporaryDirectory
+from .tokenization_tests_commons import CommonTestCases
 
-class GPT2TokenizationTest(unittest.TestCase):
+class GPT2TokenizationTest(CommonTestCases.CommonTokenizerTester):
 
-    def test_full_tokenizer(self):
-        """ Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt """
+    tokenizer_class = GPT2Tokenizer
+
+    def setUp(self):
+        super(GPT2TokenizationTest, self).setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
         vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n",
                  "lo", "low", "er",
                  "low", "lowest", "newer", "wider", "<unk>"]
         vocab_tokens = dict(zip(vocab, range(len(vocab))))
         merges = ["#version: 0.2", "l o", "lo w", "e r", ""]
-        special_tokens_map = {"unk_token": "<unk>"}
+        self.special_tokens_map = {"unk_token": "<unk>"}
 
-        with TemporaryDirectory() as tmpdirname:
-            vocab_file = os.path.join(tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
-            merges_file = os.path.join(tmpdirname, VOCAB_FILES_NAMES['merges_file'])
-            with open(vocab_file, "w") as fp:
-                fp.write(json.dumps(vocab_tokens))
-            with open(merges_file, "w") as fp:
-                fp.write("\n".join(merges))
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['merges_file'])
+        with open(self.vocab_file, "w") as fp:
+            fp.write(json.dumps(vocab_tokens))
+        with open(self.merges_file, "w") as fp:
+            fp.write("\n".join(merges))
 
-            input_text = u"lower newer"
-            output_text = u"lower<unk>newer"
+    def get_tokenizer(self):
+        return GPT2Tokenizer.from_pretrained(self.tmpdirname, **self.special_tokens_map)
 
-            create_and_check_tokenizer_commons(self, input_text, output_text, GPT2Tokenizer, tmpdirname, **special_tokens_map)
+    def get_input_output_texts(self):
+        input_text = u"lower newer"
+        output_text = u"lower<unk>newer"
+        return input_text, output_text
 
-            tokenizer = GPT2Tokenizer(vocab_file, merges_file, **special_tokens_map)
-            text = "lower"
-            bpe_tokens = ["low", "er"]
-            tokens = tokenizer.tokenize(text)
-            self.assertListEqual(tokens, bpe_tokens)
+    def test_full_tokenizer(self):
+        tokenizer = GPT2Tokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
+        text = "lower"
+        bpe_tokens = ["low", "er"]
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
 
-            input_tokens = tokens + [tokenizer.unk_token]
-            input_bpe_tokens = [13, 12, 17]
-            self.assertListEqual(
-                tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+        input_tokens = tokens + [tokenizer.unk_token]
+        input_bpe_tokens = [13, 12, 17]
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
 
 
 if __name__ == '__main__':
diff --git a/pytorch_transformers/tests/tokenization_openai_test.py b/pytorch_transformers/tests/tokenization_openai_test.py
index 9b4841a605..bb354f3fb7 100644
--- a/pytorch_transformers/tests/tokenization_openai_test.py
+++ b/pytorch_transformers/tests/tokenization_openai_test.py
@@ -20,13 +20,17 @@ import json
 
 from pytorch_transformers.tokenization_openai import OpenAIGPTTokenizer, VOCAB_FILES_NAMES
 
-from .tokenization_tests_commons import create_and_check_tokenizer_commons, TemporaryDirectory
+from .tokenization_tests_commons import CommonTestCases
 
 
-class OpenAIGPTTokenizationTest(unittest.TestCase):
+class OpenAIGPTTokenizationTest(CommonTestCases.CommonTokenizerTester):
 
-    def test_full_tokenizer(self):
-        """ Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt """
+    tokenizer_class = OpenAIGPTTokenizer
+
+    def setUp(self):
+        super(OpenAIGPTTokenizationTest, self).setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
         vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n",
                  "w</w>", "r</w>", "t</w>",
                  "lo", "low", "er</w>",
@@ -34,30 +38,34 @@ class OpenAIGPTTokenizationTest(unittest.TestCase):
         vocab_tokens = dict(zip(vocab, range(len(vocab))))
         merges = ["#version: 0.2", "l o", "lo w", "e r</w>", ""]
 
-        with TemporaryDirectory() as tmpdirname:
-            vocab_file = os.path.join(tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
-            merges_file = os.path.join(tmpdirname, VOCAB_FILES_NAMES['merges_file'])
-            with open(vocab_file, "w") as fp:
-                fp.write(json.dumps(vocab_tokens))
-            with open(merges_file, "w") as fp:
-                fp.write("\n".join(merges))
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['merges_file'])
+        with open(self.vocab_file, "w") as fp:
+            fp.write(json.dumps(vocab_tokens))
+        with open(self.merges_file, "w") as fp:
+            fp.write("\n".join(merges))
 
-            input_text = u"lower newer"
-            output_text = u"lower newer"
+    def get_tokenizer(self):
+        return OpenAIGPTTokenizer.from_pretrained(self.tmpdirname)
 
-            create_and_check_tokenizer_commons(self, input_text, output_text, OpenAIGPTTokenizer, tmpdirname)
+    def get_input_output_texts(self):
+        input_text = u"lower newer"
+        output_text = u"lower newer"
+        return input_text, output_text
 
-            tokenizer = OpenAIGPTTokenizer(vocab_file, merges_file)
 
-            text = "lower"
-            bpe_tokens = ["low", "er</w>"]
-            tokens = tokenizer.tokenize(text)
-            self.assertListEqual(tokens, bpe_tokens)
+    def test_full_tokenizer(self):
+        tokenizer = OpenAIGPTTokenizer(self.vocab_file, self.merges_file)
 
-            input_tokens = tokens + ["<unk>"]
-            input_bpe_tokens = [14, 15, 20]
-            self.assertListEqual(
-                tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+        text = "lower"
+        bpe_tokens = ["low", "er</w>"]
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + ["<unk>"]
+        input_bpe_tokens = [14, 15, 20]
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
 
 
 if __name__ == '__main__':
diff --git a/pytorch_transformers/tests/tokenization_tests_commons.py b/pytorch_transformers/tests/tokenization_tests_commons.py
index c37770b229..ebcf6f48d8 100644
--- a/pytorch_transformers/tests/tokenization_tests_commons.py
+++ b/pytorch_transformers/tests/tokenization_tests_commons.py
@@ -19,6 +19,7 @@ import sys
 from io import open
 import tempfile
 import shutil
+import unittest
 
 if sys.version_info[0] == 2:
     import cPickle as pickle
@@ -36,113 +37,124 @@ else:
     unicode = str
 
 
-def create_and_check_save_and_load_tokenizer(tester, tokenizer_class, *inputs, **kwargs):
-    tokenizer = tokenizer_class.from_pretrained(*inputs, **kwargs)
+class CommonTestCases:
 
-    before_tokens = tokenizer.encode(u"He is very happy, UNwant\u00E9d,running")
+    class CommonTokenizerTester(unittest.TestCase):
 
-    with TemporaryDirectory() as tmpdirname:
-        tokenizer.save_pretrained(tmpdirname)
-        tokenizer = tokenizer.from_pretrained(tmpdirname)
+        tokenizer_class = None
 
-    after_tokens = tokenizer.encode(u"He is very happy, UNwant\u00E9d,running")
-    tester.assertListEqual(before_tokens, after_tokens)
+        def setUp(self):
+            self.tmpdirname = tempfile.mkdtemp()
 
-def create_and_check_pickle_tokenizer(tester, tokenizer_class, *inputs, **kwargs):
-    tokenizer = tokenizer_class.from_pretrained(*inputs, **kwargs)
-    tester.assertIsNotNone(tokenizer)
+        def tearDown(self):
+            shutil.rmtree(self.tmpdirname)
 
-    text = u"Munich and Berlin are nice cities"
-    subwords = tokenizer.tokenize(text)
+        def get_tokenizer(self):
+            raise NotImplementedError
 
-    with TemporaryDirectory() as tmpdirname:
+        def get_input_output_texts(self):
+            raise NotImplementedError
 
-        filename = os.path.join(tmpdirname, u"tokenizer.bin")
-        pickle.dump(tokenizer, open(filename, "wb"))
+        def test_save_and_load_tokenizer(self):
+            tokenizer = self.get_tokenizer()
 
-        tokenizer_new = pickle.load(open(filename, "rb"))
+            before_tokens = tokenizer.encode(u"He is very happy, UNwant\u00E9d,running")
 
-    subwords_loaded = tokenizer_new.tokenize(text)
+            with TemporaryDirectory() as tmpdirname:
+                tokenizer.save_pretrained(tmpdirname)
+                tokenizer = tokenizer.from_pretrained(tmpdirname)
 
-    tester.assertListEqual(subwords, subwords_loaded)
+            after_tokens = tokenizer.encode(u"He is very happy, UNwant\u00E9d,running")
+            self.assertListEqual(before_tokens, after_tokens)
+
+        def test_pickle_tokenizer(self):
+            tokenizer = self.get_tokenizer()
+            self.assertIsNotNone(tokenizer)
+
+            text = u"Munich and Berlin are nice cities"
+            subwords = tokenizer.tokenize(text)
+
+            with TemporaryDirectory() as tmpdirname:
+
+                filename = os.path.join(tmpdirname, u"tokenizer.bin")
+                pickle.dump(tokenizer, open(filename, "wb"))
+
+                tokenizer_new = pickle.load(open(filename, "rb"))
+
+            subwords_loaded = tokenizer_new.tokenize(text)
+
+            self.assertListEqual(subwords, subwords_loaded)
 
 
-def create_and_check_add_tokens_tokenizer(tester, tokenizer_class, *inputs, **kwargs):
-    tokenizer = tokenizer_class.from_pretrained(*inputs, **kwargs)
+        def test_add_tokens_tokenizer(self):
+            tokenizer = self.get_tokenizer()
 
-    vocab_size = tokenizer.vocab_size
-    all_size = len(tokenizer)
+            vocab_size = tokenizer.vocab_size
+            all_size = len(tokenizer)
 
-    tester.assertNotEqual(vocab_size, 0)
-    tester.assertEqual(vocab_size, all_size)
+            self.assertNotEqual(vocab_size, 0)
+            self.assertEqual(vocab_size, all_size)
 
-    new_toks = ["aaaaabbbbbb", "cccccccccdddddddd"]
-    added_toks = tokenizer.add_tokens(new_toks)
-    vocab_size_2 = tokenizer.vocab_size
-    all_size_2 = len(tokenizer)
+            new_toks = ["aaaaabbbbbb", "cccccccccdddddddd"]
+            added_toks = tokenizer.add_tokens(new_toks)
+            vocab_size_2 = tokenizer.vocab_size
+            all_size_2 = len(tokenizer)
 
-    tester.assertNotEqual(vocab_size_2, 0)
-    tester.assertEqual(vocab_size, vocab_size_2)
-    tester.assertEqual(added_toks, len(new_toks))
-    tester.assertEqual(all_size_2, all_size + len(new_toks))
+            self.assertNotEqual(vocab_size_2, 0)
+            self.assertEqual(vocab_size, vocab_size_2)
+            self.assertEqual(added_toks, len(new_toks))
+            self.assertEqual(all_size_2, all_size + len(new_toks))
 
-    tokens = tokenizer.encode("aaaaabbbbbb low cccccccccdddddddd l")
-    tester.assertGreaterEqual(len(tokens), 4)
-    tester.assertGreater(tokens[0], tokenizer.vocab_size - 1)
-    tester.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+            tokens = tokenizer.encode("aaaaabbbbbb low cccccccccdddddddd l")
+            self.assertGreaterEqual(len(tokens), 4)
+            self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+            self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
 
-    new_toks_2 = {'eos_token': ">>>>|||<||<<|<<",
-                  'pad_token': "<<<<<|||>|>>>>|>"}
-    added_toks_2 = tokenizer.add_special_tokens(new_toks_2)
-    vocab_size_3 = tokenizer.vocab_size
-    all_size_3 = len(tokenizer)
+            new_toks_2 = {'eos_token': ">>>>|||<||<<|<<",
+                        'pad_token': "<<<<<|||>|>>>>|>"}
+            added_toks_2 = tokenizer.add_special_tokens(new_toks_2)
+            vocab_size_3 = tokenizer.vocab_size
+            all_size_3 = len(tokenizer)
 
-    tester.assertNotEqual(vocab_size_3, 0)
-    tester.assertEqual(vocab_size, vocab_size_3)
-    tester.assertEqual(added_toks_2, len(new_toks_2))
-    tester.assertEqual(all_size_3, all_size_2 + len(new_toks_2))
+            self.assertNotEqual(vocab_size_3, 0)
+            self.assertEqual(vocab_size, vocab_size_3)
+            self.assertEqual(added_toks_2, len(new_toks_2))
+            self.assertEqual(all_size_3, all_size_2 + len(new_toks_2))
 
-    tokens = tokenizer.encode(">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l")
+            tokens = tokenizer.encode(">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l")
 
-    tester.assertGreaterEqual(len(tokens), 6)
-    tester.assertGreater(tokens[0], tokenizer.vocab_size - 1)
-    tester.assertGreater(tokens[0], tokens[1])
-    tester.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
-    tester.assertGreater(tokens[-2], tokens[-3])
-    tester.assertEqual(tokens[0], tokenizer.convert_tokens_to_ids(tokenizer.eos_token))
-    tester.assertEqual(tokens[-2], tokenizer.convert_tokens_to_ids(tokenizer.pad_token))
+            self.assertGreaterEqual(len(tokens), 6)
+            self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+            self.assertGreater(tokens[0], tokens[1])
+            self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+            self.assertGreater(tokens[-2], tokens[-3])
+            self.assertEqual(tokens[0], tokenizer.convert_tokens_to_ids(tokenizer.eos_token))
+            self.assertEqual(tokens[-2], tokenizer.convert_tokens_to_ids(tokenizer.pad_token))
 
 
-def create_and_check_required_methods_tokenizer(tester, input_text, output_text, tokenizer_class, *inputs, **kwargs):
-    tokenizer = tokenizer_class.from_pretrained(*inputs, **kwargs)
+        def test_required_methods_tokenizer(self):
+            tokenizer = self.get_tokenizer()
+            input_text, output_text = self.get_input_output_texts()
 
-    tokens = tokenizer.tokenize(input_text)
-    ids = tokenizer.convert_tokens_to_ids(tokens)
-    ids_2 = tokenizer.encode(input_text)
-    tester.assertListEqual(ids, ids_2)
+            tokens = tokenizer.tokenize(input_text)
+            ids = tokenizer.convert_tokens_to_ids(tokens)
+            ids_2 = tokenizer.encode(input_text)
+            self.assertListEqual(ids, ids_2)
 
-    tokens_2 = tokenizer.convert_ids_to_tokens(ids)
-    text_2 = tokenizer.decode(ids)
+            tokens_2 = tokenizer.convert_ids_to_tokens(ids)
+            text_2 = tokenizer.decode(ids)
 
-    tester.assertEqual(text_2, output_text)
+            self.assertEqual(text_2, output_text)
 
-    tester.assertNotEqual(len(tokens_2), 0)
-    tester.assertIsInstance(text_2, (str, unicode))
+            self.assertNotEqual(len(tokens_2), 0)
+            self.assertIsInstance(text_2, (str, unicode))
 
 
-def create_and_check_pretrained_model_lists(tester, input_text, output_text, tokenizer_class, *inputs, **kwargs):
-    weights_list = list(tokenizer_class.max_model_input_sizes.keys())
-    weights_lists_2 = []
-    for file_id, map_list in tokenizer_class.pretrained_vocab_files_map.items():
-        weights_lists_2.append(list(map_list.keys()))
+        def test_pretrained_model_lists(self):
+            weights_list = list(self.tokenizer_class.max_model_input_sizes.keys())
+            weights_lists_2 = []
+            for file_id, map_list in self.tokenizer_class.pretrained_vocab_files_map.items():
+                weights_lists_2.append(list(map_list.keys()))
 
-    for weights_list_2 in weights_lists_2:
-        tester.assertListEqual(weights_list, weights_list_2)
-
-
-def create_and_check_tokenizer_commons(tester, input_text, output_text, tokenizer_class, *inputs, **kwargs):
-    create_and_check_pretrained_model_lists(tester, input_text, output_text, tokenizer_class, *inputs, **kwargs)
-    create_and_check_required_methods_tokenizer(tester, input_text, output_text, tokenizer_class, *inputs, **kwargs)
-    create_and_check_add_tokens_tokenizer(tester, tokenizer_class, *inputs, **kwargs)
-    create_and_check_save_and_load_tokenizer(tester, tokenizer_class, *inputs, **kwargs)
-    create_and_check_pickle_tokenizer(tester, tokenizer_class, *inputs, **kwargs)
+            for weights_list_2 in weights_lists_2:
+                self.assertListEqual(weights_list, weights_list_2)
diff --git a/pytorch_transformers/tests/tokenization_transfo_xl_test.py b/pytorch_transformers/tests/tokenization_transfo_xl_test.py
index aecfeaef5f..fbd06cf47e 100644
--- a/pytorch_transformers/tests/tokenization_transfo_xl_test.py
+++ b/pytorch_transformers/tests/tokenization_transfo_xl_test.py
@@ -20,32 +20,39 @@ from io import open
 
 from pytorch_transformers.tokenization_transfo_xl import TransfoXLTokenizer, VOCAB_FILES_NAMES
 
-from.tokenization_tests_commons import create_and_check_tokenizer_commons, TemporaryDirectory
+from.tokenization_tests_commons import CommonTestCases
 
-class TransfoXLTokenizationTest(unittest.TestCase):
+class TransfoXLTokenizationTest(CommonTestCases.CommonTokenizerTester):
+
+    tokenizer_class = TransfoXLTokenizer
+
+    def setUp(self):
+        super(TransfoXLTokenizationTest, self).setUp()
 
-    def test_full_tokenizer(self):
         vocab_tokens = [
             "<unk>", "[CLS]", "[SEP]", "want", "unwanted", "wa", "un",
             "running", ",", "low", "l",
         ]
-        with TemporaryDirectory() as tmpdirname:
-            vocab_file = os.path.join(tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
-            with open(vocab_file, "w", encoding='utf-8') as vocab_writer:
-                vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+        with open(self.vocab_file, "w", encoding='utf-8') as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
 
-            input_text = u"<unk> UNwanted , running"
-            output_text = u"<unk> unwanted, running"
+    def get_tokenizer(self):
+        return TransfoXLTokenizer.from_pretrained(self.tmpdirname, lower_case=True)
 
-            create_and_check_tokenizer_commons(self, input_text, output_text, TransfoXLTokenizer, tmpdirname, lower_case=True)
+    def get_input_output_texts(self):
+        input_text = u"<unk> UNwanted , running"
+        output_text = u"<unk> unwanted, running"
+        return input_text, output_text
 
-            tokenizer = TransfoXLTokenizer(vocab_file=vocab_file, lower_case=True)
+    def test_full_tokenizer(self):
+        tokenizer = TransfoXLTokenizer(vocab_file=self.vocab_file, lower_case=True)
 
-            tokens = tokenizer.tokenize(u"<unk> UNwanted , running")
-            self.assertListEqual(tokens, ["<unk>", "unwanted", ",", "running"])
+        tokens = tokenizer.tokenize(u"<unk> UNwanted , running")
+        self.assertListEqual(tokens, ["<unk>", "unwanted", ",", "running"])
 
-            self.assertListEqual(
-                tokenizer.convert_tokens_to_ids(tokens), [0, 4, 8, 7])
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(tokens), [0, 4, 8, 7])
 
     def test_full_tokenizer_lower(self):
         tokenizer = TransfoXLTokenizer(lower_case=True)
diff --git a/pytorch_transformers/tests/tokenization_xlm_test.py b/pytorch_transformers/tests/tokenization_xlm_test.py
index 97e8fa983f..a20e92044f 100644
--- a/pytorch_transformers/tests/tokenization_xlm_test.py
+++ b/pytorch_transformers/tests/tokenization_xlm_test.py
@@ -20,12 +20,16 @@ import json
 
 from pytorch_transformers.tokenization_xlm import XLMTokenizer, VOCAB_FILES_NAMES
 
-from .tokenization_tests_commons import create_and_check_tokenizer_commons, TemporaryDirectory
+from .tokenization_tests_commons import CommonTestCases
 
-class XLMTokenizationTest(unittest.TestCase):
+class XLMTokenizationTest(CommonTestCases.CommonTokenizerTester):
 
-    def test_full_tokenizer(self):
-        """ Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt """
+    tokenizer_class = XLMTokenizer
+
+    def setUp(self):
+        super(XLMTokenizationTest, self).setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
         vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n",
                  "w</w>", "r</w>", "t</w>",
                  "lo", "low", "er</w>",
@@ -33,30 +37,34 @@ class XLMTokenizationTest(unittest.TestCase):
         vocab_tokens = dict(zip(vocab, range(len(vocab))))
         merges = ["l o 123", "lo w 1456", "e r</w> 1789", ""]
 
-        with TemporaryDirectory() as tmpdirname:
-            vocab_file = os.path.join(tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
-            merges_file = os.path.join(tmpdirname, VOCAB_FILES_NAMES['merges_file'])
-            with open(vocab_file, "w") as fp:
-                fp.write(json.dumps(vocab_tokens))
-            with open(merges_file, "w") as fp:
-                fp.write("\n".join(merges))
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['merges_file'])
+        with open(self.vocab_file, "w") as fp:
+            fp.write(json.dumps(vocab_tokens))
+        with open(self.merges_file, "w") as fp:
+            fp.write("\n".join(merges))
 
-            input_text = u"lower newer"
-            output_text = u"lower newer"
+    def get_tokenizer(self):
+        return XLMTokenizer.from_pretrained(self.tmpdirname)
 
-            create_and_check_tokenizer_commons(self, input_text, output_text, XLMTokenizer, tmpdirname)
+    def get_input_output_texts(self):
+        input_text = u"lower newer"
+        output_text = u"lower newer"
+        return input_text, output_text
 
-            tokenizer = XLMTokenizer(vocab_file, merges_file)
+    def test_full_tokenizer(self):
+        """ Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt """
+        tokenizer = XLMTokenizer(self.vocab_file, self.merges_file)
 
-            text = "lower"
-            bpe_tokens = ["low", "er</w>"]
-            tokens = tokenizer.tokenize(text)
-            self.assertListEqual(tokens, bpe_tokens)
+        text = "lower"
+        bpe_tokens = ["low", "er</w>"]
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
 
-            input_tokens = tokens + ["<unk>"]
-            input_bpe_tokens = [14, 15, 20]
-            self.assertListEqual(
-                tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+        input_tokens = tokens + ["<unk>"]
+        input_bpe_tokens = [14, 15, 20]
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
 
 
 if __name__ == '__main__':
diff --git a/pytorch_transformers/tests/tokenization_xlnet_test.py b/pytorch_transformers/tests/tokenization_xlnet_test.py
index 27c6b984ee..08e9e9cb2d 100644
--- a/pytorch_transformers/tests/tokenization_xlnet_test.py
+++ b/pytorch_transformers/tests/tokenization_xlnet_test.py
@@ -19,48 +19,58 @@ import unittest
 
 from pytorch_transformers.tokenization_xlnet import (XLNetTokenizer, SPIECE_UNDERLINE)
 
-from .tokenization_tests_commons import create_and_check_tokenizer_commons, TemporaryDirectory
+from .tokenization_tests_commons import CommonTestCases
 
 SAMPLE_VOCAB = os.path.join(os.path.dirname(os.path.abspath(__file__)),
                     'fixtures/test_sentencepiece.model')
 
-class XLNetTokenizationTest(unittest.TestCase):
+class XLNetTokenizationTest(CommonTestCases.CommonTokenizerTester):
+
+    tokenizer_class = XLNetTokenizer
+
+    def setUp(self):
+        super(XLNetTokenizationTest, self).setUp()
+
+        # We have a SentencePiece fixture for testing
+        tokenizer = XLNetTokenizer(SAMPLE_VOCAB, keep_accents=True)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def get_tokenizer(self):
+        return XLNetTokenizer.from_pretrained(self.tmpdirname)
+
+    def get_input_output_texts(self):
+        input_text = u"This is a test"
+        output_text = u"This is a test"
+        return input_text, output_text
+
 
     def test_full_tokenizer(self):
         tokenizer = XLNetTokenizer(SAMPLE_VOCAB, keep_accents=True)
 
-        with TemporaryDirectory() as tmpdirname:
-            tokenizer.save_pretrained(tmpdirname)
+        tokens = tokenizer.tokenize(u'This is a test')
+        self.assertListEqual(tokens, [u'▁This', u'▁is', u'▁a', u'▁t', u'est'])
 
-            input_text = u"This is a test"
-            output_text = u"This is a test"
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(tokens), [285, 46, 10, 170, 382])
 
-            create_and_check_tokenizer_commons(self, input_text, output_text, XLNetTokenizer, tmpdirname)
+        tokens = tokenizer.tokenize(u"I was born in 92000, and this is falsé.")
+        self.assertListEqual(tokens, [SPIECE_UNDERLINE + u'I', SPIECE_UNDERLINE + u'was', SPIECE_UNDERLINE + u'b',
+                                    u'or', u'n', SPIECE_UNDERLINE + u'in', SPIECE_UNDERLINE + u'',
+                                    u'9', u'2', u'0', u'0', u'0', u',', SPIECE_UNDERLINE + u'and', SPIECE_UNDERLINE + u'this',
+                                    SPIECE_UNDERLINE + u'is', SPIECE_UNDERLINE + u'f', u'al', u's', u'é', u'.'])
+        ids = tokenizer.convert_tokens_to_ids(tokens)
+        self.assertListEqual(
+            ids, [8, 21, 84, 55, 24, 19, 7, 0,
+                602, 347, 347, 347, 3, 12, 66,
+                46, 72, 80, 6, 0, 4])
 
-            tokens = tokenizer.tokenize(u'This is a test')
-            self.assertListEqual(tokens, [u'▁This', u'▁is', u'▁a', u'▁t', u'est'])
-
-            self.assertListEqual(
-                tokenizer.convert_tokens_to_ids(tokens), [285, 46, 10, 170, 382])
-
-            tokens = tokenizer.tokenize(u"I was born in 92000, and this is falsé.")
-            self.assertListEqual(tokens, [SPIECE_UNDERLINE + u'I', SPIECE_UNDERLINE + u'was', SPIECE_UNDERLINE + u'b',
-                                        u'or', u'n', SPIECE_UNDERLINE + u'in', SPIECE_UNDERLINE + u'',
-                                        u'9', u'2', u'0', u'0', u'0', u',', SPIECE_UNDERLINE + u'and', SPIECE_UNDERLINE + u'this',
-                                        SPIECE_UNDERLINE + u'is', SPIECE_UNDERLINE + u'f', u'al', u's', u'é', u'.'])
-            ids = tokenizer.convert_tokens_to_ids(tokens)
-            self.assertListEqual(
-                ids, [8, 21, 84, 55, 24, 19, 7, 0,
-                    602, 347, 347, 347, 3, 12, 66,
-                    46, 72, 80, 6, 0, 4])
-
-            back_tokens = tokenizer.convert_ids_to_tokens(ids)
-            self.assertListEqual(back_tokens, [SPIECE_UNDERLINE + u'I', SPIECE_UNDERLINE + u'was', SPIECE_UNDERLINE + u'b',
-                                            u'or', u'n', SPIECE_UNDERLINE + u'in',
-                                            SPIECE_UNDERLINE + u'', u'<unk>', u'2', u'0', u'0', u'0', u',',
-                                            SPIECE_UNDERLINE + u'and', SPIECE_UNDERLINE + u'this',
-                                            SPIECE_UNDERLINE + u'is', SPIECE_UNDERLINE + u'f', u'al', u's',
-                                            u'<unk>', u'.'])
+        back_tokens = tokenizer.convert_ids_to_tokens(ids)
+        self.assertListEqual(back_tokens, [SPIECE_UNDERLINE + u'I', SPIECE_UNDERLINE + u'was', SPIECE_UNDERLINE + u'b',
+                                        u'or', u'n', SPIECE_UNDERLINE + u'in',
+                                        SPIECE_UNDERLINE + u'', u'<unk>', u'2', u'0', u'0', u'0', u',',
+                                        SPIECE_UNDERLINE + u'and', SPIECE_UNDERLINE + u'this',
+                                        SPIECE_UNDERLINE + u'is', SPIECE_UNDERLINE + u'f', u'al', u's',
+                                        u'<unk>', u'.'])
 
     def test_tokenizer_lower(self):
         tokenizer = XLNetTokenizer(SAMPLE_VOCAB, do_lower_case=True)
diff --git a/pytorch_transformers/tokenization_bert.py b/pytorch_transformers/tokenization_bert.py
index d7aeff7c39..9bf18a97d7 100644
--- a/pytorch_transformers/tokenization_bert.py
+++ b/pytorch_transformers/tokenization_bert.py
@@ -86,7 +86,7 @@ def whitespace_tokenize(text):
 class BertTokenizer(PreTrainedTokenizer):
     r"""
     Constructs a BertTokenizer.
-    :class:`~pytorch_pretrained_bert.BertTokenizer` runs end-to-end tokenization: punctuation splitting + wordpiece
+    :class:`~pytorch_transformers.BertTokenizer` runs end-to-end tokenization: punctuation splitting + wordpiece
 
     Args:
         vocab_file: Path to a one-wordpiece-per-line vocabulary file
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index 1852d74021..a81a5b9235 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -125,42 +125,34 @@ class PreTrainedTokenizer(object):
 
     @bos_token.setter
     def bos_token(self, value):
-        self.add_tokens([value]) 
         self._bos_token = value
 
     @eos_token.setter
     def eos_token(self, value):
-        self.add_tokens([value]) 
         self._eos_token = value
 
     @unk_token.setter
     def unk_token(self, value):
-        self.add_tokens([value]) 
         self._unk_token = value
 
     @sep_token.setter
     def sep_token(self, value):
-        self.add_tokens([value]) 
         self._sep_token = value
 
     @pad_token.setter
     def pad_token(self, value):
-        self.add_tokens([value]) 
         self._pad_token = value
 
     @cls_token.setter
     def cls_token(self, value):
-        self.add_tokens([value]) 
         self._cls_token = value
 
     @mask_token.setter
     def mask_token(self, value):
-        self.add_tokens([value]) 
         self._mask_token = value
 
     @additional_special_tokens.setter
     def additional_special_tokens(self, value):
-        self.add_tokens(value) 
         self._additional_special_tokens = value
 
     def __init__(self, max_len=None, **kwargs):
@@ -179,6 +171,10 @@ class PreTrainedTokenizer(object):
 
         for key, value in kwargs.items():
             if key in self.SPECIAL_TOKENS_ATTRIBUTES:
+                if key == 'additional_special_tokens':
+                    assert isinstance(value, (list, tuple)) and all(isinstance(t, str) or (six.PY2 and isinstance(t, unicode)) for t in value)
+                else:
+                    assert isinstance(value, str) or (six.PY2 and isinstance(value, unicode))
                 setattr(self, key, value)
 
 
@@ -415,15 +411,39 @@ class PreTrainedTokenizer(object):
                 
                     Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
 
+            Returns:
+                Number of tokens added to the vocabulary.
+
+        Examples::
+
+            # Let's see how to add a new classification token to GPT-2
+            tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+            model = GPT2Model.from_pretrained('gpt2')
+
+            special_tokens_dict = {'cls_token': '<CLS>'}
+
+            num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
+            print('We have added', num_added_toks, 'tokens')
+            model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer.
+
+            assert tokenizer.cls_token == '<CLS>'
         """
         if not special_tokens_dict:
             return 0
 
+        added_tokens = 0
         for key, value in special_tokens_dict.items():
             assert key in self.SPECIAL_TOKENS_ATTRIBUTES
+            if key == 'additional_special_tokens':
+                assert isinstance(value, (list, tuple)) and all(isinstance(t, str) or (six.PY2 and isinstance(t, unicode)) for t in value)
+                added_tokens += self.add_tokens(value)
+            else:
+                assert isinstance(value, str) or (six.PY2 and isinstance(value, unicode))
+                added_tokens += self.add_tokens([value])
             logger.info("Assigning %s to the %s key of the tokenizer", value, key)
             setattr(self, key, value)
 
+        return added_tokens
 
     def tokenize(self, text, **kwargs):
         """ Converts a string in a sequence of tokens (string), using the tokenizer.

From 58830807d1ce9788da1049f2faab4149b3cab683 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Mon, 5 Aug 2019 14:38:59 +0200
Subject: [PATCH 052/200] inidicate we only support pytorch 1.0.0+ now

---
 README.md        | 2 +-
 requirements.txt | 2 +-
 setup.py         | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index c130675dbd..7fdef9dcb8 100644
--- a/README.md
+++ b/README.md
@@ -25,7 +25,7 @@ These implementations have been tested on several datasets (see the example scri
 
 ## Installation
 
-This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1 to 1.1.0
+This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.0.0+
 
 ### With pip
 
diff --git a/requirements.txt b/requirements.txt
index 165fa74af9..76532d18a5 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,5 +1,5 @@
 # PyTorch
-torch>=0.4.1
+torch>=1.0.0
 # progress bars in model download and training scripts
 tqdm
 # Accessing files from S3 directly.
diff --git a/setup.py b/setup.py
index 514499481a..4c23714980 100644
--- a/setup.py
+++ b/setup.py
@@ -49,7 +49,7 @@ setup(
     url="https://github.com/huggingface/pytorch-transformers",
     packages=find_packages(exclude=["*.tests", "*.tests.*",
                                     "tests.*", "tests"]),
-    install_requires=['torch>=0.4.1',
+    install_requires=['torch>=1.0.0',
                       'numpy',
                       'boto3',
                       'requests',

From b90e29d52cfe94b1995cc5254f700e776b866d2d Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Mon, 5 Aug 2019 16:06:34 +0200
Subject: [PATCH 053/200] working on automodels

---
 docs/source/model_doc/auto.rst          |  26 +++
 examples/run_squad.py                   |   2 +-
 pytorch_transformers/modeling_auto.py   | 249 ++++++++++++++++++++++++
 pytorch_transformers/modeling_gpt2.py   |   2 +-
 pytorch_transformers/modeling_openai.py |   2 +-
 pytorch_transformers/modeling_utils.py  |  22 +--
 6 files changed, 289 insertions(+), 14 deletions(-)
 create mode 100644 docs/source/model_doc/auto.rst

diff --git a/docs/source/model_doc/auto.rst b/docs/source/model_doc/auto.rst
new file mode 100644
index 0000000000..ad439fff03
--- /dev/null
+++ b/docs/source/model_doc/auto.rst
@@ -0,0 +1,26 @@
+AutoModel, AutoConfig and AutoTokenizer - Standard derived classes
+---------------------------------------------------------------
+
+In many case, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the ``from_pretrained`` method.
+
+Auto classes are here to do this job for you so that you automatically retreive the relevant model given the name/path to the pretrained weights/config/vocabulary.
+
+``AutoConfig``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.AutoConfig
+    :members:
+
+
+``AutoModel``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.AutoModel
+    :members:
+
+
+``AutoTokenizer``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.AutoTokenizer
+    :members:
diff --git a/examples/run_squad.py b/examples/run_squad.py
index 36e03fb012..937aa154ea 100644
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -134,7 +134,7 @@ def train(args, train_dataset, model, tokenizer):
                       'end_positions':   batch[4]}
             if args.model_type in ['xlnet', 'xlm']:
                 inputs.update({'cls_index': batch[5],
-                               'p_mask':    batch[6]})
+                               'p_mask':       batch[6]})
             outputs = model(**inputs)
             loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
 
diff --git a/pytorch_transformers/modeling_auto.py b/pytorch_transformers/modeling_auto.py
index 3e28fbd0a9..7d3ea7ec60 100644
--- a/pytorch_transformers/modeling_auto.py
+++ b/pytorch_transformers/modeling_auto.py
@@ -18,6 +18,11 @@ from __future__ import absolute_import, division, print_function, unicode_litera
 
 import logging
 
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss, MSELoss
+from torch.nn.parameter import Parameter
+
 from .modeling_bert import BertConfig, BertModel
 from .modeling_openai import OpenAIGPTConfig, OpenAIGPTModel
 from .modeling_gpt2 import GPT2Config, GPT2Model
@@ -25,6 +30,8 @@ from .modeling_transfo_xl import TransfoXLConfig, TransfoXLModel
 from .modeling_xlnet import XLNetConfig, XLNetModel
 from .modeling_xlm import XLMConfig, XLMModel
 
+from .modeling_utils import PreTrainedModel, SequenceSummary
+
 logger = logging.getLogger(__name__)
 
 class AutoConfig(object):
@@ -228,3 +235,245 @@ class AutoModel(object):
                          "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
                          "'xlm'".format(pretrained_model_name_or_path))
 
+
+class DerivedAutoModel(PreTrainedModel):
+    r"""
+        :class:`~pytorch_transformers.DerivedAutoModel` is a base class for building
+        standardized derived models on top of :class:`~pytorch_transformers.AutoModel` by adding heads
+
+        The `from_pretrained()` method take care of using the correct base model class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The base model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `bert`: BertConfig (Bert model)
+            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
+            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
+            - contains `xlnet`: XLNetConfig (XLNet model)
+            - contains `xlm`: XLMConfig (XLM model)
+
+        This class should usually not be instantiated using `__init__()` but `from_pretrained()`.
+    """
+    config_class = None
+    pretrained_model_archive_map = {}
+    load_tf_weights = lambda model, config, path: None
+    base_model_prefix = "transformer"
+
+    def __init__(self, base_model):
+        super(DerivedAutoModel, self).__init__(base_model.config)
+        self.transformer = base_model
+
+    def init_weights(self, module):
+        """ Initialize the weights. Use the base model initialization function.
+        """
+        self.transformer.init_weights(module)
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        r""" Instantiate a :class:`~pytorch_transformers.DerivedAutoModel` with one of the base model classes of the library
+        from a pre-trained model configuration.
+
+        The base model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `bert`: BertConfig (Bert model)
+            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
+            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
+            - contains `xlnet`: XLNetConfig (XLNet model)
+            - contains `xlm`: XLMConfig (XLM model)
+
+            The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
+            To train the model, you should first set it back in training mode with `model.train()`
+
+        Params:
+            **pretrained_model_name_or_path**: either:
+                - a string with the `shortcut name` of a pre-trained model to load from cache
+                    or download and cache if not already stored in cache (e.g. 'bert-base-uncased').
+                - a path to a `directory` containing a configuration file saved
+                    using the `save_pretrained(save_directory)` method.
+                - a path or url to a tensorflow index checkpoint `file` (e.g. `./tf_model/model.ckpt.index`).
+                    In this case, ``from_tf`` should be set to True and a configuration object should be
+                    provided as `config` argument. This loading option is slower than converting the TensorFlow
+                    checkpoint in a PyTorch model using the provided conversion scripts and loading
+                    the PyTorch model afterwards.
+            **model_args**: (`optional`) Sequence:
+                All remaning positional arguments will be passed to the underlying model's __init__ function
+            **config**: an optional configuration for the model to use instead of an automatically loaded configuation.
+                Configuration can be automatically loaded when:
+                - the model is a model provided by the library (loaded with a `shortcut name` of a pre-trained model), or
+                - the model was saved using the `save_pretrained(save_directory)` (loaded by suppling the save directory).
+            **state_dict**: an optional state dictionnary for the model to use instead of a state dictionary loaded
+                from saved weights file.
+                This option can be used if you want to create a model from a pretrained configuration but load your own weights.
+                In this case though, you should check if using `save_pretrained(dir)` and `from_pretrained(save_directory)` is not
+                a simpler option.
+            **cache_dir**: (`optional`) string:
+                Path to a directory in which a downloaded pre-trained model
+                configuration should be cached if the standard cache should not be used.
+            **output_loading_info**: (`optional`) boolean:
+                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
+            **kwargs**: (`optional`) dict:
+                Dictionary of key, values to update the configuration object after loading.
+                Can be used to override selected configuration parameters. E.g. ``output_attention=True``.
+
+               - If a configuration is provided with `config`, **kwargs will be directly passed
+                 to the underlying model's __init__ method.
+               - If a configuration is not provided, **kwargs will be first passed to the pretrained
+                 model configuration class loading function (`PretrainedConfig.from_pretrained`).
+                 Each key of **kwargs that corresponds to a configuration attribute
+                 will be used to override said attribute with the supplied **kwargs value.
+                 Remaining keys that do not correspond to any configuration attribute will
+                 be passed to the underlying model's __init__ function.
+
+        Examples::
+
+            model = AutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
+            model = AutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
+            model = AutoModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
+            assert model.config.output_attention == True
+            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
+            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
+            model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
+
+        """
+        if 'bert' in pretrained_model_name_or_path:
+            base_model_class = BertModel
+        elif 'openai-gpt' in pretrained_model_name_or_path:
+            base_model_class = OpenAIGPTModel
+        elif 'gpt2' in pretrained_model_name_or_path:
+            base_model_class = GPT2Model
+        elif 'transfo-xl' in pretrained_model_name_or_path:
+            base_model_class = TransfoXLModel
+        elif 'xlnet' in pretrained_model_name_or_path:
+            base_model_class = XLNetModel
+        elif 'xlm' in pretrained_model_name_or_path:
+            base_model_class = XLMModel
+        else:
+            raise ValueError("Unrecognized model identifier in {}. Should contains one of "
+                            "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
+                            "'xlm'".format(pretrained_model_name_or_path))
+
+        # Get a pretrained base_model
+        base_model = base_model_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+        # Create our derived model
+        model = cls(base_model)
+
+        # Setup class attribute from the base model class
+        model.config_class = base_model.config_class
+        model.pretrained_model_archive_map = base_model.pretrained_model_archive_map
+        model.load_tf_weights = base_model.load_tf_weights
+
+        return model
+
+
+class AutoModelWithLMHead(DerivedAutoModel):
+    r"""
+        :class:`~pytorch_transformers.AutoModelWithLMHead` is a base class for language modeling
+        that contains
+        
+            - a base model instantiated as one of the base model classes of the library when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)` class method, and .
+            - a language modeling head on top of the base model.
+
+        The `from_pretrained()` method take care of using the correct base model class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The base model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `bert`: BertConfig (Bert model)
+            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
+            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
+            - contains `xlnet`: XLNetConfig (XLNet model)
+            - contains `xlm`: XLMConfig (XLM model)
+
+        This class should usually not be instantiated using `__init__()` but `from_pretrained()`.
+    """
+
+    def __init__(self, base_model):
+        super(AutoModelWithLMHead, self).__init__(base_model)
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        self.apply(self.init_weights)
+        self.tie_weights()
+
+    def tie_weights(self):
+        """ Make sure we are sharing the input and output embeddings.
+            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
+        """
+        # get input embeddings - whatever the model is
+        input_embeddings = self.transformer.resize_token_embeddings(new_num_tokens=None)
+
+        # tie of clone (torchscript) embeddings
+        self._tie_or_clone_weights(self.lm_head, input_embeddings)
+
+    def forward(self, input_ids, **kwargs):
+        labels = kwargs.pop('labels', None)  # Python 2 compatibility...
+
+        transformer_outputs = self.transformer(input_ids, **kwargs)
+        hidden_states = transformer_outputs[0]
+
+        lm_logits = self.lm_head(hidden_states)
+
+        outputs = (lm_logits,) + transformer_outputs[1:]
+        if labels is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)),
+                            labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)
+
+
+class AutoModelForSequenceClassification(DerivedAutoModel):
+    r"""
+        :class:`~pytorch_transformers.AutoModelForSequenceClassification` is a class for sequence classification
+        that contains
+        
+            - a base model instantiated as one of the base model classes of the library when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)` class method, and .
+            - a classification head on top of the base model.
+
+        The `from_pretrained()` method take care of using the correct base model class instance
+        using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The base model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `bert`: BertConfig (Bert model)
+            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
+            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
+            - contains `xlnet`: XLNetConfig (XLNet model)
+            - contains `xlm`: XLMConfig (XLM model)
+
+        This class should usually not be instantiated using `__init__()` but `from_pretrained()`.
+    """
+
+    def __init__(self, base_model):
+        super(AutoModelForSequenceClassification, self).__init__(base_model)
+        self.num_labels = base_model.config.num_labels
+        self.sequence_summary = SequenceSummary(base_model.config)
+
+        self.apply(self.init_weights)
+
+    def forward(self, input_ids, cls_index, **kwargs):
+        labels = kwargs.pop('labels', None)  # Python 2 compatibility...
+
+        transformer_outputs = self.transformer(input_ids, **kwargs)
+
+        output = transformer_outputs[0]
+        logits = self.sequence_summary(output, cls_index=cls_index)
+
+        outputs = (logits,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here
+
+        if labels is not None:
+            if self.num_labels == 1:
+                #  We are doing regression
+                loss_fct = MSELoss()
+                loss = loss_fct(logits.view(-1), labels.view(-1))
+            else:
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs
diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index 4341f0d8a1..5268c5de7d 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -137,7 +137,7 @@ class GPT2Config(PretrainedConfig):
         initializer_range=0.02,
 
         num_labels=1,
-        summary_type='token_ids',
+        summary_type='cls_index',
         summary_use_proj=True,
         summary_activation=None,
         summary_proj_to_labels=True,
diff --git a/pytorch_transformers/modeling_openai.py b/pytorch_transformers/modeling_openai.py
index a6cb6212ef..c51023444d 100644
--- a/pytorch_transformers/modeling_openai.py
+++ b/pytorch_transformers/modeling_openai.py
@@ -171,7 +171,7 @@ class OpenAIGPTConfig(PretrainedConfig):
         predict_special_tokens=True,
 
         num_labels=1,
-        summary_type='token_ids',
+        summary_type='cls_index',
         summary_use_proj=True,
         summary_activation=None,
         summary_proj_to_labels=True,
diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 8970cd56f8..2664c542e0 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -765,7 +765,7 @@ class SequenceSummary(nn.Module):
                 - 'last' => [default] take the last token hidden state (like XLNet)
                 - 'first' => take the first token hidden state (like Bert)
                 - 'mean' => take the mean of all tokens hidden states
-                - 'token_ids' => supply a Tensor of classification token indices (GPT/GPT-2)
+                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
                 - 'attn' => Not implemented now, use multi-head attention
             summary_use_proj: Add a projection after the vector extraction
             summary_proj_to_labels: If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
@@ -803,11 +803,11 @@ class SequenceSummary(nn.Module):
         if hasattr(config, 'summary_last_dropout') and config.summary_last_dropout > 0:
             self.last_dropout = nn.Dropout(config.summary_last_dropout)
 
-    def forward(self, hidden_states, token_ids=None):
+    def forward(self, hidden_states, cls_index=None):
         """ hidden_states: float Tensor in shape [bsz, seq_len, hidden_size], the hidden-states of the last layer.
-            token_ids: [optional] index of the classification token if summary_type == 'token_ids',
+            cls_index: [optional] position of the classification token if summary_type == 'cls_index',
                 shape (bsz,) or more generally (bsz, ...) where ... are optional leading dimensions of hidden_states.
-                if summary_type == 'token_ids' and token_ids is None:
+                if summary_type == 'cls_index' and cls_index is None:
                     we take the last token of the sequence as classification token
         """
         if self.summary_type == 'last':
@@ -816,14 +816,14 @@ class SequenceSummary(nn.Module):
             output = hidden_states[:, 0]
         elif self.summary_type == 'mean':
             output = hidden_states.mean(dim=1)
-        elif self.summary_type == 'token_ids':
-            if token_ids is None:
-                token_ids = torch.full_like(hidden_states[..., :1, :], hidden_states.shape[-2]-1, dtype=torch.long)
+        elif self.summary_type == 'cls_index':
+            if cls_index is None:
+                cls_index = torch.full_like(hidden_states[..., :1, :], hidden_states.shape[-2]-1, dtype=torch.long)
             else:
-                token_ids = token_ids.unsqueeze(-1).unsqueeze(-1)
-                token_ids = token_ids.expand((-1,) * (token_ids.dim()-1) + (hidden_states.size(-1),))
-            # shape of token_ids: (bsz, XX, 1, hidden_size) where XX are optional leading dim of hidden_states
-            output = hidden_states.gather(-2, token_ids).squeeze(-2) # shape (bsz, XX, hidden_size)
+                cls_index = cls_index.unsqueeze(-1).unsqueeze(-1)
+                cls_index = cls_index.expand((-1,) * (cls_index.dim()-1) + (hidden_states.size(-1),))
+            # shape of cls_index: (bsz, XX, 1, hidden_size) where XX are optional leading dim of hidden_states
+            output = hidden_states.gather(-2, cls_index).squeeze(-2) # shape (bsz, XX, hidden_size)
         elif self.summary_type == 'attn':
             raise NotImplementedError
 

From 7c524d631e4c0fd0531d02d6a155fc95a3e90810 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Mon, 5 Aug 2019 16:25:54 +0200
Subject: [PATCH 054/200] add issue templates

---
 .github/ISSUE_TEMPLATE/bug-report.md      | 36 +++++++++++++++++++
 .github/ISSUE_TEMPLATE/feature-request.md | 16 +++++++++
 .github/ISSUE_TEMPLATE/migration.md       | 43 +++++++++++++++++++++++
 .github/ISSUE_TEMPLATE/question-help.md   |  8 +++++
 4 files changed, 103 insertions(+)
 create mode 100644 .github/ISSUE_TEMPLATE/bug-report.md
 create mode 100644 .github/ISSUE_TEMPLATE/feature-request.md
 create mode 100644 .github/ISSUE_TEMPLATE/migration.md
 create mode 100644 .github/ISSUE_TEMPLATE/question-help.md

diff --git a/.github/ISSUE_TEMPLATE/bug-report.md b/.github/ISSUE_TEMPLATE/bug-report.md
new file mode 100644
index 0000000000..0d9439887b
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/bug-report.md
@@ -0,0 +1,36 @@
+---
+name: "\U0001F41B Bug Report"
+about: Submit a bug report to help us improve PyTorch Transformers
+---
+
+## 🐛 Bug
+
+<!-- A clear and concise description of what the bug is. -->
+
+## To Reproduce
+
+Steps to reproduce the behavior:
+
+1.
+2.
+3.
+
+<!-- If you have a code sample, error messages, stack traces, please provide it here as well. -->
+
+## Expected behavior
+
+<!-- A clear and concise description of what you expected to happen. -->
+
+## Environment
+
+* OS:
+* Python version:
+* PyTorch version:
+* PyTorch Transformers version (or branch):
+* Using GPU ?
+* Distributed of parallel setup ?
+* Any other relevant information:
+
+## Additional context
+
+<!-- Add any other context about the problem here. -->
\ No newline at end of file
diff --git a/.github/ISSUE_TEMPLATE/feature-request.md b/.github/ISSUE_TEMPLATE/feature-request.md
new file mode 100644
index 0000000000..828e3737be
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/feature-request.md
@@ -0,0 +1,16 @@
+---
+name: "\U0001F680 Feature Request"
+about: Submit a proposal/request for a new PyTorch Transformers feature
+---
+
+## 🚀 Feature
+
+<!-- A clear and concise description of the feature proposal. Please provide a link to the paper and code in case they exist. -->
+
+## Motivation
+
+<!-- Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too. -->
+
+## Additional context
+
+<!-- Add any other context or screenshots about the feature request here. -->
\ No newline at end of file
diff --git a/.github/ISSUE_TEMPLATE/migration.md b/.github/ISSUE_TEMPLATE/migration.md
new file mode 100644
index 0000000000..9a8b19dffa
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/migration.md
@@ -0,0 +1,43 @@
+---
+name: "\U0001F4DA Migration from PyTorch-pretrained-Bert"
+about: Report a problem when migrating from PyTorch-pretrained-Bert to PyTorch-Transformers
+---
+
+## 📚 Migration
+
+<!-- Give at least the following information -->
+
+Model I am using (Bert, XLNet....):
+
+The problem arise when using:
+* [ ] the official example scripts
+* [ ] my own modified scripts
+
+The tasks I am working on is:
+* [ ] an official GLUE/SQUaD task: (give the name)
+* [ ] my own task or dataset: (give details)
+
+Language I am using the model on (English, Chinese....):
+
+Details of the issue:
+
+<!-- A clear and concise description of the migration issue. If you have code snippets, please provide it here as well. -->
+
+## Environment
+
+* OS:
+* Python version:
+* PyTorch version:
+* PyTorch Transformers version (or branch):
+* Using GPU ?
+* Distributed of parallel setup ?
+* Any other relevant information:
+
+## Checklist
+
+- [ ] I have read the migration guide in the readme.
+- [ ] I checked if a related official extension example runs on my machine.
+
+## Additional context
+
+<!-- Add any other context about the problem here. -->
\ No newline at end of file
diff --git a/.github/ISSUE_TEMPLATE/question-help.md b/.github/ISSUE_TEMPLATE/question-help.md
new file mode 100644
index 0000000000..8c76994b02
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/question-help.md
@@ -0,0 +1,8 @@
+---
+name: "❓Questions & Help"
+about: Start a general discussion related to PyTorch Transformers
+---
+
+## ❓ Questions & Help
+
+<!-- A clear and concise description of the question. -->
\ No newline at end of file

From 077ad693e9c3b5702ba9874f7a0f0ed8099c9773 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Mon, 5 Aug 2019 16:46:29 +0200
Subject: [PATCH 055/200] tweak issue templates wordings

---
 .github/ISSUE_TEMPLATE/bug-report.md | 14 +++++++++++++-
 .github/ISSUE_TEMPLATE/migration.md  | 10 +++++-----
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/.github/ISSUE_TEMPLATE/bug-report.md b/.github/ISSUE_TEMPLATE/bug-report.md
index 0d9439887b..66f7831aea 100644
--- a/.github/ISSUE_TEMPLATE/bug-report.md
+++ b/.github/ISSUE_TEMPLATE/bug-report.md
@@ -5,7 +5,19 @@ about: Submit a bug report to help us improve PyTorch Transformers
 
 ## 🐛 Bug
 
-<!-- A clear and concise description of what the bug is. -->
+<!-- Important information -->
+
+Model I am using (Bert, XLNet....):
+
+Language I am using the model on (English, Chinese....):
+
+The problem arise when using:
+* [ ] the official example scripts: (give details)
+* [ ] my own modified scripts: (give details)
+
+The tasks I am working on is:
+* [ ] an official GLUE/SQUaD task: (give the name)
+* [ ] my own task or dataset: (give details)
 
 ## To Reproduce
 
diff --git a/.github/ISSUE_TEMPLATE/migration.md b/.github/ISSUE_TEMPLATE/migration.md
index 9a8b19dffa..cf0c9a4757 100644
--- a/.github/ISSUE_TEMPLATE/migration.md
+++ b/.github/ISSUE_TEMPLATE/migration.md
@@ -5,20 +5,20 @@ about: Report a problem when migrating from PyTorch-pretrained-Bert to PyTorch-T
 
 ## 📚 Migration
 
-<!-- Give at least the following information -->
+<!-- Important information -->
 
 Model I am using (Bert, XLNet....):
 
+Language I am using the model on (English, Chinese....):
+
 The problem arise when using:
-* [ ] the official example scripts
-* [ ] my own modified scripts
+* [ ] the official example scripts: (give details)
+* [ ] my own modified scripts: (give details)
 
 The tasks I am working on is:
 * [ ] an official GLUE/SQUaD task: (give the name)
 * [ ] my own task or dataset: (give details)
 
-Language I am using the model on (English, Chinese....):
-
 Details of the issue:
 
 <!-- A clear and concise description of the migration issue. If you have code snippets, please provide it here as well. -->

From 70c10caa06d9feda3f446d0a82655f56cd2afdab Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Mon, 5 Aug 2019 17:09:37 +0200
Subject: [PATCH 056/200] add option mentioned in #940

---
 examples/run_glue.py  | 6 ++++++
 examples/run_squad.py | 6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/examples/run_glue.py b/examples/run_glue.py
index 0d4ffaa390..a939ea373b 100644
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -247,6 +247,9 @@ def evaluate(args, model, tokenizer, prefix=""):
 
 
 def load_and_cache_examples(args, task, tokenizer, evaluate=False):
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
     processor = processors[task]()
     output_mode = output_modes[task]
     # Load data features from cache or dataset file
@@ -273,6 +276,9 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
             logger.info("Saving features into cached file %s", cached_features_file)
             torch.save(features, cached_features_file)
 
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
     # Convert to Tensors and build dataset
     all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
     all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
diff --git a/examples/run_squad.py b/examples/run_squad.py
index 7d768d2c43..e62a1f1ff3 100644
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -272,6 +272,9 @@ def evaluate(args, model, tokenizer, prefix=""):
 
 
 def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False):
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
     # Load data features from cache or dataset file
     input_file = args.predict_file if evaluate else args.train_file
     cached_features_file = os.path.join(os.path.dirname(input_file), 'cached_{}_{}_{}'.format(
@@ -296,6 +299,9 @@ def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=Fal
             logger.info("Saving features into cached file %s", cached_features_file)
             torch.save(features, cached_features_file)
 
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
     # Convert to Tensors and build dataset
     all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
     all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)

From 7223886dc944b5476ea6be1a9838738644a2e9a1 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Mon, 5 Aug 2019 17:16:56 +0200
Subject: [PATCH 057/200] fix #944

---
 README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README.md b/README.md
index 703eb47df9..f3d2865ba8 100644
--- a/README.md
+++ b/README.md
@@ -385,6 +385,7 @@ for batch in train_data:
     loss.backward()
     scheduler.step()
     optimizer.step()
+    optimizer.zero_grad()
 ```
 
 ## Citation

From 3a126e73dd020be851d59cfcdc741fe3e8c6ad4f Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Mon, 5 Aug 2019 17:26:29 +0200
Subject: [PATCH 058/200] fix #950

---
 .../convert_transfo_xl_checkpoint_to_pytorch.py          | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/pytorch_transformers/convert_transfo_xl_checkpoint_to_pytorch.py b/pytorch_transformers/convert_transfo_xl_checkpoint_to_pytorch.py
index b6672aedf7..5733146444 100755
--- a/pytorch_transformers/convert_transfo_xl_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_transfo_xl_checkpoint_to_pytorch.py
@@ -24,11 +24,10 @@ from io import open
 import torch
 
 import pytorch_transformers.tokenization_transfo_xl as data_utils
-from pytorch_transformers.modeling_transfo_xl import (CONFIG_NAME,
-                                                         WEIGHTS_NAME,
-                                                         TransfoXLConfig,
-                                                         TransfoXLLMHeadModel,
-                                                         load_tf_weights_in_transfo_xl)
+
+from pytorch_transformers import CONFIG_NAME, WEIGHTS_NAME
+from pytorch_transformers.modeling_transfo_xl import (TransfoXLConfig, TransfoXLLMHeadModel,
+                                                      load_tf_weights_in_transfo_xl)
 from pytorch_transformers.tokenization_transfo_xl import (CORPUS_NAME, VOCAB_FILES_NAMES)
 
 if sys.version_info[0] == 2:

From ed4e5422604b04df823eb2011e9ed4d766cf9980 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Mon, 5 Aug 2019 18:14:07 +0200
Subject: [PATCH 059/200] adding tests

---
 pytorch_transformers/__init__.py              |  2 +
 pytorch_transformers/modeling_auto.py         | 27 ++++++++-
 pytorch_transformers/modeling_utils.py        |  2 +-
 .../tests/modeling_auto_test.py               | 55 +++++++++++++++++++
 4 files changed, 83 insertions(+), 3 deletions(-)
 create mode 100644 pytorch_transformers/tests/modeling_auto_test.py

diff --git a/pytorch_transformers/__init__.py b/pytorch_transformers/__init__.py
index 72d666448e..d4ddda94fa 100644
--- a/pytorch_transformers/__init__.py
+++ b/pytorch_transformers/__init__.py
@@ -7,6 +7,8 @@ from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
 from .tokenization_xlm import XLMTokenizer
 from .tokenization_utils import (PreTrainedTokenizer)
 
+from .modeling_auto import (AutoConfig, AutoModel, AutoModelForSequenceClassification, AutoModelWithLMHead)
+
 from .modeling_bert import (BertConfig, BertModel, BertForPreTraining,
                        BertForMaskedLM, BertForNextSentencePrediction,
                        BertForSequenceClassification, BertForMultipleChoice,
diff --git a/pytorch_transformers/modeling_auto.py b/pytorch_transformers/modeling_auto.py
index 7d3ea7ec60..22a35090aa 100644
--- a/pytorch_transformers/modeling_auto.py
+++ b/pytorch_transformers/modeling_auto.py
@@ -393,6 +393,8 @@ class AutoModelWithLMHead(DerivedAutoModel):
 
     def __init__(self, base_model):
         super(AutoModelWithLMHead, self).__init__(base_model)
+        config = base_model.config
+
         self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
 
         self.apply(self.init_weights)
@@ -426,6 +428,17 @@ class AutoModelWithLMHead(DerivedAutoModel):
         return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)
 
 
+AUTO_MODEL_SEQUENCE_SUMMARY_DEFAULTS = {
+     'num_labels': 2,
+     'summary_type': 'first',
+     'summary_use_proj': True,
+     'summary_activation': None,
+     'summary_proj_to_labels': True,
+     'summary_first_dropout': 0.1
+}
+
+
+
 class AutoModelForSequenceClassification(DerivedAutoModel):
     r"""
         :class:`~pytorch_transformers.AutoModelForSequenceClassification` is a class for sequence classification
@@ -451,8 +464,18 @@ class AutoModelForSequenceClassification(DerivedAutoModel):
 
     def __init__(self, base_model):
         super(AutoModelForSequenceClassification, self).__init__(base_model)
-        self.num_labels = base_model.config.num_labels
-        self.sequence_summary = SequenceSummary(base_model.config)
+        # Complete configuration with defaults if necessary
+        config = base_model.config
+        for key, value in AUTO_MODEL_SEQUENCE_SUMMARY_DEFAULTS.items():
+            if not hasattr(config, key):
+                setattr(config, key, value)
+
+        # Update base model and derived model config
+        self.transformer.config = config
+        self.config = config
+
+        self.num_labels = config.num_labels
+        self.sequence_summary = SequenceSummary(config)
 
         self.apply(self.init_weights)
 
diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 2664c542e0..f832b482af 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -777,7 +777,7 @@ class SequenceSummary(nn.Module):
         super(SequenceSummary, self).__init__()
 
         self.summary_type = config.summary_type if hasattr(config, 'summary_use_proj') else 'last'
-        if config.summary_type == 'attn':
+        if self.summary_type == 'attn':
             # We should use a standard multi-head attention module with absolute positional embedding for that.
             # Cf. https://github.com/zihangdai/xlnet/blob/master/modeling.py#L253-L276
             # We can probably just use the multi-head attention module of PyTorch >=1.1.0
diff --git a/pytorch_transformers/tests/modeling_auto_test.py b/pytorch_transformers/tests/modeling_auto_test.py
new file mode 100644
index 0000000000..07042a255c
--- /dev/null
+++ b/pytorch_transformers/tests/modeling_auto_test.py
@@ -0,0 +1,55 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import shutil
+import pytest
+import logging
+
+from pytorch_transformers import AutoConfig, BertConfig, AutoModel, BertModel, AutoModelForSequenceClassification, AutoModelWithLMHead
+from pytorch_transformers.modeling_bert import BERT_PRETRAINED_MODEL_ARCHIVE_MAP
+
+from .modeling_common_test import (CommonTestCases, ConfigTester, ids_tensor)
+
+
+class AutoModelTest(unittest.TestCase):
+    def test_model_from_pretrained(self):
+        logging.basicConfig(level=logging.INFO)
+        for model_name in list(BERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            config = AutoConfig.from_pretrained(model_name)
+            self.assertIsNotNone(config)
+            self.assertIsInstance(config, BertConfig)
+
+            model = AutoModel.from_pretrained(model_name)
+            model, loading_info = AutoModel.from_pretrained(model_name, output_loading_info=True)
+            self.assertIsNotNone(model)
+            self.assertIsInstance(model, BertModel)
+            for value in loading_info.values():
+                self.assertEqual(len(value), 0)
+
+            model = AutoModelForSequenceClassification.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            self.assertIsInstance(getattr(model, model.base_model_prefix), BertModel)
+
+            model = AutoModelWithLMHead.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            self.assertIsInstance(getattr(model, model.base_model_prefix), BertModel)
+
+
+if __name__ == "__main__":
+    unittest.main()

From 13936a962102ed20424838fe5d445a28b5225d08 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Mon, 5 Aug 2019 18:48:16 +0200
Subject: [PATCH 060/200] update doc and tests

---
 docs/source/index.rst                         |  1 +
 docs/source/model_doc/auto.rst                | 28 +++++++++--
 pytorch_transformers/__init__.py              |  1 +
 .../tests/tokenization_auto_test.py           | 46 +++++++++++++++++++
 4 files changed, 72 insertions(+), 4 deletions(-)
 create mode 100644 pytorch_transformers/tests/tokenization_auto_test.py

diff --git a/docs/source/index.rst b/docs/source/index.rst
index b80fd8437b..b613596331 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -40,6 +40,7 @@ The library currently contains PyTorch implementations, pre-trained model weight
     :maxdepth: 2
     :caption: Package Reference
 
+    model_doc/auto
     model_doc/bert
     model_doc/gpt
     model_doc/transformerxl
diff --git a/docs/source/model_doc/auto.rst b/docs/source/model_doc/auto.rst
index ad439fff03..43f6e103bd 100644
--- a/docs/source/model_doc/auto.rst
+++ b/docs/source/model_doc/auto.rst
@@ -1,9 +1,15 @@
-AutoModel, AutoConfig and AutoTokenizer - Standard derived classes
----------------------------------------------------------------
+AutoModels
+-----------
 
-In many case, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the ``from_pretrained`` method.
+In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the ``from_pretrained`` method.
+
+AutoClasses are here to do this job for you so that you automatically retreive the relevant model given the name/path to the pretrained weights/config/vocabulary.
+
+There are two types of AutoClasses:
+
+- ``AutoModel``, ``AutoConfig`` and ``AutoTokenizer``: instantiating these ones will directly create a class of the relevant architecture (ex: ``model = AutoModel.from_pretrained('bert-base-cased')`` will create a instance of ``BertModel``)
+- All the others (``AutoModelWithLMHead``, ``AutoModelForSequenceClassification``...)  are standardized Auto classes for finetuning. Instantiating these will create instance of the same class (``AutoModelWithLMHead``, ``AutoModelForSequenceClassification``...) comprising (i) the relevant base model class (as mentioned just above) and (ii) a standard fine-tuning head on top, convenient for the task.
 
-Auto classes are here to do this job for you so that you automatically retreive the relevant model given the name/path to the pretrained weights/config/vocabulary.
 
 ``AutoConfig``
 ~~~~~~~~~~~~~~~~~~~~~
@@ -19,6 +25,20 @@ Auto classes are here to do this job for you so that you automatically retreive
     :members:
 
 
+``AutoModelWithLMHead``
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.AutoModelWithLMHead
+    :members:
+
+
+``AutoModelForSequenceClassification``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.AutoModelForSequenceClassification
+    :members:
+
+
 ``AutoTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/pytorch_transformers/__init__.py b/pytorch_transformers/__init__.py
index d4ddda94fa..110c3dc3c7 100644
--- a/pytorch_transformers/__init__.py
+++ b/pytorch_transformers/__init__.py
@@ -1,4 +1,5 @@
 __version__ = "1.0.0"
+from .tokenization_auto import AutoTokenizer
 from .tokenization_bert import BertTokenizer, BasicTokenizer, WordpieceTokenizer
 from .tokenization_openai import OpenAIGPTTokenizer
 from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus)
diff --git a/pytorch_transformers/tests/tokenization_auto_test.py b/pytorch_transformers/tests/tokenization_auto_test.py
new file mode 100644
index 0000000000..f4f82083f2
--- /dev/null
+++ b/pytorch_transformers/tests/tokenization_auto_test.py
@@ -0,0 +1,46 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import shutil
+import pytest
+import logging
+
+from pytorch_transformers import AutoTokenizer, BertTokenizer, AutoTokenizer, GPT2Tokenizer
+from pytorch_transformers.modeling_bert import BERT_PRETRAINED_MODEL_ARCHIVE_MAP
+from pytorch_transformers.modeling_gpt2 import GPT2_PRETRAINED_MODEL_ARCHIVE_MAP
+
+
+class AutoTokenizerTest(unittest.TestCase):
+    def test_tokenizer_from_pretrained(self):
+        logging.basicConfig(level=logging.INFO)
+        for model_name in list(BERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            tokenizer = AutoTokenizer.from_pretrained(model_name)
+            self.assertIsNotNone(tokenizer)
+            self.assertIsInstance(tokenizer, BertTokenizer)
+            self.assertGreater(len(tokenizer), 0)
+
+        for model_name in list(GPT2_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            tokenizer = AutoTokenizer.from_pretrained(model_name)
+            self.assertIsNotNone(tokenizer)
+            self.assertIsInstance(tokenizer, GPT2Tokenizer)
+            self.assertGreater(len(tokenizer), 0)
+
+
+if __name__ == "__main__":
+    unittest.main()

From 0b524b084857d0bf54eb613304a61bcdbd71e6fb Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Mon, 5 Aug 2019 19:08:19 +0200
Subject: [PATCH 061/200] remove derived classes for now

---
 docs/source/model_doc/auto.rst                |  21 +-
 pytorch_transformers/__init__.py              |   2 +-
 pytorch_transformers/modeling_auto.py         | 266 ------------------
 .../tests/modeling_auto_test.py               |  10 +-
 4 files changed, 4 insertions(+), 295 deletions(-)

diff --git a/docs/source/model_doc/auto.rst b/docs/source/model_doc/auto.rst
index 43f6e103bd..7b56eabafe 100644
--- a/docs/source/model_doc/auto.rst
+++ b/docs/source/model_doc/auto.rst
@@ -3,12 +3,9 @@ AutoModels
 
 In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the ``from_pretrained`` method.
 
-AutoClasses are here to do this job for you so that you automatically retreive the relevant model given the name/path to the pretrained weights/config/vocabulary.
+AutoClasses are here to do this job for you so that you automatically retreive the relevant model given the name/path to the pretrained weights/config/vocabulary:
 
-There are two types of AutoClasses:
-
-- ``AutoModel``, ``AutoConfig`` and ``AutoTokenizer``: instantiating these ones will directly create a class of the relevant architecture (ex: ``model = AutoModel.from_pretrained('bert-base-cased')`` will create a instance of ``BertModel``)
-- All the others (``AutoModelWithLMHead``, ``AutoModelForSequenceClassification``...)  are standardized Auto classes for finetuning. Instantiating these will create instance of the same class (``AutoModelWithLMHead``, ``AutoModelForSequenceClassification``...) comprising (i) the relevant base model class (as mentioned just above) and (ii) a standard fine-tuning head on top, convenient for the task.
+Instantiating one of ``AutoModel``, ``AutoConfig`` and ``AutoTokenizer`` will directly create a class of the relevant architecture (ex: ``model = AutoModel.from_pretrained('bert-base-cased')`` will create a instance of ``BertModel``).
 
 
 ``AutoConfig``
@@ -25,20 +22,6 @@ There are two types of AutoClasses:
     :members:
 
 
-``AutoModelWithLMHead``
-~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: pytorch_transformers.AutoModelWithLMHead
-    :members:
-
-
-``AutoModelForSequenceClassification``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: pytorch_transformers.AutoModelForSequenceClassification
-    :members:
-
-
 ``AutoTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/pytorch_transformers/__init__.py b/pytorch_transformers/__init__.py
index 110c3dc3c7..04e5c3c9dd 100644
--- a/pytorch_transformers/__init__.py
+++ b/pytorch_transformers/__init__.py
@@ -8,7 +8,7 @@ from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
 from .tokenization_xlm import XLMTokenizer
 from .tokenization_utils import (PreTrainedTokenizer)
 
-from .modeling_auto import (AutoConfig, AutoModel, AutoModelForSequenceClassification, AutoModelWithLMHead)
+from .modeling_auto import (AutoConfig, AutoModel)
 
 from .modeling_bert import (BertConfig, BertModel, BertForPreTraining,
                        BertForMaskedLM, BertForNextSentencePrediction,
diff --git a/pytorch_transformers/modeling_auto.py b/pytorch_transformers/modeling_auto.py
index 22a35090aa..64b151e3a3 100644
--- a/pytorch_transformers/modeling_auto.py
+++ b/pytorch_transformers/modeling_auto.py
@@ -234,269 +234,3 @@ class AutoModel(object):
         raise ValueError("Unrecognized model identifier in {}. Should contains one of "
                          "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
                          "'xlm'".format(pretrained_model_name_or_path))
-
-
-class DerivedAutoModel(PreTrainedModel):
-    r"""
-        :class:`~pytorch_transformers.DerivedAutoModel` is a base class for building
-        standardized derived models on top of :class:`~pytorch_transformers.AutoModel` by adding heads
-
-        The `from_pretrained()` method take care of using the correct base model class instance
-        using pattern matching on the `pretrained_model_name_or_path` string.
-
-        The base model class to instantiate is selected as the first pattern matching
-        in the `pretrained_model_name_or_path` string (in the following order):
-            - contains `bert`: BertConfig (Bert model)
-            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
-            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
-            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
-            - contains `xlnet`: XLNetConfig (XLNet model)
-            - contains `xlm`: XLMConfig (XLM model)
-
-        This class should usually not be instantiated using `__init__()` but `from_pretrained()`.
-    """
-    config_class = None
-    pretrained_model_archive_map = {}
-    load_tf_weights = lambda model, config, path: None
-    base_model_prefix = "transformer"
-
-    def __init__(self, base_model):
-        super(DerivedAutoModel, self).__init__(base_model.config)
-        self.transformer = base_model
-
-    def init_weights(self, module):
-        """ Initialize the weights. Use the base model initialization function.
-        """
-        self.transformer.init_weights(module)
-
-    @classmethod
-    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
-        r""" Instantiate a :class:`~pytorch_transformers.DerivedAutoModel` with one of the base model classes of the library
-        from a pre-trained model configuration.
-
-        The base model class to instantiate is selected as the first pattern matching
-        in the `pretrained_model_name_or_path` string (in the following order):
-            - contains `bert`: BertConfig (Bert model)
-            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
-            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
-            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
-            - contains `xlnet`: XLNetConfig (XLNet model)
-            - contains `xlm`: XLMConfig (XLM model)
-
-            The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
-            To train the model, you should first set it back in training mode with `model.train()`
-
-        Params:
-            **pretrained_model_name_or_path**: either:
-                - a string with the `shortcut name` of a pre-trained model to load from cache
-                    or download and cache if not already stored in cache (e.g. 'bert-base-uncased').
-                - a path to a `directory` containing a configuration file saved
-                    using the `save_pretrained(save_directory)` method.
-                - a path or url to a tensorflow index checkpoint `file` (e.g. `./tf_model/model.ckpt.index`).
-                    In this case, ``from_tf`` should be set to True and a configuration object should be
-                    provided as `config` argument. This loading option is slower than converting the TensorFlow
-                    checkpoint in a PyTorch model using the provided conversion scripts and loading
-                    the PyTorch model afterwards.
-            **model_args**: (`optional`) Sequence:
-                All remaning positional arguments will be passed to the underlying model's __init__ function
-            **config**: an optional configuration for the model to use instead of an automatically loaded configuation.
-                Configuration can be automatically loaded when:
-                - the model is a model provided by the library (loaded with a `shortcut name` of a pre-trained model), or
-                - the model was saved using the `save_pretrained(save_directory)` (loaded by suppling the save directory).
-            **state_dict**: an optional state dictionnary for the model to use instead of a state dictionary loaded
-                from saved weights file.
-                This option can be used if you want to create a model from a pretrained configuration but load your own weights.
-                In this case though, you should check if using `save_pretrained(dir)` and `from_pretrained(save_directory)` is not
-                a simpler option.
-            **cache_dir**: (`optional`) string:
-                Path to a directory in which a downloaded pre-trained model
-                configuration should be cached if the standard cache should not be used.
-            **output_loading_info**: (`optional`) boolean:
-                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
-            **kwargs**: (`optional`) dict:
-                Dictionary of key, values to update the configuration object after loading.
-                Can be used to override selected configuration parameters. E.g. ``output_attention=True``.
-
-               - If a configuration is provided with `config`, **kwargs will be directly passed
-                 to the underlying model's __init__ method.
-               - If a configuration is not provided, **kwargs will be first passed to the pretrained
-                 model configuration class loading function (`PretrainedConfig.from_pretrained`).
-                 Each key of **kwargs that corresponds to a configuration attribute
-                 will be used to override said attribute with the supplied **kwargs value.
-                 Remaining keys that do not correspond to any configuration attribute will
-                 be passed to the underlying model's __init__ function.
-
-        Examples::
-
-            model = AutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
-            model = AutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
-            model = AutoModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
-            assert model.config.output_attention == True
-            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
-            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
-            model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
-
-        """
-        if 'bert' in pretrained_model_name_or_path:
-            base_model_class = BertModel
-        elif 'openai-gpt' in pretrained_model_name_or_path:
-            base_model_class = OpenAIGPTModel
-        elif 'gpt2' in pretrained_model_name_or_path:
-            base_model_class = GPT2Model
-        elif 'transfo-xl' in pretrained_model_name_or_path:
-            base_model_class = TransfoXLModel
-        elif 'xlnet' in pretrained_model_name_or_path:
-            base_model_class = XLNetModel
-        elif 'xlm' in pretrained_model_name_or_path:
-            base_model_class = XLMModel
-        else:
-            raise ValueError("Unrecognized model identifier in {}. Should contains one of "
-                            "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
-                            "'xlm'".format(pretrained_model_name_or_path))
-
-        # Get a pretrained base_model
-        base_model = base_model_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
-
-        # Create our derived model
-        model = cls(base_model)
-
-        # Setup class attribute from the base model class
-        model.config_class = base_model.config_class
-        model.pretrained_model_archive_map = base_model.pretrained_model_archive_map
-        model.load_tf_weights = base_model.load_tf_weights
-
-        return model
-
-
-class AutoModelWithLMHead(DerivedAutoModel):
-    r"""
-        :class:`~pytorch_transformers.AutoModelWithLMHead` is a base class for language modeling
-        that contains
-        
-            - a base model instantiated as one of the base model classes of the library when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)` class method, and .
-            - a language modeling head on top of the base model.
-
-        The `from_pretrained()` method take care of using the correct base model class instance
-        using pattern matching on the `pretrained_model_name_or_path` string.
-
-        The base model class to instantiate is selected as the first pattern matching
-        in the `pretrained_model_name_or_path` string (in the following order):
-            - contains `bert`: BertConfig (Bert model)
-            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
-            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
-            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
-            - contains `xlnet`: XLNetConfig (XLNet model)
-            - contains `xlm`: XLMConfig (XLM model)
-
-        This class should usually not be instantiated using `__init__()` but `from_pretrained()`.
-    """
-
-    def __init__(self, base_model):
-        super(AutoModelWithLMHead, self).__init__(base_model)
-        config = base_model.config
-
-        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
-
-        self.apply(self.init_weights)
-        self.tie_weights()
-
-    def tie_weights(self):
-        """ Make sure we are sharing the input and output embeddings.
-            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
-        """
-        # get input embeddings - whatever the model is
-        input_embeddings = self.transformer.resize_token_embeddings(new_num_tokens=None)
-
-        # tie of clone (torchscript) embeddings
-        self._tie_or_clone_weights(self.lm_head, input_embeddings)
-
-    def forward(self, input_ids, **kwargs):
-        labels = kwargs.pop('labels', None)  # Python 2 compatibility...
-
-        transformer_outputs = self.transformer(input_ids, **kwargs)
-        hidden_states = transformer_outputs[0]
-
-        lm_logits = self.lm_head(hidden_states)
-
-        outputs = (lm_logits,) + transformer_outputs[1:]
-        if labels is not None:
-            loss_fct = CrossEntropyLoss(ignore_index=-1)
-            loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)),
-                            labels.view(-1))
-            outputs = (loss,) + outputs
-
-        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)
-
-
-AUTO_MODEL_SEQUENCE_SUMMARY_DEFAULTS = {
-     'num_labels': 2,
-     'summary_type': 'first',
-     'summary_use_proj': True,
-     'summary_activation': None,
-     'summary_proj_to_labels': True,
-     'summary_first_dropout': 0.1
-}
-
-
-
-class AutoModelForSequenceClassification(DerivedAutoModel):
-    r"""
-        :class:`~pytorch_transformers.AutoModelForSequenceClassification` is a class for sequence classification
-        that contains
-        
-            - a base model instantiated as one of the base model classes of the library when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)` class method, and .
-            - a classification head on top of the base model.
-
-        The `from_pretrained()` method take care of using the correct base model class instance
-        using pattern matching on the `pretrained_model_name_or_path` string.
-
-        The base model class to instantiate is selected as the first pattern matching
-        in the `pretrained_model_name_or_path` string (in the following order):
-            - contains `bert`: BertConfig (Bert model)
-            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
-            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
-            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
-            - contains `xlnet`: XLNetConfig (XLNet model)
-            - contains `xlm`: XLMConfig (XLM model)
-
-        This class should usually not be instantiated using `__init__()` but `from_pretrained()`.
-    """
-
-    def __init__(self, base_model):
-        super(AutoModelForSequenceClassification, self).__init__(base_model)
-        # Complete configuration with defaults if necessary
-        config = base_model.config
-        for key, value in AUTO_MODEL_SEQUENCE_SUMMARY_DEFAULTS.items():
-            if not hasattr(config, key):
-                setattr(config, key, value)
-
-        # Update base model and derived model config
-        self.transformer.config = config
-        self.config = config
-
-        self.num_labels = config.num_labels
-        self.sequence_summary = SequenceSummary(config)
-
-        self.apply(self.init_weights)
-
-    def forward(self, input_ids, cls_index, **kwargs):
-        labels = kwargs.pop('labels', None)  # Python 2 compatibility...
-
-        transformer_outputs = self.transformer(input_ids, **kwargs)
-
-        output = transformer_outputs[0]
-        logits = self.sequence_summary(output, cls_index=cls_index)
-
-        outputs = (logits,) + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here
-
-        if labels is not None:
-            if self.num_labels == 1:
-                #  We are doing regression
-                loss_fct = MSELoss()
-                loss = loss_fct(logits.view(-1), labels.view(-1))
-            else:
-                loss_fct = CrossEntropyLoss()
-                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
-            outputs = (loss,) + outputs
-
-        return outputs
diff --git a/pytorch_transformers/tests/modeling_auto_test.py b/pytorch_transformers/tests/modeling_auto_test.py
index 07042a255c..d0c830abc7 100644
--- a/pytorch_transformers/tests/modeling_auto_test.py
+++ b/pytorch_transformers/tests/modeling_auto_test.py
@@ -21,7 +21,7 @@ import shutil
 import pytest
 import logging
 
-from pytorch_transformers import AutoConfig, BertConfig, AutoModel, BertModel, AutoModelForSequenceClassification, AutoModelWithLMHead
+from pytorch_transformers import AutoConfig, BertConfig, AutoModel, BertModel
 from pytorch_transformers.modeling_bert import BERT_PRETRAINED_MODEL_ARCHIVE_MAP
 
 from .modeling_common_test import (CommonTestCases, ConfigTester, ids_tensor)
@@ -42,14 +42,6 @@ class AutoModelTest(unittest.TestCase):
             for value in loading_info.values():
                 self.assertEqual(len(value), 0)
 
-            model = AutoModelForSequenceClassification.from_pretrained(model_name)
-            self.assertIsNotNone(model)
-            self.assertIsInstance(getattr(model, model.base_model_prefix), BertModel)
-
-            model = AutoModelWithLMHead.from_pretrained(model_name)
-            self.assertIsNotNone(model)
-            self.assertIsInstance(getattr(model, model.base_model_prefix), BertModel)
-
 
 if __name__ == "__main__":
     unittest.main()

From beb03ec6c56e12b87fd94b97a36221b976b65651 Mon Sep 17 00:00:00 2001
From: wangfei <1140554608@qq.com>
Date: Tue, 6 Aug 2019 11:15:57 +0800
Subject: [PATCH 062/200] Fix examples of loading pretrained models in
 docstring

---
 pytorch_transformers/modeling_bert.py       | 107 +++++++++-----------
 pytorch_transformers/modeling_gpt2.py       |  37 ++++---
 pytorch_transformers/modeling_openai.py     |  37 ++++---
 pytorch_transformers/modeling_transfo_xl.py |  22 ++--
 pytorch_transformers/modeling_xlm.py        |  52 +++++-----
 pytorch_transformers/modeling_xlnet.py      |  62 +++++-------
 6 files changed, 141 insertions(+), 176 deletions(-)

diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index 3f2e7cbda1..6e2df0d2fa 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -643,12 +643,11 @@ class BertModel(BertPreTrainedModel):
 
     Examples::
 
-        config = BertConfig.from_pretrained('bert-base-uncased')
-        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        model = BertModel(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids)
-        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        >>> model = BertModel.from_pretrained('bert-base-uncased')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids)
+        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -754,13 +753,11 @@ class BertForPreTraining(BertPreTrainedModel):
 
     Examples::
 
-        config = BertConfig.from_pretrained('bert-base-uncased')
-        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        
-        model = BertForPreTraining(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids)
-        prediction_scores, seq_relationship_scores = outputs[:2]
+        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        >>> model = BertForPreTraining.from_pretrained('bert-base-uncased')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids)
+        >>> prediction_scores, seq_relationship_scores = outputs[:2]
 
     """
     def __init__(self, config):
@@ -824,13 +821,11 @@ class BertForMaskedLM(BertPreTrainedModel):
 
     Examples::
 
-        config = BertConfig.from_pretrained('bert-base-uncased')
-        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        
-        model = BertForMaskedLM(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids, masked_lm_labels=input_ids)
-        loss, prediction_scores = outputs[:2]
+        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        >>> model = BertForMaskedLM.from_pretrained('bert-base-uncased')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids, masked_lm_labels=input_ids)
+        >>> loss, prediction_scores = outputs[:2]
 
     """
     def __init__(self, config):
@@ -891,13 +886,11 @@ class BertForNextSentencePrediction(BertPreTrainedModel):
 
     Examples::
 
-        config = BertConfig.from_pretrained('bert-base-uncased')
-        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        
-        model = BertForNextSentencePrediction(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids)
-        seq_relationship_scores = outputs[0]
+        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        >>> model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids)
+        >>> seq_relationship_scores = outputs[0]
 
     """
     def __init__(self, config):
@@ -951,14 +944,12 @@ class BertForSequenceClassification(BertPreTrainedModel):
 
     Examples::
 
-        config = BertConfig.from_pretrained('bert-base-uncased')
-        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        
-        model = BertForSequenceClassification(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids, labels=labels)
-        loss, logits = outputs[:2]
+        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        >>> model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids, labels=labels)
+        >>> loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1057,15 +1048,13 @@ class BertForMultipleChoice(BertPreTrainedModel):
 
     Examples::
 
-        config = BertConfig.from_pretrained('bert-base-uncased')
-        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        
-        model = BertForMultipleChoice(config)
-        choices = ["Hello, my dog is cute", "Hello, my cat is amazing"]
-        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
-        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids, labels=labels)
-        loss, classification_scores = outputs[:2]
+        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        >>> model = BertForMultipleChoice.from_pretrained('bert-base-uncased')
+        >>> choices = ["Hello, my dog is cute", "Hello, my cat is amazing"]
+        >>> input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
+        >>> labels = torch.tensor(1).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids, labels=labels)
+        >>> loss, classification_scores = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1127,14 +1116,12 @@ class BertForTokenClassification(BertPreTrainedModel):
 
     Examples::
 
-        config = BertConfig.from_pretrained('bert-base-uncased')
-        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        
-        model = BertForTokenClassification(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids, labels=labels)
-        loss, scores = outputs[:2]
+        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        >>> model = BertForTokenClassification.from_pretrained('bert-base-uncased')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids, labels=labels)
+        >>> loss, scores = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1203,15 +1190,13 @@ class BertForQuestionAnswering(BertPreTrainedModel):
 
     Examples::
 
-        config = BertConfig.from_pretrained('bert-base-uncased')
-        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        
-        model = BertForQuestionAnswering(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        start_positions = torch.tensor([1])
-        end_positions = torch.tensor([3])
-        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
-        loss, start_scores, end_scores = outputs[:2]
+        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        >>> model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> start_positions = torch.tensor([1])
+        >>> end_positions = torch.tensor([3])
+        >>> outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        >>> loss, start_scores, end_scores = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index 5268c5de7d..9800b6658f 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -433,12 +433,11 @@ class GPT2Model(GPT2PreTrainedModel):
 
     Examples::
 
-        config = GPT2Config.from_pretrained('gpt2')
-        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-        model = GPT2Model(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids)
-        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        >>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        >>> model = GPT2Model.from_pretrained('gpt2')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids)
+        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -567,12 +566,11 @@ class GPT2LMHeadModel(GPT2PreTrainedModel):
 
     Examples::
 
-        config = GPT2Config.from_pretrained('gpt2')
-        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-        model = GPT2LMHeadModel(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids, labels=input_ids)
-        loss, logits = outputs[:2]
+        >>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        >>> model = GPT2LMHeadModel.from_pretrained('gpt2')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids, labels=input_ids)
+        >>> loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -683,14 +681,13 @@ class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
 
     Examples::
 
-        config = GPT2Config.from_pretrained('gpt2')
-        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-        model = GPT2DoubleHeadsModel(config)
-        choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
-        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
-        mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids, mc_token_ids)
-        lm_prediction_scores, mc_prediction_scores = outputs[:2]
+        >>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        >>> model = GPT2DoubleHeadsModel.from_pretrained('gpt2')
+        >>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
+        >>> input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
+        >>> mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids, mc_token_ids)
+        >>> lm_prediction_scores, mc_prediction_scores = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/modeling_openai.py b/pytorch_transformers/modeling_openai.py
index 187c51c86e..500f455816 100644
--- a/pytorch_transformers/modeling_openai.py
+++ b/pytorch_transformers/modeling_openai.py
@@ -439,12 +439,11 @@ class OpenAIGPTModel(OpenAIGPTPreTrainedModel):
 
     Examples::
 
-        config = OpenAIGPTConfig.from_pretrained('openai-gpt')
-        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-        model = OpenAIGPTModel(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids)
-        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        >>> tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+        >>> model = OpenAIGPTModel.from_pretrained('openai-gpt')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids)
+        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -558,12 +557,11 @@ class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
 
     Examples::
 
-        config = OpenAIGPTConfig.from_pretrained('openai-gpt')
-        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-        model = OpenAIGPTLMHeadModel(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids, labels=input_ids)
-        loss, logits = outputs[:2]
+        >>> tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+        >>> model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids, labels=input_ids)
+        >>> loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -665,14 +663,13 @@ class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
 
     Examples::
 
-        config = OpenAIGPTConfig.from_pretrained('openai-gpt')
-        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-        model = OpenAIGPTDoubleHeadsModel(config)
-        choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
-        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
-        mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids, mc_token_ids)
-        lm_prediction_scores, mc_prediction_scores = outputs[:2]
+        >>> tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+        >>> model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
+        >>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
+        >>> input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
+        >>> mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids, mc_token_ids)
+        >>> lm_prediction_scores, mc_prediction_scores = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/modeling_transfo_xl.py b/pytorch_transformers/modeling_transfo_xl.py
index 7c999edda7..927cc79fe6 100644
--- a/pytorch_transformers/modeling_transfo_xl.py
+++ b/pytorch_transformers/modeling_transfo_xl.py
@@ -968,12 +968,11 @@ class TransfoXLModel(TransfoXLPreTrainedModel):
 
     Examples::
 
-        config = TransfoXLConfig.from_pretrained('transfo-xl-wt103')
-        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
-        model = TransfoXLModel(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids)
-        last_hidden_states, mems = outputs[:2]
+        >>> tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
+        >>> model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids)
+        >>> last_hidden_states, mems = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1284,12 +1283,11 @@ class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
 
     Examples::
 
-        config = TransfoXLConfig.from_pretrained('transfo-xl-wt103')
-        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
-        model = TransfoXLLMHeadModel(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids)
-        prediction_scores, mems = outputs[:2]
+        >>> tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
+        >>> model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids)
+        >>> prediction_scores, mems = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/modeling_xlm.py b/pytorch_transformers/modeling_xlm.py
index 7325ff7875..ddf5fee328 100644
--- a/pytorch_transformers/modeling_xlm.py
+++ b/pytorch_transformers/modeling_xlm.py
@@ -472,12 +472,11 @@ class XLMModel(XLMPreTrainedModel):
 
     Examples::
 
-        config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
-        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        model = XLMModel(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids)
-        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        >>> model = XLMModel.from_pretrained('xlm-mlm-en-2048')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids)
+        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     ATTRIBUTES = ['encoder', 'eos_index', 'pad_index',  # 'with_output', 
@@ -745,12 +744,11 @@ class XLMWithLMHeadModel(XLMPreTrainedModel):
 
     Examples::
 
-        config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
-        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        model = XLMWithLMHeadModel(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids)
-        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        >>> model = XLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids)
+        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -805,14 +803,12 @@ class XLMForSequenceClassification(XLMPreTrainedModel):
 
     Examples::
 
-        config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
-        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        
-        model = XLMForSequenceClassification(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids, labels=labels)
-        loss, logits = outputs[:2]
+        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        >>> model = XLMForSequenceClassification.from_pretrained('xlm-mlm-en-2048')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids, labels=labels)
+        >>> loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -885,15 +881,13 @@ class XLMForQuestionAnswering(XLMPreTrainedModel):
 
     Examples::
 
-        config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
-        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        
-        model = XLMForQuestionAnswering(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        start_positions = torch.tensor([1])
-        end_positions = torch.tensor([3])
-        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
-        loss, start_scores, end_scores = outputs[:2]
+        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        >>> model = XLMForQuestionAnswering.from_pretrained('xlm-mlm-en-2048')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> start_positions = torch.tensor([1])
+        >>> end_positions = torch.tensor([3])
+        >>> outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        >>> loss, start_scores, end_scores = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/modeling_xlnet.py b/pytorch_transformers/modeling_xlnet.py
index 9c1752eb74..5b3e049ddf 100644
--- a/pytorch_transformers/modeling_xlnet.py
+++ b/pytorch_transformers/modeling_xlnet.py
@@ -712,12 +712,11 @@ class XLNetModel(XLNetPreTrainedModel):
 
     Examples::
 
-        config = XLNetConfig.from_pretrained('xlnet-large-cased')
-        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
-        model = XLNetModel(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids)
-        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
+        >>> model = XLNetModel.from_pretrained('xlnet-large-cased')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids)
+        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -1019,17 +1018,16 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
 
     Examples::
 
-        config = XLNetConfig.from_pretrained('xlnet-large-cased')
-        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
-        model = XLNetLMHeadModel(config)
-        # We show how to setup inputs to predict a next token using a bi-directional context.
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is very <mask>")).unsqueeze(0)  # We will predict the masked token
-        perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
-        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
-        target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token
-        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)
-        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
-        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
+        >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
+        >>> model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')
+        >>> # We show how to setup inputs to predict a next token using a bi-directional context.
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is very <mask>")).unsqueeze(0)  # We will predict the masked token
+        >>> perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
+        >>> perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
+        >>> target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token
+        >>> target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)
+        >>> outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
+        >>> next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
 
     """
     def __init__(self, config):
@@ -1100,14 +1098,12 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
 
     Examples::
 
-        config = XLNetConfig.from_pretrained('xlnet-large-cased')
-        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
-        
-        model = XLNetForSequenceClassification(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids, labels=labels)
-        loss, logits = outputs[:2]
+        >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
+        >>> model = XLNetForSequenceClassification.from_pretrained('xlnet-large-cased')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        >>> outputs = model(input_ids, labels=labels)
+        >>> loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1200,15 +1196,13 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
 
     Examples::
 
-        config = XLMConfig.from_pretrained('xlm-mlm-en-2048')
-        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        
-        model = XLMForQuestionAnswering(config)
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        start_positions = torch.tensor([1])
-        end_positions = torch.tensor([3])
-        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
-        loss, start_scores, end_scores = outputs[:2]
+        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        >>> model = XLMForQuestionAnswering.from_pretrained('xlnet-large-cased')
+        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        >>> start_positions = torch.tensor([1])
+        >>> end_positions = torch.tensor([3])
+        >>> outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        >>> loss, start_scores, end_scores = outputs[:2]
 
     """
     def __init__(self, config):

From f889e77b9c3e8043a30f909f3e4e3c0a016ff6df Mon Sep 17 00:00:00 2001
From: wangfei <1140554608@qq.com>
Date: Tue, 6 Aug 2019 11:30:35 +0800
Subject: [PATCH 063/200] Fix examples of loading pretrained models in
 docstring

---
 pytorch_transformers/modeling_gpt2.py | 34 +++++++++++++--------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index 9800b6658f..50cb834400 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -433,11 +433,11 @@ class GPT2Model(GPT2PreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-        >>> model = GPT2Model.from_pretrained('gpt2')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        model = GPT2Model.from_pretrained('gpt2')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -566,11 +566,11 @@ class GPT2LMHeadModel(GPT2PreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-        >>> model = GPT2LMHeadModel.from_pretrained('gpt2')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=input_ids)
-        >>> loss, logits = outputs[:2]
+        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        model = GPT2LMHeadModel.from_pretrained('gpt2')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=input_ids)
+        loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -681,13 +681,13 @@ class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-        >>> model = GPT2DoubleHeadsModel.from_pretrained('gpt2')
-        >>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
-        >>> input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
-        >>> mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, mc_token_ids)
-        >>> lm_prediction_scores, mc_prediction_scores = outputs[:2]
+        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        model = GPT2DoubleHeadsModel.from_pretrained('gpt2')
+        choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
+        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
+        mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, mc_token_ids)
+        lm_prediction_scores, mc_prediction_scores = outputs[:2]
 
     """
     def __init__(self, config):

From 72622926e59056b72cec8e95d4e45ee0927f20aa Mon Sep 17 00:00:00 2001
From: wangfei <1140554608@qq.com>
Date: Tue, 6 Aug 2019 11:32:41 +0800
Subject: [PATCH 064/200] Fix examples in docstring

---
 pytorch_transformers/modeling_transfo_xl.py | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/pytorch_transformers/modeling_transfo_xl.py b/pytorch_transformers/modeling_transfo_xl.py
index 927cc79fe6..cb5416964c 100644
--- a/pytorch_transformers/modeling_transfo_xl.py
+++ b/pytorch_transformers/modeling_transfo_xl.py
@@ -968,11 +968,11 @@ class TransfoXLModel(TransfoXLPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
-        >>> model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states, mems = outputs[:2]
+        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
+        model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states, mems = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1283,11 +1283,11 @@ class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
-        >>> model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> prediction_scores, mems = outputs[:2]
+        tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
+        model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        prediction_scores, mems = outputs[:2]
 
     """
     def __init__(self, config):

From 6ec1ee9ec28ead1a7c065153df32271ead95b417 Mon Sep 17 00:00:00 2001
From: wangfei <1140554608@qq.com>
Date: Tue, 6 Aug 2019 11:32:54 +0800
Subject: [PATCH 065/200] Fix examples in docstring

---
 pytorch_transformers/modeling_bert.py   | 92 ++++++++++++-------------
 pytorch_transformers/modeling_openai.py | 34 ++++-----
 pytorch_transformers/modeling_xlm.py    | 46 ++++++-------
 pytorch_transformers/modeling_xlnet.py  | 56 +++++++--------
 4 files changed, 114 insertions(+), 114 deletions(-)

diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index 6e2df0d2fa..3a6a50d0ed 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -643,11 +643,11 @@ class BertModel(BertPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> model = BertModel.from_pretrained('bert-base-uncased')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertModel.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -753,11 +753,11 @@ class BertForPreTraining(BertPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> model = BertForPreTraining.from_pretrained('bert-base-uncased')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> prediction_scores, seq_relationship_scores = outputs[:2]
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForPreTraining.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        prediction_scores, seq_relationship_scores = outputs[:2]
 
     """
     def __init__(self, config):
@@ -821,11 +821,11 @@ class BertForMaskedLM(BertPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> model = BertForMaskedLM.from_pretrained('bert-base-uncased')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, masked_lm_labels=input_ids)
-        >>> loss, prediction_scores = outputs[:2]
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForMaskedLM.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, masked_lm_labels=input_ids)
+        loss, prediction_scores = outputs[:2]
 
     """
     def __init__(self, config):
@@ -886,11 +886,11 @@ class BertForNextSentencePrediction(BertPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> seq_relationship_scores = outputs[0]
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        seq_relationship_scores = outputs[0]
 
     """
     def __init__(self, config):
@@ -944,12 +944,12 @@ class BertForSequenceClassification(BertPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=labels)
-        >>> loss, logits = outputs[:2]
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1048,13 +1048,13 @@ class BertForMultipleChoice(BertPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> model = BertForMultipleChoice.from_pretrained('bert-base-uncased')
-        >>> choices = ["Hello, my dog is cute", "Hello, my cat is amazing"]
-        >>> input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
-        >>> labels = torch.tensor(1).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=labels)
-        >>> loss, classification_scores = outputs[:2]
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForMultipleChoice.from_pretrained('bert-base-uncased')
+        choices = ["Hello, my dog is cute", "Hello, my cat is amazing"]
+        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
+        labels = torch.tensor(1).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, classification_scores = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1116,12 +1116,12 @@ class BertForTokenClassification(BertPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> model = BertForTokenClassification.from_pretrained('bert-base-uncased')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=labels)
-        >>> loss, scores = outputs[:2]
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForTokenClassification.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, scores = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1190,13 +1190,13 @@ class BertForQuestionAnswering(BertPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-        >>> model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> start_positions = torch.tensor([1])
-        >>> end_positions = torch.tensor([3])
-        >>> outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
-        >>> loss, start_scores, end_scores = outputs[:2]
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        start_positions = torch.tensor([1])
+        end_positions = torch.tensor([3])
+        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        loss, start_scores, end_scores = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/modeling_openai.py b/pytorch_transformers/modeling_openai.py
index 500f455816..20faf39972 100644
--- a/pytorch_transformers/modeling_openai.py
+++ b/pytorch_transformers/modeling_openai.py
@@ -439,11 +439,11 @@ class OpenAIGPTModel(OpenAIGPTPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-        >>> model = OpenAIGPTModel.from_pretrained('openai-gpt')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+        model = OpenAIGPTModel.from_pretrained('openai-gpt')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -557,11 +557,11 @@ class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-        >>> model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=input_ids)
-        >>> loss, logits = outputs[:2]
+        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+        model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=input_ids)
+        loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -663,13 +663,13 @@ class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-        >>> model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
-        >>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
-        >>> input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
-        >>> mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, mc_token_ids)
-        >>> lm_prediction_scores, mc_prediction_scores = outputs[:2]
+        tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+        model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
+        choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
+        input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
+        mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, mc_token_ids)
+        lm_prediction_scores, mc_prediction_scores = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/modeling_xlm.py b/pytorch_transformers/modeling_xlm.py
index ddf5fee328..03af828b9d 100644
--- a/pytorch_transformers/modeling_xlm.py
+++ b/pytorch_transformers/modeling_xlm.py
@@ -472,11 +472,11 @@ class XLMModel(XLMPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        >>> model = XLMModel.from_pretrained('xlm-mlm-en-2048')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        model = XLMModel.from_pretrained('xlm-mlm-en-2048')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     ATTRIBUTES = ['encoder', 'eos_index', 'pad_index',  # 'with_output', 
@@ -744,11 +744,11 @@ class XLMWithLMHeadModel(XLMPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        >>> model = XLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        model = XLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -803,12 +803,12 @@ class XLMForSequenceClassification(XLMPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        >>> model = XLMForSequenceClassification.from_pretrained('xlm-mlm-en-2048')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=labels)
-        >>> loss, logits = outputs[:2]
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        model = XLMForSequenceClassification.from_pretrained('xlm-mlm-en-2048')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -881,13 +881,13 @@ class XLMForQuestionAnswering(XLMPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        >>> model = XLMForQuestionAnswering.from_pretrained('xlm-mlm-en-2048')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> start_positions = torch.tensor([1])
-        >>> end_positions = torch.tensor([3])
-        >>> outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
-        >>> loss, start_scores, end_scores = outputs[:2]
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        model = XLMForQuestionAnswering.from_pretrained('xlm-mlm-en-2048')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        start_positions = torch.tensor([1])
+        end_positions = torch.tensor([3])
+        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        loss, start_scores, end_scores = outputs[:2]
 
     """
     def __init__(self, config):
diff --git a/pytorch_transformers/modeling_xlnet.py b/pytorch_transformers/modeling_xlnet.py
index 5b3e049ddf..f5dafe16fc 100644
--- a/pytorch_transformers/modeling_xlnet.py
+++ b/pytorch_transformers/modeling_xlnet.py
@@ -712,11 +712,11 @@ class XLNetModel(XLNetPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
-        >>> model = XLNetModel.from_pretrained('xlnet-large-cased')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids)
-        >>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
+        model = XLNetModel.from_pretrained('xlnet-large-cased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
@@ -1018,16 +1018,16 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
-        >>> model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')
-        >>> # We show how to setup inputs to predict a next token using a bi-directional context.
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is very <mask>")).unsqueeze(0)  # We will predict the masked token
-        >>> perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
-        >>> perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
-        >>> target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token
-        >>> target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)
-        >>> outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
-        >>> next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
+        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
+        model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')
+        # We show how to setup inputs to predict a next token using a bi-directional context.
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is very <mask>")).unsqueeze(0)  # We will predict the masked token
+        perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
+        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
+        target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token
+        target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)
+        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
+        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
 
     """
     def __init__(self, config):
@@ -1098,12 +1098,12 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
-        >>> model = XLNetForSequenceClassification.from_pretrained('xlnet-large-cased')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
-        >>> outputs = model(input_ids, labels=labels)
-        >>> loss, logits = outputs[:2]
+        tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
+        model = XLNetForSequenceClassification.from_pretrained('xlnet-large-cased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
 
     """
     def __init__(self, config):
@@ -1196,13 +1196,13 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
 
     Examples::
 
-        >>> tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
-        >>> model = XLMForQuestionAnswering.from_pretrained('xlnet-large-cased')
-        >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        >>> start_positions = torch.tensor([1])
-        >>> end_positions = torch.tensor([3])
-        >>> outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
-        >>> loss, start_scores, end_scores = outputs[:2]
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
+        model = XLMForQuestionAnswering.from_pretrained('xlnet-large-cased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        start_positions = torch.tensor([1])
+        end_positions = torch.tensor([3])
+        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        loss, start_scores, end_scores = outputs[:2]
 
     """
     def __init__(self, config):

From a6f412da01d15cdd60e242e9765d4dfc175adb24 Mon Sep 17 00:00:00 2001
From: Christopher Goh <chrisgzf@gmail.com>
Date: Wed, 7 Aug 2019 02:19:14 +0800
Subject: [PATCH 066/200] Fixed typo in migration guide

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 38143d0b1f..b86a5238c2 100644
--- a/README.md
+++ b/README.md
@@ -314,7 +314,7 @@ loss = outputs[0]
 # In pytorch-transformers you can also have access to the logits:
 loss, logits = outputs[:2]
 
-# And even the attention weigths if you configure the model to output them (and other outputs too, see the docstrings and documentation)
+# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
 model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
 outputs = model(input_ids, labels=labels)
 loss, logits, attentions = outputs

From 770043eea2927eea1664fdd56b3996a8fb41731c Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Wed, 7 Aug 2019 12:53:19 -0400
Subject: [PATCH 067/200] Sentence-pair tasks handling. Using common tests on
 RoBERTa. Forced push to fix indentation.

---
 pytorch_transformers/__init__.py              |   3 +
 pytorch_transformers/modeling_roberta.py      |  28 ++-
 .../tests/modeling_roberta_test.py            | 200 ++++++++++++++----
 .../tests/tokenization_roberta_test.py        |  45 ++--
 pytorch_transformers/tokenization_roberta.py  |  90 ++++++--
 5 files changed, 279 insertions(+), 87 deletions(-)

diff --git a/pytorch_transformers/__init__.py b/pytorch_transformers/__init__.py
index b4b957192c..d1e42b130a 100644
--- a/pytorch_transformers/__init__.py
+++ b/pytorch_transformers/__init__.py
@@ -5,6 +5,7 @@ from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus)
 from .tokenization_gpt2 import GPT2Tokenizer
 from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
 from .tokenization_xlm import XLMTokenizer
+from .tokenization_roberta import RobertaTokenizer
 from .tokenization_utils import (PreTrainedTokenizer, clean_up_tokenization)
 
 from .modeling_bert import (BertConfig, BertPreTrainedModel, BertModel, BertForPreTraining,
@@ -33,6 +34,8 @@ from .modeling_xlm import (XLMConfig, XLMPreTrainedModel , XLMModel,
                            XLMWithLMHeadModel, XLMForSequenceClassification,
                            XLMForQuestionAnswering, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
                            XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
+from .modeling_roberta import (RobertaConfig, RobertaForMaskedLM, RobertaModel,
+                               ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
 from .modeling_utils import (WEIGHTS_NAME, CONFIG_NAME, TF_WEIGHTS_NAME,
                           PretrainedConfig, PreTrainedModel, prune_layer, Conv1D)
 
diff --git a/pytorch_transformers/modeling_roberta.py b/pytorch_transformers/modeling_roberta.py
index 109a719616..43f76989f4 100644
--- a/pytorch_transformers/modeling_roberta.py
+++ b/pytorch_transformers/modeling_roberta.py
@@ -23,6 +23,7 @@ import logging
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
+from torch.nn import CrossEntropyLoss
 
 from pytorch_transformers.modeling_bert import (BertConfig, BertEmbeddings,
                                                 BertLayerNorm, BertModel,
@@ -78,7 +79,7 @@ class RobertaModel(BertModel):
         super(RobertaModel, self).__init__(config)
 
         self.embeddings = RobertaEmbeddings(config)
-
+        self.apply(self.init_weights)
 
 
 class RobertaForMaskedLM(BertPreTrainedModel):
@@ -94,16 +95,31 @@ class RobertaForMaskedLM(BertPreTrainedModel):
 
         self.roberta = RobertaModel(config)
         self.lm_head = RobertaLMHead(config)
-    
-    def forward(self, input_ids, token_type_ids=None, attention_mask=None, position_ids=None, head_mask=None):
+
+        self.apply(self.init_weights)
+        self.tie_weights()
+
+    def tie_weights(self):
+        """ Make sure we are sharing the input and output embeddings.
+            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
+        """
+        self._tie_or_clone_weights(self.lm_head.decoder, self.roberta.embeddings.word_embeddings)
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=None, position_ids=None,
+                head_mask=None):
         outputs = self.roberta(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
                             attention_mask=attention_mask, head_mask=head_mask)
         sequence_output = outputs[0]
         prediction_scores = self.lm_head(sequence_output)
 
         outputs = (prediction_scores,) + outputs[2:]
-        return outputs
 
+        if masked_lm_labels is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
+            outputs = (masked_lm_loss,) + outputs
+
+        return outputs
 
 
 class RobertaLMHead(nn.Module):
@@ -114,7 +130,7 @@ class RobertaLMHead(nn.Module):
         self.dense = nn.Linear(config.hidden_size, config.hidden_size)
         self.layer_norm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
 
-        self.weight = nn.Linear(config.hidden_size, config.vocab_size, bias=False).weight
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
         self.bias = nn.Parameter(torch.zeros(config.vocab_size))
 
     def forward(self, features, **kwargs):
@@ -123,6 +139,6 @@ class RobertaLMHead(nn.Module):
         x = self.layer_norm(x)
 
         # project back to size of vocabulary with bias
-        x = F.linear(x, self.weight) + self.bias
+        x = self.decoder(x) + self.bias
 
         return x
diff --git a/pytorch_transformers/tests/modeling_roberta_test.py b/pytorch_transformers/tests/modeling_roberta_test.py
index 62707326a6..273176b27a 100644
--- a/pytorch_transformers/tests/modeling_roberta_test.py
+++ b/pytorch_transformers/tests/modeling_roberta_test.py
@@ -12,58 +12,172 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from __future__ import (absolute_import, division, print_function,
-                        unicode_literals)
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
 
-import os
 import unittest
+import shutil
 import pytest
-import torch
 
-from pytorch_transformers.modeling_roberta import (RobertaForMaskedLM,
-                                                   RobertaModel)
+from pytorch_transformers import (RobertaConfig, RobertaModel, RobertaForMaskedLM)
+from pytorch_transformers.modeling_roberta import ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
+
+from .modeling_common_test import (CommonTestCases, ConfigTester, ids_tensor)
 
 
-class RobertaModelTest(unittest.TestCase):
+class RobertaModelTest(CommonTestCases.CommonModelTester):
 
-    # @pytest.mark.slow
-    def test_inference_masked_lm(self):
-        model = RobertaForMaskedLM.from_pretrained('roberta-base')
-        
-        input_ids = torch.tensor([[    0, 31414,   232,   328,   740,  1140, 12695,    69, 46078,  1588,   2]])
-        output = model(input_ids)[0]
-        expected_shape = torch.Size((1, 11, 50265))
-        self.assertEqual(
-            output.shape,
-            expected_shape
-        )
-        # compare the actual values for a slice.
-        expected_slice = torch.Tensor(
-            [[[33.8843, -4.3107, 22.7779],
-              [ 4.6533, -2.8099, 13.6252],
-              [ 1.8222, -3.6898,  8.8600]]]
-        )
-        self.assertTrue(
-            torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
-        )
+    all_model_classes = (RobertaForMaskedLM, RobertaModel)
 
-    # @pytest.mark.slow
-    def test_inference_no_head(self):
-        model = RobertaModel.from_pretrained('roberta-base')
-        
-        input_ids = torch.tensor([[    0, 31414,   232,   328,   740,  1140, 12695,    69, 46078,  1588,   2]])
-        output = model(input_ids)[0]
-        # compare the actual values for a slice.
-        expected_slice = torch.Tensor(
-            [[[-0.0231,  0.0782,  0.0074],
-              [-0.1854,  0.0539, -0.0174],
-              [ 0.0548,  0.0799,  0.1687]]]
-        )
-        self.assertTrue(
-            torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
-        )
+    class RobertaModelTester(object):
 
+        def __init__(self,
+                     parent,
+                     batch_size=13,
+                     seq_length=7,
+                     is_training=True,
+                     use_input_mask=True,
+                     use_token_type_ids=True,
+                     use_labels=True,
+                     vocab_size=99,
+                     hidden_size=32,
+                     num_hidden_layers=5,
+                     num_attention_heads=4,
+                     intermediate_size=37,
+                     hidden_act="gelu",
+                     hidden_dropout_prob=0.1,
+                     attention_probs_dropout_prob=0.1,
+                     max_position_embeddings=512,
+                     type_vocab_size=16,
+                     type_sequence_label_size=2,
+                     initializer_range=0.02,
+                     num_labels=3,
+                     num_choices=4,
+                     scope=None,
+                    ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.is_training = is_training
+            self.use_input_mask = use_input_mask
+            self.use_token_type_ids = use_token_type_ids
+            self.use_labels = use_labels
+            self.vocab_size = vocab_size
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.intermediate_size = intermediate_size
+            self.hidden_act = hidden_act
+            self.hidden_dropout_prob = hidden_dropout_prob
+            self.attention_probs_dropout_prob = attention_probs_dropout_prob
+            self.max_position_embeddings = max_position_embeddings
+            self.type_vocab_size = type_vocab_size
+            self.type_sequence_label_size = type_sequence_label_size
+            self.initializer_range = initializer_range
+            self.num_labels = num_labels
+            self.num_choices = num_choices
+            self.scope = scope
 
+        def prepare_config_and_inputs(self):
+            input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
 
-if __name__ == '__main__':
+            input_mask = None
+            if self.use_input_mask:
+                input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+
+            token_type_ids = None
+            if self.use_token_type_ids:
+                token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+            sequence_labels = None
+            token_labels = None
+            choice_labels = None
+            if self.use_labels:
+                sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+                token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+                choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+            config = RobertaConfig(
+                vocab_size_or_config_json_file=self.vocab_size,
+                hidden_size=self.hidden_size,
+                num_hidden_layers=self.num_hidden_layers,
+                num_attention_heads=self.num_attention_heads,
+                intermediate_size=self.intermediate_size,
+                hidden_act=self.hidden_act,
+                hidden_dropout_prob=self.hidden_dropout_prob,
+                attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+                max_position_embeddings=self.max_position_embeddings,
+                type_vocab_size=self.type_vocab_size,
+                initializer_range=self.initializer_range)
+
+            return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+        def check_loss_output(self, result):
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+
+        def create_and_check_roberta_model(self, config, input_ids, token_type_ids, input_mask, sequence_labels,
+                                           token_labels, choice_labels):
+            model = RobertaModel(config=config)
+            model.eval()
+            sequence_output, pooled_output = model(input_ids, token_type_ids, input_mask)
+            sequence_output, pooled_output = model(input_ids, token_type_ids)
+            sequence_output, pooled_output = model(input_ids)
+
+            result = {
+                "sequence_output": sequence_output,
+                "pooled_output": pooled_output,
+            }
+            self.parent.assertListEqual(
+                list(result["sequence_output"].size()),
+                [self.batch_size, self.seq_length, self.hidden_size])
+            self.parent.assertListEqual(list(result["pooled_output"].size()), [self.batch_size, self.hidden_size])
+
+        def create_and_check_roberta_for_masked_lm(self, config, input_ids, token_type_ids, input_mask, sequence_labels,
+                                                   token_labels, choice_labels):
+            model = RobertaForMaskedLM(config=config)
+            model.eval()
+            loss, prediction_scores = model(input_ids, token_type_ids, input_mask, token_labels)
+            result = {
+                "loss": loss,
+                "prediction_scores": prediction_scores,
+            }
+            self.parent.assertListEqual(
+                list(result["prediction_scores"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+            self.check_loss_output(result)
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+            (config, input_ids, token_type_ids, input_mask,
+             sequence_labels, token_labels, choice_labels) = config_and_inputs
+            inputs_dict = {'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': input_mask}
+            return config, inputs_dict
+
+    def setUp(self):
+        self.model_tester = RobertaModelTest.RobertaModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=RobertaConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_roberta_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_roberta_model(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_roberta_for_masked_lm(*config_and_inputs)
+
+    @pytest.mark.slow
+    def test_model_from_pretrained(self):
+        cache_dir = "/tmp/pytorch_transformers_test/"
+        for model_name in list(ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            model = RobertaModel.from_pretrained(model_name, cache_dir=cache_dir)
+            shutil.rmtree(cache_dir)
+            self.assertIsNotNone(model)
+
+if __name__ == "__main__":
     unittest.main()
diff --git a/pytorch_transformers/tests/tokenization_roberta_test.py b/pytorch_transformers/tests/tokenization_roberta_test.py
index cd4e17ec34..60df18ae2b 100644
--- a/pytorch_transformers/tests/tokenization_roberta_test.py
+++ b/pytorch_transformers/tests/tokenization_roberta_test.py
@@ -12,32 +12,45 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from __future__ import (absolute_import, division, print_function,
-                        unicode_literals)
+from __future__ import absolute_import, division, print_function, unicode_literals
 
 import os
 import unittest
-import pytest
-import six
 
-from pytorch_transformers.tokenization_roberta import RobertaTokenizer
+from pytorch_transformers.tokenization_roberta import RobertaTokenizer, VOCAB_FILES_NAMES
+from .tokenization_tests_commons import create_and_check_tokenizer_commons, TemporaryDirectory
 
 
 class RobertaTokenizationTest(unittest.TestCase):
 
-    # @pytest.mark.slow
     def test_full_tokenizer(self):
-        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
-        self.assertListEqual(
-            tokenizer.encode('Hello world!'),
-            [0, 31414, 232, 328, 2]
-        )
-        if six.PY3:
-            self.assertListEqual(
-                tokenizer.encode('Hello world! cécé herlolip'),
-                [0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]
-            )
+        """ Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt """
+        vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n",
+                 "lo", "low", "er",
+                 "low", "lowest", "newer", "wider", "<unk>"]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        special_tokens_map = {"unk_token": "<unk>"}
 
+        with TemporaryDirectory() as tmpdirname:
+            vocab_file = os.path.join(tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+            with open(vocab_file, "w") as fp:
+                [fp.write(f"{vocab} {index}\n") for index, vocab in enumerate(vocab_tokens)]
+
+            input_text = u"lower newer"
+            output_text = u"lower<unk>newer"
+
+            create_and_check_tokenizer_commons(self, input_text, output_text, RobertaTokenizer, tmpdirname, **special_tokens_map)
+
+            tokenizer = RobertaTokenizer(vocab_file, **special_tokens_map)
+            text = "lower"
+            bpe_tokens = ["low", "er"]
+            tokens = tokenizer.tokenize(text)
+            self.assertListEqual(tokens, bpe_tokens)
+
+            input_tokens = tokens + [tokenizer.unk_token]
+            input_bpe_tokens = [13, 12, 17]
+            self.assertListEqual(
+                tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
 
 
 if __name__ == '__main__':
diff --git a/pytorch_transformers/tokenization_roberta.py b/pytorch_transformers/tokenization_roberta.py
index 4f9a7bc0fa..7fa42bfb1c 100644
--- a/pytorch_transformers/tokenization_roberta.py
+++ b/pytorch_transformers/tokenization_roberta.py
@@ -22,22 +22,22 @@ import re
 from io import open
 import six
 
-from .tokenization_utils import PreTrainedTokenizer
+from .tokenization_utils import PreTrainedTokenizer, clean_up_tokenization
 from .tokenization_gpt2 import GPT2Tokenizer
 
 logger = logging.getLogger(__name__)
 
 VOCAB_FILES_NAMES = {
-    'dict_file': 'dict.txt',
+    'vocab_file': 'dict.txt',
 }
 
 PRETRAINED_VOCAB_FILES_MAP = {
-    'dict_file':
-    {
-        'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
-        'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
-        'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
-    },
+    'vocab_file':
+        {
+            'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
+            'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
+            'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
+        },
 }
 
 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
@@ -46,7 +46,6 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
     'roberta-large-mnli': 512,
 }
 
-
 SPACE_NORMALIZER = re.compile(r"\s+")
 
 def tokenize_line(line):
@@ -142,7 +141,7 @@ class Dictionary(object):
                                 "rebuild the dataset".format(f))
             return
 
-        lines = f.readlines()
+        lines = f.read().splitlines()
         for line in lines:
             idx = line.rfind(' ')
             if idx == -1:
@@ -152,7 +151,7 @@ class Dictionary(object):
             self.indices[word] = len(self.symbols)
             self.symbols.append(word)
             self.count.append(count)
-    
+
     def encode_line(self, line, line_tokenizer=tokenize_line, add_if_not_exist=True,
                     consumer=None, append_eos=True, reverse_order=False):
         words = line_tokenizer(line)
@@ -174,8 +173,6 @@ class Dictionary(object):
         return ids
 
 
-
-
 class RobertaTokenizer(PreTrainedTokenizer):
     """
     RoBERTa tokenizer. Peculiarities:
@@ -185,25 +182,53 @@ class RobertaTokenizer(PreTrainedTokenizer):
     pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
     max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
 
-    def __init__(self, dict_file,
+    def __init__(self, vocab_file,
                  bos_token="<s>", eos_token="</s>", **kwargs):
-        super(RobertaTokenizer, self).__init__(bos_token=bos_token, eos_token=eos_token, **kwargs)
+        super(RobertaTokenizer, self).__init__(cls_token=bos_token, sep_token=eos_token, eos_token=eos_token, **kwargs)
 
         self.gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-        self.dictionary = Dictionary.load(dict_file)
+        self.dictionary = Dictionary.load(vocab_file)
 
     def _tokenize(self, text):
         """ Use GPT-2 Tokenizer """
         return self.gpt2_tokenizer._tokenize(text)
 
-    def encode(self, text):
+    def encode(self, text, *args):
         """ Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
         """
-        gpt2_tokens_joined = " ".join(
-            str(x) for x in self.gpt2_tokenizer.convert_tokens_to_ids(self.tokenize(text))
-        )
-        bpe_sentence = '<s> ' + gpt2_tokens_joined + ' </s>'
-        return self.dictionary.encode_line(bpe_sentence, append_eos=False)
+        bpe_sentence = [self.cls_token] + \
+                       self.gpt2_tokenizer.convert_tokens_to_ids(self.tokenize(text)) + \
+                       [self.sep_token]
+
+        if len(args):
+            for additional_sentence in args:
+                bpe_sentence += [self.sep_token
+                                 ] + \
+                                self.gpt2_tokenizer.convert_tokens_to_ids(self.tokenize(additional_sentence)) + \
+                                [self.sep_token]
+
+        return self.dictionary.encode_line(' '.join([str(token) for token in bpe_sentence]), append_eos=False)
+
+    def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):
+        """ Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary
+            with options to remove special tokens and clean up tokenization spaces.
+            Handles sentence pairs.
+        """
+        filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
+
+        if any(isinstance(element, list) for element in filtered_tokens):
+            texts = []
+            for element in filtered_tokens:
+                text = self.convert_tokens_to_string(element)
+                if clean_up_tokenization_spaces:
+                    text = clean_up_tokenization(text)
+                    texts.append(text)
+            return texts
+        else:
+            text = self.convert_tokens_to_string(filtered_tokens)
+            if clean_up_tokenization_spaces:
+                text = clean_up_tokenization(text)
+            return text
 
     def _convert_token_to_id(self, token):
         return self.dictionary.index(token)
@@ -218,3 +243,24 @@ class RobertaTokenizer(PreTrainedTokenizer):
 
     def convert_tokens_to_string(self, tokens):
         return self.gpt2_tokenizer.convert_tokens_to_string(tokens)
+
+    def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
+        # Remove the first and last tokens which are cls and sep tokens
+        ids = ids[1:-1]
+        # If multi sentence, then split (multi sentence found by looking for two sequential sep tokens)
+        ids = [list(map(int, example.split(' '))) for example in ' '.join([str(id) for id in ids]).split(' 2 2 ')]
+
+        if len(ids) == 1:
+            tokens = self.gpt2_tokenizer.convert_ids_to_tokens(list(map(lambda id: int(self.dictionary[id]), ids[0])))
+        else:
+            tokens = []
+            for example in ids:
+                tokens += [
+                    self.gpt2_tokenizer.convert_ids_to_tokens(list(map(lambda id: int(self.dictionary[id]), example)))]
+        return tokens
+
+    def convert_tokens_to_ids(self, tokens):
+        tokens = " ".join(str(x) for x in self.gpt2_tokenizer.convert_tokens_to_ids(tokens))
+        bpe_sentence = '<s> ' + tokens + ' </s>'
+        return self.dictionary.encode_line(bpe_sentence, append_eos=False)
+

From 39d72bcc7b2c99c04b6f483f0d8e7bdff547d37c Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Wed, 7 Aug 2019 14:21:57 -0400
Subject: [PATCH 068/200] Fixed the RoBERTa checkpoint conversion script
 according to the LM head refactoring.

---
 pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py b/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
index 7a17ee3f1b..f21afa29ed 100644
--- a/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
@@ -123,7 +123,7 @@ def convert_roberta_checkpoint_to_pytorch(roberta_checkpoint_path, pytorch_dump_
     model.lm_head.layer_norm.weight = roberta.model.decoder.lm_head.layer_norm.weight
     model.lm_head.layer_norm.bias = roberta.model.decoder.lm_head.layer_norm.bias
     model.lm_head.layer_norm.variance_epsilon = roberta.model.decoder.lm_head.layer_norm.eps
-    model.lm_head.weight = roberta.model.decoder.lm_head.weight
+    model.lm_head.decoder.weight = roberta.model.decoder.lm_head.weight
     model.lm_head.bias = roberta.model.decoder.lm_head.bias
 
     # Let's check that we get the same results.

From 7df303f5add56a5b032fe133ccdcb853d8b830e3 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Thu, 8 Aug 2019 10:36:26 -0400
Subject: [PATCH 069/200] fix #971

---
 pytorch_transformers/modeling_bert.py   | 2 +-
 pytorch_transformers/modeling_gpt2.py   | 4 ++--
 pytorch_transformers/modeling_openai.py | 4 ++--
 pytorch_transformers/modeling_xlm.py    | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index 5d3c160668..e13b3d01f9 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -603,7 +603,7 @@ BERT_INPUTS_DOCSTRING = r"""
             :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
         **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Indices of positions of each input sequence tokens in the position embeddings.
-            Selected in the range ``[0, config.max_position_embeddings - 1[``.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
         **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Segment token indices to indicate first and second portions of the inputs.
             Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index 50cb834400..0ecc20516c 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -393,7 +393,7 @@ GPT2_INPUTS_DOCSTRING = r"""    Inputs:
             :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
         **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Indices of positions of each input sequence tokens in the position embeddings.
-            Selected in the range ``[0, config.max_position_embeddings - 1[``.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
         **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             A parallel sequence of tokens (can be used to indicate various portions of the inputs).
             The embeddings from these tokens will be summed with the respective token embeddings.
@@ -627,7 +627,7 @@ class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
             Selected in the range ``[0, input_ids.size(-1) - 1[``.
         **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
             Indices of positions of each input sequence tokens in the position embeddings.
-            Selected in the range ``[0, config.max_position_embeddings - 1[``.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
         **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
             A parallel sequence of tokens (can be used to indicate various portions of the inputs).
             The embeddings from these tokens will be summed with the respective token embeddings.
diff --git a/pytorch_transformers/modeling_openai.py b/pytorch_transformers/modeling_openai.py
index 20faf39972..536b0e2432 100644
--- a/pytorch_transformers/modeling_openai.py
+++ b/pytorch_transformers/modeling_openai.py
@@ -407,7 +407,7 @@ OPENAI_GPT_INPUTS_DOCSTRING = r"""    Inputs:
             :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
         **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Indices of positions of each input sequence tokens in the position embeddings.
-            Selected in the range ``[0, config.max_position_embeddings - 1[``.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
         **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             A parallel sequence of tokens (can be used to indicate various portions of the inputs).
             The embeddings from these tokens will be summed with the respective token embeddings.
@@ -617,7 +617,7 @@ class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
             Selected in the range ``[0, input_ids.size(-1) - 1[``.
         **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
             Indices of positions of each input sequence tokens in the position embeddings.
-            Selected in the range ``[0, config.max_position_embeddings - 1[``.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
         **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
             A parallel sequence of tokens (can be used to indicate various portions of the inputs).
             The embeddings from these tokens will be summed with the respective token embeddings.
diff --git a/pytorch_transformers/modeling_xlm.py b/pytorch_transformers/modeling_xlm.py
index 03af828b9d..5acb20e04c 100644
--- a/pytorch_transformers/modeling_xlm.py
+++ b/pytorch_transformers/modeling_xlm.py
@@ -427,7 +427,7 @@ XLM_INPUTS_DOCSTRING = r"""
             :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
         **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Indices of positions of each input sequence tokens in the position embeddings.
-            Selected in the range ``[0, config.max_position_embeddings - 1[``.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
         **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             A parallel sequence of tokens (can be used to indicate various portions of the inputs).
             The embeddings from these tokens will be summed with the respective token embeddings.

From f2b300df6bd46ad16580f0313bc4b30ddde8515d Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Thu, 8 Aug 2019 10:38:57 -0400
Subject: [PATCH 070/200] fix #976

---
 pytorch_transformers/modeling_bert.py  | 4 ++--
 pytorch_transformers/modeling_xlm.py   | 2 +-
 pytorch_transformers/modeling_xlnet.py | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index e13b3d01f9..34eac7f26f 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -925,7 +925,7 @@ class BertForSequenceClassification(BertPreTrainedModel):
     r"""
         **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
             Labels for computing the sequence classification/regression loss.
-            Indices should be in ``[0, ..., config.num_labels]``.
+            Indices should be in ``[0, ..., config.num_labels - 1]``.
             If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
             If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
 
@@ -1099,7 +1099,7 @@ class BertForTokenClassification(BertPreTrainedModel):
     r"""
         **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Labels for computing the token classification loss.
-            Indices should be in ``[0, ..., config.num_labels]``.
+            Indices should be in ``[0, ..., config.num_labels - 1]``.
 
     Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
         **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
diff --git a/pytorch_transformers/modeling_xlm.py b/pytorch_transformers/modeling_xlm.py
index 5acb20e04c..941c8dda2f 100644
--- a/pytorch_transformers/modeling_xlm.py
+++ b/pytorch_transformers/modeling_xlm.py
@@ -784,7 +784,7 @@ class XLMForSequenceClassification(XLMPreTrainedModel):
     r"""
         **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
             Labels for computing the sequence classification/regression loss.
-            Indices should be in ``[0, ..., config.num_labels]``.
+            Indices should be in ``[0, ..., config.num_labels - 1]``.
             If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
             If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
 
diff --git a/pytorch_transformers/modeling_xlnet.py b/pytorch_transformers/modeling_xlnet.py
index b6adc4de43..e9e75e3ab7 100644
--- a/pytorch_transformers/modeling_xlnet.py
+++ b/pytorch_transformers/modeling_xlnet.py
@@ -1075,7 +1075,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
     r"""
         **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
             Labels for computing the sequence classification/regression loss.
-            Indices should be in ``[0, ..., config.num_labels]``.
+            Indices should be in ``[0, ..., config.num_labels - 1]``.
             If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
             If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
 

From 9d0603148bc34255fad0cad73ce438ecd7306322 Mon Sep 17 00:00:00 2001
From: Julien Chaumond <chaumond@gmail.com>
Date: Thu, 8 Aug 2019 11:24:54 -0400
Subject: [PATCH 071/200] [RoBERTa] RobertaForSequenceClassification +
 conversion

---
 .../convert_roberta_checkpoint_to_pytorch.py  | 36 ++++++++----
 pytorch_transformers/modeling_roberta.py      | 57 ++++++++++++++++++
 .../tests/modeling_roberta_test.py            | 58 +++++++++++++++++++
 3 files changed, 140 insertions(+), 11 deletions(-)

diff --git a/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py b/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
index f21afa29ed..85ad5ad15b 100644
--- a/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
@@ -30,6 +30,7 @@ from pytorch_transformers.modeling_bert import (BertConfig, BertEncoder,
                                                 BertSelfOutput)
 from pytorch_transformers.modeling_roberta import (RobertaEmbeddings,
                                                    RobertaForMaskedLM,
+                                                   RobertaForSequenceClassification,
                                                    RobertaModel)
 
 logging.basicConfig(level=logging.INFO)
@@ -38,7 +39,7 @@ logger = logging.getLogger(__name__)
 SAMPLE_TEXT = 'Hello world! cécé herlolip'
 
 
-def convert_roberta_checkpoint_to_pytorch(roberta_checkpoint_path, pytorch_dump_folder_path):
+def convert_roberta_checkpoint_to_pytorch(roberta_checkpoint_path, pytorch_dump_folder_path, classification_head):
     """
     Copy/paste/tweak roberta's weights to our BERT structure.
     """
@@ -53,9 +54,11 @@ def convert_roberta_checkpoint_to_pytorch(roberta_checkpoint_path, pytorch_dump_
         max_position_embeddings=514,
         type_vocab_size=1,
     )
+    if classification_head:
+        config.num_labels = roberta.args.num_classes
     print("Our BERT config:", config)
 
-    model = RobertaForMaskedLM(config)
+    model = RobertaForSequenceClassification(config) if classification_head else RobertaForMaskedLM(config)
     model.eval()
 
     # Now let's copy all the weights.
@@ -117,14 +120,20 @@ def convert_roberta_checkpoint_to_pytorch(roberta_checkpoint_path, pytorch_dump_
         bert_output.LayerNorm.variance_epsilon = roberta_layer.final_layer_norm.eps
         #### end of layer
     
-    # LM Head
-    model.lm_head.dense.weight = roberta.model.decoder.lm_head.dense.weight
-    model.lm_head.dense.bias = roberta.model.decoder.lm_head.dense.bias
-    model.lm_head.layer_norm.weight = roberta.model.decoder.lm_head.layer_norm.weight
-    model.lm_head.layer_norm.bias = roberta.model.decoder.lm_head.layer_norm.bias
-    model.lm_head.layer_norm.variance_epsilon = roberta.model.decoder.lm_head.layer_norm.eps
-    model.lm_head.decoder.weight = roberta.model.decoder.lm_head.weight
-    model.lm_head.bias = roberta.model.decoder.lm_head.bias
+    if classification_head:
+        model.classifier.dense.weight = roberta.model.classification_heads['mnli'].dense.weight
+        model.classifier.dense.bias = roberta.model.classification_heads['mnli'].dense.bias
+        model.classifier.out_proj.weight = roberta.model.classification_heads['mnli'].out_proj.weight
+        model.classifier.out_proj.bias = roberta.model.classification_heads['mnli'].out_proj.bias
+    else:
+        # LM Head
+        model.lm_head.dense.weight = roberta.model.decoder.lm_head.dense.weight
+        model.lm_head.dense.bias = roberta.model.decoder.lm_head.dense.bias
+        model.lm_head.layer_norm.weight = roberta.model.decoder.lm_head.layer_norm.weight
+        model.lm_head.layer_norm.bias = roberta.model.decoder.lm_head.layer_norm.bias
+        model.lm_head.layer_norm.variance_epsilon = roberta.model.decoder.lm_head.layer_norm.eps
+        model.lm_head.weight = roberta.model.decoder.lm_head.weight
+        model.lm_head.bias = roberta.model.decoder.lm_head.bias
 
     # Let's check that we get the same results.
     input_ids: torch.Tensor = roberta.encode(SAMPLE_TEXT).unsqueeze(0) # batch of size 1
@@ -157,8 +166,13 @@ if __name__ == "__main__":
                         type = str,
                         required = True,
                         help = "Path to the output PyTorch model.")
+    parser.add_argument("--classification_head",
+                        action = "store_true",
+                        help = "Whether to convert a final classification head.")
     args = parser.parse_args()
     convert_roberta_checkpoint_to_pytorch(
         args.roberta_checkpoint_path,
-        args.pytorch_dump_folder_path
+        args.pytorch_dump_folder_path,
+        args.classification_head
     )
+
diff --git a/pytorch_transformers/modeling_roberta.py b/pytorch_transformers/modeling_roberta.py
index 43f76989f4..43c9362b30 100644
--- a/pytorch_transformers/modeling_roberta.py
+++ b/pytorch_transformers/modeling_roberta.py
@@ -142,3 +142,60 @@ class RobertaLMHead(nn.Module):
         x = self.decoder(x) + self.bias
 
         return x
+
+
+
+class RobertaForSequenceClassification(BertPreTrainedModel):
+    """
+    Roberta Model with a classifier head on top.
+    """
+    config_class = RobertaConfig
+    pretrained_model_archive_map = ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
+    base_model_prefix = "roberta"
+
+    def __init__(self, config):
+        super(RobertaForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.roberta = RobertaModel(config)
+        self.classifier = RobertaClassificationHead(config)
+    
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,
+                position_ids=None, head_mask=None):
+        outputs = self.roberta(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
+                            attention_mask=attention_mask, head_mask=head_mask)
+        sequence_output = outputs[0]
+        logits = self.classifier(sequence_output)
+
+        outputs = (logits,) + outputs[2:]
+        if labels is not None:
+            if self.num_labels == 1:
+                #  We are doing regression
+                loss_fct = MSELoss()
+                loss = loss_fct(logits.view(-1), labels.view(-1))
+            else:
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), logits, (hidden_states), (attentions)
+
+
+
+class RobertaClassificationHead(nn.Module):
+    """Head for sentence-level classification tasks."""
+
+    def __init__(self, config):
+        super(RobertaClassificationHead, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, features, **kwargs):
+        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
+        x = self.dropout(x)
+        x = self.dense(x)
+        x = torch.tanh(x)
+        x = self.dropout(x)
+        x = self.out_proj(x)
+        return x
diff --git a/pytorch_transformers/tests/modeling_roberta_test.py b/pytorch_transformers/tests/modeling_roberta_test.py
index 273176b27a..36145466b9 100644
--- a/pytorch_transformers/tests/modeling_roberta_test.py
+++ b/pytorch_transformers/tests/modeling_roberta_test.py
@@ -179,5 +179,63 @@ class RobertaModelTest(CommonTestCases.CommonModelTester):
             shutil.rmtree(cache_dir)
             self.assertIsNotNone(model)
 
+
+
+class RobertaModelIntegrationTest(unittest.TestCase):
+
+    @pytest.mark.slow
+    def test_inference_masked_lm(self):
+        model = RobertaForMaskedLM.from_pretrained('roberta-base')
+        
+        input_ids = torch.tensor([[    0, 31414,   232,   328,   740,  1140, 12695,    69, 46078,  1588,   2]])
+        output = model(input_ids)[0]
+        expected_shape = torch.Size((1, 11, 50265))
+        self.assertEqual(
+            output.shape,
+            expected_shape
+        )
+        # compare the actual values for a slice.
+        expected_slice = torch.Tensor(
+            [[[33.8843, -4.3107, 22.7779],
+              [ 4.6533, -2.8099, 13.6252],
+              [ 1.8222, -3.6898,  8.8600]]]
+        )
+        self.assertTrue(
+            torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
+        )
+
+    @pytest.mark.slow
+    def test_inference_no_head(self):
+        model = RobertaModel.from_pretrained('roberta-base')
+        
+        input_ids = torch.tensor([[    0, 31414,   232,   328,   740,  1140, 12695,    69, 46078,  1588,   2]])
+        output = model(input_ids)[0]
+        # compare the actual values for a slice.
+        expected_slice = torch.Tensor(
+            [[[-0.0231,  0.0782,  0.0074],
+              [-0.1854,  0.0539, -0.0174],
+              [ 0.0548,  0.0799,  0.1687]]]
+        )
+        self.assertTrue(
+            torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
+        )
+
+    @pytest.mark.slow
+    def test_inference_classification_head(self):
+        model = RobertaForSequenceClassification.from_pretrained('roberta-large-mnli')
+        
+        input_ids = torch.tensor([[    0, 31414,   232,   328,   740,  1140, 12695,    69, 46078,  1588,   2]])
+        output = model(input_ids)[0]
+        expected_shape = torch.Size((1, 3))
+        self.assertEqual(
+            output.shape,
+            expected_shape
+        )
+        expected_tensor = torch.Tensor([[-0.9469,  0.3913,  0.5118]])
+        self.assertTrue(
+            torch.allclose(output, expected_tensor, atol=1e-3)
+        )
+
+
 if __name__ == "__main__":
     unittest.main()

From e367ac469c27949854a08c5c5ba5b392c3fbcb0a Mon Sep 17 00:00:00 2001
From: Julien Chaumond <chaumond@gmail.com>
Date: Thu, 8 Aug 2019 11:26:11 -0400
Subject: [PATCH 072/200] [RoBERTa] Re-apply
 39d72bcc7b2c99c04b6f483f0d8e7bdff547d37c

cc @lysandrejik
---
 pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py b/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
index 85ad5ad15b..e4e8fbb25d 100644
--- a/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
@@ -132,7 +132,7 @@ def convert_roberta_checkpoint_to_pytorch(roberta_checkpoint_path, pytorch_dump_
         model.lm_head.layer_norm.weight = roberta.model.decoder.lm_head.layer_norm.weight
         model.lm_head.layer_norm.bias = roberta.model.decoder.lm_head.layer_norm.bias
         model.lm_head.layer_norm.variance_epsilon = roberta.model.decoder.lm_head.layer_norm.eps
-        model.lm_head.weight = roberta.model.decoder.lm_head.weight
+        model.lm_head.decoder.weight = roberta.model.decoder.lm_head.weight
         model.lm_head.bias = roberta.model.decoder.lm_head.bias
 
     # Let's check that we get the same results.

From 6c41a8f5dc5c630f31fda7b1617b701b40ea27d6 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Thu, 8 Aug 2019 18:20:32 -0400
Subject: [PATCH 073/200] Encode and Decode are back in the superclass. They
 now handle sentence pairs special tokens.

---
 pytorch_transformers/__init__.py             |   3 +-
 pytorch_transformers/modeling_roberta.py     |   3 +-
 pytorch_transformers/tokenization_roberta.py | 108 +++++++------------
 pytorch_transformers/tokenization_utils.py   |  46 ++++++--
 4 files changed, 81 insertions(+), 79 deletions(-)

diff --git a/pytorch_transformers/__init__.py b/pytorch_transformers/__init__.py
index c4148e283c..38423de14b 100644
--- a/pytorch_transformers/__init__.py
+++ b/pytorch_transformers/__init__.py
@@ -7,7 +7,6 @@ from .tokenization_gpt2 import GPT2Tokenizer
 from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
 from .tokenization_xlm import XLMTokenizer
 from .tokenization_roberta import RobertaTokenizer
-from .tokenization_utils import (PreTrainedTokenizer, clean_up_tokenization)
 
 from .tokenization_utils import (PreTrainedTokenizer)
 
@@ -39,7 +38,7 @@ from .modeling_xlm import (XLMConfig, XLMPreTrainedModel , XLMModel,
                            XLMWithLMHeadModel, XLMForSequenceClassification,
                            XLMForQuestionAnswering, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
                            XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
-from .modeling_roberta import (RobertaConfig, RobertaForMaskedLM, RobertaModel,
+from .modeling_roberta import (RobertaConfig, RobertaForMaskedLM, RobertaModel, RobertaForSequenceClassification,
                                ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
 from .modeling_utils import (WEIGHTS_NAME, CONFIG_NAME, TF_WEIGHTS_NAME,
                           PretrainedConfig, PreTrainedModel, prune_layer, Conv1D)
diff --git a/pytorch_transformers/modeling_roberta.py b/pytorch_transformers/modeling_roberta.py
index 43c9362b30..6cd4bc2d35 100644
--- a/pytorch_transformers/modeling_roberta.py
+++ b/pytorch_transformers/modeling_roberta.py
@@ -23,7 +23,7 @@ import logging
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-from torch.nn import CrossEntropyLoss
+from torch.nn import CrossEntropyLoss, MSELoss
 
 from pytorch_transformers.modeling_bert import (BertConfig, BertEmbeddings,
                                                 BertLayerNorm, BertModel,
@@ -144,7 +144,6 @@ class RobertaLMHead(nn.Module):
         return x
 
 
-
 class RobertaForSequenceClassification(BertPreTrainedModel):
     """
     Roberta Model with a classifier head on top.
diff --git a/pytorch_transformers/tokenization_roberta.py b/pytorch_transformers/tokenization_roberta.py
index 7fa42bfb1c..4ec53a65b0 100644
--- a/pytorch_transformers/tokenization_roberta.py
+++ b/pytorch_transformers/tokenization_roberta.py
@@ -21,18 +21,19 @@ import logging
 import re
 from io import open
 import six
+import os
 
-from .tokenization_utils import PreTrainedTokenizer, clean_up_tokenization
+from .tokenization_utils import PreTrainedTokenizer
 from .tokenization_gpt2 import GPT2Tokenizer
 
 logger = logging.getLogger(__name__)
 
-VOCAB_FILES_NAMES = {
-    'vocab_file': 'dict.txt',
+DICT_FILES_NAMES = {
+    'dict_file': 'dict.txt',
 }
 
-PRETRAINED_VOCAB_FILES_MAP = {
-    'vocab_file':
+PRETRAINED_DICT_FILES_MAP = {
+    'dict_file':
         {
             'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
             'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
@@ -178,89 +179,62 @@ class RobertaTokenizer(PreTrainedTokenizer):
     RoBERTa tokenizer. Peculiarities:
         - GPT-2 tokenizer with a different integer mapping on top.
     """
-    vocab_files_names = VOCAB_FILES_NAMES
-    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    vocab_files_names = DICT_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_DICT_FILES_MAP
     max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
 
-    def __init__(self, vocab_file,
-                 bos_token="<s>", eos_token="</s>", **kwargs):
-        super(RobertaTokenizer, self).__init__(cls_token=bos_token, sep_token=eos_token, eos_token=eos_token, **kwargs)
+    def __init__(self, dict_file, bpe_tokenizer=None, bos_token="<s>", eos_token="</s>", sep_token="</s>", cls_token="<s>",
+                 unk_token="<unk>", **kwargs):
+        super(RobertaTokenizer, self).__init__(cls_token=bos_token, sep_token=eos_token, eos_token=eos_token,
+                                               unk_token=unk_token, **kwargs)
 
-        self.gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-        self.dictionary = Dictionary.load(vocab_file)
+        self.gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2") if bpe_tokenizer is None else bpe_tokenizer
+        self.dictionary = Dictionary.load(dict_file)
+
+    @property
+    def vocab_size(self):
+        return len(self.dictionary.indices)
 
     def _tokenize(self, text):
         """ Use GPT-2 Tokenizer """
         return self.gpt2_tokenizer._tokenize(text)
 
-    def encode(self, text, *args):
-        """ Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
-        """
-        bpe_sentence = [self.cls_token] + \
-                       self.gpt2_tokenizer.convert_tokens_to_ids(self.tokenize(text)) + \
-                       [self.sep_token]
-
-        if len(args):
-            for additional_sentence in args:
-                bpe_sentence += [self.sep_token
-                                 ] + \
-                                self.gpt2_tokenizer.convert_tokens_to_ids(self.tokenize(additional_sentence)) + \
-                                [self.sep_token]
-
-        return self.dictionary.encode_line(' '.join([str(token) for token in bpe_sentence]), append_eos=False)
-
-    def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):
-        """ Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary
-            with options to remove special tokens and clean up tokenization spaces.
-            Handles sentence pairs.
-        """
-        filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
-
-        if any(isinstance(element, list) for element in filtered_tokens):
-            texts = []
-            for element in filtered_tokens:
-                text = self.convert_tokens_to_string(element)
-                if clean_up_tokenization_spaces:
-                    text = clean_up_tokenization(text)
-                    texts.append(text)
-            return texts
-        else:
-            text = self.convert_tokens_to_string(filtered_tokens)
-            if clean_up_tokenization_spaces:
-                text = clean_up_tokenization(text)
-            return text
-
     def _convert_token_to_id(self, token):
-        return self.dictionary.index(token)
+        if self.dictionary.index(token) != 3:
+            return self.dictionary.index(token)
+        return self.dictionary.index(str(self.gpt2_tokenizer.convert_tokens_to_ids(token)))
 
     def _convert_id_to_token(self, index):
         symbol = self.dictionary[index]
         try:
             idx = int(symbol)
             return self.gpt2_tokenizer._convert_id_to_token(idx)
-        except:
+        except ValueError:
             return symbol
 
     def convert_tokens_to_string(self, tokens):
         return self.gpt2_tokenizer.convert_tokens_to_string(tokens)
 
+    def convert_tokens_to_ids(self, tokens, no_sep_cls_tokens=False):
+        cls = [self._convert_token_to_id(self.cls_token)]
+        tokens = super().convert_tokens_to_ids(tokens)
+        sep = [self._convert_token_to_id(self.sep_token)]
+        return (cls + tokens + sep) if (isinstance(tokens, list) and not no_sep_cls_tokens) else tokens
+
     def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
-        # Remove the first and last tokens which are cls and sep tokens
-        ids = ids[1:-1]
-        # If multi sentence, then split (multi sentence found by looking for two sequential sep tokens)
-        ids = [list(map(int, example.split(' '))) for example in ' '.join([str(id) for id in ids]).split(' 2 2 ')]
+        return super().convert_ids_to_tokens(ids, skip_special_tokens=skip_special_tokens)[1:-1]
 
-        if len(ids) == 1:
-            tokens = self.gpt2_tokenizer.convert_ids_to_tokens(list(map(lambda id: int(self.dictionary[id]), ids[0])))
-        else:
-            tokens = []
-            for example in ids:
-                tokens += [
-                    self.gpt2_tokenizer.convert_ids_to_tokens(list(map(lambda id: int(self.dictionary[id]), example)))]
-        return tokens
+    def save_vocabulary(self, save_directory):
+        """Save the tokenizer vocabulary and merge files to a directory."""
+        if not os.path.isdir(save_directory):
+            logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
+            return
+        dict_file = os.path.join(save_directory, DICT_FILES_NAMES['dict_file'])
 
-    def convert_tokens_to_ids(self, tokens):
-        tokens = " ".join(str(x) for x in self.gpt2_tokenizer.convert_tokens_to_ids(tokens))
-        bpe_sentence = '<s> ' + tokens + ' </s>'
-        return self.dictionary.encode_line(bpe_sentence, append_eos=False)
+        with open(dict_file, 'w', encoding='utf-8') as f:
+            for i in range(self.dictionary.nspecial, len(self.dictionary.count)):
+                f.write(f"{list(self.dictionary.indices.keys())[i]} {self.dictionary.count[i]}\n")
 
+        vocab_files = self.gpt2_tokenizer.save_pretrained(save_directory)
+
+        return vocab_files + (dict_file,)
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index 2e75c83bfb..232ef1c35b 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -495,7 +495,7 @@ class PreTrainedTokenizer(object):
         """
         raise NotImplementedError
 
-    def convert_tokens_to_ids(self, tokens):
+    def convert_tokens_to_ids(self, tokens, **kwargs):
         """ Converts a single token, or a sequence of tokens, (str/unicode) in a single integer id
             (resp. a sequence of ids), using the vocabulary.
         """
@@ -520,12 +520,29 @@ class PreTrainedTokenizer(object):
         raise NotImplementedError
 
 
-    def encode(self, text):
+    def encode(self, *text, cls_token_at_end=False, double_sep_token=False, no_sep_cls_tokens=False):
         """ Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
         
         Same doing ``self.convert_tokens_to_ids(self.tokenize(text))``.
         """
-        return self.convert_tokens_to_ids(self.tokenize(text))
+
+        if len(text) == 1:
+            return self.convert_tokens_to_ids(self.tokenize(text[0]), no_sep_cls_tokens=no_sep_cls_tokens)
+
+        if len(text) > 2:
+            logger.warning("Tokenization currently only supports sentence pairs. Ignoring every string following the "
+                           "initial two.")
+
+        first_sentence_tokens = [self._convert_token_to_id(token) for token in self.tokenize(text[0])]
+        second_sentence_tokens = [self._convert_token_to_id(token) for token in self.tokenize(text[1])]
+        sep = [self._convert_token_to_id(self.sep_token)]
+        cls = [self._convert_token_to_id(self.cls_token)]
+        n_sep_token = 2 if double_sep_token else 1
+
+        tokens = first_sentence_tokens + sep * n_sep_token + second_sentence_tokens + sep
+        tokens = (tokens + cls) if cls_token_at_end else (cls + tokens)
+
+        return tokens
 
 
     def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
@@ -560,7 +577,8 @@ class PreTrainedTokenizer(object):
         """
         return ' '.join(self.convert_ids_to_tokens(tokens))
 
-    def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):
+    def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True, cls_token_at_end=False,
+               double_sep_token=False):
         """ Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary
             with options to remove special tokens and clean up tokenization spaces.
 
@@ -568,9 +586,21 @@ class PreTrainedTokenizer(object):
         """
         filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
         text = self.convert_tokens_to_string(filtered_tokens)
-        if clean_up_tokenization_spaces:
-            text = self.clean_up_tokenization(text)
-        return text
+
+        if self.sep_token is not None and self.sep_token in text:
+            text = text.replace(self.cls_token, self.sep_token)
+            split_text = list(filter(lambda sentence: len(sentence) > 0, text.split(self.sep_token)))
+            if clean_up_tokenization_spaces:
+                clean_text = [self.clean_up_tokenization(text) for text in split_text]
+                return clean_text
+            else:
+                return split_text
+        else:
+            if clean_up_tokenization_spaces:
+                clean_text = self.clean_up_tokenization(text)
+                return clean_text
+            else:
+                return text
 
     @property
     def special_tokens_map(self):
@@ -602,7 +632,7 @@ class PreTrainedTokenizer(object):
             class attributes (cls_token, unk_token...).
         """
         all_toks = self.all_special_tokens
-        all_ids = list(self.convert_tokens_to_ids(t) for t in all_toks)
+        all_ids = list(self._convert_token_to_id(t) for t in all_toks)
         return all_ids
 
     @staticmethod

From fbd746bd065a9aaacd1ef25840cdc9ec957e8cac Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Thu, 8 Aug 2019 18:21:34 -0400
Subject: [PATCH 074/200] Updated test architecture

---
 .../tests/modeling_roberta_test.py            | 43 +++++++++++-
 .../tests/tokenization_roberta_test.py        | 70 +++++++++++++------
 .../tests/tokenization_tests_commons.py       |  5 +-
 3 files changed, 91 insertions(+), 27 deletions(-)

diff --git a/pytorch_transformers/tests/modeling_roberta_test.py b/pytorch_transformers/tests/modeling_roberta_test.py
index 36145466b9..e0455d8508 100644
--- a/pytorch_transformers/tests/modeling_roberta_test.py
+++ b/pytorch_transformers/tests/modeling_roberta_test.py
@@ -19,8 +19,9 @@ from __future__ import print_function
 import unittest
 import shutil
 import pytest
+import torch
 
-from pytorch_transformers import (RobertaConfig, RobertaModel, RobertaForMaskedLM)
+from pytorch_transformers import (RobertaConfig, RobertaModel, RobertaForMaskedLM, RobertaForSequenceClassification)
 from pytorch_transformers.modeling_roberta import ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
 
 from .modeling_common_test import (CommonTestCases, ConfigTester, ids_tensor)
@@ -156,6 +157,42 @@ class RobertaModelTest(CommonTestCases.CommonModelTester):
             inputs_dict = {'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': input_mask}
             return config, inputs_dict
 
+        def test_inference_masked_lm(self):
+            model = RobertaForMaskedLM.from_pretrained('roberta-base')
+
+            input_ids = torch.tensor([[0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]])
+            output = model(input_ids)[0]
+            expected_shape = torch.Size((1, 11, 50265))
+            self.assertEqual(
+                output.shape,
+                expected_shape
+            )
+            # compare the actual values for a slice.
+            expected_slice = torch.Tensor(
+                [[[33.8843, -4.3107, 22.7779],
+                  [4.6533, -2.8099, 13.6252],
+                  [1.8222, -3.6898, 8.8600]]]
+            )
+            self.assertTrue(
+                torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
+            )
+
+        # @pytest.mark.slow
+        def test_inference_no_head(self):
+            model = RobertaModel.from_pretrained('roberta-base')
+
+            input_ids = torch.tensor([[0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]])
+            output = model(input_ids)[0]
+            # compare the actual values for a slice.
+            expected_slice = torch.Tensor(
+                [[[-0.0231, 0.0782, 0.0074],
+                  [-0.1854, 0.0539, -0.0174],
+                  [0.0548, 0.0799, 0.1687]]]
+            )
+            self.assertTrue(
+                torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
+            )
+
     def setUp(self):
         self.model_tester = RobertaModelTest.RobertaModelTester(self)
         self.config_tester = ConfigTester(self, config_class=RobertaConfig, hidden_size=37)
@@ -183,7 +220,7 @@ class RobertaModelTest(CommonTestCases.CommonModelTester):
 
 class RobertaModelIntegrationTest(unittest.TestCase):
 
-    @pytest.mark.slow
+    # @pytest.mark.slow
     def test_inference_masked_lm(self):
         model = RobertaForMaskedLM.from_pretrained('roberta-base')
         
@@ -204,7 +241,7 @@ class RobertaModelIntegrationTest(unittest.TestCase):
             torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
         )
 
-    @pytest.mark.slow
+    # @pytest.mark.slow
     def test_inference_no_head(self):
         model = RobertaModel.from_pretrained('roberta-base')
         
diff --git a/pytorch_transformers/tests/tokenization_roberta_test.py b/pytorch_transformers/tests/tokenization_roberta_test.py
index 60df18ae2b..fbb3f8381d 100644
--- a/pytorch_transformers/tests/tokenization_roberta_test.py
+++ b/pytorch_transformers/tests/tokenization_roberta_test.py
@@ -15,42 +15,68 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 
 import os
+import json
 import unittest
 
-from pytorch_transformers.tokenization_roberta import RobertaTokenizer, VOCAB_FILES_NAMES
-from .tokenization_tests_commons import create_and_check_tokenizer_commons, TemporaryDirectory
+from pytorch_transformers.tokenization_roberta import RobertaTokenizer, DICT_FILES_NAMES
+from pytorch_transformers.tokenization_gpt2 import GPT2Tokenizer, VOCAB_FILES_NAMES
+from .tokenization_tests_commons import CommonTestCases
 
 
-class RobertaTokenizationTest(unittest.TestCase):
+class RobertaTokenizationTest(CommonTestCases.CommonTokenizerTester):
+    tokenizer_class = RobertaTokenizer
 
-    def test_full_tokenizer(self):
-        """ Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt """
+    def setUp(self):
+        super(RobertaTokenizationTest, self).setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
         vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n",
                  "lo", "low", "er",
                  "low", "lowest", "newer", "wider", "<unk>"]
         vocab_tokens = dict(zip(vocab, range(len(vocab))))
-        special_tokens_map = {"unk_token": "<unk>"}
+        merges = ["#version: 0.2", "l o", "lo w", "e r", ""]
+        self.special_tokens_map = {"unk_token": "<unk>"}
 
-        with TemporaryDirectory() as tmpdirname:
-            vocab_file = os.path.join(tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
-            with open(vocab_file, "w") as fp:
-                [fp.write(f"{vocab} {index}\n") for index, vocab in enumerate(vocab_tokens)]
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['merges_file'])
+        with open(self.vocab_file, "w") as fp:
+            fp.write(json.dumps(vocab_tokens))
+        with open(self.merges_file, "w") as fp:
+            fp.write("\n".join(merges))
 
-            input_text = u"lower newer"
-            output_text = u"lower<unk>newer"
+    def get_tokenizer(self):
+        bpe_tokenizer = GPT2Tokenizer.from_pretrained(self.tmpdirname, **self.special_tokens_map)
+        return RobertaTokenizer.from_pretrained("roberta-base", bpe_tokenizer=bpe_tokenizer)
 
-            create_and_check_tokenizer_commons(self, input_text, output_text, RobertaTokenizer, tmpdirname, **special_tokens_map)
+    def get_input_output_texts(self):
+        input_text = u"lower newer"
+        output_text = u"lower<unk>newer"
+        return input_text, output_text
 
-            tokenizer = RobertaTokenizer(vocab_file, **special_tokens_map)
-            text = "lower"
-            bpe_tokens = ["low", "er"]
-            tokens = tokenizer.tokenize(text)
-            self.assertListEqual(tokens, bpe_tokens)
+    def test_full_tokenizer(self):
+        tokenizer = self.get_tokenizer()
+        text = "lower"
+        bpe_tokens = ["low", "er"]
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
 
-            input_tokens = tokens + [tokenizer.unk_token]
-            input_bpe_tokens = [13, 12, 17]
-            self.assertListEqual(
-                tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+        input_tokens = tokens + [tokenizer.unk_token]
+        input_bpe_tokens = [0, 4, 12, 176, 2]
+        tokenizer.convert_tokens_to_ids(input_tokens)
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    def roberta_dict_integration_testing(self):
+        tokenizer = self.get_tokenizer()
+
+        self.assertListEqual(
+            tokenizer.encode('Hello world!'),
+            [0, 31414, 232, 328, 2]
+        )
+        self.assertListEqual(
+            tokenizer.encode('Hello world! cécé herlolip'),
+            [0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]
+        )
 
 
 if __name__ == '__main__':
diff --git a/pytorch_transformers/tests/tokenization_tests_commons.py b/pytorch_transformers/tests/tokenization_tests_commons.py
index ebcf6f48d8..e766a825a0 100644
--- a/pytorch_transformers/tests/tokenization_tests_commons.py
+++ b/pytorch_transformers/tests/tokenization_tests_commons.py
@@ -105,7 +105,7 @@ class CommonTestCases:
             self.assertEqual(added_toks, len(new_toks))
             self.assertEqual(all_size_2, all_size + len(new_toks))
 
-            tokens = tokenizer.encode("aaaaabbbbbb low cccccccccdddddddd l")
+            tokens = tokenizer.encode("aaaaabbbbbb low cccccccccdddddddd l", no_sep_cls_tokens=True)
             self.assertGreaterEqual(len(tokens), 4)
             self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
             self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
@@ -121,7 +121,8 @@ class CommonTestCases:
             self.assertEqual(added_toks_2, len(new_toks_2))
             self.assertEqual(all_size_3, all_size_2 + len(new_toks_2))
 
-            tokens = tokenizer.encode(">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l")
+            tokens = tokenizer.encode(">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l",
+                                      no_sep_cls_tokens=True)
 
             self.assertGreaterEqual(len(tokens), 6)
             self.assertGreater(tokens[0], tokenizer.vocab_size - 1)

From 3566d2791905269b75014e8ea9db322c86f980b2 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Thu, 8 Aug 2019 19:04:34 -0400
Subject: [PATCH 075/200] Clarified PreTrainedModel.from_pretrained warning
 messages in documentation.

---
 pytorch_transformers/modeling_utils.py | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 5a753392fa..35f82e324f 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -365,6 +365,11 @@ class PreTrainedModel(nn.Module):
         The model is set in evaluation mode by default using ``model.eval()`` (Dropout modules are deactivated)
         To train the model, you should first set it back in training mode with ``model.train()``
 
+        The warning ``Weights from XXX not initialized from pretrained model`` means that the weights of XXX do not come pre-trained with the rest of the model.
+        It is up to you to train those weights with a downstream fine-tuning task.
+
+        The warning ``Weights from XXX not used in YYY`` means that the layer XXX is not used by YYY, therefore those weights are discarded.
+
         Parameters:
             pretrained_model_name_or_path: either:
 

From 14e970c271f8c1f21d46aaadf7e89852d329d3a8 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Fri, 9 Aug 2019 15:01:38 -0400
Subject: [PATCH 076/200] Tokenization encode/decode class-based sequence
 handling

---
 .../tests/tokenization_tests_commons.py       |  5 ++-
 pytorch_transformers/tokenization_bert.py     |  8 +++++
 pytorch_transformers/tokenization_utils.py    | 35 ++++++++++---------
 pytorch_transformers/tokenization_xlm.py      |  8 +++++
 pytorch_transformers/tokenization_xlnet.py    | 10 ++++++
 5 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/pytorch_transformers/tests/tokenization_tests_commons.py b/pytorch_transformers/tests/tokenization_tests_commons.py
index e766a825a0..ebcf6f48d8 100644
--- a/pytorch_transformers/tests/tokenization_tests_commons.py
+++ b/pytorch_transformers/tests/tokenization_tests_commons.py
@@ -105,7 +105,7 @@ class CommonTestCases:
             self.assertEqual(added_toks, len(new_toks))
             self.assertEqual(all_size_2, all_size + len(new_toks))
 
-            tokens = tokenizer.encode("aaaaabbbbbb low cccccccccdddddddd l", no_sep_cls_tokens=True)
+            tokens = tokenizer.encode("aaaaabbbbbb low cccccccccdddddddd l")
             self.assertGreaterEqual(len(tokens), 4)
             self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
             self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
@@ -121,8 +121,7 @@ class CommonTestCases:
             self.assertEqual(added_toks_2, len(new_toks_2))
             self.assertEqual(all_size_3, all_size_2 + len(new_toks_2))
 
-            tokens = tokenizer.encode(">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l",
-                                      no_sep_cls_tokens=True)
+            tokens = tokenizer.encode(">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l")
 
             self.assertGreaterEqual(len(tokens), 6)
             self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
diff --git a/pytorch_transformers/tokenization_bert.py b/pytorch_transformers/tokenization_bert.py
index 9bf18a97d7..9f4f00a300 100644
--- a/pytorch_transformers/tokenization_bert.py
+++ b/pytorch_transformers/tokenization_bert.py
@@ -166,6 +166,14 @@ class BertTokenizer(PreTrainedTokenizer):
         out_string = ' '.join(tokens).replace(' ##', '').strip()
         return out_string
 
+    def add_special_tokens_single_sentence(self, token_ids):
+        return [self._convert_token_to_id(self.cls_token)] + token_ids + [self._convert_token_to_id(self.sep_token)]
+
+    def add_special_tokens_sentences_pair(self, *token_ids):
+        sep = [self._convert_token_to_id(self.sep_token)]
+        cls = [self._convert_token_to_id(self.cls_token)]
+        return cls + token_ids[0] + sep + token_ids[1] + sep
+
     def save_vocabulary(self, vocab_path):
         """Save the tokenizer vocabulary to a directory or file."""
         index = 0
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index 232ef1c35b..a3581fe582 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -495,7 +495,7 @@ class PreTrainedTokenizer(object):
         """
         raise NotImplementedError
 
-    def convert_tokens_to_ids(self, tokens, **kwargs):
+    def convert_tokens_to_ids(self, tokens):
         """ Converts a single token, or a sequence of tokens, (str/unicode) in a single integer id
             (resp. a sequence of ids), using the vocabulary.
         """
@@ -519,31 +519,35 @@ class PreTrainedTokenizer(object):
     def _convert_token_to_id(self, token):
         raise NotImplementedError
 
-
-    def encode(self, *text, cls_token_at_end=False, double_sep_token=False, no_sep_cls_tokens=False):
+    def encode(self, text, add_special_tokens=False, *sequences):
         """ Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
         
         Same doing ``self.convert_tokens_to_ids(self.tokenize(text))``.
         """
 
-        if len(text) == 1:
-            return self.convert_tokens_to_ids(self.tokenize(text[0]), no_sep_cls_tokens=no_sep_cls_tokens)
+        if len(sequences) == 0:
+            if add_special_tokens:
+                return self.add_special_tokens_single_sentence(self.convert_tokens_to_ids(self.tokenize(text)))
+            else:
+                return self.convert_tokens_to_ids(self.tokenize(text))
 
-        if len(text) > 2:
+        if len(sequences) > 1:
             logger.warning("Tokenization currently only supports sentence pairs. Ignoring every string following the "
                            "initial two.")
 
-        first_sentence_tokens = [self._convert_token_to_id(token) for token in self.tokenize(text[0])]
-        second_sentence_tokens = [self._convert_token_to_id(token) for token in self.tokenize(text[1])]
-        sep = [self._convert_token_to_id(self.sep_token)]
-        cls = [self._convert_token_to_id(self.cls_token)]
-        n_sep_token = 2 if double_sep_token else 1
+        first_sentence_tokens = [self._convert_token_to_id(token) for token in self.tokenize(text)]
+        second_sentence_tokens = [self._convert_token_to_id(token) for token in self.tokenize(sequences[0])]
 
-        tokens = first_sentence_tokens + sep * n_sep_token + second_sentence_tokens + sep
-        tokens = (tokens + cls) if cls_token_at_end else (cls + tokens)
+        if add_special_tokens:
+            return self.add_special_tokens_sentences_pair(first_sentence_tokens, second_sentence_tokens)
+        else:
+            return first_sentence_tokens, second_sentence_tokens
 
-        return tokens
+    def add_special_tokens_single_sentence(self, token_ids):
+        raise NotImplementedError
 
+    def add_special_tokens_sentences_pair(self, *token_ids):
+        raise NotImplementedError
 
     def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
         """ Converts a single index or a sequence of indices (integers) in a token "
@@ -577,8 +581,7 @@ class PreTrainedTokenizer(object):
         """
         return ' '.join(self.convert_ids_to_tokens(tokens))
 
-    def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True, cls_token_at_end=False,
-               double_sep_token=False):
+    def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):
         """ Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary
             with options to remove special tokens and clean up tokenization spaces.
 
diff --git a/pytorch_transformers/tokenization_xlm.py b/pytorch_transformers/tokenization_xlm.py
index 899f6b884f..b0b8f1d78d 100644
--- a/pytorch_transformers/tokenization_xlm.py
+++ b/pytorch_transformers/tokenization_xlm.py
@@ -214,6 +214,14 @@ class XLMTokenizer(PreTrainedTokenizer):
         out_string = ''.join(tokens).replace('</w>', ' ').strip()
         return out_string
 
+    def add_special_tokens_single_sentence(self, token_ids):
+        return [self._convert_token_to_id(self.cls_token)] + token_ids + [self._convert_token_to_id(self.sep_token)]
+
+    def add_special_tokens_sentences_pair(self, *token_ids):
+        sep = [self._convert_token_to_id(self.sep_token)]
+        cls = [self._convert_token_to_id(self.cls_token)]
+        return cls + token_ids[0] + sep + token_ids[1] + sep
+
     def save_vocabulary(self, save_directory):
         """Save the tokenizer vocabulary and merge files to a directory."""
         if not os.path.isdir(save_directory):
diff --git a/pytorch_transformers/tokenization_xlnet.py b/pytorch_transformers/tokenization_xlnet.py
index 919ac97bce..42473da860 100644
--- a/pytorch_transformers/tokenization_xlnet.py
+++ b/pytorch_transformers/tokenization_xlnet.py
@@ -177,6 +177,16 @@ class XLNetTokenizer(PreTrainedTokenizer):
         out_string = ''.join(tokens).replace(SPIECE_UNDERLINE, ' ').strip()
         return out_string
 
+    def add_special_tokens_single_sentence(self, token_ids):
+        logger.warning("No method was defined for special tokens and single sentence streams in XLNet. "
+                       "Returning token_ids")
+        return token_ids
+
+    def add_special_tokens_sentences_pair(self, *token_ids):
+        sep = [self._convert_token_to_id(self.sep_token)]
+        cls = [self._convert_token_to_id(self.cls_token)]
+        return token_ids[0] + sep + token_ids[1] + sep + cls
+
     def save_vocabulary(self, save_directory):
         """ Save the sentencepiece vocabulary (copy original file) and special tokens file
             to a directory.

From 75d5f98fd2a154bb5bfc0879c4a6e389c6789be5 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Fri, 9 Aug 2019 15:02:13 -0400
Subject: [PATCH 077/200] Roberta tokenization + fixed tests (py3 + py2).

---
 .../tests/modeling_roberta_test.py            |  40 +--
 .../tests/tokenization_roberta_test.py        |  11 +-
 pytorch_transformers/tokenization_roberta.py  | 311 ++++++++----------
 3 files changed, 138 insertions(+), 224 deletions(-)

diff --git a/pytorch_transformers/tests/modeling_roberta_test.py b/pytorch_transformers/tests/modeling_roberta_test.py
index e0455d8508..94035e9667 100644
--- a/pytorch_transformers/tests/modeling_roberta_test.py
+++ b/pytorch_transformers/tests/modeling_roberta_test.py
@@ -157,42 +157,6 @@ class RobertaModelTest(CommonTestCases.CommonModelTester):
             inputs_dict = {'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': input_mask}
             return config, inputs_dict
 
-        def test_inference_masked_lm(self):
-            model = RobertaForMaskedLM.from_pretrained('roberta-base')
-
-            input_ids = torch.tensor([[0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]])
-            output = model(input_ids)[0]
-            expected_shape = torch.Size((1, 11, 50265))
-            self.assertEqual(
-                output.shape,
-                expected_shape
-            )
-            # compare the actual values for a slice.
-            expected_slice = torch.Tensor(
-                [[[33.8843, -4.3107, 22.7779],
-                  [4.6533, -2.8099, 13.6252],
-                  [1.8222, -3.6898, 8.8600]]]
-            )
-            self.assertTrue(
-                torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
-            )
-
-        # @pytest.mark.slow
-        def test_inference_no_head(self):
-            model = RobertaModel.from_pretrained('roberta-base')
-
-            input_ids = torch.tensor([[0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]])
-            output = model(input_ids)[0]
-            # compare the actual values for a slice.
-            expected_slice = torch.Tensor(
-                [[[-0.0231, 0.0782, 0.0074],
-                  [-0.1854, 0.0539, -0.0174],
-                  [0.0548, 0.0799, 0.1687]]]
-            )
-            self.assertTrue(
-                torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
-            )
-
     def setUp(self):
         self.model_tester = RobertaModelTest.RobertaModelTester(self)
         self.config_tester = ConfigTester(self, config_class=RobertaConfig, hidden_size=37)
@@ -220,7 +184,7 @@ class RobertaModelTest(CommonTestCases.CommonModelTester):
 
 class RobertaModelIntegrationTest(unittest.TestCase):
 
-    # @pytest.mark.slow
+    @pytest.mark.slow
     def test_inference_masked_lm(self):
         model = RobertaForMaskedLM.from_pretrained('roberta-base')
         
@@ -241,7 +205,7 @@ class RobertaModelIntegrationTest(unittest.TestCase):
             torch.allclose(output[:, :3, :3], expected_slice, atol=1e-3)
         )
 
-    # @pytest.mark.slow
+    @pytest.mark.slow
     def test_inference_no_head(self):
         model = RobertaModel.from_pretrained('roberta-base')
         
diff --git a/pytorch_transformers/tests/tokenization_roberta_test.py b/pytorch_transformers/tests/tokenization_roberta_test.py
index fbb3f8381d..daefea0fa7 100644
--- a/pytorch_transformers/tests/tokenization_roberta_test.py
+++ b/pytorch_transformers/tests/tokenization_roberta_test.py
@@ -18,8 +18,7 @@ import os
 import json
 import unittest
 
-from pytorch_transformers.tokenization_roberta import RobertaTokenizer, DICT_FILES_NAMES
-from pytorch_transformers.tokenization_gpt2 import GPT2Tokenizer, VOCAB_FILES_NAMES
+from pytorch_transformers.tokenization_roberta import RobertaTokenizer, VOCAB_FILES_NAMES
 from .tokenization_tests_commons import CommonTestCases
 
 
@@ -45,8 +44,7 @@ class RobertaTokenizationTest(CommonTestCases.CommonTokenizerTester):
             fp.write("\n".join(merges))
 
     def get_tokenizer(self):
-        bpe_tokenizer = GPT2Tokenizer.from_pretrained(self.tmpdirname, **self.special_tokens_map)
-        return RobertaTokenizer.from_pretrained("roberta-base", bpe_tokenizer=bpe_tokenizer)
+        return RobertaTokenizer.from_pretrained(self.tmpdirname, **self.special_tokens_map)
 
     def get_input_output_texts(self):
         input_text = u"lower newer"
@@ -54,15 +52,14 @@ class RobertaTokenizationTest(CommonTestCases.CommonTokenizerTester):
         return input_text, output_text
 
     def test_full_tokenizer(self):
-        tokenizer = self.get_tokenizer()
+        tokenizer = RobertaTokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
         text = "lower"
         bpe_tokens = ["low", "er"]
         tokens = tokenizer.tokenize(text)
         self.assertListEqual(tokens, bpe_tokens)
 
         input_tokens = tokens + [tokenizer.unk_token]
-        input_bpe_tokens = [0, 4, 12, 176, 2]
-        tokenizer.convert_tokens_to_ids(input_tokens)
+        input_bpe_tokens = [13, 12, 17]
         self.assertListEqual(
             tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
 
diff --git a/pytorch_transformers/tokenization_roberta.py b/pytorch_transformers/tokenization_roberta.py
index 4ec53a65b0..b01b92653d 100644
--- a/pytorch_transformers/tokenization_roberta.py
+++ b/pytorch_transformers/tokenization_roberta.py
@@ -12,229 +12,182 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""Tokenization classes for RoBERTa."""
+"""Tokenization classes for OpenAI GPT."""
 from __future__ import (absolute_import, division, print_function,
                         unicode_literals)
 
+import sys
 import json
 import logging
-import re
-from io import open
-import six
 import os
+import regex as re
+from io import open
 
+from .tokenization_gpt2 import bytes_to_unicode, get_pairs
 from .tokenization_utils import PreTrainedTokenizer
-from .tokenization_gpt2 import GPT2Tokenizer
+
+try:
+    from functools import lru_cache
+except ImportError:
+    # Just a dummy decorator to get the checks to run on python2
+    # because honestly I don't want to support a byte-level unicode BPE tokenizer on python 2 right now.
+    def lru_cache():
+        return lambda func: func
 
 logger = logging.getLogger(__name__)
 
-DICT_FILES_NAMES = {
-    'dict_file': 'dict.txt',
+VOCAB_FILES_NAMES = {
+    'vocab_file': 'vocab.json',
+    'merges_file': 'merges.txt',
 }
 
-PRETRAINED_DICT_FILES_MAP = {
-    'dict_file':
-        {
-            'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
-            'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
-            'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-dict.txt",
-        },
+PRETRAINED_VOCAB_FILES_MAP = {
+    'vocab_file':
+    {
+        'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json",
+        'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json",
+        'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-vocab.json",
+    },
+    'merges_file':
+    {
+        'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt",
+        'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt",
+        'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-merges.txt",
+    },
 }
 
 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
-    'roberta-base': 512,
-    'roberta-large': 512,
-    'roberta-large-mnli': 512,
+    'roberta-base': 1024,
+    'roberta-large': 1024,
+    'roberta-large-mnli': 1024,
 }
 
-SPACE_NORMALIZER = re.compile(r"\s+")
-
-def tokenize_line(line):
-    line = SPACE_NORMALIZER.sub(" ", line)
-    line = line.strip()
-    return line.split()
-
-
-class Dictionary(object):
-    """
-    A mapping from symbols to consecutive integers
-
-    From Facebook's fairseq.
-    """
-
-    def __init__(
-        self,
-        pad='<pad>',
-        eos='</s>',
-        unk='<unk>',
-        bos='<s>',
-        extra_special_symbols=None,
-    ):
-        self.unk_word, self.pad_word, self.eos_word = unk, pad, eos
-        self.symbols = []
-        self.count = []
-        self.indices = {}
-        self.bos_index = self.add_symbol(bos)
-        self.pad_index = self.add_symbol(pad)
-        self.eos_index = self.add_symbol(eos)
-        self.unk_index = self.add_symbol(unk)
-        if extra_special_symbols:
-            for s in extra_special_symbols:
-                self.add_symbol(s)
-        self.nspecial = len(self.symbols)
-
-    def __getitem__(self, idx):
-        if idx < len(self.symbols):
-            return self.symbols[idx]
-        return self.unk_word
-
-    def index(self, sym):
-        """Returns the index of the specified symbol"""
-        assert isinstance(sym, str)
-        if sym in self.indices:
-            return self.indices[sym]
-        return self.unk_index
-
-    def add_symbol(self, word, n=1):
-        """Adds a word to the dictionary"""
-        if word in self.indices:
-            idx = self.indices[word]
-            self.count[idx] = self.count[idx] + n
-            return idx
-        else:
-            idx = len(self.symbols)
-            self.indices[word] = idx
-            self.symbols.append(word)
-            self.count.append(n)
-            return idx
-
-    @classmethod
-    def load(cls, f, ignore_utf_errors=False):
-        """Loads the dictionary from a text file with the format:
-
-        ```
-        <symbol0> <count0>
-        <symbol1> <count1>
-        ...
-        ```
-        """
-        d = cls()
-        d.add_from_file(f, ignore_utf_errors)
-        return d
-
-    def add_from_file(self, f, ignore_utf_errors=False):
-        """
-        Loads a pre-existing dictionary from a text file and adds its symbols
-        to this instance.
-        """
-        if isinstance(f, six.string_types):
-            try:
-                if not ignore_utf_errors:
-                    with open(f, 'r', encoding='utf-8') as fd:
-                        self.add_from_file(fd)
-                else:
-                    with open(f, 'r', encoding='utf-8', errors='ignore') as fd:
-                        self.add_from_file(fd)
-            except FileNotFoundError as fnfe:
-                raise fnfe
-            except UnicodeError:
-                raise Exception("Incorrect encoding detected in {}, please "
-                                "rebuild the dataset".format(f))
-            return
-
-        lines = f.read().splitlines()
-        for line in lines:
-            idx = line.rfind(' ')
-            if idx == -1:
-                raise ValueError("Incorrect dictionary format, expected '<token> <cnt>'")
-            word = line[:idx]
-            count = int(line[idx + 1:])
-            self.indices[word] = len(self.symbols)
-            self.symbols.append(word)
-            self.count.append(count)
-
-    def encode_line(self, line, line_tokenizer=tokenize_line, add_if_not_exist=True,
-                    consumer=None, append_eos=True, reverse_order=False):
-        words = line_tokenizer(line)
-        if reverse_order:
-            words = list(reversed(words))
-        nwords = len(words)
-        ids = [0] * (nwords + 1 if append_eos else nwords)
-
-        for i, word in enumerate(words):
-            if add_if_not_exist:
-                idx = self.add_symbol(word)
-            else:
-                idx = self.index(word)
-            if consumer is not None:
-                consumer(word, idx)
-            ids[i] = idx
-        if append_eos:
-            ids[nwords] = self.eos_index
-        return ids
-
 
 class RobertaTokenizer(PreTrainedTokenizer):
     """
-    RoBERTa tokenizer. Peculiarities:
-        - GPT-2 tokenizer with a different integer mapping on top.
+    GPT-2 BPE tokenizer. Peculiarities:
+        - Byte-level BPE
     """
-    vocab_files_names = DICT_FILES_NAMES
-    pretrained_vocab_files_map = PRETRAINED_DICT_FILES_MAP
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
     max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
 
-    def __init__(self, dict_file, bpe_tokenizer=None, bos_token="<s>", eos_token="</s>", sep_token="</s>", cls_token="<s>",
-                 unk_token="<unk>", **kwargs):
-        super(RobertaTokenizer, self).__init__(cls_token=bos_token, sep_token=eos_token, eos_token=eos_token,
-                                               unk_token=unk_token, **kwargs)
+    def __init__(self, vocab_file, merges_file, errors='replace', bos_token="<s>", eos_token="</s>", sep_token="</s>",
+                 cls_token="<s>", unk_token="<unk>", **kwargs):
+        super(RobertaTokenizer, self).__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token,
+                                               sep_token=sep_token, cls_token=cls_token, **kwargs)
 
-        self.gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2") if bpe_tokenizer is None else bpe_tokenizer
-        self.dictionary = Dictionary.load(dict_file)
+        self.encoder = json.load(open(vocab_file, encoding="utf-8"))
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.errors = errors  # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        bpe_data = open(merges_file, encoding='utf-8').read().split('\n')[1:-1]
+        bpe_merges = [tuple(merge.split()) for merge in bpe_data]
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+
+        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
 
     @property
     def vocab_size(self):
-        return len(self.dictionary.indices)
+        return len(self.encoder)
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word)-1 and word[i+1] == second:
+                    new_word.append(first+second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = ' '.join(word)
+        self.cache[token] = word
+        return word
 
     def _tokenize(self, text):
-        """ Use GPT-2 Tokenizer """
-        return self.gpt2_tokenizer._tokenize(text)
+        """ Tokenize a string. """
+        bpe_tokens = []
+        for token in re.findall(self.pat, text):
+            if sys.version_info[0] == 2:
+                token = ''.join(self.byte_encoder[ord(b)] for b in token)
+            else:
+                token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))
+        return bpe_tokens
 
     def _convert_token_to_id(self, token):
-        if self.dictionary.index(token) != 3:
-            return self.dictionary.index(token)
-        return self.dictionary.index(str(self.gpt2_tokenizer.convert_tokens_to_ids(token)))
+        """ Converts a token (str/unicode) in an id using the vocab. """
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
 
     def _convert_id_to_token(self, index):
-        symbol = self.dictionary[index]
-        try:
-            idx = int(symbol)
-            return self.gpt2_tokenizer._convert_id_to_token(idx)
-        except ValueError:
-            return symbol
+        """Converts an index (integer) in a token (string/unicode) using the vocab."""
+        return self.decoder.get(index)
 
     def convert_tokens_to_string(self, tokens):
-        return self.gpt2_tokenizer.convert_tokens_to_string(tokens)
+        """ Converts a sequence of tokens (string) in a single string. """
+        text = ''.join(tokens)
+        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
+        return text
 
-    def convert_tokens_to_ids(self, tokens, no_sep_cls_tokens=False):
-        cls = [self._convert_token_to_id(self.cls_token)]
-        tokens = super().convert_tokens_to_ids(tokens)
+    def add_special_tokens_single_sentence(self, token_ids):
+        return [self._convert_token_to_id(self.cls_token)] + token_ids + [self._convert_token_to_id(self.sep_token)]
+
+    def add_special_tokens_sentences_pair(self, *token_ids):
         sep = [self._convert_token_to_id(self.sep_token)]
-        return (cls + tokens + sep) if (isinstance(tokens, list) and not no_sep_cls_tokens) else tokens
-
-    def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
-        return super().convert_ids_to_tokens(ids, skip_special_tokens=skip_special_tokens)[1:-1]
+        cls = [self._convert_token_to_id(self.cls_token)]
+        return cls + token_ids[0] + sep + sep + token_ids[1] + sep
 
     def save_vocabulary(self, save_directory):
         """Save the tokenizer vocabulary and merge files to a directory."""
         if not os.path.isdir(save_directory):
             logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
             return
-        dict_file = os.path.join(save_directory, DICT_FILES_NAMES['dict_file'])
+        vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES['vocab_file'])
+        merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES['merges_file'])
 
-        with open(dict_file, 'w', encoding='utf-8') as f:
-            for i in range(self.dictionary.nspecial, len(self.dictionary.count)):
-                f.write(f"{list(self.dictionary.indices.keys())[i]} {self.dictionary.count[i]}\n")
+        with open(vocab_file, 'w', encoding='utf-8') as f:
+            f.write(json.dumps(self.encoder, ensure_ascii=False))
 
-        vocab_files = self.gpt2_tokenizer.save_pretrained(save_directory)
+        index = 0
+        with open(merge_file, "w", encoding="utf-8") as writer:
+            writer.write(u'#version: 0.2\n')
+            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning("Saving vocabulary to {}: BPE merge indices are not consecutive."
+                                   " Please check that the tokenizer is not corrupted!".format(merge_file))
+                    index = token_index
+                writer.write(' '.join(bpe_tokens) + u'\n')
+                index += 1
 
-        return vocab_files + (dict_file,)
+        return vocab_file, merge_file

From 7060766490240f7f1a63dce4c1ca6d0abfd8555d Mon Sep 17 00:00:00 2001
From: Kevin Trebing <Kevin.Trebing@gmx.net>
Date: Fri, 9 Aug 2019 11:28:39 +0100
Subject: [PATCH 078/200] Corrected logger.error info

Signed-off-by: Kevin Trebing <Kevin.Trebing@gmx.net>
---
 pytorch_transformers/modeling_bert.py | 2 +-
 pytorch_transformers/modeling_gpt2.py | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index 34eac7f26f..51d8788545 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -74,7 +74,7 @@ def load_tf_weights_in_bert(model, config, tf_checkpoint_path):
         import numpy as np
         import tensorflow as tf
     except ImportError:
-        logger.error("Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see "
+        logger.error("Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see "
             "https://www.tensorflow.org/install/ for installation instructions.")
         raise
     tf_path = os.path.abspath(tf_checkpoint_path)
diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index 0ecc20516c..ce00c50075 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -50,7 +50,7 @@ def load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path):
         import numpy as np
         import tensorflow as tf
     except ImportError:
-        logger.error("Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see "
+        logger.error("Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see "
             "https://www.tensorflow.org/install/ for installation instructions.")
         raise
     tf_path = os.path.abspath(gpt2_checkpoint_path)

From c683c3d5a528c3cb66c6f0e497ccde18875048e0 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Sat, 10 Aug 2019 20:04:35 +0200
Subject: [PATCH 079/200] fix #993

---
 pytorch_transformers/modeling_gpt2.py   | 5 +++--
 pytorch_transformers/modeling_openai.py | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index 0ecc20516c..148b4a82ae 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -683,9 +683,10 @@ class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
 
         tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
         model = GPT2DoubleHeadsModel.from_pretrained('gpt2')
-        choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
+        tokenizer.add_special_tokens({'cls_token': '[CLS]'})  # Add a [CLS] to the vocabulary (we should train it also!)
+        choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]
         input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
-        mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
+        mc_token_ids = torch.tensor([input_ids.size(-1), input_ids.size(-1)]).unsqueeze(0)  # Batch size 1
         outputs = model(input_ids, mc_token_ids)
         lm_prediction_scores, mc_prediction_scores = outputs[:2]
 
diff --git a/pytorch_transformers/modeling_openai.py b/pytorch_transformers/modeling_openai.py
index 536b0e2432..364923b0af 100644
--- a/pytorch_transformers/modeling_openai.py
+++ b/pytorch_transformers/modeling_openai.py
@@ -665,9 +665,10 @@ class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
 
         tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
         model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
-        choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]  # Assume you've added [CLS] to the vocabulary
+        tokenizer.add_special_tokens({'cls_token': '[CLS]'})  # Add a [CLS] to the vocabulary (we should train it also!)
+        choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]
         input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0)  # Batch size 1, 2 choices
-        mc_token_ids = torch.tensor([-1, -1]).unsqueeze(0)  # Batch size 1
+        mc_token_ids = torch.tensor([input_ids.size(-1), input_ids.size(-1)]).unsqueeze(0)  # Batch size 1
         outputs = model(input_ids, mc_token_ids)
         lm_prediction_scores, mc_prediction_scores = outputs[:2]
 

From a7b4cfe9194bf93c7044a42c9f1281260ce6279e Mon Sep 17 00:00:00 2001
From: carefree0910 <syameimaru_kurumi@pku.edu.cn>
Date: Sun, 11 Aug 2019 21:36:51 +0800
Subject: [PATCH 080/200] Update README.md

I assume that it should test the `re-load` functionality after testing the `save` functionality, however I'm also surprised that nobody points this out after such a long time, so maybe I've misunderstood the purpose. This PR is just in case :)
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index b86a5238c2..48c54a055a 100644
--- a/README.md
+++ b/README.md
@@ -123,7 +123,7 @@ traced_model = torch.jit.trace(model, (input_ids,))
 model.save_pretrained('./directory/to/save/')  # save
 model = model_class.from_pretrained('./directory/to/save/')  # re-load
 tokenizer.save_pretrained('./directory/to/save/')  # save
-tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
+tokenizer = tokenizer_class.from_pretrained('./directory/to/save/')  # re-load
 
 # SOTA examples for GLUE, SQUAD, text generation...
 ```

From b3d83d68db2db037a439516c24c593d4a85035a7 Mon Sep 17 00:00:00 2001
From: Julien Chaumond <chaumond@gmail.com>
Date: Mon, 12 Aug 2019 12:28:55 -0400
Subject: [PATCH 081/200] Fixup 9d0603148bc34255fad0cad73ce438ecd7306322

---
 .../convert_roberta_checkpoint_to_pytorch.py                 | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py b/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
index e4e8fbb25d..0a8967426e 100644
--- a/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_roberta_checkpoint_to_pytorch.py
@@ -139,7 +139,10 @@ def convert_roberta_checkpoint_to_pytorch(roberta_checkpoint_path, pytorch_dump_
     input_ids: torch.Tensor = roberta.encode(SAMPLE_TEXT).unsqueeze(0) # batch of size 1
 
     our_output = model(input_ids)[0]
-    their_output = roberta.model(input_ids)[0]
+    if classification_head:
+        their_output = roberta.model.classification_heads['mnli'](roberta.extract_features(input_ids))
+    else:
+        their_output = roberta.model(input_ids)[0]
     print(our_output.shape, their_output.shape)
     success = torch.allclose(our_output, their_output, atol=1e-3)
     print(

From 912fdff899cf0fd674ed357e46a0209311aefad2 Mon Sep 17 00:00:00 2001
From: Julien Chaumond <chaumond@gmail.com>
Date: Mon, 12 Aug 2019 13:49:50 -0400
Subject: [PATCH 082/200] [RoBERTa] Update `run_glue` for RoBERTa

---
 examples/run_glue.py   | 13 +++++++++----
 examples/utils_glue.py | 17 +++++++++++++----
 2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/examples/run_glue.py b/examples/run_glue.py
index a939ea373b..f6cd73ed0b 100644
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Finetuning the library models for sequence classification on GLUE (Bert, XLM, XLNet)."""
+""" Finetuning the library models for sequence classification on GLUE (Bert, XLM, XLNet, RoBERTa)."""
 
 from __future__ import absolute_import, division, print_function
 
@@ -33,6 +33,9 @@ from tqdm import tqdm, trange
 
 from pytorch_transformers import (WEIGHTS_NAME, BertConfig,
                                   BertForSequenceClassification, BertTokenizer,
+                                  RobertaConfig,
+                                  RobertaForSequenceClassification,
+                                  RobertaTokenizer,
                                   XLMConfig, XLMForSequenceClassification,
                                   XLMTokenizer, XLNetConfig,
                                   XLNetForSequenceClassification,
@@ -45,12 +48,13 @@ from utils_glue import (compute_metrics, convert_examples_to_features,
 
 logger = logging.getLogger(__name__)
 
-ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, XLMConfig)), ())
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig)), ())
 
 MODEL_CLASSES = {
     'bert': (BertConfig, BertForSequenceClassification, BertTokenizer),
     'xlnet': (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer),
     'xlm': (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
+    'roberta': (RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer),
 }
 
 
@@ -214,7 +218,7 @@ def evaluate(args, model, tokenizer, prefix=""):
             with torch.no_grad():
                 inputs = {'input_ids':      batch[0],
                           'attention_mask': batch[1],
-                          'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None,  # XLM don't use segment_ids
+                          'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None,  # XLM and RoBERTa don't use segment_ids
                           'labels':         batch[3]}
                 outputs = model(**inputs)
                 tmp_eval_loss, logits = outputs[:2]
@@ -268,8 +272,9 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
         features = convert_examples_to_features(examples, label_list, args.max_seq_length, tokenizer, output_mode,
             cls_token_at_end=bool(args.model_type in ['xlnet']),            # xlnet has a cls token at the end
             cls_token=tokenizer.cls_token,
-            sep_token=tokenizer.sep_token,
             cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0,
+            sep_token=tokenizer.sep_token,
+            sep_token_extra=bool(args.model_type in ['roberta']),           # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
             pad_on_left=bool(args.model_type in ['xlnet']),                 # pad on the left for xlnet
             pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0)
         if args.local_rank in [-1, 0]:
diff --git a/examples/utils_glue.py b/examples/utils_glue.py
index bba9a901a8..c955e4d0ce 100644
--- a/examples/utils_glue.py
+++ b/examples/utils_glue.py
@@ -390,10 +390,16 @@ class WnliProcessor(DataProcessor):
 
 def convert_examples_to_features(examples, label_list, max_seq_length,
                                  tokenizer, output_mode,
-                                 cls_token_at_end=False, pad_on_left=False,
-                                 cls_token='[CLS]', sep_token='[SEP]', pad_token=0,
-                                 sequence_a_segment_id=0, sequence_b_segment_id=1,
-                                 cls_token_segment_id=1, pad_token_segment_id=0,
+                                 cls_token_at_end=False,
+                                 cls_token='[CLS]',
+                                 cls_token_segment_id=1,
+                                 sep_token='[SEP]',
+                                 sep_token_extra=False,
+                                 pad_on_left=False,
+                                 pad_token=0,
+                                 pad_token_segment_id=0,
+                                 sequence_a_segment_id=0, 
+                                 sequence_b_segment_id=1,
                                  mask_padding_with_zero=True):
     """ Loads a data file into a list of `InputBatch`s
         `cls_token_at_end` define the location of the CLS token:
@@ -442,6 +448,9 @@ def convert_examples_to_features(examples, label_list, max_seq_length,
         # used as as the "sentence vector". Note that this only makes sense because
         # the entire model is fine-tuned.
         tokens = tokens_a + [sep_token]
+        if sep_token_extra:
+            # roberta uses an extra separator b/w pairs of sentences
+            tokens += [sep_token]
         segment_ids = [sequence_a_segment_id] * len(tokens)
 
         if tokens_b:

From 22ac004a7c9cc76d930ecc95b6b0469cd6693b16 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Mon, 12 Aug 2019 15:13:53 -0400
Subject: [PATCH 083/200] Added documentation and changed parameters for
 special_tokens_sentences_pair.

---
 pytorch_transformers/tokenization_bert.py    | 12 +++++++++--
 pytorch_transformers/tokenization_roberta.py | 22 +++++++++++++-------
 pytorch_transformers/tokenization_utils.py   |  2 +-
 pytorch_transformers/tokenization_xlm.py     | 12 +++++++++--
 pytorch_transformers/tokenization_xlnet.py   | 20 ++++++++++++------
 5 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/pytorch_transformers/tokenization_bert.py b/pytorch_transformers/tokenization_bert.py
index 9f4f00a300..177d26dec1 100644
--- a/pytorch_transformers/tokenization_bert.py
+++ b/pytorch_transformers/tokenization_bert.py
@@ -167,12 +167,20 @@ class BertTokenizer(PreTrainedTokenizer):
         return out_string
 
     def add_special_tokens_single_sentence(self, token_ids):
+        """
+        Adds special tokens to the a sequence for sequence classification tasks.
+        A BERT sequence has the following format: [CLS] X [SEP]
+        """
         return [self._convert_token_to_id(self.cls_token)] + token_ids + [self._convert_token_to_id(self.sep_token)]
 
-    def add_special_tokens_sentences_pair(self, *token_ids):
+    def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
+        """
+        Adds special tokens to a sequence pair for sequence classification tasks.
+        A BERT sequence pair has the following format: [CLS] A [SEP] B [SEP]
+        """
         sep = [self._convert_token_to_id(self.sep_token)]
         cls = [self._convert_token_to_id(self.cls_token)]
-        return cls + token_ids[0] + sep + token_ids[1] + sep
+        return cls + token_ids_0 + sep + token_ids_1 + sep
 
     def save_vocabulary(self, vocab_path):
         """Save the tokenizer vocabulary to a directory or file."""
diff --git a/pytorch_transformers/tokenization_roberta.py b/pytorch_transformers/tokenization_roberta.py
index b01b92653d..8f5cecee8a 100644
--- a/pytorch_transformers/tokenization_roberta.py
+++ b/pytorch_transformers/tokenization_roberta.py
@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""Tokenization classes for OpenAI GPT."""
+"""Tokenization classes for RoBERTa."""
 from __future__ import (absolute_import, division, print_function,
                         unicode_literals)
 
@@ -57,15 +57,15 @@ PRETRAINED_VOCAB_FILES_MAP = {
 }
 
 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
-    'roberta-base': 1024,
-    'roberta-large': 1024,
-    'roberta-large-mnli': 1024,
+    'roberta-base': 512,
+    'roberta-large': 512,
+    'roberta-large-mnli': 512,
 }
 
 
 class RobertaTokenizer(PreTrainedTokenizer):
     """
-    GPT-2 BPE tokenizer. Peculiarities:
+    RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities:
         - Byte-level BPE
     """
     vocab_files_names = VOCAB_FILES_NAMES
@@ -161,12 +161,20 @@ class RobertaTokenizer(PreTrainedTokenizer):
         return text
 
     def add_special_tokens_single_sentence(self, token_ids):
+        """
+        Adds special tokens to a sequence for sequence classification tasks.
+        A RoBERTa sequence has the following format: [CLS] X [SEP]
+        """
         return [self._convert_token_to_id(self.cls_token)] + token_ids + [self._convert_token_to_id(self.sep_token)]
 
-    def add_special_tokens_sentences_pair(self, *token_ids):
+    def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
+        """
+        Adds special tokens to a sequence pair for sequence classification tasks.
+        A RoBERTa sequence pair has the following format: [CLS] A [SEP][SEP] B [SEP]
+        """
         sep = [self._convert_token_to_id(self.sep_token)]
         cls = [self._convert_token_to_id(self.cls_token)]
-        return cls + token_ids[0] + sep + sep + token_ids[1] + sep
+        return cls + token_ids_0 + sep + sep + token_ids_1 + sep
 
     def save_vocabulary(self, save_directory):
         """Save the tokenizer vocabulary and merge files to a directory."""
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index a3581fe582..3253596058 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -546,7 +546,7 @@ class PreTrainedTokenizer(object):
     def add_special_tokens_single_sentence(self, token_ids):
         raise NotImplementedError
 
-    def add_special_tokens_sentences_pair(self, *token_ids):
+    def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
         raise NotImplementedError
 
     def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
diff --git a/pytorch_transformers/tokenization_xlm.py b/pytorch_transformers/tokenization_xlm.py
index b0b8f1d78d..b690a3a945 100644
--- a/pytorch_transformers/tokenization_xlm.py
+++ b/pytorch_transformers/tokenization_xlm.py
@@ -215,12 +215,20 @@ class XLMTokenizer(PreTrainedTokenizer):
         return out_string
 
     def add_special_tokens_single_sentence(self, token_ids):
+        """
+        Adds special tokens to a sequence for sequence classification tasks.
+        An XLM sequence has the following format: [CLS] X [SEP]
+        """
         return [self._convert_token_to_id(self.cls_token)] + token_ids + [self._convert_token_to_id(self.sep_token)]
 
-    def add_special_tokens_sentences_pair(self, *token_ids):
+    def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
+        """
+        Adds special tokens to a sequence pair for sequence classification tasks.
+        An XLM sequence pair has the following format: [CLS] A [SEP] B [SEP]
+        """
         sep = [self._convert_token_to_id(self.sep_token)]
         cls = [self._convert_token_to_id(self.cls_token)]
-        return cls + token_ids[0] + sep + token_ids[1] + sep
+        return cls + token_ids_0 + sep + token_ids_1 + sep
 
     def save_vocabulary(self, save_directory):
         """Save the tokenizer vocabulary and merge files to a directory."""
diff --git a/pytorch_transformers/tokenization_xlnet.py b/pytorch_transformers/tokenization_xlnet.py
index 42473da860..371b3c9407 100644
--- a/pytorch_transformers/tokenization_xlnet.py
+++ b/pytorch_transformers/tokenization_xlnet.py
@@ -178,14 +178,22 @@ class XLNetTokenizer(PreTrainedTokenizer):
         return out_string
 
     def add_special_tokens_single_sentence(self, token_ids):
-        logger.warning("No method was defined for special tokens and single sentence streams in XLNet. "
-                       "Returning token_ids")
-        return token_ids
-
-    def add_special_tokens_sentences_pair(self, *token_ids):
+        """
+        Adds special tokens to a sequence pair for sequence classification tasks.
+        An XLNet sequence pair has the following format: A [SEP] B [SEP][CLS]
+        """
         sep = [self._convert_token_to_id(self.sep_token)]
         cls = [self._convert_token_to_id(self.cls_token)]
-        return token_ids[0] + sep + token_ids[1] + sep + cls
+        return token_ids + sep + cls
+
+    def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
+        """
+        Adds special tokens to a sequence for sequence classification tasks.
+        An XLNet sequence has the following format: X [SEP][CLS]
+        """
+        sep = [self._convert_token_to_id(self.sep_token)]
+        cls = [self._convert_token_to_id(self.cls_token)]
+        return token_ids_0 + sep + token_ids_1 + sep + cls
 
     def save_vocabulary(self, save_directory):
         """ Save the sentencepiece vocabulary (copy original file) and special tokens file

From 634a3172d869e2ff772b2e0813169641ca9e6cc5 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Mon, 12 Aug 2019 15:14:15 -0400
Subject: [PATCH 084/200] Added integration tests for sequence builders.

---
 .../tests/tokenization_bert_test.py                | 11 +++++++++++
 .../tests/tokenization_roberta_test.py             | 14 +++++++++++++-
 .../tests/tokenization_xlm_test.py                 | 11 +++++++++++
 .../tests/tokenization_xlnet_test.py               | 12 ++++++++++++
 4 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/pytorch_transformers/tests/tokenization_bert_test.py b/pytorch_transformers/tests/tokenization_bert_test.py
index 5eb39b729d..db507317a8 100644
--- a/pytorch_transformers/tests/tokenization_bert_test.py
+++ b/pytorch_transformers/tests/tokenization_bert_test.py
@@ -125,6 +125,17 @@ class BertTokenizationTest(CommonTestCases.CommonTokenizerTester):
         self.assertFalse(_is_punctuation(u"A"))
         self.assertFalse(_is_punctuation(u" "))
 
+    def test_sequence_builders(self):
+        tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+
+        text = tokenizer.encode("sequence builders")
+        text_2 = tokenizer.encode("multi-sequence build")
+
+        encoded_sentence = tokenizer.add_special_tokens_single_sentence(text)
+        encoded_pair = tokenizer.add_special_tokens_sentences_pair(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/pytorch_transformers/tests/tokenization_roberta_test.py b/pytorch_transformers/tests/tokenization_roberta_test.py
index daefea0fa7..b76b3e311d 100644
--- a/pytorch_transformers/tests/tokenization_roberta_test.py
+++ b/pytorch_transformers/tests/tokenization_roberta_test.py
@@ -71,10 +71,22 @@ class RobertaTokenizationTest(CommonTestCases.CommonTokenizerTester):
             [0, 31414, 232, 328, 2]
         )
         self.assertListEqual(
-            tokenizer.encode('Hello world! cécé herlolip'),
+            tokenizer.encode('Hello world! cécé herlolip 418'),
             [0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]
         )
 
+    def test_sequence_builders(self):
+        tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
+
+        text = tokenizer.encode("sequence builders")
+        text_2 = tokenizer.encode("multi-sequence build")
+
+        encoded_sentence = tokenizer.add_special_tokens_single_sentence(text)
+        encoded_pair = tokenizer.add_special_tokens_sentences_pair(text, text_2)
+
+        assert encoded_sentence == [0] + text + [2]
+        assert encoded_pair == [0] + text + [2, 2] + text_2 + [2]
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/pytorch_transformers/tests/tokenization_xlm_test.py b/pytorch_transformers/tests/tokenization_xlm_test.py
index a20e92044f..ede77a1f98 100644
--- a/pytorch_transformers/tests/tokenization_xlm_test.py
+++ b/pytorch_transformers/tests/tokenization_xlm_test.py
@@ -66,6 +66,17 @@ class XLMTokenizationTest(CommonTestCases.CommonTokenizerTester):
         self.assertListEqual(
             tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
 
+    def test_sequence_builders(self):
+        tokenizer = XLMTokenizer.from_pretrained("xlm-mlm-en-2048")
+
+        text = tokenizer.encode("sequence builders")
+        text_2 = tokenizer.encode("multi-sequence build")
+
+        encoded_sentence = tokenizer.add_special_tokens_single_sentence(text)
+        encoded_pair = tokenizer.add_special_tokens_sentences_pair(text, text_2)
+
+        assert encoded_sentence == [1] + text + [1]
+        assert encoded_pair == [1] + text + [1] + text_2 + [1]
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/pytorch_transformers/tests/tokenization_xlnet_test.py b/pytorch_transformers/tests/tokenization_xlnet_test.py
index 08e9e9cb2d..9feab7c0bd 100644
--- a/pytorch_transformers/tests/tokenization_xlnet_test.py
+++ b/pytorch_transformers/tests/tokenization_xlnet_test.py
@@ -89,6 +89,18 @@ class XLNetTokenizationTest(CommonTestCases.CommonTokenizerTester):
                                       u'9', u'2', u'0', u'0', u'0', u',', SPIECE_UNDERLINE + u'and', SPIECE_UNDERLINE + u'this',
                                       SPIECE_UNDERLINE + u'is', SPIECE_UNDERLINE + u'f', u'al', u'se', u'.'])
 
+    def test_sequence_builders(self):
+        tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
+
+        text = tokenizer.encode("sequence builders")
+        text_2 = tokenizer.encode("multi-sequence build")
+
+        encoded_sentence = tokenizer.add_special_tokens_single_sentence(text)
+        encoded_pair = tokenizer.add_special_tokens_sentences_pair(text, text_2)
+
+        assert encoded_sentence == text + [4, 3]
+        assert encoded_pair == text + [4] + text_2 + [4, 3]
+
 
 if __name__ == '__main__':
     unittest.main()

From ba4bce2581f9a67caa44c3cc959a2dacb0090670 Mon Sep 17 00:00:00 2001
From: tuvuumass <tuvu@cs.umass.edu>
Date: Tue, 13 Aug 2019 11:26:27 -0400
Subject: [PATCH 085/200] fix issue #824

---
 examples/run_bertology.py | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/examples/run_bertology.py b/examples/run_bertology.py
index 61c7440ecb..f11b73b54f 100644
--- a/examples/run_bertology.py
+++ b/examples/run_bertology.py
@@ -211,10 +211,12 @@ def prune_heads(args, model, eval_dataloader, head_mask):
 
 def main():
     parser = argparse.ArgumentParser()
+    ## Required parameters
     parser.add_argument("--data_dir", default=None, type=str, required=True,
                         help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
-    parser.add_argument("--model_name", default=None, type=str, required=True,
-                        help="Bert/XLNet/XLM pre-trained model selected in the list: " + ", ".join(ALL_MODELS))
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
+                        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(
+                            ALL_MODELS))
     parser.add_argument("--task_name", default=None, type=str, required=True,
                         help="The name of the task to train selected in the list: " + ", ".join(processors.keys()))
     parser.add_argument("--output_dir", default=None, type=str, required=True,
@@ -222,9 +224,9 @@ def main():
 
     ## Other parameters
     parser.add_argument("--config_name", default="", type=str,
-                        help="Pretrained config name or path if not the same as model_name")
+                        help="Pretrained config name or path if not the same as model_name_or_path")
     parser.add_argument("--tokenizer_name", default="", type=str,
-                        help="Pretrained tokenizer name or path if not the same as model_name")
+                        help="Pretrained tokenizer name or path if not the same as model_name_or_path")
     parser.add_argument("--cache_dir", default="", type=str,
                         help="Where do you want to store the pre-trained models downloaded from s3")
     parser.add_argument("--data_subset", type=int, default=-1,
@@ -297,15 +299,15 @@ def main():
 
     args.model_type = ""
     for key in MODEL_CLASSES:
-        if key in args.model_name.lower():
+        if key in args.model_name_or_path.lower():
             args.model_type = key  # take the first match in model types
             break
     config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
-    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name,
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path,
                                           num_labels=num_labels, finetuning_task=args.task_name,
                                           output_attentions=True)
-    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name)
-    model = model_class.from_pretrained(args.model_name, from_tf=bool('.ckpt' in args.model_name), config=config)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path)
+    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
 
     if args.local_rank == 0:
         torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

From 3d87991f606b36dc54318ac3dee9803001ef161d Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Tue, 13 Aug 2019 12:00:24 -0400
Subject: [PATCH 086/200] Fixed error with encoding

---
 .../tests/tokenization_roberta_test.py                |  7 +++++--
 pytorch_transformers/tokenization_utils.py            | 11 +++--------
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/pytorch_transformers/tests/tokenization_roberta_test.py b/pytorch_transformers/tests/tokenization_roberta_test.py
index b76b3e311d..a8f940ae43 100644
--- a/pytorch_transformers/tests/tokenization_roberta_test.py
+++ b/pytorch_transformers/tests/tokenization_roberta_test.py
@@ -81,11 +81,14 @@ class RobertaTokenizationTest(CommonTestCases.CommonTokenizerTester):
         text = tokenizer.encode("sequence builders")
         text_2 = tokenizer.encode("multi-sequence build")
 
+        encoded_text_from_decode = tokenizer.encode("sequence builders", add_special_tokens=True)
+        encoded_pair_from_decode = tokenizer.encode("sequence builders", "multi-sequence build", add_special_tokens=True)
+
         encoded_sentence = tokenizer.add_special_tokens_single_sentence(text)
         encoded_pair = tokenizer.add_special_tokens_sentences_pair(text, text_2)
 
-        assert encoded_sentence == [0] + text + [2]
-        assert encoded_pair == [0] + text + [2, 2] + text_2 + [2]
+        assert encoded_sentence == encoded_text_from_decode
+        assert encoded_pair == encoded_pair_from_decode
 
 
 if __name__ == '__main__':
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index 3253596058..7bb9fd9d29 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -519,24 +519,19 @@ class PreTrainedTokenizer(object):
     def _convert_token_to_id(self, token):
         raise NotImplementedError
 
-    def encode(self, text, add_special_tokens=False, *sequences):
+    def encode(self, text, text_pair=None, add_special_tokens=False):
         """ Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
         
         Same doing ``self.convert_tokens_to_ids(self.tokenize(text))``.
         """
-
-        if len(sequences) == 0:
+        if text_pair is None:
             if add_special_tokens:
                 return self.add_special_tokens_single_sentence(self.convert_tokens_to_ids(self.tokenize(text)))
             else:
                 return self.convert_tokens_to_ids(self.tokenize(text))
 
-        if len(sequences) > 1:
-            logger.warning("Tokenization currently only supports sentence pairs. Ignoring every string following the "
-                           "initial two.")
-
         first_sentence_tokens = [self._convert_token_to_id(token) for token in self.tokenize(text)]
-        second_sentence_tokens = [self._convert_token_to_id(token) for token in self.tokenize(sequences[0])]
+        second_sentence_tokens = [self._convert_token_to_id(token) for token in self.tokenize(text_pair)]
 
         if add_special_tokens:
             return self.add_special_tokens_sentences_pair(first_sentence_tokens, second_sentence_tokens)

From baf08ca1d4ab5aee1d530fc1801370e8a81cc091 Mon Sep 17 00:00:00 2001
From: Julien Chaumond <chaumond@gmail.com>
Date: Tue, 13 Aug 2019 12:51:15 -0400
Subject: [PATCH 087/200] [RoBERTa] run_glue: correct pad_token + reorder
 labels

---
 examples/run_glue.py | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/examples/run_glue.py b/examples/run_glue.py
index f6cd73ed0b..445a9a5912 100644
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -268,6 +268,9 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
     else:
         logger.info("Creating features from dataset file at %s", args.data_dir)
         label_list = processor.get_labels()
+        if task in ['mnli', 'mnli-mm'] and args.model_type in ['roberta']:
+            # HACK(label indices are swapped in RoBERTa pretrained model)
+            label_list[1], label_list[2] = label_list[2], label_list[1] 
         examples = processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
         features = convert_examples_to_features(examples, label_list, args.max_seq_length, tokenizer, output_mode,
             cls_token_at_end=bool(args.model_type in ['xlnet']),            # xlnet has a cls token at the end
@@ -276,7 +279,9 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
             sep_token=tokenizer.sep_token,
             sep_token_extra=bool(args.model_type in ['roberta']),           # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
             pad_on_left=bool(args.model_type in ['xlnet']),                 # pad on the left for xlnet
-            pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0)
+            pad_token=1 if args.model_type in ['roberta'] else 0, # TODO(Lysandre: replace with tokenizer.pad_token when implemented)
+            pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
+        )
         if args.local_rank in [-1, 0]:
             logger.info("Saving features into cached file %s", cached_features_file)
             torch.save(features, cached_features_file)

From 39f426be6577d4534a058c9c42d52053a0ef9257 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Tue, 13 Aug 2019 15:19:50 -0400
Subject: [PATCH 088/200] Added special tokens <pad> and <mask> to RoBERTa.

---
 examples/run_glue.py                         | 2 +-
 pytorch_transformers/tokenization_roberta.py | 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/examples/run_glue.py b/examples/run_glue.py
index 445a9a5912..c0f70e0863 100644
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -279,7 +279,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
             sep_token=tokenizer.sep_token,
             sep_token_extra=bool(args.model_type in ['roberta']),           # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
             pad_on_left=bool(args.model_type in ['xlnet']),                 # pad on the left for xlnet
-            pad_token=1 if args.model_type in ['roberta'] else 0, # TODO(Lysandre: replace with tokenizer.pad_token when implemented)
+            pad_token=tokenizer.encoder[tokenizer.pad_token] if args.model_type in ['roberta'] else tokenizer.vocab[tokenizer.pad_token],
             pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
         )
         if args.local_rank in [-1, 0]:
diff --git a/pytorch_transformers/tokenization_roberta.py b/pytorch_transformers/tokenization_roberta.py
index 8f5cecee8a..1db8013183 100644
--- a/pytorch_transformers/tokenization_roberta.py
+++ b/pytorch_transformers/tokenization_roberta.py
@@ -73,9 +73,10 @@ class RobertaTokenizer(PreTrainedTokenizer):
     max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
 
     def __init__(self, vocab_file, merges_file, errors='replace', bos_token="<s>", eos_token="</s>", sep_token="</s>",
-                 cls_token="<s>", unk_token="<unk>", **kwargs):
+                 cls_token="<s>", unk_token="<unk>", pad_token='<pad>', mask_token='<mask>', **kwargs):
         super(RobertaTokenizer, self).__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token,
-                                               sep_token=sep_token, cls_token=cls_token, **kwargs)
+                                               sep_token=sep_token, cls_token=cls_token, pad_token=pad_token,
+                                               mask_token=mask_token, **kwargs)
 
         self.encoder = json.load(open(vocab_file, encoding="utf-8"))
         self.decoder = {v: k for k, v in self.encoder.items()}

From 9ce36e3e4b0b17dd6df05e13e563570677cda39e Mon Sep 17 00:00:00 2001
From: samvelyan <mika.samvelyan@gmail.com>
Date: Wed, 14 Aug 2019 08:57:09 +0000
Subject: [PATCH 089/200] Re-implemented tokenize() iteratively in
 PreTrainedTokenizer.

---
 pytorch_transformers/tokenization_utils.py | 42 ++++++++++++++++++----
 1 file changed, 36 insertions(+), 6 deletions(-)

diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index 2e75c83bfb..bdeeeb4877 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -428,7 +428,7 @@ class PreTrainedTokenizer(object):
 
             Parameters:
                 special_tokens_dict: dict of string. Keys should be in the list of predefined special attributes: [``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``].
-                
+
                     Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
 
             Returns:
@@ -472,15 +472,45 @@ class PreTrainedTokenizer(object):
 
             Take care of added tokens.
         """
+        def split_on_token(tok, text):
+            result = []
+            split_text = text.split(tok)
+            for i, sub_text in enumerate(split_text):
+                sub_text = sub_text.strip()
+                if i == 0 and not sub_text:
+                    result += [tok]
+                elif i == len(split_text) - 1:
+                    if sub_text:
+                        result += [sub_text]
+                    else:
+                        pass
+                else:
+                    if sub_text:
+                        result += [sub_text]
+                    result += [tok]
+            return result
+
         def split_on_tokens(tok_list, text):
             if not text:
                 return []
             if not tok_list:
                 return self._tokenize(text, **kwargs)
-            tok = tok_list[0]
-            split_text = text.split(tok)
-            return sum((split_on_tokens(tok_list[1:], sub_text.strip()) + [tok] \
-                        for sub_text in split_text), [])[:-1]
+
+            tokenized_text = []
+            text_list = [text]
+            for tok in tok_list:
+                tokenized_text = []
+                for sub_text in text_list:
+                    if sub_text not in self.added_tokens_encoder \
+                            and sub_text not in self.all_special_tokens:
+                        tokenized_text += split_on_token(tok, sub_text)
+                    else:
+                        tokenized_text += [sub_text]
+                text_list = tokenized_text
+
+            return sum((self._tokenize(token, **kwargs) if token not \
+                    in self.added_tokens_encoder and token not in self.all_special_tokens \
+                    else [token] for token in tokenized_text), [])
 
         added_tokens = list(self.added_tokens_encoder.keys()) + self.all_special_tokens
         tokenized_text = split_on_tokens(added_tokens, text)
@@ -522,7 +552,7 @@ class PreTrainedTokenizer(object):
 
     def encode(self, text):
         """ Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
-        
+
         Same doing ``self.convert_tokens_to_ids(self.tokenize(text))``.
         """
         return self.convert_tokens_to_ids(self.tokenize(text))

From c4ef1034474a1cad80674f1ce4c9fdaaa4d1f937 Mon Sep 17 00:00:00 2001
From: Julien Chaumond <chaumond@gmail.com>
Date: Wed, 14 Aug 2019 12:31:09 -0400
Subject: [PATCH 090/200] [RoBERTa] First 4 authors

cf. https://github.com/huggingface/pytorch-transformers/pull/964#discussion_r313574354

Co-Authored-By: Myle Ott <myleott@fb.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index f77934bbcc..f223394868 100644
--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@ The library currently contains PyTorch implementations, pre-trained model weight
 4. **[Transformer-XL](https://github.com/kimiyoung/transformer-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
 5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
-7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott et al.
+7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.
 
 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html).
 

From 572dcfd1db0bc18fbce8c14cef82de41fdae2465 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Wed, 14 Aug 2019 14:56:14 -0400
Subject: [PATCH 091/200] Doc

---
 docs/source/index.rst                        |   1 +
 docs/source/model_doc/roberta.rst            |  36 ++++
 docs/source/pretrained_models.rst            | 196 ++++++++++---------
 pytorch_transformers/modeling_roberta.py     | 162 ++++++++++++++-
 pytorch_transformers/tokenization_roberta.py |   3 +-
 pytorch_transformers/tokenization_utils.py   |  54 +++--
 6 files changed, 327 insertions(+), 125 deletions(-)
 create mode 100644 docs/source/model_doc/roberta.rst

diff --git a/docs/source/index.rst b/docs/source/index.rst
index b613596331..37b3509fe4 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -47,3 +47,4 @@ The library currently contains PyTorch implementations, pre-trained model weight
     model_doc/gpt2
     model_doc/xlm
     model_doc/xlnet
+    model_doc/roberta
diff --git a/docs/source/model_doc/roberta.rst b/docs/source/model_doc/roberta.rst
new file mode 100644
index 0000000000..e2de917e35
--- /dev/null
+++ b/docs/source/model_doc/roberta.rst
@@ -0,0 +1,36 @@
+RoBERTa
+----------------------------------------------------
+
+``RobertaConfig``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.RobertaConfig
+    :members:
+
+
+``RobertaTokenizer``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.RobertaTokenizer
+    :members:
+
+
+``RobertaModel``
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.RobertaModel
+    :members:
+
+
+``RobertaForMaskedLM``
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.RobertaForMaskedLM
+    :members:
+
+
+``RobertaForSequenceClassification``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.RobertaForSequenceClassification
+    :members:
diff --git a/docs/source/pretrained_models.rst b/docs/source/pretrained_models.rst
index b23a96ff7c..987882d12e 100644
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -4,97 +4,109 @@ Pretrained models
 Here is the full list of the currently provided pretrained models together with a short presentation of each model.
 
 
-+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-| Architecture      | Shortcut name                                              | Details of the model                                                                                                      |
-+===================+============================================================+===========================================================================================================================+
-| BERT              | ``bert-base-uncased``                                      | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
-|                   |                                                            | Trained on lower-cased English text                                                                                       |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-large-uncased``                                     | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
-|                   |                                                            | Trained on lower-cased English text                                                                                       |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-cased``                                        | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
-|                   |                                                            | Trained on cased English text                                                                                             |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-large-cased``                                       | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
-|                   |                                                            | Trained on cased English text                                                                                             |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-multilingual-uncased``                         | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters                                               |
-|                   |                                                            | Trained on lower-cased text in the top 102 languages with the largest Wikipedias                                          |
-|                   |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__)                                   |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-multilingual-cased``                           | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters                                                    |
-|                   |                                                            | Trained on cased text in the top 104 languages with the largest Wikipedias                                                |
-|                   |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__)                                   |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-chinese``                                      | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
-|                   |                                                            | Trained on cased Chinese Simplified and Traditional text                                                                  |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-german-cased``                                 | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
-|                   |                                                            | Trained on cased German text by Deepset.ai                                                                                |
-|                   |                                                            | (see `details on deepset.ai website <https://deepset.ai/german-bert>`__)                                                  |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-large-uncased-whole-word-masking``                  | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
-|                   |                                                            | Trained on lower-cased English text using Whole-Word-Masking                                                              |
-|                   |                                                            | (see `details <https://github.com/google-research/bert/#bert>`__)                                                         |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-large-cased-whole-word-masking``                    | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
-|                   |                                                            | Trained on cased English text using Whole-Word-Masking                                                                    |
-|                   |                                                            | (see `details <https://github.com/google-research/bert/#bert>`__)                                                         |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-large-uncased-whole-word-masking-finetuned-squad``  | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
-|                   |                                                            | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD (see details of fine-tuning in the                |
-|                   |                                                            | `example section <https://github.com/huggingface/pytorch-transformers/tree/master/examples>`__)                           |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-large-cased-whole-word-masking-finetuned-squad``    | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
-|                   |                                                            | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD                                                     |
-|                   |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`__)       |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-cased-finetuned-mrpc``                         | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
-|                   |                                                            | The ``bert-base-cased`` model fine-tuned on MRPC                                                                          |
-|                   |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`__)       |
-+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-| GPT               | ``openai-gpt``                                             | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
-|                   |                                                            | OpenAI GPT English model                                                                                                  |
-+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-| GPT-2             | ``gpt2``                                                   | 12-layer, 768-hidden, 12-heads, 117M parameters                                                                           |
-|                   |                                                            | OpenAI GPT-2 English model                                                                                                |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``gpt2-medium``                                            | 24-layer, 1024-hidden, 16-heads, 345M parameters                                                                          |
-|                   |                                                            | OpenAI's Medium-sized GPT-2 English model                                                                                 |
-+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-| Transformer-XL    | ``transfo-xl-wt103``                                       | 18-layer, 1024-hidden, 16-heads, 257M parameters                                                                          |
-|                   |                                                            | English model trained on wikitext-103                                                                                     |
-+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-| XLNet             | ``xlnet-base-cased``                                       | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
-|                   |                                                            | XLNet English model                                                                                                       |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlnet-large-cased``                                      | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
-|                   |                                                            | XLNet Large English model                                                                                                 |
-+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-| XLM               | ``xlm-mlm-en-2048``                                        | 12-layer, 1024-hidden, 8-heads                                                                                            |
-|                   |                                                            | XLM English model                                                                                                         |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-mlm-ende-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
-|                   |                                                            | XLM English-German Multi-language model                                                                                   |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-mlm-enfr-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
-|                   |                                                            | XLM English-French Multi-language model                                                                                   |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-mlm-enro-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
-|                   |                                                            | XLM English-Romanian Multi-language model                                                                                 |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-mlm-xnli15-1024``                                    | 12-layer, 1024-hidden, 8-heads                                                                                            |
-|                   |                                                            | XLM Model pre-trained with MLM on the `15 XNLI languages <https://github.com/facebookresearch/XNLI>`__.                   |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-mlm-tlm-xnli15-1024``                                | 12-layer, 1024-hidden, 8-heads                                                                                            |
-|                   |                                                            | XLM Model pre-trained with MLM + TLM on the `15 XNLI languages <https://github.com/facebookresearch/XNLI>`__.             |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-clm-enfr-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
-|                   |                                                            | XLM English model trained with CLM (Causal Language Modeling)                                                             |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-clm-ende-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
-|                   |                                                            | XLM English-German Multi-language model trained with CLM (Causal Language Modeling)                                       |
-+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| Architecture      | Shortcut name                                              | Details of the model                                                                                                                  |
++===================+============================================================+=======================================================================================================================================+
+| BERT              | ``bert-base-uncased``                                      | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   |                                                            | | Trained on lower-cased English text.                                                                                                |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-large-uncased``                                     | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
+|                   |                                                            | | Trained on lower-cased English text.                                                                                                |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-cased``                                        | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   |                                                            | | Trained on cased English text.                                                                                                      |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-large-cased``                                       | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
+|                   |                                                            | | Trained on cased English text.                                                                                                      |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-multilingual-uncased``                         | | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.                                                        |
+|                   |                                                            | | Trained on lower-cased text in the top 102 languages with the largest Wikipedias                                                    |
+|                   |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__).                                              |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-multilingual-cased``                           | | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters.                                                             |
+|                   |                                                            | | Trained on cased text in the top 104 languages with the largest Wikipedias                                                          |
+|                   |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__).                                              |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-chinese``                                      | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   |                                                            | | Trained on cased Chinese Simplified and Traditional text.                                                                           |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-german-cased``                                 | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   |                                                            | | Trained on cased German text by Deepset.ai                                                                                          |
+|                   |                                                            | (see `details on deepset.ai website <https://deepset.ai/german-bert>`__).                                                             |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-large-uncased-whole-word-masking``                  | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
+|                   |                                                            | | Trained on lower-cased English text using Whole-Word-Masking                                                                        |
+|                   |                                                            | (see `details <https://github.com/google-research/bert/#bert>`__).                                                                    |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-large-cased-whole-word-masking``                    | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
+|                   |                                                            | | Trained on cased English text using Whole-Word-Masking                                                                              |
+|                   |                                                            | (see `details <https://github.com/google-research/bert/#bert>`__).                                                                    |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-large-uncased-whole-word-masking-finetuned-squad``  | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
+|                   |                                                            | | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD                                                             |
+|                   |                                                            | (see details of fine-tuning in the `example section <https://github.com/huggingface/pytorch-transformers/tree/master/examples>`__).   |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-large-cased-whole-word-masking-finetuned-squad``    | | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                                    |
+|                   |                                                            | | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD                                                               |
+|                   |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`__)                   |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-cased-finetuned-mrpc``                         | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   |                                                            | | The ``bert-base-cased`` model fine-tuned on MRPC                                                                                    |
+|                   |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`__)                   |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| GPT               | ``openai-gpt``                                             | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   |                                                            | | OpenAI GPT English model                                                                                                            |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| GPT-2             | ``gpt2``                                                   | | 12-layer, 768-hidden, 12-heads, 117M parameters.                                                                                    |
+|                   |                                                            | | OpenAI GPT-2 English model                                                                                                          |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``gpt2-medium``                                            | | 24-layer, 1024-hidden, 16-heads, 345M parameters.                                                                                   |
+|                   |                                                            | | OpenAI's Medium-sized GPT-2 English model                                                                                           |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| Transformer-XL    | ``transfo-xl-wt103``                                       | | 18-layer, 1024-hidden, 16-heads, 257M parameters.                                                                                   |
+|                   |                                                            | | English model trained on wikitext-103                                                                                               |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| XLNet             | ``xlnet-base-cased``                                       | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   |                                                            | | XLNet English model                                                                                                                 |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlnet-large-cased``                                      | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
+|                   |                                                            | | XLNet Large English model                                                                                                           |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| XLM               | ``xlm-mlm-en-2048``                                        | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   |                                                            | | XLM English model                                                                                                                   |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-ende-1024``                                      | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   |                                                            | | XLM English-German Multi-language model                                                                                             |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-enfr-1024``                                      | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   |                                                            | | XLM English-French Multi-language model                                                                                             |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-enro-1024``                                      | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   |                                                            | | XLM English-Romanian Multi-language model                                                                                           |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-xnli15-1024``                                    | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   |                                                            | | XLM Model pre-trained with MLM on the `15 XNLI languages <https://github.com/facebookresearch/XNLI>`__.                             |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-tlm-xnli15-1024``                                | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   |                                                            | | XLM Model pre-trained with MLM + TLM on the `15 XNLI languages <https://github.com/facebookresearch/XNLI>`__.                       |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-clm-enfr-1024``                                      | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   |                                                            | | XLM English model trained with CLM (Causal Language Modeling)                                                                       |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-clm-ende-1024``                                      | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   |                                                            | | XLM English-German Multi-language model trained with CLM (Causal Language Modeling)                                                 |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| RoBERTa           | ``roberta-base``                                           | | 12-layer, 768-hidden, 12-heads, 125M parameters                                                                                     |
+|                   |                                                            | | RoBERTa using the BERT-base architecture                                                                                            |
+|                   |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__)                                                   |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``roberta-large``                                          | | 24-layer, 1024-hidden, 16-heads, 355M parameters                                                                                    |
+|                   |                                                            | | RoBERTa using the BERT-large architecture                                                                                           |
+|                   |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__)                                                   |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``roberta-large-mnli``                                     | | 24-layer, 1024-hidden, 16-heads, 355M parameters                                                                                    |
+|                   |                                                            | | ``roberta-large`` fine-tuned on `MNLI <http://www.nyu.edu/projects/bowman/multinli/>`__.                                            |
+|                   |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__)                                                   |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 
 .. <https://huggingface.co/pytorch-transformers/examples.html>`__
\ No newline at end of file
diff --git a/pytorch_transformers/modeling_roberta.py b/pytorch_transformers/modeling_roberta.py
index 6cd4bc2d35..ebf701ead6 100644
--- a/pytorch_transformers/modeling_roberta.py
+++ b/pytorch_transformers/modeling_roberta.py
@@ -29,6 +29,8 @@ from pytorch_transformers.modeling_bert import (BertConfig, BertEmbeddings,
                                                 BertLayerNorm, BertModel,
                                                 BertPreTrainedModel, gelu)
 
+from pytorch_transformers.modeling_utils import add_start_docstrings
+
 logger = logging.getLogger(__name__)
 
 ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP = {
@@ -65,11 +67,93 @@ class RobertaEmbeddings(BertEmbeddings):
 class RobertaConfig(BertConfig):
     pretrained_config_archive_map = ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP
 
+
+ROBERTA_START_DOCSTRING = r"""    The RoBERTa model was proposed in
+    `RoBERTa: A Robustly Optimized BERT Pretraining Approach`_
+    by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer,
+    Veselin Stoyanov. It is based on Google's BERT model released in 2018.
+    
+    It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
+    objective and training with much larger mini-batches and learning rates.
+    
+    This implementation is the same as BertModel with a tiny embeddings tweak as well as a setup for Roberta pretrained 
+    models.
+
+    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
+    refer to the PyTorch documentation for all matter related to general usage and behavior.
+
+    .. _`RoBERTa: A Robustly Optimized BERT Pretraining Approach`:
+        https://arxiv.org/abs/1907.11692
+
+    .. _`torch.nn.Module`:
+        https://pytorch.org/docs/stable/nn.html#module
+
+    Parameters:
+        config (:class:`~pytorch_transformers.RobertaConfig`): Model configuration class with all the parameters of the 
+            model.
+"""
+
+ROBERTA_INPUTS_DOCSTRING = r"""
+    Inputs:
+        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+            To match pre-training, RoBERTa input sequence should be formatted with [CLS] and [SEP] tokens as follows:
+
+            (a) For sequence pairs:
+
+                ``tokens:         [CLS] is this jack ##son ##ville ? [SEP][SEP] no it is not . [SEP]``
+
+            (b) For single sequences:
+
+                ``tokens:         [CLS] the dog is hairy . [SEP]``
+
+            Fully encoded sequences or sequence pairs can be obtained using the RobertaTokenizer.encode function with 
+            the ``add_special_tokens`` parameter set to ``True``.
+            See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
+            :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+        **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of positions of each input sequence tokens in the position embeddings.
+            Selected in the range ``[0, config.max_position_embeddings - 1[``.
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+"""
+
+@add_start_docstrings("The bare RoBERTa Model transformer outputing raw hidden-states without any specific head on top.",
+                      ROBERTA_START_DOCSTRING, ROBERTA_INPUTS_DOCSTRING)
 class RobertaModel(BertModel):
-    """
-    Same as BertModel with:
-    - a tiny embeddings tweak.
-    - setup for Roberta pretrained models
+    r"""
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
+            Sequence of hidden-states at the output of the last layer of the model.
+        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
+            Last layer hidden-state of the first token of the sequence (classification token)
+            further processed by a Linear layer and a Tanh activation function. The Linear
+            layer weights are trained from the next sentence prediction (classification)
+            objective during Bert pretraining. This output is usually *not* a good summary
+            of the semantic content of the input, you're often better with averaging or pooling
+            the sequence of hidden-states for the whole input sequence.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
+        model = RobertaModel.from_pretrained('roberta-base')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
     """
     config_class = RobertaConfig
     pretrained_model_archive_map = ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
@@ -82,9 +166,37 @@ class RobertaModel(BertModel):
         self.apply(self.init_weights)
 
 
+@add_start_docstrings("""RoBERTa Model with a `language modeling` head on top. """,
+    ROBERTA_START_DOCSTRING, ROBERTA_INPUTS_DOCSTRING)
 class RobertaForMaskedLM(BertPreTrainedModel):
-    """
-    Roberta Model with a `language modeling` head on top.
+    r"""
+        **masked_lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for computing the masked language modeling loss.
+            Indices should be in ``[-1, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-1`` are ignored (masked), the loss is only computed for the tokens with labels
+            in ``[0, ..., config.vocab_size]``
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Masked language modeling loss.
+        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
+        model = RobertaForMaskedLM.from_pretrained('roberta-base')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, masked_lm_labels=input_ids)
+        loss, prediction_scores = outputs[:2]
+
     """
     config_class = RobertaConfig
     pretrained_model_archive_map = ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
@@ -112,14 +224,14 @@ class RobertaForMaskedLM(BertPreTrainedModel):
         sequence_output = outputs[0]
         prediction_scores = self.lm_head(sequence_output)
 
-        outputs = (prediction_scores,) + outputs[2:]
+        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here
 
         if masked_lm_labels is not None:
             loss_fct = CrossEntropyLoss(ignore_index=-1)
             masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
             outputs = (masked_lm_loss,) + outputs
 
-        return outputs
+        return outputs  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)
 
 
 class RobertaLMHead(nn.Module):
@@ -144,9 +256,39 @@ class RobertaLMHead(nn.Module):
         return x
 
 
+@add_start_docstrings("""RoBERTa Model transformer with a sequence classification/regression head on top (a linear layer 
+    on top of the pooled output) e.g. for GLUE tasks. """,
+    ROBERTA_START_DOCSTRING, ROBERTA_INPUTS_DOCSTRING)
 class RobertaForSequenceClassification(BertPreTrainedModel):
-    """
-    Roberta Model with a classifier head on top.
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for computing the sequence classification/regression loss.
+            Indices should be in ``[0, ..., config.num_labels]``.
+            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
+            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification (or regression if config.num_labels==1) loss.
+        **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
+            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = RoertaTokenizer.from_pretrained('roberta-base')
+        model = RobertaForSequenceClassification.from_pretrained('roberta-base')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
+
     """
     config_class = RobertaConfig
     pretrained_model_archive_map = ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
diff --git a/pytorch_transformers/tokenization_roberta.py b/pytorch_transformers/tokenization_roberta.py
index 1db8013183..edf4717c89 100644
--- a/pytorch_transformers/tokenization_roberta.py
+++ b/pytorch_transformers/tokenization_roberta.py
@@ -65,8 +65,7 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
 
 class RobertaTokenizer(PreTrainedTokenizer):
     """
-    RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities:
-        - Byte-level BPE
+    RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities: Byte-level BPE
     """
     vocab_files_names = VOCAB_FILES_NAMES
     pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index 7bb9fd9d29..74d50b385d 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -180,9 +180,10 @@ class PreTrainedTokenizer(object):
 
     @classmethod
     def from_pretrained(cls, *inputs, **kwargs):
-        r""" Instantiate a :class:`~pytorch_transformers.PreTrainedTokenizer` (or a derived class) from a predefined tokenizer.
+        r"""
+        Instantiate a :class:`~pytorch_transformers.PreTrainedTokenizer` (or a derived class) from a predefined tokenizer.
 
-        Parameters:
+        Args:
             pretrained_model_name_or_path: either:
 
                 - a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.
@@ -383,14 +384,15 @@ class PreTrainedTokenizer(object):
 
 
     def add_tokens(self, new_tokens):
-        """ Add a list of new tokens to the tokenizer class. If the new tokens are not in the
+        """
+        Add a list of new tokens to the tokenizer class. If the new tokens are not in the
         vocabulary, they are added to it with indices starting from length of the current vocabulary.
 
-            Parameters:
-                new_tokens: list of string. Each string is a token to add. Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
+        Args:
+            new_tokens: list of string. Each string is a token to add. Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
 
-            Returns:
-                Number of tokens added to the vocabulary.
+        Returns:
+            Number of tokens added to the vocabulary.
 
         Examples::
 
@@ -422,17 +424,20 @@ class PreTrainedTokenizer(object):
 
 
     def add_special_tokens(self, special_tokens_dict):
-        """ Add a dictionary of special tokens (eos, pad, cls...) to the encoder and link them
-            to class attributes. If special tokens are NOT in the vocabulary, they are added
-            to it (indexed starting from the last index of the current vocabulary).
+        """
+        Add a dictionary of special tokens (eos, pad, cls...) to the encoder and link them
+        to class attributes. If special tokens are NOT in the vocabulary, they are added
+        to it (indexed starting from the last index of the current vocabulary).
 
-            Parameters:
-                special_tokens_dict: dict of string. Keys should be in the list of predefined special attributes: [``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``].
-                
-                    Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
+        Args:
+            special_tokens_dict: dict of string. Keys should be in the list of predefined special attributes:
+                [``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``,
+                ``additional_special_tokens``].
 
-            Returns:
-                Number of tokens added to the vocabulary.
+                Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the ``unk_token`` to them).
+
+        Returns:
+            Number of tokens added to the vocabulary.
 
         Examples::
 
@@ -520,9 +525,16 @@ class PreTrainedTokenizer(object):
         raise NotImplementedError
 
     def encode(self, text, text_pair=None, add_special_tokens=False):
-        """ Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
+        """
+        Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
         
-        Same doing ``self.convert_tokens_to_ids(self.tokenize(text))``.
+        Same as doing ``self.convert_tokens_to_ids(self.tokenize(text))``.
+
+        Args:
+            text: The first sequence to be encoded.
+            text_pair: Optional second sequence to be encoded.
+            add_special_tokens: if set to ``True``, the sequences will be encoded with the special tokens relative
+                to their model.
         """
         if text_pair is None:
             if add_special_tokens:
@@ -577,9 +589,9 @@ class PreTrainedTokenizer(object):
         return ' '.join(self.convert_ids_to_tokens(tokens))
 
     def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):
-        """ Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary
-            with options to remove special tokens and clean up tokenization spaces.
-
+        """
+        Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary
+        with options to remove special tokens and clean up tokenization spaces.
         Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.
         """
         filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)

From 8308170156bdf41134fd0a8027f63f57f9e6a8d6 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Thu, 15 Aug 2019 10:29:04 -0400
Subject: [PATCH 092/200] Warning for RoBERTa sequences encoded without special
 tokens.

---
 pytorch_transformers/modeling_roberta.py | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/pytorch_transformers/modeling_roberta.py b/pytorch_transformers/modeling_roberta.py
index ebf701ead6..adb04b4b3a 100644
--- a/pytorch_transformers/modeling_roberta.py
+++ b/pytorch_transformers/modeling_roberta.py
@@ -165,6 +165,13 @@ class RobertaModel(BertModel):
         self.embeddings = RobertaEmbeddings(config)
         self.apply(self.init_weights)
 
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None, position_ids=None, head_mask=None):
+        if input_ids[:, 0].sum().item() != 0:
+            logger.warning("A sequence with no special tokens has been passed to the RoBERTa model. "
+                           "This model requires special tokens in order to work. "
+                           "Please specify add_special_tokens=True in your encoding.")
+        return super(RobertaModel, self).forward(input_ids, token_type_ids, attention_mask, position_ids, head_mask)
+
 
 @add_start_docstrings("""RoBERTa Model with a `language modeling` head on top. """,
     ROBERTA_START_DOCSTRING, ROBERTA_INPUTS_DOCSTRING)

From fe02e45e488a4f067605cf9768171358de9726d3 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Thu, 15 Aug 2019 11:15:08 -0400
Subject: [PATCH 093/200] Release: 1.1.0

---
 pytorch_transformers/__init__.py | 2 +-
 setup.py                         | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/pytorch_transformers/__init__.py b/pytorch_transformers/__init__.py
index 38423de14b..62e3b8c47b 100644
--- a/pytorch_transformers/__init__.py
+++ b/pytorch_transformers/__init__.py
@@ -1,4 +1,4 @@
-__version__ = "1.0.0"
+__version__ = "1.1.0"
 from .tokenization_auto import AutoTokenizer
 from .tokenization_bert import BertTokenizer, BasicTokenizer, WordpieceTokenizer
 from .tokenization_openai import OpenAIGPTTokenizer
diff --git a/setup.py b/setup.py
index 4c23714980..c9f80fc224 100644
--- a/setup.py
+++ b/setup.py
@@ -38,10 +38,10 @@ from setuptools import find_packages, setup
 
 setup(
     name="pytorch_transformers",
-    version="1.0.0",
-    author="Thomas Wolf, Lysandre Debut, Victor Sanh, Tim Rault, Google AI Language Team Authors, Open AI team Authors",
+    version="1.1.0",
+    author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors",
     author_email="thomas@huggingface.co",
-    description="Repository of pre-trained NLP Transformer models: BERT, GPT & GPT-2, Transformer-XL, XLNet and XLM",
+    description="Repository of pre-trained NLP Transformer models: BERT & RoBERTa, GPT & GPT-2, Transformer-XL, XLNet and XLM",
     long_description=open("README.md", "r", encoding='utf-8').read(),
     long_description_content_type="text/markdown",
     keywords='NLP deep learning transformer pytorch BERT GPT GPT-2 google openai CMU',

From e24e19ce3bbbc3fe317e4d277b919cd1cb31fc47 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Thu, 15 Aug 2019 14:02:11 -0400
Subject: [PATCH 094/200] Added RoBERTa to AutoModel/AutoConfig

---
 pytorch_transformers/modeling_auto.py | 33 +++++++++++++++++----------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/pytorch_transformers/modeling_auto.py b/pytorch_transformers/modeling_auto.py
index 64b151e3a3..47c37a57d6 100644
--- a/pytorch_transformers/modeling_auto.py
+++ b/pytorch_transformers/modeling_auto.py
@@ -29,6 +29,7 @@ from .modeling_gpt2 import GPT2Config, GPT2Model
 from .modeling_transfo_xl import TransfoXLConfig, TransfoXLModel
 from .modeling_xlnet import XLNetConfig, XLNetModel
 from .modeling_xlm import XLMConfig, XLMModel
+from .modeling_roberta import RobertaConfig, RobertaModel
 
 from .modeling_utils import PreTrainedModel, SequenceSummary
 
@@ -51,6 +52,7 @@ class AutoConfig(object):
             - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
             - contains `xlnet`: XLNetConfig (XLNet model)
             - contains `xlm`: XLMConfig (XLM model)
+            - contains `roberta`: RobertaConfig (RoBERTa model)
 
         This class cannot be instantiated using `__init__()` (throw an error).
     """
@@ -71,6 +73,7 @@ class AutoConfig(object):
             - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
             - contains `xlnet`: XLNetConfig (XLNet model)
             - contains `xlm`: XLMConfig (XLM model)
+            - contains `roberta`: RobertaConfig (RoBERTa model)
 
         Params:
             **pretrained_model_name_or_path**: either:
@@ -119,6 +122,8 @@ class AutoConfig(object):
             return XLNetConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
         elif 'xlm' in pretrained_model_name_or_path:
             return XLMConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'roberta' in pretrained_model_name_or_path:
+            return RobertaConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
 
         raise ValueError("Unrecognized model identifier in {}. Should contains one of "
                          "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
@@ -137,12 +142,13 @@ class AutoModel(object):
 
         The base model class to instantiate is selected as the first pattern matching
         in the `pretrained_model_name_or_path` string (in the following order):
-            - contains `bert`: BertConfig (Bert model)
-            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
-            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
-            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
-            - contains `xlnet`: XLNetConfig (XLNet model)
-            - contains `xlm`: XLMConfig (XLM model)
+            - contains `bert`: BertModel (Bert model)
+            - contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
+            - contains `gpt2`: GPT2Model (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLModel (Transformer-XL model)
+            - contains `xlnet`: XLNetModel (XLNet model)
+            - contains `xlm`: XLMModel (XLM model)
+            - contains `roberta`: RobertaModel (RoBERTa model)
 
         This class cannot be instantiated using `__init__()` (throw an error).
     """
@@ -157,12 +163,13 @@ class AutoModel(object):
 
         The base model class to instantiate is selected as the first pattern matching
         in the `pretrained_model_name_or_path` string (in the following order):
-            - contains `bert`: BertConfig (Bert model)
-            - contains `openai-gpt`: OpenAIGPTConfig (OpenAI GPT model)
-            - contains `gpt2`: GPT2Config (OpenAI GPT-2 model)
-            - contains `transfo-xl`: TransfoXLConfig (Transformer-XL model)
-            - contains `xlnet`: XLNetConfig (XLNet model)
-            - contains `xlm`: XLMConfig (XLM model)
+            - contains `bert`: BertModel (Bert model)
+            - contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
+            - contains `gpt2`: GPT2Model (OpenAI GPT-2 model)
+            - contains `transfo-xl`: TransfoXLModel (Transformer-XL model)
+            - contains `xlnet`: XLNetModel (XLNet model)
+            - contains `xlm`: XLMModel (XLM model)
+            - contains `roberta`: RobertaModel (RoBERTa model)
 
             The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
             To train the model, you should first set it back in training mode with `model.train()`
@@ -230,6 +237,8 @@ class AutoModel(object):
             return XLNetModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
         elif 'xlm' in pretrained_model_name_or_path:
             return XLMModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'roberta' in pretrained_model_name_or_path:
+            return RobertaModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
 
         raise ValueError("Unrecognized model identifier in {}. Should contains one of "
                          "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "

From 83dba0b67bd8d142e830eab7aa6538b4dc50e1ef Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Thu, 15 Aug 2019 17:07:07 -0400
Subject: [PATCH 095/200] Added RoBERTa tokenizer to AutoTokenizer

---
 pytorch_transformers/modeling_auto.py     | 4 ++--
 pytorch_transformers/tokenization_auto.py | 7 ++++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/pytorch_transformers/modeling_auto.py b/pytorch_transformers/modeling_auto.py
index 47c37a57d6..7c96b7a287 100644
--- a/pytorch_transformers/modeling_auto.py
+++ b/pytorch_transformers/modeling_auto.py
@@ -127,7 +127,7 @@ class AutoConfig(object):
 
         raise ValueError("Unrecognized model identifier in {}. Should contains one of "
                          "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
-                         "'xlm'".format(pretrained_model_name_or_path))
+                         "'xlm', 'roberta'".format(pretrained_model_name_or_path))
 
 
 class AutoModel(object):
@@ -242,4 +242,4 @@ class AutoModel(object):
 
         raise ValueError("Unrecognized model identifier in {}. Should contains one of "
                          "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
-                         "'xlm'".format(pretrained_model_name_or_path))
+                         "'xlm', 'roberta'".format(pretrained_model_name_or_path))
diff --git a/pytorch_transformers/tokenization_auto.py b/pytorch_transformers/tokenization_auto.py
index acbe1cebc6..adb8f87cd7 100644
--- a/pytorch_transformers/tokenization_auto.py
+++ b/pytorch_transformers/tokenization_auto.py
@@ -24,6 +24,7 @@ from .tokenization_gpt2 import GPT2Tokenizer
 from .tokenization_transfo_xl import TransfoXLTokenizer
 from .tokenization_xlnet import XLNetTokenizer
 from .tokenization_xlm import XLMTokenizer
+from .tokenization_roberta import RobertaTokenizer
 
 logger = logging.getLogger(__name__)
 
@@ -44,6 +45,7 @@ class AutoTokenizer(object):
             - contains `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)
             - contains `xlnet`: XLNetTokenizer (XLNet model)
             - contains `xlm`: XLMTokenizer (XLM model)
+            - contains `roberta`: RobertaTokenizer (RoBERTa model)
 
         This class cannot be instantiated using `__init__()` (throw an error).
     """
@@ -64,6 +66,7 @@ class AutoTokenizer(object):
             - contains `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)
             - contains `xlnet`: XLNetTokenizer (XLNet model)
             - contains `xlm`: XLMTokenizer (XLM model)
+            - contains `roberta`: RobertaTokenizer (XLM model)
 
         Params:
             **pretrained_model_name_or_path**: either:
@@ -94,7 +97,9 @@ class AutoTokenizer(object):
             return XLNetTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
         elif 'xlm' in pretrained_model_name_or_path:
             return XLMTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'roberta' in pretrained_model_name_or_path:
+            return RobertaTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
 
         raise ValueError("Unrecognized model identifier in {}. Should contains one of "
                          "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
-                         "'xlm'".format(pretrained_model_name_or_path))
+                         "'xlm', 'roberta'".format(pretrained_model_name_or_path))

From 9d0029e215f5ad0836d6be87458aab5142783af4 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Thu, 15 Aug 2019 17:17:35 -0400
Subject: [PATCH 096/200] Added RoBERTa example to README

---
 README.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 389c2f25ad..3389e10593 100644
--- a/README.md
+++ b/README.md
@@ -83,7 +83,8 @@ MODELS = [(BertModel,       BertTokenizer,      'bert-base-uncased'),
           (GPT2Model,       GPT2Tokenizer,      'gpt2'),
           (TransfoXLModel,  TransfoXLTokenizer, 'transfo-xl-wt103'),
           (XLNetModel,      XLNetTokenizer,     'xlnet-base-cased'),
-          (XLMModel,        XLMTokenizer,       'xlm-mlm-enfr-1024')]
+          (XLMModel,        XLMTokenizer,       'xlm-mlm-enfr-1024'),
+          (RobertaModel,    RobertaTokenizer,   'roberta-base')]
 
 # Let's encode some text in a sequence of hidden-states using each model:
 for model_class, tokenizer_class, pretrained_weights in MODELS:

From b8ff56896ccbd27a54035a90a3bc278a44541a74 Mon Sep 17 00:00:00 2001
From: wangfei <1140554608@qq.com>
Date: Fri, 16 Aug 2019 12:11:05 +0800
Subject: [PATCH 097/200] Fix bug of multi-gpu training in lm finetuning

---
 examples/lm_finetuning/finetune_on_pregenerated.py | 2 +-
 examples/lm_finetuning/simple_lm_finetuning.py     | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/examples/lm_finetuning/finetune_on_pregenerated.py b/examples/lm_finetuning/finetune_on_pregenerated.py
index 9fcc5f2cb1..7c40342f18 100644
--- a/examples/lm_finetuning/finetune_on_pregenerated.py
+++ b/examples/lm_finetuning/finetune_on_pregenerated.py
@@ -320,7 +320,7 @@ def main():
                     global_step += 1
 
     # Save a trained model
-    if  n_gpu > 1 and torch.distributed.get_rank() == 0  or n_gpu <=1 :
+    if args.local_rank == -1 or torch.distributed.get_rank() == 0:
         logging.info("** ** * Saving fine-tuned model ** ** * ")
         model.save_pretrained(args.output_dir)
         tokenizer.save_pretrained(args.output_dir)
diff --git a/examples/lm_finetuning/simple_lm_finetuning.py b/examples/lm_finetuning/simple_lm_finetuning.py
index ba5f832827..25333de0ed 100644
--- a/examples/lm_finetuning/simple_lm_finetuning.py
+++ b/examples/lm_finetuning/simple_lm_finetuning.py
@@ -507,7 +507,7 @@ def main():
 
     if os.path.exists(args.output_dir) and os.listdir(args.output_dir):
         raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
-    if not os.path.exists(args.output_dir) and ( n_gpu > 1 and torch.distributed.get_rank() == 0  or n_gpu <=1 ):
+    if not os.path.exists(args.output_dir) and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
         os.makedirs(args.output_dir)
 
     tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
@@ -608,7 +608,7 @@ def main():
                     global_step += 1
 
         # Save a trained model
-        if args.do_train and ( n_gpu > 1 and torch.distributed.get_rank() == 0  or n_gpu <=1):
+        if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
             logger.info("** ** * Saving fine - tuned model ** ** * ")
             model.save_pretrained(args.output_dir)
             tokenizer.save_pretrained(args.output_dir)

From ab05280666c9e1cfbbb23122825f3a41b7ff82c3 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Fri, 16 Aug 2019 09:53:26 -0400
Subject: [PATCH 098/200] Order of strings in AutoModel/AutoTokenizer updated.

---
 pytorch_transformers/modeling_auto.py     | 12 ++++++------
 pytorch_transformers/tokenization_auto.py |  6 +++---
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/pytorch_transformers/modeling_auto.py b/pytorch_transformers/modeling_auto.py
index 7c96b7a287..516107c40b 100644
--- a/pytorch_transformers/modeling_auto.py
+++ b/pytorch_transformers/modeling_auto.py
@@ -110,7 +110,9 @@ class AutoConfig(object):
             assert unused_kwargs == {'foo': False}
 
         """
-        if 'bert' in pretrained_model_name_or_path:
+        if 'roberta' in pretrained_model_name_or_path:
+            return RobertaConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'bert' in pretrained_model_name_or_path:
             return BertConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
         elif 'openai-gpt' in pretrained_model_name_or_path:
             return OpenAIGPTConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
@@ -122,8 +124,6 @@ class AutoConfig(object):
             return XLNetConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
         elif 'xlm' in pretrained_model_name_or_path:
             return XLMConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
-        elif 'roberta' in pretrained_model_name_or_path:
-            return RobertaConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
 
         raise ValueError("Unrecognized model identifier in {}. Should contains one of "
                          "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
@@ -225,7 +225,9 @@ class AutoModel(object):
             model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
 
         """
-        if 'bert' in pretrained_model_name_or_path:
+        if 'roberta' in pretrained_model_name_or_path:
+            return RobertaModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'bert' in pretrained_model_name_or_path:
             return BertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
         elif 'openai-gpt' in pretrained_model_name_or_path:
             return OpenAIGPTModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
@@ -237,8 +239,6 @@ class AutoModel(object):
             return XLNetModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
         elif 'xlm' in pretrained_model_name_or_path:
             return XLMModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
-        elif 'roberta' in pretrained_model_name_or_path:
-            return RobertaModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
 
         raise ValueError("Unrecognized model identifier in {}. Should contains one of "
                          "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
diff --git a/pytorch_transformers/tokenization_auto.py b/pytorch_transformers/tokenization_auto.py
index adb8f87cd7..b4b6336952 100644
--- a/pytorch_transformers/tokenization_auto.py
+++ b/pytorch_transformers/tokenization_auto.py
@@ -85,7 +85,9 @@ class AutoTokenizer(object):
             config = AutoTokenizer.from_pretrained('./test/bert_saved_model/')  # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`
 
         """
-        if 'bert' in pretrained_model_name_or_path:
+        if 'roberta' in pretrained_model_name_or_path:
+            return RobertaTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif 'bert' in pretrained_model_name_or_path:
             return BertTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
         elif 'openai-gpt' in pretrained_model_name_or_path:
             return OpenAIGPTTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
@@ -97,8 +99,6 @@ class AutoTokenizer(object):
             return XLNetTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
         elif 'xlm' in pretrained_model_name_or_path:
             return XLMTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
-        elif 'roberta' in pretrained_model_name_or_path:
-            return RobertaTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
 
         raise ValueError("Unrecognized model identifier in {}. Should contains one of "
                          "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "

From 47975ed53ec96edfcd83c101c5aac7943f2dd30e Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Tue, 6 Aug 2019 11:21:48 -0400
Subject: [PATCH 099/200] Language Modeling fine-tuning using GPT-2.

---
 examples/run_generative_finetuning.py | 402 ++++++++++++++++++++++++++
 examples/utils_lm.py                  |  42 +++
 2 files changed, 444 insertions(+)
 create mode 100644 examples/run_generative_finetuning.py
 create mode 100644 examples/utils_lm.py

diff --git a/examples/run_generative_finetuning.py b/examples/run_generative_finetuning.py
new file mode 100644
index 0000000000..e9e4545dfe
--- /dev/null
+++ b/examples/run_generative_finetuning.py
@@ -0,0 +1,402 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Finetuning the library models for language modeling on WikiText-2 (GPT, GPT-2, XLM)."""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import glob
+import logging
+import os
+import random
+
+import numpy as np
+import torch
+from torch.utils.data import (DataLoader, SequentialSampler,)
+from torch.utils.data.distributed import DistributedSampler
+from tensorboardX import SummaryWriter
+from tqdm import tqdm, trange
+
+from pytorch_transformers import (WEIGHTS_NAME, GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,)
+from pytorch_transformers import AdamW, WarmupLinearSchedule
+
+from utils_lm import WikiTextDataset
+
+logger = logging.getLogger(__name__)
+
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config,)), ())
+
+MODEL_CLASSES = {
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer)
+}
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def train(args, train_dataset, model, tokenizer):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = SequentialSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=WikiTextDataset.collate)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproductibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            batch.to(args.device)
+            model.train()
+            outputs = model(batch, labels=batch)
+            loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
+
+            if args.n_gpu > 1:
+                loss = loss.mean() # mean() to average on multi-gpu parallel training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+                torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+            else:
+                loss.backward()
+                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                scheduler.step()  # Update learning rate schedule
+                optimizer.step()
+                model.zero_grad()
+                global_step += 1
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model, tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+                    model_to_save.save_pretrained(output_dir)
+                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model, tokenizer, prefix=""):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_output_dir = args.output_dir
+
+    results = {}
+    eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)
+
+    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(eval_output_dir)
+
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=WikiTextDataset.collate)
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    eval_loss = 0.0
+    nb_eval_steps = 0
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        model.eval()
+        batch.to(args.device)
+
+        with torch.no_grad():
+            outputs = model(batch, labels=batch)
+            lm_loss = outputs[0]
+            eval_loss += lm_loss.mean().item()
+        nb_eval_steps += 1
+
+    eval_loss = eval_loss / nb_eval_steps
+    perplexity = torch.exp(torch.tensor(eval_loss))
+
+    result = {
+        "perplexity": perplexity
+    }
+
+    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    with open(output_eval_file, "w") as writer:
+        logger.info("***** Eval results {} *****".format(prefix))
+        for key in sorted(result.keys()):
+            logger.info("  %s = %s", key, str(result[key]))
+            writer.write("%s = %s\n" % (key, str(result[key])))
+
+    return results
+
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    dataset = WikiTextDataset(tokenizer, file="test" if evaluate else "train", directory=args.data_dir)
+    return dataset
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--data_dir", default=None, type=str, required=True,
+                        help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model predictions and checkpoints will be written.")
+
+    ## Other parameters
+    parser.add_argument("--model_name_or_path", default="gpt2", type=str,
+                        help="The model to be fine-tuned.")
+    parser.add_argument("--config_name", default="", type=str,
+                        help="Pretrained config name or path if not the same as model_name")
+    parser.add_argument("--tokenizer_name", default="", type=str,
+                        help="Pretrained tokenizer name or path if not the same as model_name")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Where do you want to store the pre-trained models downloaded from s3")
+    parser.add_argument("--max_seq_length", default=128, type=int,
+                        help="The maximum total input sequence length after tokenization. Sequences longer "
+                             "than this will be truncated, sequences shorter will be padded.")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Rul evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=3.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Avoid using CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="For distributed training: local_rank")
+    parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+    args = parser.parse_args()
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_name_or_path]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
+    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    model.to(args.device)
+
+    logger.info("Training/evaluation parameters %s", args)
+
+
+    # Training
+    if args.do_train:
+        train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
+        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(args.output_dir)
+
+        logger.info("Saving model checkpoint to %s", args.output_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+
+        # Good practice: save your training arguments together with the trained model
+        torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = model_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
+        model.to(args.device)
+
+
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        checkpoints = [args.output_dir]
+        if args.eval_all_checkpoints:
+            checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
+            logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            model = model_class.from_pretrained(checkpoint)
+            model.to(args.device)
+            result = evaluate(args, model, tokenizer, prefix=global_step)
+            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+            results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/utils_lm.py b/examples/utils_lm.py
new file mode 100644
index 0000000000..2b6c393a91
--- /dev/null
+++ b/examples/utils_lm.py
@@ -0,0 +1,42 @@
+from torch.utils.data import Dataset, DataLoader
+import os
+import random
+import torch
+import torch.nn.functional as F
+
+
+class WikiTextDataset(Dataset):
+	def __init__(self, tokenizer, file='train', directory='wikitext', max_context_length=512, device='cpu'):
+		self.device = device
+		self.max_context_length = max_context_length
+
+		self.examples = []
+
+		with open(os.path.join(directory, f"wiki.{file}.raw"), encoding="utf-8") as f:
+			text = f.read()
+			spans = list(filter(lambda item: len(item) > 120, text.split("\n")[:20]))
+
+			for span in spans:
+				span = tokenizer.encode(span)
+				while len(span) > 0:
+					self.examples.append(span[:max_context_length])
+					span = span[max_context_length:]
+
+		# Randomly shuffle the examples array
+		random.shuffle(self.examples)
+
+		# Sort the array by example length.
+		self.examples.sort(key=len)
+
+		print("nice")
+
+	def __len__(self):
+		return len(self.examples)
+
+	def __getitem__(self, item):
+		return torch.tensor(self.examples[item], device=self.device)
+
+	@staticmethod
+	def collate(values):
+		stack = torch.stack([F.pad(value, (len(values[-1]) - value.size(0), 0), "constant", 0) for value in values])
+		return stack

From 3e3e1454974de0e1b72c0688a0341014922cd149 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Tue, 6 Aug 2019 12:14:18 -0400
Subject: [PATCH 100/200] Added GPT to the generative fine-tuning.

---
 examples/run_generative_finetuning.py | 6 ++++--
 examples/utils_lm.py                  | 2 --
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/examples/run_generative_finetuning.py b/examples/run_generative_finetuning.py
index e9e4545dfe..458c123553 100644
--- a/examples/run_generative_finetuning.py
+++ b/examples/run_generative_finetuning.py
@@ -30,7 +30,8 @@ from torch.utils.data.distributed import DistributedSampler
 from tensorboardX import SummaryWriter
 from tqdm import tqdm, trange
 
-from pytorch_transformers import (WEIGHTS_NAME, GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,)
+from pytorch_transformers import (WEIGHTS_NAME, GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer)
 from pytorch_transformers import AdamW, WarmupLinearSchedule
 
 from utils_lm import WikiTextDataset
@@ -40,7 +41,8 @@ logger = logging.getLogger(__name__)
 ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config,)), ())
 
 MODEL_CLASSES = {
-    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer)
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer)
 }
 
 
diff --git a/examples/utils_lm.py b/examples/utils_lm.py
index 2b6c393a91..4a3bafb789 100644
--- a/examples/utils_lm.py
+++ b/examples/utils_lm.py
@@ -28,8 +28,6 @@ class WikiTextDataset(Dataset):
 		# Sort the array by example length.
 		self.examples.sort(key=len)
 
-		print("nice")
-
 	def __len__(self):
 		return len(self.examples)
 

From 5c18825a1850ad59021ea9a914e638256dd372f6 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Tue, 6 Aug 2019 14:57:07 -0400
Subject: [PATCH 101/200] Removed dataset limit

---
 examples/utils_lm.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/utils_lm.py b/examples/utils_lm.py
index 4a3bafb789..2944cdc9ea 100644
--- a/examples/utils_lm.py
+++ b/examples/utils_lm.py
@@ -14,7 +14,7 @@ class WikiTextDataset(Dataset):
 
 		with open(os.path.join(directory, f"wiki.{file}.raw"), encoding="utf-8") as f:
 			text = f.read()
-			spans = list(filter(lambda item: len(item) > 120, text.split("\n")[:20]))
+			spans = list(filter(lambda item: len(item) > 120, text.split("\n")))
 
 			for span in spans:
 				span = tokenizer.encode(span)

From 339e556feb1e6b65cee05d8a1e70d487c416e195 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Fri, 9 Aug 2019 18:08:15 -0400
Subject: [PATCH 102/200] CLM for BERT, beginning of CLM fot RoBERTa; still
 needs a better masking token mechanism.

---
 examples/run_generative_finetuning.py | 62 +++++++++++++++++++++------
 1 file changed, 48 insertions(+), 14 deletions(-)

diff --git a/examples/run_generative_finetuning.py b/examples/run_generative_finetuning.py
index 458c123553..44daa3d266 100644
--- a/examples/run_generative_finetuning.py
+++ b/examples/run_generative_finetuning.py
@@ -13,7 +13,11 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Finetuning the library models for language modeling on WikiText-2 (GPT, GPT-2, XLM)."""
+"""
+Fine-tuning the library models for language modeling on WikiText-2 (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
 
 from __future__ import absolute_import, division, print_function
 
@@ -30,8 +34,10 @@ from torch.utils.data.distributed import DistributedSampler
 from tensorboardX import SummaryWriter
 from tqdm import tqdm, trange
 
-from pytorch_transformers import (WEIGHTS_NAME, GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
-                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer)
+from pytorch_transformers import (WEIGHTS_NAME, GPT2Config, GPT2LMHeadModel, GPT2Tokenizer, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
+                                  BertConfig, BertForMaskedLM, BertTokenizer, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
 from pytorch_transformers import AdamW, WarmupLinearSchedule
 
 from utils_lm import WikiTextDataset
@@ -42,7 +48,9 @@ ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (
 
 MODEL_CLASSES = {
     'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
-    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer)
+    'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+    "bert": (BertConfig, BertForMaskedLM, BertTokenizer),
+    "roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
 }
 
 
@@ -53,6 +61,18 @@ def set_seed(args):
     if args.n_gpu > 0:
         torch.cuda.manual_seed_all(args.seed)
 
+# Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original
+def mask_tokens(inputs, tokenizer, args):
+    labels = inputs.clone()
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).byte()
+    labels[~masked_indices] = -1  # We only compute loss on masked tokens
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).byte() & masked_indices
+    inputs[indices_replaced] = tokenizer.vocab["[MASK]"]  # 80% of the time, replace masked input tokens with [MASK]
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).byte() & masked_indices & ~indices_replaced
+    random_words = torch.randint(args.num_embeddings, labels.shape, dtype=torch.long, device=args.device)
+    inputs[indices_random] = random_words[
+        indices_random]  # 10% of the time, replace masked input tokens with random word
+    return inputs, labels
 
 def train(args, train_dataset, model, tokenizer):
     """ Train the model """
@@ -108,13 +128,14 @@ def train(args, train_dataset, model, tokenizer):
     tr_loss, logging_loss = 0.0, 0.0
     model.zero_grad()
     train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
-    set_seed(args)  # Added here for reproductibility (even between python 2 and 3)
+    set_seed(args)  # Added here for reproducibility (even between python 2 and 3)
     for _ in train_iterator:
         epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
         for step, batch in enumerate(epoch_iterator):
             batch.to(args.device)
             model.train()
-            outputs = model(batch, labels=batch)
+            inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
+            outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
             loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
 
             if args.n_gpu > 1:
@@ -132,8 +153,8 @@ def train(args, train_dataset, model, tokenizer):
 
             tr_loss += loss.item()
             if (step + 1) % args.gradient_accumulation_steps == 0:
-                scheduler.step()  # Update learning rate schedule
                 optimizer.step()
+                scheduler.step()  # Update learning rate schedule
                 model.zero_grad()
                 global_step += 1
 
@@ -196,7 +217,7 @@ def evaluate(args, model, tokenizer, prefix=""):
         batch.to(args.device)
 
         with torch.no_grad():
-            outputs = model(batch, labels=batch)
+            outputs = model(batch)
             lm_loss = outputs[0]
             eval_loss += lm_loss.mean().item()
         nb_eval_steps += 1
@@ -236,8 +257,16 @@ def main():
                         help="The output directory where the model predictions and checkpoints will be written.")
 
     ## Other parameters
-    parser.add_argument("--model_name_or_path", default="gpt2", type=str,
-                        help="The model to be fine-tuned.")
+    parser.add_argument("--model_name", default="bert", type=str,
+                        help="The model architecture to be fine-tuned.")
+    parser.add_argument("--model_checkpoint", default="bert-base-cased", type=str,
+                        help="The model checkpoint for weights initialization.")
+
+    parser.add_argument("--mlm", action='store_true',
+                        help="Train with masked-language modeling loss instead of language modeling.")
+    parser.add_argument("--mlm_probability", type=float, default=0.15,
+                        help="Ratio of tokens to mask for masked language modeling loss")
+
     parser.add_argument("--config_name", default="", type=str,
                         help="Pretrained config name or path if not the same as model_name")
     parser.add_argument("--tokenizer_name", default="", type=str,
@@ -303,6 +332,10 @@ def main():
     parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
     args = parser.parse_args()
 
+    if args.model_name in ["bert", "roberta"] and not args.mlm:
+        raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+                         "flag (masked language modeling).")
+
     if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
         raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
 
@@ -339,10 +372,11 @@ def main():
     if args.local_rank not in [-1, 0]:
         torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
 
-    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_name_or_path]
-    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
-    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
-    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_name]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_checkpoint)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_checkpoint, do_lower_case=args.do_lower_case)
+    model = model_class.from_pretrained(args.model_checkpoint, from_tf=bool('.ckpt' in args.model_checkpoint), config=config)
+    args.num_embeddings = config.vocab_size  # We need this to create the model at next line (number of embeddings to use)
 
     if args.local_rank == 0:
         torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

From 715534800a2a809dbfc66bd17acb36ed30999b0d Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Wed, 14 Aug 2019 09:52:57 -0400
Subject: [PATCH 103/200] BERT + RoBERTa masking tokens handling + GPU device
 update.

---
 examples/run_generative_finetuning.py | 27 ++++++++++++++++-----------
 examples/utils_lm.py                  |  5 ++---
 2 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/examples/run_generative_finetuning.py b/examples/run_generative_finetuning.py
index 44daa3d266..ecbf44d8de 100644
--- a/examples/run_generative_finetuning.py
+++ b/examples/run_generative_finetuning.py
@@ -65,11 +65,15 @@ def set_seed(args):
 def mask_tokens(inputs, tokenizer, args):
     labels = inputs.clone()
     masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).byte()
-    labels[~masked_indices] = -1  # We only compute loss on masked tokens
+    labels[~masked_indices.bool()] = -1  # We only compute loss on masked tokens
     indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).byte() & masked_indices
-    inputs[indices_replaced] = tokenizer.vocab["[MASK]"]  # 80% of the time, replace masked input tokens with [MASK]
-    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).byte() & masked_indices & ~indices_replaced
-    random_words = torch.randint(args.num_embeddings, labels.shape, dtype=torch.long, device=args.device)
+
+    if args.model_name == "bert":
+        inputs[indices_replaced.bool()] = tokenizer.vocab["[MASK]"]  # 80% of the time, replace masked input tokens with [MASK]
+    elif args.model_name == "roberta":
+        inputs[indices_replaced.bool()] = tokenizer.encoder["<mask>"]  # 80% of the time, replace masked input tokens with <mask>
+    indices_random = (torch.bernoulli(torch.full(labels.shape, 0.5)).byte() & masked_indices & ~indices_replaced).bool()
+    random_words = torch.randint(args.num_embeddings, labels.shape, dtype=torch.long)
     inputs[indices_random] = random_words[
         indices_random]  # 10% of the time, replace masked input tokens with random word
     return inputs, labels
@@ -132,14 +136,15 @@ def train(args, train_dataset, model, tokenizer):
     for _ in train_iterator:
         epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
         for step, batch in enumerate(epoch_iterator):
-            batch.to(args.device)
-            model.train()
             inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
+            inputs = inputs.to(args.device)
+            labels = labels.to(args.device)
+            model.train()
             outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
             loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
 
             if args.n_gpu > 1:
-                loss = loss.mean() # mean() to average on multi-gpu parallel training
+                loss = loss.mean()  # mean() to average on multi-gpu parallel training
             if args.gradient_accumulation_steps > 1:
                 loss = loss / args.gradient_accumulation_steps
 
@@ -214,7 +219,7 @@ def evaluate(args, model, tokenizer, prefix=""):
     nb_eval_steps = 0
     for batch in tqdm(eval_dataloader, desc="Evaluating"):
         model.eval()
-        batch.to(args.device)
+        batch = batch.to(args.device)
 
         with torch.no_grad():
             outputs = model(batch)
@@ -285,9 +290,9 @@ def main():
     parser.add_argument("--do_lower_case", action='store_true',
                         help="Set this flag if you are using an uncased model.")
 
-    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
+    parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
                         help="Batch size per GPU/CPU for training.")
-    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
+    parser.add_argument("--per_gpu_eval_batch_size", default=4, type=int,
                         help="Batch size per GPU/CPU for evaluation.")
     parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
                         help="Number of updates steps to accumulate before performing a backward/update pass.")
@@ -299,7 +304,7 @@ def main():
                         help="Epsilon for Adam optimizer.")
     parser.add_argument("--max_grad_norm", default=1.0, type=float,
                         help="Max gradient norm.")
-    parser.add_argument("--num_train_epochs", default=3.0, type=float,
+    parser.add_argument("--num_train_epochs", default=1.0, type=float,
                         help="Total number of training epochs to perform.")
     parser.add_argument("--max_steps", default=-1, type=int,
                         help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
diff --git a/examples/utils_lm.py b/examples/utils_lm.py
index 2944cdc9ea..68a1ca2cce 100644
--- a/examples/utils_lm.py
+++ b/examples/utils_lm.py
@@ -6,8 +6,7 @@ import torch.nn.functional as F
 
 
 class WikiTextDataset(Dataset):
-	def __init__(self, tokenizer, file='train', directory='wikitext', max_context_length=512, device='cpu'):
-		self.device = device
+	def __init__(self, tokenizer, file='train', directory='wikitext', max_context_length=512):
 		self.max_context_length = max_context_length
 
 		self.examples = []
@@ -32,7 +31,7 @@ class WikiTextDataset(Dataset):
 		return len(self.examples)
 
 	def __getitem__(self, item):
-		return torch.tensor(self.examples[item], device=self.device)
+		return torch.tensor(self.examples[item])
 
 	@staticmethod
 	def collate(values):

From 7e7fc53da5f230db379ece739457c81b2f50f13e Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Fri, 16 Aug 2019 11:02:10 -0400
Subject: [PATCH 104/200] Fixing run_glue example with RoBERTa

---
 examples/run_glue.py   | 2 +-
 examples/utils_glue.py | 7 ++++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/examples/run_glue.py b/examples/run_glue.py
index c0f70e0863..7fb0732e61 100644
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -279,7 +279,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
             sep_token=tokenizer.sep_token,
             sep_token_extra=bool(args.model_type in ['roberta']),           # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
             pad_on_left=bool(args.model_type in ['xlnet']),                 # pad on the left for xlnet
-            pad_token=tokenizer.encoder[tokenizer.pad_token] if args.model_type in ['roberta'] else tokenizer.vocab[tokenizer.pad_token],
+            pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
             pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
         )
         if args.local_rank in [-1, 0]:
diff --git a/examples/utils_glue.py b/examples/utils_glue.py
index c955e4d0ce..e1649fa5af 100644
--- a/examples/utils_glue.py
+++ b/examples/utils_glue.py
@@ -425,9 +425,10 @@ def convert_examples_to_features(examples, label_list, max_seq_length,
             # Account for [CLS], [SEP], [SEP] with "- 3"
             _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
         else:
-            # Account for [CLS] and [SEP] with "- 2"
-            if len(tokens_a) > max_seq_length - 2:
-                tokens_a = tokens_a[:(max_seq_length - 2)]
+            # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
+            special_tokens_count = 3 if sep_token_extra else 2
+            if len(tokens_a) > max_seq_length - special_tokens_count:
+                tokens_a = tokens_a[:(max_seq_length - special_tokens_count)]
 
         # The convention in BERT is:
         # (a) For sequence pairs:

From 5652f54ac26f3233f4dcbfd9a2f6879e94a0bc59 Mon Sep 17 00:00:00 2001
From: Lysandre <lysandre.debut@reseau.eseo.fr>
Date: Fri, 16 Aug 2019 13:49:56 -0400
Subject: [PATCH 105/200] Simplified data generator + better perplexity
 calculator

GPT-2 now obtains ~20 perplexity on WikiText-2
---
 examples/run_generative_finetuning.py |  9 +++++----
 examples/utils_lm.py                  | 23 +++++------------------
 2 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/examples/run_generative_finetuning.py b/examples/run_generative_finetuning.py
index ecbf44d8de..bb6aee6f07 100644
--- a/examples/run_generative_finetuning.py
+++ b/examples/run_generative_finetuning.py
@@ -85,7 +85,7 @@ def train(args, train_dataset, model, tokenizer):
 
     args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
     train_sampler = SequentialSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
-    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=WikiTextDataset.collate)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
 
     if args.max_steps > 0:
         t_total = args.max_steps
@@ -209,7 +209,7 @@ def evaluate(args, model, tokenizer, prefix=""):
     args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
     # Note that DistributedSampler samples randomly
     eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
-    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=WikiTextDataset.collate)
+    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
 
     # Eval!
     logger.info("***** Running evaluation {} *****".format(prefix))
@@ -217,12 +217,13 @@ def evaluate(args, model, tokenizer, prefix=""):
     logger.info("  Batch size = %d", args.eval_batch_size)
     eval_loss = 0.0
     nb_eval_steps = 0
+    model.eval()
+
     for batch in tqdm(eval_dataloader, desc="Evaluating"):
-        model.eval()
         batch = batch.to(args.device)
 
         with torch.no_grad():
-            outputs = model(batch)
+            outputs = model(batch, masked_lm_labels=batch) if args.mlm else model(batch, labels=batch)
             lm_loss = outputs[0]
             eval_loss += lm_loss.mean().item()
         nb_eval_steps += 1
diff --git a/examples/utils_lm.py b/examples/utils_lm.py
index 68a1ca2cce..5f22e10a76 100644
--- a/examples/utils_lm.py
+++ b/examples/utils_lm.py
@@ -6,34 +6,21 @@ import torch.nn.functional as F
 
 
 class WikiTextDataset(Dataset):
-	def __init__(self, tokenizer, file='train', directory='wikitext', max_context_length=512):
+	def __init__(self, tokenizer, file='train', directory='wikitext', max_context_length=1024):
 		self.max_context_length = max_context_length
 
 		self.examples = []
 
 		with open(os.path.join(directory, f"wiki.{file}.raw"), encoding="utf-8") as f:
 			text = f.read()
-			spans = list(filter(lambda item: len(item) > 120, text.split("\n")))
+			tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
 
-			for span in spans:
-				span = tokenizer.encode(span)
-				while len(span) > 0:
-					self.examples.append(span[:max_context_length])
-					span = span[max_context_length:]
-
-		# Randomly shuffle the examples array
-		random.shuffle(self.examples)
-
-		# Sort the array by example length.
-		self.examples.sort(key=len)
+			while len(tokenized_text) > max_context_length:
+				self.examples.append(tokenized_text[:max_context_length])
+				tokenized_text = tokenized_text[max_context_length:]
 
 	def __len__(self):
 		return len(self.examples)
 
 	def __getitem__(self, item):
 		return torch.tensor(self.examples[item])
-
-	@staticmethod
-	def collate(values):
-		stack = torch.stack([F.pad(value, (len(values[-1]) - value.size(0), 0), "constant", 0) for value in values])
-		return stack

From d8923270e6c497862f990a3c72e40cc1ddd01d4e Mon Sep 17 00:00:00 2001
From: Jason Phang <email@jasonphang.com>
Date: Fri, 16 Aug 2019 15:58:19 -0400
Subject: [PATCH 106/200] Correct truncation for RoBERTa in 2-input GLUE

---
 examples/utils_glue.py | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/examples/utils_glue.py b/examples/utils_glue.py
index e1649fa5af..3e3f104672 100644
--- a/examples/utils_glue.py
+++ b/examples/utils_glue.py
@@ -422,8 +422,9 @@ def convert_examples_to_features(examples, label_list, max_seq_length,
             tokens_b = tokenizer.tokenize(example.text_b)
             # Modifies `tokens_a` and `tokens_b` in place so that the total
             # length is less than the specified length.
-            # Account for [CLS], [SEP], [SEP] with "- 3"
-            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
+            # Account for [CLS], [SEP], [SEP] with "- 3". " -4" for RoBERTa.
+            special_tokens_count = 4 if sep_token_extra else 3
+            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - special_tokens_count)
         else:
             # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
             special_tokens_count = 3 if sep_token_extra else 2

From 189ff9b66408a1758f3732725db3871322f3e0e6 Mon Sep 17 00:00:00 2001
From: Christophe Bourguignat <christophe.bourguignat@zelros.com>
Date: Sat, 17 Aug 2019 18:46:50 +0200
Subject: [PATCH 107/200] Update README after RoBERTa addition

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 3389e10593..7d2445fc11 100644
--- a/README.md
+++ b/README.md
@@ -76,7 +76,7 @@ import torch
 from pytorch_transformers import *
 
 # PyTorch-Transformers has a unified API
-# for 6 transformer architectures and 27 pretrained weights.
+# for 7 transformer architectures and 30 pretrained weights.
 #          Model          | Tokenizer          | Pretrained weights shortcut
 MODELS = [(BertModel,       BertTokenizer,      'bert-base-uncased'),
           (OpenAIGPTModel,  OpenAIGPTTokenizer, 'openai-gpt'),

From 00e9c4cc9616cab1666cab0a331b5d7e68946928 Mon Sep 17 00:00:00 2001
From: wangfei <1140554608@qq.com>
Date: Sun, 18 Aug 2019 11:02:02 +0800
Subject: [PATCH 108/200] Fix: save model/model.module

---
 examples/lm_finetuning/finetune_on_pregenerated.py | 11 ++++++-----
 examples/lm_finetuning/simple_lm_finetuning.py     |  3 ++-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/examples/lm_finetuning/finetune_on_pregenerated.py b/examples/lm_finetuning/finetune_on_pregenerated.py
index 7c40342f18..1177d84cd4 100644
--- a/examples/lm_finetuning/finetune_on_pregenerated.py
+++ b/examples/lm_finetuning/finetune_on_pregenerated.py
@@ -155,12 +155,12 @@ def main():
                         help="Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
                         "0 (default value): dynamic loss scaling.\n"
                         "Positive power of 2: static loss scaling value.\n")
-    parser.add_argument("--warmup_steps", 
-                        default=0, 
+    parser.add_argument("--warmup_steps",
+                        default=0,
                         type=int,
                         help="Linear warmup over warmup_steps.")
-    parser.add_argument("--adam_epsilon", 
-                        default=1e-8, 
+    parser.add_argument("--adam_epsilon",
+                        default=1e-8,
                         type=float,
                         help="Epsilon for Adam optimizer.")
     parser.add_argument("--learning_rate",
@@ -322,7 +322,8 @@ def main():
     # Save a trained model
     if args.local_rank == -1 or torch.distributed.get_rank() == 0:
         logging.info("** ** * Saving fine-tuned model ** ** * ")
-        model.save_pretrained(args.output_dir)
+        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
         tokenizer.save_pretrained(args.output_dir)
 
 
diff --git a/examples/lm_finetuning/simple_lm_finetuning.py b/examples/lm_finetuning/simple_lm_finetuning.py
index 25333de0ed..9633640faf 100644
--- a/examples/lm_finetuning/simple_lm_finetuning.py
+++ b/examples/lm_finetuning/simple_lm_finetuning.py
@@ -610,7 +610,8 @@ def main():
         # Save a trained model
         if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
             logger.info("** ** * Saving fine - tuned model ** ** * ")
-            model.save_pretrained(args.output_dir)
+            model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+            model_to_save.save_pretrained(args.output_dir)
             tokenizer.save_pretrained(args.output_dir)
 
 

From 1ef41b83374ce5756e24746201d21432d7ecada0 Mon Sep 17 00:00:00 2001
From: wangfei <1140554608@qq.com>
Date: Sun, 18 Aug 2019 11:03:12 +0800
Subject: [PATCH 109/200] Revert "Fix: save model/model.module"

This reverts commit 00e9c4cc9616cab1666cab0a331b5d7e68946928.
---
 examples/lm_finetuning/finetune_on_pregenerated.py | 11 +++++------
 examples/lm_finetuning/simple_lm_finetuning.py     |  3 +--
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/examples/lm_finetuning/finetune_on_pregenerated.py b/examples/lm_finetuning/finetune_on_pregenerated.py
index 1177d84cd4..7c40342f18 100644
--- a/examples/lm_finetuning/finetune_on_pregenerated.py
+++ b/examples/lm_finetuning/finetune_on_pregenerated.py
@@ -155,12 +155,12 @@ def main():
                         help="Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
                         "0 (default value): dynamic loss scaling.\n"
                         "Positive power of 2: static loss scaling value.\n")
-    parser.add_argument("--warmup_steps",
-                        default=0,
+    parser.add_argument("--warmup_steps", 
+                        default=0, 
                         type=int,
                         help="Linear warmup over warmup_steps.")
-    parser.add_argument("--adam_epsilon",
-                        default=1e-8,
+    parser.add_argument("--adam_epsilon", 
+                        default=1e-8, 
                         type=float,
                         help="Epsilon for Adam optimizer.")
     parser.add_argument("--learning_rate",
@@ -322,8 +322,7 @@ def main():
     # Save a trained model
     if args.local_rank == -1 or torch.distributed.get_rank() == 0:
         logging.info("** ** * Saving fine-tuned model ** ** * ")
-        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
-        model_to_save.save_pretrained(args.output_dir)
+        model.save_pretrained(args.output_dir)
         tokenizer.save_pretrained(args.output_dir)
 
 
diff --git a/examples/lm_finetuning/simple_lm_finetuning.py b/examples/lm_finetuning/simple_lm_finetuning.py
index 9633640faf..25333de0ed 100644
--- a/examples/lm_finetuning/simple_lm_finetuning.py
+++ b/examples/lm_finetuning/simple_lm_finetuning.py
@@ -610,8 +610,7 @@ def main():
         # Save a trained model
         if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
             logger.info("** ** * Saving fine - tuned model ** ** * ")
-            model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
-            model_to_save.save_pretrained(args.output_dir)
+            model.save_pretrained(args.output_dir)
             tokenizer.save_pretrained(args.output_dir)
 
 

From 856a63da4d1f0f302633dc73e2d4a1f698bbafda Mon Sep 17 00:00:00 2001
From: wangfei <1140554608@qq.com>
Date: Sun, 18 Aug 2019 11:03:47 +0800
Subject: [PATCH 110/200] Fix: save model/model.module

---
 examples/lm_finetuning/finetune_on_pregenerated.py | 3 ++-
 examples/lm_finetuning/simple_lm_finetuning.py     | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/examples/lm_finetuning/finetune_on_pregenerated.py b/examples/lm_finetuning/finetune_on_pregenerated.py
index 7c40342f18..eefa56c824 100644
--- a/examples/lm_finetuning/finetune_on_pregenerated.py
+++ b/examples/lm_finetuning/finetune_on_pregenerated.py
@@ -322,7 +322,8 @@ def main():
     # Save a trained model
     if args.local_rank == -1 or torch.distributed.get_rank() == 0:
         logging.info("** ** * Saving fine-tuned model ** ** * ")
-        model.save_pretrained(args.output_dir)
+        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
         tokenizer.save_pretrained(args.output_dir)
 
 
diff --git a/examples/lm_finetuning/simple_lm_finetuning.py b/examples/lm_finetuning/simple_lm_finetuning.py
index 25333de0ed..9633640faf 100644
--- a/examples/lm_finetuning/simple_lm_finetuning.py
+++ b/examples/lm_finetuning/simple_lm_finetuning.py
@@ -610,7 +610,8 @@ def main():
         # Save a trained model
         if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
             logger.info("** ** * Saving fine - tuned model ** ** * ")
-            model.save_pretrained(args.output_dir)
+            model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+            model_to_save.save_pretrained(args.output_dir)
             tokenizer.save_pretrained(args.output_dir)
 
 

From 40acf6b52a5250608c2b90edd955835131971d5a Mon Sep 17 00:00:00 2001
From: Chi-Liang Liu <liangtaiwan1230@gmail.com>
Date: Tue, 30 Jul 2019 18:37:37 +0800
Subject: [PATCH 111/200] don't save model without training

---
 examples/run_squad.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/run_squad.py b/examples/run_squad.py
index f0ae9169ad..f2d29fd6b1 100644
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -481,7 +481,7 @@ def main():
 
 
     # Save the trained model and the tokenizer
-    if args.local_rank == -1 or torch.distributed.get_rank() == 0:
+    if args.do_train and args.local_rank == -1 or torch.distributed.get_rank() == 0:
         # Create output directory if needed
         if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
             os.makedirs(args.output_dir)

From c589862b783b94a8408b40c6dc9bf4a14b2ee391 Mon Sep 17 00:00:00 2001
From: Lysandre <lysandre.debut@reseau.eseo.fr>
Date: Mon, 19 Aug 2019 10:17:47 -0400
Subject: [PATCH 112/200] Doc: loading from config alone does not load the
 model weights

---
 pytorch_transformers/modeling_bert.py       | 4 +++-
 pytorch_transformers/modeling_gpt2.py       | 2 ++
 pytorch_transformers/modeling_openai.py     | 2 ++
 pytorch_transformers/modeling_roberta.py    | 3 ++-
 pytorch_transformers/modeling_transfo_xl.py | 2 ++
 pytorch_transformers/modeling_utils.py      | 4 ++++
 pytorch_transformers/modeling_xlm.py        | 2 ++
 pytorch_transformers/modeling_xlnet.py      | 2 ++
 8 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index 51d8788545..9c20eac9bf 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -577,7 +577,9 @@ BERT_START_DOCSTRING = r"""    The BERT model was proposed in
         https://pytorch.org/docs/stable/nn.html#module
 
     Parameters:
-        config (:class:`~pytorch_transformers.BertConfig`): Model configuration class with all the parameters of the model.
+        config (:class:`~pytorch_transformers.BertConfig`): Model configuration class with all the parameters of the model. 
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
 """
 
 BERT_INPUTS_DOCSTRING = r"""
diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index 5211def3e3..f67d0e88d5 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -383,6 +383,8 @@ GPT2_START_DOCSTRING = r"""    OpenAI GPT-2 model was proposed in
 
     Parameters:
         config (:class:`~pytorch_transformers.GPT2Config`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
 """
 
 GPT2_INPUTS_DOCSTRING = r"""    Inputs:
diff --git a/pytorch_transformers/modeling_openai.py b/pytorch_transformers/modeling_openai.py
index 364923b0af..e8648487be 100644
--- a/pytorch_transformers/modeling_openai.py
+++ b/pytorch_transformers/modeling_openai.py
@@ -397,6 +397,8 @@ OPENAI_GPT_START_DOCSTRING = r"""    OpenAI GPT model was proposed in
 
     Parameters:
         config (:class:`~pytorch_transformers.OpenAIGPTConfig`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
 """
 
 OPENAI_GPT_INPUTS_DOCSTRING = r"""    Inputs:
diff --git a/pytorch_transformers/modeling_roberta.py b/pytorch_transformers/modeling_roberta.py
index adb04b4b3a..e3065cf60b 100644
--- a/pytorch_transformers/modeling_roberta.py
+++ b/pytorch_transformers/modeling_roberta.py
@@ -90,7 +90,8 @@ ROBERTA_START_DOCSTRING = r"""    The RoBERTa model was proposed in
 
     Parameters:
         config (:class:`~pytorch_transformers.RobertaConfig`): Model configuration class with all the parameters of the 
-            model.
+            model. Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
 """
 
 ROBERTA_INPUTS_DOCSTRING = r"""
diff --git a/pytorch_transformers/modeling_transfo_xl.py b/pytorch_transformers/modeling_transfo_xl.py
index cb5416964c..553a71fffe 100644
--- a/pytorch_transformers/modeling_transfo_xl.py
+++ b/pytorch_transformers/modeling_transfo_xl.py
@@ -928,6 +928,8 @@ TRANSFO_XL_START_DOCSTRING = r"""    The Transformer-XL model was proposed in
 
     Parameters:
         config (:class:`~pytorch_transformers.TransfoXLConfig`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
 """
 
 TRANSFO_XL_INPUTS_DOCSTRING = r"""
diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 35f82e324f..edc6b3903e 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -71,6 +71,10 @@ class PretrainedConfig(object):
     r""" Base class for all configuration classes.
         Handles a few parameters common to all models' configurations as well as methods for loading/downloading/saving configurations.
 
+        Note:
+            A configuration file can be loaded and saved to disk. Loading the configuration file and using this file to initialize a model does **not** load the model weights.
+            It only affects the model's configuration.
+
         Class attributes (overridden by derived classes):
             - ``pretrained_config_archive_map``: a python ``dict`` of with `short-cut-names` (string) as keys and `url` (string) of associated pretrained model configurations as values.
 
diff --git a/pytorch_transformers/modeling_xlm.py b/pytorch_transformers/modeling_xlm.py
index 941c8dda2f..d01d245bbb 100644
--- a/pytorch_transformers/modeling_xlm.py
+++ b/pytorch_transformers/modeling_xlm.py
@@ -416,6 +416,8 @@ XLM_START_DOCSTRING = r"""    The XLM model was proposed in
 
     Parameters:
         config (:class:`~pytorch_transformers.XLMConfig`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
 """
 
 XLM_INPUTS_DOCSTRING = r"""
diff --git a/pytorch_transformers/modeling_xlnet.py b/pytorch_transformers/modeling_xlnet.py
index e9e75e3ab7..af33c5a6c2 100644
--- a/pytorch_transformers/modeling_xlnet.py
+++ b/pytorch_transformers/modeling_xlnet.py
@@ -647,6 +647,8 @@ XLNET_START_DOCSTRING = r"""    The XLNet model was proposed in
 
     Parameters:
         config (:class:`~pytorch_transformers.XLNetConfig`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
 """
 
 XLNET_INPUTS_DOCSTRING = r"""

From f94f1c6016414e059fa4d8ef61ee194fdc891046 Mon Sep 17 00:00:00 2001
From: Lysandre <lysandre.debut@reseau.eseo.fr>
Date: Mon, 19 Aug 2019 14:58:50 -0400
Subject: [PATCH 113/200] Distributed training + tokenizer agnostic mask token

---
 examples/run_generative_finetuning.py | 14 +++-----------
 examples/utils_lm.py                  | 27 ++++++++++++++++++++++++++-
 2 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/examples/run_generative_finetuning.py b/examples/run_generative_finetuning.py
index bb6aee6f07..8501364ae4 100644
--- a/examples/run_generative_finetuning.py
+++ b/examples/run_generative_finetuning.py
@@ -39,12 +39,10 @@ from pytorch_transformers import (WEIGHTS_NAME, GPT2Config, GPT2LMHeadModel, GPT
                                   BertConfig, BertForMaskedLM, BertTokenizer, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
                                   RobertaConfig, RobertaForMaskedLM, RobertaTokenizer, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
 from pytorch_transformers import AdamW, WarmupLinearSchedule
+logger = logging.getLogger(__name__)
 
 from utils_lm import WikiTextDataset
 
-logger = logging.getLogger(__name__)
-
-ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config,)), ())
 
 MODEL_CLASSES = {
     'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
@@ -68,10 +66,7 @@ def mask_tokens(inputs, tokenizer, args):
     labels[~masked_indices.bool()] = -1  # We only compute loss on masked tokens
     indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).byte() & masked_indices
 
-    if args.model_name == "bert":
-        inputs[indices_replaced.bool()] = tokenizer.vocab["[MASK]"]  # 80% of the time, replace masked input tokens with [MASK]
-    elif args.model_name == "roberta":
-        inputs[indices_replaced.bool()] = tokenizer.encoder["<mask>"]  # 80% of the time, replace masked input tokens with <mask>
+    inputs[indices_replaced.bool()] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token) # 80% of the time, replace masked input tokens with [MASK]
     indices_random = (torch.bernoulli(torch.full(labels.shape, 0.5)).byte() & masked_indices & ~indices_replaced).bool()
     random_words = torch.randint(args.num_embeddings, labels.shape, dtype=torch.long)
     inputs[indices_random] = random_words[
@@ -246,10 +241,7 @@ def evaluate(args, model, tokenizer, prefix=""):
 
 
 def load_and_cache_examples(args, tokenizer, evaluate=False):
-    if args.local_rank not in [-1, 0]:
-        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
-
-    dataset = WikiTextDataset(tokenizer, file="test" if evaluate else "train", directory=args.data_dir)
+    dataset = WikiTextDataset(args, tokenizer, file="test" if evaluate else "train", directory=args.data_dir)
     return dataset
 
 
diff --git a/examples/utils_lm.py b/examples/utils_lm.py
index 5f22e10a76..251aea90e1 100644
--- a/examples/utils_lm.py
+++ b/examples/utils_lm.py
@@ -3,10 +3,27 @@ import os
 import random
 import torch
 import torch.nn.functional as F
+import logging
+import pickle
+
+logger = logging.getLogger(__name__)
 
 
 class WikiTextDataset(Dataset):
-	def __init__(self, tokenizer, file='train', directory='wikitext', max_context_length=1024):
+	def __init__(self, args, tokenizer, file='train', directory='wikitext', max_context_length=512, cache=None):
+		if args.local_rank not in [-1, 0]:
+			torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+			
+			
+		cached_features_file = os.path.join(args.data_dir, f'cached_lm_{file}_{args.max_seq_length}')
+		
+		if os.path.exists(cached_features_file):
+			logger.info("Loading features from cached file %s", cached_features_file)
+			with open(cached_features_file, 'rb') as handle:
+				self.examples = pickle.load(handle)
+		else:
+			logger.info("Creating features from dataset file at %s", args.data_dir)	
+		
 		self.max_context_length = max_context_length
 
 		self.examples = []
@@ -18,6 +35,14 @@ class WikiTextDataset(Dataset):
 			while len(tokenized_text) > max_context_length:
 				self.examples.append(tokenized_text[:max_context_length])
 				tokenized_text = tokenized_text[max_context_length:]
+			
+		if args.local_rank in [-1, 0]:
+			logger.info("Saving features into cached file %s", cached_features_file)
+			with open(cached_features_file, 'wb') as handle:
+				pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+		
+		if args.local_rank == 0:
+			torch.distributed.barrier()
 
 	def __len__(self):
 		return len(self.examples)

From a368b877911862da014ed7b219679effbb8dd8ca Mon Sep 17 00:00:00 2001
From: Peng Qi <qipeng@users.noreply.github.com>
Date: Mon, 19 Aug 2019 13:07:00 -0700
Subject: [PATCH 114/200] Fix #1015

---
 examples/run_squad.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/run_squad.py b/examples/run_squad.py
index f2d29fd6b1..efa835107c 100644
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -498,7 +498,7 @@ def main():
 
         # Load a trained model and vocabulary that you have fine-tuned
         model = model_class.from_pretrained(args.output_dir)
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
         model.to(args.device)
 
 

From 28f7ca1f807f0857c24f18c0b28b6b8ebee18c0a Mon Sep 17 00:00:00 2001
From: Zeyao Du <ned1991@gmail.com>
Date: Tue, 20 Aug 2019 15:58:42 +0800
Subject: [PATCH 115/200] swap optimizer.step and scheduler.step

---
 examples/lm_finetuning/simple_lm_finetuning.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/lm_finetuning/simple_lm_finetuning.py b/examples/lm_finetuning/simple_lm_finetuning.py
index ba5f832827..dca883d2f6 100644
--- a/examples/lm_finetuning/simple_lm_finetuning.py
+++ b/examples/lm_finetuning/simple_lm_finetuning.py
@@ -602,8 +602,8 @@ def main():
                 nb_tr_examples += input_ids.size(0)
                 nb_tr_steps += 1
                 if (step + 1) % args.gradient_accumulation_steps == 0:
-                    scheduler.step()  # Update learning rate schedule
                     optimizer.step()
+                    scheduler.step()  # Update learning rate schedule
                     optimizer.zero_grad()
                     global_step += 1
 

From a1359b970cb4bfa41008a45b44dd2a25e579bff3 Mon Sep 17 00:00:00 2001
From: Zeyao Du <ned1991@gmail.com>
Date: Tue, 20 Aug 2019 16:00:07 +0800
Subject: [PATCH 116/200] Update finetune_on_pregenerated.py

---
 examples/lm_finetuning/finetune_on_pregenerated.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/lm_finetuning/finetune_on_pregenerated.py b/examples/lm_finetuning/finetune_on_pregenerated.py
index 9fcc5f2cb1..ccf1c15313 100644
--- a/examples/lm_finetuning/finetune_on_pregenerated.py
+++ b/examples/lm_finetuning/finetune_on_pregenerated.py
@@ -314,8 +314,8 @@ def main():
                 mean_loss = tr_loss * args.gradient_accumulation_steps / nb_tr_steps
                 pbar.set_postfix_str(f"Loss: {mean_loss:.5f}")
                 if (step + 1) % args.gradient_accumulation_steps == 0:
-                    scheduler.step()  # Update learning rate schedule
                     optimizer.step()
+                    scheduler.step()  # Update learning rate schedule
                     optimizer.zero_grad()
                     global_step += 1
 

From 45ab8bf60e5c2af912006035f5568be92c0c99c9 Mon Sep 17 00:00:00 2001
From: Duzeyao <330501241@qq.com>
Date: Tue, 20 Aug 2019 16:40:39 +0800
Subject: [PATCH 117/200] Revert "Update finetune_on_pregenerated.py"

This reverts commit a1359b970cb4bfa41008a45b44dd2a25e579bff3.
---
 examples/lm_finetuning/finetune_on_pregenerated.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/lm_finetuning/finetune_on_pregenerated.py b/examples/lm_finetuning/finetune_on_pregenerated.py
index ccf1c15313..9fcc5f2cb1 100644
--- a/examples/lm_finetuning/finetune_on_pregenerated.py
+++ b/examples/lm_finetuning/finetune_on_pregenerated.py
@@ -314,8 +314,8 @@ def main():
                 mean_loss = tr_loss * args.gradient_accumulation_steps / nb_tr_steps
                 pbar.set_postfix_str(f"Loss: {mean_loss:.5f}")
                 if (step + 1) % args.gradient_accumulation_steps == 0:
-                    optimizer.step()
                     scheduler.step()  # Update learning rate schedule
+                    optimizer.step()
                     optimizer.zero_grad()
                     global_step += 1
 

From d86b49ac86141810af4a7c82ed34e789b3b1937e Mon Sep 17 00:00:00 2001
From: Duzeyao <330501241@qq.com>
Date: Tue, 20 Aug 2019 16:46:34 +0800
Subject: [PATCH 118/200] swap optimizer.step and scheduler.step

---
 examples/lm_finetuning/finetune_on_pregenerated.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/lm_finetuning/finetune_on_pregenerated.py b/examples/lm_finetuning/finetune_on_pregenerated.py
index 9fcc5f2cb1..ccf1c15313 100644
--- a/examples/lm_finetuning/finetune_on_pregenerated.py
+++ b/examples/lm_finetuning/finetune_on_pregenerated.py
@@ -314,8 +314,8 @@ def main():
                 mean_loss = tr_loss * args.gradient_accumulation_steps / nb_tr_steps
                 pbar.set_postfix_str(f"Loss: {mean_loss:.5f}")
                 if (step + 1) % args.gradient_accumulation_steps == 0:
-                    scheduler.step()  # Update learning rate schedule
                     optimizer.step()
+                    scheduler.step()  # Update learning rate schedule
                     optimizer.zero_grad()
                     global_step += 1
 

From fecaed0ed4bf338bca5b9895107b309841f8ac57 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 20 Aug 2019 10:56:12 +0200
Subject: [PATCH 119/200] add force_download option to from_pretrained methods

---
 pytorch_transformers/file_utils.py         | 13 ++++++++-----
 pytorch_transformers/modeling_utils.py     | 13 +++++++++++--
 pytorch_transformers/tokenization_utils.py |  6 +++++-
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/pytorch_transformers/file_utils.py b/pytorch_transformers/file_utils.py
index 75c075720c..074e6743ef 100644
--- a/pytorch_transformers/file_utils.py
+++ b/pytorch_transformers/file_utils.py
@@ -93,12 +93,15 @@ def filename_to_url(filename, cache_dir=None):
     return url, etag
 
 
-def cached_path(url_or_filename, cache_dir=None):
+def cached_path(url_or_filename, cache_dir=None, force_download=False):
     """
     Given something that might be a URL (or might be a local path),
     determine which. If it's a URL, download the file and cache it, and
     return the path to the cached file. If it's already a local path,
     make sure the file exists and then return the path.
+    Args:
+        cache_dir: specify a cache directory to save the file to (overwrite the default cache dir).
+        force_download: if True, re-dowload the file even if it's already cached in the cache dir.
     """
     if cache_dir is None:
         cache_dir = PYTORCH_TRANSFORMERS_CACHE
@@ -111,7 +114,7 @@ def cached_path(url_or_filename, cache_dir=None):
 
     if parsed.scheme in ('http', 'https', 's3'):
         # URL, so get it from the cache (downloading if necessary)
-        return get_from_cache(url_or_filename, cache_dir)
+        return get_from_cache(url_or_filename, cache_dir=cache_dir, force_download=force_download)
     elif os.path.exists(url_or_filename):
         # File, and it exists.
         return url_or_filename
@@ -184,7 +187,7 @@ def http_get(url, temp_file):
     progress.close()
 
 
-def get_from_cache(url, cache_dir=None):
+def get_from_cache(url, cache_dir=None, force_download=False):
     """
     Given a URL, look for the corresponding dataset in the local cache.
     If it's not there, download it. Then return the path to the cached file.
@@ -227,11 +230,11 @@ def get_from_cache(url, cache_dir=None):
         if matching_files:
             cache_path = os.path.join(cache_dir, matching_files[-1])
 
-    if not os.path.exists(cache_path):
+    if not os.path.exists(cache_path) or force_download:
         # Download to temporary file, then copy to cache dir once finished.
         # Otherwise you get corrupt cache entries if the download gets interrupted.
         with tempfile.NamedTemporaryFile() as temp_file:
-            logger.info("%s not found in cache, downloading to %s", url, temp_file.name)
+            logger.info("%s not found in cache or force_download set to True, downloading to %s", url, temp_file.name)
 
             # GET file object
             if url.startswith("s3://"):
diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index edc6b3903e..3e4fbca132 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -125,6 +125,9 @@ class PretrainedConfig(object):
                 - The values in kwargs of any keys which are configuration attributes will be used to override the loaded values.
                 - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled by the `return_unused_kwargs` keyword parameter.
 
+            force_download: (`optional`) boolean, default False:
+                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
+
             return_unused_kwargs: (`optional`) bool:
 
                 - If False, then this function returns just the final configuration object.
@@ -146,6 +149,7 @@ class PretrainedConfig(object):
 
         """
         cache_dir = kwargs.pop('cache_dir', None)
+        force_download = kwargs.pop('force_download', False)
         return_unused_kwargs = kwargs.pop('return_unused_kwargs', False)
 
         if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
@@ -156,7 +160,7 @@ class PretrainedConfig(object):
             config_file = pretrained_model_name_or_path
         # redirect to the cache, if necessary
         try:
-            resolved_config_file = cached_path(config_file, cache_dir=cache_dir)
+            resolved_config_file = cached_path(config_file, cache_dir=cache_dir, force_download=force_download)
         except EnvironmentError:
             if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
                 logger.error(
@@ -400,6 +404,9 @@ class PreTrainedModel(nn.Module):
                 Path to a directory in which a downloaded pre-trained model
                 configuration should be cached if the standard cache should not be used.
 
+            force_download: (`optional`) boolean, default False:
+                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
+
             output_loading_info: (`optional`) boolean:
                 Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
 
@@ -424,6 +431,7 @@ class PreTrainedModel(nn.Module):
         state_dict = kwargs.pop('state_dict', None)
         cache_dir = kwargs.pop('cache_dir', None)
         from_tf = kwargs.pop('from_tf', False)
+        force_download = kwargs.pop('force_download', False)
         output_loading_info = kwargs.pop('output_loading_info', False)
 
         # Load config
@@ -431,6 +439,7 @@ class PreTrainedModel(nn.Module):
             config, model_kwargs = cls.config_class.from_pretrained(
                 pretrained_model_name_or_path, *model_args,
                 cache_dir=cache_dir, return_unused_kwargs=True,
+                force_download=force_download,
                 **kwargs
             )
         else:
@@ -453,7 +462,7 @@ class PreTrainedModel(nn.Module):
                 archive_file = pretrained_model_name_or_path
         # redirect to the cache, if necessary
         try:
-            resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir)
+            resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download)
         except EnvironmentError:
             if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
                 logger.error(
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index 74d50b385d..763c0cee04 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -193,6 +193,9 @@ class PreTrainedTokenizer(object):
             cache_dir: (`optional`) string:
                 Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.
 
+            force_download: (`optional`) boolean, default False:
+                Force to (re-)download the vocabulary files and override the cached versions if they exists.
+
             inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.
 
             kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~pytorch_transformers.PreTrainedTokenizer` for details.
@@ -223,6 +226,7 @@ class PreTrainedTokenizer(object):
     @classmethod
     def _from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
         cache_dir = kwargs.pop('cache_dir', None)
+        force_download = kwargs.pop('force_download', False)
 
         s3_models = list(cls.max_model_input_sizes.keys())
         vocab_files = {}
@@ -283,7 +287,7 @@ class PreTrainedTokenizer(object):
                 if file_path is None:
                     resolved_vocab_files[file_id] = None
                 else:
-                    resolved_vocab_files[file_id] = cached_path(file_path, cache_dir=cache_dir)
+                    resolved_vocab_files[file_id] = cached_path(file_path, cache_dir=cache_dir, force_download=force_download)
         except EnvironmentError:
             if pretrained_model_name_or_path in s3_models:
                 logger.error("Couldn't reach server to download vocabulary.")

From e239a4a20fbb901e60ffcafc06bfefcbb67eaa65 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 20 Aug 2019 11:02:00 +0200
Subject: [PATCH 120/200] close #984

---
 docs/source/pretrained_models.rst | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/source/pretrained_models.rst b/docs/source/pretrained_models.rst
index 987882d12e..6a14e3dcd1 100644
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -72,16 +72,16 @@ Here is the full list of the currently provided pretrained models together with
 |                   | ``xlnet-large-cased``                                      | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
 |                   |                                                            | | XLNet Large English model                                                                                                           |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| XLM               | ``xlm-mlm-en-2048``                                        | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
+| XLM               | ``xlm-mlm-en-2048``                                        | | 12-layer, 2048-hidden, 16-heads                                                                                                      |
 |                   |                                                            | | XLM English model                                                                                                                   |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-mlm-ende-1024``                                      | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   | ``xlm-mlm-ende-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                      |
 |                   |                                                            | | XLM English-German Multi-language model                                                                                             |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-mlm-enfr-1024``                                      | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   | ``xlm-mlm-enfr-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                      |
 |                   |                                                            | | XLM English-French Multi-language model                                                                                             |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-mlm-enro-1024``                                      | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   | ``xlm-mlm-enro-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                      |
 |                   |                                                            | | XLM English-Romanian Multi-language model                                                                                           |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-mlm-xnli15-1024``                                    | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
@@ -93,7 +93,7 @@ Here is the full list of the currently provided pretrained models together with
 |                   | ``xlm-clm-enfr-1024``                                      | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
 |                   |                                                            | | XLM English model trained with CLM (Causal Language Modeling)                                                                       |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-clm-ende-1024``                                      | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   | ``xlm-clm-ende-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                      |
 |                   |                                                            | | XLM English-German Multi-language model trained with CLM (Causal Language Modeling)                                                 |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | RoBERTa           | ``roberta-base``                                           | | 12-layer, 768-hidden, 12-heads, 125M parameters                                                                                     |

From 901dde0e4583a00dc7e486aca6cda7acb647dea9 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 20 Aug 2019 11:05:51 +0200
Subject: [PATCH 121/200] fix #1014

---
 pytorch_transformers/tokenization_bert.py | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/pytorch_transformers/tokenization_bert.py b/pytorch_transformers/tokenization_bert.py
index 177d26dec1..04f35aa466 100644
--- a/pytorch_transformers/tokenization_bert.py
+++ b/pytorch_transformers/tokenization_bert.py
@@ -187,6 +187,8 @@ class BertTokenizer(PreTrainedTokenizer):
         index = 0
         if os.path.isdir(vocab_path):
             vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES['vocab_file'])
+        else:
+            vocab_file = vocab_path
         with open(vocab_file, "w", encoding="utf-8") as writer:
             for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
                 if index != token_index:

From 53c8f700f4704a58f4684674ced1c57d6ca9240c Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 20 Aug 2019 11:29:26 +0200
Subject: [PATCH 122/200] fix #808

---
 pytorch_transformers/modeling_bert.py       | 5 ++++-
 pytorch_transformers/modeling_gpt2.py       | 2 ++
 pytorch_transformers/modeling_openai.py     | 2 ++
 pytorch_transformers/modeling_roberta.py    | 4 ++++
 pytorch_transformers/modeling_transfo_xl.py | 2 ++
 pytorch_transformers/modeling_xlm.py        | 4 ++++
 pytorch_transformers/modeling_xlnet.py      | 2 ++
 7 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index 9c20eac9bf..7b34b3fd90 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -599,7 +599,10 @@ BERT_INPUTS_DOCSTRING = r"""
                 ``tokens:         [CLS] the dog is hairy . [SEP]``
                 
                 ``token_type_ids:   0   0   0   0  0     0   0``
-    
+
+            Bert is a model with absolute position embeddings so it's usually advised to pad the inputs on
+            the right rather than the left.
+
             Indices can be obtained using :class:`pytorch_transformers.BertTokenizer`.
             See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
             :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index f67d0e88d5..91d01d0584 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -390,6 +390,8 @@ GPT2_START_DOCSTRING = r"""    OpenAI GPT-2 model was proposed in
 GPT2_INPUTS_DOCSTRING = r"""    Inputs:
         **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Indices of input sequence tokens in the vocabulary.
+            GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on
+            the right rather than the left.
             Indices can be obtained using :class:`pytorch_transformers.BPT2Tokenizer`.
             See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
             :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
diff --git a/pytorch_transformers/modeling_openai.py b/pytorch_transformers/modeling_openai.py
index e8648487be..71ffb78e0f 100644
--- a/pytorch_transformers/modeling_openai.py
+++ b/pytorch_transformers/modeling_openai.py
@@ -404,6 +404,8 @@ OPENAI_GPT_START_DOCSTRING = r"""    OpenAI GPT model was proposed in
 OPENAI_GPT_INPUTS_DOCSTRING = r"""    Inputs:
         **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Indices of input sequence tokens in the vocabulary.
+            GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+            the right rather than the left.
             Indices can be obtained using :class:`pytorch_transformers.BPT2Tokenizer`.
             See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
             :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
diff --git a/pytorch_transformers/modeling_roberta.py b/pytorch_transformers/modeling_roberta.py
index e3065cf60b..e49b2a06b1 100644
--- a/pytorch_transformers/modeling_roberta.py
+++ b/pytorch_transformers/modeling_roberta.py
@@ -110,6 +110,10 @@ ROBERTA_INPUTS_DOCSTRING = r"""
 
             Fully encoded sequences or sequence pairs can be obtained using the RobertaTokenizer.encode function with 
             the ``add_special_tokens`` parameter set to ``True``.
+
+            RoBERTa is a model with absolute position embeddings so it's usually advised to pad the inputs on
+            the right rather than the left.
+
             See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
             :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
         **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
diff --git a/pytorch_transformers/modeling_transfo_xl.py b/pytorch_transformers/modeling_transfo_xl.py
index 553a71fffe..3cfdee38cb 100644
--- a/pytorch_transformers/modeling_transfo_xl.py
+++ b/pytorch_transformers/modeling_transfo_xl.py
@@ -936,6 +936,8 @@ TRANSFO_XL_INPUTS_DOCSTRING = r"""
     Inputs:
         **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Indices of input sequence tokens in the vocabulary.
+            Transformer-XL is a model with relative position embeddings so you can either pad the inputs on
+            the right or on the left.
             Indices can be obtained using :class:`pytorch_transformers.TransfoXLTokenizer`.
             See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
             :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
diff --git a/pytorch_transformers/modeling_xlm.py b/pytorch_transformers/modeling_xlm.py
index d01d245bbb..be2767ed0c 100644
--- a/pytorch_transformers/modeling_xlm.py
+++ b/pytorch_transformers/modeling_xlm.py
@@ -424,6 +424,10 @@ XLM_INPUTS_DOCSTRING = r"""
     Inputs:
         **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Indices of input sequence tokens in the vocabulary.
+
+            XLM is a model with absolute position embeddings so it's usually advised to pad the inputs on
+            the right rather than the left.
+
             Indices can be obtained using :class:`pytorch_transformers.XLMTokenizer`.
             See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
             :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
diff --git a/pytorch_transformers/modeling_xlnet.py b/pytorch_transformers/modeling_xlnet.py
index af33c5a6c2..d44821788e 100644
--- a/pytorch_transformers/modeling_xlnet.py
+++ b/pytorch_transformers/modeling_xlnet.py
@@ -655,6 +655,8 @@ XLNET_INPUTS_DOCSTRING = r"""
     Inputs:
         **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Indices of input sequence tokens in the vocabulary.
+            XLNet is a model with relative position embeddings so you can either pad the inputs on
+            the right or on the left.
             Indices can be obtained using :class:`pytorch_transformers.XLNetTokenizer`.
             See :func:`pytorch_transformers.PreTrainedTokenizer.encode` and
             :func:`pytorch_transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.

From b0b9b8091b73f929306704bd8cd62b712621cebc Mon Sep 17 00:00:00 2001
From: Julien Chaumond <chaumond@gmail.com>
Date: Tue, 20 Aug 2019 11:33:46 +0200
Subject: [PATCH 123/200] minor typo

---
 pytorch_transformers/modeling_gpt2.py   | 2 +-
 pytorch_transformers/modeling_openai.py | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index f67d0e88d5..dd3e465bf3 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -614,7 +614,7 @@ class GPT2LMHeadModel(GPT2PreTrainedModel):
 @add_start_docstrings("""The GPT2 Model transformer with a language modeling and a multiple-choice classification
 head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.
 The language modeling head has its weights tied to the input embeddings,
-the classification head takes as input the input of a specified classification token index in the intput sequence).
+the classification head takes as input the input of a specified classification token index in the input sequence).
 """, GPT2_START_DOCSTRING)
 class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
     r"""    Inputs:
diff --git a/pytorch_transformers/modeling_openai.py b/pytorch_transformers/modeling_openai.py
index e8648487be..a4f02111e7 100644
--- a/pytorch_transformers/modeling_openai.py
+++ b/pytorch_transformers/modeling_openai.py
@@ -604,7 +604,7 @@ class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
 @add_start_docstrings("""OpenAI GPT Model transformer with a language modeling and a multiple-choice classification
 head on top e.g. for RocStories/SWAG tasks. The two heads are two linear layers.
 The language modeling head has its weights tied to the input embeddings,
-the classification head takes as input the input of a specified classification token index in the intput sequence).
+the classification head takes as input the input of a specified classification token index in the input sequence).
 """, OPENAI_GPT_START_DOCSTRING)
 class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
     r"""    Inputs:

From 6d0aa73981f15618cf8d01255b07194e946c3286 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 20 Aug 2019 12:20:21 +0200
Subject: [PATCH 124/200] fix #1034

---
 pytorch_transformers/modeling_xlm.py | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/pytorch_transformers/modeling_xlm.py b/pytorch_transformers/modeling_xlm.py
index be2767ed0c..19800da2ed 100644
--- a/pytorch_transformers/modeling_xlm.py
+++ b/pytorch_transformers/modeling_xlm.py
@@ -440,8 +440,10 @@ XLM_INPUTS_DOCSTRING = r"""
             Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).
         **langs**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             A parallel sequence of tokens to be used to indicate the language of each token in the input.
-            Indices are selected in the pre-trained language vocabulary,
-            i.e. in the range ``[0, config.n_langs - 1[``.
+            Indices are languages ids which can be obtained from the language names by using two conversion mappings
+            provided in the configuration of the model (only provided for multilingual models).
+            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and
+            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).
         **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:

From bfd75056b0a080addafb7f3d7c9336d27b883a0e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Guillem=20Garc=C3=ADa=20Subies?=
 <37592763+GuillemGSubies@users.noreply.github.com>
Date: Tue, 20 Aug 2019 14:06:17 +0200
Subject: [PATCH 125/200] Update tokenization_xlm.py

---
 pytorch_transformers/tokenization_xlm.py | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/pytorch_transformers/tokenization_xlm.py b/pytorch_transformers/tokenization_xlm.py
index b690a3a945..8e7c2954f2 100644
--- a/pytorch_transformers/tokenization_xlm.py
+++ b/pytorch_transformers/tokenization_xlm.py
@@ -124,8 +124,9 @@ class XLMTokenizer(PreTrainedTokenizer):
                                            **kwargs)
         try:
             import ftfy
-            import spacy
-            self.nlp = spacy.load('en', disable=['parser', 'tagger', 'ner', 'textcat'])
+            from spacy.lang.en import English
+            _nlp = English()
+            self.nlp = nlp.Defaults.create_tokenizer(_nlp)
             self.fix_text = ftfy.fix_text
         except ImportError:
             logger.warning("ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.")

From bb04446285be43059050406b3bc4938807c63c25 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Guillem=20Garc=C3=ADa=20Subies?=
 <37592763+GuillemGSubies@users.noreply.github.com>
Date: Tue, 20 Aug 2019 14:07:40 +0200
Subject: [PATCH 126/200] Update tokenization_openai.py

---
 pytorch_transformers/tokenization_openai.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/pytorch_transformers/tokenization_openai.py b/pytorch_transformers/tokenization_openai.py
index 0eb5281d39..0f6a8f1dae 100644
--- a/pytorch_transformers/tokenization_openai.py
+++ b/pytorch_transformers/tokenization_openai.py
@@ -89,9 +89,9 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
 
         try:
             import ftfy
-            import spacy
-            self.nlp = spacy.load('en', disable=['parser', 'tagger', 'ner', 'textcat'])
-            self.fix_text = ftfy.fix_text
+            from spacy.lang.en import English
+            _nlp = English()
+            self.nlp = nlp.Defaults.create_tokenizer(_nlp)
         except ImportError:
             logger.warning("ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.")
             self.nlp = BasicTokenizer(do_lower_case=True)

From 562b998366c7a4a2bd0addf1a860fbee0aa04d74 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Guillem=20Garc=C3=ADa=20Subies?=
 <37592763+GuillemGSubies@users.noreply.github.com>
Date: Tue, 20 Aug 2019 14:10:19 +0200
Subject: [PATCH 127/200] Update tokenization_openai.py

---
 pytorch_transformers/tokenization_openai.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/pytorch_transformers/tokenization_openai.py b/pytorch_transformers/tokenization_openai.py
index 0f6a8f1dae..79eb023a8d 100644
--- a/pytorch_transformers/tokenization_openai.py
+++ b/pytorch_transformers/tokenization_openai.py
@@ -92,6 +92,7 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
             from spacy.lang.en import English
             _nlp = English()
             self.nlp = nlp.Defaults.create_tokenizer(_nlp)
+            self.fix_text = ftfy.fix_text
         except ImportError:
             logger.warning("ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.")
             self.nlp = BasicTokenizer(do_lower_case=True)

From f5e2ed0fd89d5730126d71c03324fa07ae674ca7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Guillem=20Garc=C3=ADa=20Subies?=
 <37592763+GuillemGSubies@users.noreply.github.com>
Date: Tue, 20 Aug 2019 14:19:25 +0200
Subject: [PATCH 128/200] Update tokenization_openai.py

---
 pytorch_transformers/tokenization_openai.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/tokenization_openai.py b/pytorch_transformers/tokenization_openai.py
index 79eb023a8d..51b418ebd3 100644
--- a/pytorch_transformers/tokenization_openai.py
+++ b/pytorch_transformers/tokenization_openai.py
@@ -91,7 +91,7 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
             import ftfy
             from spacy.lang.en import English
             _nlp = English()
-            self.nlp = nlp.Defaults.create_tokenizer(_nlp)
+            self.nlp = _nlp.Defaults.create_tokenizer(_nlp)
             self.fix_text = ftfy.fix_text
         except ImportError:
             logger.warning("ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.")

From 388e3251fa95b892949968dc89065e464a93b69f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Guillem=20Garc=C3=ADa=20Subies?=
 <37592763+GuillemGSubies@users.noreply.github.com>
Date: Tue, 20 Aug 2019 14:19:39 +0200
Subject: [PATCH 129/200] Update tokenization_xlm.py

---
 pytorch_transformers/tokenization_xlm.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/tokenization_xlm.py b/pytorch_transformers/tokenization_xlm.py
index 8e7c2954f2..2d2f3a8cd4 100644
--- a/pytorch_transformers/tokenization_xlm.py
+++ b/pytorch_transformers/tokenization_xlm.py
@@ -126,7 +126,7 @@ class XLMTokenizer(PreTrainedTokenizer):
             import ftfy
             from spacy.lang.en import English
             _nlp = English()
-            self.nlp = nlp.Defaults.create_tokenizer(_nlp)
+            self.nlp = _nlp.Defaults.create_tokenizer(_nlp)
             self.fix_text = ftfy.fix_text
         except ImportError:
             logger.warning("ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.")

From ad6e62cd827d546691845aca5fb9b437c5812d6a Mon Sep 17 00:00:00 2001
From: Nikolay Korolev <korolevns98@gmail.com>
Date: Tue, 20 Aug 2019 15:43:06 +0300
Subject: [PATCH 130/200] Fix typo. configuratoin -> configuration

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 7d2445fc11..4e57de5842 100644
--- a/README.md
+++ b/README.md
@@ -328,7 +328,7 @@ Breaking change in the `from_pretrained()`method:
 
 1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
 
-2. The additional `*input` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute instead which can break derived model classes build based on the previous `BertForSequenceClassification` examples. We are working on a way to mitigate this breaking change in [#866](https://github.com/huggingface/pytorch-transformers/pull/866) by forwarding the the model `__init__()` method (i) the provided positional arguments and (ii) the keyword arguments which do not match any configuratoin class attributes.
+2. The additional `*input` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute instead which can break derived model classes build based on the previous `BertForSequenceClassification` examples. We are working on a way to mitigate this breaking change in [#866](https://github.com/huggingface/pytorch-transformers/pull/866) by forwarding the the model `__init__()` method (i) the provided positional arguments and (ii) the keyword arguments which do not match any configuration class attributes.
 
 Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before.
 

From a690edab174cd1b7a5b684db34158b16c68441f8 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 20 Aug 2019 15:52:12 +0200
Subject: [PATCH 131/200] various fix and clean up on run_lm_finetuning

---
 .gitignore                                    |   5 +-
 ...ive_finetuning.py => run_lm_finetuning.py} | 165 ++++++++++++------
 examples/utils_lm.py                          |  51 ------
 3 files changed, 116 insertions(+), 105 deletions(-)
 rename examples/{run_generative_finetuning.py => run_lm_finetuning.py} (75%)
 delete mode 100644 examples/utils_lm.py

diff --git a/.gitignore b/.gitignore
index 6bbe32df6c..bbc738b931 100644
--- a/.gitignore
+++ b/.gitignore
@@ -127,4 +127,7 @@ proc_data
 
 # examples
 runs
-examples/runs
\ No newline at end of file
+examples/runs
+
+# data
+data
\ No newline at end of file
diff --git a/examples/run_generative_finetuning.py b/examples/run_lm_finetuning.py
similarity index 75%
rename from examples/run_generative_finetuning.py
rename to examples/run_lm_finetuning.py
index 8501364ae4..bd7047a587 100644
--- a/examples/run_generative_finetuning.py
+++ b/examples/run_lm_finetuning.py
@@ -25,33 +25,75 @@ import argparse
 import glob
 import logging
 import os
+import pickle
 import random
 
 import numpy as np
 import torch
-from torch.utils.data import (DataLoader, SequentialSampler,)
+from torch.utils.data import DataLoader, Dataset, SequentialSampler
 from torch.utils.data.distributed import DistributedSampler
 from tensorboardX import SummaryWriter
 from tqdm import tqdm, trange
 
-from pytorch_transformers import (WEIGHTS_NAME, GPT2Config, GPT2LMHeadModel, GPT2Tokenizer, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP,
-                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
-                                  BertConfig, BertForMaskedLM, BertTokenizer, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
-                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
-from pytorch_transformers import AdamW, WarmupLinearSchedule
-logger = logging.getLogger(__name__)
+from pytorch_transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
+                                  BertConfig, BertForMaskedLM, BertTokenizer,
+                                  GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
+                                  OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+                                  RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
 
-from utils_lm import WikiTextDataset
+
+logger = logging.getLogger(__name__)
 
 
 MODEL_CLASSES = {
     'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
     'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
-    "bert": (BertConfig, BertForMaskedLM, BertTokenizer),
-    "roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
+    'bert': (BertConfig, BertForMaskedLM, BertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer)
 }
 
 
+class TextDataset(Dataset):
+    def __init__(self, tokenizer, file_path='train', block_size=512):
+        assert os.path.isfile(file_path)
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(directory, f'cached_lm_{block_size}_{filename}')
+
+        if os.path.exists(cached_features_file):
+            logger.info("Loading features from cached file %s", cached_features_file)
+            with open(cached_features_file, 'rb') as handle:
+                self.examples = pickle.load(handle)
+        else:
+            logger.info("Creating features from dataset file at %s", directory)
+
+            self.examples = []
+            with open(file_path, encoding="utf-8") as f:
+                text = f.read()
+
+            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
+            while len(tokenized_text) >= block_size:  # Truncate in block of block_size
+                self.examples.append(tokenized_text[:block_size])
+                tokenized_text = tokenized_text[block_size:]
+            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+            # If your dataset is small, first you should loook for a bigger one :-) and second you
+            # can change this behavior by adding (model specific) padding.
+
+            logger.info("Saving features into cached file %s", cached_features_file)
+            with open(cached_features_file, 'wb') as handle:
+                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, item):
+        return torch.tensor(self.examples[item])
+
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+    dataset = TextDataset(tokenizer, file_path=args.eval_data_file if evaluate else args.train_data_file, block_size=args.block_size)
+    return dataset
+
+
 def set_seed(args):
     random.seed(args.seed)
     np.random.seed(args.seed)
@@ -59,20 +101,27 @@ def set_seed(args):
     if args.n_gpu > 0:
         torch.cuda.manual_seed_all(args.seed)
 
-# Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original
-def mask_tokens(inputs, tokenizer, args):
-    labels = inputs.clone()
-    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).byte()
-    labels[~masked_indices.bool()] = -1  # We only compute loss on masked tokens
-    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).byte() & masked_indices
 
-    inputs[indices_replaced.bool()] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token) # 80% of the time, replace masked input tokens with [MASK]
-    indices_random = (torch.bernoulli(torch.full(labels.shape, 0.5)).byte() & masked_indices & ~indices_replaced).bool()
-    random_words = torch.randint(args.num_embeddings, labels.shape, dtype=torch.long)
-    inputs[indices_random] = random_words[
-        indices_random]  # 10% of the time, replace masked input tokens with random word
+def mask_tokens(inputs, tokenizer, args):
+    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+    labels = inputs.clone()
+    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).byte()
+    labels[~masked_indices] = -1  # We only compute loss on masked tokens
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).byte() & masked_indices
+    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).byte() & masked_indices & ~indices_replaced
+    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+    inputs[indices_random] = random_words[indices_random]
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
     return inputs, labels
 
+
 def train(args, train_dataset, model, tokenizer):
     """ Train the model """
     if args.local_rank in [-1, 0]:
@@ -146,13 +195,15 @@ def train(args, train_dataset, model, tokenizer):
             if args.fp16:
                 with amp.scale_loss(loss, optimizer) as scaled_loss:
                     scaled_loss.backward()
-                torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
             else:
                 loss.backward()
-                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
 
             tr_loss += loss.item()
             if (step + 1) % args.gradient_accumulation_steps == 0:
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                 optimizer.step()
                 scheduler.step()  # Update learning rate schedule
                 model.zero_grad()
@@ -240,24 +291,22 @@ def evaluate(args, model, tokenizer, prefix=""):
     return results
 
 
-def load_and_cache_examples(args, tokenizer, evaluate=False):
-    dataset = WikiTextDataset(args, tokenizer, file="test" if evaluate else "train", directory=args.data_dir)
-    return dataset
-
-
 def main():
     parser = argparse.ArgumentParser()
 
     ## Required parameters
-    parser.add_argument("--data_dir", default=None, type=str, required=True,
-                        help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
+    parser.add_argument("--train_data_file", default=None, type=str, required=True,
+                        help="The input training data file (a text file).")
     parser.add_argument("--output_dir", default=None, type=str, required=True,
                         help="The output directory where the model predictions and checkpoints will be written.")
 
     ## Other parameters
-    parser.add_argument("--model_name", default="bert", type=str,
+    parser.add_argument("--eval_data_file", default=None, type=str,
+                        help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+
+    parser.add_argument("--model_type", default="bert", type=str,
                         help="The model architecture to be fine-tuned.")
-    parser.add_argument("--model_checkpoint", default="bert-base-cased", type=str,
+    parser.add_argument("--model_name_or_path", default="bert-base-cased", type=str,
                         help="The model checkpoint for weights initialization.")
 
     parser.add_argument("--mlm", action='store_true',
@@ -266,20 +315,21 @@ def main():
                         help="Ratio of tokens to mask for masked language modeling loss")
 
     parser.add_argument("--config_name", default="", type=str,
-                        help="Pretrained config name or path if not the same as model_name")
+                        help="Optional pretrained config name or path if not the same as model_name_or_path")
     parser.add_argument("--tokenizer_name", default="", type=str,
-                        help="Pretrained tokenizer name or path if not the same as model_name")
+                        help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
     parser.add_argument("--cache_dir", default="", type=str,
-                        help="Where do you want to store the pre-trained models downloaded from s3")
-    parser.add_argument("--max_seq_length", default=128, type=int,
-                        help="The maximum total input sequence length after tokenization. Sequences longer "
-                             "than this will be truncated, sequences shorter will be padded.")
+                        help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+    parser.add_argument("--block_size", default=-1, type=int,
+                        help="Optional input sequence length after tokenization."
+                             "The training dataset will be truncated in block of this size for training."
+                             "Default to the model max input length.")
     parser.add_argument("--do_train", action='store_true',
                         help="Whether to run training.")
     parser.add_argument("--do_eval", action='store_true',
                         help="Whether to run eval on the dev set.")
     parser.add_argument("--evaluate_during_training", action='store_true',
-                        help="Rul evaluation during training at each logging step.")
+                        help="Run evaluation during training at each logging step.")
     parser.add_argument("--do_lower_case", action='store_true',
                         help="Set this flag if you are using an uncased model.")
 
@@ -309,7 +359,7 @@ def main():
     parser.add_argument('--save_steps', type=int, default=50,
                         help="Save checkpoint every X updates steps.")
     parser.add_argument("--eval_all_checkpoints", action='store_true',
-                        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
+                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
     parser.add_argument("--no_cuda", action='store_true',
                         help="Avoid using CUDA when available")
     parser.add_argument('--overwrite_output_dir', action='store_true',
@@ -330,9 +380,12 @@ def main():
     parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
     args = parser.parse_args()
 
-    if args.model_name in ["bert", "roberta"] and not args.mlm:
+    if args.model_type in ["bert", "roberta"] and not args.mlm:
         raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
                          "flag (masked language modeling).")
+    if args.eval_data_file is None and args.do_eval:
+        raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+                         "or remove the --do_eval argument.")
 
     if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
         raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
@@ -368,30 +421,36 @@ def main():
 
     # Load pretrained model and tokenizer
     if args.local_rank not in [-1, 0]:
-        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+        torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training download model & vocab
 
-    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_name]
-    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_checkpoint)
-    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_checkpoint, do_lower_case=args.do_lower_case)
-    model = model_class.from_pretrained(args.model_checkpoint, from_tf=bool('.ckpt' in args.model_checkpoint), config=config)
-    args.num_embeddings = config.vocab_size  # We need this to create the model at next line (number of embeddings to use)
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
+    if args.block_size <= 0:
+        args.block_size = tokenizer.max_len  # Our input block size will be the max possible for the model
+    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+    model.to(args.device)
 
     if args.local_rank == 0:
-        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
-
-    model.to(args.device)
+        torch.distributed.barrier()  # End of barrier to make sure only the first process in distributed training download model & vocab
 
     logger.info("Training/evaluation parameters %s", args)
 
-
     # Training
     if args.do_train:
+        if args.local_rank not in [-1, 0]:
+            torch.distributed.barrier()  # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
         train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
+
+        if args.local_rank == 0:
+            torch.distributed.barrier()
+
         global_step, tr_loss = train(args, train_dataset, model, tokenizer)
         logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
 
 
-    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
+    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
     if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
         # Create output directory if needed
         if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
@@ -409,7 +468,7 @@ def main():
 
         # Load a trained model and vocabulary that you have fine-tuned
         model = model_class.from_pretrained(args.output_dir)
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
         model.to(args.device)
 
 
diff --git a/examples/utils_lm.py b/examples/utils_lm.py
deleted file mode 100644
index 251aea90e1..0000000000
--- a/examples/utils_lm.py
+++ /dev/null
@@ -1,51 +0,0 @@
-from torch.utils.data import Dataset, DataLoader
-import os
-import random
-import torch
-import torch.nn.functional as F
-import logging
-import pickle
-
-logger = logging.getLogger(__name__)
-
-
-class WikiTextDataset(Dataset):
-	def __init__(self, args, tokenizer, file='train', directory='wikitext', max_context_length=512, cache=None):
-		if args.local_rank not in [-1, 0]:
-			torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
-			
-			
-		cached_features_file = os.path.join(args.data_dir, f'cached_lm_{file}_{args.max_seq_length}')
-		
-		if os.path.exists(cached_features_file):
-			logger.info("Loading features from cached file %s", cached_features_file)
-			with open(cached_features_file, 'rb') as handle:
-				self.examples = pickle.load(handle)
-		else:
-			logger.info("Creating features from dataset file at %s", args.data_dir)	
-		
-		self.max_context_length = max_context_length
-
-		self.examples = []
-
-		with open(os.path.join(directory, f"wiki.{file}.raw"), encoding="utf-8") as f:
-			text = f.read()
-			tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
-
-			while len(tokenized_text) > max_context_length:
-				self.examples.append(tokenized_text[:max_context_length])
-				tokenized_text = tokenized_text[max_context_length:]
-			
-		if args.local_rank in [-1, 0]:
-			logger.info("Saving features into cached file %s", cached_features_file)
-			with open(cached_features_file, 'wb') as handle:
-				pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
-		
-		if args.local_rank == 0:
-			torch.distributed.barrier()
-
-	def __len__(self):
-		return len(self.examples)
-
-	def __getitem__(self, item):
-		return torch.tensor(self.examples[item])

From 43489756ad421a99d0f3eb9d83116b9b4904c922 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 20 Aug 2019 16:59:11 +0200
Subject: [PATCH 132/200] adding proxies options for the from_pretrained
 methods

---
 .gitignore                                 |  4 ++-
 pytorch_transformers/file_utils.py         | 29 +++++++++++-----------
 pytorch_transformers/modeling_utils.py     | 14 +++++++++--
 pytorch_transformers/tokenization_utils.py |  7 +++++-
 4 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/.gitignore b/.gitignore
index 6bbe32df6c..466a167552 100644
--- a/.gitignore
+++ b/.gitignore
@@ -127,4 +127,6 @@ proc_data
 
 # examples
 runs
-examples/runs
\ No newline at end of file
+examples/runs
+
+data
\ No newline at end of file
diff --git a/pytorch_transformers/file_utils.py b/pytorch_transformers/file_utils.py
index 074e6743ef..f6f2151b12 100644
--- a/pytorch_transformers/file_utils.py
+++ b/pytorch_transformers/file_utils.py
@@ -17,8 +17,9 @@ from hashlib import sha256
 from io import open
 
 import boto3
-import requests
+from botocore.config import Config
 from botocore.exceptions import ClientError
+import requests
 from tqdm import tqdm
 
 try:
@@ -93,7 +94,7 @@ def filename_to_url(filename, cache_dir=None):
     return url, etag
 
 
-def cached_path(url_or_filename, cache_dir=None, force_download=False):
+def cached_path(url_or_filename, cache_dir=None, force_download=False, proxies=None):
     """
     Given something that might be a URL (or might be a local path),
     determine which. If it's a URL, download the file and cache it, and
@@ -114,7 +115,7 @@ def cached_path(url_or_filename, cache_dir=None, force_download=False):
 
     if parsed.scheme in ('http', 'https', 's3'):
         # URL, so get it from the cache (downloading if necessary)
-        return get_from_cache(url_or_filename, cache_dir=cache_dir, force_download=force_download)
+        return get_from_cache(url_or_filename, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
     elif os.path.exists(url_or_filename):
         # File, and it exists.
         return url_or_filename
@@ -159,24 +160,24 @@ def s3_request(func):
 
 
 @s3_request
-def s3_etag(url):
+def s3_etag(url, proxies=None):
     """Check ETag on S3 object."""
-    s3_resource = boto3.resource("s3")
+    s3_resource = boto3.resource("s3", config=Config(proxies=proxies))
     bucket_name, s3_path = split_s3_path(url)
     s3_object = s3_resource.Object(bucket_name, s3_path)
     return s3_object.e_tag
 
 
 @s3_request
-def s3_get(url, temp_file):
+def s3_get(url, temp_file, proxies=None):
     """Pull a file directly from S3."""
-    s3_resource = boto3.resource("s3")
+    s3_resource = boto3.resource("s3", config=Config(proxies=proxies))
     bucket_name, s3_path = split_s3_path(url)
     s3_resource.Bucket(bucket_name).download_fileobj(s3_path, temp_file)
 
 
-def http_get(url, temp_file):
-    req = requests.get(url, stream=True)
+def http_get(url, temp_file, proxies=None):
+    req = requests.get(url, stream=True, proxies=proxies)
     content_length = req.headers.get('Content-Length')
     total = int(content_length) if content_length is not None else None
     progress = tqdm(unit="B", total=total)
@@ -187,7 +188,7 @@ def http_get(url, temp_file):
     progress.close()
 
 
-def get_from_cache(url, cache_dir=None, force_download=False):
+def get_from_cache(url, cache_dir=None, force_download=False, proxies=None):
     """
     Given a URL, look for the corresponding dataset in the local cache.
     If it's not there, download it. Then return the path to the cached file.
@@ -204,10 +205,10 @@ def get_from_cache(url, cache_dir=None, force_download=False):
 
     # Get eTag to add to filename, if it exists.
     if url.startswith("s3://"):
-        etag = s3_etag(url)
+        etag = s3_etag(url, proxies=proxies)
     else:
         try:
-            response = requests.head(url, allow_redirects=True)
+            response = requests.head(url, allow_redirects=True, proxies=proxies)
             if response.status_code != 200:
                 etag = None
             else:
@@ -238,9 +239,9 @@ def get_from_cache(url, cache_dir=None, force_download=False):
 
             # GET file object
             if url.startswith("s3://"):
-                s3_get(url, temp_file)
+                s3_get(url, temp_file, proxies=proxies)
             else:
-                http_get(url, temp_file)
+                http_get(url, temp_file, proxies=proxies)
 
             # we are copying the file before closing it, so flush to avoid truncation
             temp_file.flush()
diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 3e4fbca132..f1501aa8d5 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -128,6 +128,10 @@ class PretrainedConfig(object):
             force_download: (`optional`) boolean, default False:
                 Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
 
+            proxies: (`optional`) dict, default None:
+                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
+                The proxies are used on each request.
+
             return_unused_kwargs: (`optional`) bool:
 
                 - If False, then this function returns just the final configuration object.
@@ -150,6 +154,7 @@ class PretrainedConfig(object):
         """
         cache_dir = kwargs.pop('cache_dir', None)
         force_download = kwargs.pop('force_download', False)
+        proxies = kwargs.pop('proxies', None)
         return_unused_kwargs = kwargs.pop('return_unused_kwargs', False)
 
         if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
@@ -160,7 +165,7 @@ class PretrainedConfig(object):
             config_file = pretrained_model_name_or_path
         # redirect to the cache, if necessary
         try:
-            resolved_config_file = cached_path(config_file, cache_dir=cache_dir, force_download=force_download)
+            resolved_config_file = cached_path(config_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
         except EnvironmentError:
             if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
                 logger.error(
@@ -407,6 +412,10 @@ class PreTrainedModel(nn.Module):
             force_download: (`optional`) boolean, default False:
                 Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
 
+            proxies: (`optional`) dict, default None:
+                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
+                The proxies are used on each request.
+
             output_loading_info: (`optional`) boolean:
                 Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
 
@@ -432,6 +441,7 @@ class PreTrainedModel(nn.Module):
         cache_dir = kwargs.pop('cache_dir', None)
         from_tf = kwargs.pop('from_tf', False)
         force_download = kwargs.pop('force_download', False)
+        proxies = kwargs.pop('proxies', None)
         output_loading_info = kwargs.pop('output_loading_info', False)
 
         # Load config
@@ -462,7 +472,7 @@ class PreTrainedModel(nn.Module):
                 archive_file = pretrained_model_name_or_path
         # redirect to the cache, if necessary
         try:
-            resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download)
+            resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
         except EnvironmentError:
             if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
                 logger.error(
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index 763c0cee04..68af97a518 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -196,6 +196,10 @@ class PreTrainedTokenizer(object):
             force_download: (`optional`) boolean, default False:
                 Force to (re-)download the vocabulary files and override the cached versions if they exists.
 
+            proxies: (`optional`) dict, default None:
+                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
+                The proxies are used on each request.
+
             inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.
 
             kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~pytorch_transformers.PreTrainedTokenizer` for details.
@@ -227,6 +231,7 @@ class PreTrainedTokenizer(object):
     def _from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
         cache_dir = kwargs.pop('cache_dir', None)
         force_download = kwargs.pop('force_download', False)
+        proxies = kwargs.pop('proxies', None)
 
         s3_models = list(cls.max_model_input_sizes.keys())
         vocab_files = {}
@@ -287,7 +292,7 @@ class PreTrainedTokenizer(object):
                 if file_path is None:
                     resolved_vocab_files[file_id] = None
                 else:
-                    resolved_vocab_files[file_id] = cached_path(file_path, cache_dir=cache_dir, force_download=force_download)
+                    resolved_vocab_files[file_id] = cached_path(file_path, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
         except EnvironmentError:
             if pretrained_model_name_or_path in s3_models:
                 logger.error("Couldn't reach server to download vocabulary.")

From 3bffd2e8e5d726d581e0a66746b25c64d49e231d Mon Sep 17 00:00:00 2001
From: Peng Qi <qipeng@users.noreply.github.com>
Date: Tue, 20 Aug 2019 10:59:28 -0700
Subject: [PATCH 133/200] more fixes

---
 examples/run_glue.py  | 2 +-
 examples/run_squad.py | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/examples/run_glue.py b/examples/run_glue.py
index 7fb0732e61..1729f4f7e3 100644
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -467,7 +467,7 @@ def main():
 
         # Load a trained model and vocabulary that you have fine-tuned
         model = model_class.from_pretrained(args.output_dir)
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
         model.to(args.device)
 
 
diff --git a/examples/run_squad.py b/examples/run_squad.py
index efa835107c..c0586b03bd 100644
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -481,7 +481,7 @@ def main():
 
 
     # Save the trained model and the tokenizer
-    if args.do_train and args.local_rank == -1 or torch.distributed.get_rank() == 0:
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
         # Create output directory if needed
         if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
             os.makedirs(args.output_dir)

From 2d042274ac9ee6cd03aabcb861126937a29feb1a Mon Sep 17 00:00:00 2001
From: Lysandre <lysandre.debut@reseau.eseo.fr>
Date: Tue, 20 Aug 2019 14:15:28 -0400
Subject: [PATCH 134/200] Sequence special token handling for BERT and RoBERTa

---
 examples/run_lm_finetuning.py | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/examples/run_lm_finetuning.py b/examples/run_lm_finetuning.py
index bd7047a587..c69d4db53b 100644
--- a/examples/run_lm_finetuning.py
+++ b/examples/run_lm_finetuning.py
@@ -71,9 +71,15 @@ class TextDataset(Dataset):
                 text = f.read()
 
             tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
+
+            tokenized_text = tokenizer.add_special_tokens_single_sentence(tokenized_text)
             while len(tokenized_text) >= block_size:  # Truncate in block of block_size
-                self.examples.append(tokenized_text[:block_size])
-                tokenized_text = tokenized_text[block_size:]
+                if isinstance(tokenizer, (BertTokenizer, RobertaTokenizer)):
+                    self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size - 2]))
+                    tokenized_text = tokenized_text[block_size - 2:]
+                else:
+                    self.examples.append(tokenized_text[:block_size])
+                    tokenized_text = tokenized_text[block_size:]
             # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
             # If your dataset is small, first you should loook for a bigger one :-) and second you
             # can change this behavior by adding (model specific) padding.

From aa05dc8935a3e5b349abecbdc5399796578fe965 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Wed, 21 Aug 2019 02:29:34 +0200
Subject: [PATCH 135/200] adding gpt-2 large

---
 pytorch_transformers/convert_gpt2_checkpoint_to_pytorch.py  | 2 +-
 .../convert_openai_checkpoint_to_pytorch.py                 | 2 +-
 .../convert_transfo_xl_checkpoint_to_pytorch.py             | 2 +-
 pytorch_transformers/modeling_gpt2.py                       | 6 ++++--
 pytorch_transformers/tokenization_gpt2.py                   | 2 ++
 5 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/pytorch_transformers/convert_gpt2_checkpoint_to_pytorch.py b/pytorch_transformers/convert_gpt2_checkpoint_to_pytorch.py
index f9e83f5d5b..e9bfa0302a 100755
--- a/pytorch_transformers/convert_gpt2_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_gpt2_checkpoint_to_pytorch.py
@@ -35,7 +35,7 @@ def convert_gpt2_checkpoint_to_pytorch(gpt2_checkpoint_path, gpt2_config_file, p
     if gpt2_config_file == "":
         config = GPT2Config()
     else:
-        config = GPT2Config(gpt2_config_file)
+        config = GPT2Config.from_json_file(gpt2_config_file)
     model = GPT2Model(config)
 
     # Load weights from numpy
diff --git a/pytorch_transformers/convert_openai_checkpoint_to_pytorch.py b/pytorch_transformers/convert_openai_checkpoint_to_pytorch.py
index 70895b4002..3009f8a99e 100755
--- a/pytorch_transformers/convert_openai_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_openai_checkpoint_to_pytorch.py
@@ -35,7 +35,7 @@ def convert_openai_checkpoint_to_pytorch(openai_checkpoint_folder_path, openai_c
     if openai_config_file == "":
         config = OpenAIGPTConfig()
     else:
-        config = OpenAIGPTConfig(openai_config_file)
+        config = OpenAIGPTConfig.from_json_file(openai_config_file)
     model = OpenAIGPTModel(config)
 
     # Load weights from numpy
diff --git a/pytorch_transformers/convert_transfo_xl_checkpoint_to_pytorch.py b/pytorch_transformers/convert_transfo_xl_checkpoint_to_pytorch.py
index 5733146444..7e79d58d7d 100755
--- a/pytorch_transformers/convert_transfo_xl_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_transfo_xl_checkpoint_to_pytorch.py
@@ -75,7 +75,7 @@ def convert_transfo_xl_checkpoint_to_pytorch(tf_checkpoint_path,
         if transfo_xl_config_file == "":
             config = TransfoXLConfig()
         else:
-            config = TransfoXLConfig(transfo_xl_config_file)
+            config = TransfoXLConfig.from_json_file(transfo_xl_config_file)
         print("Building PyTorch model from configuration: {}".format(str(config)))
         model = TransfoXLLMHeadModel(config)
 
diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index cb4b8cc4ab..9022048d6d 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -38,9 +38,11 @@ from .modeling_bert import BertLayerNorm as LayerNorm
 logger = logging.getLogger(__name__)
 
 GPT2_PRETRAINED_MODEL_ARCHIVE_MAP = {"gpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin",
-                                     "gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin"}
+                                     "gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin",
+                                     "gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-pytorch_model.bin"}
 GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {"gpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json",
-                                      "gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-config.json"}
+                                      "gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-config.json",
+                                      "gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-config.json"}
 
 def load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path):
     """ Load tf checkpoints in a pytorch model
diff --git a/pytorch_transformers/tokenization_gpt2.py b/pytorch_transformers/tokenization_gpt2.py
index 0aee856180..4016a85a7f 100644
--- a/pytorch_transformers/tokenization_gpt2.py
+++ b/pytorch_transformers/tokenization_gpt2.py
@@ -45,11 +45,13 @@ PRETRAINED_VOCAB_FILES_MAP = {
     {
         'gpt2': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json",
         'gpt2-medium': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json",
+        'gpt2-large': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json",
     },
     'merges_file':
     {
         'gpt2': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt",
         'gpt2-medium': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt",
+        'gpt2-large': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt",
     },
 }
 

From fdc487d8b33dcb8b2ddebd7a1fe4bd0eee4e2a40 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Wed, 21 Aug 2019 02:35:01 +0200
Subject: [PATCH 136/200] Add max length

---
 pytorch_transformers/tokenization_gpt2.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/pytorch_transformers/tokenization_gpt2.py b/pytorch_transformers/tokenization_gpt2.py
index 4016a85a7f..e67f25ff59 100644
--- a/pytorch_transformers/tokenization_gpt2.py
+++ b/pytorch_transformers/tokenization_gpt2.py
@@ -58,6 +58,7 @@ PRETRAINED_VOCAB_FILES_MAP = {
 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
     'gpt2': 1024,
     'gpt2-medium': 1024,
+    'gpt2-large': 1024,
 }
 
 @lru_cache()

From 6f877d9daf36788bad4fd228930939fed6ab12bd Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 21 Aug 2019 03:43:29 +0000
Subject: [PATCH 137/200] Update dev results on GLUE (bert-base-uncased) w/
 median on 5 runs

---
 docs/source/examples.rst | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/docs/source/examples.rst b/docs/source/examples.rst
index 51c8d850b9..7777117b47 100644
--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
@@ -68,7 +68,9 @@ GLUE results on dev set
 ~~~~~~~~~~~~~~~~~~~~~~~
 
 We get the following results on the dev set of GLUE benchmark with an uncased BERT base
-model. All experiments were run on a P100 GPU with a batch size of 32.
+model (`bert-base-uncased`). All experiments ran on 8 V100 GPUs with a total train batch size of 24. Some of 
+these tasks have a small dataset and training can lead to high variance in the results between different runs.
+We report the median on 5 runs (with different seeds) for each of the metrics.
 
 .. list-table::
    :header-rows: 1
@@ -78,31 +80,31 @@ model. All experiments were run on a P100 GPU with a batch size of 32.
      - Result
    * - CoLA
      - Matthew's corr.
-     - 57.29
+     - 55.75
    * - SST-2
      - accuracy
-     - 93.00
+     - 92.09
    * - MRPC
      - F1/accuracy
-     - 88.85/83.82
+     - 90.48/86.27
    * - STS-B
      - Pearson/Spearman corr.
-     - 89.70/89.37
+     - 89.03/88.64
    * - QQP
      - accuracy/F1
-     - 90.72/87.41
+     - 90.92/87.72
    * - MNLI
      - matched acc./mismatched acc.
-     - 83.95/84.39
+     - 83.74/84.06
    * - QNLI
      - accuracy
-     - 89.04
+     - 91.07
    * - RTE
      - accuracy
-     - 61.01
+     - 68.59
    * - WNLI
      - accuracy
-     - 53.52
+     - 43.66
 
 
 Some of these results are significantly different from the ones reported on the test set

From d6bbcbc4cf79f0d6da6d4753f4d128ff7e3e42a5 Mon Sep 17 00:00:00 2001
From: Lysandre <lysandre.debut@reseau.eseo.fr>
Date: Wed, 21 Aug 2019 11:22:05 -0400
Subject: [PATCH 138/200] Added finetuning example to documentation

---
 docs/source/examples.rst | 49 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/docs/source/examples.rst b/docs/source/examples.rst
index 51c8d850b9..40e22725ce 100644
--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
@@ -12,8 +12,8 @@ Examples
      - How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models
    * - `Fine-tuning with BERT: running the examples <#fine-tuning-bert-examples>`_
      - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``extract_classif.py``\ , ``run_bert_classifier.py``\ , ``run_bert_squad.py`` and ``run_lm_finetuning.py``
-   * - `Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2 <#fine-tuning>`_
-     - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py`` and ``run_gpt2.py``
+   * - `Fine-tuning with OpenAI GPT, Transformer-XL, GPT-2 as well as BERT and RoBERTa <#fine-tuning>`_
+     - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py``, ``run_gpt2.py`` and ``run_lm_finetuning.py``
    * - `Fine-tuning BERT-large on GPUs <#fine-tuning-bert-large>`_
      - How to fine tune ``BERT large``
 
@@ -393,12 +393,13 @@ Thank to the work of @Rocketknight1 and @tholor there are now **several scripts*
 OpenAI GPT, Transformer-XL and GPT-2: running the examples
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-We provide three examples of scripts for OpenAI GPT, Transformer-XL and OpenAI GPT-2 based on (and extended from) the respective original implementations:
+We provide three examples of scripts for OpenAI GPT, Transformer-XL, OpenAI GPT-2, BERT and RoBERTa based on (and extended from) the respective original implementations:
 
 
 * fine-tuning OpenAI GPT on the ROCStories dataset
 * evaluating Transformer-XL on Wikitext 103
 * unconditional and conditional generation from a pre-trained OpenAI GPT-2 model
+* fine-tuning GPT/GPT-2 on a causal language modeling task and BERT/RoBERTa on a masked language modeling task
 
 Fine-tuning OpenAI GPT on the RocStories dataset
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -452,7 +453,47 @@ Unconditional generation:
 
    python run_gpt2.py --unconditional
 
-The same option as in the original scripts are provided, please refere to the code of the example and the original repository of OpenAI.
+The same option as in the original scripts are provided, please refer to the code of the example and the original repository of OpenAI.
+
+
+Causal LM fine-tuning on GPT/GPT-2, Masked LM fine-tuning on BERT/RoBERTa
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Before running the following examples you should download the `WikiText-2 dataset <https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/>`__ and unpack it to some directory `$WIKITEXT_2_DATASET`
+The following results were obtained using the `raw` WikiText-2 (no tokens were replaced before the tokenization).
+
+This example fine-tunes GPT-2 on the WikiText-2 dataset. The loss function is a causal language modeling loss (perplexity).
+
+.. code-block:: bash
+    export WIKITEXT_2_DATASET=/path/to/wikitext_dataset
+
+    python run_lm_finetuning.py
+        --output_dir=output
+        --model_type=gpt2
+        --model_name_or_path=gpt2
+        --do_train
+        --train_data_file=$WIKITEXT_2_DATASET/wiki.train.raw
+        --do_eval
+        --eval_data_file=$WIKITEXT_2_DATASET/wiki.test.raw
+
+This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run.
+It reaches a score of about 20 perplexity once fine-tuned on the dataset.
+
+This example fine-tunes RoBERTa on the WikiText-2 dataset. The loss function is a masked language modeling loss (masked perplexity).
+The `--mlm` flag is necessary to fine-tune BERT/RoBERTa on masked language modeling.
+
+.. code-block:: bash
+    export WIKITEXT_2_DATASET=/path/to/wikitext_dataset
+
+    python run_lm_finetuning.py
+        --output_dir=output
+        --model_type=roberta
+        --model_name_or_path=roberta-base
+        --do_train
+        --train_data_file=$WIKITEXT_2_DATASET/wiki.train.raw
+        --do_eval
+        --eval_data_file=$WIKITEXT_2_DATASET/wiki.test.raw
+        --mlm
 
 .. _fine-tuning-BERT-large:
 

From 2f9397139d1be373efa76b8133d71e1bdc43bbb3 Mon Sep 17 00:00:00 2001
From: Lysandre <lysandre.debut@reseau.eseo.fr>
Date: Wed, 21 Aug 2019 11:29:37 -0400
Subject: [PATCH 139/200] Added GPT-2 LARGE to Pre-trained Models documentation

---
 docs/source/pretrained_models.rst | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/docs/source/pretrained_models.rst b/docs/source/pretrained_models.rst
index 6a14e3dcd1..7df70ea225 100644
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -62,6 +62,9 @@ Here is the full list of the currently provided pretrained models together with
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                   | ``gpt2-medium``                                            | | 24-layer, 1024-hidden, 16-heads, 345M parameters.                                                                                   |
 |                   |                                                            | | OpenAI's Medium-sized GPT-2 English model                                                                                           |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``gpt2-large``                                             | | 36-layer, 1280-hidden, 20-heads, 774M parameters.                                                                                   |
+|                   |                                                            | | OpenAI's Large-sized GPT-2 English model                                                                                            |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | Transformer-XL    | ``transfo-xl-wt103``                                       | | 18-layer, 1024-hidden, 16-heads, 257M parameters.                                                                                   |
 |                   |                                                            | | English model trained on wikitext-103                                                                                               |
@@ -72,16 +75,16 @@ Here is the full list of the currently provided pretrained models together with
 |                   | ``xlnet-large-cased``                                      | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
 |                   |                                                            | | XLNet Large English model                                                                                                           |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| XLM               | ``xlm-mlm-en-2048``                                        | | 12-layer, 2048-hidden, 16-heads                                                                                                      |
+| XLM               | ``xlm-mlm-en-2048``                                        | | 12-layer, 2048-hidden, 16-heads                                                                                                     |
 |                   |                                                            | | XLM English model                                                                                                                   |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-mlm-ende-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   | ``xlm-mlm-ende-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                       |
 |                   |                                                            | | XLM English-German Multi-language model                                                                                             |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-mlm-enfr-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   | ``xlm-mlm-enfr-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                       |
 |                   |                                                            | | XLM English-French Multi-language model                                                                                             |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-mlm-enro-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   | ``xlm-mlm-enro-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                       |
 |                   |                                                            | | XLM English-Romanian Multi-language model                                                                                           |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-mlm-xnli15-1024``                                    | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
@@ -93,7 +96,7 @@ Here is the full list of the currently provided pretrained models together with
 |                   | ``xlm-clm-enfr-1024``                                      | | 12-layer, 1024-hidden, 8-heads                                                                                                      |
 |                   |                                                            | | XLM English model trained with CLM (Causal Language Modeling)                                                                       |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``xlm-clm-ende-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                      |
+|                   | ``xlm-clm-ende-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                       |
 |                   |                                                            | | XLM English-German Multi-language model trained with CLM (Causal Language Modeling)                                                 |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | RoBERTa           | ``roberta-base``                                           | | 12-layer, 768-hidden, 12-heads, 125M parameters                                                                                     |

From e00b4ff1de0591d5093407b16e665e5c86028f04 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Wed, 21 Aug 2019 22:22:17 +0200
Subject: [PATCH 140/200] fix #1017

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 4e57de5842..9751c720b8 100644
--- a/README.md
+++ b/README.md
@@ -393,8 +393,8 @@ for batch in train_data:
     loss = model(batch)
     loss.backward()
     torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
-    scheduler.step()
     optimizer.step()
+    scheduler.step()
     optimizer.zero_grad()
 ```
 

From 296df2b18c86464c640267df877c97a85324ce92 Mon Sep 17 00:00:00 2001
From: Abhishek Rao <arao@microsoft.com>
Date: Wed, 21 Aug 2019 15:29:30 -0700
Subject: [PATCH 141/200] reraise exception

---
 pytorch_transformers/modeling_utils.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index f1501aa8d5..8ad0f672df 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -473,7 +473,7 @@ class PreTrainedModel(nn.Module):
         # redirect to the cache, if necessary
         try:
             resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
-        except EnvironmentError:
+        except EnvironmentError as e:
             if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
                 logger.error(
                     "Couldn't reach server at '{}' to download pretrained weights.".format(
@@ -486,7 +486,7 @@ class PreTrainedModel(nn.Module):
                         pretrained_model_name_or_path,
                         ', '.join(cls.pretrained_model_archive_map.keys()),
                         archive_file))
-            return None
+            raise e
         if resolved_archive_file == archive_file:
             logger.info("loading weights file {}".format(archive_file))
         else:

From 14eef67eb227bc45f1a7d183dda35f7dad5e34ef Mon Sep 17 00:00:00 2001
From: Abhishek Rao <arao@microsoft.com>
Date: Wed, 21 Aug 2019 15:48:43 -0700
Subject: [PATCH 142/200] Fix at config rather than model

---
 pytorch_transformers/modeling_utils.py | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 8ad0f672df..5066c42595 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -166,7 +166,7 @@ class PretrainedConfig(object):
         # redirect to the cache, if necessary
         try:
             resolved_config_file = cached_path(config_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
-        except EnvironmentError:
+        except EnvironmentError as e:
             if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
                 logger.error(
                     "Couldn't reach server at '{}' to download pretrained model configuration file.".format(
@@ -179,7 +179,7 @@ class PretrainedConfig(object):
                         pretrained_model_name_or_path,
                         ', '.join(cls.pretrained_config_archive_map.keys()),
                         config_file))
-            return None
+            raise e
         if resolved_config_file == config_file:
             logger.info("loading configuration file {}".format(config_file))
         else:
@@ -473,7 +473,7 @@ class PreTrainedModel(nn.Module):
         # redirect to the cache, if necessary
         try:
             resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
-        except EnvironmentError as e:
+        except EnvironmentError:
             if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
                 logger.error(
                     "Couldn't reach server at '{}' to download pretrained weights.".format(
@@ -486,7 +486,7 @@ class PreTrainedModel(nn.Module):
                         pretrained_model_name_or_path,
                         ', '.join(cls.pretrained_model_archive_map.keys()),
                         archive_file))
-            raise e
+            return None
         if resolved_archive_file == archive_file:
             logger.info("loading weights file {}".format(archive_file))
         else:

From b006a7a12f553b374bdf781a8206e62e96d6d144 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Thu, 22 Aug 2019 00:25:42 -0400
Subject: [PATCH 143/200] fix for squad

---
 examples/run_squad.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/examples/run_squad.py b/examples/run_squad.py
index c0586b03bd..25e2c4093f 100644
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -272,7 +272,7 @@ def evaluate(args, model, tokenizer, prefix=""):
 
 
 def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False):
-    if args.local_rank not in [-1, 0]:
+    if args.local_rank not in [-1, 0] and not evaluate:
         torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
 
     # Load data features from cache or dataset file
@@ -299,7 +299,7 @@ def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=Fal
             logger.info("Saving features into cached file %s", cached_features_file)
             torch.save(features, cached_features_file)
 
-    if args.local_rank == 0:
+    if args.local_rank == 0 and not evaluate:
         torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
 
     # Convert to Tensors and build dataset

From 57272d5ddf222bd1a20b7b16e693e69c74e56ea6 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Thu, 22 Aug 2019 00:25:49 -0400
Subject: [PATCH 144/200] fix for glue

---
 examples/run_glue.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/examples/run_glue.py b/examples/run_glue.py
index 1729f4f7e3..53b46fc102 100644
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -251,7 +251,7 @@ def evaluate(args, model, tokenizer, prefix=""):
 
 
 def load_and_cache_examples(args, task, tokenizer, evaluate=False):
-    if args.local_rank not in [-1, 0]:
+    if args.local_rank not in [-1, 0] and not evaluate:
         torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
 
     processor = processors[task]()
@@ -286,7 +286,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
             logger.info("Saving features into cached file %s", cached_features_file)
             torch.save(features, cached_features_file)
 
-    if args.local_rank == 0:
+    if args.local_rank == 0 and not evaluate:
         torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
 
     # Convert to Tensors and build dataset

From 2ba1a14fb0586b9ce61769a8341ecfbfbc8a1507 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Thu, 22 Aug 2019 17:25:55 -0400
Subject: [PATCH 145/200] Decode now calls private property instead of public
 method

---
 pytorch_transformers/tokenization_utils.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index d2855e0922..d4cbd85d67 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -636,9 +636,9 @@ class PreTrainedTokenizer(object):
         filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
         text = self.convert_tokens_to_string(filtered_tokens)
 
-        if self.sep_token is not None and self.sep_token in text:
-            text = text.replace(self.cls_token, self.sep_token)
-            split_text = list(filter(lambda sentence: len(sentence) > 0, text.split(self.sep_token)))
+        if self._sep_token is not None and self._sep_token in text:
+            text = text.replace(self._cls_token, self._sep_token)
+            split_text = list(filter(lambda sentence: len(sentence) > 0, text.split(self._sep_token)))
             if clean_up_tokenization_spaces:
                 clean_text = [self.clean_up_tokenization(text) for text in split_text]
                 return clean_text

From c603d099aa24410ec5a60c23794cc4a293d92850 Mon Sep 17 00:00:00 2001
From: Abhishek Rao <arao@microsoft.com>
Date: Thu, 22 Aug 2019 15:25:40 -0700
Subject: [PATCH 146/200] reraise EnvironmentError in from_pretrained functions
 of Model and Tokenizer

---
 pytorch_transformers/modeling_utils.py     | 4 ++--
 pytorch_transformers/tokenization_utils.py | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/pytorch_transformers/modeling_utils.py b/pytorch_transformers/modeling_utils.py
index 5066c42595..468d240fbc 100644
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -473,7 +473,7 @@ class PreTrainedModel(nn.Module):
         # redirect to the cache, if necessary
         try:
             resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
-        except EnvironmentError:
+        except EnvironmentError as e:
             if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
                 logger.error(
                     "Couldn't reach server at '{}' to download pretrained weights.".format(
@@ -486,7 +486,7 @@ class PreTrainedModel(nn.Module):
                         pretrained_model_name_or_path,
                         ', '.join(cls.pretrained_model_archive_map.keys()),
                         archive_file))
-            return None
+            raise e
         if resolved_archive_file == archive_file:
             logger.info("loading weights file {}".format(archive_file))
         else:
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index d2855e0922..4fef0e34fb 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -293,7 +293,7 @@ class PreTrainedTokenizer(object):
                     resolved_vocab_files[file_id] = None
                 else:
                     resolved_vocab_files[file_id] = cached_path(file_path, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
-        except EnvironmentError:
+        except EnvironmentError as e:
             if pretrained_model_name_or_path in s3_models:
                 logger.error("Couldn't reach server to download vocabulary.")
             else:
@@ -303,7 +303,7 @@ class PreTrainedTokenizer(object):
                     "at this path or url.".format(
                         pretrained_model_name_or_path, ', '.join(s3_models),
                         pretrained_model_name_or_path, str(vocab_files.keys())))
-            return None
+            raise e
 
         for file_id, file_path in vocab_files.items():
             if file_path == resolved_vocab_files[file_id]:

From e13465fb8bbabe3bbd528761818403aa5d2e128e Mon Sep 17 00:00:00 2001
From: David Pollack <david@i2x.ai>
Date: Fri, 23 Aug 2019 12:12:12 +0200
Subject: [PATCH 147/200] change layernorm code to pytorch's native layer norm

---
 pytorch_transformers/modeling_bert.py | 15 +--------------
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index 7b34b3fd90..8bf281feb9 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -224,20 +224,7 @@ try:
     from apex.normalization.fused_layer_norm import FusedLayerNorm as BertLayerNorm
 except (ImportError, AttributeError) as e:
     logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .")
-    class BertLayerNorm(nn.Module):
-        def __init__(self, hidden_size, eps=1e-12):
-            """Construct a layernorm module in the TF style (epsilon inside the square root).
-            """
-            super(BertLayerNorm, self).__init__()
-            self.weight = nn.Parameter(torch.ones(hidden_size))
-            self.bias = nn.Parameter(torch.zeros(hidden_size))
-            self.variance_epsilon = eps
-
-        def forward(self, x):
-            u = x.mean(-1, keepdim=True)
-            s = (x - u).pow(2).mean(-1, keepdim=True)
-            x = (x - u) / torch.sqrt(s + self.variance_epsilon)
-            return self.weight * x + self.bias
+    BertLayerNorm = torch.nn.LayerNorm
 
 class BertEmbeddings(nn.Module):
     """Construct the embeddings from word, position and token_type embeddings.

From 47d6853439318f1be596219e270bee4e3819dfbb Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Fri, 23 Aug 2019 17:31:11 +0200
Subject: [PATCH 148/200] adding max_lengths for single sentences and sentences
 pairs

---
 pytorch_transformers/tokenization_bert.py    | 8 ++++++++
 pytorch_transformers/tokenization_roberta.py | 8 ++++++++
 pytorch_transformers/tokenization_utils.py   | 8 ++++++++
 pytorch_transformers/tokenization_xlm.py     | 8 ++++++++
 pytorch_transformers/tokenization_xlnet.py   | 8 ++++++++
 5 files changed, 40 insertions(+)

diff --git a/pytorch_transformers/tokenization_bert.py b/pytorch_transformers/tokenization_bert.py
index 04f35aa466..8ea71ba92b 100644
--- a/pytorch_transformers/tokenization_bert.py
+++ b/pytorch_transformers/tokenization_bert.py
@@ -139,6 +139,14 @@ class BertTokenizer(PreTrainedTokenizer):
                                                   tokenize_chinese_chars=tokenize_chinese_chars)
         self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
 
+    @property
+    def max_len_single_sentence(self):
+        return self.max_len - 2  # take into account special tokens
+
+    @property
+    def max_len_sentences_pair(self):
+        return self.max_len - 3  # take into account special tokens
+
     @property
     def vocab_size(self):
         return len(self.vocab)
diff --git a/pytorch_transformers/tokenization_roberta.py b/pytorch_transformers/tokenization_roberta.py
index edf4717c89..44047e636f 100644
--- a/pytorch_transformers/tokenization_roberta.py
+++ b/pytorch_transformers/tokenization_roberta.py
@@ -160,6 +160,14 @@ class RobertaTokenizer(PreTrainedTokenizer):
         text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
         return text
 
+    @property
+    def max_len_single_sentence(self):
+        return self.max_len - 2  # take into account special tokens
+
+    @property
+    def max_len_sentences_pair(self):
+        return self.max_len - 4  # take into account special tokens
+
     def add_special_tokens_single_sentence(self, token_ids):
         """
         Adds special tokens to a sequence for sequence classification tasks.
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index d2855e0922..a128c3fd72 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -67,6 +67,14 @@ class PreTrainedTokenizer(object):
                                  "pad_token", "cls_token", "mask_token",
                                  "additional_special_tokens"]
 
+    @property
+    def max_len_single_sentence(self):
+        return self.max_len  # Default to max_len but can be smaller in specific tokenizers to take into account special tokens
+
+    @property
+    def max_len_sentences_pair(self):
+        return self.max_len  # Default to max_len but can be smaller in specific tokenizers to take into account special tokens
+
     @property
     def bos_token(self):
         """ Beginning of sentence token (string). Log an error if used while not having been set. """
diff --git a/pytorch_transformers/tokenization_xlm.py b/pytorch_transformers/tokenization_xlm.py
index 2d2f3a8cd4..b544923e35 100644
--- a/pytorch_transformers/tokenization_xlm.py
+++ b/pytorch_transformers/tokenization_xlm.py
@@ -215,6 +215,14 @@ class XLMTokenizer(PreTrainedTokenizer):
         out_string = ''.join(tokens).replace('</w>', ' ').strip()
         return out_string
 
+    @property
+    def max_len_single_sentence(self):
+        return self.max_len - 2  # take into account special tokens
+
+    @property
+    def max_len_sentences_pair(self):
+        return self.max_len - 3  # take into account special tokens
+
     def add_special_tokens_single_sentence(self, token_ids):
         """
         Adds special tokens to a sequence for sequence classification tasks.
diff --git a/pytorch_transformers/tokenization_xlnet.py b/pytorch_transformers/tokenization_xlnet.py
index 371b3c9407..a282d67904 100644
--- a/pytorch_transformers/tokenization_xlnet.py
+++ b/pytorch_transformers/tokenization_xlnet.py
@@ -177,6 +177,14 @@ class XLNetTokenizer(PreTrainedTokenizer):
         out_string = ''.join(tokens).replace(SPIECE_UNDERLINE, ' ').strip()
         return out_string
 
+    @property
+    def max_len_single_sentence(self):
+        return self.max_len - 2  # take into account special tokens
+
+    @property
+    def max_len_sentences_pair(self):
+        return self.max_len - 3  # take into account special tokens
+
     def add_special_tokens_single_sentence(self, token_ids):
         """
         Adds special tokens to a sequence pair for sequence classification tasks.

From ab7bd5ef98c797132ab5c3378599b3eeec9041d9 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Fri, 23 Aug 2019 17:31:21 +0200
Subject: [PATCH 149/200] fixing tokenization and training

---
 examples/run_lm_finetuning.py | 24 ++++++++++--------------
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/examples/run_lm_finetuning.py b/examples/run_lm_finetuning.py
index c69d4db53b..015f742299 100644
--- a/examples/run_lm_finetuning.py
+++ b/examples/run_lm_finetuning.py
@@ -30,7 +30,7 @@ import random
 
 import numpy as np
 import torch
-from torch.utils.data import DataLoader, Dataset, SequentialSampler
+from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler
 from torch.utils.data.distributed import DistributedSampler
 from tensorboardX import SummaryWriter
 from tqdm import tqdm, trange
@@ -72,14 +72,9 @@ class TextDataset(Dataset):
 
             tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
 
-            tokenized_text = tokenizer.add_special_tokens_single_sentence(tokenized_text)
             while len(tokenized_text) >= block_size:  # Truncate in block of block_size
-                if isinstance(tokenizer, (BertTokenizer, RobertaTokenizer)):
-                    self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size - 2]))
-                    tokenized_text = tokenized_text[block_size - 2:]
-                else:
-                    self.examples.append(tokenized_text[:block_size])
-                    tokenized_text = tokenized_text[block_size:]
+                self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))
+                tokenized_text = tokenized_text[block_size:]
             # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
             # If your dataset is small, first you should loook for a bigger one :-) and second you
             # can change this behavior by adding (model specific) padding.
@@ -112,15 +107,15 @@ def mask_tokens(inputs, tokenizer, args):
     """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
     labels = inputs.clone()
     # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
-    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).byte()
+    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).bool()
     labels[~masked_indices] = -1  # We only compute loss on masked tokens
 
     # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
-    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).byte() & masked_indices
+    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
     inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
 
     # 10% of the time, we replace masked input tokens with random word
-    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).byte() & masked_indices & ~indices_replaced
+    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
     random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
     inputs[indices_random] = random_words[indices_random]
 
@@ -134,7 +129,7 @@ def train(args, train_dataset, model, tokenizer):
         tb_writer = SummaryWriter()
 
     args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
-    train_sampler = SequentialSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
     train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
 
     if args.max_steps > 0:
@@ -329,7 +324,7 @@ def main():
     parser.add_argument("--block_size", default=-1, type=int,
                         help="Optional input sequence length after tokenization."
                              "The training dataset will be truncated in block of this size for training."
-                             "Default to the model max input length.")
+                             "Default to the model max input length fo single sentences inputs (take into account special tokens).")
     parser.add_argument("--do_train", action='store_true',
                         help="Whether to run training.")
     parser.add_argument("--do_eval", action='store_true',
@@ -433,7 +428,8 @@ def main():
     config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
     tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
     if args.block_size <= 0:
-        args.block_size = tokenizer.max_len  # Our input block size will be the max possible for the model
+        args.block_size = tokenizer.max_len_single_sentence  # Our input block size will be the max possible for the model
+    args.block_size = min(args.block_size, tokenizer.max_len_single_sentence)
     model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
     model.to(args.device)
 

From 3bcbebd440c220adbaab657f2d13dac7c89f6453 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Fri, 23 Aug 2019 22:07:26 +0200
Subject: [PATCH 150/200] max_len_single_sentence & max_len_sentences_pair as
 attributes so they can be modified

---
 pytorch_transformers/tokenization_bert.py       | 11 +++--------
 pytorch_transformers/tokenization_gpt2.py       |  2 ++
 pytorch_transformers/tokenization_openai.py     |  3 +++
 pytorch_transformers/tokenization_roberta.py    | 11 +++--------
 pytorch_transformers/tokenization_transfo_xl.py |  4 ++++
 pytorch_transformers/tokenization_utils.py      | 11 +++--------
 pytorch_transformers/tokenization_xlm.py        | 12 ++++--------
 pytorch_transformers/tokenization_xlnet.py      | 12 ++++--------
 8 files changed, 26 insertions(+), 40 deletions(-)

diff --git a/pytorch_transformers/tokenization_bert.py b/pytorch_transformers/tokenization_bert.py
index 8ea71ba92b..92f027038d 100644
--- a/pytorch_transformers/tokenization_bert.py
+++ b/pytorch_transformers/tokenization_bert.py
@@ -125,6 +125,9 @@ class BertTokenizer(PreTrainedTokenizer):
         super(BertTokenizer, self).__init__(unk_token=unk_token, sep_token=sep_token,
                                             pad_token=pad_token, cls_token=cls_token,
                                             mask_token=mask_token, **kwargs)
+        self.max_len_single_sentence = self.max_len - 2  # take into account special tokens
+        self.max_len_sentences_pair = self.max_len - 3  # take into account special tokens
+
         if not os.path.isfile(vocab_file):
             raise ValueError(
                 "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
@@ -139,14 +142,6 @@ class BertTokenizer(PreTrainedTokenizer):
                                                   tokenize_chinese_chars=tokenize_chinese_chars)
         self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
 
-    @property
-    def max_len_single_sentence(self):
-        return self.max_len - 2  # take into account special tokens
-
-    @property
-    def max_len_sentences_pair(self):
-        return self.max_len - 3  # take into account special tokens
-
     @property
     def vocab_size(self):
         return len(self.vocab)
diff --git a/pytorch_transformers/tokenization_gpt2.py b/pytorch_transformers/tokenization_gpt2.py
index e67f25ff59..13806a3708 100644
--- a/pytorch_transformers/tokenization_gpt2.py
+++ b/pytorch_transformers/tokenization_gpt2.py
@@ -108,6 +108,8 @@ class GPT2Tokenizer(PreTrainedTokenizer):
     def __init__(self, vocab_file, merges_file, errors='replace', unk_token="<|endoftext|>",
                  bos_token="<|endoftext|>", eos_token="<|endoftext|>", **kwargs):
         super(GPT2Tokenizer, self).__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
+        self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
+        self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
 
         self.encoder = json.load(open(vocab_file))
         self.decoder = {v:k for k,v in self.encoder.items()}
diff --git a/pytorch_transformers/tokenization_openai.py b/pytorch_transformers/tokenization_openai.py
index 51b418ebd3..0efbdb37c0 100644
--- a/pytorch_transformers/tokenization_openai.py
+++ b/pytorch_transformers/tokenization_openai.py
@@ -87,6 +87,9 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
     def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
         super(OpenAIGPTTokenizer, self).__init__(unk_token=unk_token, **kwargs)
 
+        self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
+        self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
+
         try:
             import ftfy
             from spacy.lang.en import English
diff --git a/pytorch_transformers/tokenization_roberta.py b/pytorch_transformers/tokenization_roberta.py
index 44047e636f..e8ab29238e 100644
--- a/pytorch_transformers/tokenization_roberta.py
+++ b/pytorch_transformers/tokenization_roberta.py
@@ -77,6 +77,9 @@ class RobertaTokenizer(PreTrainedTokenizer):
                                                sep_token=sep_token, cls_token=cls_token, pad_token=pad_token,
                                                mask_token=mask_token, **kwargs)
 
+        self.max_len_single_sentence = self.max_len - 2  # take into account special tokens
+        self.max_len_sentences_pair = self.max_len - 4  # take into account special tokens
+
         self.encoder = json.load(open(vocab_file, encoding="utf-8"))
         self.decoder = {v: k for k, v in self.encoder.items()}
         self.errors = errors  # how to handle errors in decoding
@@ -160,14 +163,6 @@ class RobertaTokenizer(PreTrainedTokenizer):
         text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
         return text
 
-    @property
-    def max_len_single_sentence(self):
-        return self.max_len - 2  # take into account special tokens
-
-    @property
-    def max_len_sentences_pair(self):
-        return self.max_len - 4  # take into account special tokens
-
     def add_special_tokens_single_sentence(self, token_ids):
         """
         Adds special tokens to a sequence for sequence classification tasks.
diff --git a/pytorch_transformers/tokenization_transfo_xl.py b/pytorch_transformers/tokenization_transfo_xl.py
index 992dff80d5..c603ba695c 100644
--- a/pytorch_transformers/tokenization_transfo_xl.py
+++ b/pytorch_transformers/tokenization_transfo_xl.py
@@ -73,6 +73,10 @@ class TransfoXLTokenizer(PreTrainedTokenizer):
         super(TransfoXLTokenizer, self).__init__(unk_token=unk_token, eos_token=eos_token,
                                                  additional_special_tokens=additional_special_tokens,
                                                  **kwargs)
+
+        self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
+        self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
+
         if never_split is None:
             never_split = self.all_special_tokens
         if special is None:
diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index a128c3fd72..2fb7f87e9c 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -67,14 +67,6 @@ class PreTrainedTokenizer(object):
                                  "pad_token", "cls_token", "mask_token",
                                  "additional_special_tokens"]
 
-    @property
-    def max_len_single_sentence(self):
-        return self.max_len  # Default to max_len but can be smaller in specific tokenizers to take into account special tokens
-
-    @property
-    def max_len_sentences_pair(self):
-        return self.max_len  # Default to max_len but can be smaller in specific tokenizers to take into account special tokens
-
     @property
     def bos_token(self):
         """ Beginning of sentence token (string). Log an error if used while not having been set. """
@@ -174,6 +166,9 @@ class PreTrainedTokenizer(object):
         self._additional_special_tokens = []
 
         self.max_len = max_len if max_len is not None else int(1e12)
+        self.max_len_single_sentence = self.max_len
+        self.max_len_sentences_pair = self.max_len
+
         self.added_tokens_encoder = {}
         self.added_tokens_decoder = {}
 
diff --git a/pytorch_transformers/tokenization_xlm.py b/pytorch_transformers/tokenization_xlm.py
index b544923e35..2b930458bb 100644
--- a/pytorch_transformers/tokenization_xlm.py
+++ b/pytorch_transformers/tokenization_xlm.py
@@ -122,6 +122,10 @@ class XLMTokenizer(PreTrainedTokenizer):
                                            cls_token=cls_token, mask_token=mask_token,
                                            additional_special_tokens=additional_special_tokens,
                                            **kwargs)
+
+        self.max_len_single_sentence = self.max_len - 2  # take into account special tokens
+        self.max_len_sentences_pair = self.max_len - 3  # take into account special tokens
+
         try:
             import ftfy
             from spacy.lang.en import English
@@ -215,14 +219,6 @@ class XLMTokenizer(PreTrainedTokenizer):
         out_string = ''.join(tokens).replace('</w>', ' ').strip()
         return out_string
 
-    @property
-    def max_len_single_sentence(self):
-        return self.max_len - 2  # take into account special tokens
-
-    @property
-    def max_len_sentences_pair(self):
-        return self.max_len - 3  # take into account special tokens
-
     def add_special_tokens_single_sentence(self, token_ids):
         """
         Adds special tokens to a sequence for sequence classification tasks.
diff --git a/pytorch_transformers/tokenization_xlnet.py b/pytorch_transformers/tokenization_xlnet.py
index a282d67904..ac7231bb68 100644
--- a/pytorch_transformers/tokenization_xlnet.py
+++ b/pytorch_transformers/tokenization_xlnet.py
@@ -71,6 +71,10 @@ class XLNetTokenizer(PreTrainedTokenizer):
                                              pad_token=pad_token, cls_token=cls_token,
                                              mask_token=mask_token, additional_special_tokens=
                                              additional_special_tokens, **kwargs)
+
+        self.max_len_single_sentence = self.max_len - 2  # take into account special tokens
+        self.max_len_sentences_pair = self.max_len - 3  # take into account special tokens
+
         try:
             import sentencepiece as spm
         except ImportError:
@@ -177,14 +181,6 @@ class XLNetTokenizer(PreTrainedTokenizer):
         out_string = ''.join(tokens).replace(SPIECE_UNDERLINE, ' ').strip()
         return out_string
 
-    @property
-    def max_len_single_sentence(self):
-        return self.max_len - 2  # take into account special tokens
-
-    @property
-    def max_len_sentences_pair(self):
-        return self.max_len - 3  # take into account special tokens
-
     def add_special_tokens_single_sentence(self, token_ids):
         """
         Adds special tokens to a sequence pair for sequence classification tasks.

From 06510ccb5314f629816888a5b6eed953b30d1046 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Fri, 23 Aug 2019 22:08:10 +0200
Subject: [PATCH 151/200] typo

---
 examples/run_lm_finetuning.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/run_lm_finetuning.py b/examples/run_lm_finetuning.py
index 015f742299..d37f7a443a 100644
--- a/examples/run_lm_finetuning.py
+++ b/examples/run_lm_finetuning.py
@@ -324,7 +324,7 @@ def main():
     parser.add_argument("--block_size", default=-1, type=int,
                         help="Optional input sequence length after tokenization."
                              "The training dataset will be truncated in block of this size for training."
-                             "Default to the model max input length fo single sentences inputs (take into account special tokens).")
+                             "Default to the model max input length for single sentence inputs (take into account special tokens).")
     parser.add_argument("--do_train", action='store_true',
                         help="Whether to run training.")
     parser.add_argument("--do_eval", action='store_true',

From 529a16dec6cc9bfcf8954a1b16546960f2fab6fa Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Mon, 26 Aug 2019 15:00:43 -0400
Subject: [PATCH 152/200] Generic encoding implementation.

---
 pytorch_transformers/tokenization_utils.py | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/pytorch_transformers/tokenization_utils.py b/pytorch_transformers/tokenization_utils.py
index 2fb7f87e9c..3596711bdb 100644
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -593,10 +593,12 @@ class PreTrainedTokenizer(object):
             return first_sentence_tokens, second_sentence_tokens
 
     def add_special_tokens_single_sentence(self, token_ids):
-        raise NotImplementedError
+        logger.warning("This tokenizer does not make use of special tokens. The sequence has been returned with no modification.")
+        return token_ids
 
     def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
-        raise NotImplementedError
+        logger.warning("This tokenizer does not make use of special tokens. The two sequences have been concatenated.")
+        return token_ids_0 + token_ids_1
 
     def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
         """ Converts a single index or a sequence of indices (integers) in a token "

From e08c01aa1ad63efff83548ea69d5ba3ce4a75acc Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Mon, 26 Aug 2019 18:13:06 -0400
Subject: [PATCH 153/200] fix #1102

---
 pytorch_transformers/modeling_roberta.py     | 6 +++---
 pytorch_transformers/tokenization_roberta.py | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/pytorch_transformers/modeling_roberta.py b/pytorch_transformers/modeling_roberta.py
index e49b2a06b1..cbd88ab86e 100644
--- a/pytorch_transformers/modeling_roberta.py
+++ b/pytorch_transformers/modeling_roberta.py
@@ -98,15 +98,15 @@ ROBERTA_INPUTS_DOCSTRING = r"""
     Inputs:
         **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Indices of input sequence tokens in the vocabulary.
-            To match pre-training, RoBERTa input sequence should be formatted with [CLS] and [SEP] tokens as follows:
+            To match pre-training, RoBERTa input sequence should be formatted with <s> and </s> tokens as follows:
 
             (a) For sequence pairs:
 
-                ``tokens:         [CLS] is this jack ##son ##ville ? [SEP][SEP] no it is not . [SEP]``
+                ``tokens:         <s> Is this Jacksonville ? </s> </s> No it is not . </s>``
 
             (b) For single sequences:
 
-                ``tokens:         [CLS] the dog is hairy . [SEP]``
+                ``tokens:         <s> the dog is hairy . </s>``
 
             Fully encoded sequences or sequence pairs can be obtained using the RobertaTokenizer.encode function with 
             the ``add_special_tokens`` parameter set to ``True``.
diff --git a/pytorch_transformers/tokenization_roberta.py b/pytorch_transformers/tokenization_roberta.py
index edf4717c89..13d963d432 100644
--- a/pytorch_transformers/tokenization_roberta.py
+++ b/pytorch_transformers/tokenization_roberta.py
@@ -163,14 +163,14 @@ class RobertaTokenizer(PreTrainedTokenizer):
     def add_special_tokens_single_sentence(self, token_ids):
         """
         Adds special tokens to a sequence for sequence classification tasks.
-        A RoBERTa sequence has the following format: [CLS] X [SEP]
+        A RoBERTa sequence has the following format: <s> X </s>
         """
         return [self._convert_token_to_id(self.cls_token)] + token_ids + [self._convert_token_to_id(self.sep_token)]
 
     def add_special_tokens_sentences_pair(self, token_ids_0, token_ids_1):
         """
         Adds special tokens to a sequence pair for sequence classification tasks.
-        A RoBERTa sequence pair has the following format: [CLS] A [SEP][SEP] B [SEP]
+        A RoBERTa sequence pair has the following format: <s> A </s></s> B </s>
         """
         sep = [self._convert_token_to_id(self.sep_token)]
         cls = [self._convert_token_to_id(self.cls_token)]

From c8933bb2d9f60885bb66c1a76de878bd5f7a8e9d Mon Sep 17 00:00:00 2001
From: Nikolay Korolev <korolevns98@gmail.com>
Date: Tue, 27 Aug 2019 12:10:36 +0300
Subject: [PATCH 154/200] Delete nonexistent parameter from documentation

Changed documentation of GPT2Model, GPT2LMHeadModel and GPT2DoubleHeadsModel
---
 pytorch_transformers/modeling_gpt2.py | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index 9022048d6d..35bb9112a6 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -408,10 +408,6 @@ GPT2_INPUTS_DOCSTRING = r"""    Inputs:
             list of ``torch.FloatTensor`` (one for each layer):
             that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
             (see `past` output below). Can be used to speed up sequential decoding.
-        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
-            Mask to avoid performing attention on padding token indices.
-            Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
         **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
@@ -642,10 +638,6 @@ class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
             list of ``torch.FloatTensor`` (one for each layer):
             that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
             (see `past` output below). Can be used to speed up sequential decoding.
-        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, num_choices, sequence_length)``:
-            Mask to avoid performing attention on padding token indices.
-            Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
         **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:

From 26bda77225d3b3929691971206fba1d8f7c3c46d Mon Sep 17 00:00:00 2001
From: Nikolay Korolev <korolevns98@gmail.com>
Date: Tue, 27 Aug 2019 12:22:42 +0300
Subject: [PATCH 155/200] Fix documentation #1117

Rename parameter in documentation + Delete its second occurrence.
---
 pytorch_transformers/modeling_gpt2.py | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/pytorch_transformers/modeling_gpt2.py b/pytorch_transformers/modeling_gpt2.py
index 9022048d6d..2bf7c1a708 100644
--- a/pytorch_transformers/modeling_gpt2.py
+++ b/pytorch_transformers/modeling_gpt2.py
@@ -656,14 +656,11 @@ class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
             Indices are selected in ``[-1, 0, ..., config.vocab_size]``
             All labels set to ``-1`` are ignored (masked), the loss is only
             computed for labels in ``[0, ..., config.vocab_size]``
-        **multiple_choice_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size)``:
+        **mc_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size)``:
             Labels for computing the multiple choice classification loss.
             Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
             of the input tensors. (see `input_ids` above)
 
-            `multiple_choice_labels`: optional multiple choice labels: ``torch.LongTensor`` of shape [batch_size]
-                with indices selected in [0, ..., num_choices].
-
     Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
         **lm_loss**: (`optional`, returned when ``lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
             Language modeling loss.

From 53282b5bd0cf78fae913d1d7e43f94c94620df0c Mon Sep 17 00:00:00 2001
From: Nikolay Korolev <korolevns98@gmail.com>
Date: Tue, 27 Aug 2019 14:19:03 +0300
Subject: [PATCH 156/200] Change attention mask dtype to be bool. Fix #1119

---
 pytorch_transformers/modeling_transfo_xl.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/pytorch_transformers/modeling_transfo_xl.py b/pytorch_transformers/modeling_transfo_xl.py
index 3cfdee38cb..c4ca0be878 100644
--- a/pytorch_transformers/modeling_transfo_xl.py
+++ b/pytorch_transformers/modeling_transfo_xl.py
@@ -1142,10 +1142,10 @@ class TransfoXLModel(TransfoXLPreTrainedModel):
             else:
                 mask_shift_len = qlen
             dec_attn_mask = (torch.triu(all_ones, 1+mlen)
-                    + torch.tril(all_ones, -mask_shift_len)).byte()[:, :, None] # -1
+                    + torch.tril(all_ones, -mask_shift_len)).bool()[:, :, None] # -1
         else:
             dec_attn_mask = torch.triu(
-                word_emb.new_ones(qlen, klen), diagonal=1+mlen).byte()[:,:,None]
+                word_emb.new_ones(qlen, klen), diagonal=1+mlen).bool()[:,:,None]
 
         hids = []
         attentions = []

From 0d288727b8a7d7ba5419480caa284103396c0fe7 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Tue, 27 Aug 2019 14:50:22 +0200
Subject: [PATCH 157/200] fix #1106

---
 docs/source/examples.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/examples.rst b/docs/source/examples.rst
index 7777117b47..dbe6a3d4fc 100644
--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
@@ -384,7 +384,7 @@ Training with the previous hyper-parameters on a single GPU gave us the followin
 LM Fine-tuning
 ~~~~~~~~~~~~~~
 
-The data should be a text file in the same format as `sample_text.txt <./samples/sample_text.txt>`_  (one sentence per line, docs separated by empty line).
+The data should be a text file in the same format as `sample_text.txt <./pytorch_transformers/tests/fixtures/sample_text.txt/sample_text.txt>`_  (one sentence per line, docs separated by empty line).
 You can download an `exemplary training corpus <https://ext-bert-sample.obs.eu-de.otc.t-systems.com/small_wiki_sentence_corpus.txt>`_ generated from wikipedia articles and split into ~500k sentences with spaCy.
 Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with ``train_batch_size=200`` and ``max_seq_length=128``\ :
 

From 1d232400681186e39b477facac8159879119a85a Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Tue, 27 Aug 2019 14:27:47 +0000
Subject: [PATCH 158/200] wip

---
 pytorch_transformers/modeling_dilbert.py | 375 +++++++++++++++++++++++
 1 file changed, 375 insertions(+)
 create mode 100644 pytorch_transformers/modeling_dilbert.py

diff --git a/pytorch_transformers/modeling_dilbert.py b/pytorch_transformers/modeling_dilbert.py
new file mode 100644
index 0000000000..44d6672d47
--- /dev/null
+++ b/pytorch_transformers/modeling_dilbert.py
@@ -0,0 +1,375 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+PyTorch DilBERT model.
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import json
+import logging
+import math
+import sys
+from io import open
+
+import itertools
+import numpy as np
+
+import torch
+import torch.nn as nn
+
+from pytorch_transformers.modeling_utils import PretrainedConfig, PreTrainedModel, add_start_docstrings
+
+import logging
+logger = logging.getLogger(__name__)
+
+
+DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
+    'dilbert-base-uncased': None, # TODO(Victor)
+}
+
+DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    'dilbert-base-uncased': None, #TODO(Victor)
+}
+
+
+class DilBertconfig(PretrainedConfig):
+    pretrained_config_archive_map = DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
+
+    def __init__(self,
+                 vocab_size_or_config_json_file=30522,
+                 max_position_embeddings=512,
+                 sinusoidal_pos_embds=True,
+                 n_layers=6,
+                 n_heads=12,
+                 dim=768,
+                 dropout=0.1,
+                 attention_dropout=0.1,
+                 activation='gelu',
+                 initializer_range=0.02,
+                 tie_weights=True,
+                 **kwargs):
+        super(DilBertconfig, self).__init__(**kwargs)
+
+        if isintance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
+                        and isinstance(vocab_size_or_config_json_file, unicode)):
+            with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
+                json_config = json.loads(reader.read())
+            for key, value in json_config.items():
+                self.__dict__[key] = value
+        elif isinstance(vocab_size_or_config_json_file, int):
+            self.vocab_size = vocab_size_or_config_json_file
+            self.max_position_embeddings = max_position_embeddings
+            self.sinusoidal_pos_embds = sinusoidal_pos_embds
+            self.n_layers = n_layers
+            self.n_heads = n_heads
+            self.dim = dim
+            self.dropout = dropout
+            self.attention_dropout = attention_dropout
+            self.activation = activation
+            self.initializer_range = initializer_range
+            self.tie_weights = tie_weights
+        else:
+            raise ValueError("First argument must be either a vocabulary size (int)"
+                             "or the path to a pretrained model config file (str)")
+
+
+def gelu(x):
+    return 0.5 * x * (1.0 + torch.erf(x / math.sqrt(2.0)))
+
+def create_sinusoidal_embeddings(n_pos, dim, out):
+    position_enc = np.array([
+        [pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)]
+        for pos in range(n_pos)
+    ])
+    out[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))
+    out[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))
+    out.detach_()
+    out.requires_grad = False
+
+class Embeddings(nn.Module):
+    def __init__(self,
+                 config):
+        super(Embeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, dim, padding_idx=0)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.dim)
+        if sinusoidal_pos_embds:
+            create_sinusoidal_embeddings(n_pos=config.max_position_embeddings,
+                                         dim=config.dim,
+                                         out=self.position_embeddings.weight)
+
+        self.LayerNorm = nn.LayerNorm(config.dim, eps=1e-12)
+        self.dropout = nn.Dropout(config.dropout)
+
+    def forward(self, input_ids):
+        """
+        Parameters
+        ----------
+        input_ids: torch.tensor(bs, max_seq_length) - The token ids to embed.
+        """
+        seq_length = input_ids.size(1)
+        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device) # (max_seq_length)
+        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)                      # (bs, max_seq_length)
+
+        word_embeddings = self.word_embeddings(input_ids)                   # (bs, max_seq_length, dim)
+        position_embeddings = self.position_embeddings(position_ids)        # (bs, max_seq_length, dim)
+
+        embeddings = word_embeddings + position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+class MultiHeadSelfAttention(nn.Module):
+    def __init__(self,
+                 config):
+        super(MultiHeadSelfAttention, self).__init__()
+
+        self.n_heads = config.n_heads
+        self.dim = config.dim
+        self.dropout = nn.Dropout(p=config.attention_dropout)
+        self.output_attentions = config.output_attentions
+
+        assert self.dim % self.n_heads == 0
+
+        self.q_lin = nn.Linear(in_features=dim, out_features=dim)
+        self.k_lin = nn.Linear(in_features=dim, out_features=dim)
+        self.v_lin = nn.Linear(in_features=dim, out_features=dim)
+        self.out_lin = nn.Linear(in_features=dim, out_features=dim)
+
+    def forward(self,
+                query: torch.tensor,
+                key: torch.tensor,
+                value: torch.tensor,
+                mask: torch.tensor):
+        """
+        Classic Self Attention. I don't understand the one of PyTorch...
+
+        Parameters
+        ----------
+        query: torch.tensor(bs, seq_length, dim)
+        key: torch.tensor(bs, seq_length, dim)
+        value: torch.tensor(bs, seq_length, dim)
+        mask: torch.tensor(bs, seq_length)
+
+        Return
+        ------
+        weights: torch.tensor(bs, n_heads, seq_length, seq_length)
+            Attention weights
+        context: torch.tensor(bs, seq_length, dim)
+            Contextualized layer
+        """
+        bs, q_length, dim = query.size()
+        k_length = key.size(1)
+        assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)
+        assert key.size() == value.size()
+
+        dim_per_head = dim // self.n_heads
+
+        assert 2 <= mask.dim() <= 3
+        causal = (mask.dim() == 3)
+        mask_reshp = (bs, 1, 1, k_length)
+
+        def shape(x):
+            """ separate heads """
+            return x.view(bs, -1, self.n_heads, dim_per_head).transpose(1, 2)
+
+        def unshape(x):
+            """ group heads """
+            return x.transpose(1, 2).contiguous().view(bs, -1, dim)
+
+        q = shape(self.q_lin(query))           # (bs, n_heads, q_length, dim_per_head)
+        k = shape(self.k_lin(key))             # (bs, n_heads, k_length, dim_per_head)
+        v = shape(self.v_lin(value))           # (bs, n_heads, k_length, dim_per_head)
+
+        q = q / math.sqrt(dim_per_head)                     # (bs, n_heads, q_length, dim_per_head)
+        scores = torch.matmul(q, k.transpose(2,3))          # (bs, n_heads, q_length, k_length)
+        mask = (mask==0).view(mask_reshp).expand_as(scores) # (bs, n_heads, q_length, k_length)
+        scores.masked_fill_(mask, -float('inf'))            # (bs, n_heads, q_length, k_length)
+
+        weights = nn.Softmax(dim=-1)(scores)   # (bs, n_heads, q_length, k_length)
+        weights = self.dropout(weights)        # (bs, n_heads, q_length, k_length)
+        context = torch.matmul(weights, v)     # (bs, n_heads, q_length, dim_per_head)
+        context = unshape(context)             # (bs, q_length, dim)
+        context = self.out_lin(context)        # (bs, q_length, dim)
+
+        if self.output_attentions:
+            return context, weights
+        else:
+            return context
+
+class FFN(nn.Module):
+    def __init__(self,
+                 config):
+        super(FFN, self).__init__()
+        self.dropout = nn.Dropout(p=config.dropout)
+        self.lin1 = nn.Linear(in_features=config.dim, out_features=config.hidden_dim)
+        self.lin2 = nn.Linear(in_features=config.hidden_dim, out_features=config.dim)
+        assert activation in ['relu', 'gelu'], ValueError(f"activation ({config.activation}) must be in ['relu', 'gelu']")
+        self.activation = gelu if activation == 'gelu' else nn.ReLU()
+
+    def forward(self,
+                input: torch.tensor):
+        x = self.lin1(input)
+        x = self.activation(x)
+        x = self.lin2(x)
+        x = self.dropout(x)
+        return x
+
+class TransformerBlock(nn.Module):
+    def __init__(self,
+                 config):
+        super(TransformerBlock, self).__init__()
+
+        self.n_heads = config.n_heads
+        self.dim = config.dim
+        self.hidden_dim = config.hidden_dim
+        self.dropout = nn.Dropout(p=config.dropout)
+        self.activation = config.activation
+        self.output_attentions = config.output_attentions
+
+        assert dim % n_heads == 0
+
+        self.attention = MultiHeadSelfAttention(dim=config.dim,
+                                                n_heads=config.n_heads,
+                                                dropout=config.attention_dropout,
+                                                output_attentions=config.output_attentions)
+        self.sa_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)
+
+        self.ffn = FFN(in_dim=config.dim,
+                       hidden_dim=config.hidden_dim,
+                       out_dim=config.dim,
+                       dropout=config.dropout,
+                       activation=config.activation)
+        self.output_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)
+
+    def forward(self,
+                x: torch.tensor,
+                attn_mask: torch.tensor = None):
+        """
+        Parameters
+        ----------
+        x: torch.tensor(bs, seq_length, dim)
+        attn_mask: torch.tensor(bs, seq_length)
+        """
+        # Self-Attention
+        sa_output = self.attention(query=x, key=x, value=x, mask=attn_mask)
+        if self.output_attentions:
+            sa_output, sa_weights = sa_output                  # (bs, seq_length, dim)
+        sa_output = self.sa_layer_norm(sa_output + x)          # (bs, seq_length, dim)
+
+        # Feed Forward Network
+        ffn_output = self.ffn(sa_output)                             # (bs, seq_length, dim)
+        ffn_output = self.output_layer_norm(ffn_output + sa_output)  # (bs, seq_length, dim)
+
+        if self.output_attentions:
+            return sa_weights, ffn_output
+        else:
+            return ffn_output
+
+class Transformer(nn.Module):
+    def __init__(self,
+                 config):
+        super(Transformer, self).__init__()
+        self.n_layers = config.n_layers
+        self.output_attentions = config.output_attentions
+
+        layer = TransformerBlock(n_heads=config.n_heads,
+                                 dim=config.dim,
+                                 hidden_dim=config.hidden_dim,
+                                 dropout=config.dropout,
+                                 attention_dropout=config.attention_dropout,
+                                 activation=config.activation,
+                                 output_attentions=config.output_attentions)
+        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(n_layers)])
+
+    def forward(self,
+                x: torch.tensor,
+                attn_mask: torch.tensor = None,
+                output_all_encoded_layers: bool = True):
+        """
+        Parameters
+        ----------
+        x: torch.tensor(bs, seq_length, dim)
+        attn_mask: torch.tensor(bs, seq_length)
+        output_all_encoded_layers: bool
+        """
+        all_encoder_layers = []
+        all_attentions = []
+
+        for _, layer_module in enumerate(self.layer):
+            x = layer_module(x=x, attn_mask=attn_mask)
+            if self.output_attentions:
+                attentions, x = x
+                all_attentions.append(attentions)
+            all_encoder_layers.append(x)
+
+        if not output_all_encoded_layers:
+            all_encoder_layers = all_encoder_layers[-1]
+
+        if self.output_attentions:
+            return all_attentions, all_encoder_layers
+        else:
+            return all_encoder_layers
+
+
+
+# TODO(Victor)
+# class DilBertWithLMHeadModel(DilBertPreTrainedModel):
+# class DilBertForSequenceClassification(DilBertPretrainedModel):
+
+
+class DilBertForQuestionAnswering(DilBertPreTrainedModel):
+    def __init__(self, config):
+        super(DilBertForQuestionAnswering, self).__init__(config)
+
+        self.dilbert = DilBertModel(config)
+        self.qa_outputs = nn.Linear(config.dim, config.num_labels)
+        assert config.num_labels == 2
+        self.dropout = nn.Dropout(config.qa_dropout)
+
+        self.apply(self.init_weights)
+        
+    def forward(self,
+                input_ids: torch.tensor,
+                attention_mask: torch.tensor = None,
+                start_positions: torch.tensor = None,
+                end_positions: torch.tensor = None):
+        _, _, hidden_states = self.dilbert(input_ids=input_ids,
+                                           attention_mask=attention_mask) # _, _, (bs, max_query_len, dim)
+        
+        hidden_states = self.dropout(hidden_states)                       # (bs, max_query_len, dim)
+        logits = self.qa_outputs(hidden_states)                           # (bs, max_query_len, 2)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1)                           # (bs, max_query_len)
+        end_logits = end_logits.squeeze(-1)                               # (bs, max_query_len)
+
+        outputs = (start_logits, end_logits,) + (hidden_states,)
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions.clamp_(0, ignored_index)
+            end_positions.clamp_(0, ignored_index)
+
+            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+            outputs = (total_loss,) + outputs
+
+        return outputs  # (loss), start_logits, end_logits, hidden_states
\ No newline at end of file

From 42968138c8f73c1f7b6f93d65d92cd44597e5ee7 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Tue, 27 Aug 2019 22:00:38 +0000
Subject: [PATCH 159/200] wip wouf

---
 pytorch_transformers/__init__.py         |   2 +
 pytorch_transformers/modeling_dilbert.py | 406 +++++++++++++++++++----
 2 files changed, 343 insertions(+), 65 deletions(-)

diff --git a/pytorch_transformers/__init__.py b/pytorch_transformers/__init__.py
index 62e3b8c47b..78916d1ebb 100644
--- a/pytorch_transformers/__init__.py
+++ b/pytorch_transformers/__init__.py
@@ -40,6 +40,8 @@ from .modeling_xlm import (XLMConfig, XLMPreTrainedModel , XLMModel,
                            XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
 from .modeling_roberta import (RobertaConfig, RobertaForMaskedLM, RobertaModel, RobertaForSequenceClassification,
                                ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
+from .modeling_dilbert import (DilBertconfig, DilBertForMaskedLM, DilBertModel, DilBertForSequenceClassification,
+                              DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
 from .modeling_utils import (WEIGHTS_NAME, CONFIG_NAME, TF_WEIGHTS_NAME,
                           PretrainedConfig, PreTrainedModel, prune_layer, Conv1D)
 
diff --git a/pytorch_transformers/modeling_dilbert.py b/pytorch_transformers/modeling_dilbert.py
index 44d6672d47..b5d7e51b79 100644
--- a/pytorch_transformers/modeling_dilbert.py
+++ b/pytorch_transformers/modeling_dilbert.py
@@ -20,6 +20,7 @@ from __future__ import absolute_import, division, print_function, unicode_litera
 import json
 import logging
 import math
+import copy
 import sys
 from io import open
 
@@ -54,6 +55,7 @@ class DilBertconfig(PretrainedConfig):
                  n_layers=6,
                  n_heads=12,
                  dim=768,
+                 hidden_dim=4*768,
                  dropout=0.1,
                  attention_dropout=0.1,
                  activation='gelu',
@@ -62,7 +64,7 @@ class DilBertconfig(PretrainedConfig):
                  **kwargs):
         super(DilBertconfig, self).__init__(**kwargs)
 
-        if isintance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
+        if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
                         and isinstance(vocab_size_or_config_json_file, unicode)):
             with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
                 json_config = json.loads(reader.read())
@@ -85,6 +87,7 @@ class DilBertconfig(PretrainedConfig):
                              "or the path to a pretrained model config file (str)")
 
 
+### UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE ###
 def gelu(x):
     return 0.5 * x * (1.0 + torch.erf(x / math.sqrt(2.0)))
 
@@ -102,9 +105,9 @@ class Embeddings(nn.Module):
     def __init__(self,
                  config):
         super(Embeddings, self).__init__()
-        self.word_embeddings = nn.Embedding(config.vocab_size, dim, padding_idx=0)
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=0)
         self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.dim)
-        if sinusoidal_pos_embds:
+        if config.sinusoidal_pos_embds:
             create_sinusoidal_embeddings(n_pos=config.max_position_embeddings,
                                          dim=config.dim,
                                          out=self.position_embeddings.weight)
@@ -116,7 +119,13 @@ class Embeddings(nn.Module):
         """
         Parameters
         ----------
-        input_ids: torch.tensor(bs, max_seq_length) - The token ids to embed.
+        input_ids: torch.tensor(bs, max_seq_length)
+            The token ids to embed.
+
+        Outputs
+        -------
+        embeddings: torch.tensor(bs, max_seq_length, dim)
+            The embedded tokens (plus position embeddings, no token_type embeddings)
         """
         seq_length = input_ids.size(1)
         position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device) # (max_seq_length)
@@ -125,9 +134,9 @@ class Embeddings(nn.Module):
         word_embeddings = self.word_embeddings(input_ids)                   # (bs, max_seq_length, dim)
         position_embeddings = self.position_embeddings(position_ids)        # (bs, max_seq_length, dim)
 
-        embeddings = word_embeddings + position_embeddings
-        embeddings = self.LayerNorm(embeddings)
-        embeddings = self.dropout(embeddings)
+        embeddings = word_embeddings + position_embeddings  # (bs, max_seq_length, dim)
+        embeddings = self.LayerNorm(embeddings)             # (bs, max_seq_length, dim)
+        embeddings = self.dropout(embeddings)               # (bs, max_seq_length, dim)
         return embeddings
 
 class MultiHeadSelfAttention(nn.Module):
@@ -142,10 +151,10 @@ class MultiHeadSelfAttention(nn.Module):
 
         assert self.dim % self.n_heads == 0
 
-        self.q_lin = nn.Linear(in_features=dim, out_features=dim)
-        self.k_lin = nn.Linear(in_features=dim, out_features=dim)
-        self.v_lin = nn.Linear(in_features=dim, out_features=dim)
-        self.out_lin = nn.Linear(in_features=dim, out_features=dim)
+        self.q_lin = nn.Linear(in_features=config.dim, out_features=config.dim)
+        self.k_lin = nn.Linear(in_features=config.dim, out_features=config.dim)
+        self.v_lin = nn.Linear(in_features=config.dim, out_features=config.dim)
+        self.out_lin = nn.Linear(in_features=config.dim, out_features=config.dim)
 
     def forward(self,
                 query: torch.tensor,
@@ -153,8 +162,6 @@ class MultiHeadSelfAttention(nn.Module):
                 value: torch.tensor,
                 mask: torch.tensor):
         """
-        Classic Self Attention. I don't understand the one of PyTorch...
-
         Parameters
         ----------
         query: torch.tensor(bs, seq_length, dim)
@@ -162,12 +169,12 @@ class MultiHeadSelfAttention(nn.Module):
         value: torch.tensor(bs, seq_length, dim)
         mask: torch.tensor(bs, seq_length)
 
-        Return
-        ------
+        Outputs
+        -------
         weights: torch.tensor(bs, n_heads, seq_length, seq_length)
             Attention weights
         context: torch.tensor(bs, seq_length, dim)
-            Contextualized layer
+            Contextualized layer. Optional: only if `output_attentions=True`
         """
         bs, q_length, dim = query.size()
         k_length = key.size(1)
@@ -204,9 +211,9 @@ class MultiHeadSelfAttention(nn.Module):
         context = self.out_lin(context)        # (bs, q_length, dim)
 
         if self.output_attentions:
-            return context, weights
+            return (context, weights)
         else:
-            return context
+            return (context,)
 
 class FFN(nn.Module):
     def __init__(self,
@@ -215,8 +222,8 @@ class FFN(nn.Module):
         self.dropout = nn.Dropout(p=config.dropout)
         self.lin1 = nn.Linear(in_features=config.dim, out_features=config.hidden_dim)
         self.lin2 = nn.Linear(in_features=config.hidden_dim, out_features=config.dim)
-        assert activation in ['relu', 'gelu'], ValueError(f"activation ({config.activation}) must be in ['relu', 'gelu']")
-        self.activation = gelu if activation == 'gelu' else nn.ReLU()
+        assert config.activation in ['relu', 'gelu'], ValueError(f"activation ({config.activation}) must be in ['relu', 'gelu']")
+        self.activation = gelu if config.activation == 'gelu' else nn.ReLU()
 
     def forward(self,
                 input: torch.tensor):
@@ -238,19 +245,12 @@ class TransformerBlock(nn.Module):
         self.activation = config.activation
         self.output_attentions = config.output_attentions
 
-        assert dim % n_heads == 0
+        assert config.dim % config.n_heads == 0
 
-        self.attention = MultiHeadSelfAttention(dim=config.dim,
-                                                n_heads=config.n_heads,
-                                                dropout=config.attention_dropout,
-                                                output_attentions=config.output_attentions)
+        self.attention = MultiHeadSelfAttention(config)
         self.sa_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)
 
-        self.ffn = FFN(in_dim=config.dim,
-                       hidden_dim=config.hidden_dim,
-                       out_dim=config.dim,
-                       dropout=config.dropout,
-                       activation=config.activation)
+        self.ffn = FFN(config)
         self.output_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)
 
     def forward(self,
@@ -261,21 +261,28 @@ class TransformerBlock(nn.Module):
         ----------
         x: torch.tensor(bs, seq_length, dim)
         attn_mask: torch.tensor(bs, seq_length)
+
+        Outputs
+        -------
+        sa_weights: torch.tensor(bs, n_heads, seq_length, seq_length)
+            The attention weights
+        ffn_output: torch.tensor(bs, seq_length, dim)
+            The output of the transformer block contextualization.
         """
         # Self-Attention
         sa_output = self.attention(query=x, key=x, value=x, mask=attn_mask)
         if self.output_attentions:
-            sa_output, sa_weights = sa_output                  # (bs, seq_length, dim)
+            sa_output, sa_weights = sa_output                  # (bs, seq_length, dim), (bs, n_heads, seq_length, seq_length)
         sa_output = self.sa_layer_norm(sa_output + x)          # (bs, seq_length, dim)
 
         # Feed Forward Network
         ffn_output = self.ffn(sa_output)                             # (bs, seq_length, dim)
         ffn_output = self.output_layer_norm(ffn_output + sa_output)  # (bs, seq_length, dim)
 
+        output = (ffn_output)
         if self.output_attentions:
-            return sa_weights, ffn_output
-        else:
-            return ffn_output
+            output = (sa_weights,) + output
+        return output
 
 class Transformer(nn.Module):
     def __init__(self,
@@ -283,52 +290,286 @@ class Transformer(nn.Module):
         super(Transformer, self).__init__()
         self.n_layers = config.n_layers
         self.output_attentions = config.output_attentions
+        self.output_hidden_states = config.output_hidden_states
 
-        layer = TransformerBlock(n_heads=config.n_heads,
-                                 dim=config.dim,
-                                 hidden_dim=config.hidden_dim,
-                                 dropout=config.dropout,
-                                 attention_dropout=config.attention_dropout,
-                                 activation=config.activation,
-                                 output_attentions=config.output_attentions)
-        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(n_layers)])
+        layer = TransformerBlock(config)
+        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.n_layers)])
 
     def forward(self,
                 x: torch.tensor,
-                attn_mask: torch.tensor = None,
-                output_all_encoded_layers: bool = True):
+                attn_mask: torch.tensor = None):
         """
         Parameters
         ----------
         x: torch.tensor(bs, seq_length, dim)
+            Input sequence embedded.
         attn_mask: torch.tensor(bs, seq_length)
-        output_all_encoded_layers: bool
+            Attention mask on the sequence.
+
+        Outputs
+        -------
+        hidden_state: torch.tensor(bs, seq_length, dim)
+            Sequence of hiddens states in the last (top) layer
+        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]
+            Tuple of length n_layers with the hidden states from each layer.
+            Optional: only if output_hidden_states=True
+        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
+            Tuple of length n_layers with the attention weights from each layer
+            Optional: only if output_attentions=True
         """
-        all_encoder_layers = []
-        all_attentions = []
+        all_hidden_states = ()
+        all_attentions = ()
 
+        hidden_state = x
         for _, layer_module in enumerate(self.layer):
-            x = layer_module(x=x, attn_mask=attn_mask)
+            hidden_state = layer_module(x=hidden_state, attn_mask=attn_mask)
             if self.output_attentions:
-                attentions, x = x
-                all_attentions.append(attentions)
-            all_encoder_layers.append(x)
-
-        if not output_all_encoded_layers:
-            all_encoder_layers = all_encoder_layers[-1]
+                attentions, hidden_state = hidden_state
+                all_attentions = all_attentions + (attentions,)
+            all_hidden_states = all_hidden_states + (hidden_state,)
 
+        outputs = (hidden_state,)
+        if self.output_hidden_states:
+            outputs = outputs + (all_hidden_states,)
         if self.output_attentions:
-            return all_attentions, all_encoder_layers
-        else:
-            return all_encoder_layers
+            outputs = outputs + (all_attentions,)
+        return outputs
 
 
+### INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL ###
+class DilBertPreTrainedModel(PreTrainedModel):
+    """ An abstract class to handle weights initialization and
+        a simple interface for downloading and loading pretrained models.
+    """
+    config_class = DilBertconfig
+    pretrained_model_archive_map = DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
+    load_tf_weights = None
+    base_model_prefix = "dilbert"
 
-# TODO(Victor)
-# class DilBertWithLMHeadModel(DilBertPreTrainedModel):
-# class DilBertForSequenceClassification(DilBertPretrainedModel):
+    def __init__(self, *inputs, **kwargs):
+        super(DilBertPreTrainedModel, self).__init__(*inputs, **kwargs)
+    
+    def init_weights(self, module):
+        """ Initialize the weights.
+        """
+        if isinstance(module, nn.Embedding):
+            if module.weight.requires_grad:
+                module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
 
 
+DILBERT_START_DOCSTRING = r"""
+    Smaller, faster, cheaper, lighter: DilBERT
+
+    For more information on DilBERT, you should check TODO(Victor): Link to Medium
+
+    Parameters:
+        config (:class:`~pytorch_transformers.DilBertconfig`): Model configuration class with all the parameters of the model. 
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
+"""
+
+DILBERT_INPUTS_DOCSTRING = r"""
+    Inputs:
+        **input_ids**L ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices oof input sequence tokens in the vocabulary.
+            The input sequences should start with `[CLS]` and `[SEP]` tokens.
+            
+            For now, ONLY BertTokenizer(`bert-base-uncased`) is supported and you should use this tokenizer when using DilBERT.
+        **attention_mask**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+"""
+
+@add_start_docstrings("The bare DilBERT encoder/transformer outputing raw hidden-states without any specific head on top.",
+                      DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
+class DilBertModel(DilBertPreTrainedModel):
+    def __init__(self, config):
+        super(DilBertModel, self).__init__(config)
+
+        self.embeddings = Embeddings(config)   # Embeddings
+        self.transformer = Transformer(config) # Encoder
+
+        self.apply(self.init_weights)
+
+    def forward(self,
+                input_ids: torch.tensor,
+                attention_mask: torch.tensor = None):
+        """
+        Parameters
+        ----------
+        input_ids: torch.tensor(bs, seq_length)
+            Sequences of token ids.
+        attention_mask: torch.tensor(bs, seq_length)
+            Attention mask on the sequences. Optional: If None, it's like there was no padding.
+        
+        Outputs
+        -------
+        hidden_state: torch.tensor(bs, seq_length, dim)
+            Sequence of hiddens states in the last (top) layer
+        pooled_output: torch.tensor(bs, dim)
+            Pooled output: for DilBert, the pooled output is simply the hidden state of the [CLS] token.
+        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]
+            Tuple of length n_layers with the hidden states from each layer.
+            Optional: only if output_hidden_states=True
+        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
+            Tuple of length n_layers with the attention weights from each layer
+            Optional: only if output_attentions=True
+        """
+        if attention_mask is None:
+            attention_mask = torch.ones_like(input_ids) # (bs, seq_length)
+
+        embedding_output = self.embeddings(input_ids)   # (bs, seq_length, dim)
+        tfmr_output = self.transformer(x=embedding_output,
+                                       attn_mask=attention_mask)
+        hidden_state = tfmr_output[0]
+        pooled_output = hidden_state[:, 0]
+        output = (hidden_state, pooled_output) + tfmr_output[1:]
+
+        return output # hidden_state, pooled_output, (hidden_states), (attentions)
+
+@add_start_docstrings("""DilBert Model with a `masked language modeling` head on top. """,
+                      DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
+class DilBertForMaskedLM(DilBertPreTrainedModel):
+    def __init__(self, config):
+        super(DilBertForMaskedLM, self).__init__(config)
+        self.output_attentions = config.output_attentions
+        self.output_hidden_states = config.output_hidden_states
+
+        self.encoder = DilBertModel(config)
+        self.vocab_transform = nn.Linear(config.dim, config.dim)
+        self.vocab_layer_norm = nn.LayerNorm(config.dim, eps=1e-12)
+        self.vocab_projector = nn.Linear(config.dim, config.vocab_size)
+
+        self.apply(self.init_weights)
+        self.tie_weights()
+
+        self.mlm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
+
+    def tie_weights_(self):
+        """
+        Tying the weights of the vocabulary projection to the base token embeddings.
+        """
+        if self.config.tie_weights:
+            self.vocab_projector.weight = self.encoder.embeddings.word_embeddings.weight
+
+    def forward(self,
+                input_ids: torch.tensor,
+                attention_mask: torch.tensor = None,
+                masked_lm_labels: torch.tensor = None):
+        """
+        Parameters
+        ----------
+        input_ids: torch.tensor(bs, seq_length)
+            Token ids.
+        attention_mask: torch.tensor(bs, seq_length)
+            Attention mask. Optional: If None, it's like there was no padding.
+        masked_lm_labels: torch.tensor(bs, seq_length)
+            The masked language modeling labels. Optional: If None, no loss is computed.
+
+        Outputs
+        -------
+        mlm_loss: torch.tensor(1,)
+            Masked Language Modeling loss to optimize. 
+            Optional: only if `masked_lm_labels` is not None
+        prediction_logits: torch.tensor(bs, seq_length, voc_size)
+            Token prediction logits
+        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]
+            Tuple of length n_layers with the hidden states from each layer.
+            Optional: only if `output_hidden_states`=True
+        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
+            Tuple of length n_layers with the attention weights from each layer
+            Optional: only if `output_attentions`=True
+        """
+        tfmr_output = self.encoder(input_ids=input_ids,
+                                   attention_mask=attention_mask)
+        hidden_states = tfmr_output[0]                               # (bs, seq_length, dim)
+        prediction_logits = self.vocab_transform(hidden_states)      # (bs, seq_length, dim)
+        prediction_logits = gelu(prediction_logits)                  # (bs, seq_length, dim)
+        prediction_logits = self.vocab_layer_norm(prediction_logits) # (bs, seq_length, dim)
+        prediction_logits = self.vocab_projector(prediction_logits)  # (bs, seq_length, vocab_size)
+
+        outputs = (prediction_logits, ) + tfmr_output[2:]
+        if masked_lm_labels is not None:
+            mlm_loss = self.mlm_loss_fct(prediction_logits.view(-1, prediction_logits.size(-1)),
+                                         masked_lm_labels.view(-1))
+            outputs = (mlm_loss,) + outputs     
+
+        return outputs # (mlm_loss), prediction_logits, (hidden_states), (attentions)
+
+@add_start_docstrings("""DilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of
+                         the pooled output) e.g. for GLUE tasks. """,
+                      DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
+class DilBertForSequenceClassification(DilBertPreTrainedModel):
+    def __init__(self, config):
+        super(DilBertForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.dilbert = DilBertModel(config)
+        self.pre_classifier = nn.Linear(config.dim, config.dim)
+        self.classifier = nn.Linear(config.dim, config.num_labels)
+        self.dropout = nn.Dropout(config.seq_classif_dropout)
+
+        self.apply(self.init_weights)
+
+    def forward(self,
+                input_ids: torch.tensor,
+                attention_mask: torch.tensor = None,
+                labels: torch.tensor = None):
+        """
+        Parameters
+        ----------
+        input_ids: torch.tensor(bs, seq_length)
+            Token ids.
+        attention_mask: torch.tensor(bs, seq_length)
+            Attention mask. Optional: If None, it's like there was no padding.
+        labels: torch.tensor(bs,)
+            Classification Labels: Optional: If None, no loss will be computed.
+        
+        Outputs
+        -------
+        loss: torch.tensor(1)
+            Sequence classification loss.
+            Optional: Is computed only if `labels` is not None.
+        logits: torch.tensor(bs, seq_length)
+            Classification (or regression if config.num_labels==1) scores
+        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]
+            Tuple of length n_layers with the hidden states from each layer.
+            Optional: only if `output_hidden_states`=True
+        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
+            Tuple of length n_layers with the attention weights from each layer
+            Optional: only if `output_attentions`=True        
+        """
+        dilbert_output = self.dilbert(input_ids=input_ids,
+                                      attention_mask=attention_mask)
+        pooled_output = dilbert_output[1]                    # (bs, dim)
+        pooled_output = self.pre_classifier(pooled_output)   # (bs, dim)
+        pooled_output = nn.ReLU()(pooled_output)             # (bs, dim)
+        pooled_output = self.dropout(pooled_output)         # (bs, dim)
+        logits = self.classifier(pooled_output)              # (bs, dim)
+
+        outputs = (logits,) + dilbert_output[2:]
+        if labels is not None:
+            if self.num_labels == 1:
+                loss_fct = nn.MSELoss()
+                loss = loss_fct(logits.view(-1), labels.view(-1))
+            else:
+                loss_fct = nn.CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), logits, (hidden_states), (attentions)
+
+@add_start_docstrings("""DilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
+                         the hidden-states output to compute `span start logits` and `span end logits`). """,
+                      DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
 class DilBertForQuestionAnswering(DilBertPreTrainedModel):
     def __init__(self, config):
         super(DilBertForQuestionAnswering, self).__init__(config)
@@ -345,16 +586,51 @@ class DilBertForQuestionAnswering(DilBertPreTrainedModel):
                 attention_mask: torch.tensor = None,
                 start_positions: torch.tensor = None,
                 end_positions: torch.tensor = None):
-        _, _, hidden_states = self.dilbert(input_ids=input_ids,
-                                           attention_mask=attention_mask) # _, _, (bs, max_query_len, dim)
-        
+        """
+        Parameters
+        ----------
+        input_ids: torch.tensor(bs, seq_length)
+            Token ids.
+        attention_mask: torch.tensor(bs, seq_length)
+            Attention mask. Optional: If None, it's like there was no padding.
+        start_positions: torch,tensor(bs)
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+            Optional: if None, no loss is computed.
+        end_positions: torch,tensor(bs)
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+            Optional: if None, no loss is computed.
+
+        Outputs
+        -------
+        loss: torch.tensor(1)
+            Question answering loss.
+            Optional: Is computed only if `start_positions` and `end_positions` are not None.
+        start_logits: torch.tensor(bs, seq_length)
+            Span-start scores.
+        end_logits: torch.tensor(bs, seq_length)
+            Spand-end scores.
+        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]
+            Tuple of length n_layers with the hidden states from each layer.
+            Optional: only if `output_hidden_states`=True
+        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
+            Tuple of length n_layers with the attention weights from each layer
+            Optional: only if `output_attentions`=True
+        """
+        dilbert_output = self.dilbert(input_ids=input_ids,
+                                      attention_mask=attention_mask)
+        hidden_states = dilbert_output[0]                                 # (bs, max_query_len, dim)
+
         hidden_states = self.dropout(hidden_states)                       # (bs, max_query_len, dim)
         logits = self.qa_outputs(hidden_states)                           # (bs, max_query_len, 2)
         start_logits, end_logits = logits.split(1, dim=-1)
         start_logits = start_logits.squeeze(-1)                           # (bs, max_query_len)
         end_logits = end_logits.squeeze(-1)                               # (bs, max_query_len)
 
-        outputs = (start_logits, end_logits,) + (hidden_states,)
+        outputs = (start_logits, end_logits,) + dilbert_output[2:]
         if start_positions is not None and end_positions is not None:
             # If we are on multi-GPU, split add a dimension
             if len(start_positions.size()) > 1:
@@ -372,4 +648,4 @@ class DilBertForQuestionAnswering(DilBertPreTrainedModel):
             total_loss = (start_loss + end_loss) / 2
             outputs = (total_loss,) + outputs
 
-        return outputs  # (loss), start_logits, end_logits, hidden_states
\ No newline at end of file
+        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)

From 60c984da6cd99939993750c47db7fc44454c91fa Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Tue, 27 Aug 2019 22:25:55 +0000
Subject: [PATCH 160/200] fix bugs

---
 pytorch_transformers/__init__.py         |   3 +-
 pytorch_transformers/modeling_dilbert.py | 157 ++++++++++++-----------
 2 files changed, 81 insertions(+), 79 deletions(-)

diff --git a/pytorch_transformers/__init__.py b/pytorch_transformers/__init__.py
index 78916d1ebb..e6774c96d8 100644
--- a/pytorch_transformers/__init__.py
+++ b/pytorch_transformers/__init__.py
@@ -40,7 +40,8 @@ from .modeling_xlm import (XLMConfig, XLMPreTrainedModel , XLMModel,
                            XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
 from .modeling_roberta import (RobertaConfig, RobertaForMaskedLM, RobertaModel, RobertaForSequenceClassification,
                                ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
-from .modeling_dilbert import (DilBertconfig, DilBertForMaskedLM, DilBertModel, DilBertForSequenceClassification,
+from .modeling_dilbert import (DilBertConfig, DilBertForMaskedLM, DilBertModel,
+                              DilBertForSequenceClassification, DilBertForQuestionAnswering,
                               DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
 from .modeling_utils import (WEIGHTS_NAME, CONFIG_NAME, TF_WEIGHTS_NAME,
                           PretrainedConfig, PreTrainedModel, prune_layer, Conv1D)
diff --git a/pytorch_transformers/modeling_dilbert.py b/pytorch_transformers/modeling_dilbert.py
index b5d7e51b79..1fcb33e9ad 100644
--- a/pytorch_transformers/modeling_dilbert.py
+++ b/pytorch_transformers/modeling_dilbert.py
@@ -45,7 +45,7 @@ DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 }
 
 
-class DilBertconfig(PretrainedConfig):
+class DilBertConfig(PretrainedConfig):
     pretrained_config_archive_map = DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
 
     def __init__(self,
@@ -62,7 +62,7 @@ class DilBertconfig(PretrainedConfig):
                  initializer_range=0.02,
                  tie_weights=True,
                  **kwargs):
-        super(DilBertconfig, self).__init__(**kwargs)
+        super(DilBertConfig, self).__init__(**kwargs)
 
         if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
                         and isinstance(vocab_size_or_config_json_file, unicode)):
@@ -77,6 +77,7 @@ class DilBertconfig(PretrainedConfig):
             self.n_layers = n_layers
             self.n_heads = n_heads
             self.dim = dim
+            self.hidden_dim = hidden_dim
             self.dropout = dropout
             self.attention_dropout = attention_dropout
             self.activation = activation
@@ -341,7 +342,7 @@ class DilBertPreTrainedModel(PreTrainedModel):
     """ An abstract class to handle weights initialization and
         a simple interface for downloading and loading pretrained models.
     """
-    config_class = DilBertconfig
+    config_class = DilBertConfig
     pretrained_model_archive_map = DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
     load_tf_weights = None
     base_model_prefix = "dilbert"
@@ -370,7 +371,7 @@ DILBERT_START_DOCSTRING = r"""
     For more information on DilBERT, you should check TODO(Victor): Link to Medium
 
     Parameters:
-        config (:class:`~pytorch_transformers.DilBertconfig`): Model configuration class with all the parameters of the model. 
+        config (:class:`~pytorch_transformers.DilBertConfig`): Model configuration class with all the parameters of the model. 
             Initializing with a config file does not load the weights associated with the model, only the configuration.
             Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
 """
@@ -391,18 +392,7 @@ DILBERT_INPUTS_DOCSTRING = r"""
 @add_start_docstrings("The bare DilBERT encoder/transformer outputing raw hidden-states without any specific head on top.",
                       DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
 class DilBertModel(DilBertPreTrainedModel):
-    def __init__(self, config):
-        super(DilBertModel, self).__init__(config)
-
-        self.embeddings = Embeddings(config)   # Embeddings
-        self.transformer = Transformer(config) # Encoder
-
-        self.apply(self.init_weights)
-
-    def forward(self,
-                input_ids: torch.tensor,
-                attention_mask: torch.tensor = None):
-        """
+    r"""
         Parameters
         ----------
         input_ids: torch.tensor(bs, seq_length)
@@ -422,7 +412,18 @@ class DilBertModel(DilBertPreTrainedModel):
         all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
             Tuple of length n_layers with the attention weights from each layer
             Optional: only if output_attentions=True
-        """
+    """
+    def __init__(self, config):
+        super(DilBertModel, self).__init__(config)
+
+        self.embeddings = Embeddings(config)   # Embeddings
+        self.transformer = Transformer(config) # Encoder
+
+        self.apply(self.init_weights)
+
+    def forward(self,
+                input_ids: torch.tensor,
+                attention_mask: torch.tensor = None):
         if attention_mask is None:
             attention_mask = torch.ones_like(input_ids) # (bs, seq_length)
 
@@ -438,33 +439,7 @@ class DilBertModel(DilBertPreTrainedModel):
 @add_start_docstrings("""DilBert Model with a `masked language modeling` head on top. """,
                       DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
 class DilBertForMaskedLM(DilBertPreTrainedModel):
-    def __init__(self, config):
-        super(DilBertForMaskedLM, self).__init__(config)
-        self.output_attentions = config.output_attentions
-        self.output_hidden_states = config.output_hidden_states
-
-        self.encoder = DilBertModel(config)
-        self.vocab_transform = nn.Linear(config.dim, config.dim)
-        self.vocab_layer_norm = nn.LayerNorm(config.dim, eps=1e-12)
-        self.vocab_projector = nn.Linear(config.dim, config.vocab_size)
-
-        self.apply(self.init_weights)
-        self.tie_weights()
-
-        self.mlm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
-
-    def tie_weights_(self):
-        """
-        Tying the weights of the vocabulary projection to the base token embeddings.
-        """
-        if self.config.tie_weights:
-            self.vocab_projector.weight = self.encoder.embeddings.word_embeddings.weight
-
-    def forward(self,
-                input_ids: torch.tensor,
-                attention_mask: torch.tensor = None,
-                masked_lm_labels: torch.tensor = None):
-        """
+    r"""
         Parameters
         ----------
         input_ids: torch.tensor(bs, seq_length)
@@ -487,7 +462,33 @@ class DilBertForMaskedLM(DilBertPreTrainedModel):
         all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
             Tuple of length n_layers with the attention weights from each layer
             Optional: only if `output_attentions`=True
+    """
+    def __init__(self, config):
+        super(DilBertForMaskedLM, self).__init__(config)
+        self.output_attentions = config.output_attentions
+        self.output_hidden_states = config.output_hidden_states
+
+        self.encoder = DilBertModel(config)
+        self.vocab_transform = nn.Linear(config.dim, config.dim)
+        self.vocab_layer_norm = nn.LayerNorm(config.dim, eps=1e-12)
+        self.vocab_projector = nn.Linear(config.dim, config.vocab_size)
+
+        self.apply(self.init_weights)
+        self.tie_weights_()
+
+        self.mlm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
+
+    def tie_weights_(self):
         """
+        Tying the weights of the vocabulary projection to the base token embeddings.
+        """
+        if self.config.tie_weights:
+            self.vocab_projector.weight = self.encoder.embeddings.word_embeddings.weight
+
+    def forward(self,
+                input_ids: torch.tensor,
+                attention_mask: torch.tensor = None,
+                masked_lm_labels: torch.tensor = None):
         tfmr_output = self.encoder(input_ids=input_ids,
                                    attention_mask=attention_mask)
         hidden_states = tfmr_output[0]                               # (bs, seq_length, dim)
@@ -508,22 +509,7 @@ class DilBertForMaskedLM(DilBertPreTrainedModel):
                          the pooled output) e.g. for GLUE tasks. """,
                       DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
 class DilBertForSequenceClassification(DilBertPreTrainedModel):
-    def __init__(self, config):
-        super(DilBertForSequenceClassification, self).__init__(config)
-        self.num_labels = config.num_labels
-
-        self.dilbert = DilBertModel(config)
-        self.pre_classifier = nn.Linear(config.dim, config.dim)
-        self.classifier = nn.Linear(config.dim, config.num_labels)
-        self.dropout = nn.Dropout(config.seq_classif_dropout)
-
-        self.apply(self.init_weights)
-
-    def forward(self,
-                input_ids: torch.tensor,
-                attention_mask: torch.tensor = None,
-                labels: torch.tensor = None):
-        """
+    r"""
         Parameters
         ----------
         input_ids: torch.tensor(bs, seq_length)
@@ -546,7 +532,22 @@ class DilBertForSequenceClassification(DilBertPreTrainedModel):
         all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
             Tuple of length n_layers with the attention weights from each layer
             Optional: only if `output_attentions`=True        
-        """
+    """
+    def __init__(self, config):
+        super(DilBertForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.dilbert = DilBertModel(config)
+        self.pre_classifier = nn.Linear(config.dim, config.dim)
+        self.classifier = nn.Linear(config.dim, config.num_labels)
+        self.dropout = nn.Dropout(config.seq_classif_dropout)
+
+        self.apply(self.init_weights)
+
+    def forward(self,
+                input_ids: torch.tensor,
+                attention_mask: torch.tensor = None,
+                labels: torch.tensor = None):
         dilbert_output = self.dilbert(input_ids=input_ids,
                                       attention_mask=attention_mask)
         pooled_output = dilbert_output[1]                    # (bs, dim)
@@ -571,22 +572,7 @@ class DilBertForSequenceClassification(DilBertPreTrainedModel):
                          the hidden-states output to compute `span start logits` and `span end logits`). """,
                       DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
 class DilBertForQuestionAnswering(DilBertPreTrainedModel):
-    def __init__(self, config):
-        super(DilBertForQuestionAnswering, self).__init__(config)
-
-        self.dilbert = DilBertModel(config)
-        self.qa_outputs = nn.Linear(config.dim, config.num_labels)
-        assert config.num_labels == 2
-        self.dropout = nn.Dropout(config.qa_dropout)
-
-        self.apply(self.init_weights)
-        
-    def forward(self,
-                input_ids: torch.tensor,
-                attention_mask: torch.tensor = None,
-                start_positions: torch.tensor = None,
-                end_positions: torch.tensor = None):
-        """
+    r"""
         Parameters
         ----------
         input_ids: torch.tensor(bs, seq_length)
@@ -619,7 +605,22 @@ class DilBertForQuestionAnswering(DilBertPreTrainedModel):
         all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
             Tuple of length n_layers with the attention weights from each layer
             Optional: only if `output_attentions`=True
-        """
+    """
+    def __init__(self, config):
+        super(DilBertForQuestionAnswering, self).__init__(config)
+
+        self.dilbert = DilBertModel(config)
+        self.qa_outputs = nn.Linear(config.dim, config.num_labels)
+        assert config.num_labels == 2
+        self.dropout = nn.Dropout(config.qa_dropout)
+
+        self.apply(self.init_weights)
+        
+    def forward(self,
+                input_ids: torch.tensor,
+                attention_mask: torch.tensor = None,
+                start_positions: torch.tensor = None,
+                end_positions: torch.tensor = None):
         dilbert_output = self.dilbert(input_ids=input_ids,
                                       attention_mask=attention_mask)
         hidden_states = dilbert_output[0]                                 # (bs, max_query_len, dim)

From a8ad83040da46e9ab77db1e68554573ffdc6fd98 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 00:45:33 +0000
Subject: [PATCH 161/200] fix bugs

---
 pytorch_transformers/modeling_dilbert.py | 26 +++++++++++++-----------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/pytorch_transformers/modeling_dilbert.py b/pytorch_transformers/modeling_dilbert.py
index 1fcb33e9ad..cda8da8583 100644
--- a/pytorch_transformers/modeling_dilbert.py
+++ b/pytorch_transformers/modeling_dilbert.py
@@ -60,7 +60,7 @@ class DilBertConfig(PretrainedConfig):
                  attention_dropout=0.1,
                  activation='gelu',
                  initializer_range=0.02,
-                 tie_weights=True,
+                 tie_weights_=True,
                  **kwargs):
         super(DilBertConfig, self).__init__(**kwargs)
 
@@ -82,7 +82,7 @@ class DilBertConfig(PretrainedConfig):
             self.attention_dropout = attention_dropout
             self.activation = activation
             self.initializer_range = initializer_range
-            self.tie_weights = tie_weights
+            self.tie_weights_ = tie_weights_
         else:
             raise ValueError("First argument must be either a vocabulary size (int)"
                              "or the path to a pretrained model config file (str)")
@@ -274,13 +274,15 @@ class TransformerBlock(nn.Module):
         sa_output = self.attention(query=x, key=x, value=x, mask=attn_mask)
         if self.output_attentions:
             sa_output, sa_weights = sa_output                  # (bs, seq_length, dim), (bs, n_heads, seq_length, seq_length)
+        else:
+            sa_output = sa_output[0]
         sa_output = self.sa_layer_norm(sa_output + x)          # (bs, seq_length, dim)
 
         # Feed Forward Network
         ffn_output = self.ffn(sa_output)                             # (bs, seq_length, dim)
         ffn_output = self.output_layer_norm(ffn_output + sa_output)  # (bs, seq_length, dim)
 
-        output = (ffn_output)
+        output = (ffn_output,)
         if self.output_attentions:
             output = (sa_weights,) + output
         return output
@@ -468,36 +470,36 @@ class DilBertForMaskedLM(DilBertPreTrainedModel):
         self.output_attentions = config.output_attentions
         self.output_hidden_states = config.output_hidden_states
 
-        self.encoder = DilBertModel(config)
+        self.dilbert = DilBertModel(config)
         self.vocab_transform = nn.Linear(config.dim, config.dim)
         self.vocab_layer_norm = nn.LayerNorm(config.dim, eps=1e-12)
         self.vocab_projector = nn.Linear(config.dim, config.vocab_size)
 
         self.apply(self.init_weights)
-        self.tie_weights_()
+        self.tie_weights()
 
         self.mlm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
 
-    def tie_weights_(self):
+    def tie_weights(self):
         """
         Tying the weights of the vocabulary projection to the base token embeddings.
         """
-        if self.config.tie_weights:
-            self.vocab_projector.weight = self.encoder.embeddings.word_embeddings.weight
+        if self.config.tie_weights_:
+            self.vocab_projector.weight = self.dilbert.embeddings.word_embeddings.weight
 
     def forward(self,
                 input_ids: torch.tensor,
                 attention_mask: torch.tensor = None,
                 masked_lm_labels: torch.tensor = None):
-        tfmr_output = self.encoder(input_ids=input_ids,
-                                   attention_mask=attention_mask)
-        hidden_states = tfmr_output[0]                               # (bs, seq_length, dim)
+        dlbrt_output = self.dilbert(input_ids=input_ids,
+                                    attention_mask=attention_mask)
+        hidden_states = dlbrt_output[0]                              # (bs, seq_length, dim)
         prediction_logits = self.vocab_transform(hidden_states)      # (bs, seq_length, dim)
         prediction_logits = gelu(prediction_logits)                  # (bs, seq_length, dim)
         prediction_logits = self.vocab_layer_norm(prediction_logits) # (bs, seq_length, dim)
         prediction_logits = self.vocab_projector(prediction_logits)  # (bs, seq_length, vocab_size)
 
-        outputs = (prediction_logits, ) + tfmr_output[2:]
+        outputs = (prediction_logits, ) + dlbrt_output[2:]
         if masked_lm_labels is not None:
             mlm_loss = self.mlm_loss_fct(prediction_logits.view(-1, prediction_logits.size(-1)),
                                          masked_lm_labels.view(-1))

From 5d29f8e99bc9d2a5c84265a7ed26cedb0d500804 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 00:57:16 +0000
Subject: [PATCH 162/200] fix bugs

---
 pytorch_transformers/modeling_dilbert.py | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/pytorch_transformers/modeling_dilbert.py b/pytorch_transformers/modeling_dilbert.py
index cda8da8583..e842b31d8f 100644
--- a/pytorch_transformers/modeling_dilbert.py
+++ b/pytorch_transformers/modeling_dilbert.py
@@ -274,7 +274,8 @@ class TransformerBlock(nn.Module):
         sa_output = self.attention(query=x, key=x, value=x, mask=attn_mask)
         if self.output_attentions:
             sa_output, sa_weights = sa_output                  # (bs, seq_length, dim), (bs, n_heads, seq_length, seq_length)
-        else:
+        else: # To handle these `output_attention` or `output_hidden_states` cases returning tuples
+            assert type(sa_output) == tuple
             sa_output = sa_output[0]
         sa_output = self.sa_layer_norm(sa_output + x)          # (bs, seq_length, dim)
 
@@ -329,6 +330,9 @@ class Transformer(nn.Module):
             if self.output_attentions:
                 attentions, hidden_state = hidden_state
                 all_attentions = all_attentions + (attentions,)
+            else: # To handle these `output_attention` or `output_hidden_states` cases returning tuples
+                assert type(hidden_state) == tuple
+                hidden_state = hidden_state[0]
             all_hidden_states = all_hidden_states + (hidden_state,)
 
         outputs = (hidden_state,)

From 1ae81e4aa1868eb24d975ebff4a7241ed10975fc Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 01:10:05 +0000
Subject: [PATCH 163/200] add dataset. distiller, utils

---
 examples/distillation/dataset.py   | 184 ++++++++++++
 examples/distillation/distiller.py | 431 +++++++++++++++++++++++++++++
 examples/distillation/utils.py     | 112 ++++++++
 3 files changed, 727 insertions(+)
 create mode 100644 examples/distillation/dataset.py
 create mode 100644 examples/distillation/distiller.py
 create mode 100644 examples/distillation/utils.py

diff --git a/examples/distillation/dataset.py b/examples/distillation/dataset.py
new file mode 100644
index 0000000000..6256ce1144
--- /dev/null
+++ b/examples/distillation/dataset.py
@@ -0,0 +1,184 @@
+from typing import List
+import math
+from itertools import chain
+from collections import Counter
+import numpy as np
+import torch
+
+from utils import logger
+
+class Dataset:
+    def __init__(self,
+                 params,
+                 data):
+        self.params = params
+        self.tokens_per_batch = params.tokens_per_batch
+        self.batch_size = params.batch_size
+        self.shuffle = params.shuffle
+        self.group_by_size = params.group_by_size
+
+        self.token_ids = np.array(data)
+        self.lengths = np.uint16([len(t) for t in data])
+
+        self.check()
+        self.remove_long_sequences()
+        self.remove_empty_sequences()
+        self.check()
+        self.print_statistics()
+
+    def __len__(self):
+        return len(self.lengths)
+
+    def check(self):
+        """
+        Some sanity checks
+        """
+        assert len(self.token_ids) == len(self.lengths)
+
+    def remove_long_sequences(self):
+        """
+        Sequences that are too long are splitted by chunk of max_position_embeddings.
+        """
+        indices = self.lengths >= self.params.max_position_embeddings
+        logger.info(f'Splitting {sum(indices)} too long sequences.')
+
+        def divide_chunks(l, n):
+            return [l[i:i + n] for i in range(0, len(l), n)]
+
+        new_tok_ids = []
+        new_lengths = []
+        cls_id, sep_id = self.params.special_tok_ids['cls_token'], self.params.special_tok_ids['sep_token']
+        max_len = self.params.max_position_embeddings
+
+        for seq_, len_ in zip(self.token_ids, self.lengths):
+            if len_ <= max_len:
+                new_tok_ids.append(seq_)
+                new_lengths.append(len_)
+            else:
+                sub_seqs = []
+                for sub_s in divide_chunks(seq_, max_len-2):
+                    if sub_s[0] != cls_id:
+                        sub_s = np.insert(sub_s, 0, cls_id)
+                    if sub_s[-1] != sep_id:
+                        sub_s = np.insert(sub_s, len(sub_s), cls_id)
+                    assert len(sub_s) <= max_len
+                    sub_seqs.append(sub_s)
+
+                new_tok_ids.extend(sub_seqs)
+                new_lengths.extend([len(l) for l in sub_seqs])
+
+        self.token_ids = np.array(new_tok_ids)
+        self.lengths = np.array(new_lengths)
+
+    def remove_empty_sequences(self):
+        """
+        Too short sequences are simply removed. This could be tunedd.
+        """
+        init_size = len(self)
+        indices = self.lengths > 5
+        self.token_ids = self.token_ids[indices]
+        self.lengths = self.lengths[indices]
+        new_size = len(self)
+        logger.info(f'Remove {init_size - new_size} too short (<=5 tokens) sequences.')
+
+    def print_statistics(self):
+        """
+        Print some statistics on the corpus. Only the master process.
+        """
+        if not self.params.is_master:
+            return
+        logger.info(f'{len(self)} sequences')
+        # data_len = sum(self.lengths)
+        # nb_unique_tokens = len(Counter(list(chain(*self.token_ids))))
+        # logger.info(f'{data_len} tokens ({nb_unique_tokens} unique)')
+
+        # unk_idx = self.params.special_tok_ids['unk_token']
+        # nb_unkown = sum([(t==unk_idx).sum() for t in self.token_ids])
+        # logger.info(f'{nb_unkown} unknown tokens (covering {100*nb_unkown/data_len:.2f}% of the data)')
+
+    def select_data(self, a: int, b: int):
+        """
+        Select a subportion of the data.
+        """
+        n_sequences = len(self)
+        assert 0 <= a < b <= n_sequences, ValueError(f'`0 <= a < b <= n_sequences` is not met with a={a} and b={b}')
+
+        logger.info(f'Selecting sequences from {a} to {b} (excluded).')
+        self.token_ids = self.token_ids[a:b]
+        self.lengths = self.lengths[a:b]
+
+        self.check()
+
+    def split(self):
+        """
+        Distributed training: split the data accross the processes.
+        """
+        assert self.params.n_gpu > 1
+        logger.info('Splitting the data accross the processuses.')
+        n_seq = len(self)
+        n_seq_per_procesus = n_seq // self.params.world_size
+        a = n_seq_per_procesus * self.params.global_rank
+        b = a + n_seq_per_procesus
+        self.select_data(a=a, b=b)
+
+    def batch_sequences(self,
+                        token_ids: List[List[int]],
+                        lengths: List[int]):
+        """
+        Do the padding and transform into torch.tensor.
+        """
+        assert len(token_ids) == len(lengths)
+
+        # Max for paddings
+        max_seq_len_ = max(lengths)
+
+        # Pad token ids
+        pad_idx = self.params.special_tok_ids['pad_token']
+        tk_ = [list(t.astype(int)) + [pad_idx]*(max_seq_len_-len(t)) for t in token_ids]
+        assert len(tk_) == len(token_ids)
+        assert all(len(t) == max_seq_len_ for t in tk_)
+
+        tk_t = torch.tensor(tk_)                  # (bs, max_seq_len_)
+        lg_t = torch.tensor(lengths.astype(int))  # (bs)
+        return tk_t, lg_t
+
+    def get_batches_iterator(self,
+                             batches):
+        """
+        Return an iterator over batches.
+        """
+        for sequences_ids in batches:
+            token_ids, lengths = self.batch_sequences(self.token_ids[sequences_ids],
+                                                    self.lengths[sequences_ids])
+            yield (token_ids, lengths)
+
+    def get_iterator(self,
+                     seed: int = None):
+        """
+        Return a data iterator.
+        """
+        rng = np.random.RandomState(seed)
+
+        n_sequences = len(self)
+        indices = np.arange(n_sequences)
+
+        if self.group_by_size:
+            indices = indices[np.argsort(self.lengths[indices], kind='mergesort')]
+
+        if self.tokens_per_batch == -1:
+            batches = np.array_split(indices, math.ceil(len(indices) * 1. / self.batch_size))
+        else:
+            assert self.tokens_per_batch > 0
+            batch_ids = np.cumsum(self.lengths[indices]) // self.tokens_per_batch
+            _, bounds = np.unique(batch_ids, return_index=True)
+            batches = [indices[bounds[i]:bounds[i + 1]] for i in range(len(bounds) - 1)]
+            if bounds[-1] < len(indices):
+                batches.append(indices[bounds[-1]:])
+
+        if self.shuffle:
+            rng.shuffle(batches)
+
+        assert n_sequences == sum([len(x) for x in batches])
+        assert self.lengths[indices].sum() == sum([self.lengths[x].sum() for x in batches])
+
+        return self.get_batches_iterator(batches=batches)
diff --git a/examples/distillation/distiller.py b/examples/distillation/distiller.py
new file mode 100644
index 0000000000..c9c4458abc
--- /dev/null
+++ b/examples/distillation/distiller.py
@@ -0,0 +1,431 @@
+import os
+import math
+from tensorboardX import SummaryWriter
+from tqdm import trange, tqdm
+import numpy as np
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from pytorch_transformers import AdamW, WarmupLinearSchedule
+
+from utils import logger
+from dataset import Dataset
+
+class Distiller:
+    def __init__(self,
+                 params: dict,
+                 dataloader: Dataset,
+                 token_probs: torch.tensor,
+                 student: nn.Module,
+                 teacher: nn.Module):
+        logger.info('Initializing Distiller')
+        self.params = params
+        self.dump_path = params.dump_path
+        self.multi_gpu = params.multi_gpu
+        self.fp16 = params.fp16
+
+        self.student = student
+        self.teacher = teacher
+
+        self.dataloader = dataloader
+        if self.params.n_gpu > 1:
+            self.dataloader.split()
+        self.get_iterator(seed=params.seed)
+
+        self.temperature = params.temperature
+        assert self.temperature > 0.
+
+        self.alpha_ce = params.alpha_ce
+        self.alpha_mlm = params.alpha_mlm
+        self.alpha_mse = params.alpha_mse
+        assert self.alpha_ce >= 0.
+        assert self.alpha_mlm >= 0.
+        assert self.alpha_mse >= 0.
+        assert self.alpha_ce + self.alpha_mlm + self.alpha_mse > 0.
+
+        self.mlm_mask_prop = params.mlm_mask_prop
+        assert 0.0 <= self.mlm_mask_prop <= 1.0
+        assert params.word_mask + params.word_keep + params.word_rand == 1.0
+        self.pred_probs = torch.FloatTensor([params.word_mask, params.word_keep, params.word_rand])
+        self.pred_probs = self.pred_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else self.pred_probs
+        self.token_probs = token_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else token_probs
+        if self.fp16:
+            self.pred_probs = self.pred_probs.half()
+            self.token_probs = self.token_probs.half()
+
+        self.epoch = 0
+        self.n_iter = 0
+        self.n_total_iter = 0
+        self.n_sequences_epoch = 0
+        self.total_loss_epoch = 0
+        self.last_loss = 0
+        self.last_loss_ce = 0
+        self.last_loss_mlm = 0
+        self.last_loss_mse = 0
+
+        self.ce_loss_fct = nn.KLDivLoss(reduction='batchmean')
+        self.mlm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
+        self.mse_loss_fct = nn.MSELoss(reduction='sum')
+
+        logger.info('--- Initializing model optimizer')
+        assert params.gradient_accumulation_steps >= 1
+        self.num_steps_epoch = int(len(self.dataloader) / params.batch_size) + 1
+        num_train_optimization_steps = int(self.num_steps_epoch / params.gradient_accumulation_steps * params.n_epoch) + 1
+        warmup_steps = math.ceil(num_train_optimization_steps * params.warmup_prop)
+
+        no_decay = ['bias', 'LayerNorm.weight']
+        optimizer_grouped_parameters = [
+            {'params': [p for n, p in student.named_parameters() if not any(nd in n for nd in no_decay) and p.requires_grad], 'weight_decay': params.weight_decay},
+            {'params': [p for n, p in student.named_parameters() if any(nd in n for nd in no_decay) and p.requires_grad], 'weight_decay': 0.0}
+        ]
+        logger.info("------ Number of trainable parameters (student): %i" % sum([p.numel() for p in self.student.parameters() if p.requires_grad]))
+        logger.info("------ Number of parameters (student): %i" % sum([p.numel() for p in self.student.parameters()]))
+        self.optimizer = AdamW(optimizer_grouped_parameters,
+                               lr=params.learning_rate,
+                               eps=params.adam_epsilon,
+                               betas=(0.9, 0.98))
+        self.scheduler = WarmupLinearSchedule(self.optimizer,
+                                              warmup_steps=warmup_steps,
+                                              t_total=num_train_optimization_steps)
+
+        if self.fp16:
+            try:
+                from apex import amp
+            except ImportError:
+                raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+            logger.info(f"Using fp16 training: {self.params.fp16_opt_level} level")
+            self.student, self.optimizer = amp.initialize(self.student,
+                                                          self.optimizer,
+                                                          opt_level=self.params.fp16_opt_level)
+            self.teacher = self.teacher.half()
+
+        if self.multi_gpu:
+            if self.fp16:
+                from apex.parallel import DistributedDataParallel
+                logger.info("Using apex.parallel.DistributedDataParallel for distributed training.")
+                self.student = DistributedDataParallel(self.student)
+            else:
+                from torch.nn.parallel import DistributedDataParallel
+                logger.info("Using nn.parallel.DistributedDataParallel for distributed training.")
+                self.student = DistributedDataParallel(self.student,
+                                                       device_ids=[params.local_rank],
+                                                       output_device=params.local_rank)
+
+        self.is_master = params.is_master
+        if self.is_master:
+            logger.info('--- Initializing Tensorboard')
+            self.tensorboard = SummaryWriter(log_dir=os.path.join(self.dump_path, 'log', 'train'))
+            self.tensorboard.add_text(tag='config', text_string=str(self.params), global_step=0)
+
+    def get_iterator(self,
+                     seed: int = None):
+        """
+        Initialize the data iterator.
+        Each process has its own data iterator (iterating on his own random portion of the dataset).
+
+        Input:
+        ------
+            seed: `int` - The random seed.
+        """
+        logger.info('--- Initializing Data Iterator')
+        self.data_iterator = self.dataloader.get_iterator(seed=seed)
+
+    def get_batch(self):
+        """
+        Call the data iterator to output a new batch.
+        If the data iterator went through the whole dataset, create a new iterator.
+        """
+        assert hasattr(self, 'data_iterator')
+        try:
+            x = next(self.data_iterator)
+        except StopIteration:
+            logger.warning('--- Went through the whole dataset. Creating new data iterator.')
+            self.data_iterator = self.dataloader.get_iterator()
+            x = next(self.data_iterator)
+        return x
+
+    def prepare_batch(self,
+                      batch):
+        """
+        Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the masked label for MLM.
+
+        Input:
+        ------
+            batch: `Tuple`
+                token_ids: `torch.tensor(bs, seq_length)` - The token ids for each of the sequence. It is padded.
+                lengths: `torch.tensor(bs)` - The lengths of each of the sequences in the batch.
+
+        Output:
+        -------
+            token_ids: `torch.tensor(bs, seq_length)` - The token ids after the modifications for MLM.
+            attn_mask: `torch.tensor(bs, seq_length)` - The attention mask for the self-attention.
+            mlm_labels: `torch.tensor(bs, seq_length)` - The masked languge modeling labels. There is a -1 where there is nothing to predict.
+        """
+        token_ids, lengths = batch
+        token_ids, lengths = self.round_batch(x=token_ids, lengths=lengths)
+        assert token_ids.size(0) == lengths.size(0)
+
+        attn_mask = (torch.arange(token_ids.size(1), dtype=torch.long, device=lengths.device) < lengths[:, None])
+
+        bs, max_seq_len = token_ids.size()
+        mlm_labels = token_ids.new(token_ids.size()).copy_(token_ids)
+
+        x_prob = self.token_probs[token_ids.flatten()]
+        n_tgt = math.ceil(self.mlm_mask_prop * lengths.sum().item())
+        tgt_ids = torch.multinomial(x_prob / x_prob.sum(), n_tgt, replacement=False)
+        pred_mask = torch.zeros(bs * max_seq_len, dtype=torch.uint8, device=token_ids.device)
+        pred_mask[tgt_ids] = 1
+        pred_mask = pred_mask.view(bs, max_seq_len)
+
+        pred_mask[token_ids == self.params.special_tok_ids['pad_token']] = 0
+
+        # mask a number of words == 0 [8] (faster with fp16)
+        if self.fp16:
+            n1 = pred_mask.sum().item()
+            if n1 > 8:
+                pred_mask = pred_mask.view(-1)
+                n2 = max(n1 % 8, 8 * (n1 // 8))
+                if n2 != n1:
+                    pred_mask[torch.nonzero(pred_mask).view(-1)[:n1-n2]] = 0
+                pred_mask = pred_mask.view(bs, max_seq_len)
+                assert pred_mask.sum().item() % 8 == 0, pred_mask.sum().item()
+
+        _token_ids_real = token_ids[pred_mask]
+        _token_ids_rand = _token_ids_real.clone().random_(self.params.vocab_size)
+        _token_ids_mask = _token_ids_real.clone().fill_(self.params.special_tok_ids['mask_token'])
+        probs = torch.multinomial(self.pred_probs, len(_token_ids_real), replacement=True)
+        _token_ids = _token_ids_mask * (probs == 0).long() + _token_ids_real * (probs == 1).long() + _token_ids_rand * (probs == 2).long()
+        token_ids = token_ids.masked_scatter(pred_mask, _token_ids)
+
+        mlm_labels[1-pred_mask] = -1
+
+        return token_ids, attn_mask, mlm_labels
+
+    def round_batch(self,
+                    x: torch.tensor,
+                    lengths: torch.tensor):
+        """
+        For float16 only.
+        Sub-sample sentences in a batch, and add padding, so that each dimension is a multiple of 8.
+
+        Input:
+        ------
+            x: `torch.tensor(bs, seq_length)` - The token ids.
+            lengths: `torch.tensor(bs, seq_length)` - The lengths of each of the sequence in the batch.
+
+        Output:
+        -------
+            x:  `torch.tensor(new_bs, new_seq_length)` - The updated token ids.
+            lengths: `torch.tensor(new_bs, new_seq_length)` - The updated lengths.
+        """
+        if not self.fp16 or len(lengths) < 8:
+            return x, lengths
+
+        # number of sentences == 0 [8]
+        bs1 = len(lengths)
+        bs2 = 8 * (bs1 // 8)
+        assert bs2 > 0 and bs2 % 8 == 0
+        if bs1 != bs2:
+            idx = torch.randperm(bs1)[:bs2]
+            lengths = lengths[idx]
+            slen = lengths.max().item()
+            x = x[idx, :slen]
+        else:
+            idx = None
+
+        # sequence length == 0 [8]
+        ml1 = x.size(1)
+        if ml1 % 8 != 0:
+            pad = 8 - (ml1 % 8)
+            ml2 = ml1 + pad
+            pad_id = self.params.special_tok_ids['pad_token']
+            padding_tensor = torch.zeros(bs2, pad, dtype=torch.long, device=x.device).fill_(pad_id)
+            x = torch.cat([x, padding_tensor], 1)
+            assert x.size() == (bs2, ml2)
+
+        assert x.size(0) % 8 == 0
+        assert x.size(1) % 8 == 0
+        return x, lengths
+
+    def train(self):
+        """
+        The real training loop.
+        """
+        if self.is_master: logger.info('Starting training')
+        self.student.train()
+        self.teacher.eval()
+
+        for _ in range(self.params.n_epoch):
+            if self.is_master: logger.info(f'--- Starting epoch {self.epoch}/{self.params.n_epoch-1}')
+
+            iter_bar = trange(self.num_steps_epoch, desc="-Iter", disable=self.params.local_rank not in [-1, 0])
+            for __ in range(self.num_steps_epoch):
+                batch = self.get_batch()
+                if self.params.n_gpu > 0:
+                    batch = tuple(t.to(f'cuda:{self.params.local_rank}') for t in batch)
+                token_ids, attn_mask, mlm_labels = self.prepare_batch(batch=batch)
+
+                self.step(input_ids=token_ids, attention_mask=attn_mask, mlm_labels=mlm_labels)
+
+                iter_bar.update()
+                iter_bar.set_postfix({'Last_loss': f'{self.last_loss:.2f}',
+                                      'Avg_cum_loss': f'{self.total_loss_epoch/self.n_iter:.2f}'})
+            iter_bar.close()
+
+            if self.is_master: logger.info(f'--- Ending epoch {self.epoch}/{self.params.n_epoch-1}')
+            self.end_epoch()
+
+        if self.is_master: logger.info('Training is finished')
+
+    def step(self,
+             input_ids: torch.tensor,
+             attention_mask: torch.tensor,
+             mlm_labels: torch.tensor):
+        """
+        One optimization step: forward of student AND teacher, backward on the loss (for gradient accumulation),
+        and possibly a parameter update (depending on the gradient accumulation).
+
+        Input:
+        ------
+        input_ids: `torch.tensor(bs, seq_length)` - The token ids.
+        attention_mask: `torch.tensor(bs, seq_length)` - The attention mask for self attention.
+        mlm_labels: `torch.tensor(bs, seq_length)` - The masked language modeling labels.
+        """
+        s_logits = self.student(input_ids=input_ids, attention_mask=attention_mask)[0]     # (bs, seq_length, voc_size)
+        with torch.no_grad():
+            t_logits = self.teacher(input_ids=input_ids, attention_mask=attention_mask)[0] # (bs, seq_length, voc_size)
+        assert s_logits.size() == t_logits.size()
+
+        #https://github.com/peterliht/knowledge-distillation-pytorch/blob/master/model/net.py#L100
+        #https://github.com/peterliht/knowledge-distillation-pytorch/issues/2
+        if self.params.restrict_ce_to_mask:
+            mask = (mlm_labels>-1).unsqueeze(-1).expand_as(s_logits)   # (bs, seq_lenth, voc_size)
+        else:
+            mask = attention_mask.unsqueeze(-1).expand_as(s_logits)    # (bs, seq_lenth, voc_size)
+        s_logits_slct = torch.masked_select(s_logits, mask)            # (bs * seq_length * voc_size) modulo the 1s in mask
+        s_logits_slct = s_logits_slct.view(-1, s_logits.size(-1))      # (bs * seq_length, voc_size) modulo the 1s in mask
+        t_logits_slct = torch.masked_select(t_logits, mask)            # (bs * seq_length * voc_size) modulo the 1s in mask
+        t_logits_slct = t_logits_slct.view(-1, s_logits.size(-1))      # (bs * seq_length, voc_size) modulo the 1s in mask
+        assert t_logits_slct.size() == s_logits_slct.size()
+
+        loss_ce = self.ce_loss_fct(F.log_softmax(s_logits_slct/self.temperature, dim=-1),
+                                   F.softmax(t_logits_slct/self.temperature, dim=-1)) * (self.temperature)**2
+        loss = self.alpha_ce*loss_ce
+        if self.alpha_mlm > 0.:
+            loss_mlm = self.mlm_loss_fct(s_logits.view(-1, s_logits.size(-1)), mlm_labels.view(-1))
+            loss += self.alpha_mlm * loss_mlm
+        if self.alpha_mse > 0.:
+            loss_mse = self.mse_loss_fct(s_logits_slct, t_logits_slct)/s_logits_slct.size(0) # Reproducing batchmean reduction
+            loss += self.alpha_mse * loss_mse
+
+        self.total_loss_epoch += loss.item()
+        self.last_loss = loss.item()
+        self.last_loss_ce = loss_ce.item()
+        if self.alpha_mlm > 0.:
+            self.last_loss_mlm = loss_mlm.item()
+        if self.alpha_mse > 0.:
+            self.last_loss_mse = loss_mse.item()
+
+        self.optimize(loss)
+
+        self.n_sequences_epoch += input_ids.size(0)
+
+    def optimize(self,
+                 loss):
+        """
+        Normalization on the loss (gradient accumulation or distributed training), followed by
+        backward pass on the loss, possibly followed by a parameter update (depending on the gradient accumulation).
+        Also update the metrics for tensorboard.
+        """
+        # Check for NaN
+        if (loss != loss).data.any():
+            logger.error('NaN detected')
+            exit()
+
+        if self.multi_gpu:
+            loss = loss.mean()
+        if self.params.gradient_accumulation_steps > 1:
+            loss = loss / self.params.gradient_accumulation_steps
+
+        if self.fp16:
+            from apex import amp
+            with amp.scale_loss(loss, self.optimizer) as scaled_loss:
+                scaled_loss.backward()
+        else:
+            loss.backward()
+
+        self.iter()
+        if self.n_iter % self.params.gradient_accumulation_steps == 0:
+            if self.fp16:
+                torch.nn.utils.clip_grad_norm_(amp.master_params(self.optimizer), self.params.max_grad_norm)
+            else:
+                torch.nn.utils.clip_grad_norm_(self.student.parameters(), self.params.max_grad_norm)
+            self.scheduler.step()
+            self.optimizer.step()
+            self.optimizer.zero_grad()
+
+    def iter(self):
+        """
+        Update global counts, write to tensorboard and save checkpoint.
+        """
+        self.n_iter += 1
+        self.n_total_iter += 1
+
+        if self.n_total_iter % self.params.log_interval == 0:
+            self.log_tensorboard()
+        if self.n_total_iter % self.params.checkpoint_interval == 0:
+            self.save_checkpoint()
+
+    def log_tensorboard(self):
+        """
+        Log into tensorboard. Only by the master process.
+        """
+        if not self.is_master:
+            return
+
+        for param_name, param in self.student.named_parameters():
+            self.tensorboard.add_scalar(tag='parameter_mean/' + param_name, scalar_value=param.data.mean(), global_step=self.n_total_iter)
+            self.tensorboard.add_scalar(tag='parameter_std/' + param_name, scalar_value=param.data.std(), global_step=self.n_total_iter)
+            if param.grad is None:
+                continue
+            self.tensorboard.add_scalar(tag="grad_mean/" + param_name, scalar_value=param.grad.data.mean(),global_step=self.n_total_iter)
+            self.tensorboard.add_scalar(tag="grad_std/" + param_name, scalar_value=param.grad.data.std(), global_step=self.n_total_iter)
+
+        self.tensorboard.add_scalar(tag="losses/cum_avg_loss_epoch", scalar_value=self.total_loss_epoch/self.n_iter, global_step=self.n_total_iter)
+        self.tensorboard.add_scalar(tag="losses/loss", scalar_value=self.last_loss, global_step=self.n_total_iter)
+        self.tensorboard.add_scalar(tag="losses/loss_ce", scalar_value=self.last_loss_ce, global_step=self.n_total_iter)
+        if self.alpha_mlm > 0.:
+            self.tensorboard.add_scalar(tag="losses/loss_mlm", scalar_value=self.last_loss_mlm, global_step=self.n_total_iter)
+        if self.alpha_mse > 0.:
+            self.tensorboard.add_scalar(tag="losses/loss_mse", scalar_value=self.last_loss_mse, global_step=self.n_total_iter)
+        self.tensorboard.add_scalar(tag="learning_rate/lr", scalar_value=self.scheduler.get_lr()[0], global_step=self.n_total_iter)
+
+    def end_epoch(self):
+        """
+        Finally arrived at the end of epoch (full pass on dataset).
+        Do some tensorboard logging and checkpoint saving.
+        """
+        logger.info(f'{self.n_sequences_epoch} sequences have been trained during this epoch.')
+
+        if self.is_master:
+            self.save_checkpoint(checkpoint_name=f'model_epoch_{self.epoch}.pth')
+            self.tensorboard.add_scalar(tag='epoch/loss', scalar_value=self.total_loss_epoch/self.n_iter, global_step=self.epoch)
+
+        self.epoch += 1
+        self.n_sequences_epoch = 0
+        self.n_iter = 0
+        self.total_loss_epoch = 0
+
+    def save_checkpoint(self,
+                        checkpoint_name: str = 'checkpoint.pth'):
+        """
+        Save the current state. Only by the master process.
+        """
+        if not self.is_master:
+            return
+        mdl_to_save = self.student.module if hasattr(self.student, 'module') else self.student
+        mdl_to_save.config.save_pretrained(self.dump_path)
+        state_dict = mdl_to_save.state_dict()
+        torch.save(state_dict, os.path.join(self.dump_path, checkpoint_name))
diff --git a/examples/distillation/utils.py b/examples/distillation/utils.py
new file mode 100644
index 0000000000..b3a9f15891
--- /dev/null
+++ b/examples/distillation/utils.py
@@ -0,0 +1,112 @@
+import git
+import json
+import os
+import socket
+import torch
+import numpy as np
+
+import logging
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - PID: %(process)d -  %(message)s',
+                    datefmt = '%m/%d/%Y %H:%M:%S',
+                    level = logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+def git_log(folder_path: str):
+    """
+    Log commit info.
+    """
+    repo = git.Repo(search_parent_directories=True)
+    repo_infos = {
+        'repo_id': str(repo),
+        'repo_sha': str(repo.head.object.hexsha),
+        'repo_branch': str(repo.active_branch)
+    }
+
+    with open(os.path.join(folder_path, 'git_log.json'), 'w') as f:
+        json.dump(repo_infos, f, indent=4)
+
+
+def init_gpu_params(params):
+    """
+    Handle single and multi-GPU / multi-node.
+    """
+    if params.n_gpu <= 0:
+        params.local_rank = 0
+        params.master_port = -1
+        params.is_master = True
+        params.multi_gpu = False
+        return
+
+    assert torch.cuda.is_available()
+
+    logger.info('Initializing GPUs')
+    if params.n_gpu > 1:
+        assert params.local_rank != -1
+
+        params.world_size = int(os.environ['WORLD_SIZE'])
+        params.n_gpu_per_node = int(os.environ['N_GPU_NODE'])
+        params.global_rank = int(os.environ['RANK'])
+
+        # number of nodes / node ID
+        params.n_nodes = params.world_size // params.n_gpu_per_node
+        params.node_id = params.global_rank // params.n_gpu_per_node
+        params.multi_gpu = True
+
+        assert params.n_nodes == int(os.environ['N_NODES'])
+        assert params.node_id == int(os.environ['NODE_RANK'])
+
+    # local job (single GPU)
+    else:
+        assert params.local_rank == -1
+
+        params.n_nodes = 1
+        params.node_id = 0
+        params.local_rank = 0
+        params.global_rank = 0
+        params.world_size = 1
+        params.n_gpu_per_node = 1
+        params.multi_gpu = False
+
+    # sanity checks
+    assert params.n_nodes >= 1
+    assert 0 <= params.node_id < params.n_nodes
+    assert 0 <= params.local_rank <= params.global_rank < params.world_size
+    assert params.world_size == params.n_nodes * params.n_gpu_per_node
+
+    # define whether this is the master process / if we are in multi-node distributed mode
+    params.is_master = params.node_id == 0 and params.local_rank == 0
+    params.multi_node = params.n_nodes > 1
+
+    # summary
+    PREFIX = f"--- Global rank: {params.global_rank} - "
+    logger.info(PREFIX + "Number of nodes: %i" % params.n_nodes)
+    logger.info(PREFIX + "Node ID        : %i" % params.node_id)
+    logger.info(PREFIX + "Local rank     : %i" % params.local_rank)
+    logger.info(PREFIX + "World size     : %i" % params.world_size)
+    logger.info(PREFIX + "GPUs per node  : %i" % params.n_gpu_per_node)
+    logger.info(PREFIX + "Master         : %s" % str(params.is_master))
+    logger.info(PREFIX + "Multi-node     : %s" % str(params.multi_node))
+    logger.info(PREFIX + "Multi-GPU      : %s" % str(params.multi_gpu))
+    logger.info(PREFIX + "Hostname       : %s" % socket.gethostname())
+
+    # set GPU device
+    torch.cuda.set_device(params.local_rank)
+
+    # initialize multi-GPU
+    if params.multi_gpu:
+        logger.info("Initializing PyTorch distributed")
+        torch.distributed.init_process_group(
+            init_method='env://',
+            backend='nccl',
+        )
+
+
+def set_seed(args):
+    """
+    Set the random seed.
+    """
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)

From e424d2e45d740a7d5cc4c9502bfa1c70f51d1535 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 01:10:10 +0000
Subject: [PATCH 164/200] add README

---
 examples/distillation/README.md | 3 +++
 1 file changed, 3 insertions(+)
 create mode 100644 examples/distillation/README.md

diff --git a/examples/distillation/README.md b/examples/distillation/README.md
new file mode 100644
index 0000000000..5faeda7291
--- /dev/null
+++ b/examples/distillation/README.md
@@ -0,0 +1,3 @@
+# DilBERT
+
+You'll have the details soon enough!
\ No newline at end of file

From 780f183e55077950b6b703d2777df6d33fe124a4 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 01:39:52 +0000
Subject: [PATCH 165/200] add requirements

---
 examples/distillation/requirements.txt | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 examples/distillation/requirements.txt

diff --git a/examples/distillation/requirements.txt b/examples/distillation/requirements.txt
new file mode 100644
index 0000000000..efb369dc43
--- /dev/null
+++ b/examples/distillation/requirements.txt
@@ -0,0 +1 @@
+gitpython==3.0.2

From b247b0d880fe10e8e1a873d0b710f95f246af8ea Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 02:12:47 +0000
Subject: [PATCH 166/200] add `train.py` for distillation

---
 examples/distillation/train.py | 230 +++++++++++++++++++++++++++++++++
 1 file changed, 230 insertions(+)
 create mode 100644 examples/distillation/train.py

diff --git a/examples/distillation/train.py b/examples/distillation/train.py
new file mode 100644
index 0000000000..824eeac046
--- /dev/null
+++ b/examples/distillation/train.py
@@ -0,0 +1,230 @@
+import os
+import argparse
+import pickle
+import json
+import shutil
+import numpy as np
+import torch
+
+from pytorch_transformers import BertTokenizer, BertForMaskedLM
+from pytorch_transformers import DilBertForMaskedLM, DilBertConfig
+
+from distiller import Distiller
+from utils import git_log, logger, init_gpu_params, set_seed
+from dataset import Dataset
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Training")
+
+    parser.add_argument("--dump_path", type=str, required=True,
+                        help="The output directory (log, checkpoints, parameters, etc.)")
+    parser.add_argument("--data_file", type=str, required=True,
+                        help="The binarized file (tokenized + tokens_to_ids) and grouped by sequence.")
+    parser.add_argument("--token_counts", type=str, required=True,
+                        help="The token counts in the data_file for MLM.")
+    parser.add_argument("--force", action='store_true',
+                        help="Overwrite dump_path if it already exists.")
+
+    parser.add_argument("--vocab_size", default=30522, type=int,
+                        help="The vocabulary size.")
+    parser.add_argument("--max_position_embeddings", default=512, type=int,
+                        help="Maximum sequence length we can model (including [CLS] and [SEP]).")
+    parser.add_argument("--sinusoidal_pos_embds", action='store_false',
+                        help="If true, the position embeddings are simply fixed with sinusoidal embeddings.")
+    parser.add_argument("--n_layers", default=6, type=int,
+                        help="Number of Transformer blocks.")
+    parser.add_argument("--n_heads", default=12, type=int,
+                        help="Number of heads in the self-attention module.")
+    parser.add_argument("--dim", default=768, type=int,
+                        help="Dimension through the network. Must be divisible by n_heads")
+    parser.add_argument("--hidden_dim", default=3072, type=int,
+                        help="Intermediate dimension in the FFN.")
+    parser.add_argument("--dropout", default=0.1, type=float,
+                        help="Dropout.")
+    parser.add_argument("--attention_dropout", default=0.1, type=float,
+                        help="Dropout in self-attention.")
+    parser.add_argument("--activation", default='gelu', type=str,
+                        help="Activation to use in self-attention")
+    parser.add_argument("--tie_weights_", action='store_false',
+                        help="If true, we tie the embeddings matrix with the projection over the vocabulary matrix. Default is true.")
+
+    parser.add_argument("--from_pretrained_weights", default=None, type=str,
+                        help="Load student initialization checkpoint.")
+    parser.add_argument("--from_pretrained_config", default=None, type=str,
+                        help="Load student initialization architecture config.")
+    parser.add_argument("--bert_model", default='bert-base-uncased', type=str,
+                        help="The teacher BERT model.")
+
+    parser.add_argument("--temperature", default=2., type=float,
+                        help="Temperature for the softmax temperature.")
+    parser.add_argument("--alpha_ce", default=0.5, type=float,
+                        help="Linear weight for the distillation loss. Must be >=0.")
+    parser.add_argument("--alpha_mlm", default=0.5, type=float,
+                        help="Linear weight for the MLM loss. Must be >=0.")
+    parser.add_argument("--alpha_mse", default=0.0, type=float,
+                        help="Linear weight of the MSE loss. Must be >=0.")
+    parser.add_argument("--mlm_mask_prop", default=0.15, type=float,
+                        help="Proportion of tokens for which we need to make a prediction.")
+    parser.add_argument("--word_mask", default=0.8, type=float,
+                        help="Proportion of tokens to mask out.")
+    parser.add_argument("--word_keep", default=0.1, type=float,
+                        help="Proportion of tokens to keep.")
+    parser.add_argument("--word_rand", default=0.1, type=float,
+                        help="Proportion of tokens to randomly replace.")
+    parser.add_argument("--mlm_smoothing", default=0.7, type=float,
+                        help="Smoothing parameter to emphasize more rare tokens (see XLM, similar to word2vec).")
+    parser.add_argument("--restrict_ce_to_mask", action='store_true',
+                        help="If true, compute the distilation loss only the [MLM] prediction distribution.")
+
+    parser.add_argument("--n_epoch", type=int, default=3,
+                        help="Number of pass on the whole dataset.")
+    parser.add_argument("--batch_size", type=int, default=5,
+                        help="Batch size (for each process).")
+    parser.add_argument("--tokens_per_batch", type=int, default=-1,
+                        help="If specified, modify the batches so that they have approximately this number of tokens.")
+    parser.add_argument("--shuffle", action='store_false',
+                        help="If true, shuffle the sequence order. Default is true.")
+    parser.add_argument("--group_by_size", action='store_false',
+                        help="If true, group sequences that have similar length into the same batch. Default is true.")
+
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=50,
+                        help="Gradient accumulation for larger training batches.")
+    parser.add_argument("--warmup_prop", default=0.05, type=float,
+                        help="Linear warmup proportion.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--learning_rate", default=5e-4, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=5.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--initializer_range", default=0.02, type=float,
+                        help="Random initialization range.")
+
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument("--n_gpu", type=int, default=1,
+                        help="Number of GPUs in the node.")
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="Distributed training - Local rank")
+    parser.add_argument("--seed", type=int, default=56,
+                        help="Random seed")
+
+    parser.add_argument("--log_interval", type=int, default=500,
+                        help="Tensorboard logging interval.")
+    parser.add_argument("--checkpoint_interval", type=int, default=4000,
+                        help="Checkpoint interval.")
+    args = parser.parse_args()
+
+
+    ## ARGS ##
+    init_gpu_params(args)
+    set_seed(args)
+    if args.is_master:
+        if os.path.exists(args.dump_path):
+            if not args.force:
+                raise ValueError(f'Serialization dir {args.dump_path} already exists, but you have not precised wheter to overwrite it'
+                                   'Use `--force` if you want to overwrite it')
+            else:
+                shutil.rmtree(args.dump_path)
+
+        if not os.path.exists(args.dump_path):
+            os.makedirs(args.dump_path)
+        logger.info(f'Experiment will be dumped and logged in {args.dump_path}')
+
+
+        ### SAVE PARAMS ###
+        logger.info(f'Param: {args}')
+        with open(os.path.join(args.dump_path, 'parameters.json'), 'w') as f:
+            json.dump(vars(args), f, indent=4)
+        git_log(args.dump_path)
+
+
+    ### TOKENIZER ###
+    bert_tokenizer = BertTokenizer.from_pretrained(args.bert_model)
+    special_tok_ids = {}
+    for tok_name, tok_symbol in bert_tokenizer.special_tokens_map.items():
+        idx = bert_tokenizer.all_special_tokens.index(tok_symbol)
+        special_tok_ids[tok_name] = bert_tokenizer.all_special_ids[idx]
+    logger.info(f'Special tokens {special_tok_ids}')
+    args.special_tok_ids = special_tok_ids
+
+
+    ## DATA LOADER ##
+    logger.info(f'Loading data from {args.data_file}')
+    with open(args.data_file, 'rb') as fp:
+        data = pickle.load(fp)
+
+
+    assert os.path.isfile(args.token_counts)
+    logger.info(f'Loading token counts from {args.token_counts} (already pre-computed)')
+    with open(args.token_counts, 'rb') as fp:
+        counts = pickle.load(fp)
+        assert len(counts) == args.vocab_size
+    token_probs = np.maximum(counts, 1) ** -args.mlm_smoothing
+    for idx in special_tok_ids.values():
+        token_probs[idx] = 0.  # do not predict special tokens
+    token_probs = torch.from_numpy(token_probs)
+
+
+    train_dataloader = Dataset(params=args, data=data)
+    logger.info(f'Data loader created.')
+
+
+    ## STUDENT ##
+    assert (args.from_pretrained_weights is None and args.from_pretrained_config is None) or \
+           (args.from_pretrained_weights is not None and args.from_pretrained_config is not None)
+    if args.from_pretrained_weights is not None:
+        assert os.path.isfile(os.path.join(args.from_pretrained, 'config.json'))
+        assert os.path.isfile(os.path.join(args.from_pretrained, 'config.json'))
+        logger.info(f'Loading pretrained weights from {args.from_pretrained_weights}')
+        logger.info(f'Loading pretrained config from {args.from_pretrained_config}')
+        stu_architecture_config = DilBertConfig.from_json_file(args.from_pretrained_config)
+        student = DilBertForMaskedLM.from_pretrained(args.from_pretrained_weights,
+                                                     config=stu_architecture_config)
+    else:
+        
+        stu_architecture_config = DilBertConfig(args)
+        student = DilBertForMaskedLM(stu_architecture_config)
+        # student = Model(vocab_size=args.vocab_size,
+        #                 max_position_embeddings=args.max_position_embeddings,
+        #                 sinusoidal_pos_embds=args.sinusoidal_pos_embds,
+        #                 n_layers=args.n_layers,
+        #                 n_heads=args.n_heads,
+        #                 dim=args.dim,
+        #                 dropout=args.dropout,
+        #                 attention_dropout=args.attention_dropout,
+        #                 activation=args.activation,
+        #                 initializer_range=args.initializer_range,
+        #                 tie_weights=args.tie_weights)
+
+
+    if args.n_gpu > 0:
+        student.to(f'cuda:{args.local_rank}')
+    logger.info(f'Student loaded.')
+
+
+    ## TEACHER ##
+    teacher = BertForMaskedLM.from_pretrained(args.bert_model)
+    if args.n_gpu > 0:
+        teacher.to(f'cuda:{args.local_rank}')
+    logger.info(f'Teacher loaded from {args.bert_model}.')
+
+    ## DISTILLER ##
+    torch.cuda.empty_cache()
+    distiller = Distiller(params=args,
+                          dataloader=train_dataloader,
+                          token_probs=token_probs,
+                          student=student,
+                          teacher=teacher)
+    distiller.train()
+    logger.info("Let's go get some drinks.")
+
+
+if __name__ == "__main__":
+    main()

From 906581ae3c29939d62c23be43b280a24f0381898 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 02:43:33 +0000
Subject: [PATCH 167/200] add s3 links for dilbert (+fix small typo)

---
 pytorch_transformers/modeling_dilbert.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/pytorch_transformers/modeling_dilbert.py b/pytorch_transformers/modeling_dilbert.py
index e842b31d8f..aeaac5b1aa 100644
--- a/pytorch_transformers/modeling_dilbert.py
+++ b/pytorch_transformers/modeling_dilbert.py
@@ -37,11 +37,11 @@ logger = logging.getLogger(__name__)
 
 
 DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
-    'dilbert-base-uncased': None, # TODO(Victor)
+    'dilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-pytorch_model.bin"
 }
 
 DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
-    'dilbert-base-uncased': None, #TODO(Victor)
+    'dilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-config.json"
 }
 
 
@@ -85,7 +85,7 @@ class DilBertConfig(PretrainedConfig):
             self.tie_weights_ = tie_weights_
         else:
             raise ValueError("First argument must be either a vocabulary size (int)"
-                             "or the path to a pretrained model config file (str)")
+                             " or the path to a pretrained model config file (str)")
 
 
 ### UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE ###

From 7f5d85347e2dd30d976e8ac08bc9e4fc743fe122 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 02:44:51 +0000
Subject: [PATCH 168/200] fix small typo

---
 pytorch_transformers/modeling_bert.py       | 2 +-
 pytorch_transformers/modeling_transfo_xl.py | 2 +-
 pytorch_transformers/modeling_xlm.py        | 2 +-
 pytorch_transformers/modeling_xlnet.py      | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index 7b34b3fd90..badec992c3 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -216,7 +216,7 @@ class BertConfig(PretrainedConfig):
             self.layer_norm_eps = layer_norm_eps
         else:
             raise ValueError("First argument must be either a vocabulary size (int)"
-                             "or the path to a pretrained model config file (str)")
+                             " or the path to a pretrained model config file (str)")
 
 
 
diff --git a/pytorch_transformers/modeling_transfo_xl.py b/pytorch_transformers/modeling_transfo_xl.py
index 3cfdee38cb..c57e664c8f 100644
--- a/pytorch_transformers/modeling_transfo_xl.py
+++ b/pytorch_transformers/modeling_transfo_xl.py
@@ -285,7 +285,7 @@ class TransfoXLConfig(PretrainedConfig):
             self.init_std = init_std
         else:
             raise ValueError("First argument must be either a vocabulary size (int)"
-                             "or the path to a pretrained model config file (str)")
+                             " or the path to a pretrained model config file (str)")
 
     @property
     def max_position_embeddings(self):
diff --git a/pytorch_transformers/modeling_xlm.py b/pytorch_transformers/modeling_xlm.py
index 19800da2ed..5a659e02f9 100644
--- a/pytorch_transformers/modeling_xlm.py
+++ b/pytorch_transformers/modeling_xlm.py
@@ -178,7 +178,7 @@ class XLMConfig(PretrainedConfig):
             self.end_n_top = end_n_top
         else:
             raise ValueError("First argument must be either a vocabulary size (int)"
-                             "or the path to a pretrained model config file (str)")
+                             " or the path to a pretrained model config file (str)")
 
     @property
     def vocab_size(self):
diff --git a/pytorch_transformers/modeling_xlnet.py b/pytorch_transformers/modeling_xlnet.py
index d44821788e..136f07c436 100644
--- a/pytorch_transformers/modeling_xlnet.py
+++ b/pytorch_transformers/modeling_xlnet.py
@@ -306,7 +306,7 @@ class XLNetConfig(PretrainedConfig):
             self.end_n_top = end_n_top
         else:
             raise ValueError("First argument must be either a vocabulary size (int)"
-                             "or the path to a pretrained model config file (str)")
+                             " or the path to a pretrained model config file (str)")
 
     @property
     def max_position_embeddings(self):

From 74d78beeb418f29cade9d6a0aeb63eeee697a4e2 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 03:13:11 +0000
Subject: [PATCH 169/200] fix: add qa_dropout and seq_classif_dropout

---
 pytorch_transformers/modeling_dilbert.py | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/pytorch_transformers/modeling_dilbert.py b/pytorch_transformers/modeling_dilbert.py
index aeaac5b1aa..36a94b506c 100644
--- a/pytorch_transformers/modeling_dilbert.py
+++ b/pytorch_transformers/modeling_dilbert.py
@@ -61,6 +61,8 @@ class DilBertConfig(PretrainedConfig):
                  activation='gelu',
                  initializer_range=0.02,
                  tie_weights_=True,
+                 qa_dropout=0.1,
+                 seq_classif_dropout=0.2,
                  **kwargs):
         super(DilBertConfig, self).__init__(**kwargs)
 
@@ -83,6 +85,8 @@ class DilBertConfig(PretrainedConfig):
             self.activation = activation
             self.initializer_range = initializer_range
             self.tie_weights_ = tie_weights_
+            self.qa_dropout = qa_dropout
+            self.seq_classif_dropout = seq_classif_dropout
         else:
             raise ValueError("First argument must be either a vocabulary size (int)"
                              " or the path to a pretrained model config file (str)")

From 778a263f09537e0d3667516c1fa674c9d331bc76 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Tue, 27 Aug 2019 22:28:42 -0400
Subject: [PATCH 170/200] GilBert added to AutoModels

---
 pytorch_transformers/modeling_auto.py | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/pytorch_transformers/modeling_auto.py b/pytorch_transformers/modeling_auto.py
index 516107c40b..2d28a6017f 100644
--- a/pytorch_transformers/modeling_auto.py
+++ b/pytorch_transformers/modeling_auto.py
@@ -30,6 +30,7 @@ from .modeling_transfo_xl import TransfoXLConfig, TransfoXLModel
 from .modeling_xlnet import XLNetConfig, XLNetModel
 from .modeling_xlm import XLMConfig, XLMModel
 from .modeling_roberta import RobertaConfig, RobertaModel
+from .modeling_dilbert import DilBertConfig, DilBertModel
 
 from .modeling_utils import PreTrainedModel, SequenceSummary
 
@@ -110,7 +111,9 @@ class AutoConfig(object):
             assert unused_kwargs == {'foo': False}
 
         """
-        if 'roberta' in pretrained_model_name_or_path:
+        if 'dilbert' in pretrained_model_name_or_path:
+            return DilBertconfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        elif 'roberta' in pretrained_model_name_or_path:
             return RobertaConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
         elif 'bert' in pretrained_model_name_or_path:
             return BertConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
@@ -225,7 +228,9 @@ class AutoModel(object):
             model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
 
         """
-        if 'roberta' in pretrained_model_name_or_path:
+        if 'dilbert' in pretrained_model_name_or_path:
+            return DilBertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif 'roberta' in pretrained_model_name_or_path:
             return RobertaModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
         elif 'bert' in pretrained_model_name_or_path:
             return BertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)

From c513415b19ca43f9fe2cb0ab125a48e16d2cbbb9 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Tue, 27 Aug 2019 23:59:00 -0400
Subject: [PATCH 171/200] Dilbert tests from CommonTests

---
 .../tests/modeling_common_test.py             |   7 +
 .../tests/modeling_dilbert_test.py            | 219 ++++++++++++++++++
 2 files changed, 226 insertions(+)
 create mode 100644 pytorch_transformers/tests/modeling_dilbert_test.py

diff --git a/pytorch_transformers/tests/modeling_common_test.py b/pytorch_transformers/tests/modeling_common_test.py
index e974ae865d..8a183c30da 100644
--- a/pytorch_transformers/tests/modeling_common_test.py
+++ b/pytorch_transformers/tests/modeling_common_test.py
@@ -49,6 +49,7 @@ class CommonTestCases:
         test_torchscript = True
         test_pruning = True
         test_resize_embeddings = True
+        test_head_masking = True
 
         def test_initialization(self):
             config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
@@ -159,6 +160,9 @@ class CommonTestCases:
 
 
         def test_headmasking(self):
+            if not self.test_head_masking:
+                return
+
             config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
 
             config.output_attentions = True
@@ -282,6 +286,9 @@ class CommonTestCases:
                 self.assertTrue(models_equal)
 
         def test_tie_model_weights(self):
+            if not self.test_torchscript:
+                return
+
             config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
 
             def check_same_values(layer_1, layer_2):
diff --git a/pytorch_transformers/tests/modeling_dilbert_test.py b/pytorch_transformers/tests/modeling_dilbert_test.py
new file mode 100644
index 0000000000..0cbef7e083
--- /dev/null
+++ b/pytorch_transformers/tests/modeling_dilbert_test.py
@@ -0,0 +1,219 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import shutil
+import pytest
+
+from pytorch_transformers import (DilBertConfig, DilBertModel, DilBertForMaskedLM,
+                                     DilBertForQuestionAnswering, DilBertForSequenceClassification)
+from pytorch_transformers.modeling_dilbert import DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
+
+from .modeling_common_test import (CommonTestCases, ConfigTester, ids_tensor)
+
+
+class DilBertModelTest(CommonTestCases.CommonModelTester):
+
+    all_model_classes = (DilBertModel, DilBertForMaskedLM, DilBertForQuestionAnswering,
+                         DilBertForSequenceClassification)
+    test_pruning = False
+    test_torchscript = False
+    test_resize_embeddings = False
+    test_head_masking = False
+
+    class DilBertModelTester(object):
+
+        def __init__(self,
+                     parent,
+                     batch_size=13,
+                     seq_length=7,
+                     is_training=True,
+                     use_input_mask=True,
+                     use_token_type_ids=False,
+                     use_labels=True,
+                     vocab_size=99,
+                     hidden_size=32,
+                     num_hidden_layers=5,
+                     num_attention_heads=4,
+                     intermediate_size=37,
+                     hidden_act="gelu",
+                     hidden_dropout_prob=0.1,
+                     attention_probs_dropout_prob=0.1,
+                     max_position_embeddings=512,
+                     type_vocab_size=16,
+                     type_sequence_label_size=2,
+                     initializer_range=0.02,
+                     num_labels=3,
+                     num_choices=4,
+                     scope=None,
+                    ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.is_training = is_training
+            self.use_input_mask = use_input_mask
+            self.use_token_type_ids = use_token_type_ids
+            self.use_labels = use_labels
+            self.vocab_size = vocab_size
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.intermediate_size = intermediate_size
+            self.hidden_act = hidden_act
+            self.hidden_dropout_prob = hidden_dropout_prob
+            self.attention_probs_dropout_prob = attention_probs_dropout_prob
+            self.max_position_embeddings = max_position_embeddings
+            self.type_vocab_size = type_vocab_size
+            self.type_sequence_label_size = type_sequence_label_size
+            self.initializer_range = initializer_range
+            self.num_labels = num_labels
+            self.num_choices = num_choices
+            self.scope = scope
+
+        def prepare_config_and_inputs(self):
+            input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+            input_mask = None
+            if self.use_input_mask:
+                input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+
+            sequence_labels = None
+            token_labels = None
+            choice_labels = None
+            if self.use_labels:
+                sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+                token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+                choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+            config = DilBertConfig(
+                vocab_size_or_config_json_file=self.vocab_size,
+                dim=self.hidden_size,
+                n_layers=self.num_hidden_layers,
+                n_heads=self.num_attention_heads,
+                hidden_dim=self.intermediate_size,
+                hidden_act=self.hidden_act,
+                dropout=self.hidden_dropout_prob,
+                attention_dropout=self.attention_probs_dropout_prob,
+                max_position_embeddings=self.max_position_embeddings,
+                initializer_range=self.initializer_range)
+
+            return config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+        def check_loss_output(self, result):
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+
+        def create_and_check_dilbert_model(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = DilBertModel(config=config)
+            model.eval()
+            sequence_output, pooled_output = model(input_ids, input_mask)
+            sequence_output, pooled_output = model(input_ids)
+
+            result = {
+                "sequence_output": sequence_output,
+                "pooled_output": pooled_output,
+            }
+            self.parent.assertListEqual(
+                list(result["sequence_output"].size()),
+                [self.batch_size, self.seq_length, self.hidden_size])
+            self.parent.assertListEqual(list(result["pooled_output"].size()), [self.batch_size, self.hidden_size])
+
+        def create_and_check_dilbert_for_masked_lm(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = DilBertForMaskedLM(config=config)
+            model.eval()
+            loss, prediction_scores = model(input_ids, input_mask, token_labels)
+            result = {
+                "loss": loss,
+                "prediction_scores": prediction_scores,
+            }
+            self.parent.assertListEqual(
+                list(result["prediction_scores"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+            self.check_loss_output(result)
+
+        def create_and_check_dilbert_for_question_answering(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = DilBertForQuestionAnswering(config=config)
+            model.eval()
+            loss, start_logits, end_logits = model(input_ids, input_mask, sequence_labels, sequence_labels)
+            result = {
+                "loss": loss,
+                "start_logits": start_logits,
+                "end_logits": end_logits,
+            }
+            self.parent.assertListEqual(
+                list(result["start_logits"].size()),
+                [self.batch_size, self.seq_length])
+            self.parent.assertListEqual(
+                list(result["end_logits"].size()),
+                [self.batch_size, self.seq_length])
+            self.check_loss_output(result)
+
+        def create_and_check_dilbert_for_sequence_classification(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            config.num_labels = self.num_labels
+            model = DilBertForSequenceClassification(config)
+            model.eval()
+            loss, logits = model(input_ids, input_mask, sequence_labels)
+            result = {
+                "loss": loss,
+                "logits": logits,
+            }
+            self.parent.assertListEqual(
+                list(result["logits"].size()),
+                [self.batch_size, self.num_labels])
+            self.check_loss_output(result)
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+            (config, input_ids, input_mask, sequence_labels, token_labels, choice_labels) = config_and_inputs
+            inputs_dict = {'input_ids': input_ids, 'attention_mask': input_mask}
+            return config, inputs_dict
+
+    def setUp(self):
+        self.model_tester = DilBertModelTest.DilBertModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=DilBertConfig, dim=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_dilbert_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_dilbert_model(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_dilbert_for_masked_lm(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_dilbert_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_dilbert_for_sequence_classification(*config_and_inputs)
+
+    # @pytest.mark.slow
+    # def test_model_from_pretrained(self):
+    #     cache_dir = "/tmp/pytorch_transformers_test/"
+    #     for model_name in list(DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+    #         model = DilBertModel.from_pretrained(model_name, cache_dir=cache_dir)
+    #         shutil.rmtree(cache_dir)
+    #         self.assertIsNotNone(model)
+
+if __name__ == "__main__":
+    unittest.main()

From 4d16b279e55189b023f9903b28e527cbb2186055 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 03:59:48 +0000
Subject: [PATCH 172/200] add `scripts/binarized_data.py`

---
 .../distillation/scripts/binarized_data.py    | 60 +++++++++++++++++++
 1 file changed, 60 insertions(+)
 create mode 100644 examples/distillation/scripts/binarized_data.py

diff --git a/examples/distillation/scripts/binarized_data.py b/examples/distillation/scripts/binarized_data.py
new file mode 100644
index 0000000000..a5fab286b4
--- /dev/null
+++ b/examples/distillation/scripts/binarized_data.py
@@ -0,0 +1,60 @@
+import argparse
+import pickle
+import random
+import time
+import numpy as np
+from pytorch_transformers import BertTokenizer
+
+from ..utils import logger
+
+def main():
+    parser = argparse.ArgumentParser(description="Preprocess the data to avoid re-doing it several times by (tokenization + token_to_ids).")
+    parser.add_argument('--file_path', type=str, default='data/dump.txt',
+                        help='The path to the data.')
+    parser.add_argument('--bert_tokenizer', type=str, default='bert-base-uncased',
+                        help="The tokenizer to use.")
+    parser.add_argument('--dump_file', type=str, default='data/dump',
+                        help='The dump file prefix.')
+    args = parser.parse_args()
+
+
+    logger.info(f'Loading Tokenizer ({args.bert_tokenizer})')
+    bert_tokenizer = BertTokenizer.from_pretrained(args.bert_tokenizer)
+
+
+    logger.info(f'Loading text from {args.file_path}')
+    with open(args.file_path, 'r', encoding='utf8') as fp:
+        data = fp.readlines()
+
+
+    logger.info(f'Start encoding')
+    logger.info(f'{len(data)} examples to process.')
+
+    rslt = []
+    iter = 0
+    interval = 10000
+    start = time.time()
+    for text in data:
+        text = f'[CLS] {text.strip()} [SEP]'
+        token_ids = bert_tokenizer.encode(text)
+        rslt.append(token_ids)
+
+        iter += 1
+        if iter % interval == 0:
+            end = time.time()
+            logger.info(f'{iter} examples processed. - {(end-start)/interval:.2f}s/expl')
+            start = time.time()
+    logger.info('Finished binarization')
+    logger.info(f'{len(data)} examples processed.')
+
+
+    dp_file = f'{args.dump_file}.{args.bert_tokenizer}.pickle'
+    rslt_ = [np.uint16(d) for d in rslt]
+    random.shuffle(rslt_)
+    logger.info(f'Dump to {dp_file}')
+    with open(dp_file, 'wb') as handle:
+        pickle.dump(rslt_, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file

From 7f2c384c802c2d4c454152f38eeee21abc8df297 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 04:00:03 +0000
Subject: [PATCH 173/200] add `scripts/token_counts.py`

---
 examples/distillation/scripts/token_counts.py | 30 +++++++++++++++++++
 1 file changed, 30 insertions(+)
 create mode 100644 examples/distillation/scripts/token_counts.py

diff --git a/examples/distillation/scripts/token_counts.py b/examples/distillation/scripts/token_counts.py
new file mode 100644
index 0000000000..564dc64c8a
--- /dev/null
+++ b/examples/distillation/scripts/token_counts.py
@@ -0,0 +1,30 @@
+from collections import Counter
+import argparse
+import pickle
+
+from utils import logger
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description="Token Counts for smoothing the masking probabilities in MLM (cf XLM/word2vec)")
+    parser.add_argument("--data_file", type=str, default="data/dump.bert-base-uncased.pickle",
+                        help="The binarized dataset."
+    parser.add_argument("--token_counts_dump", type=str, default="data/token_counts.bert-base-uncased.pickle",
+                        help="The dump file.")
+    parser.add_argument("--vocab_size", default=30522, type=int)
+    args = parser.parse_args()
+
+    logger.info(f'Loading data from {args.data_file}')
+    with open(args.data_file, 'rb') as fp:
+        data = pickle.load(fp)
+
+    logger.info('Counting occurences for MLM.')
+    counter = Counter()
+    for tk_ids in data:
+        counter.update(tk_ids)
+    counts = [0]*args.vocab_size
+    for k, v in counter.items():
+        counts[k] = v
+
+    logger.info(f'Dump to {args.token_counts_dump}')
+    with open(args.token_counts_dump, 'wb') as handle:
+        pickle.dump(counts, handle, protocol=pickle.HIGHEST_PROTOCOL)

From 0d8f8848d5de1e6f4a785484f5dbe331d6a28e2a Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 04:00:19 +0000
Subject: [PATCH 174/200] add `scripts/extract_for_distil.py`

---
 .../scripts/extract_for_distil.py             | 59 +++++++++++++++++++
 1 file changed, 59 insertions(+)
 create mode 100644 examples/distillation/scripts/extract_for_distil.py

diff --git a/examples/distillation/scripts/extract_for_distil.py b/examples/distillation/scripts/extract_for_distil.py
new file mode 100644
index 0000000000..27266c82ea
--- /dev/null
+++ b/examples/distillation/scripts/extract_for_distil.py
@@ -0,0 +1,59 @@
+from pytorch_transformers import BertForPreTraining
+import torch
+import argparse
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description="Extraction some layers of the full BertForPreTraining for Transfer Learned Distillation")
+    parser.add_argument("--bert_model", default='bert-base-uncased', type=str)
+    parser.add_argument("--dump_checkpoint", default='serialization_dir/transfer_learning_checkpoint_0247911.pth', type=str)
+    parser.add_argument("--vocab_transform", action='store_true')
+    args = parser.parse_args()
+
+
+    model = BertForPreTraining.from_pretrained(args.bert_model)
+
+    state_dict = model.state_dict()
+    compressed_sd = {}
+
+    for w in ['word_embeddings', 'position_embeddings']:
+        compressed_sd[f'dilbert.embeddings.{w}.weight'] = \
+            state_dict[f'bert.embeddings.{w}.weight']
+    for w in ['weight', 'bias']:
+        compressed_sd[f'dilbert.embeddings.LayerNorm.{w}'] = \
+            state_dict[f'bert.embeddings.LayerNorm.{w}']
+
+    std_idx = 0
+    for teacher_idx in [0, 2, 4, 7, 9, 11]:
+        for w in ['weight', 'bias']:
+            compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.q_lin.{w}'] = \
+                state_dict[f'bert.encoder.layer.{teacher_idx}.attention.self.query.{w}']
+            compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.k_lin.{w}'] = \
+                state_dict[f'bert.encoder.layer.{teacher_idx}.attention.self.key.{w}']
+            compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.v_lin.{w}'] = \
+                state_dict[f'bert.encoder.layer.{teacher_idx}.attention.self.value.{w}']
+
+            compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.out_lin.{w}'] = \
+                state_dict[f'bert.encoder.layer.{teacher_idx}.attention.output.dense.{w}']
+            compressed_sd[f'dilbert.transformer.layer.{std_idx}.sa_layer_norm.{w}'] = \
+                state_dict[f'bert.encoder.layer.{teacher_idx}.attention.output.LayerNorm.{w}']
+
+            compressed_sd[f'dilbert.transformer.layer.{std_idx}.ffn.lin1.{w}'] = \
+                state_dict[f'bert.encoder.layer.{teacher_idx}.intermediate.dense.{w}']
+            compressed_sd[f'dilbert.transformer.layer.{std_idx}.ffn.lin2.{w}'] = \
+                state_dict[f'bert.encoder.layer.{teacher_idx}.output.dense.{w}']
+            compressed_sd[f'dilbert.transformer.layer.{std_idx}.output_layer_norm.{w}'] = \
+                state_dict[f'bert.encoder.layer.{teacher_idx}.output.LayerNorm.{w}']
+        std_idx += 1
+
+    compressed_sd[f'vocab_projector.weight'] = state_dict[f'cls.predictions.decoder.weight']
+    compressed_sd[f'vocab_projector.bias'] = state_dict[f'cls.predictions.bias']
+    if args.vocab_transform:
+        for w in ['weight', 'bias']:
+            compressed_sd[f'vocab_transform.{w}'] = state_dict[f'cls.predictions.transform.dense.{w}']
+            compressed_sd[f'vocab_layer_norm.{w}'] = state_dict[f'cls.predictions.transform.LayerNorm.{w}']
+
+    print(f'N layers selected for distillation: {std_idx}')
+    print(f'Number of params transfered for distillation: {len(compressed_sd.keys())}')
+
+    print(f'Save transfered checkpoint to {args.dump_checkpoint}.')
+    torch.save(compressed_sd, args.dump_checkpoint)

From da1e4e53fcd52bc281bfecef2ca0c0f420caf38f Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 04:01:03 +0000
Subject: [PATCH 175/200] some fixes in `train.py` for loading previous
 checkpoint

---
 examples/distillation/train.py | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/examples/distillation/train.py b/examples/distillation/train.py
index 824eeac046..a058182966 100644
--- a/examples/distillation/train.py
+++ b/examples/distillation/train.py
@@ -143,6 +143,8 @@ def main():
         with open(os.path.join(args.dump_path, 'parameters.json'), 'w') as f:
             json.dump(vars(args), f, indent=4)
         git_log(args.dump_path)
+    assert (args.from_pretrained_weights is None and args.from_pretrained_config is None) or \
+           (args.from_pretrained_weights is not None and args.from_pretrained_config is not None)
 
 
     ### TOKENIZER ###
@@ -177,31 +179,18 @@ def main():
 
 
     ## STUDENT ##
-    assert (args.from_pretrained_weights is None and args.from_pretrained_config is None) or \
-           (args.from_pretrained_weights is not None and args.from_pretrained_config is not None)
     if args.from_pretrained_weights is not None:
-        assert os.path.isfile(os.path.join(args.from_pretrained, 'config.json'))
-        assert os.path.isfile(os.path.join(args.from_pretrained, 'config.json'))
+        assert os.path.isfile(os.path.join(args.from_pretrained_weights))
+        assert os.path.isfile(os.path.join(args.from_pretrained_config))
         logger.info(f'Loading pretrained weights from {args.from_pretrained_weights}')
         logger.info(f'Loading pretrained config from {args.from_pretrained_config}')
         stu_architecture_config = DilBertConfig.from_json_file(args.from_pretrained_config)
         student = DilBertForMaskedLM.from_pretrained(args.from_pretrained_weights,
                                                      config=stu_architecture_config)
     else:
-        
-        stu_architecture_config = DilBertConfig(args)
+        args.vocab_size_or_config_json_file = args.vocab_size
+        stu_architecture_config = DilBertConfig(**vars(args))
         student = DilBertForMaskedLM(stu_architecture_config)
-        # student = Model(vocab_size=args.vocab_size,
-        #                 max_position_embeddings=args.max_position_embeddings,
-        #                 sinusoidal_pos_embds=args.sinusoidal_pos_embds,
-        #                 n_layers=args.n_layers,
-        #                 n_heads=args.n_heads,
-        #                 dim=args.dim,
-        #                 dropout=args.dropout,
-        #                 attention_dropout=args.attention_dropout,
-        #                 activation=args.activation,
-        #                 initializer_range=args.initializer_range,
-        #                 tie_weights=args.tie_weights)
 
 
     if args.n_gpu > 0:

From fea921d38265fad7d92b952f152a2aac314c3207 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 04:45:39 +0000
Subject: [PATCH 176/200] add licensing

---
 examples/distillation/dataset.py                | 17 +++++++++++++++++
 examples/distillation/distiller.py              | 17 +++++++++++++++++
 examples/distillation/scripts/binarized_data.py | 17 +++++++++++++++++
 .../distillation/scripts/extract_for_distil.py  | 17 +++++++++++++++++
 examples/distillation/scripts/token_counts.py   | 17 +++++++++++++++++
 examples/distillation/train.py                  | 17 +++++++++++++++++
 examples/distillation/utils.py                  | 17 +++++++++++++++++
 7 files changed, 119 insertions(+)

diff --git a/examples/distillation/dataset.py b/examples/distillation/dataset.py
index 6256ce1144..b9f58f775e 100644
--- a/examples/distillation/dataset.py
+++ b/examples/distillation/dataset.py
@@ -1,3 +1,20 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Dataloaders to train DilBERT.
+"""
 from typing import List
 import math
 from itertools import chain
diff --git a/examples/distillation/distiller.py b/examples/distillation/distiller.py
index c9c4458abc..c2d4a9785a 100644
--- a/examples/distillation/distiller.py
+++ b/examples/distillation/distiller.py
@@ -1,3 +1,20 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+The distiller to distil DilBERT.
+"""
 import os
 import math
 from tensorboardX import SummaryWriter
diff --git a/examples/distillation/scripts/binarized_data.py b/examples/distillation/scripts/binarized_data.py
index a5fab286b4..c79001bb5e 100644
--- a/examples/distillation/scripts/binarized_data.py
+++ b/examples/distillation/scripts/binarized_data.py
@@ -1,3 +1,20 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocessing script before training DilBERT.
+"""
 import argparse
 import pickle
 import random
diff --git a/examples/distillation/scripts/extract_for_distil.py b/examples/distillation/scripts/extract_for_distil.py
index 27266c82ea..1cbf19d2cf 100644
--- a/examples/distillation/scripts/extract_for_distil.py
+++ b/examples/distillation/scripts/extract_for_distil.py
@@ -1,3 +1,20 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocessing script before training DilBERT.
+"""
 from pytorch_transformers import BertForPreTraining
 import torch
 import argparse
diff --git a/examples/distillation/scripts/token_counts.py b/examples/distillation/scripts/token_counts.py
index 564dc64c8a..2f5ed83922 100644
--- a/examples/distillation/scripts/token_counts.py
+++ b/examples/distillation/scripts/token_counts.py
@@ -1,3 +1,20 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocessing script before training DilBERT.
+"""
 from collections import Counter
 import argparse
 import pickle
diff --git a/examples/distillation/train.py b/examples/distillation/train.py
index a058182966..5af42dd8f4 100644
--- a/examples/distillation/train.py
+++ b/examples/distillation/train.py
@@ -1,3 +1,20 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Training DilBERT.
+"""
 import os
 import argparse
 import pickle
diff --git a/examples/distillation/utils.py b/examples/distillation/utils.py
index b3a9f15891..14bb0e0016 100644
--- a/examples/distillation/utils.py
+++ b/examples/distillation/utils.py
@@ -1,3 +1,20 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Utils to train DilBERT.
+"""
 import git
 import json
 import os

From 19b7c9b0b7d69a12c291200198155c7681125428 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 06:25:44 +0000
Subject: [PATCH 177/200] add DilBert model for squad

---
 pytorch_transformers/modeling_dilbert.py | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/pytorch_transformers/modeling_dilbert.py b/pytorch_transformers/modeling_dilbert.py
index 36a94b506c..2f3ea1c535 100644
--- a/pytorch_transformers/modeling_dilbert.py
+++ b/pytorch_transformers/modeling_dilbert.py
@@ -37,11 +37,13 @@ logger = logging.getLogger(__name__)
 
 
 DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
-    'dilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-pytorch_model.bin"
+    'dilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-pytorch_model.bin",
+    'dilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-distilled-squad-pytorch_model.bin"
 }
 
 DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
-    'dilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-config.json"
+    'dilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-config.json",
+    'dilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-distilled-squad-config.json"
 }
 
 
@@ -378,7 +380,7 @@ class DilBertPreTrainedModel(PreTrainedModel):
 DILBERT_START_DOCSTRING = r"""
     Smaller, faster, cheaper, lighter: DilBERT
 
-    For more information on DilBERT, you should check TODO(Victor): Link to Medium
+    For more information on DilBERT, you should check TODO(Link): Link to Medium
 
     Parameters:
         config (:class:`~pytorch_transformers.DilBertConfig`): Model configuration class with all the parameters of the model. 

From 93e82ab4240a6f5b13a02303c1af385e24165938 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 06:26:09 +0000
Subject: [PATCH 178/200] Write README for DilBERT

---
 examples/distillation/README.md | 96 ++++++++++++++++++++++++++++++++-
 1 file changed, 95 insertions(+), 1 deletion(-)

diff --git a/examples/distillation/README.md b/examples/distillation/README.md
index 5faeda7291..2eb4b59f8a 100644
--- a/examples/distillation/README.md
+++ b/examples/distillation/README.md
@@ -1,3 +1,97 @@
 # DilBERT
 
-You'll have the details soon enough!
\ No newline at end of file
+This section contains examples showcasing how to use DilBERT and the original code to train DilBERT.
+
+## What is DilBERT?
+
+DilBERT stands for DistiLlation-BERT. DilBERT is a small, fast, cheap and light Transformer model: it has 40% less parameters than `bert-base-uncased`, runs 40% faster while preserving 96% on the language understanding capabilties (as shown on the GLUE benchmark). DilBERT is trained by distillation: a technique to compress a large model called the teacher into a smaller model called the student. By applying this compression technique, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model, while being lighter, smaller and faster. Thus, DilBERT can be an interesting solution to put large Transformer model into production.
+
+For more information on DilBERT, we refer to [our blog post](TODO(Link)).
+
+## How to use DilBERT?
+
+PyTorch-Transformers includes two pre-trained models:
+- `dilbert-base-uncased`: The language model pretrained by distillation under the supervision of `bert-base-uncased`. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
+- `dilbert-base-uncased-distilled-squad`: The `dilbert-base-uncased` finetune by distillation on SQuAD. It reaches a F1 score of 86.2 on the dev set, while `bert-base-uncased` reaches a 88.5 F1 score.
+
+Using DilBERT is really similar to using BERT. DilBERT uses the same tokenizer as BERT and more specifically `bert-base-uncased`. You should only use this tookenizer as the only pre-trained weights available for now are supervised by `bert-base-uncased`.
+
+```python
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+model = DilBertModel.from_pretrained('dilbert-base-uncased')
+
+input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)
+outputs = model(input_ids)
+last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+```
+
+## How to train DilBERT?
+
+In the following, we will explain how you can train your own compressed model.
+
+### A. Preparing the data
+
+The weights we release are trained using a concatenation of Toronto Book Corpus and English Wikipedia (same training data as BERT).
+
+To avoid processing the data several time, we do it once and for all before the training. From now on, will suppose that you have a text file `dump.txt` which contains one sequence per line (a sequence being composed of one of several coherent sentences).
+
+First, we will binarize the data: we tokenize the data and associate each token to an id.
+
+```bash
+python scripts/binarized_data.py \
+    --file_path data/dump.txt \
+    --bert_tokenizer bert-base-uncased \
+    --dump_file data/binarized_text
+```
+
+In the masked language modeling loss, we follow [XLM](https://github.com/facebookresearch/XLM) and smooth the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurences of each tokens in the data:
+
+```bash
+python scripts/token_counts.py \
+    --data_file data/binarized_text.bert-base-uncased.pickle \
+    --token_counts_dump data/token_counts.bert-base-uncased.pickle
+```
+
+### B. Training
+
+Launching a distillation is really simple once you have setup the data:
+
+```bash
+python train.py \
+    --dump_path serialization_dir/my_first_training \
+    --data_file data/binarized_text.bert-base-uncased.pickle \
+    --token_counts data/token_counts.bert-base-uncased.pickle \
+    --force # It overwrites the `dump_path` if it already exists.
+``` 
+
+By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please refer to `train.py`.
+
+We also highly encourage using distributed training. Here's an example that launchs a distributed traininng on a single node with 4 GPUs:
+```bash
+export NODE_RANK=0
+export N_NODES=1
+
+export N_GPU_NODE=4
+export WORLD_SIZE=4
+export MASTER_PORT=<AN_OPEN_PORT>
+export MASTER_ADDR=<I.P.>
+
+pkill -f 'python -u train.py'
+
+python -m torch.distributed.launch \
+    --nproc_per_node=$N_GPU_NODE \
+    --nnodes=$N_NODES \
+    --node_rank $NODE_RANK \
+    --master_addr $MASTER_ADDR \
+    --master_port $MASTER_PORT \
+    train.py \
+        --force \
+        --n_gpu $WORLD_SIZE \
+        --data_file data/dump_concat_wiki_toronto_bk.bert-base-uncased.pickle \
+        --token_counts data/token_counts_concat_wiki_toronto_bk.bert-base-uncased.pickle \
+        --dump_path serialization_dir/with_transform/last_word
+```
+
+**Tips** Start the distillation from some sort of structure initialization is crucial to reach a good final performance. In our experiments, we use initialization from some of the layers of the teacher itself! Please refer to `scripts/extract_for_distil.py` to create a valid initialization checkpoint and add `from_pretrained_weights` and `from_pretrained_config` when launching your distillation!
+
+Happy distillation!

From 497f73c9644b50a70402d3b546eeca71acbbb23a Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 07:16:30 +0000
Subject: [PATCH 179/200] add DilBERT to master REAME

---
 README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README.md b/README.md
index 9751c720b8..fdb160d898 100644
--- a/README.md
+++ b/README.md
@@ -13,6 +13,7 @@ The library currently contains PyTorch implementations, pre-trained model weight
 5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
 7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.
+8. **[DilBERT](https://github.com/huggingface/pytorch-transformers/tree/master/examples/distillation)** (from HuggingFace), a smaller, faster, and lighter version of BERT leveraging knowledge distillation by Victor Sanh, Thomas Wolf and Lysandre Debut
 
 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html).
 

From a5fe16687b896d1f7cf6edd7a6d4f32c2eefdd94 Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Wed, 28 Aug 2019 07:22:54 +0000
Subject: [PATCH 180/200] fix typo

---
 pytorch_transformers/modeling_auto.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/modeling_auto.py b/pytorch_transformers/modeling_auto.py
index 2d28a6017f..7e65269926 100644
--- a/pytorch_transformers/modeling_auto.py
+++ b/pytorch_transformers/modeling_auto.py
@@ -112,7 +112,7 @@ class AutoConfig(object):
 
         """
         if 'dilbert' in pretrained_model_name_or_path:
-            return DilBertconfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+            return DilBertConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
         elif 'roberta' in pretrained_model_name_or_path:
             return RobertaConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
         elif 'bert' in pretrained_model_name_or_path:

From 4ce5f36f78d5c5de6509616110fd4d3c97e2297c Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Wed, 28 Aug 2019 12:14:31 +0200
Subject: [PATCH 181/200] update readmes

---
 README.md                       |  5 ++--
 examples/distillation/README.md | 43 ++++++++++++++++++---------------
 2 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/README.md b/README.md
index fdb160d898..de69e69788 100644
--- a/README.md
+++ b/README.md
@@ -12,8 +12,9 @@ The library currently contains PyTorch implementations, pre-trained model weight
 4. **[Transformer-XL](https://github.com/kimiyoung/transformer-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
 5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
-7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.
-8. **[DilBERT](https://github.com/huggingface/pytorch-transformers/tree/master/examples/distillation)** (from HuggingFace), a smaller, faster, and lighter version of BERT leveraging knowledge distillation by Victor Sanh, Thomas Wolf and Lysandre Debut
+7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
+8. **[DilBERT](https://github.com/huggingface/pytorch-transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the blogpost [Smaller, faster, cheaper, lighter: Introducing DilBERT, a distilled version of BERT](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-dilbert-a-distilled-version-of-bert-8cf3380435b5
+) by Victor Sanh, Lysandre Debut and Thomas Wolf.
 
 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html).
 
diff --git a/examples/distillation/README.md b/examples/distillation/README.md
index 2eb4b59f8a..c037bd0c24 100644
--- a/examples/distillation/README.md
+++ b/examples/distillation/README.md
@@ -1,23 +1,25 @@
 # DilBERT
 
-This section contains examples showcasing how to use DilBERT and the original code to train DilBERT.
+This folder contains the original code used to train DilBERT as well as examples showcasing how to use DilBERT.
 
-## What is DilBERT?
+## What is DilBERT
 
-DilBERT stands for DistiLlation-BERT. DilBERT is a small, fast, cheap and light Transformer model: it has 40% less parameters than `bert-base-uncased`, runs 40% faster while preserving 96% on the language understanding capabilties (as shown on the GLUE benchmark). DilBERT is trained by distillation: a technique to compress a large model called the teacher into a smaller model called the student. By applying this compression technique, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model, while being lighter, smaller and faster. Thus, DilBERT can be an interesting solution to put large Transformer model into production.
+DilBERT stands for Distillated-BERT. DilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on the GLUE language understanding benchmark. DilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
 
-For more information on DilBERT, we refer to [our blog post](TODO(Link)).
+For more information on DilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-dilbert-a-distilled-version-of-bert-8cf3380435b5
+).
 
-## How to use DilBERT?
+## How to use DilBERT
 
-PyTorch-Transformers includes two pre-trained models:
-- `dilbert-base-uncased`: The language model pretrained by distillation under the supervision of `bert-base-uncased`. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
-- `dilbert-base-uncased-distilled-squad`: The `dilbert-base-uncased` finetune by distillation on SQuAD. It reaches a F1 score of 86.2 on the dev set, while `bert-base-uncased` reaches a 88.5 F1 score.
+PyTorch-Transformers includes two pre-trained DilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DilBERT):
 
-Using DilBERT is really similar to using BERT. DilBERT uses the same tokenizer as BERT and more specifically `bert-base-uncased`. You should only use this tookenizer as the only pre-trained weights available for now are supervised by `bert-base-uncased`.
+- `dilbert-base-uncased`: DilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
+- `dilbert-base-uncased-distilled-squad`: A finetuned version of `dilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.2 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
+
+Using DilBERT is very similar to using BERT. DilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DilBertTokenizer` name to have a consistent naming between the library models.
 
 ```python
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
 model = DilBertModel.from_pretrained('dilbert-base-uncased')
 
 input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)
@@ -25,17 +27,17 @@ outputs = model(input_ids)
 last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 ```
 
-## How to train DilBERT?
+## How to train DilBERT
 
 In the following, we will explain how you can train your own compressed model.
 
 ### A. Preparing the data
 
-The weights we release are trained using a concatenation of Toronto Book Corpus and English Wikipedia (same training data as BERT).
+The weights we release are trained using a concatenation of Toronto Book Corpus and English Wikipedia (same training data as the English version of BERT).
 
 To avoid processing the data several time, we do it once and for all before the training. From now on, will suppose that you have a text file `dump.txt` which contains one sequence per line (a sequence being composed of one of several coherent sentences).
 
-First, we will binarize the data: we tokenize the data and associate each token to an id.
+First, we will binarize the data, i.e. tokenize the data and convert each token in an index in our model's vocabulary.
 
 ```bash
 python scripts/binarized_data.py \
@@ -44,7 +46,7 @@ python scripts/binarized_data.py \
     --dump_file data/binarized_text
 ```
 
-In the masked language modeling loss, we follow [XLM](https://github.com/facebookresearch/XLM) and smooth the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurences of each tokens in the data:
+Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smoothes the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurences of each tokens in the data:
 
 ```bash
 python scripts/token_counts.py \
@@ -54,19 +56,20 @@ python scripts/token_counts.py \
 
 ### B. Training
 
-Launching a distillation is really simple once you have setup the data:
+Training with distillation is really simple once you have pre-processed the data:
 
 ```bash
 python train.py \
     --dump_path serialization_dir/my_first_training \
     --data_file data/binarized_text.bert-base-uncased.pickle \
     --token_counts data/token_counts.bert-base-uncased.pickle \
-    --force # It overwrites the `dump_path` if it already exists.
-``` 
+    --force # overwrites the `dump_path` if it already exists.
+```
 
-By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please refer to `train.py`.
+By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please look in `train.py` or run `python train.py --help` to list them.
+
+We highly encourage you to distributed training for training DilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
 
-We also highly encourage using distributed training. Here's an example that launchs a distributed traininng on a single node with 4 GPUs:
 ```bash
 export NODE_RANK=0
 export N_NODES=1
@@ -92,6 +95,6 @@ python -m torch.distributed.launch \
         --dump_path serialization_dir/with_transform/last_word
 ```
 
-**Tips** Start the distillation from some sort of structure initialization is crucial to reach a good final performance. In our experiments, we use initialization from some of the layers of the teacher itself! Please refer to `scripts/extract_for_distil.py` to create a valid initialization checkpoint and add `from_pretrained_weights` and `from_pretrained_config` when launching your distillation!
+**Tips** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract_for_distil.py` to create a valid initialization checkpoint and use `--from_pretrained_weights` and `--from_pretrained_config` arguments to use this initialization for the distilled training!
 
 Happy distillation!

From 62df4ba59aac3a62a03f40b602f9c285ea282108 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Wed, 28 Aug 2019 12:22:56 +0200
Subject: [PATCH 182/200] add dilbert tokenizer and tests

---
 pytorch_transformers/__init__.py              |  5 +-
 .../tests/tokenization_bert_test.py           |  6 +-
 .../tests/tokenization_dilbert_test.py        | 46 ++++++++++++++
 pytorch_transformers/tokenization_dilbert.py  | 62 +++++++++++++++++++
 4 files changed, 114 insertions(+), 5 deletions(-)
 create mode 100644 pytorch_transformers/tests/tokenization_dilbert_test.py
 create mode 100644 pytorch_transformers/tokenization_dilbert.py

diff --git a/pytorch_transformers/__init__.py b/pytorch_transformers/__init__.py
index e6774c96d8..22bc4d3c21 100644
--- a/pytorch_transformers/__init__.py
+++ b/pytorch_transformers/__init__.py
@@ -7,6 +7,7 @@ from .tokenization_gpt2 import GPT2Tokenizer
 from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
 from .tokenization_xlm import XLMTokenizer
 from .tokenization_roberta import RobertaTokenizer
+from .tokenization_dilbert import DilBertTokenizer
 
 from .tokenization_utils import (PreTrainedTokenizer)
 
@@ -41,8 +42,8 @@ from .modeling_xlm import (XLMConfig, XLMPreTrainedModel , XLMModel,
 from .modeling_roberta import (RobertaConfig, RobertaForMaskedLM, RobertaModel, RobertaForSequenceClassification,
                                ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
 from .modeling_dilbert import (DilBertConfig, DilBertForMaskedLM, DilBertModel,
-                              DilBertForSequenceClassification, DilBertForQuestionAnswering,
-                              DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
+                               DilBertForSequenceClassification, DilBertForQuestionAnswering,
+                               DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
 from .modeling_utils import (WEIGHTS_NAME, CONFIG_NAME, TF_WEIGHTS_NAME,
                           PretrainedConfig, PreTrainedModel, prune_layer, Conv1D)
 
diff --git a/pytorch_transformers/tests/tokenization_bert_test.py b/pytorch_transformers/tests/tokenization_bert_test.py
index db507317a8..aaca746d46 100644
--- a/pytorch_transformers/tests/tokenization_bert_test.py
+++ b/pytorch_transformers/tests/tokenization_bert_test.py
@@ -42,7 +42,7 @@ class BertTokenizationTest(CommonTestCases.CommonTokenizerTester):
             vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
 
     def get_tokenizer(self):
-        return BertTokenizer.from_pretrained(self.tmpdirname)
+        return self.tokenizer_class.from_pretrained(self.tmpdirname)
 
     def get_input_output_texts(self):
         input_text = u"UNwant\u00E9d,running"
@@ -50,7 +50,7 @@ class BertTokenizationTest(CommonTestCases.CommonTokenizerTester):
         return input_text, output_text
 
     def test_full_tokenizer(self):
-        tokenizer = BertTokenizer(self.vocab_file)
+        tokenizer = self.tokenizer_class(self.vocab_file)
 
         tokens = tokenizer.tokenize(u"UNwant\u00E9d,running")
         self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
@@ -126,7 +126,7 @@ class BertTokenizationTest(CommonTestCases.CommonTokenizerTester):
         self.assertFalse(_is_punctuation(u" "))
 
     def test_sequence_builders(self):
-        tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+        tokenizer = self.tokenizer_class.from_pretrained("bert-base-uncased")
 
         text = tokenizer.encode("sequence builders")
         text_2 = tokenizer.encode("multi-sequence build")
diff --git a/pytorch_transformers/tests/tokenization_dilbert_test.py b/pytorch_transformers/tests/tokenization_dilbert_test.py
new file mode 100644
index 0000000000..4cc7aa6c88
--- /dev/null
+++ b/pytorch_transformers/tests/tokenization_dilbert_test.py
@@ -0,0 +1,46 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import os
+import unittest
+from io import open
+
+from pytorch_transformers.tokenization_dilbert import (DilBertTokenizer)
+
+from .tokenization_tests_commons import CommonTestCases
+from .tokenization_bert_test import BertTokenizationTest
+
+class DilBertTokenizationTest(BertTokenizationTest):
+
+    tokenizer_class = DilBertTokenizer
+
+    def get_tokenizer(self):
+        return DilBertTokenizer.from_pretrained(self.tmpdirname)
+
+    def test_sequence_builders(self):
+        tokenizer = DilBertTokenizer.from_pretrained("dilbert-base-uncased")
+
+        text = tokenizer.encode("sequence builders")
+        text_2 = tokenizer.encode("multi-sequence build")
+
+        encoded_sentence = tokenizer.add_special_tokens_single_sentence(text)
+        encoded_pair = tokenizer.add_special_tokens_sentences_pair(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/pytorch_transformers/tokenization_dilbert.py b/pytorch_transformers/tokenization_dilbert.py
new file mode 100644
index 0000000000..8d71e1b486
--- /dev/null
+++ b/pytorch_transformers/tokenization_dilbert.py
@@ -0,0 +1,62 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for DilBERT."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import collections
+import logging
+import os
+import unicodedata
+from io import open
+
+from .tokenization_bert import BertTokenizer
+
+logger = logging.getLogger(__name__)
+
+VOCAB_FILES_NAMES = {'vocab_file': 'vocab.txt'}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    'vocab_file':
+    {
+        'dilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
+        'dilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
+    }
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    'dilbert-base-uncased': 512,
+    'dilbert-base-uncased-distilled-squad': 512,
+}
+
+
+class DilBertTokenizer(BertTokenizer):
+    r"""
+    Constructs a DilBertTokenizer.
+    :class:`~pytorch_transformers.DilBertTokenizer` is identical to BertTokenizer and runs end-to-end tokenization: punctuation splitting + wordpiece
+
+    Args:
+        vocab_file: Path to a one-wordpiece-per-line vocabulary file
+        do_lower_case: Whether to lower case the input. Only has an effect when do_wordpiece_only=False
+        do_basic_tokenize: Whether to do basic tokenization before wordpiece.
+        max_len: An artificial maximum length to truncate tokenized sequences to; Effective maximum length is always the
+            minimum of this value (if specified) and the underlying BERT model's sequence length.
+        never_split: List of tokens which will never be split during tokenization. Only has an effect when
+            do_wordpiece_only=False
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

From c9bce1811ce8d63f2cd2f28b47ec9cc2196384e7 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Wed, 28 Aug 2019 13:22:45 +0200
Subject: [PATCH 183/200] fixing model to add torchscript, embedding resizing,
 head pruning and masking + tests

---
 pytorch_transformers/modeling_bert.py         |   2 +-
 pytorch_transformers/modeling_dilbert.py      | 371 ++++++++++++------
 .../tests/modeling_dilbert_test.py            |  18 +-
 3 files changed, 253 insertions(+), 138 deletions(-)

diff --git a/pytorch_transformers/modeling_bert.py b/pytorch_transformers/modeling_bert.py
index badec992c3..560c4f1086 100644
--- a/pytorch_transformers/modeling_bert.py
+++ b/pytorch_transformers/modeling_bert.py
@@ -449,7 +449,7 @@ class BertEncoder(nn.Module):
             outputs = outputs + (all_hidden_states,)
         if self.output_attentions:
             outputs = outputs + (all_attentions,)
-        return outputs  # outputs, (hidden states), (attentions)
+        return outputs  # last-layer hidden state, (all hidden states), (all attentions)
 
 
 class BertPooler(nn.Module):
diff --git a/pytorch_transformers/modeling_dilbert.py b/pytorch_transformers/modeling_dilbert.py
index 2f3ea1c535..867ba0e6a8 100644
--- a/pytorch_transformers/modeling_dilbert.py
+++ b/pytorch_transformers/modeling_dilbert.py
@@ -1,5 +1,5 @@
 # coding=utf-8
-# Copyright 2019-present, the HuggingFace Inc. team.
+# Copyright 2019-present, the HuggingFace Inc. team and Facebook, Inc.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -30,7 +30,7 @@ import numpy as np
 import torch
 import torch.nn as nn
 
-from pytorch_transformers.modeling_utils import PretrainedConfig, PreTrainedModel, add_start_docstrings
+from pytorch_transformers.modeling_utils import PretrainedConfig, PreTrainedModel, add_start_docstrings, prune_linear_layer
 
 import logging
 logger = logging.getLogger(__name__)
@@ -92,6 +92,17 @@ class DilBertConfig(PretrainedConfig):
         else:
             raise ValueError("First argument must be either a vocabulary size (int)"
                              " or the path to a pretrained model config file (str)")
+    @property
+    def hidden_size(self):
+        return self.hidden_dim
+
+    @property
+    def num_attention_heads(self):
+        return self.n_heads
+
+    @property
+    def num_hidden_layers(self):
+        return self.n_layers
 
 
 ### UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE ###
@@ -163,11 +174,30 @@ class MultiHeadSelfAttention(nn.Module):
         self.v_lin = nn.Linear(in_features=config.dim, out_features=config.dim)
         self.out_lin = nn.Linear(in_features=config.dim, out_features=config.dim)
 
+    def prune_heads(self, heads):
+        attention_head_size = self.dim // self.n_heads
+        if len(heads) == 0:
+            return
+        mask = torch.ones(self.n_heads, attention_head_size)
+        for head in heads:
+            mask[head] = 0
+        mask = mask.view(-1).contiguous().eq(1)
+        index = torch.arange(len(mask))[mask].long()
+        # Prune linear layers
+        self.q_lin = prune_linear_layer(self.q_lin, index)
+        self.k_lin = prune_linear_layer(self.k_lin, index)
+        self.v_lin = prune_linear_layer(self.v_lin, index)
+        self.out_lin = prune_linear_layer(self.out_lin, index, dim=1)
+        # Update hyper params
+        self.n_heads = self.n_heads - len(heads)
+        self.dim = attention_head_size * self.n_heads
+
     def forward(self,
                 query: torch.tensor,
                 key: torch.tensor,
                 value: torch.tensor,
-                mask: torch.tensor):
+                mask: torch.tensor,
+                head_mask: torch.tensor = None):
         """
         Parameters
         ----------
@@ -185,10 +215,10 @@ class MultiHeadSelfAttention(nn.Module):
         """
         bs, q_length, dim = query.size()
         k_length = key.size(1)
-        assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)
-        assert key.size() == value.size()
+        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)
+        # assert key.size() == value.size()
 
-        dim_per_head = dim // self.n_heads
+        dim_per_head = self.dim // self.n_heads
 
         assert 2 <= mask.dim() <= 3
         causal = (mask.dim() == 3)
@@ -200,7 +230,7 @@ class MultiHeadSelfAttention(nn.Module):
 
         def unshape(x):
             """ group heads """
-            return x.transpose(1, 2).contiguous().view(bs, -1, dim)
+            return x.transpose(1, 2).contiguous().view(bs, -1, self.n_heads * dim_per_head)
 
         q = shape(self.q_lin(query))           # (bs, n_heads, q_length, dim_per_head)
         k = shape(self.k_lin(key))             # (bs, n_heads, k_length, dim_per_head)
@@ -213,6 +243,11 @@ class MultiHeadSelfAttention(nn.Module):
 
         weights = nn.Softmax(dim=-1)(scores)   # (bs, n_heads, q_length, k_length)
         weights = self.dropout(weights)        # (bs, n_heads, q_length, k_length)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            weights = weights * head_mask
+
         context = torch.matmul(weights, v)     # (bs, n_heads, q_length, dim_per_head)
         context = unshape(context)             # (bs, q_length, dim)
         context = self.out_lin(context)        # (bs, q_length, dim)
@@ -229,7 +264,7 @@ class FFN(nn.Module):
         self.dropout = nn.Dropout(p=config.dropout)
         self.lin1 = nn.Linear(in_features=config.dim, out_features=config.hidden_dim)
         self.lin2 = nn.Linear(in_features=config.hidden_dim, out_features=config.dim)
-        assert config.activation in ['relu', 'gelu'], ValueError(f"activation ({config.activation}) must be in ['relu', 'gelu']")
+        assert config.activation in ['relu', 'gelu'], "activation ({}) must be in ['relu', 'gelu']".format(config.activation)
         self.activation = gelu if config.activation == 'gelu' else nn.ReLU()
 
     def forward(self,
@@ -262,7 +297,8 @@ class TransformerBlock(nn.Module):
 
     def forward(self,
                 x: torch.tensor,
-                attn_mask: torch.tensor = None):
+                attn_mask: torch.tensor = None,
+                head_mask: torch.tensor = None):
         """
         Parameters
         ----------
@@ -277,7 +313,7 @@ class TransformerBlock(nn.Module):
             The output of the transformer block contextualization.
         """
         # Self-Attention
-        sa_output = self.attention(query=x, key=x, value=x, mask=attn_mask)
+        sa_output = self.attention(query=x, key=x, value=x, mask=attn_mask, head_mask=head_mask)
         if self.output_attentions:
             sa_output, sa_weights = sa_output                  # (bs, seq_length, dim), (bs, n_heads, seq_length, seq_length)
         else: # To handle these `output_attention` or `output_hidden_states` cases returning tuples
@@ -294,6 +330,7 @@ class TransformerBlock(nn.Module):
             output = (sa_weights,) + output
         return output
 
+
 class Transformer(nn.Module):
     def __init__(self,
                  config):
@@ -307,7 +344,8 @@ class Transformer(nn.Module):
 
     def forward(self,
                 x: torch.tensor,
-                attn_mask: torch.tensor = None):
+                attn_mask: torch.tensor = None,
+                head_mask: torch.tensor = None):
         """
         Parameters
         ----------
@@ -331,14 +369,24 @@ class Transformer(nn.Module):
         all_attentions = ()
 
         hidden_state = x
-        for _, layer_module in enumerate(self.layer):
-            hidden_state = layer_module(x=hidden_state, attn_mask=attn_mask)
+        for i, layer_module in enumerate(self.layer):
+            if self.output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_state,)
+
+            layer_outputs = layer_module(x=hidden_state,
+                                         attn_mask=attn_mask,
+                                         head_mask=head_mask[i])
+            hidden_state = layer_outputs[-1]
+
             if self.output_attentions:
-                attentions, hidden_state = hidden_state
+                assert len(layer_outputs) == 2
+                attentions = layer_outputs[0]
                 all_attentions = all_attentions + (attentions,)
-            else: # To handle these `output_attention` or `output_hidden_states` cases returning tuples
-                assert type(hidden_state) == tuple
-                hidden_state = hidden_state[0]
+            else:
+                assert len(layer_outputs) == 1
+
+        # Add last layer
+        if self.output_hidden_states:
             all_hidden_states = all_hidden_states + (hidden_state,)
 
         outputs = (hidden_state,)
@@ -346,7 +394,7 @@ class Transformer(nn.Module):
             outputs = outputs + (all_hidden_states,)
         if self.output_attentions:
             outputs = outputs + (all_attentions,)
-        return outputs
+        return outputs  # last-layer hidden state, (all hidden states), (all attentions)
 
 
 ### INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL ###
@@ -378,9 +426,21 @@ class DilBertPreTrainedModel(PreTrainedModel):
 
 
 DILBERT_START_DOCSTRING = r"""
-    Smaller, faster, cheaper, lighter: DilBERT
+    DilBERT is a small, fast, cheap and light Transformer model
+    trained by distilling Bert base. It has 40% less parameters than
+    `bert-base-uncased`, runs 60% faster while preserving over 95% of
+    Bert's performances as measured on the GLUE language understanding benchmark.
 
-    For more information on DilBERT, you should check TODO(Link): Link to Medium
+    Here are the differences between the interface of Bert and DilBert:
+
+    - DilBert doesn't have `token_type_ids`, you don't need to indicate which token belong to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`)
+    - DilBert doesn't have options to select the input positions (`position_ids` input). This could be added if necessary though, just let's us know if you need this option.
+
+    For more information on DilBERT, please refer to our
+    `detailed blog post`_
+    
+    .. _`detailed blog post`:
+        https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-dilbert-a-distilled-version-of-bert-8cf3380435b5
 
     Parameters:
         config (:class:`~pytorch_transformers.DilBertConfig`): Model configuration class with all the parameters of the model. 
@@ -399,31 +459,35 @@ DILBERT_INPUTS_DOCSTRING = r"""
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
             ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
 """
 
 @add_start_docstrings("The bare DilBERT encoder/transformer outputing raw hidden-states without any specific head on top.",
                       DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
 class DilBertModel(DilBertPreTrainedModel):
     r"""
-        Parameters
-        ----------
-        input_ids: torch.tensor(bs, seq_length)
-            Sequences of token ids.
-        attention_mask: torch.tensor(bs, seq_length)
-            Attention mask on the sequences. Optional: If None, it's like there was no padding.
-        
-        Outputs
-        -------
-        hidden_state: torch.tensor(bs, seq_length, dim)
-            Sequence of hiddens states in the last (top) layer
-        pooled_output: torch.tensor(bs, dim)
-            Pooled output: for DilBert, the pooled output is simply the hidden state of the [CLS] token.
-        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]
-            Tuple of length n_layers with the hidden states from each layer.
-            Optional: only if output_hidden_states=True
-        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
-            Tuple of length n_layers with the attention weights from each layer
-            Optional: only if output_attentions=True
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
+            Sequence of hidden-states at the output of the last layer of the model.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
+        model = DilBertModel.from_pretrained('dilbert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
     """
     def __init__(self, config):
         super(DilBertModel, self).__init__(config)
@@ -433,47 +497,83 @@ class DilBertModel(DilBertPreTrainedModel):
 
         self.apply(self.init_weights)
 
+    def _resize_token_embeddings(self, new_num_tokens):
+        old_embeddings = self.embeddings.word_embeddings
+        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)
+        self.embeddings.word_embeddings = new_embeddings
+        return self.embeddings.word_embeddings
+
+    def _prune_heads(self, heads_to_prune):
+        """ Prunes heads of the model.
+            heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
+            See base class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.transformer.layer[layer].attention.prune_heads(heads)
+
     def forward(self,
                 input_ids: torch.tensor,
-                attention_mask: torch.tensor = None):
+                attention_mask: torch.tensor = None,
+                head_mask: torch.tensor = None):
         if attention_mask is None:
             attention_mask = torch.ones_like(input_ids) # (bs, seq_length)
 
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand(self.config.num_hidden_layers, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+            head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
+        else:
+            head_mask = [None] * self.config.num_hidden_layers
+
         embedding_output = self.embeddings(input_ids)   # (bs, seq_length, dim)
         tfmr_output = self.transformer(x=embedding_output,
-                                       attn_mask=attention_mask)
+                                       attn_mask=attention_mask,
+                                       head_mask=head_mask)
         hidden_state = tfmr_output[0]
-        pooled_output = hidden_state[:, 0]
-        output = (hidden_state, pooled_output) + tfmr_output[1:]
+        output = (hidden_state, ) + tfmr_output[1:]
+
+        return output # last-layer hidden-state, (all hidden_states), (all attentions)
 
-        return output # hidden_state, pooled_output, (hidden_states), (attentions)
 
 @add_start_docstrings("""DilBert Model with a `masked language modeling` head on top. """,
                       DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
 class DilBertForMaskedLM(DilBertPreTrainedModel):
     r"""
-        Parameters
-        ----------
-        input_ids: torch.tensor(bs, seq_length)
-            Token ids.
-        attention_mask: torch.tensor(bs, seq_length)
-            Attention mask. Optional: If None, it's like there was no padding.
-        masked_lm_labels: torch.tensor(bs, seq_length)
-            The masked language modeling labels. Optional: If None, no loss is computed.
+        **masked_lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for computing the masked language modeling loss.
+            Indices should be in ``[-1, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-1`` are ignored (masked), the loss is only computed for the tokens with labels
+            in ``[0, ..., config.vocab_size]``
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Masked language modeling loss.
+        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
+        model = DilBertForMaskedLM.from_pretrained('dilbert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, masked_lm_labels=input_ids)
+        loss, prediction_scores = outputs[:2]
 
-        Outputs
-        -------
-        mlm_loss: torch.tensor(1,)
-            Masked Language Modeling loss to optimize. 
-            Optional: only if `masked_lm_labels` is not None
-        prediction_logits: torch.tensor(bs, seq_length, voc_size)
-            Token prediction logits
-        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]
-            Tuple of length n_layers with the hidden states from each layer.
-            Optional: only if `output_hidden_states`=True
-        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
-            Tuple of length n_layers with the attention weights from each layer
-            Optional: only if `output_attentions`=True
     """
     def __init__(self, config):
         super(DilBertForMaskedLM, self).__init__(config)
@@ -491,59 +591,68 @@ class DilBertForMaskedLM(DilBertPreTrainedModel):
         self.mlm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
 
     def tie_weights(self):
+        """ Make sure we are sharing the input and output embeddings.
+            Export to TorchScript can't handle parameter sharing so we are cloning them instead.
         """
-        Tying the weights of the vocabulary projection to the base token embeddings.
-        """
-        if self.config.tie_weights_:
-            self.vocab_projector.weight = self.dilbert.embeddings.word_embeddings.weight
+        self._tie_or_clone_weights(self.vocab_projector,
+                                   self.dilbert.embeddings.word_embeddings)
 
     def forward(self,
                 input_ids: torch.tensor,
                 attention_mask: torch.tensor = None,
-                masked_lm_labels: torch.tensor = None):
+                masked_lm_labels: torch.tensor = None,
+                head_mask: torch.tensor = None):
         dlbrt_output = self.dilbert(input_ids=input_ids,
-                                    attention_mask=attention_mask)
+                                    attention_mask=attention_mask,
+                                    head_mask=head_mask)
         hidden_states = dlbrt_output[0]                              # (bs, seq_length, dim)
         prediction_logits = self.vocab_transform(hidden_states)      # (bs, seq_length, dim)
         prediction_logits = gelu(prediction_logits)                  # (bs, seq_length, dim)
         prediction_logits = self.vocab_layer_norm(prediction_logits) # (bs, seq_length, dim)
         prediction_logits = self.vocab_projector(prediction_logits)  # (bs, seq_length, vocab_size)
 
-        outputs = (prediction_logits, ) + dlbrt_output[2:]
+        outputs = (prediction_logits, ) + dlbrt_output[1:]
         if masked_lm_labels is not None:
             mlm_loss = self.mlm_loss_fct(prediction_logits.view(-1, prediction_logits.size(-1)),
                                          masked_lm_labels.view(-1))
             outputs = (mlm_loss,) + outputs     
 
-        return outputs # (mlm_loss), prediction_logits, (hidden_states), (attentions)
+        return outputs # (mlm_loss), prediction_logits, (all hidden_states), (all attentions)
+
 
 @add_start_docstrings("""DilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of
                          the pooled output) e.g. for GLUE tasks. """,
                       DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
 class DilBertForSequenceClassification(DilBertPreTrainedModel):
     r"""
-        Parameters
-        ----------
-        input_ids: torch.tensor(bs, seq_length)
-            Token ids.
-        attention_mask: torch.tensor(bs, seq_length)
-            Attention mask. Optional: If None, it's like there was no padding.
-        labels: torch.tensor(bs,)
-            Classification Labels: Optional: If None, no loss will be computed.
-        
-        Outputs
-        -------
-        loss: torch.tensor(1)
-            Sequence classification loss.
-            Optional: Is computed only if `labels` is not None.
-        logits: torch.tensor(bs, seq_length)
-            Classification (or regression if config.num_labels==1) scores
-        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]
-            Tuple of length n_layers with the hidden states from each layer.
-            Optional: only if `output_hidden_states`=True
-        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
-            Tuple of length n_layers with the attention weights from each layer
-            Optional: only if `output_attentions`=True        
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
+            Labels for computing the sequence classification/regression loss.
+            Indices should be in ``[0, ..., config.num_labels - 1]``.
+            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
+            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Classification (or regression if config.num_labels==1) loss.
+        **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
+            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
+        model = DilBertForSequenceClassification.from_pretrained('dilbert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, logits = outputs[:2]
+
     """
     def __init__(self, config):
         super(DilBertForSequenceClassification, self).__init__(config)
@@ -559,16 +668,19 @@ class DilBertForSequenceClassification(DilBertPreTrainedModel):
     def forward(self,
                 input_ids: torch.tensor,
                 attention_mask: torch.tensor = None,
-                labels: torch.tensor = None):
+                labels: torch.tensor = None,
+                head_mask: torch.tensor = None):
         dilbert_output = self.dilbert(input_ids=input_ids,
-                                      attention_mask=attention_mask)
-        pooled_output = dilbert_output[1]                    # (bs, dim)
+                                      attention_mask=attention_mask,
+                                      head_mask=head_mask)
+        hidden_state = dilbert_output[0]                    # (bs, seq_len, dim)
+        pooled_output = hidden_state[:, 0]                    # (bs, dim)
         pooled_output = self.pre_classifier(pooled_output)   # (bs, dim)
         pooled_output = nn.ReLU()(pooled_output)             # (bs, dim)
         pooled_output = self.dropout(pooled_output)         # (bs, dim)
         logits = self.classifier(pooled_output)              # (bs, dim)
 
-        outputs = (logits,) + dilbert_output[2:]
+        outputs = (logits,) + dilbert_output[1:]
         if labels is not None:
             if self.num_labels == 1:
                 loss_fct = nn.MSELoss()
@@ -580,43 +692,46 @@ class DilBertForSequenceClassification(DilBertPreTrainedModel):
 
         return outputs  # (loss), logits, (hidden_states), (attentions)
 
+
 @add_start_docstrings("""DilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
                          the hidden-states output to compute `span start logits` and `span end logits`). """,
                       DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
 class DilBertForQuestionAnswering(DilBertPreTrainedModel):
     r"""
-        Parameters
-        ----------
-        input_ids: torch.tensor(bs, seq_length)
-            Token ids.
-        attention_mask: torch.tensor(bs, seq_length)
-            Attention mask. Optional: If None, it's like there was no padding.
-        start_positions: torch,tensor(bs)
+        **start_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
             Positions are clamped to the length of the sequence (`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
-            Optional: if None, no loss is computed.
-        end_positions: torch,tensor(bs)
+        **end_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
             Positions are clamped to the length of the sequence (`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
-            Optional: if None, no loss is computed.
 
-        Outputs
-        -------
-        loss: torch.tensor(1)
-            Question answering loss.
-            Optional: Is computed only if `start_positions` and `end_positions` are not None.
-        start_logits: torch.tensor(bs, seq_length)
-            Span-start scores.
-        end_logits: torch.tensor(bs, seq_length)
-            Spand-end scores.
-        all_hidden_states: Tuple[torch.tensor(bs, seq_length, dim)]
-            Tuple of length n_layers with the hidden states from each layer.
-            Optional: only if `output_hidden_states`=True
-        all_attentions: Tuple[torch.tensor(bs, n_heads, seq_length, seq_length)]
-            Tuple of length n_layers with the attention weights from each layer
-            Optional: only if `output_attentions`=True
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
+        **start_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
+            Span-start scores (before SoftMax).
+        **end_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
+            Span-end scores (before SoftMax).
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
+        model = DilBertForQuestionAnswering.from_pretrained('dilbert-base-uncased')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        start_positions = torch.tensor([1])
+        end_positions = torch.tensor([3])
+        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        loss, start_scores, end_scores = outputs[:2]
+
     """
     def __init__(self, config):
         super(DilBertForQuestionAnswering, self).__init__(config)
@@ -632,9 +747,11 @@ class DilBertForQuestionAnswering(DilBertPreTrainedModel):
                 input_ids: torch.tensor,
                 attention_mask: torch.tensor = None,
                 start_positions: torch.tensor = None,
-                end_positions: torch.tensor = None):
+                end_positions: torch.tensor = None,
+                head_mask: torch.tensor = None):
         dilbert_output = self.dilbert(input_ids=input_ids,
-                                      attention_mask=attention_mask)
+                                      attention_mask=attention_mask,
+                                      head_mask=head_mask)
         hidden_states = dilbert_output[0]                                 # (bs, max_query_len, dim)
 
         hidden_states = self.dropout(hidden_states)                       # (bs, max_query_len, dim)
@@ -643,7 +760,7 @@ class DilBertForQuestionAnswering(DilBertPreTrainedModel):
         start_logits = start_logits.squeeze(-1)                           # (bs, max_query_len)
         end_logits = end_logits.squeeze(-1)                               # (bs, max_query_len)
 
-        outputs = (start_logits, end_logits,) + dilbert_output[2:]
+        outputs = (start_logits, end_logits,) + dilbert_output[1:]
         if start_positions is not None and end_positions is not None:
             # If we are on multi-GPU, split add a dimension
             if len(start_positions.size()) > 1:
diff --git a/pytorch_transformers/tests/modeling_dilbert_test.py b/pytorch_transformers/tests/modeling_dilbert_test.py
index 0cbef7e083..2fd707dfd8 100644
--- a/pytorch_transformers/tests/modeling_dilbert_test.py
+++ b/pytorch_transformers/tests/modeling_dilbert_test.py
@@ -21,7 +21,7 @@ import shutil
 import pytest
 
 from pytorch_transformers import (DilBertConfig, DilBertModel, DilBertForMaskedLM,
-                                     DilBertForQuestionAnswering, DilBertForSequenceClassification)
+                                  DilBertForQuestionAnswering, DilBertForSequenceClassification)
 from pytorch_transformers.modeling_dilbert import DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
 
 from .modeling_common_test import (CommonTestCases, ConfigTester, ids_tensor)
@@ -31,10 +31,10 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
 
     all_model_classes = (DilBertModel, DilBertForMaskedLM, DilBertForQuestionAnswering,
                          DilBertForSequenceClassification)
-    test_pruning = False
-    test_torchscript = False
-    test_resize_embeddings = False
-    test_head_masking = False
+    test_pruning = True
+    test_torchscript = True
+    test_resize_embeddings = True
+    test_head_masking = True
 
     class DilBertModelTester(object):
 
@@ -122,22 +122,20 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
         def create_and_check_dilbert_model(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
             model = DilBertModel(config=config)
             model.eval()
-            sequence_output, pooled_output = model(input_ids, input_mask)
-            sequence_output, pooled_output = model(input_ids)
+            (sequence_output,) = model(input_ids, input_mask)
+            (sequence_output,) = model(input_ids)
 
             result = {
                 "sequence_output": sequence_output,
-                "pooled_output": pooled_output,
             }
             self.parent.assertListEqual(
                 list(result["sequence_output"].size()),
                 [self.batch_size, self.seq_length, self.hidden_size])
-            self.parent.assertListEqual(list(result["pooled_output"].size()), [self.batch_size, self.hidden_size])
 
         def create_and_check_dilbert_for_masked_lm(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
             model = DilBertForMaskedLM(config=config)
             model.eval()
-            loss, prediction_scores = model(input_ids, input_mask, token_labels)
+            loss, prediction_scores = model(input_ids, attention_mask=input_mask, masked_lm_labels=token_labels)
             result = {
                 "loss": loss,
                 "prediction_scores": prediction_scores,

From 912a377e904d1ec10ce2555c80035c074ff51e12 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Wed, 28 Aug 2019 13:59:42 +0200
Subject: [PATCH 184/200] dilbert -> distilbert

---
 README.md                                     |   2 +-
 examples/distillation/README.md               |  28 ++--
 examples/distillation/dataset.py              |   2 +-
 examples/distillation/distiller.py            |   2 +-
 .../distillation/scripts/binarized_data.py    |   2 +-
 .../scripts/extract_for_distil.py             |  22 ++--
 examples/distillation/scripts/token_counts.py |   2 +-
 examples/distillation/train.py                |  12 +-
 examples/distillation/utils.py                |   2 +-
 pytorch_transformers/__init__.py              |   8 +-
 pytorch_transformers/modeling_auto.py         |  10 +-
 ...ling_dilbert.py => modeling_distilbert.py} | 120 +++++++++---------
 .../tests/modeling_dilbert_test.py            |  50 ++++----
 .../tests/tokenization_dilbert_test.py        |  10 +-
 ..._dilbert.py => tokenization_distilbert.py} |  16 +--
 15 files changed, 144 insertions(+), 144 deletions(-)
 rename pytorch_transformers/{modeling_dilbert.py => modeling_distilbert.py} (87%)
 rename pytorch_transformers/{tokenization_dilbert.py => tokenization_distilbert.py} (75%)

diff --git a/README.md b/README.md
index de69e69788..5f69ad778f 100644
--- a/README.md
+++ b/README.md
@@ -13,7 +13,7 @@ The library currently contains PyTorch implementations, pre-trained model weight
 5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
 7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
-8. **[DilBERT](https://github.com/huggingface/pytorch-transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the blogpost [Smaller, faster, cheaper, lighter: Introducing DilBERT, a distilled version of BERT](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-dilbert-a-distilled-version-of-bert-8cf3380435b5
+8. **[DistilBERT](https://github.com/huggingface/pytorch-transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the blogpost [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
 ) by Victor Sanh, Lysandre Debut and Thomas Wolf.
 
 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html).
diff --git a/examples/distillation/README.md b/examples/distillation/README.md
index c037bd0c24..1b8a4f7178 100644
--- a/examples/distillation/README.md
+++ b/examples/distillation/README.md
@@ -1,33 +1,33 @@
-# DilBERT
+# DistilBERT
 
-This folder contains the original code used to train DilBERT as well as examples showcasing how to use DilBERT.
+This folder contains the original code used to train DistilBERT as well as examples showcasing how to use DistilBERT.
 
-## What is DilBERT
+## What is DistilBERT
 
-DilBERT stands for Distillated-BERT. DilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on the GLUE language understanding benchmark. DilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
+DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
 
-For more information on DilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-dilbert-a-distilled-version-of-bert-8cf3380435b5
+For more information on DistilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
 ).
 
-## How to use DilBERT
+## How to use DistilBERT
 
-PyTorch-Transformers includes two pre-trained DilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DilBERT):
+PyTorch-Transformers includes two pre-trained DistilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DistilBERT):
 
-- `dilbert-base-uncased`: DilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
-- `dilbert-base-uncased-distilled-squad`: A finetuned version of `dilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.2 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
+- `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
+- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.2 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
 
-Using DilBERT is very similar to using BERT. DilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DilBertTokenizer` name to have a consistent naming between the library models.
+Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.
 
 ```python
-tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
-model = DilBertModel.from_pretrained('dilbert-base-uncased')
+tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+model = DistilBertModel.from_pretrained('distilbert-base-uncased')
 
 input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)
 outputs = model(input_ids)
 last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 ```
 
-## How to train DilBERT
+## How to train DistilBERT
 
 In the following, we will explain how you can train your own compressed model.
 
@@ -68,7 +68,7 @@ python train.py \
 
 By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please look in `train.py` or run `python train.py --help` to list them.
 
-We highly encourage you to distributed training for training DilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
+We highly encourage you to distributed training for training DistilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
 
 ```bash
 export NODE_RANK=0
diff --git a/examples/distillation/dataset.py b/examples/distillation/dataset.py
index b9f58f775e..b3b76fd83c 100644
--- a/examples/distillation/dataset.py
+++ b/examples/distillation/dataset.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Dataloaders to train DilBERT.
+Dataloaders to train DistilBERT.
 """
 from typing import List
 import math
diff --git a/examples/distillation/distiller.py b/examples/distillation/distiller.py
index c2d4a9785a..e6c27fe365 100644
--- a/examples/distillation/distiller.py
+++ b/examples/distillation/distiller.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-The distiller to distil DilBERT.
+The distiller to distil DistilBERT.
 """
 import os
 import math
diff --git a/examples/distillation/scripts/binarized_data.py b/examples/distillation/scripts/binarized_data.py
index c79001bb5e..d1c97bd296 100644
--- a/examples/distillation/scripts/binarized_data.py
+++ b/examples/distillation/scripts/binarized_data.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Preprocessing script before training DilBERT.
+Preprocessing script before training DistilBERT.
 """
 import argparse
 import pickle
diff --git a/examples/distillation/scripts/extract_for_distil.py b/examples/distillation/scripts/extract_for_distil.py
index 1cbf19d2cf..f3eee024ec 100644
--- a/examples/distillation/scripts/extract_for_distil.py
+++ b/examples/distillation/scripts/extract_for_distil.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Preprocessing script before training DilBERT.
+Preprocessing script before training DistilBERT.
 """
 from pytorch_transformers import BertForPreTraining
 import torch
@@ -33,32 +33,32 @@ if __name__ == '__main__':
     compressed_sd = {}
 
     for w in ['word_embeddings', 'position_embeddings']:
-        compressed_sd[f'dilbert.embeddings.{w}.weight'] = \
+        compressed_sd[f'distilbert.embeddings.{w}.weight'] = \
             state_dict[f'bert.embeddings.{w}.weight']
     for w in ['weight', 'bias']:
-        compressed_sd[f'dilbert.embeddings.LayerNorm.{w}'] = \
+        compressed_sd[f'distilbert.embeddings.LayerNorm.{w}'] = \
             state_dict[f'bert.embeddings.LayerNorm.{w}']
 
     std_idx = 0
     for teacher_idx in [0, 2, 4, 7, 9, 11]:
         for w in ['weight', 'bias']:
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.q_lin.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.q_lin.{w}'] = \
                 state_dict[f'bert.encoder.layer.{teacher_idx}.attention.self.query.{w}']
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.k_lin.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.k_lin.{w}'] = \
                 state_dict[f'bert.encoder.layer.{teacher_idx}.attention.self.key.{w}']
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.v_lin.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.v_lin.{w}'] = \
                 state_dict[f'bert.encoder.layer.{teacher_idx}.attention.self.value.{w}']
 
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.out_lin.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.out_lin.{w}'] = \
                 state_dict[f'bert.encoder.layer.{teacher_idx}.attention.output.dense.{w}']
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.sa_layer_norm.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.sa_layer_norm.{w}'] = \
                 state_dict[f'bert.encoder.layer.{teacher_idx}.attention.output.LayerNorm.{w}']
 
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.ffn.lin1.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.ffn.lin1.{w}'] = \
                 state_dict[f'bert.encoder.layer.{teacher_idx}.intermediate.dense.{w}']
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.ffn.lin2.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.ffn.lin2.{w}'] = \
                 state_dict[f'bert.encoder.layer.{teacher_idx}.output.dense.{w}']
-            compressed_sd[f'dilbert.transformer.layer.{std_idx}.output_layer_norm.{w}'] = \
+            compressed_sd[f'distilbert.transformer.layer.{std_idx}.output_layer_norm.{w}'] = \
                 state_dict[f'bert.encoder.layer.{teacher_idx}.output.LayerNorm.{w}']
         std_idx += 1
 
diff --git a/examples/distillation/scripts/token_counts.py b/examples/distillation/scripts/token_counts.py
index 2f5ed83922..eb3fb738e0 100644
--- a/examples/distillation/scripts/token_counts.py
+++ b/examples/distillation/scripts/token_counts.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Preprocessing script before training DilBERT.
+Preprocessing script before training DistilBERT.
 """
 from collections import Counter
 import argparse
diff --git a/examples/distillation/train.py b/examples/distillation/train.py
index 5af42dd8f4..712f10b47d 100644
--- a/examples/distillation/train.py
+++ b/examples/distillation/train.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Training DilBERT.
+Training DistilBERT.
 """
 import os
 import argparse
@@ -24,7 +24,7 @@ import numpy as np
 import torch
 
 from pytorch_transformers import BertTokenizer, BertForMaskedLM
-from pytorch_transformers import DilBertForMaskedLM, DilBertConfig
+from pytorch_transformers import DistilBertForMaskedLM, DistilBertConfig
 
 from distiller import Distiller
 from utils import git_log, logger, init_gpu_params, set_seed
@@ -201,13 +201,13 @@ def main():
         assert os.path.isfile(os.path.join(args.from_pretrained_config))
         logger.info(f'Loading pretrained weights from {args.from_pretrained_weights}')
         logger.info(f'Loading pretrained config from {args.from_pretrained_config}')
-        stu_architecture_config = DilBertConfig.from_json_file(args.from_pretrained_config)
-        student = DilBertForMaskedLM.from_pretrained(args.from_pretrained_weights,
+        stu_architecture_config = DistilBertConfig.from_json_file(args.from_pretrained_config)
+        student = DistilBertForMaskedLM.from_pretrained(args.from_pretrained_weights,
                                                      config=stu_architecture_config)
     else:
         args.vocab_size_or_config_json_file = args.vocab_size
-        stu_architecture_config = DilBertConfig(**vars(args))
-        student = DilBertForMaskedLM(stu_architecture_config)
+        stu_architecture_config = DistilBertConfig(**vars(args))
+        student = DistilBertForMaskedLM(stu_architecture_config)
 
 
     if args.n_gpu > 0:
diff --git a/examples/distillation/utils.py b/examples/distillation/utils.py
index 14bb0e0016..461c371898 100644
--- a/examples/distillation/utils.py
+++ b/examples/distillation/utils.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Utils to train DilBERT.
+Utils to train DistilBERT.
 """
 import git
 import json
diff --git a/pytorch_transformers/__init__.py b/pytorch_transformers/__init__.py
index 22bc4d3c21..47783057d1 100644
--- a/pytorch_transformers/__init__.py
+++ b/pytorch_transformers/__init__.py
@@ -7,7 +7,7 @@ from .tokenization_gpt2 import GPT2Tokenizer
 from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
 from .tokenization_xlm import XLMTokenizer
 from .tokenization_roberta import RobertaTokenizer
-from .tokenization_dilbert import DilBertTokenizer
+from .tokenization_distilbert import DistilBertTokenizer
 
 from .tokenization_utils import (PreTrainedTokenizer)
 
@@ -41,9 +41,9 @@ from .modeling_xlm import (XLMConfig, XLMPreTrainedModel , XLMModel,
                            XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
 from .modeling_roberta import (RobertaConfig, RobertaForMaskedLM, RobertaModel, RobertaForSequenceClassification,
                                ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
-from .modeling_dilbert import (DilBertConfig, DilBertForMaskedLM, DilBertModel,
-                               DilBertForSequenceClassification, DilBertForQuestionAnswering,
-                               DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
+from .modeling_distilbert import (DistilBertConfig, DistilBertForMaskedLM, DistilBertModel,
+                               DistilBertForSequenceClassification, DistilBertForQuestionAnswering,
+                               DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
 from .modeling_utils import (WEIGHTS_NAME, CONFIG_NAME, TF_WEIGHTS_NAME,
                           PretrainedConfig, PreTrainedModel, prune_layer, Conv1D)
 
diff --git a/pytorch_transformers/modeling_auto.py b/pytorch_transformers/modeling_auto.py
index 7e65269926..cdacb7b552 100644
--- a/pytorch_transformers/modeling_auto.py
+++ b/pytorch_transformers/modeling_auto.py
@@ -30,7 +30,7 @@ from .modeling_transfo_xl import TransfoXLConfig, TransfoXLModel
 from .modeling_xlnet import XLNetConfig, XLNetModel
 from .modeling_xlm import XLMConfig, XLMModel
 from .modeling_roberta import RobertaConfig, RobertaModel
-from .modeling_dilbert import DilBertConfig, DilBertModel
+from .modeling_distilbert import DistilBertConfig, DistilBertModel
 
 from .modeling_utils import PreTrainedModel, SequenceSummary
 
@@ -111,8 +111,8 @@ class AutoConfig(object):
             assert unused_kwargs == {'foo': False}
 
         """
-        if 'dilbert' in pretrained_model_name_or_path:
-            return DilBertConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        if 'distilbert' in pretrained_model_name_or_path:
+            return DistilBertConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
         elif 'roberta' in pretrained_model_name_or_path:
             return RobertaConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
         elif 'bert' in pretrained_model_name_or_path:
@@ -228,8 +228,8 @@ class AutoModel(object):
             model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
 
         """
-        if 'dilbert' in pretrained_model_name_or_path:
-            return DilBertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        if 'distilbert' in pretrained_model_name_or_path:
+            return DistilBertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
         elif 'roberta' in pretrained_model_name_or_path:
             return RobertaModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
         elif 'bert' in pretrained_model_name_or_path:
diff --git a/pytorch_transformers/modeling_dilbert.py b/pytorch_transformers/modeling_distilbert.py
similarity index 87%
rename from pytorch_transformers/modeling_dilbert.py
rename to pytorch_transformers/modeling_distilbert.py
index 867ba0e6a8..af77757293 100644
--- a/pytorch_transformers/modeling_dilbert.py
+++ b/pytorch_transformers/modeling_distilbert.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-PyTorch DilBERT model.
+PyTorch DistilBERT model.
 """
 from __future__ import absolute_import, division, print_function, unicode_literals
 
@@ -36,19 +36,19 @@ import logging
 logger = logging.getLogger(__name__)
 
 
-DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
-    'dilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-pytorch_model.bin",
-    'dilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-distilled-squad-pytorch_model.bin"
+DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
+    'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-pytorch_model.bin",
+    'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-pytorch_model.bin"
 }
 
-DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
-    'dilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-config.json",
-    'dilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-distilled-squad-config.json"
+DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json",
+    'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-config.json"
 }
 
 
-class DilBertConfig(PretrainedConfig):
-    pretrained_config_archive_map = DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
+class DistilBertConfig(PretrainedConfig):
+    pretrained_config_archive_map = DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
 
     def __init__(self,
                  vocab_size_or_config_json_file=30522,
@@ -66,7 +66,7 @@ class DilBertConfig(PretrainedConfig):
                  qa_dropout=0.1,
                  seq_classif_dropout=0.2,
                  **kwargs):
-        super(DilBertConfig, self).__init__(**kwargs)
+        super(DistilBertConfig, self).__init__(**kwargs)
 
         if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
                         and isinstance(vocab_size_or_config_json_file, unicode)):
@@ -398,17 +398,17 @@ class Transformer(nn.Module):
 
 
 ### INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL ###
-class DilBertPreTrainedModel(PreTrainedModel):
+class DistilBertPreTrainedModel(PreTrainedModel):
     """ An abstract class to handle weights initialization and
         a simple interface for downloading and loading pretrained models.
     """
-    config_class = DilBertConfig
-    pretrained_model_archive_map = DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
+    config_class = DistilBertConfig
+    pretrained_model_archive_map = DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
     load_tf_weights = None
-    base_model_prefix = "dilbert"
+    base_model_prefix = "distilbert"
 
     def __init__(self, *inputs, **kwargs):
-        super(DilBertPreTrainedModel, self).__init__(*inputs, **kwargs)
+        super(DistilBertPreTrainedModel, self).__init__(*inputs, **kwargs)
     
     def init_weights(self, module):
         """ Initialize the weights.
@@ -425,36 +425,36 @@ class DilBertPreTrainedModel(PreTrainedModel):
             module.bias.data.zero_()
 
 
-DILBERT_START_DOCSTRING = r"""
-    DilBERT is a small, fast, cheap and light Transformer model
+DISTILBERT_START_DOCSTRING = r"""
+    DistilBERT is a small, fast, cheap and light Transformer model
     trained by distilling Bert base. It has 40% less parameters than
     `bert-base-uncased`, runs 60% faster while preserving over 95% of
     Bert's performances as measured on the GLUE language understanding benchmark.
 
-    Here are the differences between the interface of Bert and DilBert:
+    Here are the differences between the interface of Bert and DistilBert:
 
-    - DilBert doesn't have `token_type_ids`, you don't need to indicate which token belong to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`)
-    - DilBert doesn't have options to select the input positions (`position_ids` input). This could be added if necessary though, just let's us know if you need this option.
+    - DistilBert doesn't have `token_type_ids`, you don't need to indicate which token belong to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`)
+    - DistilBert doesn't have options to select the input positions (`position_ids` input). This could be added if necessary though, just let's us know if you need this option.
 
-    For more information on DilBERT, please refer to our
+    For more information on DistilBERT, please refer to our
     `detailed blog post`_
     
     .. _`detailed blog post`:
-        https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-dilbert-a-distilled-version-of-bert-8cf3380435b5
+        https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
 
     Parameters:
-        config (:class:`~pytorch_transformers.DilBertConfig`): Model configuration class with all the parameters of the model. 
+        config (:class:`~pytorch_transformers.DistilBertConfig`): Model configuration class with all the parameters of the model. 
             Initializing with a config file does not load the weights associated with the model, only the configuration.
             Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
 """
 
-DILBERT_INPUTS_DOCSTRING = r"""
+DISTILBERT_INPUTS_DOCSTRING = r"""
     Inputs:
         **input_ids**L ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Indices oof input sequence tokens in the vocabulary.
             The input sequences should start with `[CLS]` and `[SEP]` tokens.
             
-            For now, ONLY BertTokenizer(`bert-base-uncased`) is supported and you should use this tokenizer when using DilBERT.
+            For now, ONLY BertTokenizer(`bert-base-uncased`) is supported and you should use this tokenizer when using DistilBERT.
         **attention_mask**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
@@ -465,9 +465,9 @@ DILBERT_INPUTS_DOCSTRING = r"""
             ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
 """
 
-@add_start_docstrings("The bare DilBERT encoder/transformer outputing raw hidden-states without any specific head on top.",
-                      DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
-class DilBertModel(DilBertPreTrainedModel):
+@add_start_docstrings("The bare DistilBERT encoder/transformer outputing raw hidden-states without any specific head on top.",
+                      DISTILBERT_START_DOCSTRING, DISTILBERT_INPUTS_DOCSTRING)
+class DistilBertModel(DistilBertPreTrainedModel):
     r"""
     Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
         **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
@@ -482,15 +482,15 @@ class DilBertModel(DilBertPreTrainedModel):
 
     Examples::
 
-        tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
-        model = DilBertModel.from_pretrained('dilbert-base-uncased')
+        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+        model = DistilBertModel.from_pretrained('distilbert-base-uncased')
         input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
         outputs = model(input_ids)
         last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 
     """
     def __init__(self, config):
-        super(DilBertModel, self).__init__(config)
+        super(DistilBertModel, self).__init__(config)
 
         self.embeddings = Embeddings(config)   # Embeddings
         self.transformer = Transformer(config) # Encoder
@@ -543,9 +543,9 @@ class DilBertModel(DilBertPreTrainedModel):
         return output # last-layer hidden-state, (all hidden_states), (all attentions)
 
 
-@add_start_docstrings("""DilBert Model with a `masked language modeling` head on top. """,
-                      DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
-class DilBertForMaskedLM(DilBertPreTrainedModel):
+@add_start_docstrings("""DistilBert Model with a `masked language modeling` head on top. """,
+                      DISTILBERT_START_DOCSTRING, DISTILBERT_INPUTS_DOCSTRING)
+class DistilBertForMaskedLM(DistilBertPreTrainedModel):
     r"""
         **masked_lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
             Labels for computing the masked language modeling loss.
@@ -568,19 +568,19 @@ class DilBertForMaskedLM(DilBertPreTrainedModel):
 
     Examples::
 
-        tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
-        model = DilBertForMaskedLM.from_pretrained('dilbert-base-uncased')
+        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+        model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased')
         input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
         outputs = model(input_ids, masked_lm_labels=input_ids)
         loss, prediction_scores = outputs[:2]
 
     """
     def __init__(self, config):
-        super(DilBertForMaskedLM, self).__init__(config)
+        super(DistilBertForMaskedLM, self).__init__(config)
         self.output_attentions = config.output_attentions
         self.output_hidden_states = config.output_hidden_states
 
-        self.dilbert = DilBertModel(config)
+        self.distilbert = DistilBertModel(config)
         self.vocab_transform = nn.Linear(config.dim, config.dim)
         self.vocab_layer_norm = nn.LayerNorm(config.dim, eps=1e-12)
         self.vocab_projector = nn.Linear(config.dim, config.vocab_size)
@@ -595,14 +595,14 @@ class DilBertForMaskedLM(DilBertPreTrainedModel):
             Export to TorchScript can't handle parameter sharing so we are cloning them instead.
         """
         self._tie_or_clone_weights(self.vocab_projector,
-                                   self.dilbert.embeddings.word_embeddings)
+                                   self.distilbert.embeddings.word_embeddings)
 
     def forward(self,
                 input_ids: torch.tensor,
                 attention_mask: torch.tensor = None,
                 masked_lm_labels: torch.tensor = None,
                 head_mask: torch.tensor = None):
-        dlbrt_output = self.dilbert(input_ids=input_ids,
+        dlbrt_output = self.distilbert(input_ids=input_ids,
                                     attention_mask=attention_mask,
                                     head_mask=head_mask)
         hidden_states = dlbrt_output[0]                              # (bs, seq_length, dim)
@@ -620,10 +620,10 @@ class DilBertForMaskedLM(DilBertPreTrainedModel):
         return outputs # (mlm_loss), prediction_logits, (all hidden_states), (all attentions)
 
 
-@add_start_docstrings("""DilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of
+@add_start_docstrings("""DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of
                          the pooled output) e.g. for GLUE tasks. """,
-                      DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
-class DilBertForSequenceClassification(DilBertPreTrainedModel):
+                      DISTILBERT_START_DOCSTRING, DISTILBERT_INPUTS_DOCSTRING)
+class DistilBertForSequenceClassification(DistilBertPreTrainedModel):
     r"""
         **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
             Labels for computing the sequence classification/regression loss.
@@ -646,8 +646,8 @@ class DilBertForSequenceClassification(DilBertPreTrainedModel):
 
     Examples::
 
-        tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
-        model = DilBertForSequenceClassification.from_pretrained('dilbert-base-uncased')
+        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+        model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
         input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
         labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
         outputs = model(input_ids, labels=labels)
@@ -655,10 +655,10 @@ class DilBertForSequenceClassification(DilBertPreTrainedModel):
 
     """
     def __init__(self, config):
-        super(DilBertForSequenceClassification, self).__init__(config)
+        super(DistilBertForSequenceClassification, self).__init__(config)
         self.num_labels = config.num_labels
 
-        self.dilbert = DilBertModel(config)
+        self.distilbert = DistilBertModel(config)
         self.pre_classifier = nn.Linear(config.dim, config.dim)
         self.classifier = nn.Linear(config.dim, config.num_labels)
         self.dropout = nn.Dropout(config.seq_classif_dropout)
@@ -670,17 +670,17 @@ class DilBertForSequenceClassification(DilBertPreTrainedModel):
                 attention_mask: torch.tensor = None,
                 labels: torch.tensor = None,
                 head_mask: torch.tensor = None):
-        dilbert_output = self.dilbert(input_ids=input_ids,
+        distilbert_output = self.distilbert(input_ids=input_ids,
                                       attention_mask=attention_mask,
                                       head_mask=head_mask)
-        hidden_state = dilbert_output[0]                    # (bs, seq_len, dim)
+        hidden_state = distilbert_output[0]                    # (bs, seq_len, dim)
         pooled_output = hidden_state[:, 0]                    # (bs, dim)
         pooled_output = self.pre_classifier(pooled_output)   # (bs, dim)
         pooled_output = nn.ReLU()(pooled_output)             # (bs, dim)
         pooled_output = self.dropout(pooled_output)         # (bs, dim)
         logits = self.classifier(pooled_output)              # (bs, dim)
 
-        outputs = (logits,) + dilbert_output[1:]
+        outputs = (logits,) + distilbert_output[1:]
         if labels is not None:
             if self.num_labels == 1:
                 loss_fct = nn.MSELoss()
@@ -693,10 +693,10 @@ class DilBertForSequenceClassification(DilBertPreTrainedModel):
         return outputs  # (loss), logits, (hidden_states), (attentions)
 
 
-@add_start_docstrings("""DilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
+@add_start_docstrings("""DistilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
                          the hidden-states output to compute `span start logits` and `span end logits`). """,
-                      DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
-class DilBertForQuestionAnswering(DilBertPreTrainedModel):
+                      DISTILBERT_START_DOCSTRING, DISTILBERT_INPUTS_DOCSTRING)
+class DistilBertForQuestionAnswering(DistilBertPreTrainedModel):
     r"""
         **start_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
@@ -724,8 +724,8 @@ class DilBertForQuestionAnswering(DilBertPreTrainedModel):
 
     Examples::
 
-        tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
-        model = DilBertForQuestionAnswering.from_pretrained('dilbert-base-uncased')
+        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+        model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')
         input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
         start_positions = torch.tensor([1])
         end_positions = torch.tensor([3])
@@ -734,9 +734,9 @@ class DilBertForQuestionAnswering(DilBertPreTrainedModel):
 
     """
     def __init__(self, config):
-        super(DilBertForQuestionAnswering, self).__init__(config)
+        super(DistilBertForQuestionAnswering, self).__init__(config)
 
-        self.dilbert = DilBertModel(config)
+        self.distilbert = DistilBertModel(config)
         self.qa_outputs = nn.Linear(config.dim, config.num_labels)
         assert config.num_labels == 2
         self.dropout = nn.Dropout(config.qa_dropout)
@@ -749,10 +749,10 @@ class DilBertForQuestionAnswering(DilBertPreTrainedModel):
                 start_positions: torch.tensor = None,
                 end_positions: torch.tensor = None,
                 head_mask: torch.tensor = None):
-        dilbert_output = self.dilbert(input_ids=input_ids,
+        distilbert_output = self.distilbert(input_ids=input_ids,
                                       attention_mask=attention_mask,
                                       head_mask=head_mask)
-        hidden_states = dilbert_output[0]                                 # (bs, max_query_len, dim)
+        hidden_states = distilbert_output[0]                                 # (bs, max_query_len, dim)
 
         hidden_states = self.dropout(hidden_states)                       # (bs, max_query_len, dim)
         logits = self.qa_outputs(hidden_states)                           # (bs, max_query_len, 2)
@@ -760,7 +760,7 @@ class DilBertForQuestionAnswering(DilBertPreTrainedModel):
         start_logits = start_logits.squeeze(-1)                           # (bs, max_query_len)
         end_logits = end_logits.squeeze(-1)                               # (bs, max_query_len)
 
-        outputs = (start_logits, end_logits,) + dilbert_output[1:]
+        outputs = (start_logits, end_logits,) + distilbert_output[1:]
         if start_positions is not None and end_positions is not None:
             # If we are on multi-GPU, split add a dimension
             if len(start_positions.size()) > 1:
diff --git a/pytorch_transformers/tests/modeling_dilbert_test.py b/pytorch_transformers/tests/modeling_dilbert_test.py
index 2fd707dfd8..1c9d9c792d 100644
--- a/pytorch_transformers/tests/modeling_dilbert_test.py
+++ b/pytorch_transformers/tests/modeling_dilbert_test.py
@@ -20,23 +20,23 @@ import unittest
 import shutil
 import pytest
 
-from pytorch_transformers import (DilBertConfig, DilBertModel, DilBertForMaskedLM,
-                                  DilBertForQuestionAnswering, DilBertForSequenceClassification)
-from pytorch_transformers.modeling_dilbert import DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
+from pytorch_transformers import (DistilBertConfig, DistilBertModel, DistilBertForMaskedLM,
+                                  DistilBertForQuestionAnswering, DistilBertForSequenceClassification)
+from pytorch_transformers.modeling_distilbert import DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
 
 from .modeling_common_test import (CommonTestCases, ConfigTester, ids_tensor)
 
 
-class DilBertModelTest(CommonTestCases.CommonModelTester):
+class DistilBertModelTest(CommonTestCases.CommonModelTester):
 
-    all_model_classes = (DilBertModel, DilBertForMaskedLM, DilBertForQuestionAnswering,
-                         DilBertForSequenceClassification)
+    all_model_classes = (DistilBertModel, DistilBertForMaskedLM, DistilBertForQuestionAnswering,
+                         DistilBertForSequenceClassification)
     test_pruning = True
     test_torchscript = True
     test_resize_embeddings = True
     test_head_masking = True
 
-    class DilBertModelTester(object):
+    class DistilBertModelTester(object):
 
         def __init__(self,
                      parent,
@@ -100,7 +100,7 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
                 token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
                 choice_labels = ids_tensor([self.batch_size], self.num_choices)
 
-            config = DilBertConfig(
+            config = DistilBertConfig(
                 vocab_size_or_config_json_file=self.vocab_size,
                 dim=self.hidden_size,
                 n_layers=self.num_hidden_layers,
@@ -119,8 +119,8 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
                 list(result["loss"].size()),
                 [])
 
-        def create_and_check_dilbert_model(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
-            model = DilBertModel(config=config)
+        def create_and_check_distilbert_model(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = DistilBertModel(config=config)
             model.eval()
             (sequence_output,) = model(input_ids, input_mask)
             (sequence_output,) = model(input_ids)
@@ -132,8 +132,8 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
                 list(result["sequence_output"].size()),
                 [self.batch_size, self.seq_length, self.hidden_size])
 
-        def create_and_check_dilbert_for_masked_lm(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
-            model = DilBertForMaskedLM(config=config)
+        def create_and_check_distilbert_for_masked_lm(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = DistilBertForMaskedLM(config=config)
             model.eval()
             loss, prediction_scores = model(input_ids, attention_mask=input_mask, masked_lm_labels=token_labels)
             result = {
@@ -145,8 +145,8 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
                 [self.batch_size, self.seq_length, self.vocab_size])
             self.check_loss_output(result)
 
-        def create_and_check_dilbert_for_question_answering(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
-            model = DilBertForQuestionAnswering(config=config)
+        def create_and_check_distilbert_for_question_answering(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
+            model = DistilBertForQuestionAnswering(config=config)
             model.eval()
             loss, start_logits, end_logits = model(input_ids, input_mask, sequence_labels, sequence_labels)
             result = {
@@ -162,9 +162,9 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
                 [self.batch_size, self.seq_length])
             self.check_loss_output(result)
 
-        def create_and_check_dilbert_for_sequence_classification(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
+        def create_and_check_distilbert_for_sequence_classification(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
             config.num_labels = self.num_labels
-            model = DilBertForSequenceClassification(config)
+            model = DistilBertForSequenceClassification(config)
             model.eval()
             loss, logits = model(input_ids, input_mask, sequence_labels)
             result = {
@@ -183,33 +183,33 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
             return config, inputs_dict
 
     def setUp(self):
-        self.model_tester = DilBertModelTest.DilBertModelTester(self)
-        self.config_tester = ConfigTester(self, config_class=DilBertConfig, dim=37)
+        self.model_tester = DistilBertModelTest.DistilBertModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=DistilBertConfig, dim=37)
 
     def test_config(self):
         self.config_tester.run_common_tests()
 
-    def test_dilbert_model(self):
+    def test_distilbert_model(self):
         config_and_inputs = self.model_tester.prepare_config_and_inputs()
-        self.model_tester.create_and_check_dilbert_model(*config_and_inputs)
+        self.model_tester.create_and_check_distilbert_model(*config_and_inputs)
 
     def test_for_masked_lm(self):
         config_and_inputs = self.model_tester.prepare_config_and_inputs()
-        self.model_tester.create_and_check_dilbert_for_masked_lm(*config_and_inputs)
+        self.model_tester.create_and_check_distilbert_for_masked_lm(*config_and_inputs)
 
     def test_for_question_answering(self):
         config_and_inputs = self.model_tester.prepare_config_and_inputs()
-        self.model_tester.create_and_check_dilbert_for_question_answering(*config_and_inputs)
+        self.model_tester.create_and_check_distilbert_for_question_answering(*config_and_inputs)
 
     def test_for_sequence_classification(self):
         config_and_inputs = self.model_tester.prepare_config_and_inputs()
-        self.model_tester.create_and_check_dilbert_for_sequence_classification(*config_and_inputs)
+        self.model_tester.create_and_check_distilbert_for_sequence_classification(*config_and_inputs)
 
     # @pytest.mark.slow
     # def test_model_from_pretrained(self):
     #     cache_dir = "/tmp/pytorch_transformers_test/"
-    #     for model_name in list(DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
-    #         model = DilBertModel.from_pretrained(model_name, cache_dir=cache_dir)
+    #     for model_name in list(DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+    #         model = DistilBertModel.from_pretrained(model_name, cache_dir=cache_dir)
     #         shutil.rmtree(cache_dir)
     #         self.assertIsNotNone(model)
 
diff --git a/pytorch_transformers/tests/tokenization_dilbert_test.py b/pytorch_transformers/tests/tokenization_dilbert_test.py
index 4cc7aa6c88..30268db216 100644
--- a/pytorch_transformers/tests/tokenization_dilbert_test.py
+++ b/pytorch_transformers/tests/tokenization_dilbert_test.py
@@ -18,20 +18,20 @@ import os
 import unittest
 from io import open
 
-from pytorch_transformers.tokenization_dilbert import (DilBertTokenizer)
+from pytorch_transformers.tokenization_distilbert import (DistilBertTokenizer)
 
 from .tokenization_tests_commons import CommonTestCases
 from .tokenization_bert_test import BertTokenizationTest
 
-class DilBertTokenizationTest(BertTokenizationTest):
+class DistilBertTokenizationTest(BertTokenizationTest):
 
-    tokenizer_class = DilBertTokenizer
+    tokenizer_class = DistilBertTokenizer
 
     def get_tokenizer(self):
-        return DilBertTokenizer.from_pretrained(self.tmpdirname)
+        return DistilBertTokenizer.from_pretrained(self.tmpdirname)
 
     def test_sequence_builders(self):
-        tokenizer = DilBertTokenizer.from_pretrained("dilbert-base-uncased")
+        tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
 
         text = tokenizer.encode("sequence builders")
         text_2 = tokenizer.encode("multi-sequence build")
diff --git a/pytorch_transformers/tokenization_dilbert.py b/pytorch_transformers/tokenization_distilbert.py
similarity index 75%
rename from pytorch_transformers/tokenization_dilbert.py
rename to pytorch_transformers/tokenization_distilbert.py
index 8d71e1b486..116da41b37 100644
--- a/pytorch_transformers/tokenization_dilbert.py
+++ b/pytorch_transformers/tokenization_distilbert.py
@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""Tokenization classes for DilBERT."""
+"""Tokenization classes for DistilBERT."""
 
 from __future__ import absolute_import, division, print_function, unicode_literals
 
@@ -31,21 +31,21 @@ VOCAB_FILES_NAMES = {'vocab_file': 'vocab.txt'}
 PRETRAINED_VOCAB_FILES_MAP = {
     'vocab_file':
     {
-        'dilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
-        'dilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
+        'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
+        'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
     }
 }
 
 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
-    'dilbert-base-uncased': 512,
-    'dilbert-base-uncased-distilled-squad': 512,
+    'distilbert-base-uncased': 512,
+    'distilbert-base-uncased-distilled-squad': 512,
 }
 
 
-class DilBertTokenizer(BertTokenizer):
+class DistilBertTokenizer(BertTokenizer):
     r"""
-    Constructs a DilBertTokenizer.
-    :class:`~pytorch_transformers.DilBertTokenizer` is identical to BertTokenizer and runs end-to-end tokenization: punctuation splitting + wordpiece
+    Constructs a DistilBertTokenizer.
+    :class:`~pytorch_transformers.DistilBertTokenizer` is identical to BertTokenizer and runs end-to-end tokenization: punctuation splitting + wordpiece
 
     Args:
         vocab_file: Path to a one-wordpiece-per-line vocabulary file

From 1dc43e56c9116dc1a200ea13512bd7788c448837 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Wed, 28 Aug 2019 09:37:27 -0400
Subject: [PATCH 185/200] Documentation additions

---
 docs/source/index.rst                       |  1 +
 docs/source/model_doc/distilbert.rst        | 43 +++++++++++++++++++++
 docs/source/pretrained_models.rst           |  8 ++++
 pytorch_transformers/modeling_distilbert.py |  8 ++--
 4 files changed, 56 insertions(+), 4 deletions(-)
 create mode 100644 docs/source/model_doc/distilbert.rst

diff --git a/docs/source/index.rst b/docs/source/index.rst
index 37b3509fe4..1ae722e1e8 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -48,3 +48,4 @@ The library currently contains PyTorch implementations, pre-trained model weight
     model_doc/xlm
     model_doc/xlnet
     model_doc/roberta
+    model_doc/distilbert
diff --git a/docs/source/model_doc/distilbert.rst b/docs/source/model_doc/distilbert.rst
new file mode 100644
index 0000000000..cc156c90c2
--- /dev/null
+++ b/docs/source/model_doc/distilbert.rst
@@ -0,0 +1,43 @@
+DistilBERT
+----------------------------------------------------
+
+``DistilBertConfig``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.DistilBertConfig
+    :members:
+
+
+``DistilBertTokenizer``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.DistilBertTokenizer
+    :members:
+
+
+``DistilBertModel``
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.DistilBertModel
+    :members:
+
+
+``DistilBertForMaskedLM``
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.DistilBertForMaskedLM
+    :members:
+
+
+``DistilBertForSequenceClassification``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.DistilBertForSequenceClassification
+    :members:
+
+
+``DistilBertForQuestionAnswering``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_transformers.DistilBertForQuestionAnswering
+    :members:
diff --git a/docs/source/pretrained_models.rst b/docs/source/pretrained_models.rst
index 7df70ea225..2bbb7ae7a1 100644
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -111,5 +111,13 @@ Here is the full list of the currently provided pretrained models together with
 |                   |                                                            | | ``roberta-large`` fine-tuned on `MNLI <http://www.nyu.edu/projects/bowman/multinli/>`__.                                            |
 |                   |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__)                                                   |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| DistilBERT        | ``distilbert-base-uncased``                                | | 6-layer, 768-hidden, 12-heads, 66M parameters                                                                                       |
+|                   |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint                                                   |
+|                   |                                                            | (see `details <https://medium.com/@victorsanh/8cf3380435b5>`__)                                                                       |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``distilbert-base-uncased-distilled-squad``                | | 6-layer, 768-hidden, 12-heads, 66M parameters                                                                                       |
+|                   |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint, with an additional linear layer.                 |
+|                   |                                                            | (see `details <https://medium.com/@victorsanh/8cf3380435b5>`__)                                                                       |
++-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 
 .. <https://huggingface.co/pytorch-transformers/examples.html>`__
\ No newline at end of file
diff --git a/pytorch_transformers/modeling_distilbert.py b/pytorch_transformers/modeling_distilbert.py
index af77757293..6ae18bdb01 100644
--- a/pytorch_transformers/modeling_distilbert.py
+++ b/pytorch_transformers/modeling_distilbert.py
@@ -433,7 +433,7 @@ DISTILBERT_START_DOCSTRING = r"""
 
     Here are the differences between the interface of Bert and DistilBert:
 
-    - DistilBert doesn't have `token_type_ids`, you don't need to indicate which token belong to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`)
+    - DistilBert doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`)
     - DistilBert doesn't have options to select the input positions (`position_ids` input). This could be added if necessary though, just let's us know if you need this option.
 
     For more information on DistilBERT, please refer to our
@@ -450,9 +450,9 @@ DISTILBERT_START_DOCSTRING = r"""
 
 DISTILBERT_INPUTS_DOCSTRING = r"""
     Inputs:
-        **input_ids**L ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
-            Indices oof input sequence tokens in the vocabulary.
-            The input sequences should start with `[CLS]` and `[SEP]` tokens.
+        **input_ids** ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+            The input sequences should start with `[CLS]` and end with `[SEP]` tokens.
             
             For now, ONLY BertTokenizer(`bert-base-uncased`) is supported and you should use this tokenizer when using DistilBERT.
         **attention_mask**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:

From 75bc2a03cc1a533c86dbf856d5a01a35f6359ea4 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Wed, 28 Aug 2019 10:05:15 -0400
Subject: [PATCH 186/200] Updated article link

---
 README.md                                   | 2 +-
 docs/source/pretrained_models.rst           | 4 ++--
 pytorch_transformers/modeling_distilbert.py | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index 5f69ad778f..dd093ebaec 100644
--- a/README.md
+++ b/README.md
@@ -13,7 +13,7 @@ The library currently contains PyTorch implementations, pre-trained model weight
 5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
 7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
-8. **[DistilBERT](https://github.com/huggingface/pytorch-transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the blogpost [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
+8. **[DistilBERT](https://github.com/huggingface/pytorch-transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the blogpost [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5
 ) by Victor Sanh, Lysandre Debut and Thomas Wolf.
 
 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html).
diff --git a/docs/source/pretrained_models.rst b/docs/source/pretrained_models.rst
index 2bbb7ae7a1..af7702ad5d 100644
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -113,11 +113,11 @@ Here is the full list of the currently provided pretrained models together with
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | DistilBERT        | ``distilbert-base-uncased``                                | | 6-layer, 768-hidden, 12-heads, 66M parameters                                                                                       |
 |                   |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint                                                   |
-|                   |                                                            | (see `details <https://medium.com/@victorsanh/8cf3380435b5>`__)                                                                       |
+|                   |                                                            | (see `details <https://medium.com/huggingface/distilbert-8cf3380435b5>`__)                                                                       |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                   | ``distilbert-base-uncased-distilled-squad``                | | 6-layer, 768-hidden, 12-heads, 66M parameters                                                                                       |
 |                   |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint, with an additional linear layer.                 |
-|                   |                                                            | (see `details <https://medium.com/@victorsanh/8cf3380435b5>`__)                                                                       |
+|                   |                                                            | (see `details <https://medium.com/huggingface/distilbert-8cf3380435b5>`__)                                                                       |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 
 .. <https://huggingface.co/pytorch-transformers/examples.html>`__
\ No newline at end of file
diff --git a/pytorch_transformers/modeling_distilbert.py b/pytorch_transformers/modeling_distilbert.py
index 6ae18bdb01..4a0f3a101b 100644
--- a/pytorch_transformers/modeling_distilbert.py
+++ b/pytorch_transformers/modeling_distilbert.py
@@ -440,7 +440,7 @@ DISTILBERT_START_DOCSTRING = r"""
     `detailed blog post`_
     
     .. _`detailed blog post`:
-        https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
+        https://medium.com/huggingface/distilbert-8cf3380435b5
 
     Parameters:
         config (:class:`~pytorch_transformers.DistilBertConfig`): Model configuration class with all the parameters of the model. 

From f753d4e32bcefddd32868b9551fcf3c7908f00eb Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Wed, 28 Aug 2019 10:15:02 -0400
Subject: [PATCH 187/200] Removed typings for Python 2

---
 pytorch_transformers/modeling_distilbert.py | 55 +++++----------------
 1 file changed, 12 insertions(+), 43 deletions(-)

diff --git a/pytorch_transformers/modeling_distilbert.py b/pytorch_transformers/modeling_distilbert.py
index 4a0f3a101b..63a7485683 100644
--- a/pytorch_transformers/modeling_distilbert.py
+++ b/pytorch_transformers/modeling_distilbert.py
@@ -158,8 +158,7 @@ class Embeddings(nn.Module):
         return embeddings
 
 class MultiHeadSelfAttention(nn.Module):
-    def __init__(self,
-                 config):
+    def __init__(self, config):
         super(MultiHeadSelfAttention, self).__init__()
 
         self.n_heads = config.n_heads
@@ -192,12 +191,7 @@ class MultiHeadSelfAttention(nn.Module):
         self.n_heads = self.n_heads - len(heads)
         self.dim = attention_head_size * self.n_heads
 
-    def forward(self,
-                query: torch.tensor,
-                key: torch.tensor,
-                value: torch.tensor,
-                mask: torch.tensor,
-                head_mask: torch.tensor = None):
+    def forward(self, query, key, value, mask, head_mask = None):
         """
         Parameters
         ----------
@@ -258,8 +252,7 @@ class MultiHeadSelfAttention(nn.Module):
             return (context,)
 
 class FFN(nn.Module):
-    def __init__(self,
-                 config):
+    def __init__(self, config):
         super(FFN, self).__init__()
         self.dropout = nn.Dropout(p=config.dropout)
         self.lin1 = nn.Linear(in_features=config.dim, out_features=config.hidden_dim)
@@ -267,8 +260,7 @@ class FFN(nn.Module):
         assert config.activation in ['relu', 'gelu'], "activation ({}) must be in ['relu', 'gelu']".format(config.activation)
         self.activation = gelu if config.activation == 'gelu' else nn.ReLU()
 
-    def forward(self,
-                input: torch.tensor):
+    def forward(self, input):
         x = self.lin1(input)
         x = self.activation(x)
         x = self.lin2(x)
@@ -276,8 +268,7 @@ class FFN(nn.Module):
         return x
 
 class TransformerBlock(nn.Module):
-    def __init__(self,
-                 config):
+    def __init__(self, config):
         super(TransformerBlock, self).__init__()
 
         self.n_heads = config.n_heads
@@ -295,10 +286,7 @@ class TransformerBlock(nn.Module):
         self.ffn = FFN(config)
         self.output_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)
 
-    def forward(self,
-                x: torch.tensor,
-                attn_mask: torch.tensor = None,
-                head_mask: torch.tensor = None):
+    def forward(self, x, attn_mask=None, head_mask=None):
         """
         Parameters
         ----------
@@ -332,8 +320,7 @@ class TransformerBlock(nn.Module):
 
 
 class Transformer(nn.Module):
-    def __init__(self,
-                 config):
+    def __init__(self, config):
         super(Transformer, self).__init__()
         self.n_layers = config.n_layers
         self.output_attentions = config.output_attentions
@@ -342,10 +329,7 @@ class Transformer(nn.Module):
         layer = TransformerBlock(config)
         self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.n_layers)])
 
-    def forward(self,
-                x: torch.tensor,
-                attn_mask: torch.tensor = None,
-                head_mask: torch.tensor = None):
+    def forward(self, x, attn_mask=None, head_mask=None):
         """
         Parameters
         ----------
@@ -512,9 +496,7 @@ class DistilBertModel(DistilBertPreTrainedModel):
             self.transformer.layer[layer].attention.prune_heads(heads)
 
     def forward(self,
-                input_ids: torch.tensor,
-                attention_mask: torch.tensor = None,
-                head_mask: torch.tensor = None):
+                input_ids, attention_mask=None, head_mask=None):
         if attention_mask is None:
             attention_mask = torch.ones_like(input_ids) # (bs, seq_length)
 
@@ -597,11 +579,7 @@ class DistilBertForMaskedLM(DistilBertPreTrainedModel):
         self._tie_or_clone_weights(self.vocab_projector,
                                    self.distilbert.embeddings.word_embeddings)
 
-    def forward(self,
-                input_ids: torch.tensor,
-                attention_mask: torch.tensor = None,
-                masked_lm_labels: torch.tensor = None,
-                head_mask: torch.tensor = None):
+    def forward(self, input_ids, attention_mask=None, masked_lm_labels=None, head_mask=None):
         dlbrt_output = self.distilbert(input_ids=input_ids,
                                     attention_mask=attention_mask,
                                     head_mask=head_mask)
@@ -665,11 +643,7 @@ class DistilBertForSequenceClassification(DistilBertPreTrainedModel):
 
         self.apply(self.init_weights)
 
-    def forward(self,
-                input_ids: torch.tensor,
-                attention_mask: torch.tensor = None,
-                labels: torch.tensor = None,
-                head_mask: torch.tensor = None):
+    def forward(self, input_ids,  attention_mask=None, labels=None, head_mask=None):
         distilbert_output = self.distilbert(input_ids=input_ids,
                                       attention_mask=attention_mask,
                                       head_mask=head_mask)
@@ -743,12 +717,7 @@ class DistilBertForQuestionAnswering(DistilBertPreTrainedModel):
 
         self.apply(self.init_weights)
         
-    def forward(self,
-                input_ids: torch.tensor,
-                attention_mask: torch.tensor = None,
-                start_positions: torch.tensor = None,
-                end_positions: torch.tensor = None,
-                head_mask: torch.tensor = None):
+    def forward(self, input_ids, attention_mask=None, start_positions=None, end_positions=None, head_mask=None):
         distilbert_output = self.distilbert(input_ids=input_ids,
                                       attention_mask=attention_mask,
                                       head_mask=head_mask)

From b5eb283aaa124c6b62927481094f2f8747813e97 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Wed, 28 Aug 2019 16:36:55 +0200
Subject: [PATCH 188/200] update credits

---
 examples/distillation/dataset.py            | 6 +++---
 examples/distillation/distiller.py          | 6 +++---
 examples/distillation/utils.py              | 6 +++---
 pytorch_transformers/modeling_distilbert.py | 7 ++++---
 4 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/examples/distillation/dataset.py b/examples/distillation/dataset.py
index b3b76fd83c..cdc16b94f3 100644
--- a/examples/distillation/dataset.py
+++ b/examples/distillation/dataset.py
@@ -1,5 +1,5 @@
 # coding=utf-8
-# Copyright 2019-present, the HuggingFace Inc. team.
+# Copyright 2019-present, the HuggingFace Inc. team and Facebook, Inc.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,8 +12,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""
-Dataloaders to train DistilBERT.
+""" Dataloaders to train DistilBERT
+    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
 """
 from typing import List
 import math
diff --git a/examples/distillation/distiller.py b/examples/distillation/distiller.py
index e6c27fe365..38769c4b0e 100644
--- a/examples/distillation/distiller.py
+++ b/examples/distillation/distiller.py
@@ -1,5 +1,5 @@
 # coding=utf-8
-# Copyright 2019-present, the HuggingFace Inc. team.
+# Copyright 2019-present, the HuggingFace Inc. team and Facebook, Inc.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,8 +12,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""
-The distiller to distil DistilBERT.
+""" The distiller to distil DistilBERT
+    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
 """
 import os
 import math
diff --git a/examples/distillation/utils.py b/examples/distillation/utils.py
index 461c371898..3d62504710 100644
--- a/examples/distillation/utils.py
+++ b/examples/distillation/utils.py
@@ -1,5 +1,5 @@
 # coding=utf-8
-# Copyright 2019-present, the HuggingFace Inc. team.
+# Copyright 2019-present, the HuggingFace Inc. team and Facebook, Inc.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,8 +12,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""
-Utils to train DistilBERT.
+""" Utils to train DistilBERT
+    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
 """
 import git
 import json
diff --git a/pytorch_transformers/modeling_distilbert.py b/pytorch_transformers/modeling_distilbert.py
index 63a7485683..8ec984199a 100644
--- a/pytorch_transformers/modeling_distilbert.py
+++ b/pytorch_transformers/modeling_distilbert.py
@@ -1,5 +1,5 @@
 # coding=utf-8
-# Copyright 2019-present, the HuggingFace Inc. team and Facebook, Inc.
+# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,8 +12,9 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""
-PyTorch DistilBERT model.
+""" PyTorch DistilBERT model
+    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
+    and in part from HuggingFace PyTorch version of Google AI Bert model (https://github.com/google-research/bert)
 """
 from __future__ import absolute_import, division, print_function, unicode_literals
 

From e7706f514bf220188e2ecaef3aa4c3a17368e89a Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Wed, 28 Aug 2019 16:37:22 +0200
Subject: [PATCH 189/200] update again

---
 pytorch_transformers/tokenization_distilbert.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/tokenization_distilbert.py b/pytorch_transformers/tokenization_distilbert.py
index 116da41b37..5a6d02f98d 100644
--- a/pytorch_transformers/tokenization_distilbert.py
+++ b/pytorch_transformers/tokenization_distilbert.py
@@ -1,5 +1,5 @@
 # coding=utf-8
-# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright 2018 The HuggingFace Inc. team.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.

From ed2ab1c2205c401047f21cb6fd648cdbefe4a012 Mon Sep 17 00:00:00 2001
From: Stefan Schweter <stefan.schweter@bsb-muenchen.de>
Date: Wed, 28 Aug 2019 18:08:16 +0200
Subject: [PATCH 190/200] distilbert: fix number of hidden_size

---
 pytorch_transformers/modeling_distilbert.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pytorch_transformers/modeling_distilbert.py b/pytorch_transformers/modeling_distilbert.py
index 8ec984199a..1a0bd2496c 100644
--- a/pytorch_transformers/modeling_distilbert.py
+++ b/pytorch_transformers/modeling_distilbert.py
@@ -95,7 +95,7 @@ class DistilBertConfig(PretrainedConfig):
                              " or the path to a pretrained model config file (str)")
     @property
     def hidden_size(self):
-        return self.hidden_dim
+        return self.dim
 
     @property
     def num_attention_heads(self):

From 1d15a7f2780b15d18dc0694bb76a3a6906437352 Mon Sep 17 00:00:00 2001
From: Andreas Daiminger <andreas.daiminger@gmail.com>
Date: Wed, 28 Aug 2019 19:18:27 +0200
Subject: [PATCH 191/200] swap order of optimizer.step() and scheduler.step()

---
 examples/run_squad.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/run_squad.py b/examples/run_squad.py
index 25e2c4093f..cc4eda306c 100644
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -157,8 +157,8 @@ def train(args, train_dataset, model, tokenizer):
 
             tr_loss += loss.item()
             if (step + 1) % args.gradient_accumulation_steps == 0:
-                scheduler.step()  # Update learning rate schedule
                 optimizer.step()
+                scheduler.step()  # Update learning rate schedule
                 model.zero_grad()
                 global_step += 1
 

From 9ce42dc5402502169d8bae8f69609625d2d6ef0c Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Wed, 28 Aug 2019 13:56:28 -0400
Subject: [PATCH 192/200] Pretrained models table fix

---
 docs/source/pretrained_models.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/pretrained_models.rst b/docs/source/pretrained_models.rst
index af7702ad5d..4222ee32cf 100644
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -113,11 +113,11 @@ Here is the full list of the currently provided pretrained models together with
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | DistilBERT        | ``distilbert-base-uncased``                                | | 6-layer, 768-hidden, 12-heads, 66M parameters                                                                                       |
 |                   |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint                                                   |
-|                   |                                                            | (see `details <https://medium.com/huggingface/distilbert-8cf3380435b5>`__)                                                                       |
+|                   |                                                            | (see `details <https://medium.com/huggingface/distilbert-8cf3380435b5>`__)                                                            |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                   | ``distilbert-base-uncased-distilled-squad``                | | 6-layer, 768-hidden, 12-heads, 66M parameters                                                                                       |
 |                   |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint, with an additional linear layer.                 |
-|                   |                                                            | (see `details <https://medium.com/huggingface/distilbert-8cf3380435b5>`__)                                                                       |
+|                   |                                                            | (see `details <https://medium.com/huggingface/distilbert-8cf3380435b5>`__)                                                            |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 
 .. <https://huggingface.co/pytorch-transformers/examples.html>`__
\ No newline at end of file

From 0a74c88ac609c03293c69b61cfa7c9b084e38cdb Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Wed, 28 Aug 2019 22:41:42 +0200
Subject: [PATCH 193/200] fix #1131

---
 pytorch_transformers/modeling_xlnet.py | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/pytorch_transformers/modeling_xlnet.py b/pytorch_transformers/modeling_xlnet.py
index 136f07c436..ca2d63f6b5 100644
--- a/pytorch_transformers/modeling_xlnet.py
+++ b/pytorch_transformers/modeling_xlnet.py
@@ -677,8 +677,11 @@ XLNET_INPUTS_DOCSTRING = r"""
             ``1`` for tokens that are MASKED, ``0`` for tokens that are NOT MASKED.
         **mems**: (`optional`)
             list of ``torch.FloatTensor`` (one for each layer):
-            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            that contains pre-computed hidden-states (key and values in the attention blocks) as output by the model
             (see `mems` output below). Can be used to speed up sequential decoding and attend to longer context.
+            To activate mems you need to set up config.mem_len to a positive value which will be the max number of tokens in
+            the memory output by the model. E.g. `model = XLNetModel.from_pretrained('xlnet-base-case, mem_len=1024)` will
+            instantiate a model which can use up to 1024 tokens of memory (in addition to the input it self).
         **perm_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, sequence_length)``:
             Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:
             If ``perm_mask[k, i, j] = 0``, i attend to j in batch k;
@@ -705,7 +708,8 @@ class XLNetModel(XLNetPreTrainedModel):
         **mems**:
             list of ``torch.FloatTensor`` (one for each layer):
             that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
-            (see `mems` input above). Can be used to speed up sequential decoding and attend to longer context.
+            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
+            See details in the docstring of the `mems` input above.
         **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
             list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
             of shape ``(batch_size, sequence_length, hidden_size)``:
@@ -859,7 +863,7 @@ class XLNetModel(XLNetPreTrainedModel):
         target_mapping = target_mapping.permute(1, 2, 0).contiguous() if target_mapping is not None else None
 
         qlen, bsz = input_ids.shape[0], input_ids.shape[1]
-        mlen = mems[0].shape[0] if mems is not None else 0
+        mlen = mems[0].shape[0] if mems is not None and mems[0] is not None else 0
         klen = mlen + qlen
 
         dtype_float = next(self.parameters()).dtype
@@ -1011,7 +1015,8 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
         **mems**:
             list of ``torch.FloatTensor`` (one for each layer):
             that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
-            (see `mems` input above). Can be used to speed up sequential decoding and attend to longer context.
+            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
+            See details in the docstring of the `mems` input above.
         **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
             list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
             of shape ``(batch_size, sequence_length, hidden_size)``:
@@ -1091,7 +1096,8 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
         **mems**:
             list of ``torch.FloatTensor`` (one for each layer):
             that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
-            (see `mems` input above). Can be used to speed up sequential decoding and attend to longer context.
+            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
+            See details in the docstring of the `mems` input above.
         **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
             list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
             of shape ``(batch_size, sequence_length, hidden_size)``:
@@ -1189,7 +1195,8 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
         **mems**:
             list of ``torch.FloatTensor`` (one for each layer):
             that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
-            (see `mems` input above). Can be used to speed up sequential decoding and attend to longer context.
+            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
+            See details in the docstring of the `mems` input above.
         **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
             list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
             of shape ``(batch_size, sequence_length, hidden_size)``:

From bf3dc778b82d62cd407cbd9658f0f97a9c2d519f Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Wed, 28 Aug 2019 18:24:43 -0400
Subject: [PATCH 194/200] Changed learning rate for run_squad test

---
 examples/test_examples.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/test_examples.py b/examples/test_examples.py
index 688401ebc9..b04d722b7b 100644
--- a/examples/test_examples.py
+++ b/examples/test_examples.py
@@ -81,7 +81,7 @@ class ExamplesTests(unittest.TestCase):
                     "--do_train",
                     "--do_eval",
                     "--version_2_with_negative",
-                    "--learning_rate=1e-4",
+                    "--learning_rate=2e-4",
                     "--per_gpu_train_batch_size=2",
                     "--per_gpu_eval_batch_size=1",
                     "--overwrite_output_dir",

From fe8fb10b445b14adf872b205681fa41a7a932b28 Mon Sep 17 00:00:00 2001
From: Luis <30115537+Lawiss@users.noreply.github.com>
Date: Thu, 29 Aug 2019 09:54:45 +0200
Subject: [PATCH 195/200] Small modification of comment in the run_glue.py
 example

Add RoBERTa to the comment as it was not explicit that RoBERTa don't use token_type_ids.
---
 examples/run_glue.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/run_glue.py b/examples/run_glue.py
index 53b46fc102..89fb957b47 100644
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -128,7 +128,7 @@ def train(args, train_dataset, model, tokenizer):
             batch = tuple(t.to(args.device) for t in batch)
             inputs = {'input_ids':      batch[0],
                       'attention_mask': batch[1],
-                      'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None,  # XLM don't use segment_ids
+                      'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None,  # XLM and RoBERTa don't use segment_ids
                       'labels':         batch[3]}
             outputs = model(**inputs)
             loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)

From e7fba4bef55c8c667c318549860c0826ccd164dd Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Thu, 29 Aug 2019 12:14:29 -0400
Subject: [PATCH 196/200] Documentation auto-deploy

---
 .circleci/config.yml | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/.circleci/config.yml b/.circleci/config.yml
index 7f316b0b3a..30555731ef 100644
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -26,9 +26,27 @@ jobs:
             - run: sudo pip install pytest codecov pytest-cov
             - run: python -m pytest -sv ./pytorch_transformers/tests/ --cov
             - run: codecov
+    deploy_doc:
+        working_directory: ~/pytorch-transformers
+        docker:
+            - image: circleci/python:3.5
+        steps:
+            - add_ssh_keys:
+                  fingerprints:
+                      - "5b:7a:95:18:07:8c:aa:76:4c:60:35:88:ad:60:56:71"
+            - checkout
+            - run: sudo pip install -r docs/requirements.txt
+            - run: sudo pip install -r requirements.txt
+            - run: cd docs && make clean && make html && scp -r -oStrictHostKeyChecking=no _build/html/* $doc:$dir
+workflow_filters: &workflow_filters
+    filters:
+        branches:
+            only:
+                - master
 workflows:
-  version: 2
-  build_and_test:
-    jobs:
-      - build_py3
-      - build_py2
\ No newline at end of file
+    version: 2
+    build_and_test:
+        jobs:
+            - build_py3
+            - build_py2
+            - deploy_doc: *workflow_filters
\ No newline at end of file

From caf1d116a62a324a2b0ccfd92ca6c095d5368dde Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Thu, 29 Aug 2019 15:30:10 -0400
Subject: [PATCH 197/200] Closing bracket in DistilBERT's token count.

---
 examples/distillation/scripts/token_counts.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/distillation/scripts/token_counts.py b/examples/distillation/scripts/token_counts.py
index eb3fb738e0..d6b6126fb6 100644
--- a/examples/distillation/scripts/token_counts.py
+++ b/examples/distillation/scripts/token_counts.py
@@ -24,7 +24,7 @@ from utils import logger
 if __name__ == '__main__':
     parser = argparse.ArgumentParser(description="Token Counts for smoothing the masking probabilities in MLM (cf XLM/word2vec)")
     parser.add_argument("--data_file", type=str, default="data/dump.bert-base-uncased.pickle",
-                        help="The binarized dataset."
+                        help="The binarized dataset.")
     parser.add_argument("--token_counts_dump", type=str, default="data/token_counts.bert-base-uncased.pickle",
                         help="The dump file.")
     parser.add_argument("--vocab_size", default=30522, type=int)

From 20c06fa37d343a9ce38c32a23afdb007e1150238 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Fri, 30 Aug 2019 10:06:51 -0400
Subject: [PATCH 198/200] Added DistilBERT to documentation index

---
 docs/source/index.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/source/index.rst b/docs/source/index.rst
index 1ae722e1e8..fd73cbe9ef 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -11,6 +11,7 @@ The library currently contains PyTorch implementations, pre-trained model weight
 4. `Transformer-XL <https://github.com/kimiyoung/transformer-xl>`_ (from Google/CMU) released with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_ by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
 5. `XLNet <https://github.com/zihangdai/xlnet>`_ (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 6. `XLM <https://github.com/facebookresearch/XLM>`_ (from Facebook) released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_ by Guillaume Lample and Alexis Conneau.
+6. `DistilBERT <https://huggingface.co/pytorch-transformers/model_doc/distilbert.html`_ (from HuggingFace) released together with the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf.
 
 .. toctree::
     :maxdepth: 2

From a600b30cc35465326ac11e2b4d26865ea555d08b Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Fri, 30 Aug 2019 10:08:14 -0400
Subject: [PATCH 199/200] Fix index number in documentation

---
 docs/source/index.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/index.rst b/docs/source/index.rst
index fd73cbe9ef..89169b0945 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -11,7 +11,7 @@ The library currently contains PyTorch implementations, pre-trained model weight
 4. `Transformer-XL <https://github.com/kimiyoung/transformer-xl>`_ (from Google/CMU) released with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_ by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
 5. `XLNet <https://github.com/zihangdai/xlnet>`_ (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 6. `XLM <https://github.com/facebookresearch/XLM>`_ (from Facebook) released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_ by Guillaume Lample and Alexis Conneau.
-6. `DistilBERT <https://huggingface.co/pytorch-transformers/model_doc/distilbert.html`_ (from HuggingFace) released together with the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf.
+7. `DistilBERT <https://huggingface.co/pytorch-transformers/model_doc/distilbert.html`_ (from HuggingFace) released together with the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf.
 
 .. toctree::
     :maxdepth: 2

From e0caab0cf052c86e456bc4b4fdac5788433ed935 Mon Sep 17 00:00:00 2001
From: LysandreJik <lysandre.debut@reseau.eseo.fr>
Date: Fri, 30 Aug 2019 10:09:17 -0400
Subject: [PATCH 200/200] fix link

---
 docs/source/index.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/index.rst b/docs/source/index.rst
index 89169b0945..d349e146c9 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -11,7 +11,7 @@ The library currently contains PyTorch implementations, pre-trained model weight
 4. `Transformer-XL <https://github.com/kimiyoung/transformer-xl>`_ (from Google/CMU) released with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_ by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
 5. `XLNet <https://github.com/zihangdai/xlnet>`_ (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 6. `XLM <https://github.com/facebookresearch/XLM>`_ (from Facebook) released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_ by Guillaume Lample and Alexis Conneau.
-7. `DistilBERT <https://huggingface.co/pytorch-transformers/model_doc/distilbert.html`_ (from HuggingFace) released together with the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf.
+7. `DistilBERT <https://huggingface.co/pytorch-transformers/model_doc/distilbert.html>`_ (from HuggingFace) released together with the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf.
 
 .. toctree::
     :maxdepth: 2