Merge branch 'master' into auto_models

2019-08-05 19:17:35 +02:00
parent 0b524b0848 3a126e73dd
commit d43dc48b34
16 changed files with 340 additions and 108 deletions
--- a/.github/ISSUE_TEMPLATE/bug-report.md
+++ b/.github/ISSUE_TEMPLATE/bug-report.md
@@ -0,0 +1,48 @@
 ---
 name: "\U0001F41B Bug Report"
 about: Submit a bug report to help us improve PyTorch Transformers
 ---
 ## 🐛 Bug
 <!-- Important information -->
 Model I am using (Bert, XLNet....):
 Language I am using the model on (English, Chinese....):
 The problem arise when using:
 * [ ] the official example scripts: (give details)
 * [ ] my own modified scripts: (give details)
 The tasks I am working on is:
 * [ ] an official GLUE/SQUaD task: (give the name)
 * [ ] my own task or dataset: (give details)
 ## To Reproduce
 Steps to reproduce the behavior:
 1.
 2.
 3.
 <!-- If you have a code sample, error messages, stack traces, please provide it here as well. -->
 ## Expected behavior
 <!-- A clear and concise description of what you expected to happen. -->
 ## Environment
 * OS:
 * Python version:
 * PyTorch version:
 * PyTorch Transformers version (or branch):
 * Using GPU ?
 * Distributed of parallel setup ?
 * Any other relevant information:
 ## Additional context
 <!-- Add any other context about the problem here. -->
--- a/.github/ISSUE_TEMPLATE/feature-request.md
+++ b/.github/ISSUE_TEMPLATE/feature-request.md
@@ -0,0 +1,16 @@
 ---
 name: "\U0001F680 Feature Request"
 about: Submit a proposal/request for a new PyTorch Transformers feature
 ---
 ## 🚀 Feature
 <!-- A clear and concise description of the feature proposal. Please provide a link to the paper and code in case they exist. -->
 ## Motivation
 <!-- Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too. -->
 ## Additional context
 <!-- Add any other context or screenshots about the feature request here. -->
--- a/.github/ISSUE_TEMPLATE/migration.md
+++ b/.github/ISSUE_TEMPLATE/migration.md
@@ -0,0 +1,43 @@
 ---
 name: "\U0001F4DA Migration from PyTorch-pretrained-Bert"
 about: Report a problem when migrating from PyTorch-pretrained-Bert to PyTorch-Transformers
 ---
 ## 📚 Migration
 <!-- Important information -->
 Model I am using (Bert, XLNet....):
 Language I am using the model on (English, Chinese....):
 The problem arise when using:
 * [ ] the official example scripts: (give details)
 * [ ] my own modified scripts: (give details)
 The tasks I am working on is:
 * [ ] an official GLUE/SQUaD task: (give the name)
 * [ ] my own task or dataset: (give details)
 Details of the issue:
 <!-- A clear and concise description of the migration issue. If you have code snippets, please provide it here as well. -->
 ## Environment
 * OS:
 * Python version:
 * PyTorch version:
 * PyTorch Transformers version (or branch):
 * Using GPU ?
 * Distributed of parallel setup ?
 * Any other relevant information:
 ## Checklist
 - [ ] I have read the migration guide in the readme.
 - [ ] I checked if a related official extension example runs on my machine.
 ## Additional context
 <!-- Add any other context about the problem here. -->
--- a/.github/ISSUE_TEMPLATE/question-help.md
+++ b/.github/ISSUE_TEMPLATE/question-help.md
@@ -0,0 +1,8 @@
 ---
 name: "❓Questions & Help"
 about: Start a general discussion related to PyTorch Transformers
 ---
 ## ❓ Questions & Help
 <!-- A clear and concise description of the question. -->
--- a/README.md
+++ b/README.md
@@ -18,7 +18,7 @@ These implementations have been tested on several datasets (see the example scri
 | Section | Description |
 |-|-|
 | [Installation](#installation) | How to install the package |
-| [Quick tour: Usage](#quick-tour-usage) | Tokenizers & models usage: Bert and GPT-2 |
+| [Quick tour: Usage](#quick-tour) | Tokenizers & models usage: Bert and GPT-2 |
 | [Quick tour: Fine-tuning/usage scripts](#quick-tour-of-the-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
 | [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-pytorch-transformers) | Migrating your code from pytorch-pretrained-bert to pytorch-transformers |
 | [Documentation](https://huggingface.co/pytorch-transformers/) | Full API documentation and more |
@@ -56,6 +56,16 @@ python -m pytest -sv ./pytorch_transformers/tests/
 python -m pytest -sv ./examples/
 ```
 ### Do you want to run a Transformer model on a mobile device?
 You should check out our [`swift-coreml-transformers`](https://github.com/huggingface/swift-coreml-transformers) repo.
 It contains an example of a conversion script from a Pytorch trained Transformer model (here, `GPT-2`) to a CoreML model that runs on iOS devices.
 At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
 or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
 ## Quick tour
 Let's do a very quick overview of PyTorch-Transformers. Detailed examples for each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [full documentation](https://huggingface.co/pytorch-transformers/).
@@ -195,7 +205,7 @@ python ./examples/run_glue.py \
    --warmup_steps=120
 ```
-On this machine we thus have a batch size of 32, please increase `gradient_accumulation_steps` to reach the same batch size if you have a smaller machine. These hyper-parameters should results in a Pearson correlation coefficient of `+0.917` on the development set.
+On this machine we thus have a batch size of 32, please increase `gradient_accumulation_steps` to reach the same batch size if you have a smaller machine. These hyper-parameters should result in a Pearson correlation coefficient of `+0.917` on the development set.
 #### Fine-tuning Bert model on the MRPC classification task
@@ -265,7 +275,7 @@ This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-s
 ### `run_generation.py`: Text generation with GPT, GPT-2, Transformer-XL and XLNet
 A conditional generation script is also included to generate text from a prompt.
-The generation script include the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
+The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
 Here is how to run the script with the small version of OpenAI GPT-2 model:
@@ -284,7 +294,7 @@ Here is a quick summary of what you should take care of when migrating from `pyt
 The main breaking change when migrating from `pytorch-pretrained-bert` to `pytorch-transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
-The exact content of the tuples for each model are detailled in the models' docstrings and the [documentation](https://huggingface.co/pytorch-transformers/).
+The exact content of the tuples for each model are detailed in the models' docstrings and the [documentation](https://huggingface.co/pytorch-transformers/).
 In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.
@@ -383,6 +393,7 @@ for batch in train_data:
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
    scheduler.step()
    optimizer.step()
    optimizer.zero_grad()
 ```
 ## Citation
--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
@@ -49,4 +49,17 @@ If you want to reproduce the original tokenization process of the ``OpenAI GPT``
   pip install spacy ftfy==4.4.3
   python -m spacy download en
-If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer defaults to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
+If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
 Do you want to run a Transformer model on a mobile device?
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 You should check out our `swift-coreml-transformers <https://github.com/huggingface/swift-coreml-transformers>`_ repo.
 It contains an example of a conversion script from a Pytorch trained Transformer model (here, ``GPT-2``) to a CoreML model that runs on iOS devices.
 It also contains an implementation of BERT for Question answering.
 At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
 or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -3,57 +3,98 @@ Pretrained models
 Here is the full list of the currently provided pretrained models together with a short presentation of each model.
 +===============+============================================================+===========================+ 
 | Architecture  | Shortcut name                                              | Details of the model      |
 +===============+============================================================+===========================+ 
 |               | ``bert-base-uncased``                                      | 12-layer, 768-hidden, 12-heads, 110M parameters
 |               |                                                            | Trained on lower-cased English text                 |
 |               +------------------------------------------------------------+---------------------------+ 
 |               | ``bert-large-uncased``                                     | 24-layer, 1024-hidden, 16-heads, 340M parameters
 |               |                                                            | Trained on lower-cased English text                  |
 |               +------------------------------------------------------------+---------------------------+ 
 |               | ``bert-base-cased``                                        | 12-layer, 768-hidden, 12-heads, 110M parameters
 |               |                                                            | Trained on cased English text                 |
 |               +------------------------------------------------------------+---------------------------+ 
 |               | ``bert-large-cased``                                       | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
 |               |                                                            | Trained on cased English text                  |
 |               +------------------------------------------------------------+---------------------------+ 
 |               | ``bert-base-multilingual-uncased``                         | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters
 |               |                                                            | Trained on lower-cased text in the top 102 languages with the largest Wikipedias
 |               |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`_)                 |
 |               +------------------------------------------------------------+---------------------------+ 
 |               | ``bert-base-multilingual-cased``                           | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters                  |
 |               |                                                            | Trained on cased text in the top 104 languages with the largest Wikipedias
 |               |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`_)                 |
 |               +------------------------------------------------------------+---------------------------+ 
 |    BERT       | ``bert-base-chinese``                                      | 12-layer, 768-hidden, 12-heads, 110M parameters                  |
 |               |                                                            | Trained on cased Chinese Simplified and Traditional text |
 |               +------------------------------------------------------------+---------------------------+ 
 |               | ``bert-base-german-cased``                                 | 12-layer, 768-hidden, 12-heads, 110M parameters                  |
 |               |                                                            | Trained on cased German text by Deepset.ai |
 |               |                                                            | (see `details on deepset.ai website <https://deepset.ai/german-bert>`_)                 |
 |               +------------------------------------------------------------+---------------------------+ 
 |               | ``bert-large-uncased-whole-word-masking``                  | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
 |               |                                                            | Trained on lower-cased English text using Whole-Word-Masking                  |
 |               |                                                            | (see `details <https://github.com/google-research/bert/#bert>`_)                 |
 |               +------------------------------------------------------------+---------------------------+ 
 |               | ``bert-large-cased-whole-word-masking``                    | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
 |               |                                                            | Trained on cased English text using Whole-Word-Masking                  |
 |               |                                                            | (see `details <https://github.com/google-research/bert/#bert>`_)                 |
 |               +------------------------------------------------------------+---------------------------+ 
 |               | ``bert-large-uncased-whole-word-masking-finetuned-squad``  | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
 |               |                                                            | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD                  |
 |               |                                                            | (see details of fine-tuning in the `example section`_)                 |
 |               +------------------------------------------------------------+---------------------------+ 
 |               | ``bert-large-cased-whole-word-masking-finetuned-squad``    | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
 |               |                                                            | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD                  |
 |               |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`_)                 |
 |               +------------------------------------------------------------+---------------------------+ 
 |               | ``bert-base-cased-finetuned-mrpc``                         | 12-layer, 768-hidden, 12-heads, 110M parameters                  |
 |               |                                                            | The ``bert-base-cased`` model fine-tuned on MRPC                  |
 |               |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`_)                 |
 +---------------+------------------------------------------------------------+---------------------------+ 
 |    GPT        | Cells may span columns.                                                                |
 +---------------+----------------------------------------------------------------------------------------+ 
-.. <https://huggingface.co/pytorch-transformers/examples.html>`_
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 | Architecture      | Shortcut name                                              | Details of the model                                                                                                      |
 +===================+============================================================+===========================================================================================================================+
 | BERT              | ``bert-base-uncased``                                      | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
 |                   |                                                            | Trained on lower-cased English text                                                                                       |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-large-uncased``                                     | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
 |                   |                                                            | Trained on lower-cased English text                                                                                       |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-base-cased``                                        | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
 |                   |                                                            | Trained on cased English text                                                                                             |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-large-cased``                                       | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
 |                   |                                                            | Trained on cased English text                                                                                             |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-base-multilingual-uncased``                         | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters                                               |
 |                   |                                                            | Trained on lower-cased text in the top 102 languages with the largest Wikipedias                                          |
 |                   |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__)                                   |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-base-multilingual-cased``                           | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters                                                    |
 |                   |                                                            | Trained on cased text in the top 104 languages with the largest Wikipedias                                                |
 |                   |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__)                                   |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-base-chinese``                                      | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
 |                   |                                                            | Trained on cased Chinese Simplified and Traditional text                                                                  |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-base-german-cased``                                 | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
 |                   |                                                            | Trained on cased German text by Deepset.ai                                                                                |
 |                   |                                                            | (see `details on deepset.ai website <https://deepset.ai/german-bert>`__)                                                  |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-large-uncased-whole-word-masking``                  | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
 |                   |                                                            | Trained on lower-cased English text using Whole-Word-Masking                                                              |
 |                   |                                                            | (see `details <https://github.com/google-research/bert/#bert>`__)                                                         |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-large-cased-whole-word-masking``                    | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
 |                   |                                                            | Trained on cased English text using Whole-Word-Masking                                                                    |
 |                   |                                                            | (see `details <https://github.com/google-research/bert/#bert>`__)                                                         |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-large-uncased-whole-word-masking-finetuned-squad``  | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
 |                   |                                                            | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD (see details of fine-tuning in the                |
 |                   |                                                            | `example section <https://github.com/huggingface/pytorch-transformers/tree/master/examples>`__)                           |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-large-cased-whole-word-masking-finetuned-squad``    | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
 |                   |                                                            | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD                                                     |
 |                   |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`__)       |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bert-base-cased-finetuned-mrpc``                         | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
 |                   |                                                            | The ``bert-base-cased`` model fine-tuned on MRPC                                                                          |
 |                   |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`__)       |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 | GPT               | ``openai-gpt``                                             | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
 |                   |                                                            | OpenAI GPT English model                                                                                                  |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 | GPT-2             | ``gpt2``                                                   | 12-layer, 768-hidden, 12-heads, 117M parameters                                                                           |
 |                   |                                                            | OpenAI GPT-2 English model                                                                                                |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``gpt2-medium``                                            | 24-layer, 1024-hidden, 16-heads, 345M parameters                                                                          |
 |                   |                                                            | OpenAI's Medium-sized GPT-2 English model                                                                                 |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 | Transformer-XL    | ``transfo-xl-wt103``                                       | 18-layer, 1024-hidden, 16-heads, 257M parameters                                                                          |
 |                   |                                                            | English model trained on wikitext-103                                                                                     |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 | XLNet             | ``xlnet-base-cased``                                       | 12-layer, 768-hidden, 12-heads, 110M parameters                                                                           |
 |                   |                                                            | XLNet English model                                                                                                       |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlnet-large-cased``                                      | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                          |
 |                   |                                                            | XLNet Large English model                                                                                                 |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 | XLM               | ``xlm-mlm-en-2048``                                        | 12-layer, 1024-hidden, 8-heads                                                                                            |
 |                   |                                                            | XLM English model                                                                                                         |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-mlm-ende-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
 |                   |                                                            | XLM English-German Multi-language model                                                                                   |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-mlm-enfr-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
 |                   |                                                            | XLM English-French Multi-language model                                                                                   |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-mlm-enro-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
 |                   |                                                            | XLM English-Romanian Multi-language model                                                                                 |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-mlm-xnli15-1024``                                    | 12-layer, 1024-hidden, 8-heads                                                                                            |
 |                   |                                                            | XLM Model pre-trained with MLM on the `15 XNLI languages <https://github.com/facebookresearch/XNLI>`__.                   |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-mlm-tlm-xnli15-1024``                                | 12-layer, 1024-hidden, 8-heads                                                                                            |
 |                   |                                                            | XLM Model pre-trained with MLM + TLM on the `15 XNLI languages <https://github.com/facebookresearch/XNLI>`__.             |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-clm-enfr-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
 |                   |                                                            | XLM English model trained with CLM (Causal Language Modeling)                                                             |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-clm-ende-1024``                                      | 12-layer, 1024-hidden, 8-heads                                                                                            |
 |                   |                                                            | XLM English-German Multi-language model trained with CLM (Causal Language Modeling)                                       |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
 .. <https://huggingface.co/pytorch-transformers/examples.html>`__
--- a/docs/source/torchscript.rst
+++ b/docs/source/torchscript.rst
@@ -132,4 +132,4 @@ Using the traced model for inference is as simple as using its ``__call__`` dund
 .. code-block:: python
-    traced_model(tokens_tensor, segments_tensors)
+    traced_model(tokens_tensor, segments_tensors)
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -92,6 +92,10 @@ def train(args, train_dataset, model, tokenizer):
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)
    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
@@ -243,6 +247,9 @@ def evaluate(args, model, tokenizer, prefix=""):
 def load_and_cache_examples(args, task, tokenizer, evaluate=False):
    if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
    processor = processors[task]()
    output_mode = output_modes[task]
    # Load data features from cache or dataset file
@@ -269,6 +276,9 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
            logger.info("Saving features into cached file %s", cached_features_file)
            torch.save(features, cached_features_file)
    if args.local_rank == 0:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
@@ -418,8 +428,6 @@ def main():
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
    model.to(args.device)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)
    logger.info("Training/evaluation parameters %s", args)
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -101,6 +101,10 @@ def train(args, train_dataset, model, tokenizer):
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)
    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
@@ -241,7 +245,10 @@ def evaluate(args, model, tokenizer, prefix=""):
    # Compute predictions
    output_prediction_file = os.path.join(args.output_dir, "predictions_{}.json".format(prefix))
    output_nbest_file = os.path.join(args.output_dir, "nbest_predictions_{}.json".format(prefix))
-    output_null_log_odds_file = os.path.join(args.output_dir, "null_odds_{}.json".format(prefix))
+    if args.version_2_with_negative:
        output_null_log_odds_file = os.path.join(args.output_dir, "null_odds_{}.json".format(prefix))
    else:
        output_null_log_odds_file = None
    if args.model_type in ['xlnet', 'xlm']:
        # XLNet uses a more complex post-processing procedure
@@ -265,6 +272,9 @@ def evaluate(args, model, tokenizer, prefix=""):
 def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False):
    if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
    # Load data features from cache or dataset file
    input_file = args.predict_file if evaluate else args.train_file
    cached_features_file = os.path.join(os.path.dirname(input_file), 'cached_{}_{}_{}'.format(
@@ -289,6 +299,9 @@ def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=Fal
            logger.info("Saving features into cached file %s", cached_features_file)
            torch.save(features, cached_features_file)
    if args.local_rank == 0:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
@@ -457,8 +470,6 @@ def main():
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
    model.to(args.device)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)
    logger.info("Training/evaluation parameters %s", args)
--- a/pytorch_transformers/init.py
+++ b/pytorch_transformers/init.py
@@ -10,20 +10,20 @@ from .tokenization_utils import (PreTrainedTokenizer)
 from .modeling_auto import (AutoConfig, AutoModel)
-from .modeling_bert import (BertConfig, BertModel, BertForPreTraining,
+from .modeling_bert import (BertConfig, BertPreTrainedModel, BertModel, BertForPreTraining,
-                       BertForMaskedLM, BertForNextSentencePrediction,
+                            BertForMaskedLM, BertForNextSentencePrediction,
-                       BertForSequenceClassification, BertForMultipleChoice,
+                            BertForSequenceClassification, BertForMultipleChoice,
-                       BertForTokenClassification, BertForQuestionAnswering,
+                            BertForTokenClassification, BertForQuestionAnswering,
-                       load_tf_weights_in_bert, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
+                            load_tf_weights_in_bert, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
-                       BERT_PRETRAINED_CONFIG_ARCHIVE_MAP)
+                            BERT_PRETRAINED_CONFIG_ARCHIVE_MAP)
-from .modeling_openai import (OpenAIGPTConfig, OpenAIGPTModel,
+from .modeling_openai import (OpenAIGPTConfig, OpenAIGPTPreTrainedModel, OpenAIGPTModel,
                              OpenAIGPTLMHeadModel, OpenAIGPTDoubleHeadsModel,
                              load_tf_weights_in_openai_gpt, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
                              OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP)
-from .modeling_transfo_xl import (TransfoXLConfig, TransfoXLModel, TransfoXLLMHeadModel,
+from .modeling_transfo_xl import (TransfoXLConfig, TransfoXLPreTrainedModel, TransfoXLModel, TransfoXLLMHeadModel,
                                  load_tf_weights_in_transfo_xl, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,
                                  TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP)
-from .modeling_gpt2 import (GPT2Config, GPT2Model,
+from .modeling_gpt2 import (GPT2Config, GPT2PreTrainedModel, GPT2Model,
                            GPT2LMHeadModel, GPT2DoubleHeadsModel,
                            load_tf_weights_in_gpt2, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,
                            GPT2_PRETRAINED_MODEL_ARCHIVE_MAP)
@@ -32,7 +32,7 @@ from .modeling_xlnet import (XLNetConfig,
                             XLNetForSequenceClassification, XLNetForQuestionAnswering,
                             load_tf_weights_in_xlnet, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
                             XLNET_PRETRAINED_MODEL_ARCHIVE_MAP)
-from .modeling_xlm import (XLMConfig, XLMModel,
+from .modeling_xlm import (XLMConfig, XLMPreTrainedModel , XLMModel,
                           XLMWithLMHeadModel, XLMForSequenceClassification,
                           XLMForQuestionAnswering, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
                           XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
--- a/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
+++ b/pytorch_transformers/convert_pytorch_checkpoint_to_tf.py
@@ -41,7 +41,7 @@ def convert_pytorch_checkpoint_to_tf(model:BertModel, ckpt_dir:str, model_name:s
        N BertForQuestionAnswering
    """
-    tensors_to_transopse = (
+    tensors_to_transpose = (
        "dense.weight",
        "attention.self.query",
        "attention.self.key",
@@ -62,34 +62,34 @@ def convert_pytorch_checkpoint_to_tf(model:BertModel, ckpt_dir:str, model_name:s
    if not os.path.isdir(ckpt_dir):
        os.makedirs(ckpt_dir)
    session = tf.Session()
    state_dict = model.state_dict()
    tf_vars = []
    def to_tf_var_name(name:str):
        for patt, repl in iter(var_map):
            name = name.replace(patt, repl)
        return 'bert/{}'.format(name)
-    def assign_tf_var(tensor:np.ndarray, name:str):
+    def create_tf_var(tensor:np.ndarray, name:str, session:tf.Session):
-        tmp_var = tf.Variable(initial_value=tensor)
+        tf_dtype = tf.dtypes.as_dtype(tensor.dtype)
-        tf_var = tf.get_variable(dtype=tmp_var.dtype, shape=tmp_var.shape, name=name)
+        tf_var = tf.get_variable(dtype=tf_dtype, shape=tensor.shape, name=name, initializer=tf.zeros_initializer())
-        op = tf.assign(ref=tf_var, value=tmp_var)
+        session.run(tf.variables_initializer([tf_var]))
-        session.run(tf.variables_initializer([tmp_var, tf_var]))
+        session.run(tf_var)
        session.run(fetches=[op, tf_var])
        return tf_var
-    for var_name in state_dict:
+    tf.reset_default_graph()
-        tf_name = to_tf_var_name(var_name)
+    with tf.Session() as session:
-        torch_tensor = state_dict[var_name].numpy()
+        for var_name in state_dict:
-        if any([x in var_name for x in tensors_to_transopse]):
+            tf_name = to_tf_var_name(var_name)
-            torch_tensor = torch_tensor.T
+            torch_tensor = state_dict[var_name].numpy()
-        tf_tensor = assign_tf_var(tensor=torch_tensor, name=tf_name)
+            if any([x in var_name for x in tensors_to_transpose]):
-        tf_vars.append(tf_tensor)
+                torch_tensor = torch_tensor.T
-        print("{0}{1}initialized".format(tf_name, " " * (60 - len(tf_name))))
+            tf_var = create_tf_var(tensor=torch_tensor, name=tf_name, session=session)
            tf.keras.backend.set_value(tf_var, torch_tensor)
            tf_weight = session.run(tf_var)
            print("Successfully created {}: {}".format(tf_name, np.allclose(tf_weight, torch_tensor)))
-    saver = tf.train.Saver(tf_vars)
+        saver = tf.train.Saver(tf.trainable_variables())
-    saver.save(session, os.path.join(ckpt_dir, model_name.replace("-", "_") + ".ckpt"))
+        saver.save(session, os.path.join(ckpt_dir, model_name.replace("-", "_") + ".ckpt"))
 def main(raw_args=None):
--- a/pytorch_transformers/convert_transfo_xl_checkpoint_to_pytorch.py
+++ b/pytorch_transformers/convert_transfo_xl_checkpoint_to_pytorch.py
@@ -24,11 +24,10 @@ from io import open
 import torch
 import pytorch_transformers.tokenization_transfo_xl as data_utils
-from pytorch_transformers.modeling_transfo_xl import (CONFIG_NAME,
+
-                                                         WEIGHTS_NAME,
+from pytorch_transformers import CONFIG_NAME, WEIGHTS_NAME
-                                                         TransfoXLConfig,
+from pytorch_transformers.modeling_transfo_xl import (TransfoXLConfig, TransfoXLLMHeadModel,
-                                                         TransfoXLLMHeadModel,
+                                                      load_tf_weights_in_transfo_xl)
                                                         load_tf_weights_in_transfo_xl)
 from pytorch_transformers.tokenization_transfo_xl import (CORPUS_NAME, VOCAB_FILES_NAMES)
 if sys.version_info[0] == 2:
--- a/pytorch_transformers/modeling_openai.py
+++ b/pytorch_transformers/modeling_openai.py
@@ -538,7 +538,7 @@ class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
    r"""
        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
            Labels for language modeling.
-            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
+            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``
            Indices are selected in ``[-1, 0, ..., config.vocab_size]``
            All labels set to ``-1`` are ignored (masked), the loss is only
            computed for labels in ``[0, ..., config.vocab_size]``
--- a/pytorch_transformers/modeling_utils.py
+++ b/pytorch_transformers/modeling_utils.py
@@ -39,6 +39,20 @@ WEIGHTS_NAME = "pytorch_model.bin"
 TF_WEIGHTS_NAME = 'model.ckpt'
 try:
    from torch.nn import Identity
 except ImportError:
    # Older PyTorch compatibility
    class Identity(nn.Module):
        r"""A placeholder identity operator that is argument-insensitive.
        """
        def __init__(self, *args, **kwargs):
            super(Identity, self).__init__()
        def forward(self, input):
            return input
 if not six.PY2:
    def add_start_docstrings(*docstr):
        def docstring_decorator(fn):
@@ -783,7 +797,7 @@ class SequenceSummary(nn.Module):
            # We can probably just use the multi-head attention module of PyTorch >=1.1.0
            raise NotImplementedError
-        self.summary = nn.Identity()
+        self.summary = Identity()
        if hasattr(config, 'summary_use_proj') and config.summary_use_proj:
            if hasattr(config, 'summary_proj_to_labels') and config.summary_proj_to_labels and config.num_labels > 0:
                num_classes = config.num_labels
@@ -791,15 +805,15 @@ class SequenceSummary(nn.Module):
                num_classes = config.hidden_size
            self.summary = nn.Linear(config.hidden_size, num_classes)
-        self.activation = nn.Identity()
+        self.activation = Identity()
        if hasattr(config, 'summary_activation') and config.summary_activation == 'tanh':
            self.activation = nn.Tanh()
-        self.first_dropout = nn.Identity()
+        self.first_dropout = Identity()
        if hasattr(config, 'summary_first_dropout') and config.summary_first_dropout > 0:
            self.first_dropout = nn.Dropout(config.summary_first_dropout)
-        self.last_dropout = nn.Identity()
+        self.last_dropout = Identity()
        if hasattr(config, 'summary_last_dropout') and config.summary_last_dropout > 0:
            self.last_dropout = nn.Dropout(config.summary_last_dropout)
--- a/pytorch_transformers/tokenization_utils.py
+++ b/pytorch_transformers/tokenization_utils.py
@@ -226,26 +226,46 @@ class PreTrainedTokenizer(object):
        s3_models = list(cls.max_model_input_sizes.keys())
        vocab_files = {}
        if pretrained_model_name_or_path in s3_models:
            # Get the vocabulary from AWS S3 bucket
            for file_id, map_list in cls.pretrained_vocab_files_map.items():
                vocab_files[file_id] = map_list[pretrained_model_name_or_path]
        else:
            # Get the vocabulary from local files
            logger.info(
                "Model name '{}' not found in model shortcut name list ({}). "
                "Assuming '{}' is a path or url to a directory containing tokenizer files.".format(
                    pretrained_model_name_or_path, ', '.join(s3_models),
                    pretrained_model_name_or_path))
-            all_vocab_files_names = {'added_tokens_file': ADDED_TOKENS_FILE,
+
-                                     'special_tokens_map_file': SPECIAL_TOKENS_MAP_FILE}
+            # Look for the tokenizer main vocabulary files
-            all_vocab_files_names.update(cls.vocab_files_names)
+            for file_id, file_name in cls.vocab_files_names.items():
            for file_id, file_name in all_vocab_files_names.items():
                if os.path.isdir(pretrained_model_name_or_path):
                    # If a directory is provided we look for the standard filenames
                    full_file_name = os.path.join(pretrained_model_name_or_path, file_name)
                else:
                    # If a path to a file is provided we use it (will only work for non-BPE tokenizer using a single vocabulary file)
                    full_file_name = pretrained_model_name_or_path
                if not os.path.exists(full_file_name):
                    logger.info("Didn't find file {}. We won't load it.".format(full_file_name))
                    full_file_name = None
                vocab_files[file_id] = full_file_name
            # Look for the additional tokens files
            all_vocab_files_names = {'added_tokens_file': ADDED_TOKENS_FILE,
                                     'special_tokens_map_file': SPECIAL_TOKENS_MAP_FILE}
            # If a path to a file was provided, get the parent directory
            saved_directory = pretrained_model_name_or_path
            if os.path.exists(saved_directory) and not os.path.isdir(saved_directory):
                saved_directory = os.path.dirname(saved_directory)
            for file_id, file_name in all_vocab_files_names.items():
                full_file_name = os.path.join(saved_directory, file_name)
                if not os.path.exists(full_file_name):
                    logger.info("Didn't find file {}. We won't load it.".format(full_file_name))
                    full_file_name = None
                vocab_files[file_id] = full_file_name
            if all(full_file_name is None for full_file_name in vocab_files.values()):
                logger.error(
                    "Model name '{}' was not found in model name list ({}). "
@@ -333,7 +353,7 @@ class PreTrainedTokenizer(object):
        with open(added_tokens_file, 'w', encoding='utf-8') as f:
            if self.added_tokens_encoder:
-                out_str = json.dumps(self.added_tokens_decoder, ensure_ascii=False)
+                out_str = json.dumps(self.added_tokens_encoder, ensure_ascii=False)
            else:
                out_str = u"{}"
            f.write(out_str)
@@ -132,4 +132,4 @@ Using the traced model for inference is as simple as using its ``__call__`` dund

	`.. code-block:: python`	`.. code-block:: python`

	`traced_model(tokens_tensor, segments_tensors)`	`traced_model(tokens_tensor, segments_tensors)`