Merge pull request #73 from huggingface/third-release

Third release
update readme
2018-11-30 23:10:30 +01:00 · 2018-11-30 23:05:18 +01:00 · 2018-11-30 23:01:10 +01:00 · 2018-11-30 22:56:02 +01:00 · 2018-11-30 22:55:33 +01:00 · 2018-11-30 22:55:26 +01:00
10 changed files with 430 additions and 196 deletions
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@

 This repository contains an op-for-op PyTorch reimplementation of [Google's TensorFlow repository for the BERT model](https://github.com/google-research/bert) that was released together with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.

-This implementation is provided with [Google's pre-trained models](https://github.com/google-research/bert)) and a conversion script to load any pre-trained TensorFlow checkpoint for BERT is also provided.
+This implementation is provided with [Google's pre-trained models](https://github.com/google-research/bert), examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.

 ## Content

@@ -14,7 +14,7 @@ This implementation is provided with [Google's pre-trained models](https://githu
 | [Doc](#doc) |  Detailed documentation |
 | [Examples](#examples) | Detailed examples on how to fine-tune Bert |
 | [Notebooks](#notebooks) | Introduction on the provided Jupyter Notebooks |
-| [TPU](#tup) | Notes on TPU support and pretraining scripts |
+| [TPU](#tpu) | Notes on TPU support and pretraining scripts |
 | [Command-line interface](#Command-line-interface) | Convert a TensorFlow checkpoint in a PyTorch dump |

 ## Installation
@@ -25,7 +25,7 @@ This repo was tested on Python 3.5+ and PyTorch 0.4.1

 PyTorch pretrained bert can be installed by pip as follows:
 ```bash
-pip install pytorch_pretrained_bert
+pip install pytorch-pretrained-bert
 ```

 ### From source
@@ -46,23 +46,24 @@ python -m pytest -sv tests/

 This package comprises the following classes that can be imported in Python and are detailed in the [Doc](#doc) section of this readme:

- Six PyTorch models (`torch.nn.Module`) for Bert with pre-trained weights:
-  - `BertModel` - raw BERT Transformer model (**fully pre-trained**),
-  - `BertForMaskedLM` - BERT Transformer with the pre-trained masked language modeling head on top (**fully pre-trained**),
-  - `BertForNextSentencePrediction` - BERT Transformer with the pre-trained next sentence prediction classifier on top  (**fully pre-trained**),
-  - `BertForPreTraining` - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (**fully pre-trained**),
-  - `BertForSequenceClassification` - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**, the sequence classification head **is only initialized and has to be trained**),
-  - `BertForQuestionAnswering` - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**).
+- Seven PyTorch models (`torch.nn.Module`) for Bert with pre-trained weights (in the [`modeling.py`](./pytorch_pretrained_bert/modeling.py) file):
+  - [`BertModel`](./pytorch_pretrained_bert/modeling.py#L537) - raw BERT Transformer model (**fully pre-trained**),
+  - [`BertForMaskedLM`](./pytorch_pretrained_bert/modeling.py#L691) - BERT Transformer with the pre-trained masked language modeling head on top (**fully pre-trained**),
+  - [`BertForNextSentencePrediction`](./pytorch_pretrained_bert/modeling.py#L752) - BERT Transformer with the pre-trained next sentence prediction classifier on top  (**fully pre-trained**),
+  - [`BertForPreTraining`](./pytorch_pretrained_bert/modeling.py#L620) - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (**fully pre-trained**),
+  - [`BertForSequenceClassification`](./pytorch_pretrained_bert/modeling.py#L814) - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**, the sequence classification head **is only initialized and has to be trained**),
+  - [`BertForTokenClassification`](./pytorch_pretrained_bert/modeling.py#L880) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**),
+  - [`BertForQuestionAnswering`](./pytorch_pretrained_bert/modeling.py#L946) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**).

- Three tokenizers:
+- Three tokenizers (in the [`tokenization.py`](./pytorch_pretrained_bert/tokenization.py) file):
  - `BasicTokenizer` - basic tokenization (punctuation splitting, lower casing, etc.),
  - `WordpieceTokenizer` - WordPiece tokenization,
  - `BertTokenizer` - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.

- One optimizer:
+- One optimizer (in the [`optimization.py`](./pytorch_pretrained_bert/optimization.py) file):
  - `BertAdam` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.

- A configuration class:
+- A configuration class (in the [`modeling.py`](./pytorch_pretrained_bert/modeling.py) file):
  - `BertConfig` - Configuration class to store the configuration of a `BertModel` with utilisities to read and write from JSON configuration files.

 The repository further comprises:
@@ -99,7 +100,7 @@ from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

 # Tokenized input
-tokenized_text = "Who was Jim Henson ? Jim Henson was a puppeteer"
+text = "Who was Jim Henson ? Jim Henson was a puppeteer"
 tokenized_text = tokenizer.tokenize(text)

 # Mask a token that we will try to predict back with `BertForMaskedLM`
@@ -142,7 +143,7 @@ predictions = model(tokens_tensor, segments_tensors)

 # confirm we were able to predict 'henson'
 predicted_index = torch.argmax(predictions[0, masked_index]).item()
-predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])
+predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
 assert predicted_token == 'henson'
 ```

@@ -153,37 +154,44 @@ Here is a detailed documentation of the classes in the package and how to use th
 | Sub-section | Description |
 |-|-|
 | [Loading Google AI's pre-trained weigths](#Loading-Google-AIs-pre-trained-weigths-and-PyTorch-dump) | How to load Google AI's pre-trained weight or a PyTorch saved instance |
-| [PyTorch models](#PyTorch-models) | API of the six PyTorch model classes: `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification` or `BertForQuestionAnswering` |
+| [PyTorch models](#PyTorch-models) | API of the seven PyTorch model classes: `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification` or `BertForQuestionAnswering` |
 | [Tokenizer: `BertTokenizer`](#Tokenizer-BertTokenizer) | API of the `BertTokenizer` class|
 | [Optimizer: `BertAdam`](#Optimizer-BertAdam) |  API of the `BertAdam` class |

 ### Loading Google AI's pre-trained weigths and PyTorch dump

-To load Google AI's pre-trained weight or a PyTorch saved instance of `BertForPreTraining`, the PyTorch model classes and the tokenizer can be instantiated as
+To load one of Google AI's pre-trained models or a PyTorch saved model (an instance of `BertForPreTraining` saved with `torch.save()`), the PyTorch model classes and the tokenizer can be instantiated as

 ```python
-model = BERT_CLASS.from_pretrain(PRE_TRAINED_MODEL_NAME_OR_PATH)
+model = BERT_CLASS.from_pretrain(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None)
 ```

 where

- `BERT_CLASS` is either the `BertTokenizer` class (to load the vocabulary) or one of the six PyTorch model classes (to load the pre-trained weights): `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification` or `BertForQuestionAnswering`, and
-
- `PRE_TRAINED_MODEL_NAME` is either:
+- `BERT_CLASS` is either the `BertTokenizer` class (to load the vocabulary) or one of the seven PyTorch model classes (to load the pre-trained weights): `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification`, `BertForTokenClassification` or `BertForQuestionAnswering`, and
+- `PRE_TRAINED_MODEL_NAME_OR_PATH` is either:

  - the shortcut name of a Google AI's pre-trained model selected in the list:

    - `bert-base-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters
    - `bert-large-uncased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
    - `bert-base-cased`: 12-layer, 768-hidden, 12-heads , 110M parameters
-    - `bert-base-multilingual`: 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+    - `bert-large-cased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
+    - `bert-base-multilingual-uncased`: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+    - `bert-base-multilingual-cased`: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
    - `bert-base-chinese`: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

  - a path or url to a pretrained model archive containing:
-      . `bert_config.json` a configuration file for the model
-      . `pytorch_model.bin` a PyTorch dump of a pre-trained instance `BertForPreTraining` (saved with the usual `torch.save()`)

-If `PRE_TRAINED_MODEL_NAME` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links [here](pytorch_pretrained_bert/modeling.py)) and stored in a cache folder to avoid future download (the cache folder can be found at `~/.pytorch_pretrained_bert/`).
+    - `bert_config.json` a configuration file for the model, and
+    - `pytorch_model.bin` a PyTorch dump of a pre-trained instance `BertForPreTraining` (saved with the usual `torch.save()`)
+
+  If `PRE_TRAINED_MODEL_NAME_OR_PATH` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links [here](pytorch_pretrained_bert/modeling.py)) and stored in a cache folder to avoid future download (the cache folder can be found at `~/.pytorch_pretrained_bert/`).
+- `cache_dir` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example `cache_dir='./pretrained_model_{}'.format(args.local_rank)` (see the section on distributed training for more information)
+
+`Uncased` means that the text has been lowercased before WordPiece tokenization, e.g., `John Smith` becomes `john smith`. The Uncased model also strips out any accent markers. `Cased` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the [Multilingual README](https://github.com/google-research/bert/blob/master/multilingual.md) or the original TensorFlow repository.
+
+**When using an `uncased model`, make sure to pass `--do_lower_case` to the training scripts. (Or pass `do_lower_case=True` directly to FullTokenizer if you're using your own script.)**

 Example:
 ```python
@@ -202,15 +210,15 @@ We detail them here. This model takes as *inputs*:

 - `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary (see the tokens preprocessing logic in the scripts `extract_features.py`, `run_classifier.py` and `run_squad.py`), and
 - `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
- `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. It's the mask that we typically use for attention when a batch has varying length sentences.
+- `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if some input sequence lengths are smaller than the max input sequence length of the current batch. It's the mask that we typically use for attention when a batch has varying length sentences.
 - `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.

 This model *outputs* a tuple composed of:

 - `encoded_layers`: controled by the value of the `output_encoded_layers` argument:

-  . `output_all_encoded_layers=True`: outputs a list of the encoded-hidden-states at the end of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
-  . `output_all_encoded_layers=False`: outputs only the encoded-hidden-states corresponding to the last attention block,
+  - `output_all_encoded_layers=True`: outputs a list of the encoded-hidden-states at the end of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
+  - `output_all_encoded_layers=False`: outputs only the encoded-hidden-states corresponding to the last attention block, i.e. a single torch.FloatTensor of size [batch_size, sequence_length, hidden_size],

 - `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a classifier pretrained on top of the hidden state associated to the first character of the input (`CLF`) to train on the Next-Sentence task (see BERT's paper).

@@ -232,6 +240,7 @@ An example on how to use this class is given in the `extract_features.py` script

 - if `masked_lm_labels` and `next_sentence_label` are not `None`: Outputs the total_loss which is the sum of the masked language modeling loss and the next sentence classification loss.
 - if `masked_lm_labels` or `next_sentence_label` is `None`: Outputs a tuple comprising
+
  - the masked language modeling logits, and
  - the next sentence classification logits.

@@ -269,7 +278,13 @@ The sequence-level classifier is a linear layer that takes as input the last hid

 An example on how to use this class is given in the `run_classifier.py` script which can be used to fine-tune a single sequence (or pair of sequence) classifier using BERT, for example for the MRPC task.

-#### 6. `BertForQuestionAnswering`
+#### 6. `BertForTokenClassification`
+
+`BertForTokenClassification` is a fine-tuning model that includes `BertModel` and a token-level classifier on top of the `BertModel`.
+
+The token-level classifier is a linear layer that takes as input the last hidden state of the sequence.
+
+#### 7. `BertForQuestionAnswering`

 `BertForQuestionAnswering` is a fine-tuning model that includes `BertModel` with a token-level classifiers on top of the full sequence of last hidden states.

@@ -304,15 +319,15 @@ Please refer to the doc strings and code in [`tokenization.py`](./pytorch_pretra
 The optimizer accepts the following arguments:

 - `lr` : learning rate
- `warmup` : portion of t_total for the warmup, -1  means no warmup. Default : -1
+- `warmup` : portion of `t_total` for the warmup, `-1`  means no warmup. Default : `-1`
 - `t_total` : total number of training steps for the learning
-    rate schedule, -1  means constant learning rate. Default : -1
- `schedule` : schedule to use for the warmup (see above). Default : 'warmup_linear'
- `b1` : Adams b1. Default : 0.9
- `b2` : Adams b2. Default : 0.999
- `e` : Adams epsilon. Default : 1e-6
- `weight_decay_rate:` Weight decay. Default : 0.01
- `max_grad_norm` : Maximum norm for the gradients (-1 means no clipping). Default : 1.0
+    rate schedule, `-1`  means constant learning rate. Default : `-1`
+- `schedule` : schedule to use for the warmup (see above). Default : `'warmup_linear'`
+- `b1` : Adams b1. Default : `0.9`
+- `b2` : Adams b2. Default : `0.999`
+- `e` : Adams epsilon. Default : `1e-6`
+- `weight_decay_rate:` Weight decay. Default : `0.01`
+- `max_grad_norm` : Maximum norm for the gradients (`-1` means no clipping). Default : `1.0`

 ## Examples

@@ -419,10 +434,7 @@ To get these results we used a combination of:
 Here is the full list of hyper-parameters for this run:
 ```bash
 python ./run_squad.py \
-  --vocab_file $BERT_LARGE_DIR/vocab.txt \
-  --bert_config_file $BERT_LARGE_DIR/bert_config.json \
-  --init_checkpoint $BERT_LARGE_DIR/pytorch_model.bin \
-  --do_lower_case \
+  --bert_model bert-large-uncased \
  --do_train \
  --do_predict \
  --train_file $SQUAD_TRAIN \
@@ -442,10 +454,7 @@ If you have a recent GPU (starting from NVIDIA Volta series), you should try **1
 Here is an example of hyper-parameters for a FP16 run we tried:
 ```bash
 python ./run_squad.py \
-  --vocab_file $BERT_LARGE_DIR/vocab.txt \
-  --bert_config_file $BERT_LARGE_DIR/bert_config.json \
-  --init_checkpoint $BERT_LARGE_DIR/pytorch_model.bin \
-  --do_lower_case \
+  --bert_model bert-large-uncased \
  --do_train \
  --do_predict \
  --train_file $SQUAD_TRAIN \
@@ -467,23 +476,21 @@ The results were similar to the above FP32 results (actually slightly higher):

 ## Notebooks

-Comparing the PyTorch model and the TensorFlow model predictions
-
-We also include [three Jupyter Notebooks](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks) that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model.
+We include [three Jupyter Notebooks](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks) that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model.

 - The first NoteBook ([Comparing-TF-and-PT-models.ipynb](./notebooks/Comparing-TF-and-PT-models.ipynb)) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.

 - The second NoteBook ([Comparing-TF-and-PT-models-SQuAD.ipynb](./notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb)) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the `BertForQuestionAnswering` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.

- The third NoteBook ([Comparing-TF-and-PT-models-MLM-NSP.ipynb](./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb)) compares the predictions computed by the TensorFlow and the PyTorch models for masked token using the pre-trained masked language modeling model.
+- The third NoteBook ([Comparing-TF-and-PT-models-MLM-NSP.ipynb](./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb)) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.

 Please follow the instructions given in the notebooks to run and modify them.

 ## Command-line interface

-A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch checkpoint
+A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the `BertForPreTraining` class  (see above).

-You can convert any TensorFlow checkpoint for BERT (in particular [the pre-trained models released by Google](https://github.com/google-research/bert#pre-trained-models)) in a PyTorch save file by using the [`convert_tf_checkpoint_to_pytorch.py`](convert_tf_checkpoint_to_pytorch.py) script.
+You can convert any TensorFlow checkpoint for BERT (in particular [the pre-trained models released by Google](https://github.com/google-research/bert#pre-trained-models)) in a PyTorch save file by using the [`./pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py`](convert_tf_checkpoint_to_pytorch.py) script.

 This CLI takes as input a TensorFlow checkpoint (three files starting with `bert_model.ckpt`) and the associated configuration file (`bert_config.json`), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using `torch.load()` (see examples in `extract_features.py`, `run_classifier.py` and `run_squad.py`).

@@ -497,9 +504,9 @@ Here is an example of the conversion process for a pre-trained `BERT-Base Uncase
 export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12

 pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
-  --tf_checkpoint_path $BERT_BASE_DIR/bert_model.ckpt \
-  --bert_config_file $BERT_BASE_DIR/bert_config.json \
-  --pytorch_dump_path $BERT_BASE_DIR/pytorch_model.bin
+  $BERT_BASE_DIR/bert_model.ckpt \
+  $BERT_BASE_DIR/bert_config.json \
+  $BERT_BASE_DIR/pytorch_model.bin
 ```

 You can download Google's pre-trained models for the conversion [here](https://github.com/google-research/bert#pre-trained-models).
--- a/examples/extract_features.py
+++ b/examples/extract_features.py
@@ -28,7 +28,7 @@ import torch
 from torch.utils.data import TensorDataset, DataLoader, SequentialSampler
 from torch.utils.data.distributed import DistributedSampler

-from pytorch_pretrained_bert.tokenization import convert_to_unicode, BertTokenizer
+from pytorch_pretrained_bert.tokenization import BertTokenizer
 from pytorch_pretrained_bert.modeling import BertModel

 logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s', 
@@ -170,7 +170,7 @@ def read_examples(input_file):
    unique_id = 0
    with open(input_file, "r") as reader:
        while True:
-            line = convert_to_unicode(reader.readline())
+            line = reader.readline()
            if not line:
                break
            line = line.strip()
@@ -199,6 +199,7 @@ def main():
                             "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese.")

    ## Other parameters
+    parser.add_argument("--do_lower_case", default=False, action='store_true', help="Set this flag if you are using an uncased model.")
    parser.add_argument("--layers", default="-1,-2,-3,-4", type=str)
    parser.add_argument("--max_seq_length", default=128, type=int,
                        help="The maximum total input sequence length after WordPiece tokenization. Sequences longer "
@@ -208,6 +209,10 @@ def main():
                        type=int,
                        default=-1,
                        help = "local_rank for distributed training on gpus")
+    parser.add_argument("--no_cuda",
+                        default=False,
+                        action='store_true',
+                        help="Whether not to use CUDA when available")

    args = parser.parse_args()

@@ -223,7 +228,7 @@ def main():

    layer_indexes = [int(x) for x in args.layers.split(",")]

-    tokenizer = BertTokenizer.from_pretrained(args.bert_model)
+    tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)

    examples = read_examples(args.input_file)

--- a/examples/run_classifier.py
+++ b/examples/run_classifier.py
@@ -30,9 +30,10 @@ import torch
 from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
 from torch.utils.data.distributed import DistributedSampler

-from pytorch_pretrained_bert.tokenization import printable_text, convert_to_unicode, BertTokenizer
+from pytorch_pretrained_bert.tokenization import BertTokenizer
 from pytorch_pretrained_bert.modeling import BertForSequenceClassification
 from pytorch_pretrained_bert.optimization import BertAdam
+from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE

 logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s', 
                    datefmt = '%m/%d/%Y %H:%M:%S',
@@ -122,9 +123,9 @@ class MrpcProcessor(DataProcessor):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, i)
-            text_a = convert_to_unicode(line[3])
-            text_b = convert_to_unicode(line[4])
-            label = convert_to_unicode(line[0])
+            text_a = line[3]
+            text_b = line[4]
+            label = line[0]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples
@@ -154,10 +155,10 @@ class MnliProcessor(DataProcessor):
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
-            guid = "%s-%s" % (set_type, convert_to_unicode(line[0]))
-            text_a = convert_to_unicode(line[8])
-            text_b = convert_to_unicode(line[9])
-            label = convert_to_unicode(line[-1])
+            guid = "%s-%s" % (set_type, line[0])
+            text_a = line[8]
+            text_b = line[9]
+            label = line[-1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples
@@ -185,8 +186,8 @@ class ColaProcessor(DataProcessor):
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
-            text_a = convert_to_unicode(line[3])
-            label = convert_to_unicode(line[1])
+            text_a = line[3]
+            label = line[1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples
@@ -273,7 +274,7 @@ def convert_examples_to_features(examples, label_list, max_seq_length, tokenizer
            logger.info("*** Example ***")
            logger.info("guid: %s" % (example.guid))
            logger.info("tokens: %s" % " ".join(
-                    [printable_text(x) for x in tokens]))
+                    [str(x) for x in tokens]))
            logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
            logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
            logger.info(
@@ -375,6 +376,10 @@ def main():
                        default=False,
                        action='store_true',
                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--do_lower_case",
+                        default=False,
+                        action='store_true',
+                        help="Set this flag if you are using an uncased model.")
    parser.add_argument("--train_batch_size",
                        default=32,
                        type=int,
@@ -396,10 +401,6 @@ def main():
                        type=float,
                        help="Proportion of training to perform linear learning rate warmup for. "
                             "E.g., 0.1 = 10%% of training.")
-    parser.add_argument("--save_checkpoints_steps",
-                        default=1000,
-                        type=int,
-                        help="How often to save the model checkpoint.")
    parser.add_argument("--no_cuda",
                        default=False,
                        action='store_true',
@@ -476,7 +477,7 @@ def main():
    processor = processors[task_name]()
    label_list = processor.get_labels()

-    tokenizer = BertTokenizer.from_pretrained(args.bert_model)
+    tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)

    train_examples = None
    num_train_steps = None
@@ -486,7 +487,8 @@ def main():
            len(train_examples) / args.train_batch_size / args.gradient_accumulation_steps * args.num_train_epochs)

    # Prepare model
-    model = BertForSequenceClassification.from_pretrained(args.bert_model, len(label_list))
+    model = BertForSequenceClassification.from_pretrained(args.bert_model, 
+                cache_dir=PYTORCH_PRETRAINED_BERT_CACHE / 'distributed_{}'.format(args.local_rank))
    if args.fp16:
        model.half()
    model.to(device)
@@ -507,13 +509,16 @@ def main():
        param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'gamma', 'beta']
    optimizer_grouped_parameters = [
-        {'params': [p for n, p in param_optimizer if n not in no_decay], 'weight_decay_rate': 0.01},
-        {'params': [p for n, p in param_optimizer if n in no_decay], 'weight_decay_rate': 0.0}
+        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.01},
+        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.0}
        ]
+    t_total = num_train_steps
+    if args.local_rank != -1:
+        t_total = t_total // torch.distributed.get_world_size()
    optimizer = BertAdam(optimizer_grouped_parameters,
                         lr=args.learning_rate,
                         warmup=args.warmup_proportion,
-                         t_total=num_train_steps)
+                         t_total=t_total)

    global_step = 0
    if args.do_train:
@@ -541,7 +546,7 @@ def main():
            for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
                batch = tuple(t.to(device) for t in batch)
                input_ids, input_mask, segment_ids, label_ids = batch
-                loss, _ = model(input_ids, segment_ids, input_mask, label_ids)
+                loss = model(input_ids, segment_ids, input_mask, label_ids)
                if n_gpu > 1:
                    loss = loss.mean() # mean() to average on multi-gpu.
                if args.fp16 and args.loss_scale != 1.0:
@@ -559,7 +564,8 @@ def main():
                        if args.fp16 and args.loss_scale != 1.0:
                            # scale down gradients for fp16 training
                            for param in model.parameters():
-                                param.grad.data = param.grad.data / args.loss_scale
+                                if param.grad is not None:
+                                    param.grad.data = param.grad.data / args.loss_scale
                        is_nan = set_optimizer_params_grad(param_optimizer, model.named_parameters(), test_nan=True)
                        if is_nan:
                            logger.info("FP16 TRAINING: Nan in gradients, reducing loss scaling")
@@ -573,7 +579,7 @@ def main():
                    model.zero_grad()
                    global_step += 1

-    if args.do_eval:
+    if args.do_eval and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
        eval_examples = processor.get_dev_examples(args.data_dir)
        eval_features = convert_examples_to_features(
            eval_examples, label_list, args.max_seq_length, tokenizer)
@@ -585,10 +591,8 @@ def main():
        all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
        all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
        eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
-        if args.local_rank == -1:
-            eval_sampler = SequentialSampler(eval_data)
-        else:
-            eval_sampler = DistributedSampler(eval_data)
+        # Run prediction for full data
+        eval_sampler = SequentialSampler(eval_data)
        eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)

        model.eval()
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -25,6 +25,7 @@ import json
 import math
 import os
 import random
+import pickle
 from tqdm import tqdm, trange

 import numpy as np
@@ -32,9 +33,10 @@ import torch
 from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
 from torch.utils.data.distributed import DistributedSampler

-from pytorch_pretrained_bert.tokenization import printable_text, whitespace_tokenize, BasicTokenizer, BertTokenizer
+from pytorch_pretrained_bert.tokenization import whitespace_tokenize, BasicTokenizer, BertTokenizer
 from pytorch_pretrained_bert.modeling import BertForQuestionAnswering
 from pytorch_pretrained_bert.optimization import BertAdam
+from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE

 logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s', 
                    datefmt = '%m/%d/%Y %H:%M:%S',
@@ -64,9 +66,9 @@ class SquadExample(object):

    def __repr__(self):
        s = ""
-        s += "qas_id: %s" % (printable_text(self.qas_id))
+        s += "qas_id: %s" % (self.qas_id)
        s += ", question_text: %s" % (
-            printable_text(self.question_text))
+            self.question_text)
        s += ", doc_tokens: [%s]" % (" ".join(self.doc_tokens))
        if self.start_position:
            s += ", start_position: %d" % (self.start_position)
@@ -288,8 +290,7 @@ def convert_examples_to_features(examples, tokenizer, max_seq_length,
                logger.info("unique_id: %s" % (unique_id))
                logger.info("example_index: %s" % (example_index))
                logger.info("doc_span_index: %s" % (doc_span_index))
-                logger.info("tokens: %s" % " ".join(
-                    [printable_text(x) for x in tokens]))
+                logger.info("tokens: %s" % " ".join(tokens))
                logger.info("token_to_orig_map: %s" % " ".join([
                    "%d:%d" % (x, y) for (x, y) in token_to_orig_map.items()]))
                logger.info("token_is_max_context: %s" % " ".join([
@@ -305,7 +306,7 @@ def convert_examples_to_features(examples, tokenizer, max_seq_length,
                    logger.info("start_position: %d" % (start_position))
                    logger.info("end_position: %d" % (end_position))
                    logger.info(
-                        "answer: %s" % (printable_text(answer_text)))
+                        "answer: %s" % (answer_text))

            features.append(
                InputFeatures(
@@ -729,10 +730,6 @@ def main():
    parser.add_argument("--warmup_proportion", default=0.1, type=float,
                        help="Proportion of training to perform linear learning rate warmup for. E.g., 0.1 = 10% "
                             "of training.")
-    parser.add_argument("--save_checkpoints_steps", default=1000, type=int,
-                        help="How often to save the model checkpoint.")
-    parser.add_argument("--iterations_per_loop", default=1000, type=int,
-                        help="How many steps to make in each estimator call.")
    parser.add_argument("--n_best_size", default=20, type=int,
                        help="The total number of n-best predictions to generate in the nbest_predictions.json "
                             "output file.")
@@ -754,6 +751,10 @@ def main():
                        type=int,
                        default=1,
                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--do_lower_case",
+                        default=True,
+                        action='store_true',
+                        help="Whether to lower case the input text. True for uncased models, False for cased models.")
    parser.add_argument("--local_rank",
                        type=int,
                        default=-1,
@@ -825,7 +826,8 @@ def main():
            len(train_examples) / args.train_batch_size / args.gradient_accumulation_steps * args.num_train_epochs)

    # Prepare model
-    model = BertForQuestionAnswering.from_pretrained(args.bert_model)
+    model = BertForQuestionAnswering.from_pretrained(args.bert_model,
+                cache_dir=PYTORCH_PRETRAINED_BERT_CACHE / 'distributed_{}'.format(args.local_rank))
    if args.fp16:
        model.half()
    model.to(device)
@@ -846,23 +848,37 @@ def main():
        param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'gamma', 'beta']
    optimizer_grouped_parameters = [
-        {'params': [p for n, p in param_optimizer if n not in no_decay], 'weight_decay_rate': 0.01},
-        {'params': [p for n, p in param_optimizer if n in no_decay], 'weight_decay_rate': 0.0}
+        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.01},
+        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.0}
        ]
+    t_total = num_train_steps
+    if args.local_rank != -1:
+        t_total = t_total // torch.distributed.get_world_size()
    optimizer = BertAdam(optimizer_grouped_parameters,
                         lr=args.learning_rate,
                         warmup=args.warmup_proportion,
-                         t_total=num_train_steps)
+                         t_total=t_total)

    global_step = 0
    if args.do_train:
-        train_features = convert_examples_to_features(
-            examples=train_examples,
-            tokenizer=tokenizer,
-            max_seq_length=args.max_seq_length,
-            doc_stride=args.doc_stride,
-            max_query_length=args.max_query_length,
-            is_training=True)
+        cached_train_features_file = args.train_file+'_{0}_{1}_{2}_{3}'.format(
+            args.bert_model, str(args.max_seq_length), str(args.doc_stride), str(args.max_query_length))
+        train_features = None
+        try:
+            with open(cached_train_features_file, "rb") as reader:
+                train_features = pickle.load(reader)
+        except:
+            train_features = convert_examples_to_features(
+                examples=train_examples,
+                tokenizer=tokenizer,
+                max_seq_length=args.max_seq_length,
+                doc_stride=args.doc_stride,
+                max_query_length=args.max_query_length,
+                is_training=True)
+            if args.local_rank == -1 or torch.distributed.get_rank() == 0:
+                logger.info("  Saving train features into cached file %s", cached_train_features_file)
+                with open(cached_train_features_file, "wb") as writer:
+                    pickle.dump(train_features, writer)
        logger.info("***** Running training *****")
        logger.info("  Num orig examples = %d", len(train_examples))
        logger.info("  Num split examples = %d", len(train_features))
@@ -902,7 +918,8 @@ def main():
                        if args.fp16 and args.loss_scale != 1.0:
                            # scale down gradients for fp16 training
                            for param in model.parameters():
-                                param.grad.data = param.grad.data / args.loss_scale
+                                if param.grad is not None:
+                                    param.grad.data = param.grad.data / args.loss_scale
                        is_nan = set_optimizer_params_grad(param_optimizer, model.named_parameters(), test_nan=True)
                        if is_nan:
                            logger.info("FP16 TRAINING: Nan in gradients, reducing loss scaling")
@@ -916,7 +933,7 @@ def main():
                    model.zero_grad()
                    global_step += 1

-    if args.do_predict:
+    if args.do_predict and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
        eval_examples = read_squad_examples(
            input_file=args.predict_file, is_training=False)
        eval_features = convert_examples_to_features(
@@ -937,10 +954,8 @@ def main():
        all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
        all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)
        eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_example_index)
-        if args.local_rank == -1:
-            eval_sampler = SequentialSampler(eval_data)
-        else:
-            eval_sampler = DistributedSampler(eval_data)
+        # Run prediction for full data
+        eval_sampler = SequentialSampler(eval_data)
        eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.predict_batch_size)

        model.eval()
--- a/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb
+++ b/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb
@@ -133,7 +133,7 @@
    "    unique_id = 0\n",
    "    with tf.gfile.GFile(input_file, \"r\") as reader:\n",
    "        while True:\n",
-    "            line = reader.readline()#tokenization.convert_to_unicode(reader.readline())\n",
+    "            line = reader.readline()\n",
    "            if not line:\n",
    "                break\n",
    "            line = line.strip()\n",
--- a/pytorch_pretrained_bert/init.py
+++ b/pytorch_pretrained_bert/init.py
@@ -1,5 +1,7 @@
 from .tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer
 from .modeling import (BertConfig, BertModel, BertForPreTraining,
                       BertForMaskedLM, BertForNextSentencePrediction,
-                       BertForSequenceClassification, BertForQuestionAnswering)
+                       BertForSequenceClassification, BertForTokenClassification,
+                       BertForQuestionAnswering)
 from .optimization import BertAdam
+from .file_utils import PYTORCH_PRETRAINED_BERT_CACHE
--- a/pytorch_pretrained_bert/modeling.py
+++ b/pytorch_pretrained_bert/modeling.py
@@ -42,7 +42,9 @@ PRETRAINED_MODEL_ARCHIVE_MAP = {
    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz",
    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz",
    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz",
-    'bert-base-multilingual': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual.tar.gz",
+    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz",
+    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz",
+    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz",
    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz",
 }
 CONFIG_NAME = 'bert_config.json'
@@ -443,7 +445,7 @@ class PreTrainedBertModel(nn.Module):
            module.bias.data.zero_()

    @classmethod
-    def from_pretrained(cls, pretrained_model_name, *inputs, **kwargs):
+    def from_pretrained(cls, pretrained_model_name, cache_dir=None, *inputs, **kwargs):
        """
        Instantiate a PreTrainedBertModel from a pre-trained model file.
        Download and cache the pre-trained model file if needed.
@@ -468,7 +470,7 @@ class PreTrainedBertModel(nn.Module):
            archive_file = pretrained_model_name
        # redirect to the cache, if necessary
        try:
-            resolved_archive_file = cached_path(archive_file)
+            resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir)
        except FileNotFoundError:
            logger.error(
                "Model name '{}' was not found in model name list ({}). "
@@ -476,7 +478,7 @@ class PreTrainedBertModel(nn.Module):
                "associated to this path or url.".format(
                    pretrained_model_name,
                    ', '.join(PRETRAINED_MODEL_ARCHIVE_MAP.keys()),
-                    pretrained_model_name))
+                    archive_file))
            return None
        if resolved_archive_file == archive_file:
            logger.info("loading archive file {}".format(archive_file))
@@ -557,7 +559,7 @@ class BertModel(PreTrainedBertModel):
                of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
                encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
            - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
-                to the last attention block,
+                to the last attention block of shape [batch_size, sequence_length, hidden_size],
        `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
            classifier pretrained on top of the hidden state associated to the first character of the
            input (`CLF`) to train on the Next-Sentence task (see BERT's paper).
@@ -567,10 +569,10 @@ class BertModel(PreTrainedBertModel):
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 2, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

-    config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
-        num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
+    config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

    model = modeling.BertModel(config=config)
    all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
@@ -648,18 +650,18 @@ class BertForPreTraining(PreTrainedBertModel):
            sentence classification loss.
        if `masked_lm_labels` or `next_sentence_label` is `None`:
            Outputs a tuple comprising
-            - the masked language modeling logits, and
-            - the next sentence classification logits.
+            - the masked language modeling logits of shape [batch_size, sequence_length, vocab_size], and
+            - the next sentence classification logits of shape [batch_size, 2].

    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 2, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

-    config = BertConfig(vocab_size=32000, hidden_size=512,
-        num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

    model = BertForPreTraining(config)
    masked_lm_logits_scores, seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
@@ -678,8 +680,8 @@ class BertForPreTraining(PreTrainedBertModel):

        if masked_lm_labels is not None and next_sentence_label is not None:
            loss_fct = CrossEntropyLoss(ignore_index=-1)
-            masked_lm_loss = loss_fct(prediction_scores, masked_lm_labels)
-            next_sentence_loss = loss_fct(seq_relationship_score, next_sentence_label)
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
+            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
            total_loss = masked_lm_loss + next_sentence_loss
            return total_loss
        else:
@@ -712,17 +714,17 @@ class BertForMaskedLM(PreTrainedBertModel):
        if `masked_lm_labels` is `None`:
            Outputs the masked language modeling loss.
        if `masked_lm_labels` is `None`:
-            Outputs the masked language modeling logits.
+            Outputs the masked language modeling logits of shape [batch_size, sequence_length, vocab_size].

    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 2, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

-    config = BertConfig(vocab_size=32000, hidden_size=512,
-        num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

    model = BertForMaskedLM(config)
    masked_lm_logits_scores = model(input_ids, token_type_ids, input_mask)
@@ -741,7 +743,7 @@ class BertForMaskedLM(PreTrainedBertModel):

        if masked_lm_labels is not None:
            loss_fct = CrossEntropyLoss(ignore_index=-1)
-            masked_lm_loss = loss_fct(prediction_scores, masked_lm_labels)
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
            return masked_lm_loss
        else:
            return prediction_scores
@@ -774,17 +776,17 @@ class BertForNextSentencePrediction(PreTrainedBertModel):
            Outputs the total_loss which is the sum of the masked language modeling loss and the next
            sentence classification loss.
        if `next_sentence_label` is `None`:
-            Outputs the next sentence classification logits.
+            Outputs the next sentence classification logits of shape [batch_size, 2].

    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 2, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

-    config = BertConfig(vocab_size=32000, hidden_size=512,
-        num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

    model = BertForNextSentencePrediction(config)
    seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
@@ -803,7 +805,7 @@ class BertForNextSentencePrediction(PreTrainedBertModel):

        if next_sentence_label is not None:
            loss_fct = CrossEntropyLoss(ignore_index=-1)
-            next_sentence_loss = loss_fct(seq_relationship_score, next_sentence_label)
+            next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
            return next_sentence_loss
        else:
            return seq_relationship_score
@@ -836,17 +838,17 @@ class BertForSequenceClassification(PreTrainedBertModel):
        if `labels` is not `None`:
            Outputs the CrossEntropy classification loss of the output with the labels.
        if `labels` is `None`:
-            Outputs the classification logits.
+            Outputs the classification logits of shape [batch_size, num_labels].

    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 2, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

-    config = BertConfig(vocab_size=32000, hidden_size=512,
-        num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

    num_labels = 2

@@ -856,6 +858,7 @@ class BertForSequenceClassification(PreTrainedBertModel):
    """
    def __init__(self, config, num_labels=2):
        super(BertForSequenceClassification, self).__init__(config)
+        self.num_labels = num_labels
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, num_labels)
@@ -868,8 +871,74 @@ class BertForSequenceClassification(PreTrainedBertModel):

        if labels is not None:
            loss_fct = CrossEntropyLoss()
-            loss = loss_fct(logits, labels)
-            return loss, logits
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            return loss
+        else:
+            return logits
+
+
+class BertForTokenClassification(PreTrainedBertModel):
+    """BERT model for token-level classification.
+    This module is composed of the BERT model with a linear layer on top of
+    the full hidden state of the last layer.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model.
+        `num_labels`: the number of classes for the classifier. Default = 2.
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
+            with indices selected in [0, ..., num_labels].
+
+    Outputs:
+        if `labels` is not `None`:
+            Outputs the CrossEntropy classification loss of the output with the labels.
+        if `labels` is `None`:
+            Outputs the classification logits of shape [batch_size, sequence_length, num_labels].
+
+    Example usage:
+    ```python
+    # Already been converted into WordPiece token ids
+    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+    num_labels = 2
+
+    model = BertForTokenClassification(config, num_labels)
+    logits = model(input_ids, token_type_ids, input_mask)
+    ```
+    """
+    def __init__(self, config, num_labels=2):
+        super(BertForTokenClassification, self).__init__(config)
+        self.num_labels = num_labels
+        self.bert = BertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, num_labels)
+        self.apply(self.init_bert_weights)
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
+        sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            return loss
        else:
            return logits

@@ -913,17 +982,17 @@ class BertForQuestionAnswering(PreTrainedBertModel):
            Outputs the total_loss which is the sum of the CrossEntropy loss for the start and end token positions.
        if `start_positions` or `end_positions` is `None`:
            Outputs a tuple of start_logits, end_logits which are the logits respectively for the start and end
-            position tokens.
+            position tokens of shape [batch_size, sequence_length].

    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 2, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

-    config = BertConfig(vocab_size=32000, hidden_size=512,
-        num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

    model = BertForQuestionAnswering(config)
    start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
--- a/pytorch_pretrained_bert/tokenization.py
+++ b/pytorch_pretrained_bert/tokenization.py
@@ -34,40 +34,21 @@ PRETRAINED_VOCAB_ARCHIVE_MAP = {
    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt",
-    'bert-base-multilingual': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-vocab.txt",
+    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt",
+    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt",
+    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",
    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt",
 }
-
-def convert_to_unicode(text):
-    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
-    if isinstance(text, str):
-        return text
-    elif isinstance(text, bytes):
-        return text.decode("utf-8", "ignore")
-    else:
-        raise ValueError("Unsupported string type: %s" % (type(text)))
-
-
-def printable_text(text):
-    """Returns text encoded in a way suitable for print or `tf.logging`."""
-
-    # These functions want `str` for both Python2 and Python3, but in one case
-    # it's a Unicode string and in the other it's a byte string.
-    if isinstance(text, str):
-        return text
-    elif isinstance(text, bytes):
-        return text.decode("utf-8", "ignore")
-    else:
-        raise ValueError("Unsupported string type: %s" % (type(text)))
+VOCAB_NAME = 'vocab.txt'


 def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = collections.OrderedDict()
    index = 0
-    with open(vocab_file, "r") as reader:
+    with open(vocab_file, "r", encoding="utf-8") as reader:
        while True:
-            token = convert_to_unicode(reader.readline())
+            token = reader.readline()
            if not token:
                break
            token = token.strip()
@@ -120,7 +101,7 @@ class BertTokenizer(object):
        return tokens

    @classmethod
-    def from_pretrained(cls, pretrained_model_name, do_lower_case=True):
+    def from_pretrained(cls, pretrained_model_name, cache_dir=None, *inputs, **kwargs):
        """
        Instantiate a PreTrainedBertModel from a pre-trained model file.
        Download and cache the pre-trained model file if needed.
@@ -129,16 +110,11 @@ class BertTokenizer(object):
            vocab_file = PRETRAINED_VOCAB_ARCHIVE_MAP[pretrained_model_name]
        else:
            vocab_file = pretrained_model_name
+        if os.path.isdir(vocab_file):
+            vocab_file = os.path.join(vocab_file, VOCAB_NAME)
        # redirect to the cache, if necessary
        try:
-            resolved_vocab_file = cached_path(vocab_file)
-            if resolved_vocab_file == vocab_file:
-                logger.info("loading vocabulary file {}".format(vocab_file))
-            else:
-                logger.info("loading vocabulary file {} from cache at {}".format(
-                    vocab_file, resolved_vocab_file))
-            # Instantiate tokenizer.
-            tokenizer = cls(resolved_vocab_file, do_lower_case)
+            resolved_vocab_file = cached_path(vocab_file, cache_dir=cache_dir)
        except FileNotFoundError:
            logger.error(
                "Model name '{}' was not found in model name list ({}). "
@@ -146,8 +122,15 @@ class BertTokenizer(object):
                "associated to this path or url.".format(
                    pretrained_model_name,
                    ', '.join(PRETRAINED_VOCAB_ARCHIVE_MAP.keys()),
-                    pretrained_model_name))
-            tokenizer = None
+                    vocab_file))
+            return None
+        if resolved_vocab_file == vocab_file:
+            logger.info("loading vocabulary file {}".format(vocab_file))
+        else:
+            logger.info("loading vocabulary file {} from cache at {}".format(
+                vocab_file, resolved_vocab_file))
+        # Instantiate tokenizer.
+        tokenizer = cls(resolved_vocab_file, *inputs, **kwargs)
        return tokenizer


@@ -164,7 +147,6 @@ class BasicTokenizer(object):

    def tokenize(self, text):
        """Tokenizes a piece of text."""
-        text = convert_to_unicode(text)
        text = self._clean_text(text)
        # This was added on November 1st, 2018 for the multilingual and Chinese
        # models. This is also applied to the English models now, but it doesn't
@@ -290,8 +272,6 @@ class WordpieceTokenizer(object):
          A list of wordpiece tokens.
        """

-        text = convert_to_unicode(text)
-
        output_tokens = []
        for token in whitespace_tokenize(text):
            chars = list(token)
--- a/setup.py
+++ b/setup.py
@@ -2,7 +2,7 @@ from setuptools import find_packages, setup

 setup(
    name="pytorch_pretrained_bert",
-    version="0.1.2",
+    version="0.3.0",
    author="Thomas Wolf, Victor Sanh, Tim Rault, Google AI Language Team Authors",
    author_email="thomas@huggingface.co",
    description="PyTorch version of Google AI BERT model with script to load Google pre-trained models",
--- a/tests/modeling_test.py
+++ b/tests/modeling_test.py
@@ -22,7 +22,10 @@ import random

 import torch

-from pytorch_pretrained_bert import BertConfig, BertModel
+from pytorch_pretrained_bert import (BertConfig, BertModel, BertForMaskedLM,
+                                     BertForNextSentencePrediction, BertForPreTraining,
+                                     BertForQuestionAnswering, BertForSequenceClassification,
+                                     BertForTokenClassification)


 class BertModelTest(unittest.TestCase):
@@ -35,6 +38,7 @@ class BertModelTest(unittest.TestCase):
                     is_training=True,
                     use_input_mask=True,
                     use_token_type_ids=True,
+                     use_labels=True,
                     vocab_size=99,
                     hidden_size=32,
                     num_hidden_layers=5,
@@ -45,7 +49,9 @@ class BertModelTest(unittest.TestCase):
                     attention_probs_dropout_prob=0.1,
                     max_position_embeddings=512,
                     type_vocab_size=16,
+                     type_sequence_label_size=2,
                     initializer_range=0.02,
+                     num_labels=3,
                     scope=None):
            self.parent = parent
            self.batch_size = batch_size
@@ -53,6 +59,7 @@ class BertModelTest(unittest.TestCase):
            self.is_training = is_training
            self.use_input_mask = use_input_mask
            self.use_token_type_ids = use_token_type_ids
+            self.use_labels = use_labels
            self.vocab_size = vocab_size
            self.hidden_size = hidden_size
            self.num_hidden_layers = num_hidden_layers
@@ -63,10 +70,12 @@ class BertModelTest(unittest.TestCase):
            self.attention_probs_dropout_prob = attention_probs_dropout_prob
            self.max_position_embeddings = max_position_embeddings
            self.type_vocab_size = type_vocab_size
+            self.type_sequence_label_size = type_sequence_label_size
            self.initializer_range = initializer_range
+            self.num_labels = num_labels
            self.scope = scope

-        def create_model(self):
+        def prepare_config_and_inputs(self):
            input_ids = BertModelTest.ids_tensor([self.batch_size, self.seq_length], self.vocab_size)

            input_mask = None
@@ -77,6 +86,12 @@ class BertModelTest(unittest.TestCase):
            if self.use_token_type_ids:
                token_type_ids = BertModelTest.ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)

+            sequence_labels = None
+            token_labels = None
+            if self.use_labels:
+                sequence_labels = BertModelTest.ids_tensor([self.batch_size], self.type_sequence_label_size)
+                token_labels = BertModelTest.ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+
            config = BertConfig(
                vocab_size_or_config_json_file=self.vocab_size,
                hidden_size=self.hidden_size,
@@ -90,10 +105,16 @@ class BertModelTest(unittest.TestCase):
                type_vocab_size=self.type_vocab_size,
                initializer_range=self.initializer_range)

+            return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels
+
+        def check_loss_output(self, result):
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+
+        def create_bert_model(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
            model = BertModel(config=config)
-
            all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
-
            outputs = {
                "sequence_output": all_encoder_layers[-1],
                "pooled_output": pooled_output,
@@ -101,13 +122,119 @@ class BertModelTest(unittest.TestCase):
            }
            return outputs

-        def check_output(self, result):
+        def check_bert_model_output(self, result):
+            self.parent.assertListEqual(
+                [size for layer in result["all_encoder_layers"] for size in layer.size()],
+                [self.batch_size, self.seq_length, self.hidden_size] * self.num_hidden_layers)
            self.parent.assertListEqual(
                list(result["sequence_output"].size()),
                [self.batch_size, self.seq_length, self.hidden_size])
-
            self.parent.assertListEqual(list(result["pooled_output"].size()), [self.batch_size, self.hidden_size])

+
+        def create_bert_for_masked_lm(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
+            model = BertForMaskedLM(config=config)
+            loss = model(input_ids, token_type_ids, input_mask, token_labels)
+            prediction_scores = model(input_ids, token_type_ids, input_mask)
+            outputs = {
+                "loss": loss,
+                "prediction_scores": prediction_scores,
+            }
+            return outputs
+
+        def check_bert_for_masked_lm_output(self, result):
+            self.parent.assertListEqual(
+                list(result["prediction_scores"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+
+        def create_bert_for_next_sequence_prediction(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
+            model = BertForNextSentencePrediction(config=config)
+            loss = model(input_ids, token_type_ids, input_mask, sequence_labels)
+            seq_relationship_score = model(input_ids, token_type_ids, input_mask)
+            outputs = {
+                "loss": loss,
+                "seq_relationship_score": seq_relationship_score,
+            }
+            return outputs
+
+        def check_bert_for_next_sequence_prediction_output(self, result):
+            self.parent.assertListEqual(
+                list(result["seq_relationship_score"].size()),
+                [self.batch_size, 2])
+
+
+        def create_bert_for_pretraining(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
+            model = BertForPreTraining(config=config)
+            loss = model(input_ids, token_type_ids, input_mask, token_labels, sequence_labels)
+            prediction_scores, seq_relationship_score = model(input_ids, token_type_ids, input_mask)
+            outputs = {
+                "loss": loss,
+                "prediction_scores": prediction_scores,
+                "seq_relationship_score": seq_relationship_score,
+            }
+            return outputs
+
+        def check_bert_for_pretraining_output(self, result):
+            self.parent.assertListEqual(
+                list(result["prediction_scores"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+            self.parent.assertListEqual(
+                list(result["seq_relationship_score"].size()),
+                [self.batch_size, 2])
+
+
+        def create_bert_for_question_answering(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
+            model = BertForQuestionAnswering(config=config)
+            loss = model(input_ids, token_type_ids, input_mask, sequence_labels, sequence_labels)
+            start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
+            outputs = {
+                "loss": loss,
+                "start_logits": start_logits,
+                "end_logits": end_logits,
+            }
+            return outputs
+
+        def check_bert_for_question_answering_output(self, result):
+            self.parent.assertListEqual(
+                list(result["start_logits"].size()),
+                [self.batch_size, self.seq_length])
+            self.parent.assertListEqual(
+                list(result["end_logits"].size()),
+                [self.batch_size, self.seq_length])
+
+
+        def create_bert_for_sequence_classification(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
+            model = BertForSequenceClassification(config=config, num_labels=self.num_labels)
+            loss = model(input_ids, token_type_ids, input_mask, sequence_labels)
+            logits = model(input_ids, token_type_ids, input_mask)
+            outputs = {
+                "loss": loss,
+                "logits": logits,
+            }
+            return outputs
+
+        def check_bert_for_sequence_classification_output(self, result):
+            self.parent.assertListEqual(
+                list(result["logits"].size()),
+                [self.batch_size, self.num_labels])
+
+
+        def create_bert_for_token_classification(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
+            model = BertForTokenClassification(config=config, num_labels=self.num_labels)
+            loss = model(input_ids, token_type_ids, input_mask, token_labels)
+            logits = model(input_ids, token_type_ids, input_mask)
+            outputs = {
+                "loss": loss,
+                "logits": logits,
+            }
+            return outputs
+
+        def check_bert_for_token_classification_output(self, result):
+            self.parent.assertListEqual(
+                list(result["logits"].size()),
+                [self.batch_size, self.seq_length, self.num_labels])
+
+
    def test_default(self):
        self.run_tester(BertModelTest.BertModelTester(self))

@@ -118,8 +245,33 @@ class BertModelTest(unittest.TestCase):
        self.assertEqual(obj["hidden_size"], 37)

    def run_tester(self, tester):
-        output_result = tester.create_model()
-        tester.check_output(output_result)
+        config_and_inputs = tester.prepare_config_and_inputs()
+        output_result = tester.create_bert_model(*config_and_inputs)
+        tester.check_bert_model_output(output_result)
+
+        output_result = tester.create_bert_for_masked_lm(*config_and_inputs)
+        tester.check_bert_for_masked_lm_output(output_result)
+        tester.check_loss_output(output_result)
+
+        output_result = tester.create_bert_for_next_sequence_prediction(*config_and_inputs)
+        tester.check_bert_for_next_sequence_prediction_output(output_result)
+        tester.check_loss_output(output_result)
+
+        output_result = tester.create_bert_for_pretraining(*config_and_inputs)
+        tester.check_bert_for_pretraining_output(output_result)
+        tester.check_loss_output(output_result)
+
+        output_result = tester.create_bert_for_question_answering(*config_and_inputs)
+        tester.check_bert_for_question_answering_output(output_result)
+        tester.check_loss_output(output_result)
+
+        output_result = tester.create_bert_for_sequence_classification(*config_and_inputs)
+        tester.check_bert_for_sequence_classification_output(output_result)
+        tester.check_loss_output(output_result)
+
+        output_result = tester.create_bert_for_token_classification(*config_and_inputs)
+        tester.check_bert_for_token_classification_output(output_result)
+        tester.check_loss_output(output_result)

    @classmethod
    def ids_tensor(cls, shape, vocab_size, rng=None, name=None):
Author	SHA1	Message	Date
Thomas Wolf	66d50ca6ae	Merge pull request #73 from huggingface/third-release Third release	2018-11-30 23:10:30 +01:00
thomwolf	f9f3bdd60b	update readme	2018-11-30 23:05:18 +01:00
thomwolf	52ff0590ff	tup => tpu	2018-11-30 23:01:10 +01:00
thomwolf	511bce58bd	update new token classification model	2018-11-30 22:56:02 +01:00
thomwolf	258eb50086	bump up version	2018-11-30 22:55:33 +01:00
thomwolf	d787c6be8c	improve docstrings and fix new token classification model	2018-11-30 22:55:26 +01:00
thomwolf	ed302a73f4	add new token classification model	2018-11-30 22:55:03 +01:00
thomwolf	89d47230d7	clean up classification model output	2018-11-30 22:54:53 +01:00
thomwolf	7f7c41b0c1	tests for all model classes with and without labels	2018-11-30 22:54:33 +01:00
Thomas Wolf	8c7267f1cf	Merge pull request #70 from deepset-ai/fix_lm_loss fix typo in input for masked lm loss function	2018-11-30 18:23:46 +01:00
Malte Pietsch	7b3bb8c00f	fix typo in input for masked lm loss function	2018-11-30 16:52:50 +01:00
thomwolf	257a35134a	fix pickle dump in run_squad example	2018-11-30 14:23:09 +01:00
thomwolf	c588453a0f	fix run_squad	2018-11-30 14:22:40 +01:00
thomwolf	d6f06c03f4	fixed loading pre-trained tokenizer from directory	2018-11-30 14:09:06 +01:00
thomwolf	532a81d3d6	fixed doc_strings	2018-11-30 13:57:01 +01:00
thomwolf	296f006132	added BertForTokenClassification model	2018-11-30 13:56:53 +01:00
thomwolf	298107fed7	Added new bert models	2018-11-30 13:56:02 +01:00
thomwolf	0541442558	add do_lower_case in examples	2018-11-30 13:47:33 +01:00
Thomas Wolf	3951c2c189	Merge pull request #60 from davidefiocco/patch-1 Updated quick-start example with `BertForMaskedLM`	2018-11-28 14:59:08 +01:00
Davide Fiocco	ec2c339b53	Updated quick-start example with `BertForMaskedLM` As `convert_ids_to_tokens` returns a list, the code in the README currently throws an `AssertionError`, so I propose I quick fix.	2018-11-28 14:53:46 +01:00
Thomas Wolf	21f0196412	Merge pull request #58 from lliimsft/master Bug fix in examples;correct t_total for distributed training;run pred…	2018-11-28 12:39:45 +01:00
Li Li	0aaedcc02f	Bug fix in examples;correct t_total for distributed training;run prediction for full dataset	2018-11-27 01:08:37 -08:00
thomwolf	32167cdf4b	remove convert_to_unicode and printable_text from examples	2018-11-26 23:33:22 +01:00
thomwolf	ce37b8e481	bump version in setup.py	2018-11-26 10:45:48 +01:00
thomwolf	05053d163c	update cache_dir in readme and examples	2018-11-26 10:45:13 +01:00
thomwolf	63ae5d2134	added cache_dir option in from_pretrained	2018-11-26 10:21:56 +01:00
thomwolf	029bdc0d50	fixing readme examples	2018-11-26 09:56:41 +01:00
thomwolf	ebaacba38b	fixing typo in docstring	2018-11-26 09:55:15 +01:00
thomwolf	870d71636e	fixing target size in crossentropy losses	2018-11-26 09:51:34 +01:00
thomwolf	982339d829	fixing unicode error	2018-11-23 12:22:12 +01:00
Thomas Wolf	60e01ac427	fix link in readme	2018-11-21 12:08:30 +01:00
thomwolf	6b2136a8a9	fixing weights decay in run_squad example	2018-11-20 10:12:44 +01:00
Thomas Wolf	061eeca84a	Merge pull request #32 from xiaoda99/master Fix ineffective no_decay bug when using BERTAdam	2018-11-20 10:11:46 +01:00
Thomas Wolf	fd32ebed81	Merge pull request #42 from weiyumou/master Fixed UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2	2018-11-20 10:09:50 +01:00
thomwolf	eed255a58d	fixing CLI typo in readme	2018-11-20 10:02:57 +01:00
thomwolf	2f21497d3e	fixing param.grad is None in fp16 examples	2018-11-20 10:01:21 +01:00
weiyumou	9ff2b7d86d	Fixed README typo	2018-11-19 23:13:10 -05:00
weiyumou	37b6c9b21b	Fixed UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3793: ordinal not in range(128)	2018-11-19 23:01:28 -05:00
Thomas Wolf	da73925f6a	fix typos	2018-11-19 20:58:48 +01:00
Thomas Wolf	6f4be31d0d	Merge pull request #40 from joelgrus/patch-1 update pip package name	2018-11-19 20:54:46 +01:00
Joel Grus	dd56cfd89a	update pip package name	2018-11-19 09:50:34 -08:00
xiaoda99	6c4789e4e8	Fix ineffective no_decay bug	2018-11-18 16:16:21 +08:00
Thomas Wolf	956c917344	fix typos in readme	2018-11-17 23:25:23 +01:00
thomwolf	27ee0fff3c	add no_cuda args in extract_features	2018-11-17 23:04:44 +01:00
thomwolf	aa50fd196f	remove unused arguments in example scripts	2018-11-17 23:01:05 +01:00
Thomas Wolf	7c91e51c26	update links in readme	2018-11-17 22:54:15 +01:00
Thomas Wolf	e113101702	fix typos in readme	2018-11-17 12:36:35 +01:00