Merge pull request #254 from huggingface/python_2
Adding OpenAI GPT and Transformer-XL models, compatibility with Python 2
This commit is contained in:
@@ -1,11 +1,29 @@
|
|||||||
version: 2
|
version: 2
|
||||||
jobs:
|
jobs:
|
||||||
build:
|
build_py3:
|
||||||
working_directory: ~/pytorch-pretrained-BERT
|
working_directory: ~/pytorch-pretrained-BERT
|
||||||
docker:
|
docker:
|
||||||
- image: circleci/python:3.7
|
- image: circleci/python:3.5
|
||||||
steps:
|
steps:
|
||||||
- checkout
|
- checkout
|
||||||
- run: sudo pip install --progress-bar off .
|
- run: sudo pip install --progress-bar off .
|
||||||
- run: sudo pip install pytest
|
- run: sudo pip install pytest ftfy spacy
|
||||||
|
- run: sudo python -m spacy download en
|
||||||
- run: python -m pytest -sv tests/
|
- run: python -m pytest -sv tests/
|
||||||
|
build_py2:
|
||||||
|
working_directory: ~/pytorch-pretrained-BERT
|
||||||
|
docker:
|
||||||
|
- image: circleci/python:2.7
|
||||||
|
steps:
|
||||||
|
- checkout
|
||||||
|
- run: sudo pip install --progress-bar off .
|
||||||
|
- run: sudo pip install pytest spacy
|
||||||
|
- run: sudo pip install ftfy==4.4.3
|
||||||
|
- run: sudo python -m spacy download en
|
||||||
|
- run: python -m pytest -sv tests/
|
||||||
|
workflows:
|
||||||
|
version: 2
|
||||||
|
build_and_test:
|
||||||
|
jobs:
|
||||||
|
- build_py3
|
||||||
|
- build_py2
|
||||||
481
README.md
481
README.md
@@ -1,10 +1,25 @@
|
|||||||
# PyTorch Pretrained Bert
|
# PyTorch Pretrained BERT: The Big and Extending Repository of (pre-trained) Transformers
|
||||||
|
|
||||||
[](https://circleci.com/gh/huggingface/pytorch-pretrained-BERT)
|
[](https://circleci.com/gh/huggingface/pytorch-pretrained-BERT)
|
||||||
|
|
||||||
This repository contains an op-for-op PyTorch reimplementation of [Google's TensorFlow repository for the BERT model](https://github.com/google-research/bert) that was released together with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
|
This repository contains op-for-op PyTorch reimplementations, pre-trained models and fine-tuning examples for:
|
||||||
|
|
||||||
This implementation is provided with [Google's pre-trained models](https://github.com/google-research/bert), examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.
|
- [Google's BERT model](https://github.com/google-research/bert),
|
||||||
|
- [OpenAI's GPT model](https://github.com/openai/finetune-transformer-lm), and
|
||||||
|
- [Google/CMU's Transformer-XL model](https://github.com/kimiyoung/transformer-xl).
|
||||||
|
|
||||||
|
These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the [Examples](#examples) section below.
|
||||||
|
|
||||||
|
Here are some information on these models:
|
||||||
|
|
||||||
|
**BERT** was released together with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
|
||||||
|
This PyTorch implementation of BERT is provided with [Google's pre-trained models](https://github.com/google-research/bert), examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.
|
||||||
|
|
||||||
|
**OpenAI GPT** was released together with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
|
||||||
|
This PyTorch implementation of OpenAI GPT is an adaptation of the [PyTorch implementation by HuggingFace](https://github.com/huggingface/pytorch-openai-transformer-lm) and is provided with [OpenAI's pre-trained model](https://github.com/openai/finetune-transformer-lm) and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.
|
||||||
|
|
||||||
|
**Google/CMU's Transformer-XL** was released together with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](http://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
|
||||||
|
This PyTorch implementation of Transformer-XL is an adaptation of the original [PyTorch implementation](https://github.com/kimiyoung/transformer-xl) which has been slightly modified to match the performances of the TensforFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.
|
||||||
|
|
||||||
## Content
|
## Content
|
||||||
|
|
||||||
@@ -21,7 +36,7 @@ This implementation is provided with [Google's pre-trained models](https://githu
|
|||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
This repo was tested on Python 3.5+ and PyTorch 0.4.1/1.0.0
|
This repo was tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1/1.0.0
|
||||||
|
|
||||||
### With pip
|
### With pip
|
||||||
|
|
||||||
@@ -30,6 +45,12 @@ PyTorch pretrained bert can be installed by pip as follows:
|
|||||||
pip install pytorch-pretrained-bert
|
pip install pytorch-pretrained-bert
|
||||||
```
|
```
|
||||||
|
|
||||||
|
If you want to use the tokenizer associated to the `OpenAI GPT` tokenizer, you will need to install `ftfy` (if you are using Python 2, version 4.4.3 is the last version working for you) and `SpaCy` :
|
||||||
|
```bash
|
||||||
|
pip install spacy ftfy==4.4.3
|
||||||
|
python -m spacy download en
|
||||||
|
```
|
||||||
|
|
||||||
### From source
|
### From source
|
||||||
|
|
||||||
Clone the repository and run:
|
Clone the repository and run:
|
||||||
@@ -37,6 +58,13 @@ Clone the repository and run:
|
|||||||
pip install [--editable] .
|
pip install [--editable] .
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Here also, if you want to use `OpenAIGPT` tokenizer, you will need to install `ftfy` (limit to version 4.4.3 if you are using Python 2) and `SpaCy` :
|
||||||
|
```bash
|
||||||
|
pip install spacy ftfy==4.4.3
|
||||||
|
python -m spacy download en
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
A series of tests is included in the [tests folder](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/tests) and can be run using `pytest` (install pytest if needed: `pip install pytest`).
|
A series of tests is included in the [tests folder](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/tests) and can be run using `pytest` (install pytest if needed: `pip install pytest`).
|
||||||
|
|
||||||
You can run the tests with the command:
|
You can run the tests with the command:
|
||||||
@@ -48,7 +76,7 @@ python -m pytest -sv tests/
|
|||||||
|
|
||||||
This package comprises the following classes that can be imported in Python and are detailed in the [Doc](#doc) section of this readme:
|
This package comprises the following classes that can be imported in Python and are detailed in the [Doc](#doc) section of this readme:
|
||||||
|
|
||||||
- Eight PyTorch models (`torch.nn.Module`) for Bert with pre-trained weights (in the [`modeling.py`](./pytorch_pretrained_bert/modeling.py) file):
|
- Eight **Bert** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling.py`](./pytorch_pretrained_bert/modeling.py) file):
|
||||||
- [`BertModel`](./pytorch_pretrained_bert/modeling.py#L556) - raw BERT Transformer model (**fully pre-trained**),
|
- [`BertModel`](./pytorch_pretrained_bert/modeling.py#L556) - raw BERT Transformer model (**fully pre-trained**),
|
||||||
- [`BertForMaskedLM`](./pytorch_pretrained_bert/modeling.py#L710) - BERT Transformer with the pre-trained masked language modeling head on top (**fully pre-trained**),
|
- [`BertForMaskedLM`](./pytorch_pretrained_bert/modeling.py#L710) - BERT Transformer with the pre-trained masked language modeling head on top (**fully pre-trained**),
|
||||||
- [`BertForNextSentencePrediction`](./pytorch_pretrained_bert/modeling.py#L771) - BERT Transformer with the pre-trained next sentence prediction classifier on top (**fully pre-trained**),
|
- [`BertForNextSentencePrediction`](./pytorch_pretrained_bert/modeling.py#L771) - BERT Transformer with the pre-trained next sentence prediction classifier on top (**fully pre-trained**),
|
||||||
@@ -58,26 +86,53 @@ This package comprises the following classes that can be imported in Python and
|
|||||||
- [`BertForTokenClassification`](./pytorch_pretrained_bert/modeling.py#L969) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**),
|
- [`BertForTokenClassification`](./pytorch_pretrained_bert/modeling.py#L969) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**),
|
||||||
- [`BertForQuestionAnswering`](./pytorch_pretrained_bert/modeling.py#L1034) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**).
|
- [`BertForQuestionAnswering`](./pytorch_pretrained_bert/modeling.py#L1034) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**).
|
||||||
|
|
||||||
- Three tokenizers (in the [`tokenization.py`](./pytorch_pretrained_bert/tokenization.py) file):
|
- Three **OpenAI GPT** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling_openai.py`](./pytorch_pretrained_bert/modeling_openai.py) file):
|
||||||
|
- [`OpenAIGPTModel`](./pytorch_pretrained_bert/modeling_openai.py#L537) - raw OpenAI GPT Transformer model (**fully pre-trained**),
|
||||||
|
- [`OpenAIGPTLMHeadModel`](./pytorch_pretrained_bert/modeling_openai.py#L691) - OpenAI GPT Transformer with the tied language modeling head on top (**fully pre-trained**),
|
||||||
|
- [`OpenAIGPTDoubleHeadsModel`](./pytorch_pretrained_bert/modeling_openai.py#L752) - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is **pre-trained**, the multiple choice classification head **is only initialized and has to be trained**),
|
||||||
|
|
||||||
|
- Two **Transformer-XL** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling_transfo_xl.py`](./pytorch_pretrained_bert/modeling_transfo_xl.py) file):
|
||||||
|
- [`TransfoXLModel`](./pytorch_pretrained_bert/modeling_transfo_xl.py#L974) - Transformer-XL model which outputs the last hidden state and memory cells (**fully pre-trained**),
|
||||||
|
- [`TransfoXLLMHeadModel`](./pytorch_pretrained_bert/modeling_transfo_xl.py#L1236) - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (**fully pre-trained**),
|
||||||
|
|
||||||
|
- Tokenizers for **BERT** (using word-piece) (in the [`tokenization.py`](./pytorch_pretrained_bert/tokenization.py) file):
|
||||||
- `BasicTokenizer` - basic tokenization (punctuation splitting, lower casing, etc.),
|
- `BasicTokenizer` - basic tokenization (punctuation splitting, lower casing, etc.),
|
||||||
- `WordpieceTokenizer` - WordPiece tokenization,
|
- `WordpieceTokenizer` - WordPiece tokenization,
|
||||||
- `BertTokenizer` - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
|
- `BertTokenizer` - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
|
||||||
|
|
||||||
- One optimizer (in the [`optimization.py`](./pytorch_pretrained_bert/optimization.py) file):
|
- Tokenizer for **OpenAI GPT** (using Byte-Pair-Encoding) (in the [`tokenization_openai.py`](./pytorch_pretrained_bert/tokenization_openai.py) file):
|
||||||
|
- `OpenAIGPTTokenizer` - perform Byte-Pair-Encoding (BPE) tokenization.
|
||||||
|
|
||||||
|
- Tokenizer for **Transformer-XL** (word tokens ordered by frequency for adaptive softmax) (in the [`tokenization_transfo_xl.py`](./pytorch_pretrained_bert/tokenization_transfo_xl.py) file):
|
||||||
|
- `OpenAIGPTTokenizer` - perform word tokenization and can order words by frequency in a corpus for use in an adaptive softmax.
|
||||||
|
|
||||||
|
- Optimizer for **BERT** (in the [`optimization.py`](./pytorch_pretrained_bert/optimization.py) file):
|
||||||
- `BertAdam` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
|
- `BertAdam` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
|
||||||
|
|
||||||
- A configuration class (in the [`modeling.py`](./pytorch_pretrained_bert/modeling.py) file):
|
- Optimizer for **OpenAI GPT** (in the [`optimization_openai.py`](./pytorch_pretrained_bert/optimization_openai.py) file):
|
||||||
|
- `OpenAIGPTAdam` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
|
||||||
|
|
||||||
|
- Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective [`modeling.py`](./pytorch_pretrained_bert/modeling.py), [`modeling_openai.py`](./pytorch_pretrained_bert/modeling_openai.py), [`modeling_transfo_xl.py`](./pytorch_pretrained_bert/modeling_transfo_xl.py) files):
|
||||||
- `BertConfig` - Configuration class to store the configuration of a `BertModel` with utilities to read and write from JSON configuration files.
|
- `BertConfig` - Configuration class to store the configuration of a `BertModel` with utilities to read and write from JSON configuration files.
|
||||||
|
- `OpenAIGPTConfig` - Configuration class to store the configuration of a `OpenAIGPTModel` with utilities to read and write from JSON configuration files.
|
||||||
|
- `TransfoXLConfig` - Configuration class to store the configuration of a `TransfoXLModel` with utilities to read and write from JSON configuration files.
|
||||||
|
|
||||||
The repository further comprises:
|
The repository further comprises:
|
||||||
|
|
||||||
- Five examples on how to use Bert (in the [`examples` folder](./examples)):
|
- Five examples on how to use **BERT** (in the [`examples` folder](./examples)):
|
||||||
- [`extract_features.py`](./examples/extract_features.py) - Show how to extract hidden states from an instance of `BertModel`,
|
- [`extract_features.py`](./examples/extract_features.py) - Show how to extract hidden states from an instance of `BertModel`,
|
||||||
- [`run_classifier.py`](./examples/run_classifier.py) - Show how to fine-tune an instance of `BertForSequenceClassification` on GLUE's MRPC task,
|
- [`run_classifier.py`](./examples/run_classifier.py) - Show how to fine-tune an instance of `BertForSequenceClassification` on GLUE's MRPC task,
|
||||||
- [`run_squad.py`](./examples/run_squad.py) - Show how to fine-tune an instance of `BertForQuestionAnswering` on SQuAD v1.0 task.
|
- [`run_squad.py`](./examples/run_squad.py) - Show how to fine-tune an instance of `BertForQuestionAnswering` on SQuAD v1.0 and SQuAD v2.0 tasks.
|
||||||
- [`run_swag.py`](./examples/run_swag.py) - Show how to fine-tune an instance of `BertForMultipleChoice` on Swag task.
|
- [`run_swag.py`](./examples/run_swag.py) - Show how to fine-tune an instance of `BertForMultipleChoice` on Swag task.
|
||||||
- [`run_lm_finetuning.py`](./examples/run_lm_finetuning.py) - Show how to fine-tune an instance of `BertForPretraining' on a target text corpus.
|
- [`run_lm_finetuning.py`](./examples/run_lm_finetuning.py) - Show how to fine-tune an instance of `BertForPretraining' on a target text corpus.
|
||||||
|
|
||||||
|
- One example on how to use **OpenAI GPT** (in the [`examples` folder](./examples)):
|
||||||
|
- [`openai_gpt_train.py`](./examples/openai_gpt_train.py) - Show how to fine-tune an instance of `OpenGPTDoubleHeadsModel` on the RocStories task.
|
||||||
|
|
||||||
|
- Two examples on how to use **Transformer-XL** (in the [`examples` folder](./examples)):
|
||||||
|
- [`transfo_xl_train.py`](./examples/transfo_xl_train.py) - Show how to train and exaluate an instance of `TransfoXLModel` on WikiText 103,
|
||||||
|
- [`transfo_xl_eval.py`](./examples/transfo_xl_eval.py) - Simply exaluate a pre-trained model of `TransfoXLModel` on WikiText 103.
|
||||||
|
|
||||||
These examples are detailed in the [Examples](#examples) section of this readme.
|
These examples are detailed in the [Examples](#examples) section of this readme.
|
||||||
|
|
||||||
- Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the [`notebooks` folder](./notebooks)):
|
- Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the [`notebooks` folder](./notebooks)):
|
||||||
@@ -87,12 +142,14 @@ The repository further comprises:
|
|||||||
|
|
||||||
These notebooks are detailed in the [Notebooks](#notebooks) section of this readme.
|
These notebooks are detailed in the [Notebooks](#notebooks) section of this readme.
|
||||||
|
|
||||||
- A command-line interface to convert any TensorFlow checkpoint in a PyTorch dump:
|
- A command-line interface to convert TensorFlow checkpoints (BERT, Transformer-XL) or NumPy checkpoint (OpenAI) in a PyTorch save of the associated PyTorch model:
|
||||||
|
|
||||||
This CLI is detailed in the [Command-line interface](#Command-line-interface) section of this readme.
|
This CLI is detailed in the [Command-line interface](#Command-line-interface) section of this readme.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
|
### BERT
|
||||||
|
|
||||||
Here is a quick-start example using `BertTokenizer`, `BertModel` and `BertForMaskedLM` class with Google AI's pre-trained `Bert base uncased` model. See the [doc section](#doc) below for all the details on these classes.
|
Here is a quick-start example using `BertTokenizer`, `BertModel` and `BertForMaskedLM` class with Google AI's pre-trained `Bert base uncased` model. See the [doc section](#doc) below for all the details on these classes.
|
||||||
|
|
||||||
First let's prepare a tokenized input with `BertTokenizer`
|
First let's prepare a tokenized input with `BertTokenizer`
|
||||||
@@ -105,18 +162,18 @@ from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
|
|||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||||||
|
|
||||||
# Tokenized input
|
# Tokenized input
|
||||||
text = "Who was Jim Henson ? Jim Henson was a puppeteer"
|
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
|
||||||
tokenized_text = tokenizer.tokenize(text)
|
tokenized_text = tokenizer.tokenize(text)
|
||||||
|
|
||||||
# Mask a token that we will try to predict back with `BertForMaskedLM`
|
# Mask a token that we will try to predict back with `BertForMaskedLM`
|
||||||
masked_index = 6
|
masked_index = 6
|
||||||
tokenized_text[masked_index] = '[MASK]'
|
tokenized_text[masked_index] = '[MASK]'
|
||||||
assert tokenized_text == ['who', 'was', 'jim', 'henson', '?', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer']
|
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
|
||||||
|
|
||||||
# Convert token to vocabulary indices
|
# Convert token to vocabulary indices
|
||||||
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
|
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
|
||||||
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
|
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
|
||||||
segments_ids = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
|
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
|
||||||
|
|
||||||
# Convert inputs to PyTorch tensors
|
# Convert inputs to PyTorch tensors
|
||||||
tokens_tensor = torch.tensor([indexed_tokens])
|
tokens_tensor = torch.tensor([indexed_tokens])
|
||||||
@@ -130,8 +187,14 @@ Let's see how to use `BertModel` to get hidden states
|
|||||||
model = BertModel.from_pretrained('bert-base-uncased')
|
model = BertModel.from_pretrained('bert-base-uncased')
|
||||||
model.eval()
|
model.eval()
|
||||||
|
|
||||||
|
# If you have a GPU, put everything on cuda
|
||||||
|
tokens_tensor = tokens_tensor.to('cuda')
|
||||||
|
segments_tensors = segments_tensors.to('cuda')
|
||||||
|
model.to('cuda')
|
||||||
|
|
||||||
# Predict hidden states features for each layer
|
# Predict hidden states features for each layer
|
||||||
encoded_layers, _ = model(tokens_tensor, segments_tensors)
|
with torch.no_grad():
|
||||||
|
encoded_layers, _ = model(tokens_tensor, segments_tensors)
|
||||||
# We have a hidden states for each of the 12 layers in model bert-base-uncased
|
# We have a hidden states for each of the 12 layers in model bert-base-uncased
|
||||||
assert len(encoded_layers) == 12
|
assert len(encoded_layers) == 12
|
||||||
```
|
```
|
||||||
@@ -143,8 +206,14 @@ And how to use `BertForMaskedLM`
|
|||||||
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
|
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
|
||||||
model.eval()
|
model.eval()
|
||||||
|
|
||||||
|
# If you have a GPU, put everything on cuda
|
||||||
|
tokens_tensor = tokens_tensor.to('cuda')
|
||||||
|
segments_tensors = segments_tensors.to('cuda')
|
||||||
|
model.to('cuda')
|
||||||
|
|
||||||
# Predict all tokens
|
# Predict all tokens
|
||||||
predictions = model(tokens_tensor, segments_tensors)
|
with torch.no_grad():
|
||||||
|
predictions = model(tokens_tensor, segments_tensors)
|
||||||
|
|
||||||
# confirm we were able to predict 'henson'
|
# confirm we were able to predict 'henson'
|
||||||
predicted_index = torch.argmax(predictions[0, masked_index]).item()
|
predicted_index = torch.argmax(predictions[0, masked_index]).item()
|
||||||
@@ -152,20 +221,152 @@ predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
|
|||||||
assert predicted_token == 'henson'
|
assert predicted_token == 'henson'
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### OpenAI GPT
|
||||||
|
|
||||||
|
Here is a quick-start example using `OpenAIGPTTokenizer`, `OpenAIGPTModel` and `OpenAIGPTLMHeadModel` class with OpenAI's pre-trained model. See the [doc section](#doc) below for all the details on these classes.
|
||||||
|
|
||||||
|
First let's prepare a tokenized input with `OpenAIGPTTokenizer`
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
|
||||||
|
|
||||||
|
# Load pre-trained model tokenizer (vocabulary)
|
||||||
|
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
|
||||||
|
|
||||||
|
# Tokenized input
|
||||||
|
text = "Who was Jim Henson ? Jim Henson was a puppeteer"
|
||||||
|
tokenized_text = tokenizer.tokenize(text)
|
||||||
|
|
||||||
|
# Convert token to vocabulary indices
|
||||||
|
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
|
||||||
|
|
||||||
|
# Convert inputs to PyTorch tensors
|
||||||
|
tokens_tensor = torch.tensor([indexed_tokens])
|
||||||
|
```
|
||||||
|
|
||||||
|
Let's see how to use `OpenAIGPTModel` to get hidden states
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Load pre-trained model (weights)
|
||||||
|
model = OpenAIGPTModel.from_pretrained('openai-gpt')
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
# If you have a GPU, put everything on cuda
|
||||||
|
tokens_tensor = tokens_tensor.to('cuda')
|
||||||
|
model.to('cuda')
|
||||||
|
|
||||||
|
# Predict hidden states features for each layer
|
||||||
|
with torch.no_grad():
|
||||||
|
hidden_states = model(tokens_tensor)
|
||||||
|
```
|
||||||
|
|
||||||
|
And how to use `OpenAIGPTLMHeadModel`
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Load pre-trained model (weights)
|
||||||
|
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
# If you have a GPU, put everything on cuda
|
||||||
|
tokens_tensor = tokens_tensor.to('cuda')
|
||||||
|
model.to('cuda')
|
||||||
|
|
||||||
|
# Predict all tokens
|
||||||
|
with torch.no_grad():
|
||||||
|
predictions = model(tokens_tensor)
|
||||||
|
|
||||||
|
# get the predicted last token
|
||||||
|
predicted_index = torch.argmax(predictions[0, -1, :]).item()
|
||||||
|
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
|
||||||
|
assert predicted_token == '.</w>'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Transformer-XL
|
||||||
|
|
||||||
|
Here is a quick-start example using `TransfoXLTokenizer`, `TransfoXLModel` and `TransfoXLModelLMHeadModel` class with the Transformer-XL model pre-trained on WikiText-103. See the [doc section](#doc) below for all the details on these classes.
|
||||||
|
|
||||||
|
First let's prepare a tokenized input with `TransfoXLTokenizer`
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel
|
||||||
|
|
||||||
|
# Load pre-trained model tokenizer (vocabulary from wikitext 103)
|
||||||
|
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
|
||||||
|
|
||||||
|
# Tokenized input
|
||||||
|
text_1 = "Who was Jim Henson ?"
|
||||||
|
text_2 = "Jim Henson was a puppeteer"
|
||||||
|
tokenized_text_1 = tokenizer.tokenize(text_1)
|
||||||
|
tokenized_text_2 = tokenizer.tokenize(text_2)
|
||||||
|
|
||||||
|
# Convert token to vocabulary indices
|
||||||
|
indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1)
|
||||||
|
indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2)
|
||||||
|
|
||||||
|
# Convert inputs to PyTorch tensors
|
||||||
|
tokens_tensor_1 = torch.tensor([indexed_tokens_1])
|
||||||
|
tokens_tensor_2 = torch.tensor([indexed_tokens_2])
|
||||||
|
```
|
||||||
|
|
||||||
|
Let's see how to use `TransfoXLModel` to get hidden states
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Load pre-trained model (weights)
|
||||||
|
model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
# If you have a GPU, put everything on cuda
|
||||||
|
tokens_tensor_1 = tokens_tensor_1.to('cuda')
|
||||||
|
tokens_tensor_2 = tokens_tensor_2.to('cuda')
|
||||||
|
model.to('cuda')
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
# Predict hidden states features for each layer
|
||||||
|
hidden_states_1, mems_1 = model(tokens_tensor_1)
|
||||||
|
# We can re-use the memory cells in a subsequent call to attend a longer context
|
||||||
|
hidden_states_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
|
||||||
|
```
|
||||||
|
|
||||||
|
And how to use `TransfoXLLMHeadModel`
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Load pre-trained model (weights)
|
||||||
|
model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
# If you have a GPU, put everything on cuda
|
||||||
|
tokens_tensor_1 = tokens_tensor_1.to('cuda')
|
||||||
|
tokens_tensor_2 = tokens_tensor_2.to('cuda')
|
||||||
|
model.to('cuda')
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
# Predict all tokens
|
||||||
|
predictions_1, mems_1 = model(tokens_tensor_1)
|
||||||
|
# We can re-use the memory cells in a subsequent call to attend a longer context
|
||||||
|
predictions_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
|
||||||
|
|
||||||
|
# get the predicted last token
|
||||||
|
predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
|
||||||
|
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
|
||||||
|
assert predicted_token == 'who'
|
||||||
|
```
|
||||||
|
|
||||||
## Doc
|
## Doc
|
||||||
|
|
||||||
Here is a detailed documentation of the classes in the package and how to use them:
|
Here is a detailed documentation of the classes in the package and how to use them:
|
||||||
|
|
||||||
| Sub-section | Description |
|
| Sub-section | Description |
|
||||||
|-|-|
|
|-|-|
|
||||||
| [Loading Google AI's pre-trained weigths](#Loading-Google-AIs-pre-trained-weigths-and-PyTorch-dump) | How to load Google AI's pre-trained weight or a PyTorch saved instance |
|
| [Loading Google AI's/OpenAI's pre-trained weigths](#Loading-Google-AI-or-OpenAI-pre-trained-weigths-and-PyTorch-dump) | How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance |
|
||||||
| [PyTorch models](#PyTorch-models) | API of the eight PyTorch model classes: `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification`, `BertForMultipleChoice` or `BertForQuestionAnswering` |
|
| [PyTorch models](#PyTorch-models) | API of the eight PyTorch model classes: `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification`, `BertForMultipleChoice` or `BertForQuestionAnswering` |
|
||||||
| [Tokenizer: `BertTokenizer`](#Tokenizer-BertTokenizer) | API of the `BertTokenizer` class|
|
| [Tokenizer: `BertTokenizer`](#Tokenizer-BertTokenizer) | API of the `BertTokenizer` class|
|
||||||
| [Optimizer: `BertAdam`](#Optimizer-BertAdam) | API of the `BertAdam` class |
|
| [Optimizer: `BertAdam`](#Optimizer-BertAdam) | API of the `BertAdam` class |
|
||||||
|
|
||||||
### Loading Google AI's pre-trained weigths and PyTorch dump
|
### Loading Google AI or OpenAI pre-trained weigths or PyTorch dump
|
||||||
|
|
||||||
To load one of Google AI's pre-trained models or a PyTorch saved model (an instance of `BertForPreTraining` saved with `torch.save()`), the PyTorch model classes and the tokenizer can be instantiated as
|
To load one of Google AI's, OpenAI's pre-trained models or a PyTorch saved model (an instance of `BertForPreTraining` saved with `torch.save()`), the PyTorch model classes and the tokenizer can be instantiated as
|
||||||
|
|
||||||
```python
|
```python
|
||||||
model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None)
|
model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None)
|
||||||
@@ -173,10 +374,10 @@ model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=Non
|
|||||||
|
|
||||||
where
|
where
|
||||||
|
|
||||||
- `BERT_CLASS` is either the `BertTokenizer` class (to load the vocabulary) or one of the eight PyTorch model classes (to load the pre-trained weights): `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification`, `BertForTokenClassification`, `BertForMultipleChoice` or `BertForQuestionAnswering`, and
|
- `BERT_CLASS` is either a tokenizer to load the vocabulary (`BertTokenizer` or `OpenAIGPTTokenizer` classes) or one of the eight BERT or three OpenAI GPT PyTorch model classes (to load the pre-trained weights): `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification`, `BertForTokenClassification`, `BertForMultipleChoice`, `BertForQuestionAnswering`, `OpenAIGPTModel`, `OpenAIGPTLMHeadModel` or `OpenAIGPTDoubleHeadsModel`, and
|
||||||
- `PRE_TRAINED_MODEL_NAME_OR_PATH` is either:
|
- `PRE_TRAINED_MODEL_NAME_OR_PATH` is either:
|
||||||
|
|
||||||
- the shortcut name of a Google AI's pre-trained model selected in the list:
|
- the shortcut name of a Google AI's or OpenAI's pre-trained model selected in the list:
|
||||||
|
|
||||||
- `bert-base-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters
|
- `bert-base-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters
|
||||||
- `bert-large-uncased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
|
- `bert-large-uncased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
|
||||||
@@ -185,11 +386,13 @@ where
|
|||||||
- `bert-base-multilingual-uncased`: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
|
- `bert-base-multilingual-uncased`: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
|
||||||
- `bert-base-multilingual-cased`: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
|
- `bert-base-multilingual-cased`: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
|
||||||
- `bert-base-chinese`: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
|
- `bert-base-chinese`: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
|
||||||
|
- `openai-gpt`: OpenAI English model, 12-layer, 768-hidden, 12-heads, 110M parameters
|
||||||
|
- `transfo-xl-wt103`: Transformer-XL English model trained on wikitext-103, 18-layer, 1024-hidden, 16-heads, 257M parameters
|
||||||
|
|
||||||
- a path or url to a pretrained model archive containing:
|
- a path or url to a pretrained model archive containing:
|
||||||
|
|
||||||
- `bert_config.json` a configuration file for the model, and
|
- `bert_config.json` or `openai_gpt_config.json` a configuration file for the model, and
|
||||||
- `pytorch_model.bin` a PyTorch dump of a pre-trained instance `BertForPreTraining` (saved with the usual `torch.save()`)
|
- `pytorch_model.bin` a PyTorch dump of a pre-trained instance of `BertForPreTraining`, `OpenAIGPTModel` or `TransfoXLModel` (saved with the usual `torch.save()`)
|
||||||
|
|
||||||
If `PRE_TRAINED_MODEL_NAME_OR_PATH` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links [here](pytorch_pretrained_bert/modeling.py)) and stored in a cache folder to avoid future download (the cache folder can be found at `~/.pytorch_pretrained_bert/`).
|
If `PRE_TRAINED_MODEL_NAME_OR_PATH` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links [here](pytorch_pretrained_bert/modeling.py)) and stored in a cache folder to avoid future download (the cache folder can be found at `~/.pytorch_pretrained_bert/`).
|
||||||
- `cache_dir` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example `cache_dir='./pretrained_model_{}'.format(args.local_rank)` (see the section on distributed training for more information).
|
- `cache_dir` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example `cache_dir='./pretrained_model_{}'.format(args.local_rank)` (see the section on distributed training for more information).
|
||||||
@@ -198,10 +401,19 @@ where
|
|||||||
|
|
||||||
**When using an `uncased model`, make sure to pass `--do_lower_case` to the example training scripts (or pass `do_lower_case=True` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).**
|
**When using an `uncased model`, make sure to pass `--do_lower_case` to the example training scripts (or pass `do_lower_case=True` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).**
|
||||||
|
|
||||||
Example:
|
Examples:
|
||||||
```python
|
```python
|
||||||
|
# BERT
|
||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
|
||||||
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
|
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
|
||||||
|
|
||||||
|
# OpenAI GPT
|
||||||
|
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
|
||||||
|
model = OpenAIGPTModel.from_pretrained('openai-gpt')
|
||||||
|
|
||||||
|
# Transformer-XL
|
||||||
|
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
|
||||||
|
model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
|
||||||
```
|
```
|
||||||
|
|
||||||
### PyTorch models
|
### PyTorch models
|
||||||
@@ -311,7 +523,110 @@ The token-level classifier takes as input the full sequence of the last hidden s
|
|||||||
|
|
||||||
An example on how to use this class is given in the [`run_squad.py`](./examples/run_squad.py) script which can be used to fine-tune a token classifier using BERT, for example for the SQuAD task.
|
An example on how to use this class is given in the [`run_squad.py`](./examples/run_squad.py) script which can be used to fine-tune a token classifier using BERT, for example for the SQuAD task.
|
||||||
|
|
||||||
### Tokenizer: `BertTokenizer`
|
#### 9. `OpenAIGPTModel`
|
||||||
|
|
||||||
|
`OpenAIGPTModel` is the basic OpenAI GPT Transformer model with a layer of summed token and position embeddings followed by a series of 12 identical self-attention blocks.
|
||||||
|
|
||||||
|
OpenAI GPT use a single embedding matrix to store the word and special embeddings.
|
||||||
|
Special tokens embeddings are additional tokens that are not pre-trained: `[SEP]`, `[CLS]`...
|
||||||
|
Special tokens need to be trained during the fine-tuning if you use them.
|
||||||
|
The number of special embeddings can be controled using the `set_num_special_tokens(num_special_tokens)` function.
|
||||||
|
|
||||||
|
The embeddings are ordered as follow in the token embeddings matrice:
|
||||||
|
|
||||||
|
```python
|
||||||
|
[0, ----------------------
|
||||||
|
... -> word embeddings
|
||||||
|
config.vocab_size - 1, ______________________
|
||||||
|
config.vocab_size,
|
||||||
|
... -> special embeddings
|
||||||
|
config.vocab_size + config.n_special - 1] ______________________
|
||||||
|
```
|
||||||
|
|
||||||
|
where total_tokens_embeddings can be obtained as config.total_tokens_embeddings and is:
|
||||||
|
`total_tokens_embeddings = config.vocab_size + config.n_special`
|
||||||
|
You should use the associate indices to index the embeddings.
|
||||||
|
|
||||||
|
The inputs and output are **identical to the TensorFlow model inputs and outputs**.
|
||||||
|
|
||||||
|
We detail them here. This model takes as *inputs*:
|
||||||
|
[`modeling_openai.py`](./pytorch_pretrained_bert/modeling_openai.py)
|
||||||
|
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] (or more generally [d_1, ..., d_n, sequence_length] were d_1 ... d_n are arbitrary dimensions) with the word BPE token indices selected in the range [0, total_tokens_embeddings[
|
||||||
|
- `position_ids`: an optional torch.LongTensor with the same shape as input_ids
|
||||||
|
with the position indices (selected in the range [0, config.n_positions - 1[.
|
||||||
|
- `token_type_ids`: an optional torch.LongTensor with the same shape as input_ids
|
||||||
|
You can use it to add a third type of embedding to each input token in the sequence
|
||||||
|
(the previous two being the word and position embeddings). The input, position and token_type embeddings are summed inside the Transformer before the first self-attention block.
|
||||||
|
|
||||||
|
This model *outputs*:
|
||||||
|
- `hidden_states`: the encoded-hidden-states at the top of the model as a torch.FloatTensor of size [batch_size, sequence_length, hidden_size] (or more generally [d_1, ..., d_n, hidden_size] were d_1 ... d_n are the dimension of input_ids)
|
||||||
|
|
||||||
|
#### 10. `OpenAIGPTLMHeadModel`
|
||||||
|
|
||||||
|
`OpenAIGPTLMHeadModel` includes the `OpenAIGPTModel` Transformer followed by a language modeling head with weights tied to the input embeddings (no additional parameters).
|
||||||
|
|
||||||
|
*Inputs* are the same as the inputs of the [`OpenAIGPTModel`](#-9.-`OpenAIGPTModel`) class plus optional labels:
|
||||||
|
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].
|
||||||
|
|
||||||
|
*Outputs*:
|
||||||
|
- if `lm_labels` is not `None`:
|
||||||
|
Outputs the language modeling loss.
|
||||||
|
- else:
|
||||||
|
Outputs `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, sequence_length, total_tokens_embeddings] (or more generally [d_1, ..., d_n, total_tokens_embeddings] were d_1 ... d_n are the dimension of input_ids)
|
||||||
|
|
||||||
|
#### 11. `OpenAIGPTDoubleHeadsModel`
|
||||||
|
|
||||||
|
`OpenAIGPTDoubleHeadsModel` includes the `OpenAIGPTModel` Transformer followed by two heads:
|
||||||
|
- a language modeling head with weights tied to the input embeddings (no additional parameters) and:
|
||||||
|
- a multiple choice classifier (linear layer that take as input a hidden state in a sequence to compute a score, see details in paper).
|
||||||
|
|
||||||
|
*Inputs* are the same as the inputs of the [`OpenAIGPTModel`](#-9.-`OpenAIGPTModel`) class plus a classification mask and two optional labels:
|
||||||
|
- `multiple_choice_token_ids`: a torch.LongTensor of shape [batch_size, num_choices] with the index of the token whose hidden state should be used as input for the multiple choice classifier (usually the [CLS] token for each choice).
|
||||||
|
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].
|
||||||
|
- `multiple_choice_labels`: optional multiple choice labels: torch.LongTensor of shape [batch_size] with indices selected in [0, ..., num_choices].
|
||||||
|
|
||||||
|
*Outputs*:
|
||||||
|
- if `lm_labels` and `multiple_choice_labels` are not `None`:
|
||||||
|
Outputs a tuple of losses with the language modeling loss and the multiple choice loss.
|
||||||
|
- else Outputs a tuple with:
|
||||||
|
- `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, num_choices, sequence_length, total_tokens_embeddings]
|
||||||
|
- `multiple_choice_logits`: the multiple choice logits as a torch.FloatTensor of size [batch_size, num_choices]
|
||||||
|
|
||||||
|
#### 12. `TransfoXLModel`
|
||||||
|
|
||||||
|
The Transformer-XL model is described in "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context".
|
||||||
|
|
||||||
|
Transformer XL use a relative positioning with sinusiodal patterns and adaptive softmax inputs which means that:
|
||||||
|
|
||||||
|
- you don't need to specify positioning embeddings indices
|
||||||
|
- the tokens in the vocabulary have to be sorted to decreasing frequency.
|
||||||
|
|
||||||
|
This model takes as *inputs*:
|
||||||
|
[`modeling_transfo_xl.py`](./pytorch_pretrained_bert/modeling_transfo_xl.py)
|
||||||
|
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the token indices selected in the range [0, self.config.n_token[
|
||||||
|
- `mems`: an optional memory of hidden states from previous forward passes as a list (num layers) of hidden states at the entry of each layer. Each hidden states has shape [self.config.mem_len, bsz, self.config.d_model]. Note that the first two dimensions are transposed in `mems` with regards to `input_ids`.
|
||||||
|
|
||||||
|
This model *outputs* a tuple of (last_hidden_state, new_mems)
|
||||||
|
- `last_hidden_state`: the encoded-hidden-states at the top of the model as a torch.FloatTensor of size [batch_size, sequence_length, self.config.d_model]
|
||||||
|
- `new_mems`: list (num layers) of updated mem states at the entry of each layer each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]. Note that the first two dimensions are transposed in `mems` with regards to `input_ids`.
|
||||||
|
|
||||||
|
#### 13. `TransfoXLLMHeadModel`
|
||||||
|
|
||||||
|
`TransfoXLLMHeadModel` includes the `TransfoXLModel` Transformer followed by an (adaptive) softmax head with weights tied to the input embeddings.
|
||||||
|
|
||||||
|
*Inputs* are the same as the inputs of the [`TransfoXLModel`](#-12.-`TransfoXLModel`) class plus optional labels:
|
||||||
|
- `target`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the target token indices selected in the range [0, self.config.n_token[
|
||||||
|
|
||||||
|
*Outputs* a tuple of (last_hidden_state, new_mems)
|
||||||
|
- `softmax_output`: output of the (adaptive) softmax:
|
||||||
|
- if target is None: Negative log likelihood of shape [batch_size, sequence_length]
|
||||||
|
- else: log probabilities of tokens, shape [batch_size, sequence_length, n_tokens]
|
||||||
|
- `new_mems`: list (num layers) of updated mem states at the entry of each layer each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]. Note that the first two dimensions are transposed in `mems` with regards to `input_ids`.
|
||||||
|
|
||||||
|
|
||||||
|
### Tokenizers:
|
||||||
|
|
||||||
|
#### `BertTokenizer`
|
||||||
|
|
||||||
`BertTokenizer` perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
|
`BertTokenizer` perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
|
||||||
|
|
||||||
@@ -328,7 +643,32 @@ and three methods:
|
|||||||
|
|
||||||
Please refer to the doc strings and code in [`tokenization.py`](./pytorch_pretrained_bert/tokenization.py) for the details of the `BasicTokenizer` and `WordpieceTokenizer` classes. In general it is recommended to use `BertTokenizer` unless you know what you are doing.
|
Please refer to the doc strings and code in [`tokenization.py`](./pytorch_pretrained_bert/tokenization.py) for the details of the `BasicTokenizer` and `WordpieceTokenizer` classes. In general it is recommended to use `BertTokenizer` unless you know what you are doing.
|
||||||
|
|
||||||
### Optimizer: `BertAdam`
|
#### `OpenAIGPTTokenizer`
|
||||||
|
|
||||||
|
`OpenAIGPTTokenizer` perform Byte-Pair-Encoding (BPE) tokenization.
|
||||||
|
|
||||||
|
This class has two arguments:
|
||||||
|
|
||||||
|
- `vocab_file`: path to a vocabulary file.
|
||||||
|
- `merges_file`: path to a file containing the BPE merges.
|
||||||
|
|
||||||
|
and three methods:
|
||||||
|
|
||||||
|
- `tokenize(text)`: convert a `str` in a list of `str` tokens by (1) performing basic tokenization and (2) WordPiece tokenization.
|
||||||
|
- `convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
|
||||||
|
- `convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
|
||||||
|
|
||||||
|
Please refer to the doc strings and code in [`tokenization_openai.py`](./pytorch_pretrained_bert/tokenization_openai.py) for the details of the `OpenAIGPTTokenizer`.
|
||||||
|
|
||||||
|
#### `TransfoXLTokenizer`
|
||||||
|
|
||||||
|
`TransfoXLTokenizer` perform word tokenization. This tokenizer can be used for adaptive softmax and has utilities for counting tokens in a corpus to create a vocabulary ordered by toekn frequency (for adaptive softmax). See the adaptive softmax paper ([Efficient softmax approximation for GPUs](http://arxiv.org/abs/1609.04309)) for more details.
|
||||||
|
|
||||||
|
Please refer to the doc strings and code in [`tokenization_transfo_xl.py`](./pytorch_pretrained_bert/tokenization_transfo_xl.py) for the details of these additional methods in `TransfoXLTokenizer`.
|
||||||
|
|
||||||
|
### Optimizers:
|
||||||
|
|
||||||
|
#### `BertAdam`
|
||||||
|
|
||||||
`BertAdam` is a `torch.optimizer` adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following:
|
`BertAdam` is a `torch.optimizer` adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following:
|
||||||
|
|
||||||
@@ -348,6 +688,13 @@ The optimizer accepts the following arguments:
|
|||||||
- `weight_decay:` Weight decay. Default : `0.01`
|
- `weight_decay:` Weight decay. Default : `0.01`
|
||||||
- `max_grad_norm` : Maximum norm for the gradients (`-1` means no clipping). Default : `1.0`
|
- `max_grad_norm` : Maximum norm for the gradients (`-1` means no clipping). Default : `1.0`
|
||||||
|
|
||||||
|
#### `OpenAIGPTAdam`
|
||||||
|
|
||||||
|
`OpenAIGPTAdam` is similar to `BertAdam`.
|
||||||
|
The differences with `BertAdam` is that `OpenAIGPTAdam` compensate for bias as in the regular Adam optimizer.
|
||||||
|
|
||||||
|
`OpenAIGPTAdam` accepts the same arguments as `BertAdam`.
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
| Sub-section | Description |
|
| Sub-section | Description |
|
||||||
@@ -432,7 +779,8 @@ python run_classifier.py \
|
|||||||
--train_batch_size 32 \
|
--train_batch_size 32 \
|
||||||
--learning_rate 2e-5 \
|
--learning_rate 2e-5 \
|
||||||
--num_train_epochs 3.0 \
|
--num_train_epochs 3.0 \
|
||||||
--output_dir /tmp/mrpc_output/
|
--output_dir /tmp/mrpc_output/ \
|
||||||
|
--fp16
|
||||||
```
|
```
|
||||||
|
|
||||||
#### SQuAD
|
#### SQuAD
|
||||||
@@ -506,16 +854,57 @@ Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 wit
|
|||||||
|
|
||||||
```shell
|
```shell
|
||||||
python run_lm_finetuning.py \
|
python run_lm_finetuning.py \
|
||||||
--bert_model bert-base-cased \
|
--bert_model bert-base-uncased \
|
||||||
|
--do_lower_case \
|
||||||
--do_train \
|
--do_train \
|
||||||
--train_file samples/sample_text.txt \
|
--train_file ../samples/sample_text.txt \
|
||||||
--output_dir models \
|
--output_dir models \
|
||||||
--num_train_epochs 5.0 \
|
--num_train_epochs 5.0 \
|
||||||
--learning_rate 3e-5 \
|
--learning_rate 3e-5 \
|
||||||
--train_batch_size 32 \
|
--train_batch_size 32 \
|
||||||
--max_seq_length 128
|
--max_seq_length 128 \
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### OpenAI GPT and Transformer-XL: running the examples
|
||||||
|
|
||||||
|
We provide two examples of scripts for OpenAI GPT and Transformer-XL based on (and extended from) the respective original implementations:
|
||||||
|
|
||||||
|
- fine-tuning OpenAI GPT on the ROCStories dataset
|
||||||
|
- evaluating Transformer-XL on Wikitext 103
|
||||||
|
|
||||||
|
#### Fine-tuning OpenAI GPT on the RocStories dataset
|
||||||
|
|
||||||
|
This example code fine-tunes OpenAI GPT on the RocStories dataset.
|
||||||
|
|
||||||
|
Before running this example you should download the
|
||||||
|
[RocStories dataset](https://github.com/snigdhac/StoryComprehension_EMNLP/tree/master/Dataset/RoCStories) and unpack it to some directory `$ROC_STORIES_DIR`.
|
||||||
|
|
||||||
|
```shell
|
||||||
|
export ROC_STORIES_DIR=/path/to/RocStories
|
||||||
|
|
||||||
|
python run_openai_gpt.py \
|
||||||
|
--model_name openai-gpt \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--train_dataset $ROC_STORIES_DIR/cloze_test_val__spring2016\ -\ cloze_test_ALL_val.csv \
|
||||||
|
--eval_dataset $ROC_STORIES_DIR/cloze_test_test__spring2016\ -\ cloze_test_ALL_test.csv \
|
||||||
|
--output_dir ../log \
|
||||||
|
--train_batch_size 16 \
|
||||||
|
```
|
||||||
|
|
||||||
|
This command runs in about 10 min on a single K-80 an gives an evaluation accuracy of about 86.4% (the authors report a median accuracy with the TensorFlow code of 85.8% and the OpenAI GPT paper reports a best single run accuracy of 86.5%).
|
||||||
|
|
||||||
|
#### Evaluating the pre-trained Transformer-XL on the WikiText 103 dataset
|
||||||
|
|
||||||
|
This example code evaluate the pre-trained Transformer-XL on the WikiText 103 dataset.
|
||||||
|
This command will download a pre-processed version of the WikiText 103 dataset in which the vocabulary has been computed.
|
||||||
|
|
||||||
|
```shell
|
||||||
|
python run_transfo_xl.py --work_dir ../log
|
||||||
|
```
|
||||||
|
|
||||||
|
This command runs in about 1 min on a V100 and gives an evaluation perplexity of 18.22 on WikiText-103 (the authors report a perplexity of about 18.3 on this dataset with the TensorFlow code).
|
||||||
|
|
||||||
## Fine-tuning BERT-large on GPUs
|
## Fine-tuning BERT-large on GPUs
|
||||||
|
|
||||||
The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.
|
The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.
|
||||||
@@ -587,7 +976,9 @@ Please follow the instructions given in the notebooks to run and modify them.
|
|||||||
|
|
||||||
## Command-line interface
|
## Command-line interface
|
||||||
|
|
||||||
A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the `BertForPreTraining` class (see above).
|
A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the `BertForPreTraining` class (for BERT) or NumPy checkpoint in a PyTorch dump of the `OpenAIGPTModel` class (for OpenAI GPT).
|
||||||
|
|
||||||
|
### BERT
|
||||||
|
|
||||||
You can convert any TensorFlow checkpoint for BERT (in particular [the pre-trained models released by Google](https://github.com/google-research/bert#pre-trained-models)) in a PyTorch save file by using the [`./pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py`](convert_tf_checkpoint_to_pytorch.py) script.
|
You can convert any TensorFlow checkpoint for BERT (in particular [the pre-trained models released by Google](https://github.com/google-research/bert#pre-trained-models)) in a PyTorch save file by using the [`./pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py`](convert_tf_checkpoint_to_pytorch.py) script.
|
||||||
|
|
||||||
@@ -610,6 +1001,32 @@ pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
|
|||||||
|
|
||||||
You can download Google's pre-trained models for the conversion [here](https://github.com/google-research/bert#pre-trained-models).
|
You can download Google's pre-trained models for the conversion [here](https://github.com/google-research/bert#pre-trained-models).
|
||||||
|
|
||||||
|
### OpenAI GPT
|
||||||
|
|
||||||
|
Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see [here](https://github.com/openai/finetune-transformer-lm))
|
||||||
|
|
||||||
|
```shell
|
||||||
|
export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights
|
||||||
|
|
||||||
|
pytorch_pretrained_bert convert_openai_checkpoint \
|
||||||
|
$OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
|
||||||
|
$PYTORCH_DUMP_OUTPUT \
|
||||||
|
[OPENAI_GPT_CONFIG]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Transformer-XL
|
||||||
|
|
||||||
|
Here is an example of the conversion process for a pre-trained Transformer-XL model (see [here](https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models))
|
||||||
|
|
||||||
|
```shell
|
||||||
|
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
|
||||||
|
|
||||||
|
pytorch_pretrained_bert convert_openai_checkpoint \
|
||||||
|
$OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
|
||||||
|
$PYTORCH_DUMP_OUTPUT \
|
||||||
|
[OPENAI_GPT_CONFIG]
|
||||||
|
```
|
||||||
|
|
||||||
## TPU
|
## TPU
|
||||||
|
|
||||||
TPU support and pretraining scripts
|
TPU support and pretraining scripts
|
||||||
|
|||||||
@@ -80,10 +80,10 @@ def convert_examples_to_features(examples, seq_length, tokenizer):
|
|||||||
# The convention in BERT is:
|
# The convention in BERT is:
|
||||||
# (a) For sequence pairs:
|
# (a) For sequence pairs:
|
||||||
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
|
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
|
||||||
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
|
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
|
||||||
# (b) For single sequences:
|
# (b) For single sequences:
|
||||||
# tokens: [CLS] the dog is hairy . [SEP]
|
# tokens: [CLS] the dog is hairy . [SEP]
|
||||||
# type_ids: 0 0 0 0 0 0 0
|
# type_ids: 0 0 0 0 0 0 0
|
||||||
#
|
#
|
||||||
# Where "type_ids" are used to indicate whether this is the first
|
# Where "type_ids" are used to indicate whether this is the first
|
||||||
# sequence or the second sequence. The embedding vectors for `type=0` and
|
# sequence or the second sequence. The embedding vectors for `type=0` and
|
||||||
|
|||||||
@@ -15,26 +15,26 @@
|
|||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
"""BERT finetuning runner."""
|
"""BERT finetuning runner."""
|
||||||
|
|
||||||
from __future__ import absolute_import
|
from __future__ import absolute_import, division, print_function
|
||||||
from __future__ import division
|
|
||||||
from __future__ import print_function
|
|
||||||
|
|
||||||
import csv
|
|
||||||
import os
|
|
||||||
import logging
|
|
||||||
import argparse
|
import argparse
|
||||||
|
import csv
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
import random
|
import random
|
||||||
from tqdm import tqdm, trange
|
import sys
|
||||||
|
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import torch
|
import torch
|
||||||
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
|
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
||||||
|
TensorDataset)
|
||||||
from torch.utils.data.distributed import DistributedSampler
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
|
from tqdm import tqdm, trange
|
||||||
|
|
||||||
from pytorch_pretrained_bert.tokenization import BertTokenizer
|
|
||||||
from pytorch_pretrained_bert.modeling import BertForSequenceClassification
|
|
||||||
from pytorch_pretrained_bert.optimization import BertAdam, warmup_linear
|
|
||||||
from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE
|
from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE
|
||||||
|
from pytorch_pretrained_bert.modeling import BertForSequenceClassification, BertConfig, WEIGHTS_NAME, CONFIG_NAME
|
||||||
|
from pytorch_pretrained_bert.tokenization import BertTokenizer
|
||||||
|
from pytorch_pretrained_bert.optimization import BertAdam, warmup_linear
|
||||||
|
|
||||||
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||||
datefmt = '%m/%d/%Y %H:%M:%S',
|
datefmt = '%m/%d/%Y %H:%M:%S',
|
||||||
@@ -91,10 +91,12 @@ class DataProcessor(object):
|
|||||||
@classmethod
|
@classmethod
|
||||||
def _read_tsv(cls, input_file, quotechar=None):
|
def _read_tsv(cls, input_file, quotechar=None):
|
||||||
"""Reads a tab separated value file."""
|
"""Reads a tab separated value file."""
|
||||||
with open(input_file, "r", encoding='utf-8') as f:
|
with open(input_file, "r") as f:
|
||||||
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
|
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
|
||||||
lines = []
|
lines = []
|
||||||
for line in reader:
|
for line in reader:
|
||||||
|
if sys.version_info[0] == 2:
|
||||||
|
line = list(unicode(cell, 'utf-8') for cell in line)
|
||||||
lines.append(line)
|
lines.append(line)
|
||||||
return lines
|
return lines
|
||||||
|
|
||||||
@@ -321,6 +323,10 @@ def main():
|
|||||||
help="The output directory where the model predictions and checkpoints will be written.")
|
help="The output directory where the model predictions and checkpoints will be written.")
|
||||||
|
|
||||||
## Other parameters
|
## Other parameters
|
||||||
|
parser.add_argument("--cache_dir",
|
||||||
|
default="",
|
||||||
|
type=str,
|
||||||
|
help="Where do you want to store the pre-trained models downloaded from s3")
|
||||||
parser.add_argument("--max_seq_length",
|
parser.add_argument("--max_seq_length",
|
||||||
default=128,
|
default=128,
|
||||||
type=int,
|
type=int,
|
||||||
@@ -380,9 +386,17 @@ def main():
|
|||||||
help="Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
|
help="Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
|
||||||
"0 (default value): dynamic loss scaling.\n"
|
"0 (default value): dynamic loss scaling.\n"
|
||||||
"Positive power of 2: static loss scaling value.\n")
|
"Positive power of 2: static loss scaling value.\n")
|
||||||
|
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
|
||||||
|
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.server_ip and args.server_port:
|
||||||
|
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
|
||||||
|
import ptvsd
|
||||||
|
print("Waiting for debugger attach")
|
||||||
|
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
|
||||||
|
ptvsd.wait_for_attach()
|
||||||
|
|
||||||
processors = {
|
processors = {
|
||||||
"cola": ColaProcessor,
|
"cola": ColaProcessor,
|
||||||
"mnli": MnliProcessor,
|
"mnli": MnliProcessor,
|
||||||
@@ -424,7 +438,8 @@ def main():
|
|||||||
|
|
||||||
if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train:
|
if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train:
|
||||||
raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
|
raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
|
||||||
os.makedirs(args.output_dir, exist_ok=True)
|
if not os.path.exists(args.output_dir):
|
||||||
|
os.makedirs(args.output_dir)
|
||||||
|
|
||||||
task_name = args.task_name.lower()
|
task_name = args.task_name.lower()
|
||||||
|
|
||||||
@@ -447,8 +462,9 @@ def main():
|
|||||||
num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
|
num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
|
||||||
|
|
||||||
# Prepare model
|
# Prepare model
|
||||||
|
cache_dir = args.cache_dir if args.cache_dir else os.path.join(PYTORCH_PRETRAINED_BERT_CACHE, 'distributed_{}'.format(args.local_rank))
|
||||||
model = BertForSequenceClassification.from_pretrained(args.bert_model,
|
model = BertForSequenceClassification.from_pretrained(args.bert_model,
|
||||||
cache_dir=PYTORCH_PRETRAINED_BERT_CACHE / 'distributed_{}'.format(args.local_rank),
|
cache_dir=cache_dir,
|
||||||
num_labels = num_labels)
|
num_labels = num_labels)
|
||||||
if args.fp16:
|
if args.fp16:
|
||||||
model.half()
|
model.half()
|
||||||
@@ -545,15 +561,21 @@ def main():
|
|||||||
optimizer.zero_grad()
|
optimizer.zero_grad()
|
||||||
global_step += 1
|
global_step += 1
|
||||||
|
|
||||||
# Save a trained model
|
|
||||||
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
|
|
||||||
output_model_file = os.path.join(args.output_dir, "pytorch_model.bin")
|
|
||||||
if args.do_train:
|
if args.do_train:
|
||||||
|
# Save a trained model and the associated configuration
|
||||||
|
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
|
||||||
|
output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
|
||||||
torch.save(model_to_save.state_dict(), output_model_file)
|
torch.save(model_to_save.state_dict(), output_model_file)
|
||||||
|
output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
|
||||||
|
with open(output_config_file, 'w') as f:
|
||||||
|
f.write(model_to_save.config.to_json_string())
|
||||||
|
|
||||||
# Load a trained model that you have fine-tuned
|
# Load a trained model and config that you have fine-tuned
|
||||||
model_state_dict = torch.load(output_model_file)
|
config = BertConfig(output_config_file)
|
||||||
model = BertForSequenceClassification.from_pretrained(args.bert_model, state_dict=model_state_dict, num_labels=num_labels)
|
model = BertForSequenceClassification(config, num_labels=num_labels)
|
||||||
|
model.load_state_dict(torch.load(output_model_file))
|
||||||
|
else:
|
||||||
|
model = BertForSequenceClassification.from_pretrained(args.bert_model, num_labels=num_labels)
|
||||||
model.to(device)
|
model.to(device)
|
||||||
|
|
||||||
if args.do_eval and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
|
if args.do_eval and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
|
||||||
|
|||||||
@@ -15,22 +15,22 @@
|
|||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
"""BERT finetuning runner."""
|
"""BERT finetuning runner."""
|
||||||
|
|
||||||
from __future__ import absolute_import
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
from __future__ import division
|
|
||||||
from __future__ import print_function
|
|
||||||
|
|
||||||
import os
|
|
||||||
import logging
|
|
||||||
import argparse
|
import argparse
|
||||||
from tqdm import tqdm, trange
|
import logging
|
||||||
|
import os
|
||||||
|
import random
|
||||||
|
from io import open
|
||||||
|
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import torch
|
import torch
|
||||||
from torch.utils.data import DataLoader, RandomSampler
|
from torch.utils.data import DataLoader, Dataset, RandomSampler
|
||||||
from torch.utils.data.distributed import DistributedSampler
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
|
from tqdm import tqdm, trange
|
||||||
|
|
||||||
from pytorch_pretrained_bert.tokenization import BertTokenizer
|
|
||||||
from pytorch_pretrained_bert.modeling import BertForPreTraining
|
from pytorch_pretrained_bert.modeling import BertForPreTraining
|
||||||
|
from pytorch_pretrained_bert.tokenization import BertTokenizer
|
||||||
from pytorch_pretrained_bert.optimization import BertAdam, warmup_linear
|
from pytorch_pretrained_bert.optimization import BertAdam, warmup_linear
|
||||||
|
|
||||||
from torch.utils.data import Dataset
|
from torch.utils.data import Dataset
|
||||||
@@ -179,16 +179,16 @@ class BERTDataset(Dataset):
|
|||||||
if self.line_buffer is None:
|
if self.line_buffer is None:
|
||||||
# read first non-empty line of file
|
# read first non-empty line of file
|
||||||
while t1 == "" :
|
while t1 == "" :
|
||||||
t1 = self.file.__next__().strip()
|
t1 = next(self.file).strip()
|
||||||
t2 = self.file.__next__().strip()
|
t2 = next(self.file).strip()
|
||||||
else:
|
else:
|
||||||
# use t2 from previous iteration as new t1
|
# use t2 from previous iteration as new t1
|
||||||
t1 = self.line_buffer
|
t1 = self.line_buffer
|
||||||
t2 = self.file.__next__().strip()
|
t2 = next(self.file).strip()
|
||||||
# skip empty rows that are used for separating documents and keep track of current doc id
|
# skip empty rows that are used for separating documents and keep track of current doc id
|
||||||
while t2 == "" or t1 == "":
|
while t2 == "" or t1 == "":
|
||||||
t1 = self.file.__next__().strip()
|
t1 = next(self.file).strip()
|
||||||
t2 = self.file.__next__().strip()
|
t2 = next(self.file).strip()
|
||||||
self.current_doc = self.current_doc+1
|
self.current_doc = self.current_doc+1
|
||||||
self.line_buffer = t2
|
self.line_buffer = t2
|
||||||
|
|
||||||
@@ -222,15 +222,15 @@ class BERTDataset(Dataset):
|
|||||||
def get_next_line(self):
|
def get_next_line(self):
|
||||||
""" Gets next line of random_file and starts over when reaching end of file"""
|
""" Gets next line of random_file and starts over when reaching end of file"""
|
||||||
try:
|
try:
|
||||||
line = self.random_file.__next__().strip()
|
line = next(self.random_file).strip()
|
||||||
#keep track of which document we are currently looking at to later avoid having the same doc as t1
|
#keep track of which document we are currently looking at to later avoid having the same doc as t1
|
||||||
if line == "":
|
if line == "":
|
||||||
self.current_random_doc = self.current_random_doc + 1
|
self.current_random_doc = self.current_random_doc + 1
|
||||||
line = self.random_file.__next__().strip()
|
line = next(self.random_file).strip()
|
||||||
except StopIteration:
|
except StopIteration:
|
||||||
self.random_file.close()
|
self.random_file.close()
|
||||||
self.random_file = open(self.corpus_path, "r", encoding=self.encoding)
|
self.random_file = open(self.corpus_path, "r", encoding=self.encoding)
|
||||||
line = self.random_file.__next__().strip()
|
line = next(self.random_file).strip()
|
||||||
return line
|
return line
|
||||||
|
|
||||||
|
|
||||||
@@ -419,6 +419,7 @@ def main():
|
|||||||
help="The output directory where the model checkpoints will be written.")
|
help="The output directory where the model checkpoints will be written.")
|
||||||
|
|
||||||
## Other parameters
|
## Other parameters
|
||||||
|
parser.add_argument("--do_lower_case", action='store_true', help="Set this flag if you are using an uncased model.")
|
||||||
parser.add_argument("--max_seq_length",
|
parser.add_argument("--max_seq_length",
|
||||||
default=128,
|
default=128,
|
||||||
type=int,
|
type=int,
|
||||||
@@ -506,7 +507,8 @@ def main():
|
|||||||
|
|
||||||
if os.path.exists(args.output_dir) and os.listdir(args.output_dir):
|
if os.path.exists(args.output_dir) and os.listdir(args.output_dir):
|
||||||
raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
|
raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
|
||||||
os.makedirs(args.output_dir, exist_ok=True)
|
if not os.path.exists(args.output_dir):
|
||||||
|
os.makedirs(args.output_dir)
|
||||||
|
|
||||||
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
|
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
|
||||||
|
|
||||||
@@ -575,7 +577,7 @@ def main():
|
|||||||
if args.local_rank == -1:
|
if args.local_rank == -1:
|
||||||
train_sampler = RandomSampler(train_dataset)
|
train_sampler = RandomSampler(train_dataset)
|
||||||
else:
|
else:
|
||||||
#TODO: check if this works with current data generator from disk that relies on file.__next__
|
#TODO: check if this works with current data generator from disk that relies on next(file)
|
||||||
# (it doesn't return item back by index)
|
# (it doesn't return item back by index)
|
||||||
train_sampler = DistributedSampler(train_dataset)
|
train_sampler = DistributedSampler(train_dataset)
|
||||||
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
|
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
|
||||||
@@ -641,4 +643,4 @@ def accuracy(out, labels):
|
|||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
main()
|
main()
|
||||||
|
|||||||
259
examples/run_openai_gpt.py
Normal file
259
examples/run_openai_gpt.py
Normal file
@@ -0,0 +1,259 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HugginFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" OpenAI GPT model fine-tuning script.
|
||||||
|
Adapted from https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/train.py
|
||||||
|
It self adapted from https://github.com/openai/finetune-transformer-lm/blob/master/train.py
|
||||||
|
|
||||||
|
This script with default values fine-tunes and evaluate a pretrained OpenAI GPT on the RocStories dataset
|
||||||
|
"""
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import csv
|
||||||
|
import random
|
||||||
|
import logging
|
||||||
|
from tqdm import tqdm, trange
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
||||||
|
TensorDataset)
|
||||||
|
|
||||||
|
from pytorch_pretrained_bert import OpenAIGPTDoubleHeadsModel, OpenAIGPTTokenizer, OpenAIAdam, cached_path
|
||||||
|
|
||||||
|
ROCSTORIES_URL = "https://s3.amazonaws.com/datasets.huggingface.co/ROCStories.tar.gz"
|
||||||
|
|
||||||
|
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||||
|
datefmt = '%m/%d/%Y %H:%M:%S',
|
||||||
|
level = logging.INFO)
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
def accuracy(out, labels):
|
||||||
|
outputs = np.argmax(out, axis=1)
|
||||||
|
return np.sum(outputs == labels)
|
||||||
|
|
||||||
|
def load_rocstories_dataset(dataset_path):
|
||||||
|
""" Output a list of tuples(story, 1st continuation, 2nd continuation, label) """
|
||||||
|
with open(dataset_path, encoding='utf_8') as f:
|
||||||
|
f = csv.reader(f)
|
||||||
|
output = []
|
||||||
|
next(f) # skip the first line
|
||||||
|
for line in tqdm(f):
|
||||||
|
output.append((' '.join(line[1:5]), line[5], line[6], int(line[-1])-1))
|
||||||
|
return output
|
||||||
|
|
||||||
|
def pre_process_datasets(encoded_datasets, input_len, cap_length, start_token, delimiter_token, clf_token):
|
||||||
|
""" Pre-process datasets containing lists of tuples(story, 1st continuation, 2nd continuation, label)
|
||||||
|
|
||||||
|
To Transformer inputs of shape (n_batch, n_alternative, length) comprising for each batch, continuation:
|
||||||
|
input_ids[batch, alternative, :] = [start_token] + story[:cap_length] + [delimiter_token] + cont1[:cap_length] + [clf_token]
|
||||||
|
"""
|
||||||
|
tensor_datasets = []
|
||||||
|
for dataset in encoded_datasets:
|
||||||
|
n_batch = len(dataset)
|
||||||
|
input_ids = np.zeros((n_batch, 2, input_len), dtype=np.int64)
|
||||||
|
mc_token_ids = np.zeros((n_batch, 2), dtype=np.int64)
|
||||||
|
lm_labels = np.full((n_batch, 2, input_len), fill_value=-1, dtype=np.int64)
|
||||||
|
mc_labels = np.zeros((n_batch,), dtype=np.int64)
|
||||||
|
for i, (story, cont1, cont2, mc_label), in enumerate(dataset):
|
||||||
|
with_cont1 = [start_token] + story[:cap_length] + [delimiter_token] + cont1[:cap_length] + [clf_token]
|
||||||
|
with_cont2 = [start_token] + story[:cap_length] + [delimiter_token] + cont2[:cap_length] + [clf_token]
|
||||||
|
input_ids[i, 0, :len(with_cont1)] = with_cont1
|
||||||
|
input_ids[i, 1, :len(with_cont2)] = with_cont2
|
||||||
|
mc_token_ids[i, 0] = len(with_cont1) - 1
|
||||||
|
mc_token_ids[i, 1] = len(with_cont2) - 1
|
||||||
|
lm_labels[i, 0, :len(with_cont1)-1] = with_cont1[1:]
|
||||||
|
lm_labels[i, 1, :len(with_cont2)-1] = with_cont2[1:]
|
||||||
|
mc_labels[i] = mc_label
|
||||||
|
all_inputs = (input_ids, mc_token_ids, lm_labels, mc_labels)
|
||||||
|
tensor_datasets.append(tuple(torch.tensor(t) for t in all_inputs))
|
||||||
|
return tensor_datasets
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument('--model_name', type=str, default='openai-gpt',
|
||||||
|
help='pretrained model name')
|
||||||
|
parser.add_argument("--do_train", action='store_true', help="Whether to run training.")
|
||||||
|
parser.add_argument("--do_eval", action='store_true', help="Whether to run eval on the dev set.")
|
||||||
|
parser.add_argument("--output_dir", default=None, type=str, required=True,
|
||||||
|
help="The output directory where the model predictions and checkpoints will be written.")
|
||||||
|
parser.add_argument('--train_dataset', type=str, default='')
|
||||||
|
parser.add_argument('--eval_dataset', type=str, default='')
|
||||||
|
parser.add_argument('--seed', type=int, default=42)
|
||||||
|
parser.add_argument('--num_train_epochs', type=int, default=3)
|
||||||
|
parser.add_argument('--train_batch_size', type=int, default=8)
|
||||||
|
parser.add_argument('--eval_batch_size', type=int, default=16)
|
||||||
|
parser.add_argument('--max_grad_norm', type=int, default=1)
|
||||||
|
parser.add_argument('--learning_rate', type=float, default=6.25e-5)
|
||||||
|
parser.add_argument('--warmup_proportion', type=float, default=0.002)
|
||||||
|
parser.add_argument('--lr_schedule', type=str, default='warmup_linear')
|
||||||
|
parser.add_argument('--weight_decay', type=float, default=0.01)
|
||||||
|
parser.add_argument('--lm_coef', type=float, default=0.9)
|
||||||
|
parser.add_argument('--n_valid', type=int, default=374)
|
||||||
|
|
||||||
|
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
|
||||||
|
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
|
||||||
|
args = parser.parse_args()
|
||||||
|
print(args)
|
||||||
|
|
||||||
|
if args.server_ip and args.server_port:
|
||||||
|
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
|
||||||
|
import ptvsd
|
||||||
|
print("Waiting for debugger attach")
|
||||||
|
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
|
||||||
|
ptvsd.wait_for_attach()
|
||||||
|
|
||||||
|
random.seed(args.seed)
|
||||||
|
np.random.seed(args.seed)
|
||||||
|
torch.manual_seed(args.seed)
|
||||||
|
torch.cuda.manual_seed_all(args.seed)
|
||||||
|
|
||||||
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||||
|
n_gpu = torch.cuda.device_count()
|
||||||
|
logger.info("device: {}, n_gpu {}".format(device, n_gpu))
|
||||||
|
|
||||||
|
if not args.do_train and not args.do_eval:
|
||||||
|
raise ValueError("At least one of `do_train` or `do_eval` must be True.")
|
||||||
|
|
||||||
|
if not os.path.exists(args.output_dir):
|
||||||
|
os.makedirs(args.output_dir)
|
||||||
|
|
||||||
|
# Load tokenizer and model
|
||||||
|
# This loading functions also add new tokens and embeddings called `special tokens`
|
||||||
|
# These new embeddings will be fine-tuned on the RocStories dataset
|
||||||
|
special_tokens = ['_start_', '_delimiter_', '_classify_']
|
||||||
|
tokenizer = OpenAIGPTTokenizer.from_pretrained(args.model_name, special_tokens=special_tokens)
|
||||||
|
special_tokens_ids = list(tokenizer.convert_tokens_to_ids(token) for token in special_tokens)
|
||||||
|
model = OpenAIGPTDoubleHeadsModel.from_pretrained(args.model_name, num_special_tokens=len(special_tokens))
|
||||||
|
model.to(device)
|
||||||
|
|
||||||
|
# Load and encode the datasets
|
||||||
|
if not args.train_dataset and not args.eval_dataset:
|
||||||
|
roc_stories = cached_path(ROCSTORIES_URL)
|
||||||
|
def tokenize_and_encode(obj):
|
||||||
|
""" Tokenize and encode a nested object """
|
||||||
|
if isinstance(obj, str):
|
||||||
|
return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
|
||||||
|
elif isinstance(obj, int):
|
||||||
|
return obj
|
||||||
|
return list(tokenize_and_encode(o) for o in obj)
|
||||||
|
logger.info("Encoding dataset...")
|
||||||
|
train_dataset = load_rocstories_dataset(args.train_dataset)
|
||||||
|
eval_dataset = load_rocstories_dataset(args.eval_dataset)
|
||||||
|
datasets = (train_dataset, eval_dataset)
|
||||||
|
encoded_datasets = tokenize_and_encode(datasets)
|
||||||
|
|
||||||
|
# Compute the mex input length for the Transformer
|
||||||
|
max_length = model.config.n_positions // 2 - 2
|
||||||
|
input_length = max(len(story[:max_length]) + max(len(cont1[:max_length]), len(cont2[:max_length])) + 3 \
|
||||||
|
for dataset in encoded_datasets for story, cont1, cont2, _ in dataset)
|
||||||
|
input_length = min(input_length, model.config.n_positions) # Max size of input for the pre-trained model
|
||||||
|
|
||||||
|
# Prepare inputs tensors and dataloaders
|
||||||
|
tensor_datasets = pre_process_datasets(encoded_datasets, input_length, max_length, *special_tokens_ids)
|
||||||
|
train_tensor_dataset, eval_tensor_dataset = tensor_datasets[0], tensor_datasets[1]
|
||||||
|
|
||||||
|
train_data = TensorDataset(*train_tensor_dataset)
|
||||||
|
train_sampler = RandomSampler(train_data)
|
||||||
|
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size)
|
||||||
|
|
||||||
|
eval_data = TensorDataset(*eval_tensor_dataset)
|
||||||
|
eval_sampler = SequentialSampler(eval_data)
|
||||||
|
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)
|
||||||
|
|
||||||
|
# Prepare optimizer
|
||||||
|
param_optimizer = list(model.named_parameters())
|
||||||
|
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
|
||||||
|
optimizer_grouped_parameters = [
|
||||||
|
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
|
||||||
|
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
|
||||||
|
]
|
||||||
|
num_train_optimization_steps = len(train_data) * args.num_train_epochs // args.train_batch_size
|
||||||
|
optimizer = OpenAIAdam(optimizer_grouped_parameters,
|
||||||
|
lr=args.learning_rate,
|
||||||
|
warmup=args.warmup_proportion,
|
||||||
|
max_grad_norm=args.max_grad_norm,
|
||||||
|
weight_decay=args.weight_decay,
|
||||||
|
t_total=num_train_optimization_steps)
|
||||||
|
|
||||||
|
if args.do_train:
|
||||||
|
nb_tr_steps, tr_loss, exp_average_loss = 0, 0, None
|
||||||
|
model.train()
|
||||||
|
for _ in trange(int(args.num_train_epochs), desc="Epoch"):
|
||||||
|
tr_loss = 0
|
||||||
|
nb_tr_steps = 0
|
||||||
|
tqdm_bar = tqdm(train_dataloader, desc="Training")
|
||||||
|
for step, batch in enumerate(tqdm_bar):
|
||||||
|
batch = tuple(t.to(device) for t in batch)
|
||||||
|
input_ids, mc_token_ids, lm_labels, mc_labels = batch
|
||||||
|
losses = model(input_ids, mc_token_ids, lm_labels, mc_labels)
|
||||||
|
loss = args.lm_coef * losses[0] + losses[1]
|
||||||
|
loss.backward()
|
||||||
|
optimizer.step()
|
||||||
|
tr_loss += loss.item()
|
||||||
|
exp_average_loss = loss.item() if exp_average_loss is None else 0.7*exp_average_loss+0.3*loss.item()
|
||||||
|
nb_tr_steps += 1
|
||||||
|
tqdm_bar.desc = "Training loss: {:.2e} lr: {:.2e}".format(exp_average_loss, optimizer.get_lr()[0])
|
||||||
|
|
||||||
|
# Save a trained model
|
||||||
|
if args.do_train:
|
||||||
|
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
|
||||||
|
output_model_file = os.path.join(args.output_dir, "pytorch_model.bin")
|
||||||
|
config = model.config
|
||||||
|
torch.save(model_to_save.state_dict(), output_model_file)
|
||||||
|
|
||||||
|
# Load a trained model that you have fine-tuned
|
||||||
|
model_state_dict = torch.load(output_model_file)
|
||||||
|
model = OpenAIGPTDoubleHeadsModel(config)
|
||||||
|
model.load_state_dict(model_state_dict)
|
||||||
|
model.to(device)
|
||||||
|
|
||||||
|
if args.do_eval:
|
||||||
|
model.eval()
|
||||||
|
eval_loss, eval_accuracy = 0, 0
|
||||||
|
nb_eval_steps, nb_eval_examples = 0, 0
|
||||||
|
for batch in tqdm(eval_dataloader, desc="Evaluating"):
|
||||||
|
batch = tuple(t.to(device) for t in batch)
|
||||||
|
input_ids, mc_token_ids, lm_labels, mc_labels = batch
|
||||||
|
with torch.no_grad():
|
||||||
|
_, mc_loss = model(input_ids, mc_token_ids, lm_labels, mc_labels)
|
||||||
|
_, mc_logits = model(input_ids, mc_token_ids)
|
||||||
|
|
||||||
|
mc_logits = mc_logits.detach().cpu().numpy()
|
||||||
|
mc_labels = mc_labels.to('cpu').numpy()
|
||||||
|
tmp_eval_accuracy = accuracy(mc_logits, mc_labels)
|
||||||
|
|
||||||
|
eval_loss += mc_loss.mean().item()
|
||||||
|
eval_accuracy += tmp_eval_accuracy
|
||||||
|
|
||||||
|
nb_eval_examples += input_ids.size(0)
|
||||||
|
nb_eval_steps += 1
|
||||||
|
|
||||||
|
eval_loss = eval_loss / nb_eval_steps
|
||||||
|
eval_accuracy = eval_accuracy / nb_eval_examples
|
||||||
|
train_loss = tr_loss/nb_tr_steps if args.do_train else None
|
||||||
|
result = {'eval_loss': eval_loss,
|
||||||
|
'eval_accuracy': eval_accuracy,
|
||||||
|
'train_loss': train_loss}
|
||||||
|
|
||||||
|
output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
|
||||||
|
with open(output_eval_file, "w") as writer:
|
||||||
|
logger.info("***** Eval results *****")
|
||||||
|
for key in sorted(result.keys()):
|
||||||
|
logger.info(" %s = %s", key, str(result[key]))
|
||||||
|
writer.write("%s = %s\n" % (key, str(result[key])))
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
@@ -15,29 +15,36 @@
|
|||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
"""Run BERT on SQuAD."""
|
"""Run BERT on SQuAD."""
|
||||||
|
|
||||||
from __future__ import absolute_import
|
from __future__ import absolute_import, division, print_function
|
||||||
from __future__ import division
|
|
||||||
from __future__ import print_function
|
|
||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
import collections
|
import collections
|
||||||
import logging
|
|
||||||
import json
|
import json
|
||||||
|
import logging
|
||||||
import math
|
import math
|
||||||
import os
|
import os
|
||||||
import random
|
import random
|
||||||
import pickle
|
import sys
|
||||||
from tqdm import tqdm, trange
|
from io import open
|
||||||
|
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import torch
|
import torch
|
||||||
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
|
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
||||||
|
TensorDataset)
|
||||||
from torch.utils.data.distributed import DistributedSampler
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
|
from tqdm import tqdm, trange
|
||||||
|
|
||||||
from pytorch_pretrained_bert.tokenization import whitespace_tokenize, BasicTokenizer, BertTokenizer
|
|
||||||
from pytorch_pretrained_bert.modeling import BertForQuestionAnswering
|
|
||||||
from pytorch_pretrained_bert.optimization import BertAdam, warmup_linear
|
|
||||||
from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE
|
from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE
|
||||||
|
from pytorch_pretrained_bert.modeling import BertForQuestionAnswering, BertConfig, WEIGHTS_NAME, CONFIG_NAME
|
||||||
|
from pytorch_pretrained_bert.optimization import BertAdam, warmup_linear
|
||||||
|
from pytorch_pretrained_bert.tokenization import (BasicTokenizer,
|
||||||
|
BertTokenizer,
|
||||||
|
whitespace_tokenize)
|
||||||
|
|
||||||
|
if sys.version_info[0] == 2:
|
||||||
|
import cPickle as pickle
|
||||||
|
else:
|
||||||
|
import pickle
|
||||||
|
|
||||||
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||||
datefmt = '%m/%d/%Y %H:%M:%S',
|
datefmt = '%m/%d/%Y %H:%M:%S',
|
||||||
@@ -863,7 +870,8 @@ def main():
|
|||||||
|
|
||||||
if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train:
|
if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train:
|
||||||
raise ValueError("Output directory () already exists and is not empty.")
|
raise ValueError("Output directory () already exists and is not empty.")
|
||||||
os.makedirs(args.output_dir, exist_ok=True)
|
if not os.path.exists(args.output_dir):
|
||||||
|
os.makedirs(args.output_dir)
|
||||||
|
|
||||||
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
|
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
|
||||||
|
|
||||||
@@ -879,7 +887,7 @@ def main():
|
|||||||
|
|
||||||
# Prepare model
|
# Prepare model
|
||||||
model = BertForQuestionAnswering.from_pretrained(args.bert_model,
|
model = BertForQuestionAnswering.from_pretrained(args.bert_model,
|
||||||
cache_dir=PYTORCH_PRETRAINED_BERT_CACHE / 'distributed_{}'.format(args.local_rank))
|
cache_dir=os.path.join(PYTORCH_PRETRAINED_BERT_CACHE, 'distributed_{}'.format(args.local_rank)))
|
||||||
|
|
||||||
if args.fp16:
|
if args.fp16:
|
||||||
model.half()
|
model.half()
|
||||||
@@ -909,7 +917,7 @@ def main():
|
|||||||
|
|
||||||
if args.fp16:
|
if args.fp16:
|
||||||
try:
|
try:
|
||||||
from apex.optimizer import FP16_Optimizer
|
from apex.optimizers import FP16_Optimizer
|
||||||
from apex.optimizers import FusedAdam
|
from apex.optimizers import FusedAdam
|
||||||
except ImportError:
|
except ImportError:
|
||||||
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
|
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
|
||||||
@@ -993,14 +1001,19 @@ def main():
|
|||||||
optimizer.zero_grad()
|
optimizer.zero_grad()
|
||||||
global_step += 1
|
global_step += 1
|
||||||
|
|
||||||
# Save a trained model
|
|
||||||
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
|
|
||||||
output_model_file = os.path.join(args.output_dir, "pytorch_model.bin")
|
|
||||||
if args.do_train:
|
if args.do_train:
|
||||||
|
# Save a trained model and the associated configuration
|
||||||
|
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
|
||||||
|
output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
|
||||||
torch.save(model_to_save.state_dict(), output_model_file)
|
torch.save(model_to_save.state_dict(), output_model_file)
|
||||||
# Load a trained model that you have fine-tuned
|
output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
|
||||||
model_state_dict = torch.load(output_model_file)
|
with open(output_config_file, 'w') as f:
|
||||||
model = BertForQuestionAnswering.from_pretrained(args.bert_model, state_dict=model_state_dict)
|
f.write(model_to_save.config.to_json_string())
|
||||||
|
|
||||||
|
# Load a trained model and config that you have fine-tuned
|
||||||
|
config = BertConfig(output_config_file)
|
||||||
|
model = BertForQuestionAnswering(config)
|
||||||
|
model.load_state_dict(torch.load(output_model_file))
|
||||||
else:
|
else:
|
||||||
model = BertForQuestionAnswering.from_pretrained(args.bert_model)
|
model = BertForQuestionAnswering.from_pretrained(args.bert_model)
|
||||||
|
|
||||||
|
|||||||
@@ -15,22 +15,25 @@
|
|||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
"""BERT finetuning runner."""
|
"""BERT finetuning runner."""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import csv
|
||||||
import logging
|
import logging
|
||||||
import os
|
import os
|
||||||
import argparse
|
|
||||||
import random
|
import random
|
||||||
from tqdm import tqdm, trange
|
import sys
|
||||||
import csv
|
from io import open
|
||||||
|
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import torch
|
import torch
|
||||||
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
|
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
||||||
|
TensorDataset)
|
||||||
from torch.utils.data.distributed import DistributedSampler
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
|
from tqdm import tqdm, trange
|
||||||
|
|
||||||
from pytorch_pretrained_bert.tokenization import BertTokenizer
|
from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE
|
||||||
from pytorch_pretrained_bert.modeling import BertForMultipleChoice
|
from pytorch_pretrained_bert.modeling import BertForMultipleChoice
|
||||||
from pytorch_pretrained_bert.optimization import BertAdam, warmup_linear
|
from pytorch_pretrained_bert.optimization import BertAdam, warmup_linear
|
||||||
from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE
|
from pytorch_pretrained_bert.tokenization import BertTokenizer
|
||||||
|
|
||||||
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||||
datefmt = '%m/%d/%Y %H:%M:%S',
|
datefmt = '%m/%d/%Y %H:%M:%S',
|
||||||
@@ -65,17 +68,17 @@ class SwagExample(object):
|
|||||||
|
|
||||||
def __repr__(self):
|
def __repr__(self):
|
||||||
l = [
|
l = [
|
||||||
f"swag_id: {self.swag_id}",
|
"swag_id: {}".format(self.swag_id),
|
||||||
f"context_sentence: {self.context_sentence}",
|
"context_sentence: {}".format(self.context_sentence),
|
||||||
f"start_ending: {self.start_ending}",
|
"start_ending: {}".format(self.start_ending),
|
||||||
f"ending_0: {self.endings[0]}",
|
"ending_0: {}".format(self.endings[0]),
|
||||||
f"ending_1: {self.endings[1]}",
|
"ending_1: {}".format(self.endings[1]),
|
||||||
f"ending_2: {self.endings[2]}",
|
"ending_2: {}".format(self.endings[2]),
|
||||||
f"ending_3: {self.endings[3]}",
|
"ending_3: {}".format(self.endings[3]),
|
||||||
]
|
]
|
||||||
|
|
||||||
if self.label is not None:
|
if self.label is not None:
|
||||||
l.append(f"label: {self.label}")
|
l.append("label: {}".format(self.label))
|
||||||
|
|
||||||
return ", ".join(l)
|
return ", ".join(l)
|
||||||
|
|
||||||
@@ -102,7 +105,11 @@ class InputFeatures(object):
|
|||||||
def read_swag_examples(input_file, is_training):
|
def read_swag_examples(input_file, is_training):
|
||||||
with open(input_file, 'r', encoding='utf-8') as f:
|
with open(input_file, 'r', encoding='utf-8') as f:
|
||||||
reader = csv.reader(f)
|
reader = csv.reader(f)
|
||||||
lines = list(reader)
|
lines = []
|
||||||
|
for line in reader:
|
||||||
|
if sys.version_info[0] == 2:
|
||||||
|
line = list(unicode(cell, 'utf-8') for cell in line)
|
||||||
|
lines.append(line)
|
||||||
|
|
||||||
if is_training and lines[0][-1] != 'label':
|
if is_training and lines[0][-1] != 'label':
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
@@ -184,15 +191,15 @@ def convert_examples_to_features(examples, tokenizer, max_seq_length,
|
|||||||
label = example.label
|
label = example.label
|
||||||
if example_index < 5:
|
if example_index < 5:
|
||||||
logger.info("*** Example ***")
|
logger.info("*** Example ***")
|
||||||
logger.info(f"swag_id: {example.swag_id}")
|
logger.info("swag_id: {}".format(example.swag_id))
|
||||||
for choice_idx, (tokens, input_ids, input_mask, segment_ids) in enumerate(choices_features):
|
for choice_idx, (tokens, input_ids, input_mask, segment_ids) in enumerate(choices_features):
|
||||||
logger.info(f"choice: {choice_idx}")
|
logger.info("choice: {}".format(choice_idx))
|
||||||
logger.info(f"tokens: {' '.join(tokens)}")
|
logger.info("tokens: {}".format(' '.join(tokens)))
|
||||||
logger.info(f"input_ids: {' '.join(map(str, input_ids))}")
|
logger.info("input_ids: {}".format(' '.join(map(str, input_ids))))
|
||||||
logger.info(f"input_mask: {' '.join(map(str, input_mask))}")
|
logger.info("input_mask: {}".format(' '.join(map(str, input_mask))))
|
||||||
logger.info(f"segment_ids: {' '.join(map(str, segment_ids))}")
|
logger.info("segment_ids: {}".format(' '.join(map(str, segment_ids))))
|
||||||
if is_training:
|
if is_training:
|
||||||
logger.info(f"label: {label}")
|
logger.info("label: {}".format(label))
|
||||||
|
|
||||||
features.append(
|
features.append(
|
||||||
InputFeatures(
|
InputFeatures(
|
||||||
@@ -344,7 +351,8 @@ def main():
|
|||||||
|
|
||||||
if os.path.exists(args.output_dir) and os.listdir(args.output_dir):
|
if os.path.exists(args.output_dir) and os.listdir(args.output_dir):
|
||||||
raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
|
raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
|
||||||
os.makedirs(args.output_dir, exist_ok=True)
|
if not os.path.exists(args.output_dir):
|
||||||
|
os.makedirs(args.output_dir)
|
||||||
|
|
||||||
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
|
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
|
||||||
|
|
||||||
@@ -359,7 +367,7 @@ def main():
|
|||||||
|
|
||||||
# Prepare model
|
# Prepare model
|
||||||
model = BertForMultipleChoice.from_pretrained(args.bert_model,
|
model = BertForMultipleChoice.from_pretrained(args.bert_model,
|
||||||
cache_dir=PYTORCH_PRETRAINED_BERT_CACHE / 'distributed_{}'.format(args.local_rank),
|
cache_dir=os.path.join(PYTORCH_PRETRAINED_BERT_CACHE, 'distributed_{}'.format(args.local_rank)),
|
||||||
num_choices=4)
|
num_choices=4)
|
||||||
if args.fp16:
|
if args.fp16:
|
||||||
model.half()
|
model.half()
|
||||||
@@ -461,18 +469,25 @@ def main():
|
|||||||
optimizer.zero_grad()
|
optimizer.zero_grad()
|
||||||
global_step += 1
|
global_step += 1
|
||||||
|
|
||||||
# Save a trained model
|
|
||||||
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
|
|
||||||
output_model_file = os.path.join(args.output_dir, "pytorch_model.bin")
|
|
||||||
torch.save(model_to_save.state_dict(), output_model_file)
|
|
||||||
|
|
||||||
# Load a trained model that you have fine-tuned
|
if args.do_train:
|
||||||
model_state_dict = torch.load(output_model_file)
|
# Save a trained model and the associated configuration
|
||||||
model = BertForMultipleChoice.from_pretrained(args.bert_model,
|
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
|
||||||
state_dict=model_state_dict,
|
output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
|
||||||
num_choices=4)
|
torch.save(model_to_save.state_dict(), output_model_file)
|
||||||
|
output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
|
||||||
|
with open(output_config_file, 'w') as f:
|
||||||
|
f.write(model_to_save.config.to_json_string())
|
||||||
|
|
||||||
|
# Load a trained model and config that you have fine-tuned
|
||||||
|
config = BertConfig(output_config_file)
|
||||||
|
model = BertForMultipleChoice(config, num_choices=4)
|
||||||
|
model.load_state_dict(torch.load(output_model_file))
|
||||||
|
else:
|
||||||
|
model = BertForMultipleChoice.from_pretrained(args.bert_model, num_choices=4)
|
||||||
model.to(device)
|
model.to(device)
|
||||||
|
|
||||||
|
|
||||||
if args.do_eval and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
|
if args.do_eval and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
|
||||||
eval_examples = read_swag_examples(os.path.join(args.data_dir, 'val.csv'), is_training = True)
|
eval_examples = read_swag_examples(os.path.join(args.data_dir, 'val.csv'), is_training = True)
|
||||||
eval_features = convert_examples_to_features(
|
eval_features = convert_examples_to_features(
|
||||||
|
|||||||
152
examples/run_transfo_xl.py
Normal file
152
examples/run_transfo_xl.py
Normal file
@@ -0,0 +1,152 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HugginFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" PyTorch Transformer XL model evaluation script.
|
||||||
|
Adapted from https://github.com/kimiyoung/transformer-xl.
|
||||||
|
In particular https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/eval.py
|
||||||
|
|
||||||
|
This script with default values evaluates a pretrained Transformer-XL on WikiText 103
|
||||||
|
"""
|
||||||
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import logging
|
||||||
|
import time
|
||||||
|
import math
|
||||||
|
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from pytorch_pretrained_bert import TransfoXLLMHeadModel, TransfoXLCorpus
|
||||||
|
|
||||||
|
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||||
|
datefmt = '%m/%d/%Y %H:%M:%S',
|
||||||
|
level = logging.INFO)
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description='PyTorch Transformer Language Model')
|
||||||
|
parser.add_argument('--model_name', type=str, default='transfo-xl-wt103',
|
||||||
|
help='pretrained model name')
|
||||||
|
parser.add_argument('--split', type=str, default='test',
|
||||||
|
choices=['all', 'valid', 'test'],
|
||||||
|
help='which split to evaluate')
|
||||||
|
parser.add_argument('--batch_size', type=int, default=10,
|
||||||
|
help='batch size')
|
||||||
|
parser.add_argument('--tgt_len', type=int, default=128,
|
||||||
|
help='number of tokens to predict')
|
||||||
|
parser.add_argument('--ext_len', type=int, default=0,
|
||||||
|
help='length of the extended context')
|
||||||
|
parser.add_argument('--mem_len', type=int, default=1600,
|
||||||
|
help='length of the retained previous heads')
|
||||||
|
parser.add_argument('--clamp_len', type=int, default=1000,
|
||||||
|
help='max positional embedding index')
|
||||||
|
parser.add_argument('--no_cuda', action='store_true',
|
||||||
|
help='Do not use CUDA even though CUA is available')
|
||||||
|
parser.add_argument('--work_dir', type=str, required=True,
|
||||||
|
help='path to the work_dir')
|
||||||
|
parser.add_argument('--no_log', action='store_true',
|
||||||
|
help='do not log the eval result')
|
||||||
|
parser.add_argument('--same_length', action='store_true',
|
||||||
|
help='set same length attention with masking')
|
||||||
|
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
|
||||||
|
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
|
||||||
|
args = parser.parse_args()
|
||||||
|
assert args.ext_len >= 0, 'extended context length must be non-negative'
|
||||||
|
|
||||||
|
if args.server_ip and args.server_port:
|
||||||
|
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
|
||||||
|
import ptvsd
|
||||||
|
print("Waiting for debugger attach")
|
||||||
|
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
|
||||||
|
ptvsd.wait_for_attach()
|
||||||
|
|
||||||
|
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
|
||||||
|
logger.info("device: {}".format(device))
|
||||||
|
|
||||||
|
# Load a pre-processed dataset
|
||||||
|
# You can also build the corpus yourself using TransfoXLCorpus methods
|
||||||
|
# The pre-processing involve computing word frequencies to prepare the Adaptive input and SoftMax
|
||||||
|
# and tokenizing the dataset
|
||||||
|
# The pre-processed corpus is a convertion (using the conversion script )
|
||||||
|
corpus = TransfoXLCorpus.from_pretrained(args.model_name)
|
||||||
|
ntokens = len(corpus.vocab)
|
||||||
|
|
||||||
|
va_iter = corpus.get_iterator('valid', args.batch_size, args.tgt_len,
|
||||||
|
device=device, ext_len=args.ext_len)
|
||||||
|
te_iter = corpus.get_iterator('test', args.batch_size, args.tgt_len,
|
||||||
|
device=device, ext_len=args.ext_len)
|
||||||
|
|
||||||
|
# Load a pre-trained model
|
||||||
|
model = TransfoXLLMHeadModel.from_pretrained(args.model_name)
|
||||||
|
model = model.to(device)
|
||||||
|
|
||||||
|
logger.info('Evaluating with bsz {} tgt_len {} ext_len {} mem_len {} clamp_len {}'.format(
|
||||||
|
args.batch_size, args.tgt_len, args.ext_len, args.mem_len, args.clamp_len))
|
||||||
|
|
||||||
|
model.reset_length(args.tgt_len, args.ext_len, args.mem_len)
|
||||||
|
if args.clamp_len > 0:
|
||||||
|
model.clamp_len = args.clamp_len
|
||||||
|
if args.same_length:
|
||||||
|
model.same_length = True
|
||||||
|
|
||||||
|
###############################################################################
|
||||||
|
# Evaluation code
|
||||||
|
###############################################################################
|
||||||
|
def evaluate(eval_iter):
|
||||||
|
# Turn on evaluation mode which disables dropout.
|
||||||
|
model.eval()
|
||||||
|
total_len, total_loss = 0, 0.
|
||||||
|
start_time = time.time()
|
||||||
|
with torch.no_grad():
|
||||||
|
mems = None
|
||||||
|
for idx, (data, target, seq_len) in enumerate(eval_iter):
|
||||||
|
ret = model(data, target, mems)
|
||||||
|
loss, mems = ret
|
||||||
|
loss = loss.mean()
|
||||||
|
total_loss += seq_len * loss.item()
|
||||||
|
total_len += seq_len
|
||||||
|
total_time = time.time() - start_time
|
||||||
|
logger.info('Time : {:.2f}s, {:.2f}ms/segment'.format(
|
||||||
|
total_time, 1000 * total_time / (idx+1)))
|
||||||
|
return total_loss / total_len
|
||||||
|
|
||||||
|
# Run on test data.
|
||||||
|
if args.split == 'all':
|
||||||
|
test_loss = evaluate(te_iter)
|
||||||
|
valid_loss = evaluate(va_iter)
|
||||||
|
elif args.split == 'valid':
|
||||||
|
valid_loss = evaluate(va_iter)
|
||||||
|
test_loss = None
|
||||||
|
elif args.split == 'test':
|
||||||
|
test_loss = evaluate(te_iter)
|
||||||
|
valid_loss = None
|
||||||
|
|
||||||
|
def format_log(loss, split):
|
||||||
|
log_str = '| {0} loss {1:5.2f} | {0} ppl {2:9.3f} '.format(
|
||||||
|
split, loss, math.exp(loss))
|
||||||
|
return log_str
|
||||||
|
|
||||||
|
log_str = ''
|
||||||
|
if valid_loss is not None:
|
||||||
|
log_str += format_log(valid_loss, 'valid')
|
||||||
|
if test_loss is not None:
|
||||||
|
log_str += format_log(test_loss, 'test')
|
||||||
|
|
||||||
|
logger.info('=' * 100)
|
||||||
|
logger.info(log_str)
|
||||||
|
logger.info('=' * 100)
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
@@ -1,8 +1,20 @@
|
|||||||
__version__ = "0.4.0"
|
__version__ = "0.5.0"
|
||||||
from .tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer
|
from .tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer
|
||||||
|
from .tokenization_openai import OpenAIGPTTokenizer
|
||||||
|
from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus)
|
||||||
|
|
||||||
from .modeling import (BertConfig, BertModel, BertForPreTraining,
|
from .modeling import (BertConfig, BertModel, BertForPreTraining,
|
||||||
BertForMaskedLM, BertForNextSentencePrediction,
|
BertForMaskedLM, BertForNextSentencePrediction,
|
||||||
BertForSequenceClassification, BertForMultipleChoice,
|
BertForSequenceClassification, BertForMultipleChoice,
|
||||||
BertForTokenClassification, BertForQuestionAnswering)
|
BertForTokenClassification, BertForQuestionAnswering,
|
||||||
|
load_tf_weights_in_bert)
|
||||||
|
from .modeling_openai import (OpenAIGPTConfig, OpenAIGPTModel,
|
||||||
|
OpenAIGPTLMHeadModel, OpenAIGPTDoubleHeadsModel,
|
||||||
|
load_tf_weights_in_openai_gpt)
|
||||||
|
from .modeling_transfo_xl import (TransfoXLConfig, TransfoXLModel, TransfoXLLMHeadModel,
|
||||||
|
load_tf_weights_in_transfo_xl)
|
||||||
|
|
||||||
from .optimization import BertAdam
|
from .optimization import BertAdam
|
||||||
from .file_utils import PYTORCH_PRETRAINED_BERT_CACHE
|
from .optimization_openai import OpenAIAdam
|
||||||
|
|
||||||
|
from .file_utils import PYTORCH_PRETRAINED_BERT_CACHE, cached_path
|
||||||
|
|||||||
@@ -1,22 +1,65 @@
|
|||||||
# coding: utf8
|
# coding: utf8
|
||||||
def main():
|
def main():
|
||||||
import sys
|
import sys
|
||||||
try:
|
if (len(sys.argv) != 4 and len(sys.argv) != 5) or sys.argv[1] not in [
|
||||||
from .convert_tf_checkpoint_to_pytorch import convert_tf_checkpoint_to_pytorch
|
"convert_tf_checkpoint_to_pytorch",
|
||||||
except ModuleNotFoundError:
|
"convert_openai_checkpoint",
|
||||||
print("pytorch_pretrained_bert can only be used from the commandline to convert TensorFlow models in PyTorch, "
|
"convert_transfo_xl_checkpoint"
|
||||||
"In that case, it requires TensorFlow to be installed. Please see "
|
]:
|
||||||
"https://www.tensorflow.org/install/ for installation instructions.")
|
print(
|
||||||
raise
|
"Should be used as one of: \n"
|
||||||
|
">> `pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT`, \n"
|
||||||
if len(sys.argv) != 5:
|
">> `pytorch_pretrained_bert convert_openai_checkpoint OPENAI_GPT_CHECKPOINT_FOLDER_PATH PYTORCH_DUMP_OUTPUT [OPENAI_GPT_CONFIG]` or \n"
|
||||||
# pylint: disable=line-too-long
|
">> `pytorch_pretrained_bert convert_transfo_xl_checkpoint TF_CHECKPOINT_OR_DATASET PYTORCH_DUMP_OUTPUT [TF_CONFIG]`")
|
||||||
print("Should be used as `pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT`")
|
|
||||||
else:
|
else:
|
||||||
PYTORCH_DUMP_OUTPUT = sys.argv.pop()
|
if sys.argv[1] == "convert_tf_checkpoint_to_pytorch":
|
||||||
TF_CONFIG = sys.argv.pop()
|
try:
|
||||||
TF_CHECKPOINT = sys.argv.pop()
|
from .convert_tf_checkpoint_to_pytorch import convert_tf_checkpoint_to_pytorch
|
||||||
convert_tf_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT)
|
except ImportError:
|
||||||
|
print("pytorch_pretrained_bert can only be used from the commandline to convert TensorFlow models in PyTorch, "
|
||||||
|
"In that case, it requires TensorFlow to be installed. Please see "
|
||||||
|
"https://www.tensorflow.org/install/ for installation instructions.")
|
||||||
|
raise
|
||||||
|
|
||||||
|
if len(sys.argv) != 5:
|
||||||
|
# pylint: disable=line-too-long
|
||||||
|
print("Should be used as `pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT`")
|
||||||
|
else:
|
||||||
|
PYTORCH_DUMP_OUTPUT = sys.argv.pop()
|
||||||
|
TF_CONFIG = sys.argv.pop()
|
||||||
|
TF_CHECKPOINT = sys.argv.pop()
|
||||||
|
convert_tf_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT)
|
||||||
|
elif sys.argv[1] == "convert_openai_checkpoint":
|
||||||
|
from .convert_openai_checkpoint_to_pytorch import convert_openai_checkpoint_to_pytorch
|
||||||
|
OPENAI_GPT_CHECKPOINT_FOLDER_PATH = sys.argv[2]
|
||||||
|
PYTORCH_DUMP_OUTPUT = sys.argv[3]
|
||||||
|
if len(sys.argv) == 5:
|
||||||
|
OPENAI_GPT_CONFIG = sys.argv[4]
|
||||||
|
else:
|
||||||
|
OPENAI_GPT_CONFIG = ""
|
||||||
|
convert_openai_checkpoint_to_pytorch(OPENAI_GPT_CHECKPOINT_FOLDER_PATH,
|
||||||
|
OPENAI_GPT_CONFIG,
|
||||||
|
PYTORCH_DUMP_OUTPUT)
|
||||||
|
else:
|
||||||
|
try:
|
||||||
|
from .convert_transfo_xl_checkpoint_to_pytorch import convert_transfo_xl_checkpoint_to_pytorch
|
||||||
|
except ImportError:
|
||||||
|
print("pytorch_pretrained_bert can only be used from the commandline to convert TensorFlow models in PyTorch, "
|
||||||
|
"In that case, it requires TensorFlow to be installed. Please see "
|
||||||
|
"https://www.tensorflow.org/install/ for installation instructions.")
|
||||||
|
raise
|
||||||
|
|
||||||
|
if 'ckpt' in sys.argv[2].lower():
|
||||||
|
TF_CHECKPOINT = sys.argv[2]
|
||||||
|
TF_DATASET_FILE = ""
|
||||||
|
else:
|
||||||
|
TF_DATASET_FILE = sys.argv[2]
|
||||||
|
TF_CHECKPOINT = ""
|
||||||
|
PYTORCH_DUMP_OUTPUT = sys.argv[3]
|
||||||
|
if len(sys.argv) == 5:
|
||||||
|
TF_CONFIG = sys.argv[4]
|
||||||
|
else:
|
||||||
|
TF_CONFIG = ""
|
||||||
|
convert_transfo_xl_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT, TF_DATASET_FILE)
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
main()
|
main()
|
||||||
|
|||||||
72
pytorch_pretrained_bert/convert_openai_checkpoint_to_pytorch.py
Executable file
72
pytorch_pretrained_bert/convert_openai_checkpoint_to_pytorch.py
Executable file
@@ -0,0 +1,72 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The HugginFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Convert OpenAI GPT checkpoint."""
|
||||||
|
|
||||||
|
from __future__ import absolute_import, division, print_function
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
from io import open
|
||||||
|
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from pytorch_pretrained_bert.modeling_openai import (CONFIG_NAME, WEIGHTS_NAME,
|
||||||
|
OpenAIGPTConfig,
|
||||||
|
OpenAIGPTModel,
|
||||||
|
load_tf_weights_in_openai_gpt)
|
||||||
|
|
||||||
|
|
||||||
|
def convert_openai_checkpoint_to_pytorch(openai_checkpoint_folder_path, openai_config_file, pytorch_dump_folder_path):
|
||||||
|
# Construct model
|
||||||
|
if openai_config_file == "":
|
||||||
|
config = OpenAIGPTConfig()
|
||||||
|
else:
|
||||||
|
config = OpenAIGPTConfig(openai_config_file)
|
||||||
|
model = OpenAIGPTModel(config)
|
||||||
|
|
||||||
|
# Load weights from numpy
|
||||||
|
load_tf_weights_in_openai_gpt(model, openai_checkpoint_folder_path)
|
||||||
|
|
||||||
|
# Save pytorch-model
|
||||||
|
pytorch_weights_dump_path = pytorch_dump_folder_path + '/' + WEIGHTS_NAME
|
||||||
|
pytorch_config_dump_path = pytorch_dump_folder_path + '/' + CONFIG_NAME
|
||||||
|
print("Save PyTorch model to {}".format(pytorch_weights_dump_path))
|
||||||
|
torch.save(model.state_dict(), pytorch_weights_dump_path)
|
||||||
|
print("Save configuration file to {}".format(pytorch_config_dump_path))
|
||||||
|
with open(pytorch_config_dump_path, "w", encoding="utf-8") as f:
|
||||||
|
f.write(config.to_json_string())
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
## Required parameters
|
||||||
|
parser.add_argument("--openai_checkpoint_folder_path",
|
||||||
|
default = None,
|
||||||
|
type = str,
|
||||||
|
required = True,
|
||||||
|
help = "Path the TensorFlow checkpoint path.")
|
||||||
|
parser.add_argument("--pytorch_dump_folder_path",
|
||||||
|
default = None,
|
||||||
|
type = str,
|
||||||
|
required = True,
|
||||||
|
help = "Path to the output PyTorch model.")
|
||||||
|
parser.add_argument("--openai_config_file",
|
||||||
|
default = "",
|
||||||
|
type = str,
|
||||||
|
help = "An optional config json file corresponding to the pre-trained OpenAI model. \n"
|
||||||
|
"This specifies the model architecture.")
|
||||||
|
args = parser.parse_args()
|
||||||
|
convert_openai_checkpoint_to_pytorch(args.openai_checkpoint_folder_path,
|
||||||
|
args.openai_config_file,
|
||||||
|
args.pytorch_dump_folder_path)
|
||||||
@@ -25,62 +25,16 @@ import tensorflow as tf
|
|||||||
import torch
|
import torch
|
||||||
import numpy as np
|
import numpy as np
|
||||||
|
|
||||||
from .modeling import BertConfig, BertForPreTraining
|
from pytorch_pretrained_bert.modeling import BertConfig, BertForPreTraining, load_tf_weights_in_bert
|
||||||
|
|
||||||
def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_file, pytorch_dump_path):
|
def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_file, pytorch_dump_path):
|
||||||
config_path = os.path.abspath(bert_config_file)
|
|
||||||
tf_path = os.path.abspath(tf_checkpoint_path)
|
|
||||||
print("Converting TensorFlow checkpoint from {} with config at {}".format(tf_path, config_path))
|
|
||||||
# Load weights from TF model
|
|
||||||
init_vars = tf.train.list_variables(tf_path)
|
|
||||||
names = []
|
|
||||||
arrays = []
|
|
||||||
for name, shape in init_vars:
|
|
||||||
print("Loading TF weight {} with shape {}".format(name, shape))
|
|
||||||
array = tf.train.load_variable(tf_path, name)
|
|
||||||
names.append(name)
|
|
||||||
arrays.append(array)
|
|
||||||
|
|
||||||
# Initialise PyTorch model
|
# Initialise PyTorch model
|
||||||
config = BertConfig.from_json_file(bert_config_file)
|
config = BertConfig.from_json_file(bert_config_file)
|
||||||
print("Building PyTorch model from configuration: {}".format(str(config)))
|
print("Building PyTorch model from configuration: {}".format(str(config)))
|
||||||
model = BertForPreTraining(config)
|
model = BertForPreTraining(config)
|
||||||
|
|
||||||
for name, array in zip(names, arrays):
|
# Load weights from tf checkpoint
|
||||||
name = name.split('/')
|
load_tf_weights_in_bert(model, tf_checkpoint_path)
|
||||||
# adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
|
|
||||||
# which are not required for using pretrained model
|
|
||||||
if any(n in ["adam_v", "adam_m", "global_step"] for n in name):
|
|
||||||
print("Skipping {}".format("/".join(name)))
|
|
||||||
continue
|
|
||||||
pointer = model
|
|
||||||
for m_name in name:
|
|
||||||
if re.fullmatch(r'[A-Za-z]+_\d+', m_name):
|
|
||||||
l = re.split(r'_(\d+)', m_name)
|
|
||||||
else:
|
|
||||||
l = [m_name]
|
|
||||||
if l[0] == 'kernel' or l[0] == 'gamma':
|
|
||||||
pointer = getattr(pointer, 'weight')
|
|
||||||
elif l[0] == 'output_bias' or l[0] == 'beta':
|
|
||||||
pointer = getattr(pointer, 'bias')
|
|
||||||
elif l[0] == 'output_weights':
|
|
||||||
pointer = getattr(pointer, 'weight')
|
|
||||||
else:
|
|
||||||
pointer = getattr(pointer, l[0])
|
|
||||||
if len(l) >= 2:
|
|
||||||
num = int(l[1])
|
|
||||||
pointer = pointer[num]
|
|
||||||
if m_name[-11:] == '_embeddings':
|
|
||||||
pointer = getattr(pointer, 'weight')
|
|
||||||
elif m_name == 'kernel':
|
|
||||||
array = np.transpose(array)
|
|
||||||
try:
|
|
||||||
assert pointer.shape == array.shape
|
|
||||||
except AssertionError as e:
|
|
||||||
e.args += (pointer.shape, array.shape)
|
|
||||||
raise
|
|
||||||
print("Initialize PyTorch weight {}".format(name))
|
|
||||||
pointer.data = torch.from_numpy(array)
|
|
||||||
|
|
||||||
# Save pytorch-model
|
# Save pytorch-model
|
||||||
print("Save PyTorch model to {}".format(pytorch_dump_path))
|
print("Save PyTorch model to {}".format(pytorch_dump_path))
|
||||||
|
|||||||
116
pytorch_pretrained_bert/convert_transfo_xl_checkpoint_to_pytorch.py
Executable file
116
pytorch_pretrained_bert/convert_transfo_xl_checkpoint_to_pytorch.py
Executable file
@@ -0,0 +1,116 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The HugginFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Convert Transformer XL checkpoint and datasets."""
|
||||||
|
|
||||||
|
from __future__ import absolute_import, division, print_function
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
from io import open
|
||||||
|
|
||||||
|
import torch
|
||||||
|
|
||||||
|
import pytorch_pretrained_bert.tokenization_transfo_xl as data_utils
|
||||||
|
from pytorch_pretrained_bert.modeling_transfo_xl import (CONFIG_NAME,
|
||||||
|
WEIGHTS_NAME,
|
||||||
|
TransfoXLConfig,
|
||||||
|
TransfoXLLMHeadModel,
|
||||||
|
load_tf_weights_in_transfo_xl)
|
||||||
|
from pytorch_pretrained_bert.tokenization_transfo_xl import (CORPUS_NAME,
|
||||||
|
VOCAB_NAME)
|
||||||
|
|
||||||
|
if sys.version_info[0] == 2:
|
||||||
|
import cPickle as pickle
|
||||||
|
else:
|
||||||
|
import pickle
|
||||||
|
|
||||||
|
# We do this to be able to load python 2 datasets pickles
|
||||||
|
# See e.g. https://stackoverflow.com/questions/2121874/python-pickling-after-changing-a-modules-directory/2121918#2121918
|
||||||
|
data_utils.Vocab = data_utils.TransfoXLTokenizer
|
||||||
|
data_utils.Corpus = data_utils.TransfoXLCorpus
|
||||||
|
sys.modules['data_utils'] = data_utils
|
||||||
|
sys.modules['vocabulary'] = data_utils
|
||||||
|
|
||||||
|
def convert_transfo_xl_checkpoint_to_pytorch(tf_checkpoint_path,
|
||||||
|
transfo_xl_config_file,
|
||||||
|
pytorch_dump_folder_path,
|
||||||
|
transfo_xl_dataset_file):
|
||||||
|
if transfo_xl_dataset_file:
|
||||||
|
# Convert a pre-processed corpus (see original TensorFlow repo)
|
||||||
|
with open(transfo_xl_dataset_file, "rb") as fp:
|
||||||
|
corpus = pickle.load(fp, encoding="latin1")
|
||||||
|
# Save vocabulary and dataset cache as Dictionaries (should be better than pickles for the long-term)
|
||||||
|
pytorch_vocab_dump_path = pytorch_dump_folder_path + '/' + VOCAB_NAME
|
||||||
|
print("Save vocabulary to {}".format(pytorch_vocab_dump_path))
|
||||||
|
corpus_vocab_dict = corpus.vocab.__dict__
|
||||||
|
torch.save(corpus_vocab_dict, pytorch_vocab_dump_path)
|
||||||
|
|
||||||
|
corpus_dict_no_vocab = corpus.__dict__
|
||||||
|
corpus_dict_no_vocab.pop('vocab', None)
|
||||||
|
pytorch_dataset_dump_path = pytorch_dump_folder_path + '/' + CORPUS_NAME
|
||||||
|
print("Save dataset to {}".format(pytorch_dataset_dump_path))
|
||||||
|
torch.save(corpus_dict_no_vocab, pytorch_dataset_dump_path)
|
||||||
|
|
||||||
|
if tf_checkpoint_path:
|
||||||
|
# Convert a pre-trained TensorFlow model
|
||||||
|
config_path = os.path.abspath(transfo_xl_config_file)
|
||||||
|
tf_path = os.path.abspath(tf_checkpoint_path)
|
||||||
|
|
||||||
|
print("Converting Transformer XL checkpoint from {} with config at {}".format(tf_path, config_path))
|
||||||
|
# Initialise PyTorch model
|
||||||
|
if transfo_xl_config_file == "":
|
||||||
|
config = TransfoXLConfig()
|
||||||
|
else:
|
||||||
|
config = TransfoXLConfig(transfo_xl_config_file)
|
||||||
|
print("Building PyTorch model from configuration: {}".format(str(config)))
|
||||||
|
model = TransfoXLLMHeadModel(config)
|
||||||
|
|
||||||
|
model = load_tf_weights_in_transfo_xl(model, config, tf_path)
|
||||||
|
# Save pytorch-model
|
||||||
|
pytorch_weights_dump_path = os.path.join(pytorch_dump_folder_path, WEIGHTS_NAME)
|
||||||
|
pytorch_config_dump_path = os.path.join(pytorch_dump_folder_path, CONFIG_NAME)
|
||||||
|
print("Save PyTorch model to {}".format(os.path.abspath(pytorch_weights_dump_path)))
|
||||||
|
torch.save(model.state_dict(), pytorch_weights_dump_path)
|
||||||
|
print("Save configuration file to {}".format(os.path.abspath(pytorch_config_dump_path)))
|
||||||
|
with open(pytorch_config_dump_path, "w", encoding="utf-8") as f:
|
||||||
|
f.write(config.to_json_string())
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--pytorch_dump_folder_path",
|
||||||
|
default = None,
|
||||||
|
type = str,
|
||||||
|
required = True,
|
||||||
|
help = "Path to the folder to store the PyTorch model or dataset/vocab.")
|
||||||
|
parser.add_argument("--tf_checkpoint_path",
|
||||||
|
default = "",
|
||||||
|
type = str,
|
||||||
|
help = "An optional path to a TensorFlow checkpoint path to be converted.")
|
||||||
|
parser.add_argument("--transfo_xl_config_file",
|
||||||
|
default = "",
|
||||||
|
type = str,
|
||||||
|
help = "An optional config json file corresponding to the pre-trained BERT model. \n"
|
||||||
|
"This specifies the model architecture.")
|
||||||
|
parser.add_argument("--transfo_xl_dataset_file",
|
||||||
|
default = "",
|
||||||
|
type = str,
|
||||||
|
help = "An optional dataset file to be converted in a vocabulary.")
|
||||||
|
args = parser.parse_args()
|
||||||
|
convert_transfo_xl_checkpoint_to_pytorch(args.tf_checkpoint_path,
|
||||||
|
args.transfo_xl_config_file,
|
||||||
|
args.pytorch_dump_folder_path,
|
||||||
|
args.transfo_xl_dataset_file)
|
||||||
@@ -3,31 +3,40 @@ Utilities for working with the local dataset cache.
|
|||||||
This file is adapted from the AllenNLP library at https://github.com/allenai/allennlp
|
This file is adapted from the AllenNLP library at https://github.com/allenai/allennlp
|
||||||
Copyright by the AllenNLP authors.
|
Copyright by the AllenNLP authors.
|
||||||
"""
|
"""
|
||||||
|
from __future__ import (absolute_import, division, print_function, unicode_literals)
|
||||||
|
|
||||||
import os
|
import json
|
||||||
import logging
|
import logging
|
||||||
|
import os
|
||||||
import shutil
|
import shutil
|
||||||
import tempfile
|
import tempfile
|
||||||
import json
|
|
||||||
from urllib.parse import urlparse
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import Optional, Tuple, Union, IO, Callable, Set
|
|
||||||
from hashlib import sha256
|
|
||||||
from functools import wraps
|
from functools import wraps
|
||||||
|
from hashlib import sha256
|
||||||
from tqdm import tqdm
|
import sys
|
||||||
|
from io import open
|
||||||
|
|
||||||
import boto3
|
import boto3
|
||||||
from botocore.exceptions import ClientError
|
|
||||||
import requests
|
import requests
|
||||||
|
from botocore.exceptions import ClientError
|
||||||
|
from tqdm import tqdm
|
||||||
|
|
||||||
|
try:
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
except ImportError:
|
||||||
|
from urlparse import urlparse
|
||||||
|
|
||||||
|
try:
|
||||||
|
from pathlib import Path
|
||||||
|
PYTORCH_PRETRAINED_BERT_CACHE = Path(os.getenv('PYTORCH_PRETRAINED_BERT_CACHE',
|
||||||
|
Path.home() / '.pytorch_pretrained_bert'))
|
||||||
|
except AttributeError:
|
||||||
|
PYTORCH_PRETRAINED_BERT_CACHE = os.getenv('PYTORCH_PRETRAINED_BERT_CACHE',
|
||||||
|
os.path.join(os.path.expanduser("~"), '.pytorch_pretrained_bert'))
|
||||||
|
|
||||||
logger = logging.getLogger(__name__) # pylint: disable=invalid-name
|
logger = logging.getLogger(__name__) # pylint: disable=invalid-name
|
||||||
|
|
||||||
PYTORCH_PRETRAINED_BERT_CACHE = Path(os.getenv('PYTORCH_PRETRAINED_BERT_CACHE',
|
|
||||||
Path.home() / '.pytorch_pretrained_bert'))
|
|
||||||
|
|
||||||
|
def url_to_filename(url, etag=None):
|
||||||
def url_to_filename(url: str, etag: str = None) -> str:
|
|
||||||
"""
|
"""
|
||||||
Convert `url` into a hashed filename in a repeatable way.
|
Convert `url` into a hashed filename in a repeatable way.
|
||||||
If `etag` is specified, append its hash to the url's, delimited
|
If `etag` is specified, append its hash to the url's, delimited
|
||||||
@@ -45,25 +54,25 @@ def url_to_filename(url: str, etag: str = None) -> str:
|
|||||||
return filename
|
return filename
|
||||||
|
|
||||||
|
|
||||||
def filename_to_url(filename: str, cache_dir: Union[str, Path] = None) -> Tuple[str, str]:
|
def filename_to_url(filename, cache_dir=None):
|
||||||
"""
|
"""
|
||||||
Return the url and etag (which may be ``None``) stored for `filename`.
|
Return the url and etag (which may be ``None``) stored for `filename`.
|
||||||
Raise ``FileNotFoundError`` if `filename` or its stored metadata do not exist.
|
Raise ``EnvironmentError`` if `filename` or its stored metadata do not exist.
|
||||||
"""
|
"""
|
||||||
if cache_dir is None:
|
if cache_dir is None:
|
||||||
cache_dir = PYTORCH_PRETRAINED_BERT_CACHE
|
cache_dir = PYTORCH_PRETRAINED_BERT_CACHE
|
||||||
if isinstance(cache_dir, Path):
|
if sys.version_info[0] == 3 and isinstance(cache_dir, Path):
|
||||||
cache_dir = str(cache_dir)
|
cache_dir = str(cache_dir)
|
||||||
|
|
||||||
cache_path = os.path.join(cache_dir, filename)
|
cache_path = os.path.join(cache_dir, filename)
|
||||||
if not os.path.exists(cache_path):
|
if not os.path.exists(cache_path):
|
||||||
raise FileNotFoundError("file {} not found".format(cache_path))
|
raise EnvironmentError("file {} not found".format(cache_path))
|
||||||
|
|
||||||
meta_path = cache_path + '.json'
|
meta_path = cache_path + '.json'
|
||||||
if not os.path.exists(meta_path):
|
if not os.path.exists(meta_path):
|
||||||
raise FileNotFoundError("file {} not found".format(meta_path))
|
raise EnvironmentError("file {} not found".format(meta_path))
|
||||||
|
|
||||||
with open(meta_path) as meta_file:
|
with open(meta_path, encoding="utf-8") as meta_file:
|
||||||
metadata = json.load(meta_file)
|
metadata = json.load(meta_file)
|
||||||
url = metadata['url']
|
url = metadata['url']
|
||||||
etag = metadata['etag']
|
etag = metadata['etag']
|
||||||
@@ -71,7 +80,7 @@ def filename_to_url(filename: str, cache_dir: Union[str, Path] = None) -> Tuple[
|
|||||||
return url, etag
|
return url, etag
|
||||||
|
|
||||||
|
|
||||||
def cached_path(url_or_filename: Union[str, Path], cache_dir: Union[str, Path] = None) -> str:
|
def cached_path(url_or_filename, cache_dir=None):
|
||||||
"""
|
"""
|
||||||
Given something that might be a URL (or might be a local path),
|
Given something that might be a URL (or might be a local path),
|
||||||
determine which. If it's a URL, download the file and cache it, and
|
determine which. If it's a URL, download the file and cache it, and
|
||||||
@@ -80,9 +89,9 @@ def cached_path(url_or_filename: Union[str, Path], cache_dir: Union[str, Path] =
|
|||||||
"""
|
"""
|
||||||
if cache_dir is None:
|
if cache_dir is None:
|
||||||
cache_dir = PYTORCH_PRETRAINED_BERT_CACHE
|
cache_dir = PYTORCH_PRETRAINED_BERT_CACHE
|
||||||
if isinstance(url_or_filename, Path):
|
if sys.version_info[0] == 3 and isinstance(url_or_filename, Path):
|
||||||
url_or_filename = str(url_or_filename)
|
url_or_filename = str(url_or_filename)
|
||||||
if isinstance(cache_dir, Path):
|
if sys.version_info[0] == 3 and isinstance(cache_dir, Path):
|
||||||
cache_dir = str(cache_dir)
|
cache_dir = str(cache_dir)
|
||||||
|
|
||||||
parsed = urlparse(url_or_filename)
|
parsed = urlparse(url_or_filename)
|
||||||
@@ -95,13 +104,13 @@ def cached_path(url_or_filename: Union[str, Path], cache_dir: Union[str, Path] =
|
|||||||
return url_or_filename
|
return url_or_filename
|
||||||
elif parsed.scheme == '':
|
elif parsed.scheme == '':
|
||||||
# File, but it doesn't exist.
|
# File, but it doesn't exist.
|
||||||
raise FileNotFoundError("file {} not found".format(url_or_filename))
|
raise EnvironmentError("file {} not found".format(url_or_filename))
|
||||||
else:
|
else:
|
||||||
# Something unknown
|
# Something unknown
|
||||||
raise ValueError("unable to parse {} as a URL or as a local path".format(url_or_filename))
|
raise ValueError("unable to parse {} as a URL or as a local path".format(url_or_filename))
|
||||||
|
|
||||||
|
|
||||||
def split_s3_path(url: str) -> Tuple[str, str]:
|
def split_s3_path(url):
|
||||||
"""Split a full s3 path into the bucket name and path."""
|
"""Split a full s3 path into the bucket name and path."""
|
||||||
parsed = urlparse(url)
|
parsed = urlparse(url)
|
||||||
if not parsed.netloc or not parsed.path:
|
if not parsed.netloc or not parsed.path:
|
||||||
@@ -114,19 +123,19 @@ def split_s3_path(url: str) -> Tuple[str, str]:
|
|||||||
return bucket_name, s3_path
|
return bucket_name, s3_path
|
||||||
|
|
||||||
|
|
||||||
def s3_request(func: Callable):
|
def s3_request(func):
|
||||||
"""
|
"""
|
||||||
Wrapper function for s3 requests in order to create more helpful error
|
Wrapper function for s3 requests in order to create more helpful error
|
||||||
messages.
|
messages.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
@wraps(func)
|
@wraps(func)
|
||||||
def wrapper(url: str, *args, **kwargs):
|
def wrapper(url, *args, **kwargs):
|
||||||
try:
|
try:
|
||||||
return func(url, *args, **kwargs)
|
return func(url, *args, **kwargs)
|
||||||
except ClientError as exc:
|
except ClientError as exc:
|
||||||
if int(exc.response["Error"]["Code"]) == 404:
|
if int(exc.response["Error"]["Code"]) == 404:
|
||||||
raise FileNotFoundError("file {} not found".format(url))
|
raise EnvironmentError("file {} not found".format(url))
|
||||||
else:
|
else:
|
||||||
raise
|
raise
|
||||||
|
|
||||||
@@ -134,7 +143,7 @@ def s3_request(func: Callable):
|
|||||||
|
|
||||||
|
|
||||||
@s3_request
|
@s3_request
|
||||||
def s3_etag(url: str) -> Optional[str]:
|
def s3_etag(url):
|
||||||
"""Check ETag on S3 object."""
|
"""Check ETag on S3 object."""
|
||||||
s3_resource = boto3.resource("s3")
|
s3_resource = boto3.resource("s3")
|
||||||
bucket_name, s3_path = split_s3_path(url)
|
bucket_name, s3_path = split_s3_path(url)
|
||||||
@@ -143,14 +152,14 @@ def s3_etag(url: str) -> Optional[str]:
|
|||||||
|
|
||||||
|
|
||||||
@s3_request
|
@s3_request
|
||||||
def s3_get(url: str, temp_file: IO) -> None:
|
def s3_get(url, temp_file):
|
||||||
"""Pull a file directly from S3."""
|
"""Pull a file directly from S3."""
|
||||||
s3_resource = boto3.resource("s3")
|
s3_resource = boto3.resource("s3")
|
||||||
bucket_name, s3_path = split_s3_path(url)
|
bucket_name, s3_path = split_s3_path(url)
|
||||||
s3_resource.Bucket(bucket_name).download_fileobj(s3_path, temp_file)
|
s3_resource.Bucket(bucket_name).download_fileobj(s3_path, temp_file)
|
||||||
|
|
||||||
|
|
||||||
def http_get(url: str, temp_file: IO) -> None:
|
def http_get(url, temp_file):
|
||||||
req = requests.get(url, stream=True)
|
req = requests.get(url, stream=True)
|
||||||
content_length = req.headers.get('Content-Length')
|
content_length = req.headers.get('Content-Length')
|
||||||
total = int(content_length) if content_length is not None else None
|
total = int(content_length) if content_length is not None else None
|
||||||
@@ -162,17 +171,18 @@ def http_get(url: str, temp_file: IO) -> None:
|
|||||||
progress.close()
|
progress.close()
|
||||||
|
|
||||||
|
|
||||||
def get_from_cache(url: str, cache_dir: Union[str, Path] = None) -> str:
|
def get_from_cache(url, cache_dir=None):
|
||||||
"""
|
"""
|
||||||
Given a URL, look for the corresponding dataset in the local cache.
|
Given a URL, look for the corresponding dataset in the local cache.
|
||||||
If it's not there, download it. Then return the path to the cached file.
|
If it's not there, download it. Then return the path to the cached file.
|
||||||
"""
|
"""
|
||||||
if cache_dir is None:
|
if cache_dir is None:
|
||||||
cache_dir = PYTORCH_PRETRAINED_BERT_CACHE
|
cache_dir = PYTORCH_PRETRAINED_BERT_CACHE
|
||||||
if isinstance(cache_dir, Path):
|
if sys.version_info[0] == 3 and isinstance(cache_dir, Path):
|
||||||
cache_dir = str(cache_dir)
|
cache_dir = str(cache_dir)
|
||||||
|
|
||||||
os.makedirs(cache_dir, exist_ok=True)
|
if not os.path.exists(cache_dir):
|
||||||
|
os.makedirs(cache_dir)
|
||||||
|
|
||||||
# Get eTag to add to filename, if it exists.
|
# Get eTag to add to filename, if it exists.
|
||||||
if url.startswith("s3://"):
|
if url.startswith("s3://"):
|
||||||
@@ -213,7 +223,7 @@ def get_from_cache(url: str, cache_dir: Union[str, Path] = None) -> str:
|
|||||||
logger.info("creating metadata file for %s", cache_path)
|
logger.info("creating metadata file for %s", cache_path)
|
||||||
meta = {'url': url, 'etag': etag}
|
meta = {'url': url, 'etag': etag}
|
||||||
meta_path = cache_path + '.json'
|
meta_path = cache_path + '.json'
|
||||||
with open(meta_path, 'w') as meta_file:
|
with open(meta_path, 'w', encoding="utf-8") as meta_file:
|
||||||
json.dump(meta, meta_file)
|
json.dump(meta, meta_file)
|
||||||
|
|
||||||
logger.info("removing temp file %s", temp_file.name)
|
logger.info("removing temp file %s", temp_file.name)
|
||||||
@@ -221,7 +231,7 @@ def get_from_cache(url: str, cache_dir: Union[str, Path] = None) -> str:
|
|||||||
return cache_path
|
return cache_path
|
||||||
|
|
||||||
|
|
||||||
def read_set_from_file(filename: str) -> Set[str]:
|
def read_set_from_file(filename):
|
||||||
'''
|
'''
|
||||||
Extract a de-duped collection (set) of text from a file.
|
Extract a de-duped collection (set) of text from a file.
|
||||||
Expected file format is one item per line.
|
Expected file format is one item per line.
|
||||||
@@ -233,7 +243,7 @@ def read_set_from_file(filename: str) -> Set[str]:
|
|||||||
return collection
|
return collection
|
||||||
|
|
||||||
|
|
||||||
def get_file_extension(path: str, dot=True, lower: bool = True):
|
def get_file_extension(path, dot=True, lower=True):
|
||||||
ext = os.path.splitext(path)[1]
|
ext = os.path.splitext(path)[1]
|
||||||
ext = ext if dot else ext[1:]
|
ext = ext if dot else ext[1:]
|
||||||
return ext.lower() if lower else ext
|
return ext.lower() if lower else ext
|
||||||
|
|||||||
@@ -15,18 +15,18 @@
|
|||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
"""PyTorch BERT model."""
|
"""PyTorch BERT model."""
|
||||||
|
|
||||||
from __future__ import absolute_import
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
from __future__ import division
|
|
||||||
from __future__ import print_function
|
|
||||||
|
|
||||||
import os
|
|
||||||
import copy
|
import copy
|
||||||
import json
|
import json
|
||||||
import math
|
|
||||||
import logging
|
import logging
|
||||||
|
import math
|
||||||
|
import os
|
||||||
|
import shutil
|
||||||
import tarfile
|
import tarfile
|
||||||
import tempfile
|
import tempfile
|
||||||
import shutil
|
import sys
|
||||||
|
from io import open
|
||||||
|
|
||||||
import torch
|
import torch
|
||||||
from torch import nn
|
from torch import nn
|
||||||
@@ -47,6 +47,68 @@ PRETRAINED_MODEL_ARCHIVE_MAP = {
|
|||||||
}
|
}
|
||||||
CONFIG_NAME = 'bert_config.json'
|
CONFIG_NAME = 'bert_config.json'
|
||||||
WEIGHTS_NAME = 'pytorch_model.bin'
|
WEIGHTS_NAME = 'pytorch_model.bin'
|
||||||
|
TF_WEIGHTS_NAME = 'model.ckpt'
|
||||||
|
|
||||||
|
def load_tf_weights_in_bert(model, tf_checkpoint_path):
|
||||||
|
""" Load tf checkpoints in a pytorch model
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
import re
|
||||||
|
import numpy as np
|
||||||
|
import tensorflow as tf
|
||||||
|
except ImportError:
|
||||||
|
print("Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see "
|
||||||
|
"https://www.tensorflow.org/install/ for installation instructions.")
|
||||||
|
raise
|
||||||
|
tf_path = os.path.abspath(tf_checkpoint_path)
|
||||||
|
print("Converting TensorFlow checkpoint from {}".format(tf_path))
|
||||||
|
# Load weights from TF model
|
||||||
|
init_vars = tf.train.list_variables(tf_path)
|
||||||
|
names = []
|
||||||
|
arrays = []
|
||||||
|
for name, shape in init_vars:
|
||||||
|
print("Loading TF weight {} with shape {}".format(name, shape))
|
||||||
|
array = tf.train.load_variable(tf_path, name)
|
||||||
|
names.append(name)
|
||||||
|
arrays.append(array)
|
||||||
|
|
||||||
|
for name, array in zip(names, arrays):
|
||||||
|
name = name.split('/')
|
||||||
|
# adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
|
||||||
|
# which are not required for using pretrained model
|
||||||
|
if any(n in ["adam_v", "adam_m"] for n in name):
|
||||||
|
print("Skipping {}".format("/".join(name)))
|
||||||
|
continue
|
||||||
|
pointer = model
|
||||||
|
for m_name in name:
|
||||||
|
if re.fullmatch(r'[A-Za-z]+_\d+', m_name):
|
||||||
|
l = re.split(r'_(\d+)', m_name)
|
||||||
|
else:
|
||||||
|
l = [m_name]
|
||||||
|
if l[0] == 'kernel' or l[0] == 'gamma':
|
||||||
|
pointer = getattr(pointer, 'weight')
|
||||||
|
elif l[0] == 'output_bias' or l[0] == 'beta':
|
||||||
|
pointer = getattr(pointer, 'bias')
|
||||||
|
elif l[0] == 'output_weights':
|
||||||
|
pointer = getattr(pointer, 'weight')
|
||||||
|
else:
|
||||||
|
pointer = getattr(pointer, l[0])
|
||||||
|
if len(l) >= 2:
|
||||||
|
num = int(l[1])
|
||||||
|
pointer = pointer[num]
|
||||||
|
if m_name[-11:] == '_embeddings':
|
||||||
|
pointer = getattr(pointer, 'weight')
|
||||||
|
elif m_name == 'kernel':
|
||||||
|
array = np.transpose(array)
|
||||||
|
try:
|
||||||
|
assert pointer.shape == array.shape
|
||||||
|
except AssertionError as e:
|
||||||
|
e.args += (pointer.shape, array.shape)
|
||||||
|
raise
|
||||||
|
print("Initialize PyTorch weight {}".format(name))
|
||||||
|
pointer.data = torch.from_numpy(array)
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
def gelu(x):
|
def gelu(x):
|
||||||
"""Implementation of the gelu activation function.
|
"""Implementation of the gelu activation function.
|
||||||
@@ -102,7 +164,8 @@ class BertConfig(object):
|
|||||||
initializer_range: The sttdev of the truncated_normal_initializer for
|
initializer_range: The sttdev of the truncated_normal_initializer for
|
||||||
initializing all weight matrices.
|
initializing all weight matrices.
|
||||||
"""
|
"""
|
||||||
if isinstance(vocab_size_or_config_json_file, str):
|
if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
|
||||||
|
and isinstance(vocab_size_or_config_json_file, unicode)):
|
||||||
with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
|
with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
|
||||||
json_config = json.loads(reader.read())
|
json_config = json.loads(reader.read())
|
||||||
for key, value in json_config.items():
|
for key, value in json_config.items():
|
||||||
@@ -281,8 +344,10 @@ class BertIntermediate(nn.Module):
|
|||||||
def __init__(self, config):
|
def __init__(self, config):
|
||||||
super(BertIntermediate, self).__init__()
|
super(BertIntermediate, self).__init__()
|
||||||
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
|
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
|
||||||
self.intermediate_act_fn = ACT2FN[config.hidden_act] \
|
if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2 and isinstance(config.hidden_act, unicode)):
|
||||||
if isinstance(config.hidden_act, str) else config.hidden_act
|
self.intermediate_act_fn = ACT2FN[config.hidden_act]
|
||||||
|
else:
|
||||||
|
self.intermediate_act_fn = config.hidden_act
|
||||||
|
|
||||||
def forward(self, hidden_states):
|
def forward(self, hidden_states):
|
||||||
hidden_states = self.dense(hidden_states)
|
hidden_states = self.dense(hidden_states)
|
||||||
@@ -354,8 +419,10 @@ class BertPredictionHeadTransform(nn.Module):
|
|||||||
def __init__(self, config):
|
def __init__(self, config):
|
||||||
super(BertPredictionHeadTransform, self).__init__()
|
super(BertPredictionHeadTransform, self).__init__()
|
||||||
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
|
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
|
||||||
self.transform_act_fn = ACT2FN[config.hidden_act] \
|
if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2 and isinstance(config.hidden_act, unicode)):
|
||||||
if isinstance(config.hidden_act, str) else config.hidden_act
|
self.transform_act_fn = ACT2FN[config.hidden_act]
|
||||||
|
else:
|
||||||
|
self.transform_act_fn = config.hidden_act
|
||||||
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
|
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
|
||||||
|
|
||||||
def forward(self, hidden_states):
|
def forward(self, hidden_states):
|
||||||
@@ -416,12 +483,12 @@ class BertPreTrainingHeads(nn.Module):
|
|||||||
return prediction_scores, seq_relationship_score
|
return prediction_scores, seq_relationship_score
|
||||||
|
|
||||||
|
|
||||||
class PreTrainedBertModel(nn.Module):
|
class BertPreTrainedModel(nn.Module):
|
||||||
""" An abstract class to handle weights initialization and
|
""" An abstract class to handle weights initialization and
|
||||||
a simple interface for dowloading and loading pretrained models.
|
a simple interface for dowloading and loading pretrained models.
|
||||||
"""
|
"""
|
||||||
def __init__(self, config, *inputs, **kwargs):
|
def __init__(self, config, *inputs, **kwargs):
|
||||||
super(PreTrainedBertModel, self).__init__()
|
super(BertPreTrainedModel, self).__init__()
|
||||||
if not isinstance(config, BertConfig):
|
if not isinstance(config, BertConfig):
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
"Parameter config in `{}(config)` should be an instance of class `BertConfig`. "
|
"Parameter config in `{}(config)` should be an instance of class `BertConfig`. "
|
||||||
@@ -445,13 +512,14 @@ class PreTrainedBertModel(nn.Module):
|
|||||||
module.bias.data.zero_()
|
module.bias.data.zero_()
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def from_pretrained(cls, pretrained_model_name, state_dict=None, cache_dir=None, *inputs, **kwargs):
|
def from_pretrained(cls, pretrained_model_name_or_path, state_dict=None, cache_dir=None,
|
||||||
|
from_tf=False, *inputs, **kwargs):
|
||||||
"""
|
"""
|
||||||
Instantiate a PreTrainedBertModel from a pre-trained model file or a pytorch state dict.
|
Instantiate a BertPreTrainedModel from a pre-trained model file or a pytorch state dict.
|
||||||
Download and cache the pre-trained model file if needed.
|
Download and cache the pre-trained model file if needed.
|
||||||
|
|
||||||
Params:
|
Params:
|
||||||
pretrained_model_name: either:
|
pretrained_model_name_or_path: either:
|
||||||
- a str with the name of a pre-trained model to load selected in the list of:
|
- a str with the name of a pre-trained model to load selected in the list of:
|
||||||
. `bert-base-uncased`
|
. `bert-base-uncased`
|
||||||
. `bert-large-uncased`
|
. `bert-large-uncased`
|
||||||
@@ -463,24 +531,28 @@ class PreTrainedBertModel(nn.Module):
|
|||||||
- a path or url to a pretrained model archive containing:
|
- a path or url to a pretrained model archive containing:
|
||||||
. `bert_config.json` a configuration file for the model
|
. `bert_config.json` a configuration file for the model
|
||||||
. `pytorch_model.bin` a PyTorch dump of a BertForPreTraining instance
|
. `pytorch_model.bin` a PyTorch dump of a BertForPreTraining instance
|
||||||
|
- a path or url to a pretrained model archive containing:
|
||||||
|
. `bert_config.json` a configuration file for the model
|
||||||
|
. `model.chkpt` a TensorFlow checkpoint
|
||||||
|
from_tf: should we load the weights from a locally saved TensorFlow checkpoint
|
||||||
cache_dir: an optional path to a folder in which the pre-trained models will be cached.
|
cache_dir: an optional path to a folder in which the pre-trained models will be cached.
|
||||||
state_dict: an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
|
state_dict: an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
|
||||||
*inputs, **kwargs: additional input for the specific Bert class
|
*inputs, **kwargs: additional input for the specific Bert class
|
||||||
(ex: num_labels for BertForSequenceClassification)
|
(ex: num_labels for BertForSequenceClassification)
|
||||||
"""
|
"""
|
||||||
if pretrained_model_name in PRETRAINED_MODEL_ARCHIVE_MAP:
|
if pretrained_model_name_or_path in PRETRAINED_MODEL_ARCHIVE_MAP:
|
||||||
archive_file = PRETRAINED_MODEL_ARCHIVE_MAP[pretrained_model_name]
|
archive_file = PRETRAINED_MODEL_ARCHIVE_MAP[pretrained_model_name_or_path]
|
||||||
else:
|
else:
|
||||||
archive_file = pretrained_model_name
|
archive_file = pretrained_model_name_or_path
|
||||||
# redirect to the cache, if necessary
|
# redirect to the cache, if necessary
|
||||||
try:
|
try:
|
||||||
resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir)
|
resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir)
|
||||||
except FileNotFoundError:
|
except EnvironmentError:
|
||||||
logger.error(
|
logger.error(
|
||||||
"Model name '{}' was not found in model name list ({}). "
|
"Model name '{}' was not found in model name list ({}). "
|
||||||
"We assumed '{}' was a path or url but couldn't find any file "
|
"We assumed '{}' was a path or url but couldn't find any file "
|
||||||
"associated to this path or url.".format(
|
"associated to this path or url.".format(
|
||||||
pretrained_model_name,
|
pretrained_model_name_or_path,
|
||||||
', '.join(PRETRAINED_MODEL_ARCHIVE_MAP.keys()),
|
', '.join(PRETRAINED_MODEL_ARCHIVE_MAP.keys()),
|
||||||
archive_file))
|
archive_file))
|
||||||
return None
|
return None
|
||||||
@@ -490,7 +562,7 @@ class PreTrainedBertModel(nn.Module):
|
|||||||
logger.info("loading archive file {} from cache at {}".format(
|
logger.info("loading archive file {} from cache at {}".format(
|
||||||
archive_file, resolved_archive_file))
|
archive_file, resolved_archive_file))
|
||||||
tempdir = None
|
tempdir = None
|
||||||
if os.path.isdir(resolved_archive_file):
|
if os.path.isdir(resolved_archive_file) or from_tf:
|
||||||
serialization_dir = resolved_archive_file
|
serialization_dir = resolved_archive_file
|
||||||
else:
|
else:
|
||||||
# Extract archive to temp dir
|
# Extract archive to temp dir
|
||||||
@@ -506,10 +578,17 @@ class PreTrainedBertModel(nn.Module):
|
|||||||
logger.info("Model config {}".format(config))
|
logger.info("Model config {}".format(config))
|
||||||
# Instantiate model.
|
# Instantiate model.
|
||||||
model = cls(config, *inputs, **kwargs)
|
model = cls(config, *inputs, **kwargs)
|
||||||
if state_dict is None:
|
if state_dict is None and not from_tf:
|
||||||
weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
|
weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
|
||||||
state_dict = torch.load(weights_path)
|
state_dict = torch.load(weights_path, map_location='cpu' if not torch.cuda.is_available() else None)
|
||||||
|
if tempdir:
|
||||||
|
# Clean up temp dir
|
||||||
|
shutil.rmtree(tempdir)
|
||||||
|
if from_tf:
|
||||||
|
# Directly load from a TensorFlow checkpoint
|
||||||
|
weights_path = os.path.join(serialization_dir, TF_WEIGHTS_NAME)
|
||||||
|
return load_tf_weights_in_bert(model, weights_path)
|
||||||
|
# Load from a PyTorch state_dict
|
||||||
old_keys = []
|
old_keys = []
|
||||||
new_keys = []
|
new_keys = []
|
||||||
for key in state_dict.keys():
|
for key in state_dict.keys():
|
||||||
@@ -540,20 +619,23 @@ class PreTrainedBertModel(nn.Module):
|
|||||||
for name, child in module._modules.items():
|
for name, child in module._modules.items():
|
||||||
if child is not None:
|
if child is not None:
|
||||||
load(child, prefix + name + '.')
|
load(child, prefix + name + '.')
|
||||||
load(model, prefix='' if hasattr(model, 'bert') else 'bert.')
|
start_prefix = ''
|
||||||
|
if not hasattr(model, 'bert') and any(s.startswith('bert.') for s in state_dict.keys()):
|
||||||
|
start_prefix = 'bert.'
|
||||||
|
load(model, prefix=start_prefix)
|
||||||
if len(missing_keys) > 0:
|
if len(missing_keys) > 0:
|
||||||
logger.info("Weights of {} not initialized from pretrained model: {}".format(
|
logger.info("Weights of {} not initialized from pretrained model: {}".format(
|
||||||
model.__class__.__name__, missing_keys))
|
model.__class__.__name__, missing_keys))
|
||||||
if len(unexpected_keys) > 0:
|
if len(unexpected_keys) > 0:
|
||||||
logger.info("Weights from pretrained model not used in {}: {}".format(
|
logger.info("Weights from pretrained model not used in {}: {}".format(
|
||||||
model.__class__.__name__, unexpected_keys))
|
model.__class__.__name__, unexpected_keys))
|
||||||
if tempdir:
|
if len(error_msgs) > 0:
|
||||||
# Clean up temp dir
|
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
|
||||||
shutil.rmtree(tempdir)
|
model.__class__.__name__, "\n\t".join(error_msgs)))
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
class BertModel(PreTrainedBertModel):
|
class BertModel(BertPreTrainedModel):
|
||||||
"""BERT model ("Bidirectional Embedding Representations from a Transformer").
|
"""BERT model ("Bidirectional Embedding Representations from a Transformer").
|
||||||
|
|
||||||
Params:
|
Params:
|
||||||
@@ -581,7 +663,7 @@ class BertModel(PreTrainedBertModel):
|
|||||||
to the last attention block of shape [batch_size, sequence_length, hidden_size],
|
to the last attention block of shape [batch_size, sequence_length, hidden_size],
|
||||||
`pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
|
`pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
|
||||||
classifier pretrained on top of the hidden state associated to the first character of the
|
classifier pretrained on top of the hidden state associated to the first character of the
|
||||||
input (`CLF`) to train on the Next-Sentence task (see BERT's paper).
|
input (`CLS`) to train on the Next-Sentence task (see BERT's paper).
|
||||||
|
|
||||||
Example usage:
|
Example usage:
|
||||||
```python
|
```python
|
||||||
@@ -636,7 +718,7 @@ class BertModel(PreTrainedBertModel):
|
|||||||
return encoded_layers, pooled_output
|
return encoded_layers, pooled_output
|
||||||
|
|
||||||
|
|
||||||
class BertForPreTraining(PreTrainedBertModel):
|
class BertForPreTraining(BertPreTrainedModel):
|
||||||
"""BERT model with pre-training heads.
|
"""BERT model with pre-training heads.
|
||||||
This module comprises the BERT model followed by the two pre-training heads:
|
This module comprises the BERT model followed by the two pre-training heads:
|
||||||
- the masked language modeling head, and
|
- the masked language modeling head, and
|
||||||
@@ -656,10 +738,10 @@ class BertForPreTraining(PreTrainedBertModel):
|
|||||||
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
|
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
|
||||||
input sequence length in the current batch. It's the mask that we typically use for attention when
|
input sequence length in the current batch. It's the mask that we typically use for attention when
|
||||||
a batch has varying length sentences.
|
a batch has varying length sentences.
|
||||||
`masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
|
`masked_lm_labels`: optional masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
|
||||||
with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
|
with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
|
||||||
is only computed for the labels set in [0, ..., vocab_size]
|
is only computed for the labels set in [0, ..., vocab_size]
|
||||||
`next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size]
|
`next_sentence_label`: optional next sentence classification loss: torch.LongTensor of shape [batch_size]
|
||||||
with indices selected in [0, 1].
|
with indices selected in [0, 1].
|
||||||
0 => next sentence is the continuation, 1 => next sentence is a random sentence.
|
0 => next sentence is the continuation, 1 => next sentence is a random sentence.
|
||||||
|
|
||||||
@@ -707,7 +789,7 @@ class BertForPreTraining(PreTrainedBertModel):
|
|||||||
return prediction_scores, seq_relationship_score
|
return prediction_scores, seq_relationship_score
|
||||||
|
|
||||||
|
|
||||||
class BertForMaskedLM(PreTrainedBertModel):
|
class BertForMaskedLM(BertPreTrainedModel):
|
||||||
"""BERT model with the masked language modeling head.
|
"""BERT model with the masked language modeling head.
|
||||||
This module comprises the BERT model followed by the masked language modeling head.
|
This module comprises the BERT model followed by the masked language modeling head.
|
||||||
|
|
||||||
@@ -768,7 +850,7 @@ class BertForMaskedLM(PreTrainedBertModel):
|
|||||||
return prediction_scores
|
return prediction_scores
|
||||||
|
|
||||||
|
|
||||||
class BertForNextSentencePrediction(PreTrainedBertModel):
|
class BertForNextSentencePrediction(BertPreTrainedModel):
|
||||||
"""BERT model with next sentence prediction head.
|
"""BERT model with next sentence prediction head.
|
||||||
This module comprises the BERT model followed by the next sentence classification head.
|
This module comprises the BERT model followed by the next sentence classification head.
|
||||||
|
|
||||||
@@ -830,7 +912,7 @@ class BertForNextSentencePrediction(PreTrainedBertModel):
|
|||||||
return seq_relationship_score
|
return seq_relationship_score
|
||||||
|
|
||||||
|
|
||||||
class BertForSequenceClassification(PreTrainedBertModel):
|
class BertForSequenceClassification(BertPreTrainedModel):
|
||||||
"""BERT model for classification.
|
"""BERT model for classification.
|
||||||
This module is composed of the BERT model with a linear layer on top of
|
This module is composed of the BERT model with a linear layer on top of
|
||||||
the pooled output.
|
the pooled output.
|
||||||
@@ -875,7 +957,7 @@ class BertForSequenceClassification(PreTrainedBertModel):
|
|||||||
logits = model(input_ids, token_type_ids, input_mask)
|
logits = model(input_ids, token_type_ids, input_mask)
|
||||||
```
|
```
|
||||||
"""
|
"""
|
||||||
def __init__(self, config, num_labels=2):
|
def __init__(self, config, num_labels):
|
||||||
super(BertForSequenceClassification, self).__init__(config)
|
super(BertForSequenceClassification, self).__init__(config)
|
||||||
self.num_labels = num_labels
|
self.num_labels = num_labels
|
||||||
self.bert = BertModel(config)
|
self.bert = BertModel(config)
|
||||||
@@ -896,7 +978,7 @@ class BertForSequenceClassification(PreTrainedBertModel):
|
|||||||
return logits
|
return logits
|
||||||
|
|
||||||
|
|
||||||
class BertForMultipleChoice(PreTrainedBertModel):
|
class BertForMultipleChoice(BertPreTrainedModel):
|
||||||
"""BERT model for multiple choice tasks.
|
"""BERT model for multiple choice tasks.
|
||||||
This module is composed of the BERT model with a linear layer on top of
|
This module is composed of the BERT model with a linear layer on top of
|
||||||
the pooled output.
|
the pooled output.
|
||||||
@@ -940,7 +1022,7 @@ class BertForMultipleChoice(PreTrainedBertModel):
|
|||||||
logits = model(input_ids, token_type_ids, input_mask)
|
logits = model(input_ids, token_type_ids, input_mask)
|
||||||
```
|
```
|
||||||
"""
|
"""
|
||||||
def __init__(self, config, num_choices=2):
|
def __init__(self, config, num_choices):
|
||||||
super(BertForMultipleChoice, self).__init__(config)
|
super(BertForMultipleChoice, self).__init__(config)
|
||||||
self.num_choices = num_choices
|
self.num_choices = num_choices
|
||||||
self.bert = BertModel(config)
|
self.bert = BertModel(config)
|
||||||
@@ -965,7 +1047,7 @@ class BertForMultipleChoice(PreTrainedBertModel):
|
|||||||
return reshaped_logits
|
return reshaped_logits
|
||||||
|
|
||||||
|
|
||||||
class BertForTokenClassification(PreTrainedBertModel):
|
class BertForTokenClassification(BertPreTrainedModel):
|
||||||
"""BERT model for token-level classification.
|
"""BERT model for token-level classification.
|
||||||
This module is composed of the BERT model with a linear layer on top of
|
This module is composed of the BERT model with a linear layer on top of
|
||||||
the full hidden state of the last layer.
|
the full hidden state of the last layer.
|
||||||
@@ -1010,7 +1092,7 @@ class BertForTokenClassification(PreTrainedBertModel):
|
|||||||
logits = model(input_ids, token_type_ids, input_mask)
|
logits = model(input_ids, token_type_ids, input_mask)
|
||||||
```
|
```
|
||||||
"""
|
"""
|
||||||
def __init__(self, config, num_labels=2):
|
def __init__(self, config, num_labels):
|
||||||
super(BertForTokenClassification, self).__init__(config)
|
super(BertForTokenClassification, self).__init__(config)
|
||||||
self.num_labels = num_labels
|
self.num_labels = num_labels
|
||||||
self.bert = BertModel(config)
|
self.bert = BertModel(config)
|
||||||
@@ -1038,7 +1120,7 @@ class BertForTokenClassification(PreTrainedBertModel):
|
|||||||
return logits
|
return logits
|
||||||
|
|
||||||
|
|
||||||
class BertForQuestionAnswering(PreTrainedBertModel):
|
class BertForQuestionAnswering(BertPreTrainedModel):
|
||||||
"""BERT model for Question Answering (span extraction).
|
"""BERT model for Question Answering (span extraction).
|
||||||
This module is composed of the BERT model with a linear layer on top of
|
This module is composed of the BERT model with a linear layer on top of
|
||||||
the sequence output that computes start_logits and end_logits
|
the sequence output that computes start_logits and end_logits
|
||||||
|
|||||||
810
pytorch_pretrained_bert/modeling_openai.py
Normal file
810
pytorch_pretrained_bert/modeling_openai.py
Normal file
@@ -0,0 +1,810 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The OpenAI Team Authors and HugginFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""PyTorch OpenAI GPT model."""
|
||||||
|
|
||||||
|
import collections
|
||||||
|
import copy
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import math
|
||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
import tarfile
|
||||||
|
import tempfile
|
||||||
|
import sys
|
||||||
|
from io import open
|
||||||
|
|
||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
from torch.nn import CrossEntropyLoss
|
||||||
|
from torch.nn.parameter import Parameter
|
||||||
|
|
||||||
|
from .file_utils import cached_path
|
||||||
|
from .modeling import BertLayerNorm as LayerNorm
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
PRETRAINED_MODEL_ARCHIVE_MAP = {"openai-gpt": "https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-pytorch_model.bin"}
|
||||||
|
PRETRAINED_CONFIG_ARCHIVE_MAP = {"openai-gpt": "https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-config.json"}
|
||||||
|
|
||||||
|
CONFIG_NAME = "config.json"
|
||||||
|
WEIGHTS_NAME = "pytorch_model.bin"
|
||||||
|
|
||||||
|
def load_tf_weights_in_openai_gpt(model, openai_checkpoint_folder_path):
|
||||||
|
""" Load tf pre-trained weights in a pytorch model (from NumPy arrays here)
|
||||||
|
"""
|
||||||
|
import re
|
||||||
|
import numpy as np
|
||||||
|
print("Loading weights...")
|
||||||
|
names = json.load(open(openai_checkpoint_folder_path + '/parameters_names.json', "r", encoding='utf-8'))
|
||||||
|
shapes = json.load(open(openai_checkpoint_folder_path + '/params_shapes.json', "r", encoding='utf-8'))
|
||||||
|
offsets = np.cumsum([np.prod(shape) for shape in shapes])
|
||||||
|
init_params = [np.load(openai_checkpoint_folder_path + '/params_{}.npy'.format(n)) for n in range(10)]
|
||||||
|
init_params = np.split(np.concatenate(init_params, 0), offsets)[:-1]
|
||||||
|
init_params = [param.reshape(shape) for param, shape in zip(init_params, shapes)]
|
||||||
|
|
||||||
|
# Thsi as used when we had a single embedding matrix for positions and tokens
|
||||||
|
# init_params[0] = np.concatenate([init_params[1], init_params[0]], 0)
|
||||||
|
# del init_params[1]
|
||||||
|
init_params = [arr.squeeze() for arr in init_params]
|
||||||
|
|
||||||
|
try:
|
||||||
|
assert model.tokens_embed.weight.shape == init_params[1].shape
|
||||||
|
assert model.positions_embed.weight.shape == init_params[0].shape
|
||||||
|
except AssertionError as e:
|
||||||
|
e.args += (model.tokens_embed.weight.shape, init_params[1].shape)
|
||||||
|
e.args += (model.positions_embed.weight.shape, init_params[0].shape)
|
||||||
|
raise
|
||||||
|
|
||||||
|
model.tokens_embed.weight.data = torch.from_numpy(init_params[1])
|
||||||
|
model.positions_embed.weight.data = torch.from_numpy(init_params[0])
|
||||||
|
names.pop(0)
|
||||||
|
# Pop position and token embedding arrays
|
||||||
|
init_params.pop(0)
|
||||||
|
init_params.pop(0)
|
||||||
|
|
||||||
|
for name, array in zip(names, init_params): # names[1:n_transfer], init_params[1:n_transfer]):
|
||||||
|
name = name[6:] # skip "model/"
|
||||||
|
assert name[-2:] == ":0"
|
||||||
|
name = name[:-2]
|
||||||
|
name = name.split('/')
|
||||||
|
pointer = model
|
||||||
|
for m_name in name:
|
||||||
|
if re.fullmatch(r'[A-Za-z]+\d+', m_name):
|
||||||
|
l = re.split(r'(\d+)', m_name)
|
||||||
|
else:
|
||||||
|
l = [m_name]
|
||||||
|
if l[0] == 'g':
|
||||||
|
pointer = getattr(pointer, 'weight')
|
||||||
|
elif l[0] == 'b':
|
||||||
|
pointer = getattr(pointer, 'bias')
|
||||||
|
elif l[0] == 'w':
|
||||||
|
pointer = getattr(pointer, 'weight')
|
||||||
|
else:
|
||||||
|
pointer = getattr(pointer, l[0])
|
||||||
|
if len(l) >= 2:
|
||||||
|
num = int(l[1])
|
||||||
|
pointer = pointer[num]
|
||||||
|
try:
|
||||||
|
assert pointer.shape == array.shape
|
||||||
|
except AssertionError as e:
|
||||||
|
e.args += (pointer.shape, array.shape)
|
||||||
|
raise
|
||||||
|
try:
|
||||||
|
assert pointer.shape == array.shape
|
||||||
|
except AssertionError as e:
|
||||||
|
e.args += (pointer.shape, array.shape)
|
||||||
|
raise
|
||||||
|
print("Initialize PyTorch weight {}".format(name))
|
||||||
|
pointer.data = torch.from_numpy(array)
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
def gelu(x):
|
||||||
|
return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
|
||||||
|
|
||||||
|
|
||||||
|
def swish(x):
|
||||||
|
return x * torch.sigmoid(x)
|
||||||
|
|
||||||
|
|
||||||
|
ACT_FNS = {"relu": nn.ReLU, "swish": swish, "gelu": gelu}
|
||||||
|
|
||||||
|
|
||||||
|
class OpenAIGPTConfig(object):
|
||||||
|
"""Configuration class to store the configuration of a `OpenAIGPTModel`.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_size_or_config_json_file=40478,
|
||||||
|
n_special=0,
|
||||||
|
n_positions=512,
|
||||||
|
n_ctx=512,
|
||||||
|
n_embd=768,
|
||||||
|
n_layer=12,
|
||||||
|
n_head=12,
|
||||||
|
afn="gelu",
|
||||||
|
resid_pdrop=0.1,
|
||||||
|
embd_pdrop=0.1,
|
||||||
|
attn_pdrop=0.1,
|
||||||
|
layer_norm_epsilon=1e-5,
|
||||||
|
initializer_range=0.02,
|
||||||
|
):
|
||||||
|
"""Constructs OpenAIGPTConfig.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `OpenAIGPTModel` or a configuration json file.
|
||||||
|
n_special: The number of special tokens to learn during fine-tuning ('[SEP]', '[CLF]', ...)
|
||||||
|
n_positions: Number of positional embeddings.
|
||||||
|
n_ctx: Size of the causal mask (usually same as n_positions).
|
||||||
|
n_embd: Dimensionality of the embeddings and hidden states.
|
||||||
|
n_layer: Number of hidden layers in the Transformer encoder.
|
||||||
|
n_head: Number of attention heads for each attention layer in
|
||||||
|
the Transformer encoder.
|
||||||
|
afn: The non-linear activation function (function or string) in the
|
||||||
|
encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
|
||||||
|
resid_pdrop: The dropout probabilitiy for all fully connected
|
||||||
|
layers in the embeddings, encoder, and pooler.
|
||||||
|
attn_pdrop: The dropout ratio for the attention
|
||||||
|
probabilities.
|
||||||
|
embd_pdrop: The dropout ratio for the embeddings.
|
||||||
|
layer_norm_epsilon: epsilon to use in the layer norm layers
|
||||||
|
initializer_range: The sttdev of the truncated_normal_initializer for
|
||||||
|
initializing all weight matrices.
|
||||||
|
"""
|
||||||
|
if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
|
||||||
|
and isinstance(vocab_size_or_config_json_file, unicode)):
|
||||||
|
with open(vocab_size_or_config_json_file, "r", encoding="utf-8") as reader:
|
||||||
|
json_config = json.loads(reader.read())
|
||||||
|
for key, value in json_config.items():
|
||||||
|
self.__dict__[key] = value
|
||||||
|
elif isinstance(vocab_size_or_config_json_file, int):
|
||||||
|
self.vocab_size = vocab_size_or_config_json_file
|
||||||
|
self.n_special = n_special
|
||||||
|
self.n_ctx = n_ctx
|
||||||
|
self.n_positions = n_positions
|
||||||
|
self.n_embd = n_embd
|
||||||
|
self.n_layer = n_layer
|
||||||
|
self.n_head = n_head
|
||||||
|
self.afn = afn
|
||||||
|
self.resid_pdrop = resid_pdrop
|
||||||
|
self.embd_pdrop = embd_pdrop
|
||||||
|
self.attn_pdrop = attn_pdrop
|
||||||
|
self.layer_norm_epsilon = layer_norm_epsilon
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
else:
|
||||||
|
raise ValueError(
|
||||||
|
"First argument must be either a vocabulary size (int)"
|
||||||
|
"or the path to a pretrained model config file (str)"
|
||||||
|
)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def total_tokens_embeddings(self):
|
||||||
|
return self.vocab_size + self.n_special
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_dict(cls, json_object):
|
||||||
|
"""Constructs a `OpenAIGPTConfig` from a Python dictionary of parameters."""
|
||||||
|
config = OpenAIGPTConfig(vocab_size_or_config_json_file=-1)
|
||||||
|
for key, value in json_object.items():
|
||||||
|
config.__dict__[key] = value
|
||||||
|
return config
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_json_file(cls, json_file):
|
||||||
|
"""Constructs a `OpenAIGPTConfig` from a json file of parameters."""
|
||||||
|
with open(json_file, "r", encoding="utf-8") as reader:
|
||||||
|
text = reader.read()
|
||||||
|
return cls.from_dict(json.loads(text))
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return str(self.to_json_string())
|
||||||
|
|
||||||
|
def to_dict(self):
|
||||||
|
"""Serializes this instance to a Python dictionary."""
|
||||||
|
output = copy.deepcopy(self.__dict__)
|
||||||
|
return output
|
||||||
|
|
||||||
|
def to_json_string(self):
|
||||||
|
"""Serializes this instance to a JSON string."""
|
||||||
|
return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
|
||||||
|
|
||||||
|
|
||||||
|
class Conv1D(nn.Module):
|
||||||
|
def __init__(self, nf, rf, nx):
|
||||||
|
super(Conv1D, self).__init__()
|
||||||
|
self.rf = rf
|
||||||
|
self.nf = nf
|
||||||
|
if rf == 1: # faster 1x1 conv
|
||||||
|
w = torch.empty(nx, nf)
|
||||||
|
nn.init.normal_(w, std=0.02)
|
||||||
|
self.weight = Parameter(w)
|
||||||
|
self.bias = Parameter(torch.zeros(nf))
|
||||||
|
else: # was used to train LM
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
if self.rf == 1:
|
||||||
|
size_out = x.size()[:-1] + (self.nf,)
|
||||||
|
x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
|
||||||
|
x = x.view(*size_out)
|
||||||
|
else:
|
||||||
|
raise NotImplementedError
|
||||||
|
return x
|
||||||
|
|
||||||
|
|
||||||
|
class Attention(nn.Module):
|
||||||
|
def __init__(self, nx, n_ctx, config, scale=False):
|
||||||
|
super(Attention, self).__init__()
|
||||||
|
n_state = nx # in Attention: n_state=768 (nx=n_embd)
|
||||||
|
# [switch nx => n_state from Block to Attention to keep identical to TF implem]
|
||||||
|
assert n_state % config.n_head == 0
|
||||||
|
self.register_buffer("bias", torch.tril(torch.ones(n_ctx, n_ctx)).view(1, 1, n_ctx, n_ctx))
|
||||||
|
self.n_head = config.n_head
|
||||||
|
self.split_size = n_state
|
||||||
|
self.scale = scale
|
||||||
|
self.c_attn = Conv1D(n_state * 3, 1, nx)
|
||||||
|
self.c_proj = Conv1D(n_state, 1, nx)
|
||||||
|
self.attn_dropout = nn.Dropout(config.attn_pdrop)
|
||||||
|
self.resid_dropout = nn.Dropout(config.resid_pdrop)
|
||||||
|
|
||||||
|
def _attn(self, q, k, v):
|
||||||
|
w = torch.matmul(q, k)
|
||||||
|
if self.scale:
|
||||||
|
w = w / math.sqrt(v.size(-1))
|
||||||
|
# w = w * self.bias + -1e9 * (1 - self.bias) # TF implem method: mask_attn_weights
|
||||||
|
# XD: self.b may be larger than w, so we need to crop it
|
||||||
|
b = self.bias[:, :, : w.size(-2), : w.size(-1)]
|
||||||
|
w = w * b + -1e9 * (1 - b)
|
||||||
|
|
||||||
|
w = nn.Softmax(dim=-1)(w)
|
||||||
|
w = self.attn_dropout(w)
|
||||||
|
return torch.matmul(w, v)
|
||||||
|
|
||||||
|
def merge_heads(self, x):
|
||||||
|
x = x.permute(0, 2, 1, 3).contiguous()
|
||||||
|
new_x_shape = x.size()[:-2] + (x.size(-2) * x.size(-1),)
|
||||||
|
return x.view(*new_x_shape) # in Tensorflow implem: fct merge_states
|
||||||
|
|
||||||
|
def split_heads(self, x, k=False):
|
||||||
|
new_x_shape = x.size()[:-1] + (self.n_head, x.size(-1) // self.n_head)
|
||||||
|
x = x.view(*new_x_shape) # in Tensorflow implem: fct split_states
|
||||||
|
if k:
|
||||||
|
return x.permute(0, 2, 3, 1)
|
||||||
|
else:
|
||||||
|
return x.permute(0, 2, 1, 3)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
x = self.c_attn(x)
|
||||||
|
query, key, value = x.split(self.split_size, dim=2)
|
||||||
|
query = self.split_heads(query)
|
||||||
|
key = self.split_heads(key, k=True)
|
||||||
|
value = self.split_heads(value)
|
||||||
|
a = self._attn(query, key, value)
|
||||||
|
a = self.merge_heads(a)
|
||||||
|
a = self.c_proj(a)
|
||||||
|
a = self.resid_dropout(a)
|
||||||
|
return a
|
||||||
|
|
||||||
|
|
||||||
|
class MLP(nn.Module):
|
||||||
|
def __init__(self, n_state, config): # in MLP: n_state=3072 (4 * n_embd)
|
||||||
|
super(MLP, self).__init__()
|
||||||
|
nx = config.n_embd
|
||||||
|
self.c_fc = Conv1D(n_state, 1, nx)
|
||||||
|
self.c_proj = Conv1D(nx, 1, n_state)
|
||||||
|
self.act = ACT_FNS[config.afn]
|
||||||
|
self.dropout = nn.Dropout(config.resid_pdrop)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
h = self.act(self.c_fc(x))
|
||||||
|
h2 = self.c_proj(h)
|
||||||
|
return self.dropout(h2)
|
||||||
|
|
||||||
|
|
||||||
|
class Block(nn.Module):
|
||||||
|
def __init__(self, n_ctx, config, scale=False):
|
||||||
|
super(Block, self).__init__()
|
||||||
|
nx = config.n_embd
|
||||||
|
self.attn = Attention(nx, n_ctx, config, scale)
|
||||||
|
self.ln_1 = LayerNorm(nx, eps=config.layer_norm_epsilon)
|
||||||
|
self.mlp = MLP(4 * nx, config)
|
||||||
|
self.ln_2 = LayerNorm(nx, eps=config.layer_norm_epsilon)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
a = self.attn(x)
|
||||||
|
n = self.ln_1(x + a)
|
||||||
|
m = self.mlp(n)
|
||||||
|
h = self.ln_2(n + m)
|
||||||
|
return h
|
||||||
|
|
||||||
|
|
||||||
|
class OpenAIGPTLMHead(nn.Module):
|
||||||
|
""" Language Model Head for the transformer """
|
||||||
|
|
||||||
|
def __init__(self, model_embeddings_weights, config):
|
||||||
|
super(OpenAIGPTLMHead, self).__init__()
|
||||||
|
self.n_embd = config.n_embd
|
||||||
|
self.set_embeddings_weights(model_embeddings_weights)
|
||||||
|
|
||||||
|
def set_embeddings_weights(self, model_embeddings_weights):
|
||||||
|
embed_shape = model_embeddings_weights.shape
|
||||||
|
self.decoder = nn.Linear(embed_shape[1], embed_shape[0], bias=False)
|
||||||
|
self.decoder.weight = model_embeddings_weights # Tied weights
|
||||||
|
|
||||||
|
def forward(self, hidden_state):
|
||||||
|
# Truncated Language modeling logits (we remove the last token)
|
||||||
|
# h_trunc = h[:, :-1].contiguous().view(-1, self.n_embd)
|
||||||
|
lm_logits = self.decoder(hidden_state)
|
||||||
|
return lm_logits
|
||||||
|
|
||||||
|
|
||||||
|
class OpenAIGPTMultipleChoiceHead(nn.Module):
|
||||||
|
""" Classifier Head for the transformer """
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super(OpenAIGPTMultipleChoiceHead, self).__init__()
|
||||||
|
self.n_embd = config.n_embd
|
||||||
|
# self.multiple_choice_token = multiple_choice_token
|
||||||
|
self.dropout = nn.Dropout2d(config.resid_pdrop) # To reproduce the noise_shape parameter of TF implementation
|
||||||
|
self.linear = nn.Linear(config.n_embd, 1)
|
||||||
|
|
||||||
|
nn.init.normal_(self.linear.weight, std=0.02)
|
||||||
|
nn.init.normal_(self.linear.bias, 0)
|
||||||
|
|
||||||
|
def forward(self, hidden_states, mc_token_ids):
|
||||||
|
# Classification logits
|
||||||
|
# hidden_state (bsz, num_choices, seq_length, hidden_size)
|
||||||
|
# mc_token_ids (bsz, num_choices)
|
||||||
|
mc_token_ids = mc_token_ids.unsqueeze(-1).unsqueeze(-1).expand(-1, -1, -1, hidden_states.size(-1))
|
||||||
|
# (bsz, num_choices, 1, hidden_size)
|
||||||
|
multiple_choice_h = hidden_states.gather(2, mc_token_ids).squeeze(2)
|
||||||
|
# (bsz, num_choices, hidden_size)
|
||||||
|
multiple_choice_logits = self.linear(multiple_choice_h).squeeze(-1)
|
||||||
|
# (bsz, num_choices)
|
||||||
|
return multiple_choice_logits
|
||||||
|
|
||||||
|
|
||||||
|
class OpenAIGPTPreTrainedModel(nn.Module):
|
||||||
|
""" An abstract class to handle weights initialization and
|
||||||
|
a simple interface for dowloading and loading pretrained models.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, config, *inputs, **kwargs):
|
||||||
|
super(OpenAIGPTPreTrainedModel, self).__init__()
|
||||||
|
if not isinstance(config, OpenAIGPTConfig):
|
||||||
|
raise ValueError(
|
||||||
|
"Parameter config in `{}(config)` should be an instance of class `OpenAIGPTConfig`. "
|
||||||
|
"To create a model from a pretrained model use "
|
||||||
|
"`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`".format(
|
||||||
|
self.__class__.__name__, self.__class__.__name__
|
||||||
|
)
|
||||||
|
)
|
||||||
|
self.config = config
|
||||||
|
|
||||||
|
def init_weights(self, module):
|
||||||
|
""" Initialize the weights.
|
||||||
|
"""
|
||||||
|
if isinstance(module, (nn.Linear, nn.Embedding)):
|
||||||
|
# Slightly different from the TF version which uses truncated_normal for initialization
|
||||||
|
# cf https://github.com/pytorch/pytorch/pull/5617
|
||||||
|
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
|
||||||
|
elif isinstance(module, LayerNorm):
|
||||||
|
module.bias.data.zero_()
|
||||||
|
module.weight.data.fill_(1.0)
|
||||||
|
if isinstance(module, nn.Linear) and module.bias is not None:
|
||||||
|
module.bias.data.zero_()
|
||||||
|
|
||||||
|
def set_num_special_tokens(self, num_special_tokens):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(
|
||||||
|
cls, pretrained_model_name_or_path, num_special_tokens=None, state_dict=None, cache_dir=None, from_tf=False, *inputs, **kwargs
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Instantiate a OpenAIGPTPreTrainedModel from a pre-trained model file or a pytorch state dict.
|
||||||
|
Download and cache the pre-trained model file if needed.
|
||||||
|
|
||||||
|
Params:
|
||||||
|
pretrained_model_name_or_path: either:
|
||||||
|
- a str with the name of a pre-trained model to load selected in the list of:
|
||||||
|
. `openai-gpt`
|
||||||
|
- a path or url to a pretrained model archive containing:
|
||||||
|
. `openai_gpt_config.json` a configuration file for the model
|
||||||
|
. `pytorch_model.bin` a PyTorch dump of a OpenAIGPTModel instance
|
||||||
|
- a path or url to a pretrained model archive containing:
|
||||||
|
. `bert_config.json` a configuration file for the model
|
||||||
|
. a series of NumPy files containing OpenAI TensorFlow trained weights
|
||||||
|
from_tf: should we load the weights from a locally saved TensorFlow checkpoint
|
||||||
|
cache_dir: an optional path to a folder in which the pre-trained models will be cached.
|
||||||
|
state_dict: an optional state dictionnary (collections.OrderedDict object) to use instead of pre-trained models
|
||||||
|
*inputs, **kwargs: additional input for the specific Bert class
|
||||||
|
(ex: num_labels for BertForSequenceClassification)
|
||||||
|
"""
|
||||||
|
if pretrained_model_name_or_path in PRETRAINED_MODEL_ARCHIVE_MAP:
|
||||||
|
archive_file = PRETRAINED_MODEL_ARCHIVE_MAP[pretrained_model_name_or_path]
|
||||||
|
config_file = PRETRAINED_CONFIG_ARCHIVE_MAP[pretrained_model_name_or_path]
|
||||||
|
else:
|
||||||
|
archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)
|
||||||
|
config_file = os.path.join(pretrained_model_name_or_path, CONFIG_NAME)
|
||||||
|
# redirect to the cache, if necessary
|
||||||
|
try:
|
||||||
|
resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir)
|
||||||
|
resolved_config_file = cached_path(config_file, cache_dir=cache_dir)
|
||||||
|
except EnvironmentError:
|
||||||
|
logger.error(
|
||||||
|
"Model name '{}' was not found in model name list ({}). "
|
||||||
|
"We assumed '{}' was a path or url but couldn't find files {} and {} "
|
||||||
|
"at this path or url.".format(
|
||||||
|
pretrained_model_name_or_path, ", ".join(PRETRAINED_MODEL_ARCHIVE_MAP.keys()), pretrained_model_name_or_path,
|
||||||
|
archive_file, config_file
|
||||||
|
)
|
||||||
|
)
|
||||||
|
return None
|
||||||
|
if resolved_archive_file == archive_file and resolved_config_file == config_file:
|
||||||
|
logger.info("loading weights file {}".format(archive_file))
|
||||||
|
logger.info("loading configuration file {}".format(config_file))
|
||||||
|
else:
|
||||||
|
logger.info("loading weights file {} from cache at {}".format(
|
||||||
|
archive_file, resolved_archive_file))
|
||||||
|
logger.info("loading configuration file {} from cache at {}".format(
|
||||||
|
config_file, resolved_config_file))
|
||||||
|
# Load config
|
||||||
|
config = OpenAIGPTConfig.from_json_file(resolved_config_file)
|
||||||
|
logger.info("Model config {}".format(config))
|
||||||
|
# Instantiate model.
|
||||||
|
model = cls(config, *inputs, **kwargs)
|
||||||
|
if state_dict is None and not from_tf:
|
||||||
|
state_dict = torch.load(resolved_archive_file, map_location='cpu' if not torch.cuda.is_available() else None)
|
||||||
|
if from_tf:
|
||||||
|
# Directly load from a TensorFlow checkpoint (stored as NumPy array)
|
||||||
|
return load_tf_weights_in_openai_gpt(model, resolved_archive_file)
|
||||||
|
|
||||||
|
old_keys = []
|
||||||
|
new_keys = []
|
||||||
|
for key in state_dict.keys():
|
||||||
|
new_key = None
|
||||||
|
if key.endswith(".g"):
|
||||||
|
new_key = key[:-2] + ".weight"
|
||||||
|
elif key.endswith(".b"):
|
||||||
|
new_key = key[:-2] + ".bias"
|
||||||
|
elif key.endswith(".w"):
|
||||||
|
new_key = key[:-2] + ".weight"
|
||||||
|
if new_key:
|
||||||
|
old_keys.append(key)
|
||||||
|
new_keys.append(new_key)
|
||||||
|
for old_key, new_key in zip(old_keys, new_keys):
|
||||||
|
state_dict[new_key] = state_dict.pop(old_key)
|
||||||
|
|
||||||
|
missing_keys = []
|
||||||
|
unexpected_keys = []
|
||||||
|
error_msgs = []
|
||||||
|
# copy state_dict so _load_from_state_dict can modify it
|
||||||
|
metadata = getattr(state_dict, "_metadata", None)
|
||||||
|
state_dict = state_dict.copy()
|
||||||
|
if metadata is not None:
|
||||||
|
state_dict._metadata = metadata
|
||||||
|
|
||||||
|
def load(module, prefix=""):
|
||||||
|
local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})
|
||||||
|
module._load_from_state_dict(
|
||||||
|
state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs
|
||||||
|
)
|
||||||
|
for name, child in module._modules.items():
|
||||||
|
if child is not None:
|
||||||
|
load(child, prefix + name + ".")
|
||||||
|
|
||||||
|
start_model = model
|
||||||
|
if hasattr(model, "transformer") and all(not s.startswith('transformer.') for s in state_dict.keys()):
|
||||||
|
start_model = model.transformer
|
||||||
|
load(start_model, prefix="")
|
||||||
|
|
||||||
|
if len(missing_keys) > 0:
|
||||||
|
logger.info(
|
||||||
|
"Weights of {} not initialized from pretrained model: {}".format(model.__class__.__name__, missing_keys)
|
||||||
|
)
|
||||||
|
if len(unexpected_keys) > 0:
|
||||||
|
logger.info(
|
||||||
|
"Weights from pretrained model not used in {}: {}".format(model.__class__.__name__, unexpected_keys)
|
||||||
|
)
|
||||||
|
if len(error_msgs) > 0:
|
||||||
|
raise RuntimeError(
|
||||||
|
"Error(s) in loading state_dict for {}:\n\t{}".format(model.__class__.__name__, "\n\t".join(error_msgs))
|
||||||
|
)
|
||||||
|
|
||||||
|
# Add additional embeddings for special tokens if needed
|
||||||
|
# This step also make sure we are still sharing the output and input embeddings after loading weights
|
||||||
|
model.set_num_special_tokens(num_special_tokens if num_special_tokens is not None else config.n_special)
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
class OpenAIGPTModel(OpenAIGPTPreTrainedModel):
|
||||||
|
"""OpenAI GPT model ("Improving Language Understanding by Generative Pre-Training").
|
||||||
|
|
||||||
|
OpenAI GPT use a single embedding matrix to store the word and special embeddings.
|
||||||
|
Special tokens embeddings are additional tokens that are not pre-trained: [SEP], [CLS]...
|
||||||
|
Special tokens need to be trained during the fine-tuning if you use them.
|
||||||
|
The number of special embeddings can be controled using the `set_num_special_tokens(num_special_tokens)` function.
|
||||||
|
|
||||||
|
The embeddings are ordered as follow in the token embeddings matrice:
|
||||||
|
[0, ----------------------
|
||||||
|
... -> word embeddings
|
||||||
|
config.vocab_size - 1, ______________________
|
||||||
|
config.vocab_size,
|
||||||
|
... -> special embeddings
|
||||||
|
config.vocab_size + config.n_special - 1] ______________________
|
||||||
|
|
||||||
|
where total_tokens_embeddings can be obtained as config.total_tokens_embeddings and is:
|
||||||
|
total_tokens_embeddings = config.vocab_size + config.n_special
|
||||||
|
You should use the associate indices to index the embeddings.
|
||||||
|
|
||||||
|
Params:
|
||||||
|
config: a OpenAIGPTConfig class instance with the configuration to build a new model
|
||||||
|
|
||||||
|
Inputs:
|
||||||
|
`input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] (or more generally [d_1, ..., d_n, sequence_length]
|
||||||
|
were d_1 ... d_n are arbitrary dimensions) with the word BPE token indices selected in the range [0, total_tokens_embeddings[
|
||||||
|
`position_ids`: an optional torch.LongTensor with the same shape as input_ids
|
||||||
|
with the position indices (selected in the range [0, config.n_positions - 1[.
|
||||||
|
`token_type_ids`: an optional torch.LongTensor with the same shape as input_ids
|
||||||
|
You can use it to add a third type of embedding to each input token in the sequence
|
||||||
|
(the previous two being the word and position embeddings).
|
||||||
|
The input, position and token_type embeddings are summed inside the Transformer before the first
|
||||||
|
self-attention block.
|
||||||
|
|
||||||
|
Outputs:
|
||||||
|
`hidden_states`: the encoded-hidden-states at the top of the model
|
||||||
|
as a torch.FloatTensor of size [batch_size, sequence_length, hidden_size]
|
||||||
|
(or more generally [d_1, ..., d_n, hidden_size] were d_1 ... d_n are the dimension of input_ids)
|
||||||
|
|
||||||
|
Example usage:
|
||||||
|
```python
|
||||||
|
# Already been converted into BPE token ids
|
||||||
|
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
|
||||||
|
|
||||||
|
config = modeling_openai.OpenAIGPTConfig()
|
||||||
|
|
||||||
|
model = modeling_openai.OpenAIGPTModel(config)
|
||||||
|
hidden_states = model(input_ids)
|
||||||
|
```
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super(OpenAIGPTModel, self).__init__(config)
|
||||||
|
num_tokens = config.vocab_size + config.n_special
|
||||||
|
self.tokens_embed = nn.Embedding(num_tokens, config.n_embd)
|
||||||
|
self.positions_embed = nn.Embedding(config.n_positions, config.n_embd)
|
||||||
|
self.drop = nn.Dropout(config.embd_pdrop)
|
||||||
|
block = Block(config.n_ctx, config, scale=True)
|
||||||
|
self.h = nn.ModuleList([copy.deepcopy(block) for _ in range(config.n_layer)])
|
||||||
|
|
||||||
|
self.apply(self.init_weights)
|
||||||
|
# nn.init.normal_(self.embed.weight, std=0.02)
|
||||||
|
|
||||||
|
def set_num_special_tokens(self, num_special_tokens):
|
||||||
|
" Update input embeddings with new embedding matrice if needed "
|
||||||
|
if self.config.n_special == num_special_tokens:
|
||||||
|
return
|
||||||
|
# Update config
|
||||||
|
self.config.n_special = num_special_tokens
|
||||||
|
# # Build new embeddings and initialize
|
||||||
|
old_embed = self.tokens_embed
|
||||||
|
self.tokens_embed = nn.Embedding(self.config.total_tokens_embeddings, self.config.n_embd)
|
||||||
|
# Initialize all new embeddings (in particular the special tokens)
|
||||||
|
self.init_weights(self.tokens_embed)
|
||||||
|
# Copy word and positional embeddings from the previous weights
|
||||||
|
self.tokens_embed.weight.data[: self.config.vocab_size, :] = old_embed.weight.data[: self.config.vocab_size, :]
|
||||||
|
self.tokens_embed.weight.data[-self.config.n_positions :, :] = old_embed.weight.data[-self.config.n_positions :, :]
|
||||||
|
|
||||||
|
def forward(self, input_ids, position_ids=None, token_type_ids=None):
|
||||||
|
if position_ids is None:
|
||||||
|
# This was used when we had a single embedding matrice from position and token embeddings
|
||||||
|
# start = self.config.vocab_size + self.config.n_special
|
||||||
|
# end = start + input_ids.size(-1)
|
||||||
|
# position_ids = torch.arange(start, end, dtype=torch.long, device=input_ids.device)
|
||||||
|
position_ids = torch.arange(input_ids.size(-1), dtype=torch.long, device=input_ids.device)
|
||||||
|
position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
|
||||||
|
|
||||||
|
input_shape = input_ids.size()
|
||||||
|
input_ids = input_ids.view(-1, input_ids.size(-1))
|
||||||
|
position_ids = position_ids.view(-1, position_ids.size(-1))
|
||||||
|
|
||||||
|
inputs_embeds = self.tokens_embed(input_ids)
|
||||||
|
position_embeds = self.positions_embed(position_ids)
|
||||||
|
if token_type_ids is not None:
|
||||||
|
token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))
|
||||||
|
token_type_embeds = self.tokens_embed(token_type_ids)
|
||||||
|
else:
|
||||||
|
token_type_embeds = 0
|
||||||
|
# Add the position information to the input embeddings
|
||||||
|
# h = e.sum(dim=2)
|
||||||
|
hidden_states = inputs_embeds + position_embeds + token_type_embeds
|
||||||
|
for block in self.h:
|
||||||
|
hidden_states = block(hidden_states)
|
||||||
|
output_shape = input_shape + (hidden_states.size(-1),)
|
||||||
|
return hidden_states.view(*output_shape)
|
||||||
|
|
||||||
|
|
||||||
|
class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
|
||||||
|
"""OpenAI GPT model with a Language Modeling head ("Improving Language Understanding by Generative Pre-Training").
|
||||||
|
|
||||||
|
OpenAI GPT use a single embedding matrix to store the word and special embeddings.
|
||||||
|
Special tokens embeddings are additional tokens that are not pre-trained: [SEP], [CLS]...
|
||||||
|
Special tokens need to be trained during the fine-tuning if you use them.
|
||||||
|
The number of special embeddings can be controled using the `set_num_special_tokens(num_special_tokens)` function.
|
||||||
|
|
||||||
|
The embeddings are ordered as follow in the token embeddings matrice:
|
||||||
|
[0, ----------------------
|
||||||
|
... -> word embeddings
|
||||||
|
config.vocab_size - 1, ______________________
|
||||||
|
config.vocab_size,
|
||||||
|
... -> special embeddings
|
||||||
|
config.vocab_size + config.n_special - 1] ______________________
|
||||||
|
|
||||||
|
where total_tokens_embeddings can be obtained as config.total_tokens_embeddings and is:
|
||||||
|
total_tokens_embeddings = config.vocab_size + config.n_special
|
||||||
|
You should use the associate indices to index the embeddings.
|
||||||
|
|
||||||
|
Params:
|
||||||
|
config: a OpenAIGPTConfig class instance with the configuration to build a new model
|
||||||
|
|
||||||
|
Inputs:
|
||||||
|
`input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] (or more generally [d_1, ..., d_n, sequence_length]
|
||||||
|
were d_1 ... d_n are arbitrary dimensions) with the word BPE token indices selected in the range [0, total_tokens_embeddings[
|
||||||
|
`position_ids`: an optional torch.LongTensor with the same shape as input_ids
|
||||||
|
with the position indices (selected in the range [0, config.n_positions - 1[.
|
||||||
|
`token_type_ids`: an optional torch.LongTensor with the same shape as input_ids
|
||||||
|
You can use it to add a third type of embedding to each input token in the sequence
|
||||||
|
(the previous two being the word and position embeddings).
|
||||||
|
The input, position and token_type embeddings are summed inside the Transformer before the first
|
||||||
|
self-attention block.
|
||||||
|
`lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
|
||||||
|
with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
|
||||||
|
is only computed for the labels set in [0, ..., vocab_size]
|
||||||
|
|
||||||
|
Outputs:
|
||||||
|
if `lm_labels` is not `None`:
|
||||||
|
Outputs the language modeling loss.
|
||||||
|
else:
|
||||||
|
`lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, sequence_length, total_tokens_embeddings]
|
||||||
|
(or more generally [d_1, ..., d_n, total_tokens_embeddings] were d_1 ... d_n are the dimension of input_ids)
|
||||||
|
|
||||||
|
Example usage:
|
||||||
|
```python
|
||||||
|
# Already been converted into BPE token ids
|
||||||
|
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
|
||||||
|
|
||||||
|
config = modeling_openai.OpenAIGPTConfig()
|
||||||
|
|
||||||
|
model = modeling_openai.OpenAIGPTLMHeadModel(config)
|
||||||
|
lm_logits = model(input_ids)
|
||||||
|
```
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super(OpenAIGPTLMHeadModel, self).__init__(config)
|
||||||
|
self.transformer = OpenAIGPTModel(config)
|
||||||
|
self.lm_head = OpenAIGPTLMHead(self.transformer.tokens_embed.weight, config)
|
||||||
|
self.apply(self.init_weights)
|
||||||
|
|
||||||
|
def set_num_special_tokens(self, num_special_tokens):
|
||||||
|
""" Update input and output embeddings with new embedding matrice
|
||||||
|
Make sure we are sharing the embeddings
|
||||||
|
"""
|
||||||
|
self.transformer.set_num_special_tokens(num_special_tokens)
|
||||||
|
self.lm_head.set_embeddings_weights(self.transformer.tokens_embed.weight)
|
||||||
|
|
||||||
|
def forward(self, input_ids, position_ids=None, token_type_ids=None, lm_labels=None):
|
||||||
|
hidden_states = self.transformer(input_ids, position_ids, token_type_ids)
|
||||||
|
lm_logits = self.lm_head(hidden_states)
|
||||||
|
if lm_labels is not None:
|
||||||
|
loss_fct = CrossEntropyLoss(ignore_index=-1)
|
||||||
|
loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), lm_labels.view(-1))
|
||||||
|
return loss
|
||||||
|
return lm_logits
|
||||||
|
|
||||||
|
|
||||||
|
class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
|
||||||
|
"""OpenAI GPT model with a Language Modeling and a Multiple Choice head ("Improving Language Understanding by Generative Pre-Training").
|
||||||
|
|
||||||
|
OpenAI GPT use a single embedding matrix to store the word and special embeddings.
|
||||||
|
Special tokens embeddings are additional tokens that are not pre-trained: [SEP], [CLS]...
|
||||||
|
Special tokens need to be trained during the fine-tuning if you use them.
|
||||||
|
The number of special embeddings can be controled using the `set_num_special_tokens(num_special_tokens)` function.
|
||||||
|
|
||||||
|
The embeddings are ordered as follow in the token embeddings matrice:
|
||||||
|
[0, ----------------------
|
||||||
|
... -> word embeddings
|
||||||
|
config.vocab_size - 1, ______________________
|
||||||
|
config.vocab_size,
|
||||||
|
... -> special embeddings
|
||||||
|
config.vocab_size + config.n_special - 1] ______________________
|
||||||
|
|
||||||
|
where total_tokens_embeddings can be obtained as config.total_tokens_embeddings and is:
|
||||||
|
total_tokens_embeddings = config.vocab_size + config.n_special
|
||||||
|
You should use the associate indices to index the embeddings.
|
||||||
|
|
||||||
|
Params:
|
||||||
|
config: a OpenAIGPTConfig class instance with the configuration to build a new model
|
||||||
|
|
||||||
|
Inputs:
|
||||||
|
`input_ids`: a torch.LongTensor of shape [batch_size, num_choices, sequence_length] with the BPE token
|
||||||
|
indices selected in the range [0, total_tokens_embeddings[
|
||||||
|
`mc_token_ids`: a torch.LongTensor of shape [batch_size, num_choices] with the index of the token from
|
||||||
|
which we should take the hidden state to feed the multiple choice classifier (usually last token of the sequence)
|
||||||
|
`position_ids`: an optional torch.LongTensor with the same shape as input_ids
|
||||||
|
with the position indices (selected in the range [0, config.n_positions - 1[.
|
||||||
|
`token_type_ids`: an optional torch.LongTensor with the same shape as input_ids
|
||||||
|
You can use it to add a third type of embedding to each input token in the sequence
|
||||||
|
(the previous two being the word and position embeddings).
|
||||||
|
The input, position and token_type embeddings are summed inside the Transformer before the first
|
||||||
|
self-attention block.
|
||||||
|
`lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, num_choices, sequence_length]
|
||||||
|
with indices selected in [-1, 0, ..., total_tokens_embeddings]. All labels set to -1 are ignored (masked), the loss
|
||||||
|
is only computed for the labels set in [0, ..., total_tokens_embeddings]
|
||||||
|
`multiple_choice_labels`: optional multiple choice labels: torch.LongTensor of shape [batch_size]
|
||||||
|
with indices selected in [0, ..., num_choices].
|
||||||
|
|
||||||
|
Outputs:
|
||||||
|
if `lm_labels` and `multiple_choice_labels` are not `None`:
|
||||||
|
Outputs a tuple of losses with the language modeling loss and the multiple choice loss.
|
||||||
|
else: a tuple with
|
||||||
|
`lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, num_choices, sequence_length, total_tokens_embeddings]
|
||||||
|
`multiple_choice_logits`: the multiple choice logits as a torch.FloatTensor of size [batch_size, num_choices]
|
||||||
|
|
||||||
|
Example usage:
|
||||||
|
```python
|
||||||
|
# Already been converted into BPE token ids
|
||||||
|
input_ids = torch.LongTensor([[[31, 51, 99], [15, 5, 0]]]) # (bsz, number of choice, seq length)
|
||||||
|
mc_token_ids = torch.LongTensor([[2], [1]]) # (bsz, number of choice)
|
||||||
|
|
||||||
|
config = modeling_openai.OpenAIGPTConfig()
|
||||||
|
|
||||||
|
model = modeling_openai.OpenAIGPTLMHeadModel(config)
|
||||||
|
lm_logits, multiple_choice_logits = model(input_ids, mc_token_ids)
|
||||||
|
```
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super(OpenAIGPTDoubleHeadsModel, self).__init__(config)
|
||||||
|
self.transformer = OpenAIGPTModel(config)
|
||||||
|
self.lm_head = OpenAIGPTLMHead(self.transformer.tokens_embed.weight, config)
|
||||||
|
self.multiple_choice_head = OpenAIGPTMultipleChoiceHead(config)
|
||||||
|
self.apply(self.init_weights)
|
||||||
|
|
||||||
|
def set_num_special_tokens(self, num_special_tokens):
|
||||||
|
""" Update input and output embeddings with new embedding matrice
|
||||||
|
Make sure we are sharing the embeddings
|
||||||
|
"""
|
||||||
|
self.transformer.set_num_special_tokens(num_special_tokens)
|
||||||
|
self.lm_head.set_embeddings_weights(self.transformer.tokens_embed.weight)
|
||||||
|
|
||||||
|
def forward(self, input_ids, mc_token_ids, lm_labels=None, mc_labels=None, token_type_ids=None, position_ids=None):
|
||||||
|
hidden_states = self.transformer(input_ids, position_ids, token_type_ids)
|
||||||
|
lm_logits = self.lm_head(hidden_states)
|
||||||
|
mc_logits = self.multiple_choice_head(hidden_states, mc_token_ids)
|
||||||
|
losses = []
|
||||||
|
if lm_labels is not None:
|
||||||
|
loss_fct = CrossEntropyLoss(ignore_index=-1)
|
||||||
|
losses.append(loss_fct(lm_logits.view(-1, lm_logits.size(-1)), lm_labels.view(-1)))
|
||||||
|
if mc_labels is not None:
|
||||||
|
loss_fct = CrossEntropyLoss()
|
||||||
|
losses.append(loss_fct(mc_logits.view(-1, mc_logits.size(-1)), mc_labels.view(-1)))
|
||||||
|
if losses:
|
||||||
|
return losses
|
||||||
|
return lm_logits, mc_logits
|
||||||
1376
pytorch_pretrained_bert/modeling_transfo_xl.py
Normal file
1376
pytorch_pretrained_bert/modeling_transfo_xl.py
Normal file
File diff suppressed because it is too large
Load Diff
402
pytorch_pretrained_bert/modeling_transfo_xl_utilities.py
Normal file
402
pytorch_pretrained_bert/modeling_transfo_xl_utilities.py
Normal file
@@ -0,0 +1,402 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HugginFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Utilities for PyTorch Transformer XL model.
|
||||||
|
Directly adapted from https://github.com/kimiyoung/transformer-xl.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from collections import defaultdict
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
import torch.nn.functional as F
|
||||||
|
|
||||||
|
# CUDA_MAJOR = int(torch.version.cuda.split('.')[0])
|
||||||
|
# CUDA_MINOR = int(torch.version.cuda.split('.')[1])
|
||||||
|
|
||||||
|
class ProjectedAdaptiveLogSoftmax(nn.Module):
|
||||||
|
def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1,
|
||||||
|
keep_order=False):
|
||||||
|
super(ProjectedAdaptiveLogSoftmax, self).__init__()
|
||||||
|
|
||||||
|
self.n_token = n_token
|
||||||
|
self.d_embed = d_embed
|
||||||
|
self.d_proj = d_proj
|
||||||
|
|
||||||
|
self.cutoffs = cutoffs + [n_token]
|
||||||
|
self.cutoff_ends = [0] + self.cutoffs
|
||||||
|
self.div_val = div_val
|
||||||
|
|
||||||
|
self.shortlist_size = self.cutoffs[0]
|
||||||
|
self.n_clusters = len(self.cutoffs) - 1
|
||||||
|
self.head_size = self.shortlist_size + self.n_clusters
|
||||||
|
|
||||||
|
if self.n_clusters > 0:
|
||||||
|
self.cluster_weight = nn.Parameter(torch.zeros(self.n_clusters, self.d_embed))
|
||||||
|
self.cluster_bias = nn.Parameter(torch.zeros(self.n_clusters))
|
||||||
|
|
||||||
|
self.out_layers = nn.ModuleList()
|
||||||
|
self.out_projs = nn.ParameterList()
|
||||||
|
|
||||||
|
if div_val == 1:
|
||||||
|
for i in range(len(self.cutoffs)):
|
||||||
|
if d_proj != d_embed:
|
||||||
|
self.out_projs.append(
|
||||||
|
nn.Parameter(torch.Tensor(d_proj, d_embed))
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.out_projs.append(None)
|
||||||
|
|
||||||
|
self.out_layers.append(nn.Linear(d_embed, n_token))
|
||||||
|
else:
|
||||||
|
for i in range(len(self.cutoffs)):
|
||||||
|
l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i+1]
|
||||||
|
d_emb_i = d_embed // (div_val ** i)
|
||||||
|
|
||||||
|
self.out_projs.append(
|
||||||
|
nn.Parameter(torch.Tensor(d_proj, d_emb_i))
|
||||||
|
)
|
||||||
|
|
||||||
|
self.out_layers.append(nn.Linear(d_emb_i, r_idx-l_idx))
|
||||||
|
|
||||||
|
self.keep_order = keep_order
|
||||||
|
|
||||||
|
def _compute_logit(self, hidden, weight, bias, proj):
|
||||||
|
if proj is None:
|
||||||
|
logit = F.linear(hidden, weight, bias=bias)
|
||||||
|
else:
|
||||||
|
# if CUDA_MAJOR <= 9 and CUDA_MINOR <= 1:
|
||||||
|
proj_hid = F.linear(hidden, proj.t().contiguous())
|
||||||
|
logit = F.linear(proj_hid, weight, bias=bias)
|
||||||
|
# else:
|
||||||
|
# logit = torch.einsum('bd,de,ev->bv', (hidden, proj, weight.t()))
|
||||||
|
# if bias is not None:
|
||||||
|
# logit = logit + bias
|
||||||
|
|
||||||
|
return logit
|
||||||
|
|
||||||
|
def forward(self, hidden, target=None, keep_order=False):
|
||||||
|
'''
|
||||||
|
Params:
|
||||||
|
hidden :: [len*bsz x d_proj]
|
||||||
|
target :: [len*bsz]
|
||||||
|
Return:
|
||||||
|
if target is None:
|
||||||
|
out :: [len*bsz] Negative log likelihood
|
||||||
|
else:
|
||||||
|
out :: [len*bsz x n_tokens] log probabilities of tokens over the vocabulary
|
||||||
|
We could replace this implementation by the native PyTorch one
|
||||||
|
if their's had an option to set bias on all clusters in the native one.
|
||||||
|
here: https://github.com/pytorch/pytorch/blob/dbe6a7a9ff1a364a8706bf5df58a1ca96d2fd9da/torch/nn/modules/adaptive.py#L138
|
||||||
|
'''
|
||||||
|
|
||||||
|
if target is not None:
|
||||||
|
target = target.view(-1)
|
||||||
|
if hidden.size(0) != target.size(0):
|
||||||
|
raise RuntimeError('Input and target should have the same size '
|
||||||
|
'in the batch dimension.')
|
||||||
|
|
||||||
|
if self.n_clusters == 0:
|
||||||
|
logit = self._compute_logit(hidden, self.out_layers[0].weight,
|
||||||
|
self.out_layers[0].bias, self.out_projs[0])
|
||||||
|
if target is not None:
|
||||||
|
output = -F.log_softmax(logit, dim=-1) \
|
||||||
|
.gather(1, target.unsqueeze(1)).squeeze(1)
|
||||||
|
else:
|
||||||
|
output = F.log_softmax(logit, dim=-1)
|
||||||
|
else:
|
||||||
|
# construct weights and biases
|
||||||
|
weights, biases = [], []
|
||||||
|
for i in range(len(self.cutoffs)):
|
||||||
|
if self.div_val == 1:
|
||||||
|
l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
|
||||||
|
weight_i = self.out_layers[0].weight[l_idx:r_idx]
|
||||||
|
bias_i = self.out_layers[0].bias[l_idx:r_idx]
|
||||||
|
else:
|
||||||
|
weight_i = self.out_layers[i].weight
|
||||||
|
bias_i = self.out_layers[i].bias
|
||||||
|
|
||||||
|
if i == 0:
|
||||||
|
weight_i = torch.cat(
|
||||||
|
[weight_i, self.cluster_weight], dim=0)
|
||||||
|
bias_i = torch.cat(
|
||||||
|
[bias_i, self.cluster_bias], dim=0)
|
||||||
|
|
||||||
|
weights.append(weight_i)
|
||||||
|
biases.append(bias_i)
|
||||||
|
|
||||||
|
head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]
|
||||||
|
|
||||||
|
head_logit = self._compute_logit(hidden, head_weight, head_bias, head_proj)
|
||||||
|
head_logprob = F.log_softmax(head_logit, dim=1)
|
||||||
|
|
||||||
|
if target is None:
|
||||||
|
out = hidden.new_empty((head_logit.size(0), self.n_token))
|
||||||
|
else:
|
||||||
|
out = torch.zeros_like(target, dtype=hidden.dtype, device=hidden.device)
|
||||||
|
|
||||||
|
offset = 0
|
||||||
|
cutoff_values = [0] + self.cutoffs
|
||||||
|
for i in range(len(cutoff_values) - 1):
|
||||||
|
l_idx, r_idx = cutoff_values[i], cutoff_values[i + 1]
|
||||||
|
|
||||||
|
if target is not None:
|
||||||
|
mask_i = (target >= l_idx) & (target < r_idx)
|
||||||
|
indices_i = mask_i.nonzero().squeeze()
|
||||||
|
|
||||||
|
if indices_i.numel() == 0:
|
||||||
|
continue
|
||||||
|
|
||||||
|
target_i = target.index_select(0, indices_i) - l_idx
|
||||||
|
head_logprob_i = head_logprob.index_select(0, indices_i)
|
||||||
|
hidden_i = hidden.index_select(0, indices_i)
|
||||||
|
else:
|
||||||
|
hidden_i = hidden
|
||||||
|
|
||||||
|
if i == 0:
|
||||||
|
if target is not None:
|
||||||
|
logprob_i = head_logprob_i.gather(1, target_i[:, None]).squeeze(1)
|
||||||
|
else:
|
||||||
|
out[:, :self.cutoffs[0]] = head_logprob[:, :self.cutoffs[0]]
|
||||||
|
else:
|
||||||
|
weight_i, bias_i, proj_i = weights[i], biases[i], self.out_projs[i]
|
||||||
|
|
||||||
|
tail_logit_i = self._compute_logit(hidden_i, weight_i, bias_i, proj_i)
|
||||||
|
tail_logprob_i = F.log_softmax(tail_logit_i, dim=1)
|
||||||
|
cluster_prob_idx = self.cutoffs[0] + i - 1 # No probability for the head cluster
|
||||||
|
if target is not None:
|
||||||
|
logprob_i = head_logprob_i[:, cluster_prob_idx] \
|
||||||
|
+ tail_logprob_i.gather(1, target_i[:, None]).squeeze(1)
|
||||||
|
else:
|
||||||
|
logprob_i = head_logprob[:, cluster_prob_idx, None] + tail_logprob_i
|
||||||
|
out[:, l_idx:r_idx] = logprob_i
|
||||||
|
|
||||||
|
if target is not None:
|
||||||
|
if (hasattr(self, 'keep_order') and self.keep_order) or keep_order:
|
||||||
|
out.index_copy_(0, indices_i, -logprob_i)
|
||||||
|
else:
|
||||||
|
out[offset:offset+logprob_i.size(0)].copy_(-logprob_i)
|
||||||
|
offset += logprob_i.size(0)
|
||||||
|
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def log_prob(self, hidden):
|
||||||
|
r""" Computes log probabilities for all :math:`n\_classes`
|
||||||
|
From: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/adaptive.py
|
||||||
|
Args:
|
||||||
|
hidden (Tensor): a minibatch of examples
|
||||||
|
Returns:
|
||||||
|
log-probabilities of for each class :math:`c`
|
||||||
|
in range :math:`0 <= c <= n\_classes`, where :math:`n\_classes` is a
|
||||||
|
parameter passed to ``AdaptiveLogSoftmaxWithLoss`` constructor.
|
||||||
|
Shape:
|
||||||
|
- Input: :math:`(N, in\_features)`
|
||||||
|
- Output: :math:`(N, n\_classes)`
|
||||||
|
"""
|
||||||
|
if self.n_clusters == 0:
|
||||||
|
logit = self._compute_logit(hidden, self.out_layers[0].weight,
|
||||||
|
self.out_layers[0].bias, self.out_projs[0])
|
||||||
|
return F.log_softmax(logit, dim=-1)
|
||||||
|
else:
|
||||||
|
# construct weights and biases
|
||||||
|
weights, biases = [], []
|
||||||
|
for i in range(len(self.cutoffs)):
|
||||||
|
if self.div_val == 1:
|
||||||
|
l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
|
||||||
|
weight_i = self.out_layers[0].weight[l_idx:r_idx]
|
||||||
|
bias_i = self.out_layers[0].bias[l_idx:r_idx]
|
||||||
|
else:
|
||||||
|
weight_i = self.out_layers[i].weight
|
||||||
|
bias_i = self.out_layers[i].bias
|
||||||
|
|
||||||
|
if i == 0:
|
||||||
|
weight_i = torch.cat(
|
||||||
|
[weight_i, self.cluster_weight], dim=0)
|
||||||
|
bias_i = torch.cat(
|
||||||
|
[bias_i, self.cluster_bias], dim=0)
|
||||||
|
|
||||||
|
weights.append(weight_i)
|
||||||
|
biases.append(bias_i)
|
||||||
|
|
||||||
|
head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]
|
||||||
|
head_logit = self._compute_logit(hidden, head_weight, head_bias, head_proj)
|
||||||
|
|
||||||
|
out = hidden.new_empty((head_logit.size(0), self.n_token))
|
||||||
|
head_logprob = F.log_softmax(head_logit, dim=1)
|
||||||
|
|
||||||
|
cutoff_values = [0] + self.cutoffs
|
||||||
|
for i in range(len(cutoff_values) - 1):
|
||||||
|
start_idx, stop_idx = cutoff_values[i], cutoff_values[i + 1]
|
||||||
|
|
||||||
|
if i == 0:
|
||||||
|
out[:, :self.cutoffs[0]] = head_logprob[:, :self.cutoffs[0]]
|
||||||
|
else:
|
||||||
|
weight_i, bias_i, proj_i = weights[i], biases[i], self.out_projs[i]
|
||||||
|
|
||||||
|
tail_logit_i = self._compute_logit(hidden, weight_i, bias_i, proj_i)
|
||||||
|
tail_logprob_i = F.log_softmax(tail_logit_i, dim=1)
|
||||||
|
|
||||||
|
logprob_i = head_logprob[:, -i] + tail_logprob_i
|
||||||
|
out[:, start_idx, stop_idx] = logprob_i
|
||||||
|
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
class LogUniformSampler(object):
|
||||||
|
def __init__(self, range_max, n_sample):
|
||||||
|
"""
|
||||||
|
Reference : https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/python/ops/candidate_sampling_ops.py
|
||||||
|
`P(class) = (log(class + 2) - log(class + 1)) / log(range_max + 1)`
|
||||||
|
|
||||||
|
expected count can be approximated by 1 - (1 - p)^n
|
||||||
|
and we use a numerically stable version -expm1(num_tries * log1p(-p))
|
||||||
|
|
||||||
|
Our implementation fixes num_tries at 2 * n_sample, and the actual #samples will vary from run to run
|
||||||
|
"""
|
||||||
|
with torch.no_grad():
|
||||||
|
self.range_max = range_max
|
||||||
|
log_indices = torch.arange(1., range_max+2., 1.).log_()
|
||||||
|
self.dist = (log_indices[1:] - log_indices[:-1]) / log_indices[-1]
|
||||||
|
# print('P', self.dist.numpy().tolist()[-30:])
|
||||||
|
|
||||||
|
self.log_q = (- (-self.dist.double().log1p_() * 2 * n_sample).expm1_()).log_().float()
|
||||||
|
|
||||||
|
self.n_sample = n_sample
|
||||||
|
|
||||||
|
def sample(self, labels):
|
||||||
|
"""
|
||||||
|
labels: [b1, b2]
|
||||||
|
Return
|
||||||
|
true_log_probs: [b1, b2]
|
||||||
|
samp_log_probs: [n_sample]
|
||||||
|
neg_samples: [n_sample]
|
||||||
|
"""
|
||||||
|
|
||||||
|
# neg_samples = torch.empty(0).long()
|
||||||
|
n_sample = self.n_sample
|
||||||
|
n_tries = 2 * n_sample
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
neg_samples = torch.multinomial(self.dist, n_tries, replacement=True).unique()
|
||||||
|
device = labels.device
|
||||||
|
neg_samples = neg_samples.to(device)
|
||||||
|
true_log_probs = self.log_q[labels].to(device)
|
||||||
|
samp_log_probs = self.log_q[neg_samples].to(device)
|
||||||
|
return true_log_probs, samp_log_probs, neg_samples
|
||||||
|
|
||||||
|
def sample_logits(embedding, bias, labels, inputs, sampler):
|
||||||
|
"""
|
||||||
|
embedding: an nn.Embedding layer
|
||||||
|
bias: [n_vocab]
|
||||||
|
labels: [b1, b2]
|
||||||
|
inputs: [b1, b2, n_emb]
|
||||||
|
sampler: you may use a LogUniformSampler
|
||||||
|
Return
|
||||||
|
logits: [b1, b2, 1 + n_sample]
|
||||||
|
"""
|
||||||
|
true_log_probs, samp_log_probs, neg_samples = sampler.sample(labels)
|
||||||
|
n_sample = neg_samples.size(0)
|
||||||
|
b1, b2 = labels.size(0), labels.size(1)
|
||||||
|
all_ids = torch.cat([labels.view(-1), neg_samples])
|
||||||
|
all_w = embedding(all_ids)
|
||||||
|
true_w = all_w[: -n_sample].view(b1, b2, -1)
|
||||||
|
sample_w = all_w[- n_sample:].view(n_sample, -1)
|
||||||
|
|
||||||
|
all_b = bias[all_ids]
|
||||||
|
true_b = all_b[: -n_sample].view(b1, b2)
|
||||||
|
sample_b = all_b[- n_sample:]
|
||||||
|
|
||||||
|
hit = (labels[:, :, None] == neg_samples).detach()
|
||||||
|
|
||||||
|
true_logits = torch.einsum('ijk,ijk->ij',
|
||||||
|
[true_w, inputs]) + true_b - true_log_probs
|
||||||
|
sample_logits = torch.einsum('lk,ijk->ijl',
|
||||||
|
[sample_w, inputs]) + sample_b - samp_log_probs
|
||||||
|
sample_logits.masked_fill_(hit, -1e30)
|
||||||
|
logits = torch.cat([true_logits[:, :, None], sample_logits], -1)
|
||||||
|
|
||||||
|
return logits
|
||||||
|
|
||||||
|
|
||||||
|
# class LogUniformSampler(object):
|
||||||
|
# def __init__(self, range_max, unique=False):
|
||||||
|
# """
|
||||||
|
# Reference : https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/python/ops/candidate_sampling_ops.py
|
||||||
|
# `P(class) = (log(class + 2) - log(class + 1)) / log(range_max + 1)`
|
||||||
|
# """
|
||||||
|
# self.range_max = range_max
|
||||||
|
# log_indices = torch.arange(1., range_max+2., 1.).log_()
|
||||||
|
# self.dist = (log_indices[1:] - log_indices[:-1]) / log_indices[-1]
|
||||||
|
|
||||||
|
# self.unique = unique
|
||||||
|
|
||||||
|
# if self.unique:
|
||||||
|
# self.exclude_mask = torch.ByteTensor(range_max).fill_(0)
|
||||||
|
|
||||||
|
# def sample(self, n_sample, labels):
|
||||||
|
# pos_sample, new_labels = labels.unique(return_inverse=True)
|
||||||
|
# n_pos_sample = pos_sample.size(0)
|
||||||
|
# n_neg_sample = n_sample - n_pos_sample
|
||||||
|
|
||||||
|
# if self.unique:
|
||||||
|
# self.exclude_mask.index_fill_(0, pos_sample, 1)
|
||||||
|
# sample_dist = self.dist.clone().masked_fill_(self.exclude_mask, 0)
|
||||||
|
# self.exclude_mask.index_fill_(0, pos_sample, 0)
|
||||||
|
# else:
|
||||||
|
# sample_dist = self.dist
|
||||||
|
|
||||||
|
# neg_sample = torch.multinomial(sample_dist, n_neg_sample)
|
||||||
|
|
||||||
|
# sample = torch.cat([pos_sample, neg_sample])
|
||||||
|
# sample_prob = self.dist[sample]
|
||||||
|
|
||||||
|
# return new_labels, sample, sample_prob
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
S, B = 3, 4
|
||||||
|
n_vocab = 10000
|
||||||
|
n_sample = 5
|
||||||
|
H = 32
|
||||||
|
|
||||||
|
labels = torch.LongTensor(S, B).random_(0, n_vocab)
|
||||||
|
|
||||||
|
# sampler = LogUniformSampler(n_vocab, unique=False)
|
||||||
|
# new_labels, sample, sample_prob = sampler.sample(n_sample, labels)
|
||||||
|
|
||||||
|
sampler = LogUniformSampler(n_vocab, n_sample)#, unique=True)
|
||||||
|
# true_probs, samp_probs, neg_samples = sampler.sample(n_sample, labels)
|
||||||
|
|
||||||
|
# print('true_probs', true_probs.numpy().tolist())
|
||||||
|
# print('samp_probs', samp_probs.numpy().tolist())
|
||||||
|
# print('neg_samples', neg_samples.numpy().tolist())
|
||||||
|
|
||||||
|
# print('sum', torch.sum(sampler.dist).item())
|
||||||
|
|
||||||
|
# assert torch.all(torch.sort(sample.unique())[0].eq(torch.sort(sample)[0])).item()
|
||||||
|
|
||||||
|
embedding = nn.Embedding(n_vocab, H)
|
||||||
|
bias = torch.zeros(n_vocab)
|
||||||
|
inputs = torch.Tensor(S, B, H).normal_()
|
||||||
|
|
||||||
|
logits, out_labels = sample_logits(embedding, bias, labels, inputs, sampler, n_sample)
|
||||||
|
print('logits', logits.detach().numpy().tolist())
|
||||||
|
print('logits shape', logits.size())
|
||||||
|
print('out_labels', out_labels.detach().numpy().tolist())
|
||||||
|
print('out_labels shape', out_labels.size())
|
||||||
|
|
||||||
140
pytorch_pretrained_bert/optimization_openai.py
Normal file
140
pytorch_pretrained_bert/optimization_openai.py
Normal file
@@ -0,0 +1,140 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Open AI Team Authors and The HugginFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""PyTorch optimization for OpenAI GPT model."""
|
||||||
|
|
||||||
|
import math
|
||||||
|
import torch
|
||||||
|
from torch.optim import Optimizer
|
||||||
|
from torch.optim.optimizer import required
|
||||||
|
from torch.nn.utils import clip_grad_norm_
|
||||||
|
|
||||||
|
def warmup_cosine(x, warmup=0.002):
|
||||||
|
s = 1 if x <= warmup else 0
|
||||||
|
return s*(x/warmup) + (1-s)*(0.5 * (1 + torch.cos(math.pi * x)))
|
||||||
|
|
||||||
|
def warmup_constant(x, warmup=0.002):
|
||||||
|
s = 1 if x <= warmup else 0
|
||||||
|
return s*(x/warmup) + (1-s)*1
|
||||||
|
|
||||||
|
def warmup_linear(x, warmup=0.002):
|
||||||
|
s = 1 if x <= warmup else 0
|
||||||
|
return (s*(x/warmup) + (1-s))*(1-x)
|
||||||
|
|
||||||
|
SCHEDULES = {
|
||||||
|
'warmup_cosine':warmup_cosine,
|
||||||
|
'warmup_constant':warmup_constant,
|
||||||
|
'warmup_linear':warmup_linear,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class OpenAIAdam(Optimizer):
|
||||||
|
"""Implements Open AI version of Adam algorithm with weight decay fix.
|
||||||
|
"""
|
||||||
|
def __init__(self, params, lr=required, schedule='warmup_linear', warmup=-1, t_total=-1,
|
||||||
|
b1=0.9, b2=0.999, e=1e-8, weight_decay=0,
|
||||||
|
vector_l2=False, max_grad_norm=-1, **kwargs):
|
||||||
|
if lr is not required and lr < 0.0:
|
||||||
|
raise ValueError("Invalid learning rate: {} - should be >= 0.0".format(lr))
|
||||||
|
if schedule not in SCHEDULES:
|
||||||
|
raise ValueError("Invalid schedule parameter: {}".format(schedule))
|
||||||
|
if not 0.0 <= warmup < 1.0 and not warmup == -1:
|
||||||
|
raise ValueError("Invalid warmup: {} - should be in [0.0, 1.0[ or -1".format(warmup))
|
||||||
|
if not 0.0 <= b1 < 1.0:
|
||||||
|
raise ValueError("Invalid b1 parameter: {}".format(b1))
|
||||||
|
if not 0.0 <= b2 < 1.0:
|
||||||
|
raise ValueError("Invalid b2 parameter: {}".format(b2))
|
||||||
|
if not e >= 0.0:
|
||||||
|
raise ValueError("Invalid epsilon value: {}".format(e))
|
||||||
|
defaults = dict(lr=lr, schedule=schedule, warmup=warmup, t_total=t_total,
|
||||||
|
b1=b1, b2=b2, e=e, weight_decay=weight_decay, vector_l2=vector_l2,
|
||||||
|
max_grad_norm=max_grad_norm)
|
||||||
|
super(OpenAIAdam, self).__init__(params, defaults)
|
||||||
|
|
||||||
|
def get_lr(self):
|
||||||
|
lr = []
|
||||||
|
for group in self.param_groups:
|
||||||
|
for p in group['params']:
|
||||||
|
state = self.state[p]
|
||||||
|
if len(state) == 0:
|
||||||
|
return [0]
|
||||||
|
if group['t_total'] != -1:
|
||||||
|
schedule_fct = SCHEDULES[group['schedule']]
|
||||||
|
lr_scheduled = group['lr'] * schedule_fct(state['step']/group['t_total'], group['warmup'])
|
||||||
|
else:
|
||||||
|
lr_scheduled = group['lr']
|
||||||
|
lr.append(lr_scheduled)
|
||||||
|
return lr
|
||||||
|
|
||||||
|
def step(self, closure=None):
|
||||||
|
"""Performs a single optimization step.
|
||||||
|
|
||||||
|
Arguments:
|
||||||
|
closure (callable, optional): A closure that reevaluates the model
|
||||||
|
and returns the loss.
|
||||||
|
"""
|
||||||
|
loss = None
|
||||||
|
if closure is not None:
|
||||||
|
loss = closure()
|
||||||
|
|
||||||
|
for group in self.param_groups:
|
||||||
|
for p in group['params']:
|
||||||
|
if p.grad is None:
|
||||||
|
continue
|
||||||
|
grad = p.grad.data
|
||||||
|
if grad.is_sparse:
|
||||||
|
raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')
|
||||||
|
|
||||||
|
state = self.state[p]
|
||||||
|
|
||||||
|
# State initialization
|
||||||
|
if len(state) == 0:
|
||||||
|
state['step'] = 0
|
||||||
|
# Exponential moving average of gradient values
|
||||||
|
state['exp_avg'] = torch.zeros_like(p.data)
|
||||||
|
# Exponential moving average of squared gradient values
|
||||||
|
state['exp_avg_sq'] = torch.zeros_like(p.data)
|
||||||
|
|
||||||
|
exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
|
||||||
|
beta1, beta2 = group['b1'], group['b2']
|
||||||
|
|
||||||
|
state['step'] += 1
|
||||||
|
|
||||||
|
# Add grad clipping
|
||||||
|
if group['max_grad_norm'] > 0:
|
||||||
|
clip_grad_norm_(p, group['max_grad_norm'])
|
||||||
|
|
||||||
|
# Decay the first and second moment running average coefficient
|
||||||
|
exp_avg.mul_(beta1).add_(1 - beta1, grad)
|
||||||
|
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
|
||||||
|
denom = exp_avg_sq.sqrt().add_(group['e'])
|
||||||
|
|
||||||
|
bias_correction1 = 1 - beta1 ** state['step']
|
||||||
|
bias_correction2 = 1 - beta2 ** state['step']
|
||||||
|
|
||||||
|
if group['t_total'] != -1:
|
||||||
|
schedule_fct = SCHEDULES[group['schedule']]
|
||||||
|
lr_scheduled = group['lr'] * schedule_fct(state['step']/group['t_total'], group['warmup'])
|
||||||
|
else:
|
||||||
|
lr_scheduled = group['lr']
|
||||||
|
|
||||||
|
step_size = lr_scheduled * math.sqrt(bias_correction2) / bias_correction1
|
||||||
|
|
||||||
|
p.data.addcdiv_(-step_size, exp_avg, denom)
|
||||||
|
|
||||||
|
# Add weight decay at the end (fixed version)
|
||||||
|
if (len(p.size()) > 1 or group['vector_l2']) and group['weight_decay'] > 0:
|
||||||
|
p.data.add_(-lr_scheduled * group['weight_decay'], p.data)
|
||||||
|
|
||||||
|
return loss
|
||||||
@@ -14,14 +14,13 @@
|
|||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
"""Tokenization classes."""
|
"""Tokenization classes."""
|
||||||
|
|
||||||
from __future__ import absolute_import
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
from __future__ import division
|
|
||||||
from __future__ import print_function
|
|
||||||
|
|
||||||
import collections
|
import collections
|
||||||
import unicodedata
|
|
||||||
import os
|
|
||||||
import logging
|
import logging
|
||||||
|
import os
|
||||||
|
import unicodedata
|
||||||
|
from io import open
|
||||||
|
|
||||||
from .file_utils import cached_path
|
from .file_utils import cached_path
|
||||||
|
|
||||||
@@ -117,26 +116,26 @@ class BertTokenizer(object):
|
|||||||
return tokens
|
return tokens
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def from_pretrained(cls, pretrained_model_name, cache_dir=None, *inputs, **kwargs):
|
def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):
|
||||||
"""
|
"""
|
||||||
Instantiate a PreTrainedBertModel from a pre-trained model file.
|
Instantiate a PreTrainedBertModel from a pre-trained model file.
|
||||||
Download and cache the pre-trained model file if needed.
|
Download and cache the pre-trained model file if needed.
|
||||||
"""
|
"""
|
||||||
if pretrained_model_name in PRETRAINED_VOCAB_ARCHIVE_MAP:
|
if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP:
|
||||||
vocab_file = PRETRAINED_VOCAB_ARCHIVE_MAP[pretrained_model_name]
|
vocab_file = PRETRAINED_VOCAB_ARCHIVE_MAP[pretrained_model_name_or_path]
|
||||||
else:
|
else:
|
||||||
vocab_file = pretrained_model_name
|
vocab_file = pretrained_model_name_or_path
|
||||||
if os.path.isdir(vocab_file):
|
if os.path.isdir(vocab_file):
|
||||||
vocab_file = os.path.join(vocab_file, VOCAB_NAME)
|
vocab_file = os.path.join(vocab_file, VOCAB_NAME)
|
||||||
# redirect to the cache, if necessary
|
# redirect to the cache, if necessary
|
||||||
try:
|
try:
|
||||||
resolved_vocab_file = cached_path(vocab_file, cache_dir=cache_dir)
|
resolved_vocab_file = cached_path(vocab_file, cache_dir=cache_dir)
|
||||||
except FileNotFoundError:
|
except EnvironmentError:
|
||||||
logger.error(
|
logger.error(
|
||||||
"Model name '{}' was not found in model name list ({}). "
|
"Model name '{}' was not found in model name list ({}). "
|
||||||
"We assumed '{}' was a path or url but couldn't find any file "
|
"We assumed '{}' was a path or url but couldn't find any file "
|
||||||
"associated to this path or url.".format(
|
"associated to this path or url.".format(
|
||||||
pretrained_model_name,
|
pretrained_model_name_or_path,
|
||||||
', '.join(PRETRAINED_VOCAB_ARCHIVE_MAP.keys()),
|
', '.join(PRETRAINED_VOCAB_ARCHIVE_MAP.keys()),
|
||||||
vocab_file))
|
vocab_file))
|
||||||
return None
|
return None
|
||||||
@@ -145,10 +144,10 @@ class BertTokenizer(object):
|
|||||||
else:
|
else:
|
||||||
logger.info("loading vocabulary file {} from cache at {}".format(
|
logger.info("loading vocabulary file {} from cache at {}".format(
|
||||||
vocab_file, resolved_vocab_file))
|
vocab_file, resolved_vocab_file))
|
||||||
if pretrained_model_name in PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP:
|
if pretrained_model_name_or_path in PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP:
|
||||||
# if we're using a pretrained model, ensure the tokenizer wont index sequences longer
|
# if we're using a pretrained model, ensure the tokenizer wont index sequences longer
|
||||||
# than the number of positional embeddings
|
# than the number of positional embeddings
|
||||||
max_len = PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP[pretrained_model_name]
|
max_len = PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP[pretrained_model_name_or_path]
|
||||||
kwargs['max_len'] = min(kwargs.get('max_len', int(1e12)), max_len)
|
kwargs['max_len'] = min(kwargs.get('max_len', int(1e12)), max_len)
|
||||||
# Instantiate tokenizer.
|
# Instantiate tokenizer.
|
||||||
tokenizer = cls(resolved_vocab_file, *inputs, **kwargs)
|
tokenizer = cls(resolved_vocab_file, *inputs, **kwargs)
|
||||||
|
|||||||
248
pytorch_pretrained_bert/tokenization_openai.py
Normal file
248
pytorch_pretrained_bert/tokenization_openai.py
Normal file
@@ -0,0 +1,248 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Open AI Team Authors and The HugginFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Tokenization classes for OpenAI GPT."""
|
||||||
|
from __future__ import (absolute_import, division, print_function,
|
||||||
|
unicode_literals)
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
from io import open
|
||||||
|
|
||||||
|
from tqdm import tqdm
|
||||||
|
|
||||||
|
from .file_utils import cached_path
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
PRETRAINED_VOCAB_ARCHIVE_MAP = {
|
||||||
|
'openai-gpt': "https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-vocab.json",
|
||||||
|
}
|
||||||
|
PRETRAINED_MERGES_ARCHIVE_MAP = {
|
||||||
|
'openai-gpt': "https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-merges.txt",
|
||||||
|
}
|
||||||
|
PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP = {
|
||||||
|
'openai-gpt': 512,
|
||||||
|
}
|
||||||
|
VOCAB_NAME = 'vocab.json'
|
||||||
|
MERGES_NAME = 'merges.txt'
|
||||||
|
|
||||||
|
def get_pairs(word):
|
||||||
|
"""
|
||||||
|
Return set of symbol pairs in a word.
|
||||||
|
word is represented as tuple of symbols (symbols being variable-length strings)
|
||||||
|
"""
|
||||||
|
pairs = set()
|
||||||
|
prev_char = word[0]
|
||||||
|
for char in word[1:]:
|
||||||
|
pairs.add((prev_char, char))
|
||||||
|
prev_char = char
|
||||||
|
return pairs
|
||||||
|
|
||||||
|
def text_standardize(text):
|
||||||
|
"""
|
||||||
|
fixes some issues the spacy tokenizer had on books corpus
|
||||||
|
also does some whitespace standardization
|
||||||
|
"""
|
||||||
|
text = text.replace('—', '-')
|
||||||
|
text = text.replace('–', '-')
|
||||||
|
text = text.replace('―', '-')
|
||||||
|
text = text.replace('…', '...')
|
||||||
|
text = text.replace('´', "'")
|
||||||
|
text = re.sub(r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)''', r' \1 ', text)
|
||||||
|
text = re.sub(r'\s*\n\s*', ' \n ', text)
|
||||||
|
text = re.sub(r'[^\S\n]+', ' ', text)
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
class OpenAIGPTTokenizer(object):
|
||||||
|
"""
|
||||||
|
BPE tokenizer. Peculiarities:
|
||||||
|
- lower case all inputs
|
||||||
|
- uses SpaCy tokenizer
|
||||||
|
- special tokens: additional symbols (ex: "__classify__") to add to a vocabulary.
|
||||||
|
"""
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):
|
||||||
|
"""
|
||||||
|
Instantiate a PreTrainedBertModel from a pre-trained model file.
|
||||||
|
Download and cache the pre-trained model file if needed.
|
||||||
|
"""
|
||||||
|
if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP:
|
||||||
|
vocab_file = PRETRAINED_VOCAB_ARCHIVE_MAP[pretrained_model_name_or_path]
|
||||||
|
merges_file = PRETRAINED_MERGES_ARCHIVE_MAP[pretrained_model_name_or_path]
|
||||||
|
else:
|
||||||
|
vocab_file = os.path.join(pretrained_model_name_or_path, VOCAB_NAME)
|
||||||
|
merges_file = os.path.join(pretrained_model_name_or_path, MERGES_NAME)
|
||||||
|
# redirect to the cache, if necessary
|
||||||
|
try:
|
||||||
|
resolved_vocab_file = cached_path(vocab_file, cache_dir=cache_dir)
|
||||||
|
resolved_merges_file = cached_path(merges_file, cache_dir=cache_dir)
|
||||||
|
except EnvironmentError:
|
||||||
|
logger.error(
|
||||||
|
"Model name '{}' was not found in model name list ({}). "
|
||||||
|
"We assumed '{}' was a path or url but couldn't find files {} and {} "
|
||||||
|
"at this path or url.".format(
|
||||||
|
pretrained_model_name_or_path,
|
||||||
|
', '.join(PRETRAINED_VOCAB_ARCHIVE_MAP.keys()),
|
||||||
|
pretrained_model_name_or_path,
|
||||||
|
vocab_file, merges_file))
|
||||||
|
return None
|
||||||
|
if resolved_vocab_file == vocab_file and resolved_merges_file == merges_file:
|
||||||
|
logger.info("loading vocabulary file {}".format(vocab_file))
|
||||||
|
logger.info("loading merges file {}".format(merges_file))
|
||||||
|
else:
|
||||||
|
logger.info("loading vocabulary file {} from cache at {}".format(
|
||||||
|
vocab_file, resolved_vocab_file))
|
||||||
|
logger.info("loading merges file {} from cache at {}".format(
|
||||||
|
merges_file, resolved_merges_file))
|
||||||
|
if pretrained_model_name_or_path in PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP:
|
||||||
|
# if we're using a pretrained model, ensure the tokenizer wont index sequences longer
|
||||||
|
# than the number of positional embeddings
|
||||||
|
max_len = PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP[pretrained_model_name_or_path]
|
||||||
|
kwargs['max_len'] = min(kwargs.get('max_len', int(1e12)), max_len)
|
||||||
|
# Instantiate tokenizer.
|
||||||
|
tokenizer = cls(resolved_vocab_file, resolved_merges_file, *inputs, **kwargs)
|
||||||
|
return tokenizer
|
||||||
|
|
||||||
|
def __init__(self, vocab_file, merges_file, special_tokens=None, max_len=None):
|
||||||
|
try:
|
||||||
|
import ftfy
|
||||||
|
import spacy
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError("Please install ftfy and spacy to use OpenAI GPT tokenizer.")
|
||||||
|
|
||||||
|
self.max_len = max_len if max_len is not None else int(1e12)
|
||||||
|
self.nlp = spacy.load('en', disable=['parser', 'tagger', 'ner', 'textcat'])
|
||||||
|
self.fix_text = ftfy.fix_text
|
||||||
|
self.encoder = json.load(open(vocab_file, encoding="utf-8"))
|
||||||
|
self.decoder = {v:k for k,v in self.encoder.items()}
|
||||||
|
merges = open(merges_file, encoding='utf-8').read().split('\n')[1:-1]
|
||||||
|
merges = [tuple(merge.split()) for merge in merges]
|
||||||
|
self.bpe_ranks = dict(zip(merges, range(len(merges))))
|
||||||
|
self.cache = {}
|
||||||
|
self.set_special_tokens(special_tokens)
|
||||||
|
|
||||||
|
def __len__(self):
|
||||||
|
return len(self.encoder) + len(self.special_tokens)
|
||||||
|
|
||||||
|
def set_special_tokens(self, special_tokens):
|
||||||
|
""" Add a list of additional tokens to the encoder.
|
||||||
|
The additional tokens are indexed starting from the last index of the
|
||||||
|
current vocabulary in the order of the `special_tokens` list.
|
||||||
|
"""
|
||||||
|
if not special_tokens:
|
||||||
|
self.special_tokens = {}
|
||||||
|
self.special_tokens_decoder = {}
|
||||||
|
return
|
||||||
|
self.special_tokens = dict((tok, len(self.encoder) + i) for i, tok in enumerate(special_tokens))
|
||||||
|
self.special_tokens_decoder = {v:k for k, v in self.special_tokens.items()}
|
||||||
|
logger.info("Special tokens {}".format(self.special_tokens))
|
||||||
|
|
||||||
|
def bpe(self, token):
|
||||||
|
word = tuple(token[:-1]) + (token[-1] + '</w>',)
|
||||||
|
if token in self.cache:
|
||||||
|
return self.cache[token]
|
||||||
|
pairs = get_pairs(word)
|
||||||
|
|
||||||
|
if not pairs:
|
||||||
|
return token+'</w>'
|
||||||
|
|
||||||
|
while True:
|
||||||
|
bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float('inf')))
|
||||||
|
if bigram not in self.bpe_ranks:
|
||||||
|
break
|
||||||
|
first, second = bigram
|
||||||
|
new_word = []
|
||||||
|
i = 0
|
||||||
|
while i < len(word):
|
||||||
|
try:
|
||||||
|
j = word.index(first, i)
|
||||||
|
new_word.extend(word[i:j])
|
||||||
|
i = j
|
||||||
|
except:
|
||||||
|
new_word.extend(word[i:])
|
||||||
|
break
|
||||||
|
|
||||||
|
if word[i] == first and i < len(word)-1 and word[i+1] == second:
|
||||||
|
new_word.append(first+second)
|
||||||
|
i += 2
|
||||||
|
else:
|
||||||
|
new_word.append(word[i])
|
||||||
|
i += 1
|
||||||
|
new_word = tuple(new_word)
|
||||||
|
word = new_word
|
||||||
|
if len(word) == 1:
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
pairs = get_pairs(word)
|
||||||
|
word = ' '.join(word)
|
||||||
|
if word == '\n </w>':
|
||||||
|
word = '\n</w>'
|
||||||
|
self.cache[token] = word
|
||||||
|
return word
|
||||||
|
|
||||||
|
def tokenize(self, text):
|
||||||
|
""" Tokenize a string. """
|
||||||
|
split_tokens = []
|
||||||
|
text = self.nlp(text_standardize(self.fix_text(text)))
|
||||||
|
for token in text:
|
||||||
|
split_tokens.extend([t for t in self.bpe(token.text.lower()).split(' ')])
|
||||||
|
return split_tokens
|
||||||
|
|
||||||
|
def convert_tokens_to_ids(self, tokens):
|
||||||
|
""" Converts a sequence of tokens into ids using the vocab. """
|
||||||
|
ids = []
|
||||||
|
if isinstance(tokens, str) or (sys.version_info[0] == 2 and isinstance(tokens, unicode)):
|
||||||
|
if tokens in self.special_tokens:
|
||||||
|
return self.special_tokens[tokens]
|
||||||
|
else:
|
||||||
|
return self.encoder.get(tokens, 0)
|
||||||
|
for token in tokens:
|
||||||
|
if token in self.special_tokens:
|
||||||
|
ids.append(self.special_tokens[token])
|
||||||
|
else:
|
||||||
|
ids.append(self.encoder.get(token, 0))
|
||||||
|
if len(ids) > self.max_len:
|
||||||
|
raise ValueError(
|
||||||
|
"Token indices sequence length is longer than the specified maximum "
|
||||||
|
" sequence length for this BERT model ({} > {}). Running this"
|
||||||
|
" sequence through BERT will result in indexing errors".format(len(ids), self.max_len)
|
||||||
|
)
|
||||||
|
return ids
|
||||||
|
|
||||||
|
def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
|
||||||
|
"""Converts a sequence of ids in BPE tokens using the vocab."""
|
||||||
|
tokens = []
|
||||||
|
for i in ids:
|
||||||
|
if i in self.special_tokens_decoder:
|
||||||
|
if not skip_special_tokens:
|
||||||
|
tokens.append(self.special_tokens_decoder[i])
|
||||||
|
else:
|
||||||
|
tokens.append(self.decoder[i])
|
||||||
|
return tokens
|
||||||
|
|
||||||
|
def decode(self, ids, skip_special_tokens=False, clean_up_tokenization_spaces=False):
|
||||||
|
"""Converts a sequence of ids in a string."""
|
||||||
|
tokens = self.convert_ids_to_tokens(ids, skip_special_tokens=skip_special_tokens)
|
||||||
|
out_string = ''.join(tokens).replace('</w>', ' ').strip()
|
||||||
|
if clean_up_tokenization_spaces:
|
||||||
|
out_string = out_string.replace('<unk>', '')
|
||||||
|
out_string = out_string.replace(' .', '.').replace(' ?', '?').replace(' !', '!').replace(' ,', ',').replace(' ,', ','
|
||||||
|
).replace(" n't", "n't").replace(" 'm", "'m").replace(" 're", "'re").replace(" do not", " don't"
|
||||||
|
).replace(" 's", "'s").replace(" t ", "'t ").replace(" s ", "'s ").replace(" m ", "'m "
|
||||||
|
).replace(" 've", "'ve")
|
||||||
|
return out_string
|
||||||
672
pytorch_pretrained_bert/tokenization_transfo_xl.py
Normal file
672
pytorch_pretrained_bert/tokenization_transfo_xl.py
Normal file
@@ -0,0 +1,672 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HugginFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Tokenization classes for Transformer XL model.
|
||||||
|
Adapted from https://github.com/kimiyoung/transformer-xl.
|
||||||
|
"""
|
||||||
|
from __future__ import (absolute_import, division, print_function,
|
||||||
|
unicode_literals)
|
||||||
|
|
||||||
|
import glob
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
from collections import Counter, OrderedDict
|
||||||
|
from io import open
|
||||||
|
import unicodedata
|
||||||
|
|
||||||
|
import torch
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
from .file_utils import cached_path
|
||||||
|
|
||||||
|
if sys.version_info[0] == 2:
|
||||||
|
import cPickle as pickle
|
||||||
|
else:
|
||||||
|
import pickle
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
PRETRAINED_VOCAB_ARCHIVE_MAP = {
|
||||||
|
'transfo-xl-wt103': "https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.bin",
|
||||||
|
}
|
||||||
|
VOCAB_NAME = 'vocab.bin'
|
||||||
|
|
||||||
|
PRETRAINED_CORPUS_ARCHIVE_MAP = {
|
||||||
|
'transfo-xl-wt103': "https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-corpus.bin",
|
||||||
|
}
|
||||||
|
CORPUS_NAME = 'corpus.bin'
|
||||||
|
|
||||||
|
class TransfoXLTokenizer(object):
|
||||||
|
"""
|
||||||
|
Transformer-XL tokenizer adapted from Vocab class in https://github.com/kimiyoung/transformer-xl
|
||||||
|
"""
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):
|
||||||
|
"""
|
||||||
|
Instantiate a TransfoXLTokenizer.
|
||||||
|
The TransfoXLTokenizer.
|
||||||
|
"""
|
||||||
|
if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP:
|
||||||
|
vocab_file = PRETRAINED_VOCAB_ARCHIVE_MAP[pretrained_model_name_or_path]
|
||||||
|
else:
|
||||||
|
vocab_file = os.path.join(pretrained_model_name_or_path, VOCAB_NAME)
|
||||||
|
# redirect to the cache, if necessary
|
||||||
|
try:
|
||||||
|
resolved_vocab_file = cached_path(vocab_file, cache_dir=cache_dir)
|
||||||
|
except EnvironmentError:
|
||||||
|
logger.error(
|
||||||
|
"Model name '{}' was not found in model name list ({}). "
|
||||||
|
"We assumed '{}' was a path or url but couldn't find files {} "
|
||||||
|
"at this path or url.".format(
|
||||||
|
pretrained_model_name_or_path,
|
||||||
|
', '.join(PRETRAINED_VOCAB_ARCHIVE_MAP.keys()),
|
||||||
|
pretrained_model_name_or_path,
|
||||||
|
vocab_file))
|
||||||
|
return None
|
||||||
|
if resolved_vocab_file == vocab_file:
|
||||||
|
logger.info("loading vocabulary file {}".format(vocab_file))
|
||||||
|
else:
|
||||||
|
logger.info("loading vocabulary file {} from cache at {}".format(
|
||||||
|
vocab_file, resolved_vocab_file))
|
||||||
|
|
||||||
|
# Instantiate tokenizer.
|
||||||
|
tokenizer = cls(*inputs, **kwargs)
|
||||||
|
vocab_dict = torch.load(resolved_vocab_file)
|
||||||
|
for key, value in vocab_dict.items():
|
||||||
|
tokenizer.__dict__[key] = value
|
||||||
|
return tokenizer
|
||||||
|
|
||||||
|
def __init__(self, special=[], min_freq=0, max_size=None, lower_case=False,
|
||||||
|
delimiter=None, vocab_file=None, never_split=("<unk>", "<eos>", "<formula>")):
|
||||||
|
self.counter = Counter()
|
||||||
|
self.special = special
|
||||||
|
self.min_freq = min_freq
|
||||||
|
self.max_size = max_size
|
||||||
|
self.lower_case = lower_case
|
||||||
|
self.delimiter = delimiter
|
||||||
|
self.vocab_file = vocab_file
|
||||||
|
self.never_split = never_split
|
||||||
|
|
||||||
|
def count_file(self, path, verbose=False, add_eos=False):
|
||||||
|
if verbose: print('counting file {} ...'.format(path))
|
||||||
|
assert os.path.exists(path)
|
||||||
|
|
||||||
|
sents = []
|
||||||
|
with open(path, 'r', encoding='utf-8') as f:
|
||||||
|
for idx, line in enumerate(f):
|
||||||
|
if verbose and idx > 0 and idx % 500000 == 0:
|
||||||
|
print(' line {}'.format(idx))
|
||||||
|
symbols = self.tokenize(line, add_eos=add_eos)
|
||||||
|
self.counter.update(symbols)
|
||||||
|
sents.append(symbols)
|
||||||
|
|
||||||
|
return sents
|
||||||
|
|
||||||
|
def count_sents(self, sents, verbose=False):
|
||||||
|
"""
|
||||||
|
sents : a list of sentences, each a list of tokenized symbols
|
||||||
|
"""
|
||||||
|
if verbose: print('counting {} sents ...'.format(len(sents)))
|
||||||
|
for idx, symbols in enumerate(sents):
|
||||||
|
if verbose and idx > 0 and idx % 500000 == 0:
|
||||||
|
print(' line {}'.format(idx))
|
||||||
|
self.counter.update(symbols)
|
||||||
|
|
||||||
|
def _build_from_file(self, vocab_file):
|
||||||
|
self.idx2sym = []
|
||||||
|
self.sym2idx = OrderedDict()
|
||||||
|
|
||||||
|
with open(vocab_file, 'r', encoding='utf-8') as f:
|
||||||
|
for line in f:
|
||||||
|
symb = line.strip().split()[0]
|
||||||
|
self.add_symbol(symb)
|
||||||
|
if '<UNK>' in self.sym2idx:
|
||||||
|
self.unk_idx = self.sym2idx['<UNK>']
|
||||||
|
elif '<unk>' in self.sym2idx:
|
||||||
|
self.unk_idx = self.sym2idx['<unk>']
|
||||||
|
else:
|
||||||
|
raise ValueError('No <unkown> token in vocabulary')
|
||||||
|
|
||||||
|
def build_vocab(self):
|
||||||
|
if self.vocab_file:
|
||||||
|
print('building vocab from {}'.format(self.vocab_file))
|
||||||
|
self._build_from_file(self.vocab_file)
|
||||||
|
print('final vocab size {}'.format(len(self)))
|
||||||
|
else:
|
||||||
|
print('building vocab with min_freq={}, max_size={}'.format(
|
||||||
|
self.min_freq, self.max_size))
|
||||||
|
self.idx2sym = []
|
||||||
|
self.sym2idx = OrderedDict()
|
||||||
|
|
||||||
|
for sym in self.special:
|
||||||
|
self.add_special(sym)
|
||||||
|
|
||||||
|
for sym, cnt in self.counter.most_common(self.max_size):
|
||||||
|
if cnt < self.min_freq: break
|
||||||
|
self.add_symbol(sym)
|
||||||
|
|
||||||
|
print('final vocab size {} from {} unique tokens'.format(
|
||||||
|
len(self), len(self.counter)))
|
||||||
|
|
||||||
|
def encode_file(self, path, ordered=False, verbose=False, add_eos=True,
|
||||||
|
add_double_eos=False):
|
||||||
|
if verbose: print('encoding file {} ...'.format(path))
|
||||||
|
assert os.path.exists(path)
|
||||||
|
encoded = []
|
||||||
|
with open(path, 'r', encoding='utf-8') as f:
|
||||||
|
for idx, line in enumerate(f):
|
||||||
|
if verbose and idx > 0 and idx % 500000 == 0:
|
||||||
|
print(' line {}'.format(idx))
|
||||||
|
symbols = self.tokenize(line, add_eos=add_eos,
|
||||||
|
add_double_eos=add_double_eos)
|
||||||
|
encoded.append(self.convert_to_tensor(symbols))
|
||||||
|
|
||||||
|
if ordered:
|
||||||
|
encoded = torch.cat(encoded)
|
||||||
|
|
||||||
|
return encoded
|
||||||
|
|
||||||
|
def encode_sents(self, sents, ordered=False, verbose=False):
|
||||||
|
if verbose: print('encoding {} sents ...'.format(len(sents)))
|
||||||
|
encoded = []
|
||||||
|
for idx, symbols in enumerate(sents):
|
||||||
|
if verbose and idx > 0 and idx % 500000 == 0:
|
||||||
|
print(' line {}'.format(idx))
|
||||||
|
encoded.append(self.convert_to_tensor(symbols))
|
||||||
|
|
||||||
|
if ordered:
|
||||||
|
encoded = torch.cat(encoded)
|
||||||
|
|
||||||
|
return encoded
|
||||||
|
|
||||||
|
def add_special(self, sym):
|
||||||
|
if sym not in self.sym2idx:
|
||||||
|
self.idx2sym.append(sym)
|
||||||
|
self.sym2idx[sym] = len(self.idx2sym) - 1
|
||||||
|
setattr(self, '{}_idx'.format(sym.strip('<>')), self.sym2idx[sym])
|
||||||
|
|
||||||
|
def add_symbol(self, sym):
|
||||||
|
if sym not in self.sym2idx:
|
||||||
|
self.idx2sym.append(sym)
|
||||||
|
self.sym2idx[sym] = len(self.idx2sym) - 1
|
||||||
|
|
||||||
|
def get_sym(self, idx):
|
||||||
|
assert 0 <= idx < len(self), 'Index {} out of vocabulary range'.format(idx)
|
||||||
|
return self.idx2sym[idx]
|
||||||
|
|
||||||
|
def get_idx(self, sym):
|
||||||
|
if sym in self.sym2idx:
|
||||||
|
return self.sym2idx[sym]
|
||||||
|
else:
|
||||||
|
# print('encounter unk {}'.format(sym))
|
||||||
|
# assert '<eos>' not in sym
|
||||||
|
if hasattr(self, 'unk_idx'):
|
||||||
|
return self.sym2idx.get(sym, self.unk_idx)
|
||||||
|
# Backward compatibility with pre-trained models
|
||||||
|
elif '<unk>' in self.sym2idx:
|
||||||
|
return self.sym2idx['<unk>']
|
||||||
|
elif '<UNK>' in self.sym2idx:
|
||||||
|
return self.sym2idx['<UNK>']
|
||||||
|
else:
|
||||||
|
raise ValueError('Token not in vocabulary and no <unk> token in vocabulary for replacement')
|
||||||
|
|
||||||
|
def convert_ids_to_tokens(self, indices):
|
||||||
|
"""Converts a sequence of indices in symbols using the vocab."""
|
||||||
|
return [self.get_sym(idx) for idx in indices]
|
||||||
|
|
||||||
|
def convert_tokens_to_ids(self, symbols):
|
||||||
|
"""Converts a sequence of symbols into ids using the vocab."""
|
||||||
|
return [self.get_idx(sym) for sym in symbols]
|
||||||
|
|
||||||
|
def convert_to_tensor(self, symbols):
|
||||||
|
return torch.LongTensor(self.convert_tokens_to_ids(symbols))
|
||||||
|
|
||||||
|
def decode(self, indices, exclude=None):
|
||||||
|
"""Converts a sequence of indices in a string."""
|
||||||
|
if exclude is None:
|
||||||
|
return ' '.join([self.get_sym(idx) for idx in indices])
|
||||||
|
else:
|
||||||
|
return ' '.join([self.get_sym(idx) for idx in indices if idx not in exclude])
|
||||||
|
|
||||||
|
def __len__(self):
|
||||||
|
return len(self.idx2sym)
|
||||||
|
|
||||||
|
def _run_split_on_punc(self, text):
|
||||||
|
"""Splits punctuation on a piece of text."""
|
||||||
|
if text in self.never_split:
|
||||||
|
return [text]
|
||||||
|
chars = list(text)
|
||||||
|
i = 0
|
||||||
|
start_new_word = True
|
||||||
|
output = []
|
||||||
|
while i < len(chars):
|
||||||
|
char = chars[i]
|
||||||
|
if _is_punctuation(char):
|
||||||
|
output.append([char])
|
||||||
|
start_new_word = True
|
||||||
|
else:
|
||||||
|
if start_new_word:
|
||||||
|
output.append([])
|
||||||
|
start_new_word = False
|
||||||
|
output[-1].append(char)
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
return ["".join(x) for x in output]
|
||||||
|
|
||||||
|
def _run_strip_accents(self, text):
|
||||||
|
"""Strips accents from a piece of text."""
|
||||||
|
text = unicodedata.normalize("NFD", text)
|
||||||
|
output = []
|
||||||
|
for char in text:
|
||||||
|
cat = unicodedata.category(char)
|
||||||
|
if cat == "Mn":
|
||||||
|
continue
|
||||||
|
output.append(char)
|
||||||
|
return "".join(output)
|
||||||
|
|
||||||
|
def _clean_text(self, text):
|
||||||
|
"""Performs invalid character removal and whitespace cleanup on text."""
|
||||||
|
output = []
|
||||||
|
for char in text:
|
||||||
|
cp = ord(char)
|
||||||
|
if cp == 0 or cp == 0xfffd or _is_control(char):
|
||||||
|
continue
|
||||||
|
if _is_whitespace(char):
|
||||||
|
output.append(" ")
|
||||||
|
else:
|
||||||
|
output.append(char)
|
||||||
|
return "".join(output)
|
||||||
|
|
||||||
|
def whitespace_tokenize(self, text):
|
||||||
|
"""Runs basic whitespace cleaning and splitting on a peice of text."""
|
||||||
|
text = text.strip()
|
||||||
|
if not text:
|
||||||
|
return []
|
||||||
|
if self.delimiter == '':
|
||||||
|
tokens = text
|
||||||
|
else:
|
||||||
|
tokens = text.split(self.delimiter)
|
||||||
|
return tokens
|
||||||
|
|
||||||
|
def tokenize(self, line, add_eos=False, add_double_eos=False):
|
||||||
|
line = self._clean_text(line)
|
||||||
|
line = line.strip()
|
||||||
|
|
||||||
|
symbols = self.whitespace_tokenize(line)
|
||||||
|
|
||||||
|
split_symbols = []
|
||||||
|
for symbol in symbols:
|
||||||
|
if self.lower_case and symbol not in self.never_split:
|
||||||
|
symbol = symbol.lower()
|
||||||
|
symbol = self._run_strip_accents(symbol)
|
||||||
|
split_symbols.extend(self._run_split_on_punc(symbol))
|
||||||
|
|
||||||
|
if add_double_eos: # lm1b
|
||||||
|
return ['<S>'] + split_symbols + ['<S>']
|
||||||
|
elif add_eos:
|
||||||
|
return split_symbols + ['<eos>']
|
||||||
|
else:
|
||||||
|
return split_symbols
|
||||||
|
|
||||||
|
|
||||||
|
class LMOrderedIterator(object):
|
||||||
|
def __init__(self, data, bsz, bptt, device='cpu', ext_len=None):
|
||||||
|
"""
|
||||||
|
data -- LongTensor -- the LongTensor is strictly ordered
|
||||||
|
"""
|
||||||
|
self.bsz = bsz
|
||||||
|
self.bptt = bptt
|
||||||
|
self.ext_len = ext_len if ext_len is not None else 0
|
||||||
|
|
||||||
|
self.device = device
|
||||||
|
|
||||||
|
# Work out how cleanly we can divide the dataset into bsz parts.
|
||||||
|
self.n_step = data.size(0) // bsz
|
||||||
|
|
||||||
|
# Trim off any extra elements that wouldn't cleanly fit (remainders).
|
||||||
|
data = data.narrow(0, 0, self.n_step * bsz)
|
||||||
|
|
||||||
|
# Evenly divide the data across the bsz batches.
|
||||||
|
self.data = data.view(bsz, -1).t().contiguous().to(device)
|
||||||
|
|
||||||
|
# Number of mini-batches
|
||||||
|
self.n_batch = (self.n_step + self.bptt - 1) // self.bptt
|
||||||
|
|
||||||
|
def get_batch(self, i, bptt=None):
|
||||||
|
if bptt is None: bptt = self.bptt
|
||||||
|
seq_len = min(bptt, self.data.size(0) - 1 - i)
|
||||||
|
|
||||||
|
end_idx = i + seq_len
|
||||||
|
beg_idx = max(0, i - self.ext_len)
|
||||||
|
|
||||||
|
data = self.data[beg_idx:end_idx]
|
||||||
|
target = self.data[i+1:i+1+seq_len]
|
||||||
|
|
||||||
|
data_out = data.transpose(0, 1).contiguous().to(self.device)
|
||||||
|
target_out = target.transpose(0, 1).contiguous().to(self.device)
|
||||||
|
|
||||||
|
return data_out, target_out, seq_len
|
||||||
|
|
||||||
|
def get_fixlen_iter(self, start=0):
|
||||||
|
for i in range(start, self.data.size(0) - 1, self.bptt):
|
||||||
|
yield self.get_batch(i)
|
||||||
|
|
||||||
|
def get_varlen_iter(self, start=0, std=5, min_len=5, max_deviation=3):
|
||||||
|
max_len = self.bptt + max_deviation * std
|
||||||
|
i = start
|
||||||
|
while True:
|
||||||
|
bptt = self.bptt if np.random.random() < 0.95 else self.bptt / 2.
|
||||||
|
bptt = min(max_len, max(min_len, int(np.random.normal(bptt, std))))
|
||||||
|
data, target, seq_len = self.get_batch(i, bptt)
|
||||||
|
i += seq_len
|
||||||
|
yield data, target, seq_len
|
||||||
|
if i >= self.data.size(0) - 2:
|
||||||
|
break
|
||||||
|
|
||||||
|
def __iter__(self):
|
||||||
|
return self.get_fixlen_iter()
|
||||||
|
|
||||||
|
|
||||||
|
class LMShuffledIterator(object):
|
||||||
|
def __init__(self, data, bsz, bptt, device='cpu', ext_len=None, shuffle=False):
|
||||||
|
"""
|
||||||
|
data -- list[LongTensor] -- there is no order among the LongTensors
|
||||||
|
"""
|
||||||
|
self.data = data
|
||||||
|
|
||||||
|
self.bsz = bsz
|
||||||
|
self.bptt = bptt
|
||||||
|
self.ext_len = ext_len if ext_len is not None else 0
|
||||||
|
|
||||||
|
self.device = device
|
||||||
|
self.shuffle = shuffle
|
||||||
|
|
||||||
|
def get_sent_stream(self):
|
||||||
|
# index iterator
|
||||||
|
epoch_indices = np.random.permutation(len(self.data)) if self.shuffle \
|
||||||
|
else np.array(range(len(self.data)))
|
||||||
|
|
||||||
|
# sentence iterator
|
||||||
|
for idx in epoch_indices:
|
||||||
|
yield self.data[idx]
|
||||||
|
|
||||||
|
def stream_iterator(self, sent_stream):
|
||||||
|
# streams for each data in the batch
|
||||||
|
streams = [None] * self.bsz
|
||||||
|
|
||||||
|
data = torch.LongTensor(self.bptt, self.bsz)
|
||||||
|
target = torch.LongTensor(self.bptt, self.bsz)
|
||||||
|
|
||||||
|
n_retain = 0
|
||||||
|
|
||||||
|
while True:
|
||||||
|
# data : [n_retain+bptt x bsz]
|
||||||
|
# target : [bptt x bsz]
|
||||||
|
data[n_retain:].fill_(-1)
|
||||||
|
target.fill_(-1)
|
||||||
|
|
||||||
|
valid_batch = True
|
||||||
|
|
||||||
|
for i in range(self.bsz):
|
||||||
|
n_filled = 0
|
||||||
|
try:
|
||||||
|
while n_filled < self.bptt:
|
||||||
|
if streams[i] is None or len(streams[i]) <= 1:
|
||||||
|
streams[i] = next(sent_stream)
|
||||||
|
# number of new tokens to fill in
|
||||||
|
n_new = min(len(streams[i]) - 1, self.bptt - n_filled)
|
||||||
|
# first n_retain tokens are retained from last batch
|
||||||
|
data[n_retain+n_filled:n_retain+n_filled+n_new, i] = \
|
||||||
|
streams[i][:n_new]
|
||||||
|
target[n_filled:n_filled+n_new, i] = \
|
||||||
|
streams[i][1:n_new+1]
|
||||||
|
streams[i] = streams[i][n_new:]
|
||||||
|
n_filled += n_new
|
||||||
|
except StopIteration:
|
||||||
|
valid_batch = False
|
||||||
|
break
|
||||||
|
|
||||||
|
if not valid_batch:
|
||||||
|
return
|
||||||
|
|
||||||
|
data_out = data.transpose(0, 1).contiguous().to(self.device)
|
||||||
|
target_out = target.transpose(0, 1).contiguous().to(self.device)
|
||||||
|
|
||||||
|
yield data_out, target_out, self.bptt
|
||||||
|
|
||||||
|
n_retain = min(data.size(0), self.ext_len)
|
||||||
|
if n_retain > 0:
|
||||||
|
data[:n_retain] = data[-n_retain:]
|
||||||
|
data.resize_(n_retain + self.bptt, data.size(1))
|
||||||
|
|
||||||
|
def __iter__(self):
|
||||||
|
# sent_stream is an iterator
|
||||||
|
sent_stream = self.get_sent_stream()
|
||||||
|
|
||||||
|
for batch in self.stream_iterator(sent_stream):
|
||||||
|
yield batch
|
||||||
|
|
||||||
|
|
||||||
|
class LMMultiFileIterator(LMShuffledIterator):
|
||||||
|
def __init__(self, paths, vocab, bsz, bptt, device='cpu', ext_len=None,
|
||||||
|
shuffle=False):
|
||||||
|
|
||||||
|
self.paths = paths
|
||||||
|
self.vocab = vocab
|
||||||
|
|
||||||
|
self.bsz = bsz
|
||||||
|
self.bptt = bptt
|
||||||
|
self.ext_len = ext_len if ext_len is not None else 0
|
||||||
|
|
||||||
|
self.device = device
|
||||||
|
self.shuffle = shuffle
|
||||||
|
|
||||||
|
def get_sent_stream(self, path):
|
||||||
|
sents = self.vocab.encode_file(path, add_double_eos=True)
|
||||||
|
if self.shuffle:
|
||||||
|
np.random.shuffle(sents)
|
||||||
|
sent_stream = iter(sents)
|
||||||
|
|
||||||
|
return sent_stream
|
||||||
|
|
||||||
|
def __iter__(self):
|
||||||
|
if self.shuffle:
|
||||||
|
np.random.shuffle(self.paths)
|
||||||
|
|
||||||
|
for path in self.paths:
|
||||||
|
# sent_stream is an iterator
|
||||||
|
sent_stream = self.get_sent_stream(path)
|
||||||
|
for batch in self.stream_iterator(sent_stream):
|
||||||
|
yield batch
|
||||||
|
|
||||||
|
|
||||||
|
class TransfoXLCorpus(object):
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):
|
||||||
|
"""
|
||||||
|
Instantiate a pre-processed corpus.
|
||||||
|
"""
|
||||||
|
vocab = TransfoXLTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
|
||||||
|
if pretrained_model_name_or_path in PRETRAINED_CORPUS_ARCHIVE_MAP:
|
||||||
|
corpus_file = PRETRAINED_CORPUS_ARCHIVE_MAP[pretrained_model_name_or_path]
|
||||||
|
else:
|
||||||
|
corpus_file = os.path.join(pretrained_model_name_or_path, CORPUS_NAME)
|
||||||
|
# redirect to the cache, if necessary
|
||||||
|
try:
|
||||||
|
resolved_corpus_file = cached_path(corpus_file, cache_dir=cache_dir)
|
||||||
|
except EnvironmentError:
|
||||||
|
logger.error(
|
||||||
|
"Corpus '{}' was not found in corpus list ({}). "
|
||||||
|
"We assumed '{}' was a path or url but couldn't find files {} "
|
||||||
|
"at this path or url.".format(
|
||||||
|
pretrained_model_name_or_path,
|
||||||
|
', '.join(PRETRAINED_VOCAB_ARCHIVE_MAP.keys()),
|
||||||
|
pretrained_model_name_or_path,
|
||||||
|
corpus_file))
|
||||||
|
return None
|
||||||
|
if resolved_corpus_file == corpus_file:
|
||||||
|
logger.info("loading corpus file {}".format(corpus_file))
|
||||||
|
else:
|
||||||
|
logger.info("loading corpus file {} from cache at {}".format(
|
||||||
|
corpus_file, resolved_corpus_file))
|
||||||
|
|
||||||
|
# Instantiate tokenizer.
|
||||||
|
corpus = cls(*inputs, **kwargs)
|
||||||
|
corpus_dict = torch.load(resolved_corpus_file)
|
||||||
|
for key, value in corpus_dict.items():
|
||||||
|
corpus.__dict__[key] = value
|
||||||
|
corpus.vocab = vocab
|
||||||
|
if corpus.train is not None:
|
||||||
|
corpus.train = torch.tensor(corpus.train, dtype=torch.long)
|
||||||
|
if corpus.valid is not None:
|
||||||
|
corpus.valid = torch.tensor(corpus.valid, dtype=torch.long)
|
||||||
|
if corpus.test is not None:
|
||||||
|
corpus.test = torch.tensor(corpus.test, dtype=torch.long)
|
||||||
|
return corpus
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
self.vocab = TransfoXLTokenizer(*args, **kwargs)
|
||||||
|
self.dataset = None
|
||||||
|
self.train = None
|
||||||
|
self.valid = None
|
||||||
|
self.test = None
|
||||||
|
|
||||||
|
def build_corpus(self, path, dataset):
|
||||||
|
self.dataset = dataset
|
||||||
|
|
||||||
|
if self.dataset in ['ptb', 'wt2', 'enwik8', 'text8']:
|
||||||
|
self.vocab.count_file(os.path.join(path, 'train.txt'))
|
||||||
|
self.vocab.count_file(os.path.join(path, 'valid.txt'))
|
||||||
|
self.vocab.count_file(os.path.join(path, 'test.txt'))
|
||||||
|
elif self.dataset == 'wt103':
|
||||||
|
self.vocab.count_file(os.path.join(path, 'train.txt'))
|
||||||
|
elif self.dataset == 'lm1b':
|
||||||
|
train_path_pattern = os.path.join(
|
||||||
|
path, '1-billion-word-language-modeling-benchmark-r13output',
|
||||||
|
'training-monolingual.tokenized.shuffled', 'news.en-*')
|
||||||
|
train_paths = glob.glob(train_path_pattern)
|
||||||
|
# the vocab will load from file when build_vocab() is called
|
||||||
|
|
||||||
|
self.vocab.build_vocab()
|
||||||
|
|
||||||
|
if self.dataset in ['ptb', 'wt2', 'wt103']:
|
||||||
|
self.train = self.vocab.encode_file(
|
||||||
|
os.path.join(path, 'train.txt'), ordered=True)
|
||||||
|
self.valid = self.vocab.encode_file(
|
||||||
|
os.path.join(path, 'valid.txt'), ordered=True)
|
||||||
|
self.test = self.vocab.encode_file(
|
||||||
|
os.path.join(path, 'test.txt'), ordered=True)
|
||||||
|
elif self.dataset in ['enwik8', 'text8']:
|
||||||
|
self.train = self.vocab.encode_file(
|
||||||
|
os.path.join(path, 'train.txt'), ordered=True, add_eos=False)
|
||||||
|
self.valid = self.vocab.encode_file(
|
||||||
|
os.path.join(path, 'valid.txt'), ordered=True, add_eos=False)
|
||||||
|
self.test = self.vocab.encode_file(
|
||||||
|
os.path.join(path, 'test.txt'), ordered=True, add_eos=False)
|
||||||
|
elif self.dataset == 'lm1b':
|
||||||
|
self.train = train_paths
|
||||||
|
self.valid = self.vocab.encode_file(
|
||||||
|
os.path.join(path, 'valid.txt'), ordered=False, add_double_eos=True)
|
||||||
|
self.test = self.vocab.encode_file(
|
||||||
|
os.path.join(path, 'test.txt'), ordered=False, add_double_eos=True)
|
||||||
|
|
||||||
|
def get_iterator(self, split, *args, **kwargs):
|
||||||
|
if split == 'train':
|
||||||
|
if self.dataset in ['ptb', 'wt2', 'wt103', 'enwik8', 'text8']:
|
||||||
|
data_iter = LMOrderedIterator(self.train, *args, **kwargs)
|
||||||
|
elif self.dataset == 'lm1b':
|
||||||
|
kwargs['shuffle'] = True
|
||||||
|
data_iter = LMMultiFileIterator(self.train, self.vocab, *args, **kwargs)
|
||||||
|
elif split in ['valid', 'test']:
|
||||||
|
data = self.valid if split == 'valid' else self.test
|
||||||
|
if self.dataset in ['ptb', 'wt2', 'wt103', 'enwik8', 'text8']:
|
||||||
|
data_iter = LMOrderedIterator(data, *args, **kwargs)
|
||||||
|
elif self.dataset == 'lm1b':
|
||||||
|
data_iter = LMShuffledIterator(data, *args, **kwargs)
|
||||||
|
|
||||||
|
return data_iter
|
||||||
|
|
||||||
|
|
||||||
|
def get_lm_corpus(datadir, dataset):
|
||||||
|
fn = os.path.join(datadir, 'cache.pt')
|
||||||
|
fn_pickle = os.path.join(datadir, 'cache.pkl')
|
||||||
|
if os.path.exists(fn):
|
||||||
|
print('Loading cached dataset...')
|
||||||
|
corpus = torch.load(fn_pickle)
|
||||||
|
elif os.path.exists(fn):
|
||||||
|
print('Loading cached dataset from pickle...')
|
||||||
|
with open(fn, "rb") as fp:
|
||||||
|
corpus = pickle.load(fp)
|
||||||
|
else:
|
||||||
|
print('Producing dataset {}...'.format(dataset))
|
||||||
|
kwargs = {}
|
||||||
|
if dataset in ['wt103', 'wt2']:
|
||||||
|
kwargs['special'] = ['<eos>']
|
||||||
|
kwargs['lower_case'] = False
|
||||||
|
elif dataset == 'ptb':
|
||||||
|
kwargs['special'] = ['<eos>']
|
||||||
|
kwargs['lower_case'] = True
|
||||||
|
elif dataset == 'lm1b':
|
||||||
|
kwargs['special'] = []
|
||||||
|
kwargs['lower_case'] = False
|
||||||
|
kwargs['vocab_file'] = os.path.join(datadir, '1b_word_vocab.txt')
|
||||||
|
elif dataset in ['enwik8', 'text8']:
|
||||||
|
pass
|
||||||
|
|
||||||
|
corpus = TransfoXLCorpus(datadir, dataset, **kwargs)
|
||||||
|
torch.save(corpus, fn)
|
||||||
|
|
||||||
|
return corpus
|
||||||
|
|
||||||
|
def _is_whitespace(char):
|
||||||
|
"""Checks whether `chars` is a whitespace character."""
|
||||||
|
# \t, \n, and \r are technically contorl characters but we treat them
|
||||||
|
# as whitespace since they are generally considered as such.
|
||||||
|
if char == " " or char == "\t" or char == "\n" or char == "\r":
|
||||||
|
return True
|
||||||
|
cat = unicodedata.category(char)
|
||||||
|
if cat == "Zs":
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def _is_control(char):
|
||||||
|
"""Checks whether `chars` is a control character."""
|
||||||
|
# These are technically control characters but we count them as whitespace
|
||||||
|
# characters.
|
||||||
|
if char == "\t" or char == "\n" or char == "\r":
|
||||||
|
return False
|
||||||
|
cat = unicodedata.category(char)
|
||||||
|
if cat.startswith("C"):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def _is_punctuation(char):
|
||||||
|
"""Checks whether `chars` is a punctuation character."""
|
||||||
|
cp = ord(char)
|
||||||
|
# We treat all non-letter/number ASCII as punctuation.
|
||||||
|
# Characters such as "^", "$", and "`" are not in the Unicode
|
||||||
|
# Punctuation class but we treat them as punctuation anyways, for
|
||||||
|
# consistency.
|
||||||
|
if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
|
||||||
|
(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
|
||||||
|
return True
|
||||||
|
cat = unicodedata.category(char)
|
||||||
|
if cat.startswith("P"):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
9
setup.py
9
setup.py
@@ -33,12 +33,13 @@ To create the package for pypi.
|
|||||||
7. Copy the release notes from RELEASE.md to the tag in github once everything is looking hunky-dory.
|
7. Copy the release notes from RELEASE.md to the tag in github once everything is looking hunky-dory.
|
||||||
|
|
||||||
"""
|
"""
|
||||||
|
from io import open
|
||||||
from setuptools import find_packages, setup
|
from setuptools import find_packages, setup
|
||||||
|
|
||||||
setup(
|
setup(
|
||||||
name="pytorch_pretrained_bert",
|
name="pytorch_pretrained_bert",
|
||||||
version="0.4.0",
|
version="0.5.0",
|
||||||
author="Thomas Wolf, Victor Sanh, Tim Rault, Google AI Language Team Authors",
|
author="Thomas Wolf, Victor Sanh, Tim Rault, Google AI Language Team Authors, Open AI team Authors",
|
||||||
author_email="thomas@huggingface.co",
|
author_email="thomas@huggingface.co",
|
||||||
description="PyTorch version of Google AI BERT model with script to load Google pre-trained models",
|
description="PyTorch version of Google AI BERT model with script to load Google pre-trained models",
|
||||||
long_description=open("README.md", "r", encoding='utf-8').read(),
|
long_description=open("README.md", "r", encoding='utf-8').read(),
|
||||||
@@ -55,10 +56,10 @@ setup(
|
|||||||
'tqdm'],
|
'tqdm'],
|
||||||
entry_points={
|
entry_points={
|
||||||
'console_scripts': [
|
'console_scripts': [
|
||||||
"pytorch_pretrained_bert=pytorch_pretrained_bert.__main__:main"
|
"pytorch_pretrained_bert=pytorch_pretrained_bert.__main__:main",
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
python_requires='>=3.5.0',
|
# python_requires='>=3.5.0',
|
||||||
tests_require=['pytest'],
|
tests_require=['pytest'],
|
||||||
classifiers=[
|
classifiers=[
|
||||||
'Intended Audience :: Science/Research',
|
'Intended Audience :: Science/Research',
|
||||||
|
|||||||
222
tests/modeling_openai_test.py
Normal file
222
tests/modeling_openai_test.py
Normal file
@@ -0,0 +1,222 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Google AI Language Team Authors.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from __future__ import absolute_import
|
||||||
|
from __future__ import division
|
||||||
|
from __future__ import print_function
|
||||||
|
|
||||||
|
import unittest
|
||||||
|
import json
|
||||||
|
import random
|
||||||
|
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from pytorch_pretrained_bert import (OpenAIGPTConfig, OpenAIGPTModel,
|
||||||
|
OpenAIGPTLMHeadModel, OpenAIGPTDoubleHeadsModel)
|
||||||
|
|
||||||
|
|
||||||
|
class OpenAIGPTModelTest(unittest.TestCase):
|
||||||
|
class OpenAIGPTModelTester(object):
|
||||||
|
|
||||||
|
def __init__(self,
|
||||||
|
parent,
|
||||||
|
batch_size=13,
|
||||||
|
seq_length=7,
|
||||||
|
is_training=True,
|
||||||
|
use_position_ids=True,
|
||||||
|
use_token_type_ids=True,
|
||||||
|
use_labels=True,
|
||||||
|
vocab_size=99,
|
||||||
|
n_special=1,
|
||||||
|
n_positions=33,
|
||||||
|
n_embd=32,
|
||||||
|
n_layer=5,
|
||||||
|
n_head=4,
|
||||||
|
n_choices=3,
|
||||||
|
afn="gelu",
|
||||||
|
resid_pdrop=0.1,
|
||||||
|
attn_pdrop=0.1,
|
||||||
|
embd_pdrop=0.1,
|
||||||
|
type_sequence_label_size=2,
|
||||||
|
initializer_range=0.02,
|
||||||
|
num_labels=3,
|
||||||
|
scope=None):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.seq_length = seq_length
|
||||||
|
self.is_training = is_training
|
||||||
|
self.use_position_ids = use_position_ids
|
||||||
|
self.use_token_type_ids = use_token_type_ids
|
||||||
|
self.use_labels = use_labels
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.n_special = n_special
|
||||||
|
self.n_positions = n_positions
|
||||||
|
self.n_embd = n_embd
|
||||||
|
self.n_layer = n_layer
|
||||||
|
self.n_head = n_head
|
||||||
|
self.afn = afn
|
||||||
|
self.n_choices = n_choices
|
||||||
|
self.resid_pdrop = resid_pdrop
|
||||||
|
self.attn_pdrop = attn_pdrop
|
||||||
|
self.embd_pdrop = embd_pdrop
|
||||||
|
self.type_sequence_label_size = type_sequence_label_size
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.num_labels = num_labels
|
||||||
|
self.scope = scope
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
input_ids = OpenAIGPTModelTest.ids_tensor([self.batch_size, self.n_choices, self.seq_length], self.vocab_size)
|
||||||
|
|
||||||
|
position_ids = None
|
||||||
|
if self.use_position_ids:
|
||||||
|
position_ids = OpenAIGPTModelTest.ids_tensor([self.batch_size, self.n_choices, self.seq_length], self.n_positions)
|
||||||
|
|
||||||
|
token_type_ids = None
|
||||||
|
if self.use_token_type_ids:
|
||||||
|
total_voc = self.vocab_size + self.n_special
|
||||||
|
token_type_ids = OpenAIGPTModelTest.ids_tensor([self.batch_size, self.n_choices, self.seq_length], total_voc)
|
||||||
|
|
||||||
|
mc_labels = None
|
||||||
|
lm_labels = None
|
||||||
|
mc_token_ids = None
|
||||||
|
if self.use_labels:
|
||||||
|
mc_labels = OpenAIGPTModelTest.ids_tensor([self.batch_size], self.type_sequence_label_size)
|
||||||
|
lm_labels = OpenAIGPTModelTest.ids_tensor([self.batch_size, self.n_choices, self.seq_length], self.num_labels)
|
||||||
|
mc_token_ids = OpenAIGPTModelTest.ids_tensor([self.batch_size, self.n_choices], self.seq_length)
|
||||||
|
|
||||||
|
config = OpenAIGPTConfig(
|
||||||
|
vocab_size_or_config_json_file=self.vocab_size,
|
||||||
|
n_positions=self.n_positions,
|
||||||
|
n_special=self.n_special,
|
||||||
|
n_embd=self.n_embd,
|
||||||
|
n_layer=self.n_layer,
|
||||||
|
n_head=self.n_head,
|
||||||
|
afn=self.afn,
|
||||||
|
resid_pdrop=self.resid_pdrop,
|
||||||
|
attn_pdrop=self.attn_pdrop,
|
||||||
|
embd_pdrop=self.embd_pdrop,
|
||||||
|
initializer_range=self.initializer_range)
|
||||||
|
|
||||||
|
return (config, input_ids, token_type_ids, position_ids,
|
||||||
|
mc_labels, lm_labels, mc_token_ids)
|
||||||
|
|
||||||
|
def create_openai_model(self, config, input_ids, token_type_ids, position_ids,
|
||||||
|
mc_labels, lm_labels, mc_token_ids):
|
||||||
|
model = OpenAIGPTModel(config)
|
||||||
|
model.eval()
|
||||||
|
hidden_states = model(input_ids, position_ids, token_type_ids)
|
||||||
|
outputs = {
|
||||||
|
"hidden_states": hidden_states,
|
||||||
|
}
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
def check_openai_model_output(self, result):
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["hidden_states"].size()),
|
||||||
|
[self.batch_size, self.n_choices, self.seq_length, self.n_embd])
|
||||||
|
|
||||||
|
|
||||||
|
def create_openai_lm_head(self, config, input_ids, token_type_ids, position_ids,
|
||||||
|
mc_labels, lm_labels, mc_token_ids):
|
||||||
|
model = OpenAIGPTLMHeadModel(config)
|
||||||
|
model.eval()
|
||||||
|
loss = model(input_ids, position_ids, token_type_ids, lm_labels)
|
||||||
|
lm_logits = model(input_ids, position_ids, token_type_ids)
|
||||||
|
outputs = {
|
||||||
|
"loss": loss,
|
||||||
|
"lm_logits": lm_logits,
|
||||||
|
}
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
def check_openai_lm_head_output(self, result):
|
||||||
|
total_voc = self.n_special + self.vocab_size
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["lm_logits"].size()),
|
||||||
|
[self.batch_size, self.n_choices, self.seq_length, total_voc])
|
||||||
|
|
||||||
|
def check_openai_lm_head_loss_output(self, result):
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["loss"].size()),
|
||||||
|
[])
|
||||||
|
|
||||||
|
def create_openai_double_heads(self, config, input_ids, token_type_ids, position_ids,
|
||||||
|
mc_labels, lm_labels, mc_token_ids):
|
||||||
|
model = OpenAIGPTDoubleHeadsModel(config)
|
||||||
|
model.eval()
|
||||||
|
loss = model(input_ids, mc_token_ids,
|
||||||
|
lm_labels=lm_labels, mc_labels=mc_labels,
|
||||||
|
token_type_ids=token_type_ids, position_ids=position_ids)
|
||||||
|
lm_logits, mc_logits = model(input_ids, mc_token_ids, position_ids=position_ids, token_type_ids=token_type_ids)
|
||||||
|
outputs = {
|
||||||
|
"loss": loss,
|
||||||
|
"lm_logits": lm_logits,
|
||||||
|
"mc_logits": mc_logits,
|
||||||
|
}
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
def check_openai_double_heads_output(self, result):
|
||||||
|
total_voc = self.n_special + self.vocab_size
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["lm_logits"].size()),
|
||||||
|
[self.batch_size, self.n_choices, self.seq_length, total_voc])
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["mc_logits"].size()),
|
||||||
|
[self.batch_size, self.n_choices])
|
||||||
|
|
||||||
|
def check_openai_double_heads_loss_output(self, result):
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
[list(l.size()) for l in result["loss"]],
|
||||||
|
[[], []])
|
||||||
|
|
||||||
|
def test_default(self):
|
||||||
|
self.run_tester(OpenAIGPTModelTest.OpenAIGPTModelTester(self))
|
||||||
|
|
||||||
|
def test_config_to_json_string(self):
|
||||||
|
config = OpenAIGPTConfig(vocab_size_or_config_json_file=99, n_embd=37)
|
||||||
|
obj = json.loads(config.to_json_string())
|
||||||
|
self.assertEqual(obj["vocab_size"], 99)
|
||||||
|
self.assertEqual(obj["n_embd"], 37)
|
||||||
|
|
||||||
|
def run_tester(self, tester):
|
||||||
|
config_and_inputs = tester.prepare_config_and_inputs()
|
||||||
|
output_result = tester.create_openai_model(*config_and_inputs)
|
||||||
|
tester.check_openai_model_output(output_result)
|
||||||
|
|
||||||
|
output_result = tester.create_openai_lm_head(*config_and_inputs)
|
||||||
|
tester.check_openai_lm_head_output(output_result)
|
||||||
|
tester.check_openai_lm_head_loss_output(output_result)
|
||||||
|
|
||||||
|
output_result = tester.create_openai_double_heads(*config_and_inputs)
|
||||||
|
tester.check_openai_double_heads_output(output_result)
|
||||||
|
tester.check_openai_double_heads_loss_output(output_result)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def ids_tensor(cls, shape, vocab_size, rng=None, name=None):
|
||||||
|
"""Creates a random int32 tensor of the shape within the vocab size."""
|
||||||
|
if rng is None:
|
||||||
|
rng = random.Random()
|
||||||
|
|
||||||
|
total_dims = 1
|
||||||
|
for dim in shape:
|
||||||
|
total_dims *= dim
|
||||||
|
|
||||||
|
values = []
|
||||||
|
for _ in range(total_dims):
|
||||||
|
values.append(rng.randint(0, vocab_size - 1))
|
||||||
|
|
||||||
|
return torch.tensor(data=values, dtype=torch.long).view(shape).contiguous()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
@@ -114,6 +114,7 @@ class BertModelTest(unittest.TestCase):
|
|||||||
|
|
||||||
def create_bert_model(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
def create_bert_model(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
||||||
model = BertModel(config=config)
|
model = BertModel(config=config)
|
||||||
|
model.eval()
|
||||||
all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
|
all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
|
||||||
outputs = {
|
outputs = {
|
||||||
"sequence_output": all_encoder_layers[-1],
|
"sequence_output": all_encoder_layers[-1],
|
||||||
@@ -134,6 +135,7 @@ class BertModelTest(unittest.TestCase):
|
|||||||
|
|
||||||
def create_bert_for_masked_lm(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
def create_bert_for_masked_lm(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
||||||
model = BertForMaskedLM(config=config)
|
model = BertForMaskedLM(config=config)
|
||||||
|
model.eval()
|
||||||
loss = model(input_ids, token_type_ids, input_mask, token_labels)
|
loss = model(input_ids, token_type_ids, input_mask, token_labels)
|
||||||
prediction_scores = model(input_ids, token_type_ids, input_mask)
|
prediction_scores = model(input_ids, token_type_ids, input_mask)
|
||||||
outputs = {
|
outputs = {
|
||||||
@@ -149,6 +151,7 @@ class BertModelTest(unittest.TestCase):
|
|||||||
|
|
||||||
def create_bert_for_next_sequence_prediction(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
def create_bert_for_next_sequence_prediction(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
||||||
model = BertForNextSentencePrediction(config=config)
|
model = BertForNextSentencePrediction(config=config)
|
||||||
|
model.eval()
|
||||||
loss = model(input_ids, token_type_ids, input_mask, sequence_labels)
|
loss = model(input_ids, token_type_ids, input_mask, sequence_labels)
|
||||||
seq_relationship_score = model(input_ids, token_type_ids, input_mask)
|
seq_relationship_score = model(input_ids, token_type_ids, input_mask)
|
||||||
outputs = {
|
outputs = {
|
||||||
@@ -165,6 +168,7 @@ class BertModelTest(unittest.TestCase):
|
|||||||
|
|
||||||
def create_bert_for_pretraining(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
def create_bert_for_pretraining(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
||||||
model = BertForPreTraining(config=config)
|
model = BertForPreTraining(config=config)
|
||||||
|
model.eval()
|
||||||
loss = model(input_ids, token_type_ids, input_mask, token_labels, sequence_labels)
|
loss = model(input_ids, token_type_ids, input_mask, token_labels, sequence_labels)
|
||||||
prediction_scores, seq_relationship_score = model(input_ids, token_type_ids, input_mask)
|
prediction_scores, seq_relationship_score = model(input_ids, token_type_ids, input_mask)
|
||||||
outputs = {
|
outputs = {
|
||||||
@@ -185,6 +189,7 @@ class BertModelTest(unittest.TestCase):
|
|||||||
|
|
||||||
def create_bert_for_question_answering(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
def create_bert_for_question_answering(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
||||||
model = BertForQuestionAnswering(config=config)
|
model = BertForQuestionAnswering(config=config)
|
||||||
|
model.eval()
|
||||||
loss = model(input_ids, token_type_ids, input_mask, sequence_labels, sequence_labels)
|
loss = model(input_ids, token_type_ids, input_mask, sequence_labels, sequence_labels)
|
||||||
start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
|
start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
|
||||||
outputs = {
|
outputs = {
|
||||||
@@ -205,6 +210,7 @@ class BertModelTest(unittest.TestCase):
|
|||||||
|
|
||||||
def create_bert_for_sequence_classification(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
def create_bert_for_sequence_classification(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
||||||
model = BertForSequenceClassification(config=config, num_labels=self.num_labels)
|
model = BertForSequenceClassification(config=config, num_labels=self.num_labels)
|
||||||
|
model.eval()
|
||||||
loss = model(input_ids, token_type_ids, input_mask, sequence_labels)
|
loss = model(input_ids, token_type_ids, input_mask, sequence_labels)
|
||||||
logits = model(input_ids, token_type_ids, input_mask)
|
logits = model(input_ids, token_type_ids, input_mask)
|
||||||
outputs = {
|
outputs = {
|
||||||
@@ -221,6 +227,7 @@ class BertModelTest(unittest.TestCase):
|
|||||||
|
|
||||||
def create_bert_for_token_classification(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
def create_bert_for_token_classification(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels):
|
||||||
model = BertForTokenClassification(config=config, num_labels=self.num_labels)
|
model = BertForTokenClassification(config=config, num_labels=self.num_labels)
|
||||||
|
model.eval()
|
||||||
loss = model(input_ids, token_type_ids, input_mask, token_labels)
|
loss = model(input_ids, token_type_ids, input_mask, token_labels)
|
||||||
logits = model(input_ids, token_type_ids, input_mask)
|
logits = model(input_ids, token_type_ids, input_mask)
|
||||||
outputs = {
|
outputs = {
|
||||||
|
|||||||
218
tests/modeling_transfo_xl_test.py
Normal file
218
tests/modeling_transfo_xl_test.py
Normal file
@@ -0,0 +1,218 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Google AI Language Team Authors.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from __future__ import absolute_import
|
||||||
|
from __future__ import division
|
||||||
|
from __future__ import print_function
|
||||||
|
|
||||||
|
import unittest
|
||||||
|
import json
|
||||||
|
import random
|
||||||
|
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from pytorch_pretrained_bert import (TransfoXLConfig, TransfoXLModel, TransfoXLLMHeadModel)
|
||||||
|
|
||||||
|
|
||||||
|
class TransfoXLModelTest(unittest.TestCase):
|
||||||
|
class TransfoXLModelTester(object):
|
||||||
|
|
||||||
|
def __init__(self,
|
||||||
|
parent,
|
||||||
|
batch_size=13,
|
||||||
|
seq_length=7,
|
||||||
|
mem_len=30,
|
||||||
|
clamp_len=15,
|
||||||
|
is_training=True,
|
||||||
|
use_labels=True,
|
||||||
|
vocab_size=99,
|
||||||
|
cutoffs=[10, 50, 80],
|
||||||
|
d_model=32,
|
||||||
|
d_embed=32,
|
||||||
|
n_head=4,
|
||||||
|
d_head=8,
|
||||||
|
d_inner=128,
|
||||||
|
div_val=2,
|
||||||
|
n_layer=5,
|
||||||
|
scope=None,
|
||||||
|
seed=1):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.seq_length = seq_length
|
||||||
|
self.mem_len = mem_len
|
||||||
|
self.clamp_len = clamp_len
|
||||||
|
self.is_training = is_training
|
||||||
|
self.use_labels = use_labels
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.cutoffs = cutoffs
|
||||||
|
self.d_model = d_model
|
||||||
|
self.d_embed = d_embed
|
||||||
|
self.n_head = n_head
|
||||||
|
self.d_head = d_head
|
||||||
|
self.d_inner = d_inner
|
||||||
|
self.div_val = div_val
|
||||||
|
self.n_layer = n_layer
|
||||||
|
self.scope = scope
|
||||||
|
self.seed = seed
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
input_ids_1 = TransfoXLModelTest.ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||||
|
input_ids_2 = TransfoXLModelTest.ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||||
|
|
||||||
|
lm_labels = None
|
||||||
|
if self.use_labels:
|
||||||
|
lm_labels = TransfoXLModelTest.ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||||
|
|
||||||
|
config = TransfoXLConfig(
|
||||||
|
vocab_size_or_config_json_file=self.vocab_size,
|
||||||
|
mem_len=self.mem_len,
|
||||||
|
clamp_len=self.clamp_len,
|
||||||
|
cutoffs=self.cutoffs,
|
||||||
|
d_model=self.d_model,
|
||||||
|
d_embed=self.d_embed,
|
||||||
|
n_head=self.n_head,
|
||||||
|
d_head=self.d_head,
|
||||||
|
d_inner=self.d_inner,
|
||||||
|
div_val=self.div_val,
|
||||||
|
n_layer=self.n_layer)
|
||||||
|
|
||||||
|
return (config, input_ids_1, input_ids_2, lm_labels)
|
||||||
|
|
||||||
|
def set_seed(self):
|
||||||
|
random.seed(self.seed)
|
||||||
|
torch.manual_seed(self.seed)
|
||||||
|
|
||||||
|
def create_transfo_xl_model(self, config, input_ids_1, input_ids_2, lm_labels):
|
||||||
|
model = TransfoXLModel(config)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
hidden_states_1, mems_1 = model(input_ids_1)
|
||||||
|
hidden_states_2, mems_2 = model(input_ids_2, mems_1)
|
||||||
|
outputs = {
|
||||||
|
"hidden_states_1": hidden_states_1,
|
||||||
|
"mems_1": mems_1,
|
||||||
|
"hidden_states_2": hidden_states_2,
|
||||||
|
"mems_2": mems_2,
|
||||||
|
}
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
def check_transfo_xl_model_output(self, result):
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["hidden_states_1"].size()),
|
||||||
|
[self.batch_size, self.seq_length, self.d_model])
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["hidden_states_2"].size()),
|
||||||
|
[self.batch_size, self.seq_length, self.d_model])
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(list(mem.size()) for mem in result["mems_1"]),
|
||||||
|
[[self.mem_len, self.batch_size, self.d_model]] * self.n_layer)
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(list(mem.size()) for mem in result["mems_2"]),
|
||||||
|
[[self.mem_len, self.batch_size, self.d_model]] * self.n_layer)
|
||||||
|
|
||||||
|
|
||||||
|
def create_transfo_xl_lm_head(self, config, input_ids_1, input_ids_2, lm_labels):
|
||||||
|
model = TransfoXLLMHeadModel(config)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
loss_1, mems_1a = model(input_ids_1, target=lm_labels)
|
||||||
|
lm_logits_1, mems_1b = model(input_ids_1)
|
||||||
|
|
||||||
|
loss_2, mems_2a = model(input_ids_2, target=lm_labels, mems=mems_1a)
|
||||||
|
lm_logits_2, mems_2b = model(input_ids_2, mems=mems_1b)
|
||||||
|
|
||||||
|
outputs = {
|
||||||
|
"loss_1": loss_1,
|
||||||
|
"mems_1a": mems_1a,
|
||||||
|
"lm_logits_1": lm_logits_1,
|
||||||
|
"mems_1b": mems_1b,
|
||||||
|
"loss_2": loss_2,
|
||||||
|
"mems_2a": mems_2a,
|
||||||
|
"lm_logits_2": lm_logits_2,
|
||||||
|
"mems_2b": mems_2b,
|
||||||
|
}
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
def check_transfo_xl_lm_head_output(self, result):
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["loss_1"].size()),
|
||||||
|
[self.batch_size, self.seq_length])
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["lm_logits_1"].size()),
|
||||||
|
[self.batch_size, self.seq_length, self.vocab_size])
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(list(mem.size()) for mem in result["mems_1a"]),
|
||||||
|
[[self.mem_len, self.batch_size, self.d_model]] * self.n_layer)
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(list(mem.size()) for mem in result["mems_1b"]),
|
||||||
|
[[self.mem_len, self.batch_size, self.d_model]] * self.n_layer)
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(mem[~torch.isnan(mem)].sum() for mem in result["mems_1a"]),
|
||||||
|
list(mem[~torch.isnan(mem)].sum() for mem in result["mems_1b"]))
|
||||||
|
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["loss_2"].size()),
|
||||||
|
[self.batch_size, self.seq_length])
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["lm_logits_2"].size()),
|
||||||
|
[self.batch_size, self.seq_length, self.vocab_size])
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(list(mem.size()) for mem in result["mems_2a"]),
|
||||||
|
[[self.mem_len, self.batch_size, self.d_model]] * self.n_layer)
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(list(mem.size()) for mem in result["mems_2b"]),
|
||||||
|
[[self.mem_len, self.batch_size, self.d_model]] * self.n_layer)
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(mem[~torch.isnan(mem)].sum() for mem in result["mems_2a"]),
|
||||||
|
list(mem[~torch.isnan(mem)].sum() for mem in result["mems_2b"]))
|
||||||
|
|
||||||
|
def test_default(self):
|
||||||
|
self.run_tester(TransfoXLModelTest.TransfoXLModelTester(self))
|
||||||
|
|
||||||
|
def test_config_to_json_string(self):
|
||||||
|
config = TransfoXLConfig(vocab_size_or_config_json_file=96, d_embed=37)
|
||||||
|
obj = json.loads(config.to_json_string())
|
||||||
|
self.assertEqual(obj["n_token"], 96)
|
||||||
|
self.assertEqual(obj["d_embed"], 37)
|
||||||
|
|
||||||
|
def run_tester(self, tester):
|
||||||
|
config_and_inputs = tester.prepare_config_and_inputs()
|
||||||
|
|
||||||
|
tester.set_seed()
|
||||||
|
output_result = tester.create_transfo_xl_model(*config_and_inputs)
|
||||||
|
tester.check_transfo_xl_model_output(output_result)
|
||||||
|
|
||||||
|
tester.set_seed()
|
||||||
|
output_result = tester.create_transfo_xl_lm_head(*config_and_inputs)
|
||||||
|
tester.check_transfo_xl_lm_head_output(output_result)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def ids_tensor(cls, shape, vocab_size, rng=None, name=None):
|
||||||
|
"""Creates a random int32 tensor of the shape within the vocab size."""
|
||||||
|
if rng is None:
|
||||||
|
rng = random.Random()
|
||||||
|
|
||||||
|
total_dims = 1
|
||||||
|
for dim in shape:
|
||||||
|
total_dims *= dim
|
||||||
|
|
||||||
|
values = []
|
||||||
|
for _ in range(total_dims):
|
||||||
|
values.append(rng.randint(0, vocab_size - 1))
|
||||||
|
|
||||||
|
return torch.tensor(data=values, dtype=torch.long).view(shape).contiguous()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
56
tests/tokenization_openai_test.py
Normal file
56
tests/tokenization_openai_test.py
Normal file
@@ -0,0 +1,56 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Google AI Language Team Authors.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
|
|
||||||
|
import os
|
||||||
|
import unittest
|
||||||
|
import json
|
||||||
|
|
||||||
|
from pytorch_pretrained_bert.tokenization_openai import OpenAIGPTTokenizer
|
||||||
|
|
||||||
|
|
||||||
|
class OpenAIGPTTokenizationTest(unittest.TestCase):
|
||||||
|
|
||||||
|
def test_full_tokenizer(self):
|
||||||
|
""" Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt """
|
||||||
|
vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n",
|
||||||
|
"w</w>", "r</w>", "t</w>",
|
||||||
|
"lo", "low", "er</w>",
|
||||||
|
"low</w>", "lowest</w>", "newer</w>", "wider</w>"]
|
||||||
|
vocab_tokens = dict(zip(vocab, range(len(vocab))))
|
||||||
|
merges = ["#version: 0.2", "l o", "lo w", "e r</w>", ""]
|
||||||
|
with open("/tmp/openai_tokenizer_vocab_test.json", "w") as fp:
|
||||||
|
json.dump(vocab_tokens, fp)
|
||||||
|
vocab_file = fp.name
|
||||||
|
with open("/tmp/openai_tokenizer_merges_test.txt", "w") as fp:
|
||||||
|
fp.write("\n".join(merges))
|
||||||
|
merges_file = fp.name
|
||||||
|
|
||||||
|
tokenizer = OpenAIGPTTokenizer(vocab_file, merges_file, special_tokens=["<unk>"])
|
||||||
|
os.remove(vocab_file)
|
||||||
|
os.remove(merges_file)
|
||||||
|
|
||||||
|
text = "lower"
|
||||||
|
bpe_tokens = ["low", "er</w>"]
|
||||||
|
tokens = tokenizer.tokenize(text)
|
||||||
|
self.assertListEqual(tokens, bpe_tokens)
|
||||||
|
|
||||||
|
input_tokens = tokens + ["<unk>"]
|
||||||
|
input_bpe_tokens = [14, 15, 20]
|
||||||
|
self.assertListEqual(
|
||||||
|
tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
unittest.main()
|
||||||
@@ -12,15 +12,17 @@
|
|||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
from __future__ import absolute_import
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
from __future__ import division
|
|
||||||
from __future__ import print_function
|
|
||||||
|
|
||||||
import os
|
import os
|
||||||
import unittest
|
import unittest
|
||||||
|
from io import open
|
||||||
|
|
||||||
from pytorch_pretrained_bert.tokenization import (BertTokenizer, BasicTokenizer, WordpieceTokenizer,
|
from pytorch_pretrained_bert.tokenization import (BasicTokenizer,
|
||||||
_is_whitespace, _is_control, _is_punctuation)
|
BertTokenizer,
|
||||||
|
WordpieceTokenizer,
|
||||||
|
_is_control, _is_punctuation,
|
||||||
|
_is_whitespace)
|
||||||
|
|
||||||
|
|
||||||
class TokenizationTest(unittest.TestCase):
|
class TokenizationTest(unittest.TestCase):
|
||||||
@@ -30,7 +32,7 @@ class TokenizationTest(unittest.TestCase):
|
|||||||
"[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn",
|
"[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn",
|
||||||
"##ing", ","
|
"##ing", ","
|
||||||
]
|
]
|
||||||
with open("/tmp/bert_tokenizer_test.txt", "w") as vocab_writer:
|
with open("/tmp/bert_tokenizer_test.txt", "w", encoding='utf-8') as vocab_writer:
|
||||||
vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
|
vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
|
||||||
|
|
||||||
vocab_file = vocab_writer.name
|
vocab_file = vocab_writer.name
|
||||||
@@ -49,7 +51,7 @@ class TokenizationTest(unittest.TestCase):
|
|||||||
"[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn",
|
"[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn",
|
||||||
"##ing", ","
|
"##ing", ","
|
||||||
]
|
]
|
||||||
with open("/tmp/bert_tokenizer_test.txt", "w") as vocab_writer:
|
with open("/tmp/bert_tokenizer_test.txt", "w", encoding='utf-8') as vocab_writer:
|
||||||
vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
|
vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
|
||||||
vocab_file = vocab_writer.name
|
vocab_file = vocab_writer.name
|
||||||
|
|
||||||
|
|||||||
90
tests/tokenization_transfo_xl_test.py
Normal file
90
tests/tokenization_transfo_xl_test.py
Normal file
@@ -0,0 +1,90 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Google AI Language Team Authors.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
|
|
||||||
|
import os
|
||||||
|
import unittest
|
||||||
|
from io import open
|
||||||
|
|
||||||
|
from pytorch_pretrained_bert.tokenization_transfo_xl import (TransfoXLTokenizer,
|
||||||
|
_is_control, _is_punctuation,
|
||||||
|
_is_whitespace)
|
||||||
|
|
||||||
|
|
||||||
|
class TransfoXLTokenizationTest(unittest.TestCase):
|
||||||
|
|
||||||
|
def test_full_tokenizer(self):
|
||||||
|
vocab_tokens = [
|
||||||
|
"<unk>", "[CLS]", "[SEP]", "want", "unwanted", "wa", "un", "running", ","
|
||||||
|
]
|
||||||
|
with open("/tmp/transfo_xl_tokenizer_test.txt", "w", encoding='utf-8') as vocab_writer:
|
||||||
|
vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
|
||||||
|
vocab_file = vocab_writer.name
|
||||||
|
|
||||||
|
tokenizer = TransfoXLTokenizer(vocab_file=vocab_file, lower_case=True)
|
||||||
|
tokenizer.build_vocab()
|
||||||
|
os.remove(vocab_file)
|
||||||
|
|
||||||
|
tokens = tokenizer.tokenize(u"<unk> UNwant\u00E9d,running")
|
||||||
|
self.assertListEqual(tokens, ["<unk>", "unwanted", ",", "running"])
|
||||||
|
|
||||||
|
self.assertListEqual(
|
||||||
|
tokenizer.convert_tokens_to_ids(tokens), [0, 4, 8, 7])
|
||||||
|
|
||||||
|
def test_full_tokenizer_lower(self):
|
||||||
|
tokenizer = TransfoXLTokenizer(lower_case=True)
|
||||||
|
|
||||||
|
self.assertListEqual(
|
||||||
|
tokenizer.tokenize(u" \tHeLLo!how \n Are yoU? "),
|
||||||
|
["hello", "!", "how", "are", "you", "?"])
|
||||||
|
self.assertListEqual(tokenizer.tokenize(u"H\u00E9llo"), ["hello"])
|
||||||
|
|
||||||
|
def test_full_tokenizer_no_lower(self):
|
||||||
|
tokenizer = TransfoXLTokenizer(lower_case=False)
|
||||||
|
|
||||||
|
self.assertListEqual(
|
||||||
|
tokenizer.tokenize(u" \tHeLLo!how \n Are yoU? "),
|
||||||
|
["HeLLo", "!", "how", "Are", "yoU", "?"])
|
||||||
|
|
||||||
|
def test_is_whitespace(self):
|
||||||
|
self.assertTrue(_is_whitespace(u" "))
|
||||||
|
self.assertTrue(_is_whitespace(u"\t"))
|
||||||
|
self.assertTrue(_is_whitespace(u"\r"))
|
||||||
|
self.assertTrue(_is_whitespace(u"\n"))
|
||||||
|
self.assertTrue(_is_whitespace(u"\u00A0"))
|
||||||
|
|
||||||
|
self.assertFalse(_is_whitespace(u"A"))
|
||||||
|
self.assertFalse(_is_whitespace(u"-"))
|
||||||
|
|
||||||
|
def test_is_control(self):
|
||||||
|
self.assertTrue(_is_control(u"\u0005"))
|
||||||
|
|
||||||
|
self.assertFalse(_is_control(u"A"))
|
||||||
|
self.assertFalse(_is_control(u" "))
|
||||||
|
self.assertFalse(_is_control(u"\t"))
|
||||||
|
self.assertFalse(_is_control(u"\r"))
|
||||||
|
|
||||||
|
def test_is_punctuation(self):
|
||||||
|
self.assertTrue(_is_punctuation(u"-"))
|
||||||
|
self.assertTrue(_is_punctuation(u"$"))
|
||||||
|
self.assertTrue(_is_punctuation(u"`"))
|
||||||
|
self.assertTrue(_is_punctuation(u"."))
|
||||||
|
|
||||||
|
self.assertFalse(_is_punctuation(u"A"))
|
||||||
|
self.assertFalse(_is_punctuation(u" "))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
unittest.main()
|
||||||
Reference in New Issue
Block a user