Merge branch 'master' into conditional-generation
This commit is contained in:
@@ -9,7 +9,7 @@ jobs:
|
|||||||
steps:
|
steps:
|
||||||
- checkout
|
- checkout
|
||||||
- run: sudo pip install torch
|
- run: sudo pip install torch
|
||||||
- run: sudo pip install tensorflow==2.0.0-rc0
|
- run: sudo pip install tensorflow
|
||||||
- run: sudo pip install --progress-bar off .
|
- run: sudo pip install --progress-bar off .
|
||||||
- run: sudo pip install pytest codecov pytest-cov
|
- run: sudo pip install pytest codecov pytest-cov
|
||||||
- run: sudo pip install tensorboardX scikit-learn
|
- run: sudo pip install tensorboardX scikit-learn
|
||||||
@@ -38,7 +38,7 @@ jobs:
|
|||||||
parallelism: 1
|
parallelism: 1
|
||||||
steps:
|
steps:
|
||||||
- checkout
|
- checkout
|
||||||
- run: sudo pip install tensorflow==2.0.0-rc0
|
- run: sudo pip install tensorflow
|
||||||
- run: sudo pip install --progress-bar off .
|
- run: sudo pip install --progress-bar off .
|
||||||
- run: sudo pip install pytest codecov pytest-cov
|
- run: sudo pip install pytest codecov pytest-cov
|
||||||
- run: sudo pip install tensorboardX scikit-learn
|
- run: sudo pip install tensorboardX scikit-learn
|
||||||
@@ -65,7 +65,7 @@ jobs:
|
|||||||
- image: circleci/python:2.7
|
- image: circleci/python:2.7
|
||||||
steps:
|
steps:
|
||||||
- checkout
|
- checkout
|
||||||
- run: sudo pip install tensorflow==2.0.0-rc0
|
- run: sudo pip install tensorflow
|
||||||
- run: sudo pip install --progress-bar off .
|
- run: sudo pip install --progress-bar off .
|
||||||
- run: sudo pip install pytest codecov pytest-cov
|
- run: sudo pip install pytest codecov pytest-cov
|
||||||
- run: python -m pytest -sv ./transformers/tests/ --cov
|
- run: python -m pytest -sv ./transformers/tests/ --cov
|
||||||
|
|||||||
22
.github/ISSUE_TEMPLATE/---new-benchmark.md
vendored
Normal file
22
.github/ISSUE_TEMPLATE/---new-benchmark.md
vendored
Normal file
@@ -0,0 +1,22 @@
|
|||||||
|
---
|
||||||
|
name: "\U0001F5A5 New Benchmark"
|
||||||
|
about: You benchmark a part of this library and would like to share your results
|
||||||
|
title: "[Benchmark]"
|
||||||
|
labels: ''
|
||||||
|
assignees: ''
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Benchmarking Transformers
|
||||||
|
|
||||||
|
## Benchmark
|
||||||
|
|
||||||
|
Which part of Transformers did you benchmark?
|
||||||
|
|
||||||
|
## Set-up
|
||||||
|
|
||||||
|
What did you run your benchmarks on? Please include details, such as: CPU, GPU? If using multiple GPUs, which parallelization did you use?
|
||||||
|
|
||||||
|
## Results
|
||||||
|
|
||||||
|
Put your results here!
|
||||||
8
.gitignore
vendored
8
.gitignore
vendored
@@ -118,6 +118,9 @@ dmypy.json
|
|||||||
# vscode
|
# vscode
|
||||||
.vscode
|
.vscode
|
||||||
|
|
||||||
|
# Pycharm
|
||||||
|
.idea
|
||||||
|
|
||||||
# TF code
|
# TF code
|
||||||
tensorflow_code
|
tensorflow_code
|
||||||
|
|
||||||
@@ -131,4 +134,7 @@ examples/runs
|
|||||||
|
|
||||||
# data
|
# data
|
||||||
/data
|
/data
|
||||||
serialization_dir
|
serialization_dir
|
||||||
|
|
||||||
|
# emacs
|
||||||
|
*.*~
|
||||||
175
CONTRIBUTING.md
Normal file
175
CONTRIBUTING.md
Normal file
@@ -0,0 +1,175 @@
|
|||||||
|
# How to contribute to transformers?
|
||||||
|
|
||||||
|
Everyone is welcome to contribute, and we value everybody's contribution. Code
|
||||||
|
is thus not the only way to help the community. Answering questions, helping
|
||||||
|
others, reaching out and improving the documentations are immensely valuable to
|
||||||
|
the community.
|
||||||
|
|
||||||
|
It also helps us if you spread the word: reference the library from blog posts
|
||||||
|
on the awesome projects it made possible, shout out on Twitter every time it has
|
||||||
|
helped you, or simply star the repo to say "thank you".
|
||||||
|
|
||||||
|
## You can contribute in so many ways!
|
||||||
|
|
||||||
|
There are 4 ways you can contribute to transformers:
|
||||||
|
* Fixing outstanding issues with the existing code;
|
||||||
|
* Implementing new models;
|
||||||
|
* Contributing to the examples or to the documentation;
|
||||||
|
* Submitting issues related to bugs or desired new features.
|
||||||
|
|
||||||
|
*All are equally valuable to the community.*
|
||||||
|
|
||||||
|
## Submitting a new issue or feature request
|
||||||
|
|
||||||
|
Do your best to follow these guidelines when submitting an issue or a feature
|
||||||
|
request. It will make it easier for us to come back to you quickly and with good
|
||||||
|
feedback.
|
||||||
|
|
||||||
|
### Did you find a bug?
|
||||||
|
|
||||||
|
The transformers are robust and reliable thanks to the users who notify us of
|
||||||
|
the problems they encounter. So thank you for reporting an issue.
|
||||||
|
|
||||||
|
First, we would really appreciate it if you could **make sure the bug was not
|
||||||
|
already reported** (use the search bar on Github under Issues).
|
||||||
|
|
||||||
|
Did not find it? :( So we can act quickly on it, please follow these steps:
|
||||||
|
|
||||||
|
* Include your **OS type and version**, the versions of **Python**, **PyTorch** and
|
||||||
|
**Tensorflow** when applicable;
|
||||||
|
* A short, self-contained, code snippet that allows us to reproduce the bug in
|
||||||
|
less than 30s;
|
||||||
|
* Provide the *full* traceback if an exception is raised.
|
||||||
|
|
||||||
|
To get the OS and software versions, execute the following code and copy-paste
|
||||||
|
the output:
|
||||||
|
|
||||||
|
```
|
||||||
|
import platform; print("Platform", platform.platform())
|
||||||
|
import sys; print("Python", sys.version)
|
||||||
|
import torch; print("PyTorch", torch.__version__)
|
||||||
|
import tensorflow; print("Tensorflow", tensorflow.__version__)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Do you want to implement a new model?
|
||||||
|
|
||||||
|
Awesome! Please provide the following information:
|
||||||
|
|
||||||
|
* Short description of the model and link to the paper;
|
||||||
|
* Link to the implementation if it is open-source;
|
||||||
|
* Link to the model weights if they are available.
|
||||||
|
|
||||||
|
If you are willing to contribute the model yourself, let us know so we can best
|
||||||
|
guide you.
|
||||||
|
|
||||||
|
### Do you want a new feature (that is not a model)?
|
||||||
|
|
||||||
|
A world-class feature request addresses the following points:
|
||||||
|
|
||||||
|
1. Motivation first:
|
||||||
|
* Is it related to a problem/frustration with the library? If so, please explain
|
||||||
|
why. Providing a code snippet that demonstrates the problem is best.
|
||||||
|
* Is it related to something you would need for a project? We'd love to hear
|
||||||
|
about it!
|
||||||
|
* Is it something you worked on and think could benefit the community?
|
||||||
|
Awesome! Tell us what problem it solved for you.
|
||||||
|
2. Write a *full paragraph* describing the feature;
|
||||||
|
3. Provide a **code snippet** that demonstrates its future use;
|
||||||
|
4. In case this is related to a paper, please attach a link;
|
||||||
|
5. Attach any additional information (drawings, screenshots, etc.) you think may help.
|
||||||
|
|
||||||
|
If your issue is well written we're already 80% of the way there by the time you
|
||||||
|
post it.
|
||||||
|
|
||||||
|
## Start contributing! (Pull Requests)
|
||||||
|
|
||||||
|
Before writing code, we strongly advise you to search through the exising PRs or
|
||||||
|
issues to make sure that nobody is already working on the same thing. If you are
|
||||||
|
unsure, it is always a good idea to open an issue to get some feedback.
|
||||||
|
|
||||||
|
You will need basic `git` proficiency to be able to contribute to
|
||||||
|
`transformers`. `git` is not the easiest tool to use but it has the greatest
|
||||||
|
manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
|
||||||
|
Git](https://git-scm.com/book/en/v2) is a very good reference.
|
||||||
|
|
||||||
|
Follow these steps to start contributing:
|
||||||
|
|
||||||
|
1. Fork the [repository](https://github.com/huggingface/transformers) by
|
||||||
|
clicking on the 'Fork' button on the repository's page. This creates a copy of the code
|
||||||
|
under your github user account.
|
||||||
|
2. Clone your fork to your local disk, and add the base repository as a remote:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ git clone git@github.com:<your Github handle>/transformers.git
|
||||||
|
$ cd transformers
|
||||||
|
$ git remote add upstream git@github.com:huggingface/transformers.git
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Create a new branch to hold your development changes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ git checkout -b a-descriptive-name-for-my-changes
|
||||||
|
```
|
||||||
|
|
||||||
|
**do not** work on the `master` branch.
|
||||||
|
|
||||||
|
4. Set up a development environment by running the following command in a virtual environment:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ pip install -r requirements-dev.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
5. Develop the features on your branch. Add changed files using `git add` and
|
||||||
|
then `git commit` to record your changes locally:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ git add modified_file.py
|
||||||
|
$ git commit
|
||||||
|
```
|
||||||
|
|
||||||
|
Please write [good commit
|
||||||
|
messages](https://chris.beams.io/posts/git-commit/). It
|
||||||
|
is a good idea to sync your copy of the code with the original repository
|
||||||
|
regularly. This way you can quickly account for changes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ git fetch upstream
|
||||||
|
$ git rebase upstream/master
|
||||||
|
```
|
||||||
|
|
||||||
|
Push the changes to your account using:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ git push -u origin a-descriptive-name-for-my-changes
|
||||||
|
```
|
||||||
|
|
||||||
|
6. Once you are satisfied (**and the checklist below is happy too**), go to the
|
||||||
|
webpage of your fork on Github. Click on 'Pull request' to send your changes
|
||||||
|
to the project maintainers for review.
|
||||||
|
|
||||||
|
7. It's ok if maintainers ask you for changes. It happens to core contributors
|
||||||
|
too! So everyone can see the changes in the Pull request, work in your local
|
||||||
|
branch and push the changes to your fork. They will automatically appear in
|
||||||
|
the pull request.
|
||||||
|
|
||||||
|
|
||||||
|
### Checklist
|
||||||
|
|
||||||
|
1. The title of your pull request should be a summary of its contribution;
|
||||||
|
2. If your pull request adresses an issue, please mention the issue number in
|
||||||
|
the pull request description to make sure they are linked (and people
|
||||||
|
consulting the issue know you are working on it);
|
||||||
|
3. To indicate a work in progress please prefix the title with `[WIP]`. These
|
||||||
|
are useful to avoid duplicated work, and to differentiate it from PRs ready
|
||||||
|
to be merged;
|
||||||
|
4. Make sure pre-existing tests still pass;
|
||||||
|
5. Add high-coverage tests. No quality test, no merge;
|
||||||
|
6. All public methods must have informative doctrings;
|
||||||
|
|
||||||
|
|
||||||
|
### Style guide
|
||||||
|
|
||||||
|
For documentation strings, `transformers` follows the [google
|
||||||
|
style](https://google.github.io/styleguide/pyguide.html).
|
||||||
|
|
||||||
|
#### This guide was heavily inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md)
|
||||||
47
README.md
47
README.md
@@ -22,7 +22,7 @@
|
|||||||
<p>State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
|
<p>State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
|
||||||
</h3>
|
</h3>
|
||||||
|
|
||||||
🤗 Transformers (formerly known as `pytorch-transformers` and `pytorch-pretrained-bert`) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.
|
🤗 Transformers (formerly known as `pytorch-transformers` and `pytorch-pretrained-bert`) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.
|
||||||
|
|
||||||
### Features
|
### Features
|
||||||
|
|
||||||
@@ -39,7 +39,7 @@ State-of-the-art NLP for everyone
|
|||||||
Lower compute costs, smaller carbon footprint
|
Lower compute costs, smaller carbon footprint
|
||||||
- Researchers can share trained models instead of always retraining
|
- Researchers can share trained models instead of always retraining
|
||||||
- Practitioners can reduce compute time and production costs
|
- Practitioners can reduce compute time and production costs
|
||||||
- 8 architectures with over 30 pretrained models, some in more than 100 languages
|
- 10 architectures with over 30 pretrained models, some in more than 100 languages
|
||||||
|
|
||||||
Choose the right framework for every part of a model's lifetime
|
Choose the right framework for every part of a model's lifetime
|
||||||
- Train state-of-the-art models in 3 lines of code
|
- Train state-of-the-art models in 3 lines of code
|
||||||
@@ -56,7 +56,7 @@ Choose the right framework for every part of a model's lifetime
|
|||||||
| [Quick tour: Usage](#quick-tour) | Tokenizers & models usage: Bert and GPT-2 |
|
| [Quick tour: Usage](#quick-tour) | Tokenizers & models usage: Bert and GPT-2 |
|
||||||
| [Quick tour: TF 2.0 and PyTorch ](#Quick-tour-TF-20-training-and-PyTorch-interoperability) | Train a TF 2.0 model in 10 lines of code, load it in PyTorch |
|
| [Quick tour: TF 2.0 and PyTorch ](#Quick-tour-TF-20-training-and-PyTorch-interoperability) | Train a TF 2.0 model in 10 lines of code, load it in PyTorch |
|
||||||
| [Quick tour: Fine-tuning/usage scripts](#quick-tour-of-the-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
|
| [Quick tour: Fine-tuning/usage scripts](#quick-tour-of-the-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
|
||||||
| [Migrating from pytorch-transformers to transformers](#Migrating-from-pytorch-transformers-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
|
| [Migrating from pytorch-transformers to transformers](#Migrating-from-pytorch-transformers-to-transformers) | Migrating your code from pytorch-transformers to transformers |
|
||||||
| [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
|
| [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
|
||||||
| [Documentation](https://huggingface.co/transformers/) | Full API documentation and more |
|
| [Documentation](https://huggingface.co/transformers/) | Full API documentation and more |
|
||||||
|
|
||||||
@@ -105,13 +105,13 @@ python -m pytest -sv ./examples/
|
|||||||
|
|
||||||
You should check out our [`swift-coreml-transformers`](https://github.com/huggingface/swift-coreml-transformers) repo.
|
You should check out our [`swift-coreml-transformers`](https://github.com/huggingface/swift-coreml-transformers) repo.
|
||||||
|
|
||||||
It contains an example of a conversion script from a Pytorch trained Transformer model (here, `GPT-2`) to a CoreML model that runs on iOS devices.
|
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`, `DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
|
||||||
|
|
||||||
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models to productizing them in CoreML, or prototype a model or an app in CoreML then research its hyperparameters or architecture from TensorFlow 2.0 and/or PyTorch. Super exciting!
|
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models to productizing them in CoreML, or prototype a model or an app in CoreML then research its hyperparameters or architecture from TensorFlow 2.0 and/or PyTorch. Super exciting!
|
||||||
|
|
||||||
## Model architectures
|
## Model architectures
|
||||||
|
|
||||||
🤗 Transformers currently provides 8 NLU/NLG architectures:
|
🤗 Transformers currently provides 10 NLU/NLG architectures:
|
||||||
|
|
||||||
1. **[BERT](https://github.com/google-research/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
|
1. **[BERT](https://github.com/google-research/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
|
||||||
2. **[GPT](https://github.com/openai/finetune-transformer-lm)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
|
2. **[GPT](https://github.com/openai/finetune-transformer-lm)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
|
||||||
@@ -121,6 +121,7 @@ At some point in the future, you'll be able to seamlessly move from pre-training
|
|||||||
6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
|
6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
|
||||||
7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
||||||
8. **[DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation).
|
8. **[DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation).
|
||||||
|
9. **[CTRL](https://github.com/salesforce/ctrl/)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
|
||||||
|
|
||||||
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
|
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
|
||||||
|
|
||||||
@@ -147,6 +148,7 @@ from transformers import *
|
|||||||
MODELS = [(BertModel, BertTokenizer, 'bert-base-uncased'),
|
MODELS = [(BertModel, BertTokenizer, 'bert-base-uncased'),
|
||||||
(OpenAIGPTModel, OpenAIGPTTokenizer, 'openai-gpt'),
|
(OpenAIGPTModel, OpenAIGPTTokenizer, 'openai-gpt'),
|
||||||
(GPT2Model, GPT2Tokenizer, 'gpt2'),
|
(GPT2Model, GPT2Tokenizer, 'gpt2'),
|
||||||
|
(CTRLModel, CTRLTokenizer, 'ctrl'),
|
||||||
(TransfoXLModel, TransfoXLTokenizer, 'transfo-xl-wt103'),
|
(TransfoXLModel, TransfoXLTokenizer, 'transfo-xl-wt103'),
|
||||||
(XLNetModel, XLNetTokenizer, 'xlnet-base-cased'),
|
(XLNetModel, XLNetTokenizer, 'xlnet-base-cased'),
|
||||||
(XLMModel, XLMTokenizer, 'xlm-mlm-enfr-1024'),
|
(XLMModel, XLMTokenizer, 'xlm-mlm-enfr-1024'),
|
||||||
@@ -174,10 +176,11 @@ BERT_MODEL_CLASSES = [BertModel, BertForPreTraining, BertForMaskedLM, BertForNex
|
|||||||
# All the classes for an architecture can be initiated from pretrained weights for this architecture
|
# All the classes for an architecture can be initiated from pretrained weights for this architecture
|
||||||
# Note that additional weights added for fine-tuning are only initialized
|
# Note that additional weights added for fine-tuning are only initialized
|
||||||
# and need to be trained on the down-stream task
|
# and need to be trained on the down-stream task
|
||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
pretrained_weights = 'bert-base-uncased'
|
||||||
|
tokenizer = BertTokenizer.from_pretrained(pretrained_weights)
|
||||||
for model_class in BERT_MODEL_CLASSES:
|
for model_class in BERT_MODEL_CLASSES:
|
||||||
# Load pretrained model/tokenizer
|
# Load pretrained model/tokenizer
|
||||||
model = model_class.from_pretrained('bert-base-uncased')
|
model = model_class.from_pretrained(pretrained_weights)
|
||||||
|
|
||||||
# Models can return full list of hidden-states & attentions weights at each layer
|
# Models can return full list of hidden-states & attentions weights at each layer
|
||||||
model = model_class.from_pretrained(pretrained_weights,
|
model = model_class.from_pretrained(pretrained_weights,
|
||||||
@@ -240,8 +243,9 @@ sentence_2 = "His findings were not compatible with this research."
|
|||||||
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
|
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
|
||||||
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')
|
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')
|
||||||
|
|
||||||
pred_1 = pytorch_model(**inputs_1)[0].argmax().item()
|
pred_1 = pytorch_model(inputs_1['input_ids'], token_type_ids=inputs_1['token_type_ids'])[0].argmax().item()
|
||||||
pred_2 = pytorch_model(**inputs_2)[0].argmax().item()
|
pred_2 = pytorch_model(inputs_2['input_ids'], token_type_ids=inputs_2['token_type_ids'])[0].argmax().item()
|
||||||
|
|
||||||
print("sentence_1 is", "a paraphrase" if pred_1 else "not a paraphrase", "of sentence_0")
|
print("sentence_1 is", "a paraphrase" if pred_1 else "not a paraphrase", "of sentence_0")
|
||||||
print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sentence_0")
|
print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sentence_0")
|
||||||
```
|
```
|
||||||
@@ -252,7 +256,7 @@ The library comprises several example scripts with SOTA performances for NLU and
|
|||||||
|
|
||||||
- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
|
- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
|
||||||
- `run_squad.py`: an example fine-tuning Bert, XLNet and XLM on the question answering dataset SQuAD 2.0 (*token-level classification*)
|
- `run_squad.py`: an example fine-tuning Bert, XLNet and XLM on the question answering dataset SQuAD 2.0 (*token-level classification*)
|
||||||
- `run_generation.py`: an example using GPT, GPT-2, Transformer-XL and XLNet for conditional language generation
|
- `run_generation.py`: an example using GPT, GPT-2, CTRL, Transformer-XL and XLNet for conditional language generation
|
||||||
- other model-specific examples (see the documentation).
|
- other model-specific examples (see the documentation).
|
||||||
|
|
||||||
Here are three quick usage examples for these scripts:
|
Here are three quick usage examples for these scripts:
|
||||||
@@ -390,7 +394,7 @@ python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncase
|
|||||||
|
|
||||||
This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-squad`.
|
This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-squad`.
|
||||||
|
|
||||||
### `run_generation.py`: Text generation with GPT, GPT-2, Transformer-XL and XLNet
|
### `run_generation.py`: Text generation with GPT, GPT-2, CTRL, Transformer-XL and XLNet
|
||||||
|
|
||||||
A conditional generation script is also included to generate text from a prompt.
|
A conditional generation script is also included to generate text from a prompt.
|
||||||
The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by Aman Rusia to get high-quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
|
The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by Aman Rusia to get high-quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
|
||||||
@@ -404,6 +408,16 @@ python ./examples/run_generation.py \
|
|||||||
--model_name_or_path=gpt2 \
|
--model_name_or_path=gpt2 \
|
||||||
```
|
```
|
||||||
|
|
||||||
|
and from the Salesforce CTRL model:
|
||||||
|
```shell
|
||||||
|
python ./examples/run_generation.py \
|
||||||
|
--model_type=ctrl \
|
||||||
|
--length=20 \
|
||||||
|
--model_name_or_path=ctrl \
|
||||||
|
--temperature=0 \
|
||||||
|
--repetition_penalty=1.2 \
|
||||||
|
```
|
||||||
|
|
||||||
## Migrating from pytorch-transformers to transformers
|
## Migrating from pytorch-transformers to transformers
|
||||||
|
|
||||||
Here is a quick summary of what you should take care of when migrating from `pytorch-transformers` to `transformers`.
|
Here is a quick summary of what you should take care of when migrating from `pytorch-transformers` to `transformers`.
|
||||||
@@ -533,4 +547,13 @@ for batch in train_data:
|
|||||||
|
|
||||||
## Citation
|
## Citation
|
||||||
|
|
||||||
At the moment, there is no paper associated with Transformers but we are working on preparing one. In the meantime, please include a mention of the library and a link to the present repository if you use this work in a published or open-source project.
|
We now have a paper you can cite for the 🤗 Transformers library:
|
||||||
|
```
|
||||||
|
@article{Wolf2019HuggingFacesTS,
|
||||||
|
title={HuggingFace's Transformers: State-of-the-art Natural Language Processing},
|
||||||
|
author={Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R'emi Louf and Morgan Funtowicz and Jamie Brew},
|
||||||
|
journal={ArXiv},
|
||||||
|
year={2019},
|
||||||
|
volume={abs/1910.03771}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|||||||
@@ -50,7 +50,7 @@ make html
|
|||||||
---
|
---
|
||||||
**NOTE**
|
**NOTE**
|
||||||
|
|
||||||
If you are adding/removing elements from the toc-tree or from any strutural item, it is recommended to clean the build
|
If you are adding/removing elements from the toc-tree or from any structural item, it is recommended to clean the build
|
||||||
directory before rebuilding. Run the following command to clean and build:
|
directory before rebuilding. Run the following command to clean and build:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
54
docs/source/benchmarks.md
Normal file
54
docs/source/benchmarks.md
Normal file
@@ -0,0 +1,54 @@
|
|||||||
|
# Benchmarks
|
||||||
|
|
||||||
|
This section is dedicated to the Benchmarks done by the library, both by maintainers, contributors and users. These
|
||||||
|
benchmark will help keep track of the preformance improvements that are brought to our models across versions.
|
||||||
|
|
||||||
|
## Benchmarking all models for inference
|
||||||
|
|
||||||
|
As of version 2.1 we have benchmarked all models for inference, across many different settings: using PyTorch, with
|
||||||
|
and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for
|
||||||
|
TensorFlow XLA) and GPUs.
|
||||||
|
|
||||||
|
The approach is detailed in the [following blogpost](https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2)
|
||||||
|
|
||||||
|
The results are available [here](https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing).
|
||||||
|
|
||||||
|
## TF2 with mixed precision, XLA, Distribution (@tlkh)
|
||||||
|
|
||||||
|
This work was done by [Timothy Liu](https://github.com/tlkh).
|
||||||
|
|
||||||
|
There are very positive results to be gained from the various TensorFlow 2.0 features:
|
||||||
|
|
||||||
|
- Automatic Mixed Precision (AMP)
|
||||||
|
- XLA compiler
|
||||||
|
- Distribution strategies (multi-GPU)
|
||||||
|
|
||||||
|
The benefits are listed here (tested on CoLA, MRPC, SST-2):
|
||||||
|
|
||||||
|
- AMP: Between 1.4x to 1.6x decrease in overall time without change in batch size
|
||||||
|
- AMP+XLA: Up to 2.5x decrease in overall time on SST-2 (larger dataset)
|
||||||
|
- Distribution: Between 1.4x to 3.4x decrease in overall time on 4xV100
|
||||||
|
- Combined: Up to 5.7x decrease in overall training time, or 9.1x training throughput
|
||||||
|
|
||||||
|
The model quality (measured by the validation accuracy) fluctuates slightly. Taking an average of 4 training runs
|
||||||
|
on a single GPU gives the following results:
|
||||||
|
|
||||||
|
- CoLA: AMP results in slighter lower acc (0.820 vs 0.824)
|
||||||
|
- MRPC: AMP results in lower acc (0.823 vs 0.835)
|
||||||
|
- SST-2: AMP results in slighter lower acc (0.918 vs 0.922)
|
||||||
|
|
||||||
|
However, in a distributed setting with 4xV100 (4x batch size), AMP can yield in better results:
|
||||||
|
|
||||||
|
CoLA: AMP results in higher acc (0.828 vs 0.812)
|
||||||
|
MRPC: AMP results in lower acc (0.817 vs 0.827)
|
||||||
|
SST-2: AMP results in slightly lower acc (0.926 vs 0.929)
|
||||||
|
|
||||||
|
The benchmark script is available [here](https://github.com/NVAITC/benchmarking/blob/master/tf2/bert_dist.py).
|
||||||
|
|
||||||
|
Note: on some tasks (e.g. MRPC), the dataset is too small. The overhead due to the model compilation with XLA as well
|
||||||
|
as the distribution strategy setup does not speed things up. The XLA compile time is also the reason why although throughput
|
||||||
|
can increase a lot (e.g. 2.7x for single GPU), overall (end-to-end) training speed-up is not as fast (as low as 1.4x)
|
||||||
|
|
||||||
|
The benefits as seen on SST-2 (larger dataset) is much clear.
|
||||||
|
|
||||||
|
All results can be seen on this [Google Sheet](https://docs.google.com/spreadsheets/d/1538MN224EzjbRL239sqSiUy6YY-rAjHyXhTzz_Zptls/edit#gid=960868445).
|
||||||
@@ -26,7 +26,7 @@ author = u'huggingface'
|
|||||||
# The short X.Y version
|
# The short X.Y version
|
||||||
version = u''
|
version = u''
|
||||||
# The full version, including alpha/beta/rc tags
|
# The full version, including alpha/beta/rc tags
|
||||||
release = u'2.0.0'
|
release = u'2.1.1'
|
||||||
|
|
||||||
|
|
||||||
# -- General configuration ---------------------------------------------------
|
# -- General configuration ---------------------------------------------------
|
||||||
|
|||||||
@@ -62,6 +62,8 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
|
|||||||
migration
|
migration
|
||||||
bertology
|
bertology
|
||||||
torchscript
|
torchscript
|
||||||
|
multilingual
|
||||||
|
benchmarks
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 2
|
:maxdepth: 2
|
||||||
@@ -86,3 +88,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
|
|||||||
model_doc/xlnet
|
model_doc/xlnet
|
||||||
model_doc/roberta
|
model_doc/roberta
|
||||||
model_doc/distilbert
|
model_doc/distilbert
|
||||||
|
model_doc/ctrl
|
||||||
|
|||||||
58
docs/source/installation.md
Normal file
58
docs/source/installation.md
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
# Installation
|
||||||
|
|
||||||
|
Transformers is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.1.0
|
||||||
|
|
||||||
|
## With pip
|
||||||
|
|
||||||
|
PyTorch Transformers can be installed using pip as follows:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
pip install transformers
|
||||||
|
```
|
||||||
|
|
||||||
|
## From source
|
||||||
|
|
||||||
|
To install from source, clone the repository and install with:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
git clone https://github.com/huggingface/transformers.git
|
||||||
|
cd transformers
|
||||||
|
pip install [--editable] .
|
||||||
|
```
|
||||||
|
|
||||||
|
## Tests
|
||||||
|
|
||||||
|
An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
|
||||||
|
|
||||||
|
Tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
|
||||||
|
|
||||||
|
Run all the tests from the root of the cloned repository with the commands:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
python -m pytest -sv ./transformers/tests/
|
||||||
|
python -m pytest -sv ./examples/
|
||||||
|
```
|
||||||
|
|
||||||
|
## OpenAI GPT original tokenization workflow
|
||||||
|
|
||||||
|
If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install `ftfy` (use version 4.4.3 if you are using Python 2) and `SpaCy`:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
pip install spacy ftfy==4.4.3
|
||||||
|
python -m spacy download en
|
||||||
|
```
|
||||||
|
|
||||||
|
If you don't install `ftfy` and `SpaCy`, the `OpenAI GPT` tokenizer will default to tokenize using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
|
||||||
|
|
||||||
|
## Note on model downloads (Continuous Integration or large-scale deployments)
|
||||||
|
|
||||||
|
If you expect to be downloading large volumes of models (more than 1,000) from our hosted bucket (for instance through your CI setup, or a large-scale production deployment), please cache the model files on your end. It will be way faster, and cheaper. Feel free to contact us privately if you need any help.
|
||||||
|
|
||||||
|
## Do you want to run a Transformer model on a mobile device?
|
||||||
|
|
||||||
|
You should check out our [swift-coreml-transformers](https://github.com/huggingface/swift-coreml-transformers) repo.
|
||||||
|
|
||||||
|
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`, `DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
|
||||||
|
|
||||||
|
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
|
||||||
|
or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
|
||||||
@@ -1,71 +0,0 @@
|
|||||||
Installation
|
|
||||||
================================================
|
|
||||||
|
|
||||||
Transformers is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.1.0
|
|
||||||
|
|
||||||
With pip
|
|
||||||
^^^^^^^^
|
|
||||||
|
|
||||||
PyTorch Transformers can be installed using pip as follows:
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
pip install transformers
|
|
||||||
|
|
||||||
From source
|
|
||||||
^^^^^^^^^^^
|
|
||||||
|
|
||||||
To install from source, clone the repository and install with:
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
git clone https://github.com/huggingface/transformers.git
|
|
||||||
cd transformers
|
|
||||||
pip install [--editable] .
|
|
||||||
|
|
||||||
|
|
||||||
Tests
|
|
||||||
^^^^^
|
|
||||||
|
|
||||||
An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the `tests folder <https://github.com/huggingface/transformers/tree/master/transformers/tests>`_ and examples tests in the `examples folder <https://github.com/huggingface/transformers/tree/master/examples>`_.
|
|
||||||
|
|
||||||
Tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
|
|
||||||
|
|
||||||
Run all the tests from the root of the cloned repository with the commands:
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
python -m pytest -sv ./transformers/tests/
|
|
||||||
python -m pytest -sv ./examples/
|
|
||||||
|
|
||||||
|
|
||||||
OpenAI GPT original tokenization workflow
|
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
||||||
|
|
||||||
If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (use version 4.4.3 if you are using Python 2) and ``SpaCy`` :
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
pip install spacy ftfy==4.4.3
|
|
||||||
python -m spacy download en
|
|
||||||
|
|
||||||
If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
|
|
||||||
|
|
||||||
|
|
||||||
Note on model downloads (Continuous Integration or large-scale deployments)
|
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
||||||
|
|
||||||
If you expect to be downloading large volumes of models (more than 1,000) from our hosted bucket (for instance through your CI setup, or a large-scale production deployment), please cache the model files on your end. It will be way faster, and cheaper. Feel free to contact us privately if you need any help.
|
|
||||||
|
|
||||||
|
|
||||||
Do you want to run a Transformer model on a mobile device?
|
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
||||||
|
|
||||||
You should check out our `swift-coreml-transformers <https://github.com/huggingface/swift-coreml-transformers>`_ repo.
|
|
||||||
|
|
||||||
It contains an example of a conversion script from a Pytorch trained Transformer model (here, ``GPT-2``) to a CoreML model that runs on iOS devices.
|
|
||||||
|
|
||||||
It also contains an implementation of BERT for Question answering.
|
|
||||||
|
|
||||||
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
|
|
||||||
or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
|
|
||||||
44
docs/source/model_doc/ctrl.rst
Normal file
44
docs/source/model_doc/ctrl.rst
Normal file
@@ -0,0 +1,44 @@
|
|||||||
|
CTRL
|
||||||
|
----------------------------------------------------
|
||||||
|
|
||||||
|
``CTRLConfig``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.CTRLConfig
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``CTRLTokenizer``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.CTRLTokenizer
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``CTRLModel``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.CTRLModel
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``CTRLLMHeadModel``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.CTRLLMHeadModel
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``TFCTRLModel``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.TFCTRLModel
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``TFCTRLLMHeadModel``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.TFCTRLLMHeadModel
|
||||||
|
:members:
|
||||||
|
|
||||||
103
docs/source/multilingual.rst
Normal file
103
docs/source/multilingual.rst
Normal file
@@ -0,0 +1,103 @@
|
|||||||
|
Multi-lingual models
|
||||||
|
================================================
|
||||||
|
|
||||||
|
Most of the models available in this library are mono-lingual models (English, Chinese and German). A few
|
||||||
|
multi-lingual models are available and have a different mechanisms than mono-lingual models.
|
||||||
|
This page details the usage of these models.
|
||||||
|
|
||||||
|
The two models that currently support multiple languages are BERT and XLM.
|
||||||
|
|
||||||
|
XLM
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
XLM has a total of 10 different checkpoints, only one of which is mono-lingual. The 9 remaining model checkpoints can
|
||||||
|
be split in two categories: the checkpoints that make use of language embeddings, and those that don't
|
||||||
|
|
||||||
|
XLM & Language Embeddings
|
||||||
|
------------------------------------------------
|
||||||
|
|
||||||
|
This section concerns the following checkpoints:
|
||||||
|
|
||||||
|
- ``xlm-mlm-ende-1024`` (Masked language modeling, English-German)
|
||||||
|
- ``xlm-mlm-enfr-1024`` (Masked language modeling, English-French)
|
||||||
|
- ``xlm-mlm-enro-1024`` (Masked language modeling, English-Romanian)
|
||||||
|
- ``xlm-mlm-xnli15-1024`` (Masked language modeling, XNLI languages)
|
||||||
|
- ``xlm-mlm-tlm-xnli15-1024`` (Masked language modeling + Translation, XNLI languages)
|
||||||
|
- ``xlm-clm-enfr-1024`` (Causal language modeling, English-French)
|
||||||
|
- ``xlm-clm-ende-1024`` (Causal language modeling, English-German)
|
||||||
|
|
||||||
|
These checkpoints require language embeddings that will specify the language used at inference time. These language
|
||||||
|
embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
|
||||||
|
these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes
|
||||||
|
from the tokenizer.
|
||||||
|
|
||||||
|
Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French):
|
||||||
|
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from transformers import XLMTokenizer, XLMWithLMHeadModel
|
||||||
|
|
||||||
|
tokenizer = XLMTokenizer.from_pretrained("xlm-clm-1024-enfr")
|
||||||
|
|
||||||
|
|
||||||
|
The different languages this model/tokenizer handles, as well as the ids of these languages are visible using the
|
||||||
|
``lang2id`` attribute:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
print(tokenizer.lang2id) # {'en': 0, 'fr': 1}
|
||||||
|
|
||||||
|
|
||||||
|
These ids should be used when passing a language parameter during a model pass. Let's define our inputs:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
|
||||||
|
|
||||||
|
|
||||||
|
We should now define the language embedding by using the previously defined language id. We want to create a tensor
|
||||||
|
filled with the appropriate language ids, of the same size as input_ids. For english, the id is 0:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
language_id = tokenizer.lang2id['en'] # 0
|
||||||
|
langs = torch.tensor([language_id] * input_ids.shape[1]) # torch.tensor([0, 0, 0, ..., 0])
|
||||||
|
|
||||||
|
# We reshape it to be of size (batch_size, sequence_length)
|
||||||
|
langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
|
||||||
|
|
||||||
|
|
||||||
|
You can then feed it all as input to your model:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
outputs = model(input_ids, langs=langs)
|
||||||
|
|
||||||
|
|
||||||
|
The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/run_generation.py>`__
|
||||||
|
can generate text using the CLM checkpoints from XLM, using the language embeddings.
|
||||||
|
|
||||||
|
XLM without Language Embeddings
|
||||||
|
------------------------------------------------
|
||||||
|
|
||||||
|
This section concerns the following checkpoints:
|
||||||
|
|
||||||
|
- ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages)
|
||||||
|
- ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages)
|
||||||
|
|
||||||
|
These checkpoints do not require language embeddings at inference time. These models are used to have generic
|
||||||
|
sentence representations, differently from previously-mentioned XLM checkpoints.
|
||||||
|
|
||||||
|
|
||||||
|
BERT
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
BERT has two checkpoints that can be used for multi-lingual tasks:
|
||||||
|
|
||||||
|
- ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages)
|
||||||
|
- ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages)
|
||||||
|
|
||||||
|
These checkpoints do not require language embeddings at inference time. They should identify the language
|
||||||
|
used in the context and infer accordingly.
|
||||||
@@ -53,6 +53,14 @@ Here is the full list of the currently provided pretrained models together with
|
|||||||
| | ``bert-base-cased-finetuned-mrpc`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
|
| | ``bert-base-cased-finetuned-mrpc`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
|
||||||
| | | | The ``bert-base-cased`` model fine-tuned on MRPC |
|
| | | | The ``bert-base-cased`` model fine-tuned on MRPC |
|
||||||
| | | (see `details of fine-tuning in the example section <https://huggingface.co/transformers/examples.html>`__) |
|
| | | (see `details of fine-tuning in the example section <https://huggingface.co/transformers/examples.html>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-base-german-dbmdz-cased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
|
||||||
|
| | | | Trained on cased German text by DBMDZ |
|
||||||
|
| | | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__). |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-base-german-dbmdz-uncased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
|
||||||
|
| | | | Trained on uncased German text by DBMDZ |
|
||||||
|
| | | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__). |
|
||||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
| GPT | ``openai-gpt`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
|
| GPT | ``openai-gpt`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
|
||||||
| | | | OpenAI GPT English model |
|
| | | | OpenAI GPT English model |
|
||||||
@@ -128,5 +136,13 @@ Here is the full list of the currently provided pretrained models together with
|
|||||||
| | ``distilgpt2`` | | 6-layer, 768-hidden, 12-heads, 82M parameters |
|
| | ``distilgpt2`` | | 6-layer, 768-hidden, 12-heads, 82M parameters |
|
||||||
| | | | The DistilGPT2 model distilled from the GPT2 model `gpt2` checkpoint. |
|
| | | | The DistilGPT2 model distilled from the GPT2 model `gpt2` checkpoint. |
|
||||||
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
|
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``distilroberta-base`` | | 6-layer, 768-hidden, 12-heads, 82M parameters |
|
||||||
|
| | | | The DistilRoBERTa model distilled from the RoBERTa model `roberta-base` checkpoint. |
|
||||||
|
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
|
||||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| CTRL | ``ctrl`` | | 48-layer, 1280-hidden, 16-heads, 1.6B parameters |
|
||||||
|
| | | | Salesforce's Large-sized CTRL English model |
|
||||||
|
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
|
||||||
.. <https://huggingface.co/transformers/examples.html>`__
|
.. <https://huggingface.co/transformers/examples.html>`__
|
||||||
@@ -33,6 +33,8 @@ where
|
|||||||
* ``bert-large-uncased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
|
* ``bert-large-uncased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
|
||||||
* ``bert-large-cased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
|
* ``bert-large-cased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
|
||||||
* ``bert-large-uncased-whole-word-masking-finetuned-squad``: The ``bert-large-uncased-whole-word-masking`` model finetuned on SQuAD (using the ``run_bert_squad.py`` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
|
* ``bert-large-uncased-whole-word-masking-finetuned-squad``: The ``bert-large-uncased-whole-word-masking`` model finetuned on SQuAD (using the ``run_bert_squad.py`` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
|
||||||
|
* ``bert-base-german-dbmdz-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://github.com/dbmdz/german-bert>`__
|
||||||
|
* ``bert-base-german-dbmdz-uncased``: Trained on (uncased) German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://github.com/dbmdz/german-bert>`__
|
||||||
* ``openai-gpt``: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
|
* ``openai-gpt``: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
|
||||||
* ``gpt2``: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
|
* ``gpt2``: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
|
||||||
* ``gpt2-medium``: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
|
* ``gpt2-medium``: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
|
||||||
|
|||||||
@@ -5,13 +5,37 @@ similar API between the different models.
|
|||||||
|
|
||||||
| Section | Description |
|
| Section | Description |
|
||||||
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||||
|
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks.
|
||||||
| [Language Model fine-tuning](#language-model-fine-tuning) | Fine-tuning the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
|
| [Language Model fine-tuning](#language-model-fine-tuning) | Fine-tuning the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
|
||||||
| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
|
| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
|
||||||
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
|
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
|
||||||
| [SQuAD](#squad) | Using BERT for question answering, examples with distributed training. |
|
| [SQuAD](#squad) | Using BERT/XLM/XLNet/RoBERTa for question answering, examples with distributed training. |
|
||||||
| [Multiple Choice](#multiple choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks.
|
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks.
|
||||||
|
| [Named Entity Recognition](#named-entity-recognition) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
|
||||||
| [Abstractive summarization](#abstractive-summarization) | Fine-tuning the library models for abstractive summarization tasks on the CNN/Daily Mail dataset. |
|
| [Abstractive summarization](#abstractive-summarization) | Fine-tuning the library models for abstractive summarization tasks on the CNN/Daily Mail dataset. |
|
||||||
|
|
||||||
|
## TensorFlow 2.0 Bert models on GLUE
|
||||||
|
|
||||||
|
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
|
||||||
|
|
||||||
|
Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
|
||||||
|
|
||||||
|
This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
|
||||||
|
Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
|
||||||
|
These options and the below benchmark are provided by @tlkh.
|
||||||
|
|
||||||
|
Quick benchmarks from the script (no other modifications):
|
||||||
|
|
||||||
|
| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) |
|
||||||
|
| --------- | -------- | ----------------------- | ----------------------|
|
||||||
|
| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
|
||||||
|
| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
|
||||||
|
| V100 | FP32 | 35s | 0.8646/0.8359/0.8464 |
|
||||||
|
| V100 | AMP | 22s | 0.8646/0.8385/0.8411 |
|
||||||
|
| 1080 Ti | FP32 | 55s | - |
|
||||||
|
|
||||||
|
Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
|
||||||
|
|
||||||
## Language model fine-tuning
|
## Language model fine-tuning
|
||||||
|
|
||||||
Based on the script [`run_lm_finetuning.py`](https://github.com/huggingface/transformers/blob/master/examples/run_lm_finetuning.py).
|
Based on the script [`run_lm_finetuning.py`](https://github.com/huggingface/transformers/blob/master/examples/run_lm_finetuning.py).
|
||||||
@@ -78,7 +102,7 @@ python run_lm_finetuning.py \
|
|||||||
|
|
||||||
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
|
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
|
||||||
|
|
||||||
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet.
|
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
|
||||||
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
|
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
|
||||||
can try out the different models available in the library.
|
can try out the different models available in the library.
|
||||||
|
|
||||||
@@ -284,17 +308,17 @@ The results are the following:
|
|||||||
loss = 0.04755385363816904
|
loss = 0.04755385363816904
|
||||||
```
|
```
|
||||||
|
|
||||||
##Multiple Choice
|
## Multiple Choice
|
||||||
|
|
||||||
Based on the script [`run_multiple_choice.py`]().
|
Based on the script [`run_multiple_choice.py`]().
|
||||||
|
|
||||||
#### Fine-tuning on SWAG
|
#### Fine-tuning on SWAG
|
||||||
Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
|
Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
|
||||||
|
|
||||||
```
|
```bash
|
||||||
#training on 4 tesla V100(16GB) GPUS
|
#training on 4 tesla V100(16GB) GPUS
|
||||||
export SWAG_DIR=/path/to/swag_data_dir
|
export SWAG_DIR=/path/to/swag_data_dir
|
||||||
python ./examples/single_model_scripts/run_multiple_choice.py \
|
python ./examples/run_multiple_choice.py \
|
||||||
--model_type roberta \
|
--model_type roberta \
|
||||||
--task_name swag \
|
--task_name swag \
|
||||||
--model_name_or_path roberta-base \
|
--model_name_or_path roberta-base \
|
||||||
@@ -391,6 +415,107 @@ exact_match = 86.91
|
|||||||
This fine-tuned model is available as a checkpoint under the reference
|
This fine-tuned model is available as a checkpoint under the reference
|
||||||
`bert-large-uncased-whole-word-masking-finetuned-squad`.
|
`bert-large-uncased-whole-word-masking-finetuned-squad`.
|
||||||
|
|
||||||
|
## Named Entity Recognition
|
||||||
|
|
||||||
|
Based on the script [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py).
|
||||||
|
This example fine-tune Bert Multilingual on GermEval 2014 (German NER).
|
||||||
|
Details and results for the fine-tuning provided by @stefan-it.
|
||||||
|
|
||||||
|
### Data (Download and pre-processing steps)
|
||||||
|
|
||||||
|
Data can be obtained from the [GermEval 2014](https://sites.google.com/site/germeval2014ner/data) shared task page.
|
||||||
|
|
||||||
|
Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-train.tsv?attredirects=0&d=1' \
|
||||||
|
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
|
||||||
|
curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-dev.tsv?attredirects=0&d=1' \
|
||||||
|
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
|
||||||
|
curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-test.tsv?attredirects=0&d=1' \
|
||||||
|
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
|
||||||
|
```
|
||||||
|
|
||||||
|
The GermEval 2014 dataset contains some strange "control character" tokens like `'\x96', '\u200e', '\x95', '\xad' or '\x80'`. One problem with these tokens is, that `BertTokenizer` returns an empty token for them, resulting in misaligned `InputExample`s. I wrote a script that a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"
|
||||||
|
```
|
||||||
|
Let's define some variables that we need for further pre-processing steps and training the model:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export MAX_LENGTH=128
|
||||||
|
export BERT_MODEL=bert-base-multilingual-cased
|
||||||
|
```
|
||||||
|
|
||||||
|
Run the pre-processing script on training, dev and test datasets:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
|
||||||
|
python3 preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
|
||||||
|
python3 preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### Training
|
||||||
|
|
||||||
|
Additional environment variables must be set:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export OUTPUT_DIR=germeval-model
|
||||||
|
export BATCH_SIZE=32
|
||||||
|
export NUM_EPOCHS=3
|
||||||
|
export SAVE_STEPS=750
|
||||||
|
export SEED=1
|
||||||
|
```
|
||||||
|
|
||||||
|
To start training, just run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 run_ner.py --data_dir ./ \
|
||||||
|
--model_type bert \
|
||||||
|
--labels ./labels.txt \
|
||||||
|
--model_name_or_path $BERT_MODEL \
|
||||||
|
--output_dir $OUTPUT_DIR \
|
||||||
|
--max_seq_length $MAX_LENGTH \
|
||||||
|
--num_train_epochs $NUM_EPOCHS \
|
||||||
|
--per_gpu_train_batch_size $BATCH_SIZE \
|
||||||
|
--save_steps $SAVE_STEPS \
|
||||||
|
--seed $SEED \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--do_predict
|
||||||
|
```
|
||||||
|
|
||||||
|
If your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets.
|
||||||
|
|
||||||
|
### Evaluation
|
||||||
|
|
||||||
|
Evaluation on development dataset outputs the following for our example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
10/04/2019 00:42:06 - INFO - __main__ - ***** Eval results *****
|
||||||
|
10/04/2019 00:42:06 - INFO - __main__ - f1 = 0.8623348017621146
|
||||||
|
10/04/2019 00:42:06 - INFO - __main__ - loss = 0.07183869666975543
|
||||||
|
10/04/2019 00:42:06 - INFO - __main__ - precision = 0.8467916366258111
|
||||||
|
10/04/2019 00:42:06 - INFO - __main__ - recall = 0.8784592370979806
|
||||||
|
```
|
||||||
|
|
||||||
|
On the test dataset the following results could be achieved:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
10/04/2019 00:42:42 - INFO - __main__ - ***** Eval results *****
|
||||||
|
10/04/2019 00:42:42 - INFO - __main__ - f1 = 0.8614389652384803
|
||||||
|
10/04/2019 00:42:42 - INFO - __main__ - loss = 0.07064602487454782
|
||||||
|
10/04/2019 00:42:42 - INFO - __main__ - precision = 0.8604651162790697
|
||||||
|
10/04/2019 00:42:42 - INFO - __main__ - recall = 0.8624150210424085
|
||||||
|
```
|
||||||
|
|
||||||
## Abstractive summarization
|
## Abstractive summarization
|
||||||
|
|
||||||
Based on the script
|
Based on the script
|
||||||
@@ -417,4 +542,4 @@ python run_summarization_finetuning.py \
|
|||||||
--model_name_or_path=bert2bert \
|
--model_name_or_path=bert2bert \
|
||||||
--do_train \
|
--do_train \
|
||||||
--data_path=$DATA_PATH \
|
--data_path=$DATA_PATH \
|
||||||
```
|
```
|
||||||
467
examples/benchmarks.py
Normal file
467
examples/benchmarks.py
Normal file
@@ -0,0 +1,467 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The HuggingFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Benchmarking the library on inference and training """
|
||||||
|
|
||||||
|
# If checking the tensors placement
|
||||||
|
# tf.debugging.set_log_device_placement(True)
|
||||||
|
|
||||||
|
from typing import List
|
||||||
|
import timeit
|
||||||
|
from transformers import is_tf_available, is_torch_available
|
||||||
|
from time import time
|
||||||
|
import argparse
|
||||||
|
import csv
|
||||||
|
|
||||||
|
if is_tf_available():
|
||||||
|
import tensorflow as tf
|
||||||
|
from transformers import TFAutoModel
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
from transformers import AutoModel
|
||||||
|
|
||||||
|
from transformers import AutoConfig, AutoTokenizer
|
||||||
|
|
||||||
|
input_text = """Bent over their instruments, three hundred Fertilizers were plunged, as
|
||||||
|
the Director of Hatcheries and Conditioning entered the room, in the
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
scarcely breathing silence, the absent-minded, soliloquizing hum or
|
||||||
|
whistle, of absorbed concentration. A troop of newly arrived students,
|
||||||
|
very young, pink and callow, followed nervously, rather abjectly, at the
|
||||||
|
Director's heels. Each of them carried a notebook, in which, whenever
|
||||||
|
the great man spoke, he desperately scribbled. Straight from the
|
||||||
|
horse's mouth. It was a rare privilege. The D. H. C. for Central London
|
||||||
|
always made a point of personally conducting his new students round
|
||||||
|
the various departments.
|
||||||
|
|
||||||
|
"Just to give you a general idea," he would explain to them. For of
|
||||||
|
course some sort of general idea they must have, if they were to do
|
||||||
|
their work intelligently-though as little of one, if they were to be good
|
||||||
|
and happy members of society, as possible. For particulars, as every
|
||||||
|
one knows, make for virtue and happiness; generalities are intellectu-
|
||||||
|
ally necessary evils. Not philosophers but fret-sawyers and stamp col-
|
||||||
|
lectors compose the backbone of society.
|
||||||
|
|
||||||
|
"To-morrow," he would add, smiling at them with a slightly menacing
|
||||||
|
geniality, "you'll be settling down to serious work. You won't have time
|
||||||
|
for generalities. Meanwhile ..."
|
||||||
|
|
||||||
|
Meanwhile, it was a privilege. Straight from the horse's mouth into the
|
||||||
|
notebook. The boys scribbled like mad.
|
||||||
|
|
||||||
|
Tall and rather thin but upright, the Director advanced into the room.
|
||||||
|
He had a long chin and big rather prominent teeth, just covered, when
|
||||||
|
he was not talking, by his full, floridly curved lips. Old, young? Thirty?
|
||||||
|
Fifty? Fifty-five? It was hard to say. And anyhow the question didn't
|
||||||
|
arise; in this year of stability, A. F. 632, it didn't occur to you to ask it.
|
||||||
|
|
||||||
|
"I shall begin at the beginning," said the D.H.C. and the more zealous
|
||||||
|
students recorded his intention in their notebooks: Begin at the begin-
|
||||||
|
ning. "These," he waved his hand, "are the incubators." And opening
|
||||||
|
an insulated door he showed them racks upon racks of numbered test-
|
||||||
|
tubes. "The week's supply of ova. Kept," he explained, "at blood heat;
|
||||||
|
whereas the male gametes," and here he opened another door, "they
|
||||||
|
have to be kept at thirty-five instead of thirty-seven. Full blood heat
|
||||||
|
sterilizes." Rams wrapped in theremogene beget no lambs.
|
||||||
|
|
||||||
|
Still leaning against the incubators he gave them, while the pencils
|
||||||
|
scurried illegibly across the pages, a brief description of the modern
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
fertilizing process; spoke first, of course, of its surgical introduc-
|
||||||
|
tion-"the operation undergone voluntarily for the good of Society, not
|
||||||
|
to mention the fact that it carries a bonus amounting to six months'
|
||||||
|
salary"; continued with some account of the technique for preserving
|
||||||
|
the excised ovary alive and actively developing; passed on to a consid-
|
||||||
|
eration of optimum temperature, salinity, viscosity; referred to the liq-
|
||||||
|
uor in which the detached and ripened eggs were kept; and, leading
|
||||||
|
his charges to the work tables, actually showed them how this liquor
|
||||||
|
was drawn off from the test-tubes; how it was let out drop by drop
|
||||||
|
onto the specially warmed slides of the microscopes; how the eggs
|
||||||
|
which it contained were inspected for abnormalities, counted and
|
||||||
|
transferred to a porous receptacle; how (and he now took them to
|
||||||
|
watch the operation) this receptacle was immersed in a warm bouillon
|
||||||
|
containing free-swimming spermatozoa-at a minimum concentration
|
||||||
|
of one hundred thousand per cubic centimetre, he insisted; and how,
|
||||||
|
after ten minutes, the container was lifted out of the liquor and its
|
||||||
|
contents re-examined; how, if any of the eggs remained unfertilized, it
|
||||||
|
was again immersed, and, if necessary, yet again; how the fertilized
|
||||||
|
ova went back to the incubators; where the Alphas and Betas re-
|
||||||
|
mained until definitely bottled; while the Gammas, Deltas and Epsilons
|
||||||
|
were brought out again, after only thirty-six hours, to undergo Bo-
|
||||||
|
kanovsky's Process.
|
||||||
|
|
||||||
|
"Bokanovsky's Process," repeated the Director, and the students un-
|
||||||
|
derlined the words in their little notebooks.
|
||||||
|
|
||||||
|
One egg, one embryo, one adult-normality. But a bokanovskified egg
|
||||||
|
will bud, will proliferate, will divide. From eight to ninety-six buds, and
|
||||||
|
every bud will grow into a perfectly formed embryo, and every embryo
|
||||||
|
into a full-sized adult. Making ninety-six human beings grow where
|
||||||
|
only one grew before. Progress.
|
||||||
|
|
||||||
|
"Essentially," the D.H.C. concluded, "bokanovskification consists of a
|
||||||
|
series of arrests of development. We check the normal growth and,
|
||||||
|
paradoxically enough, the egg responds by budding."
|
||||||
|
|
||||||
|
Responds by budding. The pencils were busy.
|
||||||
|
|
||||||
|
He pointed. On a very slowly moving band a rack-full of test-tubes was
|
||||||
|
entering a large metal box, another, rack-full was emerging. Machinery
|
||||||
|
faintly purred. It took eight minutes for the tubes to go through, he
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
told them. Eight minutes of hard X-rays being about as much as an
|
||||||
|
egg can stand. A few died; of the rest, the least susceptible divided
|
||||||
|
into two; most put out four buds; some eight; all were returned to the
|
||||||
|
incubators, where the buds began to develop; then, after two days,
|
||||||
|
were suddenly chilled, chilled and checked. Two, four, eight, the buds
|
||||||
|
in their turn budded; and having budded were dosed almost to death
|
||||||
|
with alcohol; consequently burgeoned again and having budded-bud
|
||||||
|
out of bud out of bud-were thereafter-further arrest being generally
|
||||||
|
fatal-left to develop in peace. By which time the original egg was in a
|
||||||
|
fair way to becoming anything from eight to ninety-six embryos- a
|
||||||
|
prodigious improvement, you will agree, on nature. Identical twins-but
|
||||||
|
not in piddling twos and threes as in the old viviparous days, when an
|
||||||
|
egg would sometimes accidentally divide; actually by dozens, by
|
||||||
|
scores at a time.
|
||||||
|
|
||||||
|
"Scores," the Director repeated and flung out his arms, as though he
|
||||||
|
were distributing largesse. "Scores."
|
||||||
|
|
||||||
|
But one of the students was fool enough to ask where the advantage
|
||||||
|
lay.
|
||||||
|
|
||||||
|
"My good boy!" The Director wheeled sharply round on him. "Can't you
|
||||||
|
see? Can't you see?" He raised a hand; his expression was solemn.
|
||||||
|
"Bokanovsky's Process is one of the major instruments of social stabil-
|
||||||
|
ity!"
|
||||||
|
|
||||||
|
Major instruments of social stability.
|
||||||
|
|
||||||
|
Standard men and women; in uniform batches. The whole of a small
|
||||||
|
factory staffed with the products of a single bokanovskified egg.
|
||||||
|
|
||||||
|
"Ninety-six identical twins working ninety-six identical machines!" The
|
||||||
|
voice was almost tremulous with enthusiasm. "You really know where
|
||||||
|
you are. For the first time in history." He quoted the planetary motto.
|
||||||
|
"Community, Identity, Stability." Grand words. "If we could bo-
|
||||||
|
kanovskify indefinitely the whole problem would be solved."
|
||||||
|
|
||||||
|
Solved by standard Gammas, unvarying Deltas, uniform Epsilons. Mil-
|
||||||
|
lions of identical twins. The principle of mass production at last applied
|
||||||
|
to biology.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
"But, alas," the Director shook his head, "we can't bokanovskify indefi-
|
||||||
|
nitely."
|
||||||
|
|
||||||
|
Ninety-six seemed to be the limit; seventy-two a good average. From
|
||||||
|
the same ovary and with gametes of the same male to manufacture as
|
||||||
|
many batches of identical twins as possible-that was the best (sadly a
|
||||||
|
second best) that they could do. And even that was difficult.
|
||||||
|
|
||||||
|
"For in nature it takes thirty years for two hundred eggs to reach ma-
|
||||||
|
turity. But our business is to stabilize the population at this moment,
|
||||||
|
here and now. Dribbling out twins over a quarter of a century-what
|
||||||
|
would be the use of that?"
|
||||||
|
|
||||||
|
Obviously, no use at all. But Podsnap's Technique had immensely ac-
|
||||||
|
celerated the process of ripening. They could make sure of at least a
|
||||||
|
hundred and fifty mature eggs within two years. Fertilize and bo-
|
||||||
|
kanovskify-in other words, multiply by seventy-two-and you get an
|
||||||
|
average of nearly eleven thousand brothers and sisters in a hundred
|
||||||
|
and fifty batches of identical twins, all within two years of the same
|
||||||
|
age.
|
||||||
|
|
||||||
|
"And in exceptional cases we can make one ovary yield us over fifteen
|
||||||
|
thousand adult individuals."
|
||||||
|
|
||||||
|
Beckoning to a fair-haired, ruddy young man who happened to be
|
||||||
|
passing at the moment. "Mr. Foster," he called. The ruddy young man
|
||||||
|
approached. "Can you tell us the record for a single ovary, Mr. Foster?"
|
||||||
|
|
||||||
|
"Sixteen thousand and twelve in this Centre," Mr. Foster replied with-
|
||||||
|
out hesitation. He spoke very quickly, had a vivacious blue eye, and
|
||||||
|
took an evident pleasure in quoting figures. "Sixteen thousand and
|
||||||
|
twelve; in one hundred and eighty-nine batches of identicals. But of
|
||||||
|
course they've done much better," he rattled on, "in some of the tropi-
|
||||||
|
cal Centres. Singapore has often produced over sixteen thousand five
|
||||||
|
hundred; and Mombasa has actually touched the seventeen thousand
|
||||||
|
mark. But then they have unfair advantages. You should see the way a
|
||||||
|
negro ovary responds to pituitary! It's quite astonishing, when you're
|
||||||
|
used to working with European material. Still," he added, with a laugh
|
||||||
|
(but the light of combat was in his eyes and the lift of his chin was
|
||||||
|
challenging), "still, we mean to beat them if we can. I'm working on a
|
||||||
|
wonderful Delta-Minus ovary at this moment. Only just eighteen
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
months old. Over twelve thousand seven hundred children already, ei-
|
||||||
|
ther decanted or in embryo. And still going strong. We'll beat them
|
||||||
|
yet."
|
||||||
|
|
||||||
|
"That's the spirit I like!" cried the Director, and clapped Mr. Foster on
|
||||||
|
the shoulder. "Come along with us, and give these boys the benefit of
|
||||||
|
your expert knowledge."
|
||||||
|
|
||||||
|
Mr. Foster smiled modestly. "With pleasure." They went.
|
||||||
|
In the Bottling Room all was harmonious bustle and ordered activity.
|
||||||
|
Flaps of fresh sow's peritoneum ready cut to the proper size came
|
||||||
|
shooting up in little lifts from the Organ Store in the sub-basement.
|
||||||
|
Whizz and then, click! the lift-hatches hew open; the bottle-liner had
|
||||||
|
only to reach out a hand, take the flap, insert, smooth-down, and be-
|
||||||
|
fore the lined bottle had had time to travel out of reach along the end-
|
||||||
|
less band, whizz, click! another flap of peritoneum had shot up from
|
||||||
|
the depths, ready to be slipped into yet another bottle, the next of that
|
||||||
|
slow interminable procession on the band.
|
||||||
|
|
||||||
|
Next to the Liners stood the Matriculators. The procession advanced;
|
||||||
|
one by one the eggs were transferred from their test-tubes to the
|
||||||
|
larger containers; deftly the peritoneal lining was slit, the morula
|
||||||
|
dropped into place, the saline solution poured in ... and already the
|
||||||
|
bottle had passed, and it was the turn of the labellers. Heredity, date
|
||||||
|
of fertilization, membership of Bokanovsky Group-details were trans-
|
||||||
|
ferred from test-tube to bottle. No longer anonymous, but named,
|
||||||
|
identified, the procession marched slowly on; on through an opening in
|
||||||
|
the wall, slowly on into the Social Predestination Room.
|
||||||
|
"Eighty-eight cubic metres of card-index," said Mr. Foster with relish,
|
||||||
|
as they entered."""
|
||||||
|
|
||||||
|
|
||||||
|
def create_setup_and_compute(model_names: List[str],
|
||||||
|
gpu: bool = True,
|
||||||
|
tensorflow: bool = False,
|
||||||
|
average_over: int = 3,
|
||||||
|
torchscript: bool = False,
|
||||||
|
xla: bool = False,
|
||||||
|
save_to_csv: bool = False,
|
||||||
|
csv_filename: str = f"results_{round(time())}.csv"):
|
||||||
|
if xla:
|
||||||
|
tf.config.optimizer.set_jit(True)
|
||||||
|
|
||||||
|
if tensorflow:
|
||||||
|
dictionary = {model_name: {} for model_name in model_names}
|
||||||
|
results = _compute_tensorflow(model_names, dictionary, average_over)
|
||||||
|
else:
|
||||||
|
device = 'cuda' if (gpu and torch.cuda.is_available()) else 'cpu'
|
||||||
|
dictionary = {model_name: {} for model_name in model_names}
|
||||||
|
results = _compute_pytorch(model_names, dictionary, average_over, device, torchscript)
|
||||||
|
|
||||||
|
print("=========== RESULTS ===========")
|
||||||
|
for model_name in model_names:
|
||||||
|
print("\t" + f"======= MODEL CHECKPOINT: {model_name} =======")
|
||||||
|
for batch_size in results[model_name]["bs"]:
|
||||||
|
print("\t\t" + f"===== BATCH SIZE: {batch_size} =====")
|
||||||
|
for slice_size in results[model_name]["ss"]:
|
||||||
|
result = results[model_name]['results'][batch_size][slice_size]
|
||||||
|
if isinstance(result, str):
|
||||||
|
print(f"\t\t{model_name}/{batch_size}/{slice_size}: "
|
||||||
|
f"{result}")
|
||||||
|
else:
|
||||||
|
print(f"\t\t{model_name}/{batch_size}/{slice_size}: "
|
||||||
|
f"{(round(1000 * result) / 1000)}"
|
||||||
|
f"s")
|
||||||
|
|
||||||
|
if save_to_csv:
|
||||||
|
with open(csv_filename, mode='w') as csv_file:
|
||||||
|
fieldnames = ['model',
|
||||||
|
'1x8', '1x64', '1x128', '1x256', '1x512', '1x1024',
|
||||||
|
'2x8', '2x64', '2x128', '2x256', '2x512', '2x1024',
|
||||||
|
'4x8', '4x64', '4x128', '4x256', '4x512', '4x1024',
|
||||||
|
'8x8', '8x64', '8x128', '8x256', '8x512', '8x1024',
|
||||||
|
]
|
||||||
|
|
||||||
|
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
|
||||||
|
writer.writeheader()
|
||||||
|
|
||||||
|
for model_name in model_names:
|
||||||
|
model_results = {
|
||||||
|
f'{bs}x{ss}': results[model_name]['results'][bs][ss]
|
||||||
|
for bs in results[model_name]["results"]
|
||||||
|
for ss in results[model_name]['results'][bs]
|
||||||
|
}
|
||||||
|
writer.writerow({'model': model_name, **model_results})
|
||||||
|
|
||||||
|
|
||||||
|
def _compute_pytorch(model_names, dictionary, average_over, device, torchscript):
|
||||||
|
for c, model_name in enumerate(model_names):
|
||||||
|
print(f"{c + 1} / {len(model_names)}")
|
||||||
|
config = AutoConfig.from_pretrained(model_name, torchscript=torchscript)
|
||||||
|
model = AutoModel.from_pretrained(model_name, config=config)
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||||
|
|
||||||
|
tokenized_sequence = tokenizer.encode(input_text)
|
||||||
|
|
||||||
|
max_input_size = tokenizer.max_model_input_sizes[model_name]
|
||||||
|
batch_sizes = [1, 2, 4, 8]
|
||||||
|
slice_sizes = [8, 64, 128, 256, 512, 1024]
|
||||||
|
|
||||||
|
dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "results": {}}
|
||||||
|
dictionary[model_name]["results"] = {i: {} for i in batch_sizes}
|
||||||
|
|
||||||
|
for batch_size in batch_sizes:
|
||||||
|
model.to(device)
|
||||||
|
model.eval()
|
||||||
|
for slice_size in slice_sizes:
|
||||||
|
if max_input_size is not None and slice_size > max_input_size:
|
||||||
|
dictionary[model_name]["results"][batch_size][slice_size] = "N/A"
|
||||||
|
else:
|
||||||
|
sequence = torch.tensor(tokenized_sequence[:slice_size], device=device).repeat(batch_size, 1)
|
||||||
|
try:
|
||||||
|
if torchscript:
|
||||||
|
print("Tracing model with sequence size", sequence.shape)
|
||||||
|
inference = torch.jit.trace(model, sequence)
|
||||||
|
inference(sequence)
|
||||||
|
else:
|
||||||
|
inference = model
|
||||||
|
inference(sequence)
|
||||||
|
|
||||||
|
print("Going through model with sequence of shape", sequence.shape)
|
||||||
|
runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
|
||||||
|
average_time = sum(runtimes)/float(len(runtimes)) / 3.0
|
||||||
|
dictionary[model_name]["results"][batch_size][slice_size] = average_time
|
||||||
|
except RuntimeError as e:
|
||||||
|
print("Doesn't fit on GPU.", e)
|
||||||
|
torch.cuda.empty_cache()
|
||||||
|
dictionary[model_name]["results"][batch_size][slice_size] = "N/A"
|
||||||
|
return dictionary
|
||||||
|
|
||||||
|
|
||||||
|
def _compute_tensorflow(model_names, dictionary, average_over):
|
||||||
|
for c, model_name in enumerate(model_names):
|
||||||
|
print(f"{c + 1} / {len(model_names)}")
|
||||||
|
config = AutoConfig.from_pretrained(model_name)
|
||||||
|
model = TFAutoModel.from_pretrained(model_name, config=config)
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||||
|
|
||||||
|
tokenized_sequence = tokenizer.encode(input_text)
|
||||||
|
|
||||||
|
max_input_size = tokenizer.max_model_input_sizes[model_name]
|
||||||
|
batch_sizes = [1, 2, 4, 8]
|
||||||
|
slice_sizes = [8, 64, 128, 256, 512, 1024]
|
||||||
|
|
||||||
|
dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "results": {}}
|
||||||
|
dictionary[model_name]["results"] = {i: {} for i in batch_sizes}
|
||||||
|
|
||||||
|
print("Using model", model)
|
||||||
|
|
||||||
|
@tf.function
|
||||||
|
def inference(inputs):
|
||||||
|
return model(inputs)
|
||||||
|
|
||||||
|
for batch_size in batch_sizes:
|
||||||
|
for slice_size in slice_sizes:
|
||||||
|
if max_input_size is not None and slice_size > max_input_size:
|
||||||
|
dictionary[model_name]["results"][batch_size][slice_size] = "N/A"
|
||||||
|
else:
|
||||||
|
sequence = tf.stack([tf.squeeze(tf.constant(tokenized_sequence[:slice_size])[None, :])] * batch_size)
|
||||||
|
|
||||||
|
try:
|
||||||
|
print("Going through model with sequence of shape", sequence.shape)
|
||||||
|
# To make sure that the model is traced + that the tensors are on the appropriate device
|
||||||
|
inference(sequence)
|
||||||
|
|
||||||
|
runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
|
||||||
|
average_time = sum(runtimes)/float(len(runtimes)) / 3.0
|
||||||
|
dictionary[model_name]["results"][batch_size][slice_size] = average_time
|
||||||
|
except tf.errors.ResourceExhaustedError as e:
|
||||||
|
print("Doesn't fit on GPU.", e)
|
||||||
|
torch.cuda.empty_cache()
|
||||||
|
dictionary[model_name]["results"][batch_size][slice_size] = "N/A"
|
||||||
|
return dictionary
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
|
||||||
|
parser.add_argument("--models", required=False, type=str, default='all', help="Model checkpoints to be provided "
|
||||||
|
"to the AutoModel classes. Leave "
|
||||||
|
"blank to benchmark the base version "
|
||||||
|
"of all available model "
|
||||||
|
"architectures.")
|
||||||
|
parser.add_argument("--torch", required=False, action="store_true", help="Benchmark the Pytorch version of the "
|
||||||
|
"models")
|
||||||
|
parser.add_argument("--torch_cuda", required=False, action="store_true", help="Pytorch only: run on available "
|
||||||
|
"cuda devices")
|
||||||
|
parser.add_argument("--torchscript", required=False, action="store_true", help="Pytorch only: trace the models "
|
||||||
|
"using torchscript")
|
||||||
|
parser.add_argument("--tensorflow", required=False, action="store_true", help="Benchmark the TensorFlow version "
|
||||||
|
"of the models. Will run on GPU if "
|
||||||
|
"the correct dependencies are "
|
||||||
|
"installed")
|
||||||
|
parser.add_argument("--xla", required=False, action="store_true", help="TensorFlow only: use XLA acceleration.")
|
||||||
|
parser.add_argument("--keras_predict", required=False, action="store_true", help="Whether to use model.predict "
|
||||||
|
"instead of model() to do a "
|
||||||
|
"forward pass.")
|
||||||
|
parser.add_argument("--save_to_csv", required=False, action="store_true", help="Save to a CSV file.")
|
||||||
|
parser.add_argument("--csv_filename", required=False, default=None, help="CSV filename used if saving results to csv.")
|
||||||
|
parser.add_argument("--average_over", required=False, default=30, type=int, help="Times an experiment will be run.")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
if args.models == 'all':
|
||||||
|
args.models = [
|
||||||
|
"gpt2",
|
||||||
|
"bert-base-cased",
|
||||||
|
"xlnet-base-cased",
|
||||||
|
"xlm-mlm-en-2048",
|
||||||
|
"transfo-xl-wt103",
|
||||||
|
"openai-gpt",
|
||||||
|
"distilbert-base-uncased",
|
||||||
|
"distilgpt2",
|
||||||
|
"roberta-base",
|
||||||
|
"ctrl"
|
||||||
|
]
|
||||||
|
else:
|
||||||
|
args.models = args.models.split()
|
||||||
|
|
||||||
|
print("Running with arguments", args)
|
||||||
|
|
||||||
|
if args.torch:
|
||||||
|
if is_torch_available():
|
||||||
|
create_setup_and_compute(
|
||||||
|
model_names=args.models,
|
||||||
|
tensorflow=False,
|
||||||
|
gpu=args.torch_cuda,
|
||||||
|
torchscript=args.torchscript,
|
||||||
|
save_to_csv=args.save_to_csv,
|
||||||
|
csv_filename=args.csv_filename,
|
||||||
|
average_over=args.average_over
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
raise ImportError("Trying to run a PyTorch benchmark but PyTorch was not found in the environment.")
|
||||||
|
|
||||||
|
if args.tensorflow:
|
||||||
|
if is_tf_available():
|
||||||
|
create_setup_and_compute(
|
||||||
|
model_names=args.models,
|
||||||
|
tensorflow=True,
|
||||||
|
xla=args.xla,
|
||||||
|
save_to_csv=args.save_to_csv,
|
||||||
|
csv_filename=args.csv_filename,
|
||||||
|
average_over=args.average_over
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
raise ImportError("Trying to run a TensorFlow benchmark but TensorFlow was not found in the environment.")
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
|
|
||||||
@@ -31,9 +31,13 @@ import torch
|
|||||||
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
||||||
TensorDataset)
|
TensorDataset)
|
||||||
from torch.utils.data.distributed import DistributedSampler
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
from tqdm import tqdm, trange
|
|
||||||
|
|
||||||
from tensorboardX import SummaryWriter
|
try:
|
||||||
|
from torch.utils.tensorboard import SummaryWriter
|
||||||
|
except:
|
||||||
|
from tensorboardX import SummaryWriter
|
||||||
|
|
||||||
|
from tqdm import tqdm, trange
|
||||||
|
|
||||||
from transformers import (WEIGHTS_NAME, BertConfig,
|
from transformers import (WEIGHTS_NAME, BertConfig,
|
||||||
BertForMultipleChoice, BertTokenizer)
|
BertForMultipleChoice, BertTokenizer)
|
||||||
|
|||||||
@@ -1,31 +1,47 @@
|
|||||||
# Distil*
|
# Distil*
|
||||||
|
|
||||||
This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT and DistilGPT2.
|
This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT, DistilRoBERTa and DistilGPT2.
|
||||||
|
|
||||||
**2019, October 3rd - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2.
|
**October 23rd, 2019 - Update** We release **DistilRoBERTa**: 95% of `RoBERTa-base`'s performance on GLUE, twice as fast as RoBERTa while being 35% smaller.
|
||||||
|
|
||||||
|
**October 3rd, 2019 - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2. **The paper superseeds our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances. Please use the paper as a reference when comparing/reporting results on DistilBERT.**
|
||||||
|
|
||||||
|
**September 19th, 2019 - Update:** We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 97% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future!
|
||||||
|
|
||||||
**2019, September 19th - Update:** We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 97% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future!
|
|
||||||
|
|
||||||
## What is Distil*
|
## What is Distil*
|
||||||
|
|
||||||
Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
|
Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
|
||||||
|
|
||||||
We have applied the same method to GPT2 and release the weights of the compressed model. On the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test of 15.8 compared to 19.3 for DistilGPT2 (after fine-tuning on the train set).
|
We have applied the same method to other Transformer architectures and released the weights:
|
||||||
|
- GPT2: on the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 15.0 compared to 18.5 for **DistilGPT2** (after fine-tuning on the train set).
|
||||||
|
- RoBERTa: **DistilRoBERTa** reaches 95% of `RoBERTa-base` performance on GLUE while being twice faster and 35% smaller.
|
||||||
|
- and more to come! 🤗🤗🤗
|
||||||
|
|
||||||
For more information on DistilBERT, please refer to our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108). The paper superseeds our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances.
|
For more information on DistilBERT, please refer to our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108).
|
||||||
|
|
||||||
Here are the results on the dev sets of GLUE:
|
Here are the results on the dev sets of GLUE:
|
||||||
|
|
||||||
| Model | Macro-score | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2| STS-B| WNLI |
|
| Model | Macro-score | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2| STS-B| WNLI |
|
||||||
| :---: | :---: | :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:|
|
| :---: | :---: | :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---: |
|
||||||
| BERT-base | **77.6** | 48.9 | 84.3 | 88.6 | 89.3 | 89.5 | 71.3 | 91.7 | 91.2 | 43.7 |
|
| BERT-base | **77.6** | 48.9 | 84.3 | 88.6 | 89.3 | 89.5 | 71.3 | 91.7 | 91.2 | 43.7 |
|
||||||
| DistilBERT | **76.8** | 49.1 | 81.8 | 90.2 | 90.2 | 89.2 | 62.9 | 92.7 | 90.7 | 44.4 |
|
| DistilBERT | **76.8** | 49.1 | 81.8 | 90.2 | 90.2 | 89.2 | 62.9 | 92.7 | 90.7 | 44.4 |
|
||||||
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
||||||
|
| RoBERTa-base (reported) | **83.2**/**86.4**<sup>2</sup> | 63.6 | 87.6 | 90.2 | 92.8 | 91.9 | 78.7 | 94.8 | 91.2 | 57.7<sup>3</sup> |
|
||||||
|
| DistilRoBERTa<sup>1</sup> | **79.0**/**82.3**<sup>2</sup> | 59.4 | 83.9 | 86.6 | 90.8 | 89.4 | 67.9 | 92.5 | 88.3 | 52.1 |
|
||||||
|
|
||||||
|
<sup>1</sup> We did not use the MNLI checkpoint for fine-tuning but directy perform transfer learning on the pre-trained DistilRoBERTa.
|
||||||
|
|
||||||
|
<sup>2</sup> Macro-score computed without WNLI.
|
||||||
|
|
||||||
|
<sup>3</sup> We compute this score ourselves for completeness.
|
||||||
|
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
This part of the library has only be tested with Python3.6+. There are few specific dependencies to install before launching a distillation, you can install them with the command `pip install -r requirements.txt`.
|
This part of the library has only be tested with Python3.6+. There are few specific dependencies to install before launching a distillation, you can install them with the command `pip install -r requirements.txt`.
|
||||||
|
|
||||||
**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breakings changes compared to v1.1.0). It is important to note that there is a small internal bug in the current version of PyTorch available on pip that causes a memory leak in our training/distillation. It has been recently fixed and will likely be integrated into the next release. For the moment, we recommend to [compile PyTorch from source](https://github.com/pytorch/pytorch#from-source). Please refer to [issue 1179](https://github.com/huggingface/transformers/issues/1179) for more details.
|
**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breakings changes compared to v1.1.0).
|
||||||
|
|
||||||
|
|
||||||
## How to use DistilBERT
|
## How to use DistilBERT
|
||||||
|
|
||||||
@@ -33,7 +49,8 @@ Transformers includes two pre-trained Distil* models, currently only provided fo
|
|||||||
|
|
||||||
- `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
|
- `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
|
||||||
- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
|
- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
|
||||||
- `distilgpt2`: DistilGPT2 English language model pretrained with the supervision of `gpt2` (the smallest version of GPT2) on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset and . The model has 6 layers, 768 dimension and 12 heads, totalizing 82M (compared to 124M parameters for GPT2). On average, DistilGPT2 is two times faster than GPT2.
|
- `distilgpt2`: DistilGPT2 English language model pretrained with the supervision of `gpt2` (the smallest version of GPT2) on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset. The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 124M parameters for GPT2). On average, DistilGPT2 is two times faster than GPT2.
|
||||||
|
- `distilroberta-base`: DistilRoBERTa English language model pretrained with the supervision of `roberta-base` solely on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset (it is ~4 times less training data than the teacher RoBERTa). The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). On average DistilRoBERTa is twice as fast as Roberta-base.
|
||||||
- and more to come! 🤗🤗🤗
|
- and more to come! 🤗🤗🤗
|
||||||
|
|
||||||
Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.
|
Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.
|
||||||
@@ -47,7 +64,10 @@ outputs = model(input_ids)
|
|||||||
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||||
```
|
```
|
||||||
|
|
||||||
Similarly, using DistilGPT2 simply consists in calling the GPT2 classes from a different pretrained checkpoint: `model = GPT2Model.from_pretrained('distilgpt2')`.
|
Similarly, using the other Distil* models simply consists in calling the base classes with a different pretrained checkpoint:
|
||||||
|
- DistilGPT2: `model = GPT2Model.from_pretrained('distilgpt2')`
|
||||||
|
- DistilRoBERTa: `model = RobertaModel.from_pretrained('distilroberta-base')`
|
||||||
|
|
||||||
|
|
||||||
## How to train Distil*
|
## How to train Distil*
|
||||||
|
|
||||||
@@ -134,3 +154,16 @@ python -m torch.distributed.launch \
|
|||||||
**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training!
|
**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training!
|
||||||
|
|
||||||
Happy distillation!
|
Happy distillation!
|
||||||
|
|
||||||
|
## Citation
|
||||||
|
|
||||||
|
If you find the ressource useful, you should cite the following paper:
|
||||||
|
|
||||||
|
```
|
||||||
|
@inproceedings{sanh2019distilbert,
|
||||||
|
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
|
||||||
|
author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
|
||||||
|
booktitle={NeurIPS EMC^2 Workshop},
|
||||||
|
year={2019}
|
||||||
|
}
|
||||||
|
```
|
||||||
@@ -19,7 +19,6 @@ import os
|
|||||||
import math
|
import math
|
||||||
import psutil
|
import psutil
|
||||||
import time
|
import time
|
||||||
from tensorboardX import SummaryWriter
|
|
||||||
from tqdm import trange, tqdm
|
from tqdm import trange, tqdm
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import psutil
|
import psutil
|
||||||
@@ -31,6 +30,11 @@ from torch.optim import AdamW
|
|||||||
from torch.utils.data.distributed import DistributedSampler
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
from torch.utils.data import RandomSampler, BatchSampler, DataLoader
|
from torch.utils.data import RandomSampler, BatchSampler, DataLoader
|
||||||
|
|
||||||
|
try:
|
||||||
|
from torch.utils.tensorboard import SummaryWriter
|
||||||
|
except:
|
||||||
|
from tensorboardX import SummaryWriter
|
||||||
|
|
||||||
from transformers import WarmupLinearSchedule
|
from transformers import WarmupLinearSchedule
|
||||||
|
|
||||||
from utils import logger
|
from utils import logger
|
||||||
|
|||||||
@@ -30,9 +30,13 @@ from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
|||||||
from torch.utils.data.distributed import DistributedSampler
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
import torch.nn.functional as F
|
import torch.nn.functional as F
|
||||||
import torch.nn as nn
|
import torch.nn as nn
|
||||||
from tqdm import tqdm, trange
|
|
||||||
|
|
||||||
from tensorboardX import SummaryWriter
|
try:
|
||||||
|
from torch.utils.tensorboard import SummaryWriter
|
||||||
|
except:
|
||||||
|
from tensorboardX import SummaryWriter
|
||||||
|
|
||||||
|
from tqdm import tqdm, trange
|
||||||
|
|
||||||
from transformers import (WEIGHTS_NAME, BertConfig,
|
from transformers import (WEIGHTS_NAME, BertConfig,
|
||||||
BertForQuestionAnswering, BertTokenizer,
|
BertForQuestionAnswering, BertTokenizer,
|
||||||
|
|||||||
@@ -1,2 +1,4 @@
|
|||||||
tensorboardX
|
tensorboardX
|
||||||
scikit-learn
|
tensorboard
|
||||||
|
scikit-learn
|
||||||
|
seqeval
|
||||||
|
|||||||
@@ -14,7 +14,7 @@
|
|||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
|
""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/CTRL/Transformer-XL/XLNet)
|
||||||
"""
|
"""
|
||||||
from __future__ import absolute_import, division, print_function, unicode_literals
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
|
|
||||||
@@ -26,12 +26,13 @@ import torch
|
|||||||
import torch.nn.functional as F
|
import torch.nn.functional as F
|
||||||
import numpy as np
|
import numpy as np
|
||||||
|
|
||||||
from transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, XLMConfig
|
from transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, XLMConfig, CTRLConfig
|
||||||
|
|
||||||
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
||||||
from transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
|
from transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
|
||||||
from transformers import XLNetLMHeadModel, XLNetTokenizer
|
from transformers import XLNetLMHeadModel, XLNetTokenizer
|
||||||
from transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
|
from transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
|
||||||
|
from transformers import CTRLLMHeadModel, CTRLTokenizer
|
||||||
from transformers import XLMWithLMHeadModel, XLMTokenizer
|
from transformers import XLMWithLMHeadModel, XLMTokenizer
|
||||||
|
|
||||||
|
|
||||||
@@ -42,10 +43,11 @@ logger = logging.getLogger(__name__)
|
|||||||
|
|
||||||
MAX_LENGTH = int(10000) # Hardcoded max length to avoid infinite loop
|
MAX_LENGTH = int(10000) # Hardcoded max length to avoid infinite loop
|
||||||
|
|
||||||
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, XLMConfig)), ())
|
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, XLMConfig, CTRLConfig)), ())
|
||||||
|
|
||||||
MODEL_CLASSES = {
|
MODEL_CLASSES = {
|
||||||
'gpt2': (GPT2LMHeadModel, GPT2Tokenizer),
|
'gpt2': (GPT2LMHeadModel, GPT2Tokenizer),
|
||||||
|
'ctrl': (CTRLLMHeadModel, CTRLTokenizer),
|
||||||
'openai-gpt': (OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
|
'openai-gpt': (OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
|
||||||
'xlnet': (XLNetLMHeadModel, XLNetTokenizer),
|
'xlnet': (XLNetLMHeadModel, XLNetTokenizer),
|
||||||
'transfo-xl': (TransfoXLLMHeadModel, TransfoXLTokenizer),
|
'transfo-xl': (TransfoXLLMHeadModel, TransfoXLTokenizer),
|
||||||
@@ -105,8 +107,8 @@ def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')
|
|||||||
return logits
|
return logits
|
||||||
|
|
||||||
|
|
||||||
def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, is_xlnet=False,
|
def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, repetition_penalty=1.0,
|
||||||
xlm_lang=None, device='cpu'):
|
is_xlnet=False, is_xlm_mlm=False, xlm_mask_token=None, xlm_lang=None, device='cpu'):
|
||||||
context = torch.tensor(context, dtype=torch.long, device=device)
|
context = torch.tensor(context, dtype=torch.long, device=device)
|
||||||
context = context.unsqueeze(0).repeat(num_samples, 1)
|
context = context.unsqueeze(0).repeat(num_samples, 1)
|
||||||
generated = context
|
generated = context
|
||||||
@@ -124,13 +126,27 @@ def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=
|
|||||||
target_mapping[0, 0, -1] = 1.0 # predict last token
|
target_mapping[0, 0, -1] = 1.0 # predict last token
|
||||||
inputs = {'input_ids': input_ids, 'perm_mask': perm_mask, 'target_mapping': target_mapping}
|
inputs = {'input_ids': input_ids, 'perm_mask': perm_mask, 'target_mapping': target_mapping}
|
||||||
|
|
||||||
|
if is_xlm_mlm and xlm_mask_token:
|
||||||
|
# XLM MLM models are direct models (predict same token, not next token)
|
||||||
|
# => need one additional dummy token in the input (will be masked and guessed)
|
||||||
|
input_ids = torch.cat((generated, torch.full((1, 1), xlm_mask_token, dtype=torch.long, device=device)), dim=1)
|
||||||
|
inputs = {'input_ids': input_ids}
|
||||||
|
|
||||||
if xlm_lang is not None:
|
if xlm_lang is not None:
|
||||||
inputs["langs"] = torch.tensor([xlm_lang] * inputs["input_ids"].shape[1], device=device).view(1, -1)
|
inputs["langs"] = torch.tensor([xlm_lang] * inputs["input_ids"].shape[1], device=device).view(1, -1)
|
||||||
|
|
||||||
outputs = model(**inputs) # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
|
outputs = model(**inputs) # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet/CTRL (cached hidden-states)
|
||||||
next_token_logits = outputs[0][0, -1, :] / temperature
|
next_token_logits = outputs[0][0, -1, :] / (temperature if temperature > 0 else 1.)
|
||||||
|
|
||||||
|
# reptition penalty from CTRL (https://arxiv.org/abs/1909.05858)
|
||||||
|
for _ in set(generated.view(-1).tolist()):
|
||||||
|
next_token_logits[_] /= repetition_penalty
|
||||||
|
|
||||||
filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
|
filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
|
||||||
next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
|
if temperature == 0: #greedy sampling:
|
||||||
|
next_token = torch.argmax(filtered_logits).unsqueeze(0)
|
||||||
|
else:
|
||||||
|
next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
|
||||||
generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
|
generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
|
||||||
return generated
|
return generated
|
||||||
|
|
||||||
@@ -145,7 +161,10 @@ def main():
|
|||||||
parser.add_argument("--padding_text", type=str, default="")
|
parser.add_argument("--padding_text", type=str, default="")
|
||||||
parser.add_argument("--xlm_lang", type=str, default="", help="Optional language when used with the XLM model.")
|
parser.add_argument("--xlm_lang", type=str, default="", help="Optional language when used with the XLM model.")
|
||||||
parser.add_argument("--length", type=int, default=20)
|
parser.add_argument("--length", type=int, default=20)
|
||||||
parser.add_argument("--temperature", type=float, default=1.0)
|
parser.add_argument("--temperature", type=float, default=1.0,
|
||||||
|
help="temperature of 0 implies greedy sampling")
|
||||||
|
parser.add_argument("--repetition_penalty", type=float, default=1.0,
|
||||||
|
help="primarily useful for CTRL model; in that case, use 1.2")
|
||||||
parser.add_argument("--top_k", type=int, default=0)
|
parser.add_argument("--top_k", type=int, default=0)
|
||||||
parser.add_argument("--top_p", type=float, default=0.9)
|
parser.add_argument("--top_p", type=float, default=0.9)
|
||||||
parser.add_argument("--no_cuda", action='store_true',
|
parser.add_argument("--no_cuda", action='store_true',
|
||||||
@@ -175,7 +194,11 @@ def main():
|
|||||||
elif args.length < 0:
|
elif args.length < 0:
|
||||||
args.length = MAX_LENGTH # avoid infinite loop
|
args.length = MAX_LENGTH # avoid infinite loop
|
||||||
|
|
||||||
print(args)
|
logger.info(args)
|
||||||
|
if args.model_type in ["ctrl"]:
|
||||||
|
if args.temperature > 0.7:
|
||||||
|
logger.info('CTRL typically works better with lower temperatures (and lower top_k).')
|
||||||
|
|
||||||
while True:
|
while True:
|
||||||
xlm_lang = None
|
xlm_lang = None
|
||||||
# XLM Language usage detailed in the issues #1414
|
# XLM Language usage detailed in the issues #1414
|
||||||
@@ -189,11 +212,21 @@ def main():
|
|||||||
language = input("Using XLM. Select language in " + str(list(tokenizer.lang2id.keys())) + " >>> ")
|
language = input("Using XLM. Select language in " + str(list(tokenizer.lang2id.keys())) + " >>> ")
|
||||||
xlm_lang = tokenizer.lang2id[language]
|
xlm_lang = tokenizer.lang2id[language]
|
||||||
|
|
||||||
|
# XLM masked-language modeling (MLM) models need masked token (see details in sample_sequence)
|
||||||
|
is_xlm_mlm = args.model_type in ["xlm"] and 'mlm' in args.model_name_or_path
|
||||||
|
if is_xlm_mlm:
|
||||||
|
xlm_mask_token = tokenizer.mask_token_id
|
||||||
|
else:
|
||||||
|
xlm_mask_token = None
|
||||||
|
|
||||||
raw_text = args.prompt if args.prompt else input("Model prompt >>> ")
|
raw_text = args.prompt if args.prompt else input("Model prompt >>> ")
|
||||||
if args.model_type in ["transfo-xl", "xlnet"]:
|
if args.model_type in ["transfo-xl", "xlnet"]:
|
||||||
# Models with memory likes to have a long prompt for short inputs.
|
# Models with memory likes to have a long prompt for short inputs.
|
||||||
raw_text = (args.padding_text if args.padding_text else PADDING_TEXT) + raw_text
|
raw_text = (args.padding_text if args.padding_text else PADDING_TEXT) + raw_text
|
||||||
context_tokens = tokenizer.encode(raw_text)
|
context_tokens = tokenizer.encode(raw_text)
|
||||||
|
if args.model_type == "ctrl":
|
||||||
|
if not any(context_tokens[0] == x for x in tokenizer.control_codes.values()):
|
||||||
|
logger.info("WARNING! You are not starting your generation from a control code so you won't get good results")
|
||||||
out = sample_sequence(
|
out = sample_sequence(
|
||||||
model=model,
|
model=model,
|
||||||
context=context_tokens,
|
context=context_tokens,
|
||||||
@@ -201,7 +234,10 @@ def main():
|
|||||||
temperature=args.temperature,
|
temperature=args.temperature,
|
||||||
top_k=args.top_k,
|
top_k=args.top_k,
|
||||||
top_p=args.top_p,
|
top_p=args.top_p,
|
||||||
|
repetition_penalty=args.repetition_penalty,
|
||||||
is_xlnet=bool(args.model_type == "xlnet"),
|
is_xlnet=bool(args.model_type == "xlnet"),
|
||||||
|
is_xlm_mlm=is_xlm_mlm,
|
||||||
|
xlm_mask_token=xlm_mask_token,
|
||||||
xlm_lang=xlm_lang,
|
xlm_lang=xlm_lang,
|
||||||
device=args.device,
|
device=args.device,
|
||||||
)
|
)
|
||||||
|
|||||||
@@ -28,7 +28,12 @@ import torch
|
|||||||
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
||||||
TensorDataset)
|
TensorDataset)
|
||||||
from torch.utils.data.distributed import DistributedSampler
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
from tensorboardX import SummaryWriter
|
|
||||||
|
try:
|
||||||
|
from torch.utils.tensorboard import SummaryWriter
|
||||||
|
except:
|
||||||
|
from tensorboardX import SummaryWriter
|
||||||
|
|
||||||
from tqdm import tqdm, trange
|
from tqdm import tqdm, trange
|
||||||
|
|
||||||
from transformers import (WEIGHTS_NAME, BertConfig,
|
from transformers import (WEIGHTS_NAME, BertConfig,
|
||||||
@@ -149,13 +154,16 @@ def train(args, train_dataset, model, tokenizer):
|
|||||||
if args.fp16:
|
if args.fp16:
|
||||||
with amp.scale_loss(loss, optimizer) as scaled_loss:
|
with amp.scale_loss(loss, optimizer) as scaled_loss:
|
||||||
scaled_loss.backward()
|
scaled_loss.backward()
|
||||||
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
|
||||||
else:
|
else:
|
||||||
loss.backward()
|
loss.backward()
|
||||||
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
|
||||||
|
|
||||||
tr_loss += loss.item()
|
tr_loss += loss.item()
|
||||||
if (step + 1) % args.gradient_accumulation_steps == 0:
|
if (step + 1) % args.gradient_accumulation_steps == 0 and not args.tpu:
|
||||||
|
if args.fp16:
|
||||||
|
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
||||||
|
else:
|
||||||
|
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
||||||
|
|
||||||
optimizer.step()
|
optimizer.step()
|
||||||
scheduler.step() # Update learning rate schedule
|
scheduler.step() # Update learning rate schedule
|
||||||
model.zero_grad()
|
model.zero_grad()
|
||||||
@@ -181,6 +189,11 @@ def train(args, train_dataset, model, tokenizer):
|
|||||||
torch.save(args, os.path.join(output_dir, 'training_args.bin'))
|
torch.save(args, os.path.join(output_dir, 'training_args.bin'))
|
||||||
logger.info("Saving model checkpoint to %s", output_dir)
|
logger.info("Saving model checkpoint to %s", output_dir)
|
||||||
|
|
||||||
|
if args.tpu:
|
||||||
|
args.xla_model.optimizer_step(optimizer, barrier=True)
|
||||||
|
model.zero_grad()
|
||||||
|
global_step += 1
|
||||||
|
|
||||||
if args.max_steps > 0 and global_step > args.max_steps:
|
if args.max_steps > 0 and global_step > args.max_steps:
|
||||||
epoch_iterator.close()
|
epoch_iterator.close()
|
||||||
break
|
break
|
||||||
@@ -249,7 +262,7 @@ def evaluate(args, model, tokenizer, prefix=""):
|
|||||||
result = compute_metrics(eval_task, preds, out_label_ids)
|
result = compute_metrics(eval_task, preds, out_label_ids)
|
||||||
results.update(result)
|
results.update(result)
|
||||||
|
|
||||||
output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
|
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
|
||||||
with open(output_eval_file, "w") as writer:
|
with open(output_eval_file, "w") as writer:
|
||||||
logger.info("***** Eval results {} *****".format(prefix))
|
logger.info("***** Eval results {} *****".format(prefix))
|
||||||
for key in sorted(result.keys()):
|
for key in sorted(result.keys()):
|
||||||
@@ -271,7 +284,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
|
|||||||
list(filter(None, args.model_name_or_path.split('/'))).pop(),
|
list(filter(None, args.model_name_or_path.split('/'))).pop(),
|
||||||
str(args.max_seq_length),
|
str(args.max_seq_length),
|
||||||
str(task)))
|
str(task)))
|
||||||
if os.path.exists(cached_features_file):
|
if os.path.exists(cached_features_file) and not args.overwrite_cache:
|
||||||
logger.info("Loading features from cached file %s", cached_features_file)
|
logger.info("Loading features from cached file %s", cached_features_file)
|
||||||
features = torch.load(cached_features_file)
|
features = torch.load(cached_features_file)
|
||||||
else:
|
else:
|
||||||
@@ -380,6 +393,15 @@ def main():
|
|||||||
parser.add_argument('--seed', type=int, default=42,
|
parser.add_argument('--seed', type=int, default=42,
|
||||||
help="random seed for initialization")
|
help="random seed for initialization")
|
||||||
|
|
||||||
|
parser.add_argument('--tpu', action='store_true',
|
||||||
|
help="Whether to run on the TPU defined in the environment variables")
|
||||||
|
parser.add_argument('--tpu_ip_address', type=str, default='',
|
||||||
|
help="TPU IP address if none are set in the environment variables")
|
||||||
|
parser.add_argument('--tpu_name', type=str, default='',
|
||||||
|
help="TPU name if none are set in the environment variables")
|
||||||
|
parser.add_argument('--xrt_tpu_config', type=str, default='',
|
||||||
|
help="XRT TPU config if none are set in the environment variables")
|
||||||
|
|
||||||
parser.add_argument('--fp16', action='store_true',
|
parser.add_argument('--fp16', action='store_true',
|
||||||
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
|
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
|
||||||
parser.add_argument('--fp16_opt_level', type=str, default='O1',
|
parser.add_argument('--fp16_opt_level', type=str, default='O1',
|
||||||
@@ -413,6 +435,23 @@ def main():
|
|||||||
args.n_gpu = 1
|
args.n_gpu = 1
|
||||||
args.device = device
|
args.device = device
|
||||||
|
|
||||||
|
if args.tpu:
|
||||||
|
if args.tpu_ip_address:
|
||||||
|
os.environ["TPU_IP_ADDRESS"] = args.tpu_ip_address
|
||||||
|
if args.tpu_name:
|
||||||
|
os.environ["TPU_NAME"] = args.tpu_name
|
||||||
|
if args.xrt_tpu_config:
|
||||||
|
os.environ["XRT_TPU_CONFIG"] = args.xrt_tpu_config
|
||||||
|
|
||||||
|
assert "TPU_IP_ADDRESS" in os.environ
|
||||||
|
assert "TPU_NAME" in os.environ
|
||||||
|
assert "XRT_TPU_CONFIG" in os.environ
|
||||||
|
|
||||||
|
import torch_xla
|
||||||
|
import torch_xla.core.xla_model as xm
|
||||||
|
args.device = xm.xla_device()
|
||||||
|
args.xla_model = xm
|
||||||
|
|
||||||
# Setup logging
|
# Setup logging
|
||||||
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||||
datefmt = '%m/%d/%Y %H:%M:%S',
|
datefmt = '%m/%d/%Y %H:%M:%S',
|
||||||
@@ -458,7 +497,7 @@ def main():
|
|||||||
|
|
||||||
|
|
||||||
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
|
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
|
||||||
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
|
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0) and not args.tpu:
|
||||||
# Create output directory if needed
|
# Create output directory if needed
|
||||||
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
|
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
|
||||||
os.makedirs(args.output_dir)
|
os.makedirs(args.output_dir)
|
||||||
@@ -490,9 +529,11 @@ def main():
|
|||||||
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||||
for checkpoint in checkpoints:
|
for checkpoint in checkpoints:
|
||||||
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
||||||
|
prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
|
||||||
|
|
||||||
model = model_class.from_pretrained(checkpoint)
|
model = model_class.from_pretrained(checkpoint)
|
||||||
model.to(args.device)
|
model.to(args.device)
|
||||||
result = evaluate(args, model, tokenizer, prefix=global_step)
|
result = evaluate(args, model, tokenizer, prefix=prefix)
|
||||||
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
|
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
|
||||||
results.update(result)
|
results.update(result)
|
||||||
|
|
||||||
|
|||||||
@@ -27,12 +27,19 @@ import logging
|
|||||||
import os
|
import os
|
||||||
import pickle
|
import pickle
|
||||||
import random
|
import random
|
||||||
|
import re
|
||||||
|
import shutil
|
||||||
|
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import torch
|
import torch
|
||||||
from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler
|
from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler
|
||||||
from torch.utils.data.distributed import DistributedSampler
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
from tensorboardX import SummaryWriter
|
|
||||||
|
try:
|
||||||
|
from torch.utils.tensorboard import SummaryWriter
|
||||||
|
except:
|
||||||
|
from tensorboardX import SummaryWriter
|
||||||
|
|
||||||
from tqdm import tqdm, trange
|
from tqdm import tqdm, trange
|
||||||
|
|
||||||
from transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
|
from transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
|
||||||
@@ -59,7 +66,7 @@ class TextDataset(Dataset):
|
|||||||
def __init__(self, tokenizer, file_path='train', block_size=512):
|
def __init__(self, tokenizer, file_path='train', block_size=512):
|
||||||
assert os.path.isfile(file_path)
|
assert os.path.isfile(file_path)
|
||||||
directory, filename = os.path.split(file_path)
|
directory, filename = os.path.split(file_path)
|
||||||
cached_features_file = os.path.join(directory, 'cached_lm_{}_{}'.format(block_size, filename))
|
cached_features_file = os.path.join(directory, 'cached_lm_' + str(block_size) + '_' + filename)
|
||||||
|
|
||||||
if os.path.exists(cached_features_file):
|
if os.path.exists(cached_features_file):
|
||||||
logger.info("Loading features from cached file %s", cached_features_file)
|
logger.info("Loading features from cached file %s", cached_features_file)
|
||||||
@@ -75,7 +82,7 @@ class TextDataset(Dataset):
|
|||||||
tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
|
tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
|
||||||
|
|
||||||
for i in range(0, len(tokenized_text)-block_size+1, block_size): # Truncate in block of block_size
|
for i in range(0, len(tokenized_text)-block_size+1, block_size): # Truncate in block of block_size
|
||||||
self.examples.append(tokenizer.add_special_tokens_single_sequence(tokenized_text[i:i+block_size]))
|
self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i:i+block_size]))
|
||||||
# Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
|
# Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
|
||||||
# If your dataset is small, first you should loook for a bigger one :-) and second you
|
# If your dataset is small, first you should loook for a bigger one :-) and second you
|
||||||
# can change this behavior by adding (model specific) padding.
|
# can change this behavior by adding (model specific) padding.
|
||||||
@@ -104,11 +111,43 @@ def set_seed(args):
|
|||||||
torch.cuda.manual_seed_all(args.seed)
|
torch.cuda.manual_seed_all(args.seed)
|
||||||
|
|
||||||
|
|
||||||
|
def _rotate_checkpoints(args, checkpoint_prefix, use_mtime=False):
|
||||||
|
if not args.save_total_limit:
|
||||||
|
return
|
||||||
|
if args.save_total_limit <= 0:
|
||||||
|
return
|
||||||
|
|
||||||
|
# Check if we should delete older checkpoint(s)
|
||||||
|
glob_checkpoints = glob.glob(os.path.join(args.output_dir, '{}-*'.format(checkpoint_prefix)))
|
||||||
|
if len(glob_checkpoints) <= args.save_total_limit:
|
||||||
|
return
|
||||||
|
|
||||||
|
ordering_and_checkpoint_path = []
|
||||||
|
for path in glob_checkpoints:
|
||||||
|
if use_mtime:
|
||||||
|
ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
|
||||||
|
else:
|
||||||
|
regex_match = re.match('.*{}-([0-9]+)'.format(checkpoint_prefix), path)
|
||||||
|
if regex_match and regex_match.groups():
|
||||||
|
ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))
|
||||||
|
|
||||||
|
checkpoints_sorted = sorted(ordering_and_checkpoint_path)
|
||||||
|
checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
|
||||||
|
number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
|
||||||
|
checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
|
||||||
|
for checkpoint in checkpoints_to_be_deleted:
|
||||||
|
logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
|
||||||
|
shutil.rmtree(checkpoint)
|
||||||
|
|
||||||
|
|
||||||
def mask_tokens(inputs, tokenizer, args):
|
def mask_tokens(inputs, tokenizer, args):
|
||||||
""" Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
|
""" Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
|
||||||
labels = inputs.clone()
|
labels = inputs.clone()
|
||||||
# We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
|
# We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
|
||||||
masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).bool()
|
probability_matrix = torch.full(labels.shape, args.mlm_probability)
|
||||||
|
special_tokens_mask = [tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()]
|
||||||
|
probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
|
||||||
|
masked_indices = torch.bernoulli(probability_matrix).bool()
|
||||||
labels[~masked_indices] = -1 # We only compute loss on masked tokens
|
labels[~masked_indices] = -1 # We only compute loss on masked tokens
|
||||||
|
|
||||||
# 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
|
# 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
|
||||||
@@ -222,8 +261,9 @@ def train(args, train_dataset, model, tokenizer):
|
|||||||
logging_loss = tr_loss
|
logging_loss = tr_loss
|
||||||
|
|
||||||
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
|
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
|
||||||
|
checkpoint_prefix = 'checkpoint'
|
||||||
# Save model checkpoint
|
# Save model checkpoint
|
||||||
output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
|
output_dir = os.path.join(args.output_dir, '{}-{}'.format(checkpoint_prefix, global_step))
|
||||||
if not os.path.exists(output_dir):
|
if not os.path.exists(output_dir):
|
||||||
os.makedirs(output_dir)
|
os.makedirs(output_dir)
|
||||||
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
|
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
|
||||||
@@ -231,6 +271,8 @@ def train(args, train_dataset, model, tokenizer):
|
|||||||
torch.save(args, os.path.join(output_dir, 'training_args.bin'))
|
torch.save(args, os.path.join(output_dir, 'training_args.bin'))
|
||||||
logger.info("Saving model checkpoint to %s", output_dir)
|
logger.info("Saving model checkpoint to %s", output_dir)
|
||||||
|
|
||||||
|
_rotate_checkpoints(args, checkpoint_prefix)
|
||||||
|
|
||||||
if args.max_steps > 0 and global_step > args.max_steps:
|
if args.max_steps > 0 and global_step > args.max_steps:
|
||||||
epoch_iterator.close()
|
epoch_iterator.close()
|
||||||
break
|
break
|
||||||
@@ -267,10 +309,12 @@ def evaluate(args, model, tokenizer, prefix=""):
|
|||||||
model.eval()
|
model.eval()
|
||||||
|
|
||||||
for batch in tqdm(eval_dataloader, desc="Evaluating"):
|
for batch in tqdm(eval_dataloader, desc="Evaluating"):
|
||||||
batch = batch.to(args.device)
|
inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
|
||||||
|
inputs = inputs.to(args.device)
|
||||||
|
labels = labels.to(args.device)
|
||||||
|
|
||||||
with torch.no_grad():
|
with torch.no_grad():
|
||||||
outputs = model(batch, masked_lm_labels=batch) if args.mlm else model(batch, labels=batch)
|
outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
|
||||||
lm_loss = outputs[0]
|
lm_loss = outputs[0]
|
||||||
eval_loss += lm_loss.mean().item()
|
eval_loss += lm_loss.mean().item()
|
||||||
nb_eval_steps += 1
|
nb_eval_steps += 1
|
||||||
@@ -282,7 +326,7 @@ def evaluate(args, model, tokenizer, prefix=""):
|
|||||||
"perplexity": perplexity
|
"perplexity": perplexity
|
||||||
}
|
}
|
||||||
|
|
||||||
output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
|
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
|
||||||
with open(output_eval_file, "w") as writer:
|
with open(output_eval_file, "w") as writer:
|
||||||
logger.info("***** Eval results {} *****".format(prefix))
|
logger.info("***** Eval results {} *****".format(prefix))
|
||||||
for key in sorted(result.keys()):
|
for key in sorted(result.keys()):
|
||||||
@@ -359,6 +403,8 @@ def main():
|
|||||||
help="Log every X updates steps.")
|
help="Log every X updates steps.")
|
||||||
parser.add_argument('--save_steps', type=int, default=50,
|
parser.add_argument('--save_steps', type=int, default=50,
|
||||||
help="Save checkpoint every X updates steps.")
|
help="Save checkpoint every X updates steps.")
|
||||||
|
parser.add_argument('--save_total_limit', type=int, default=None,
|
||||||
|
help='Limit the total amount of checkpoints, delete the older checkpoints in the output_dir, does not delete by default')
|
||||||
parser.add_argument("--eval_all_checkpoints", action='store_true',
|
parser.add_argument("--eval_all_checkpoints", action='store_true',
|
||||||
help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
|
help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
|
||||||
parser.add_argument("--no_cuda", action='store_true',
|
parser.add_argument("--no_cuda", action='store_true',
|
||||||
@@ -484,9 +530,11 @@ def main():
|
|||||||
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||||
for checkpoint in checkpoints:
|
for checkpoint in checkpoints:
|
||||||
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
||||||
|
prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
|
||||||
|
|
||||||
model = model_class.from_pretrained(checkpoint)
|
model = model_class.from_pretrained(checkpoint)
|
||||||
model.to(args.device)
|
model.to(args.device)
|
||||||
result = evaluate(args, model, tokenizer, prefix=global_step)
|
result = evaluate(args, model, tokenizer, prefix=prefix)
|
||||||
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
|
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
|
||||||
results.update(result)
|
results.update(result)
|
||||||
|
|
||||||
|
|||||||
@@ -29,7 +29,12 @@ import torch
|
|||||||
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
||||||
TensorDataset)
|
TensorDataset)
|
||||||
from torch.utils.data.distributed import DistributedSampler
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
from tensorboardX import SummaryWriter
|
|
||||||
|
try:
|
||||||
|
from torch.utils.tensorboard import SummaryWriter
|
||||||
|
except:
|
||||||
|
from tensorboardX import SummaryWriter
|
||||||
|
|
||||||
from tqdm import tqdm, trange
|
from tqdm import tqdm, trange
|
||||||
|
|
||||||
from transformers import (WEIGHTS_NAME, BertConfig,
|
from transformers import (WEIGHTS_NAME, BertConfig,
|
||||||
@@ -293,7 +298,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False, test=False):
|
|||||||
list(filter(None, args.model_name_or_path.split('/'))).pop(),
|
list(filter(None, args.model_name_or_path.split('/'))).pop(),
|
||||||
str(args.max_seq_length),
|
str(args.max_seq_length),
|
||||||
str(task)))
|
str(task)))
|
||||||
if os.path.exists(cached_features_file):
|
if os.path.exists(cached_features_file) and not args.overwrite_cache:
|
||||||
logger.info("Loading features from cached file %s", cached_features_file)
|
logger.info("Loading features from cached file %s", cached_features_file)
|
||||||
features = torch.load(cached_features_file)
|
features = torch.load(cached_features_file)
|
||||||
else:
|
else:
|
||||||
@@ -306,14 +311,14 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False, test=False):
|
|||||||
else:
|
else:
|
||||||
examples = processor.get_train_examples(args.data_dir)
|
examples = processor.get_train_examples(args.data_dir)
|
||||||
logger.info("Training number: %s", str(len(examples)))
|
logger.info("Training number: %s", str(len(examples)))
|
||||||
features = convert_examples_to_features(examples, label_list, args.max_seq_length, tokenizer,
|
features = convert_examples_to_features(
|
||||||
cls_token_at_end=bool(args.model_type in ['xlnet']), # xlnet has a cls token at the end
|
examples,
|
||||||
cls_token=tokenizer.cls_token,
|
label_list,
|
||||||
sep_token=tokenizer.sep_token,
|
args.max_seq_length,
|
||||||
sep_token_extra=bool(args.model_type in ['roberta']),
|
tokenizer,
|
||||||
cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0,
|
|
||||||
pad_on_left=bool(args.model_type in ['xlnet']), # pad on the left for xlnet
|
pad_on_left=bool(args.model_type in ['xlnet']), # pad on the left for xlnet
|
||||||
pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0)
|
pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0
|
||||||
|
)
|
||||||
if args.local_rank in [-1, 0]:
|
if args.local_rank in [-1, 0]:
|
||||||
logger.info("Saving features into cached file %s", cached_features_file)
|
logger.info("Saving features into cached file %s", cached_features_file)
|
||||||
torch.save(features, cached_features_file)
|
torch.save(features, cached_features_file)
|
||||||
@@ -362,7 +367,7 @@ def main():
|
|||||||
help="Whether to run eval on the dev set.")
|
help="Whether to run eval on the dev set.")
|
||||||
parser.add_argument("--do_test", action='store_true', help='Whether to run test on the test set')
|
parser.add_argument("--do_test", action='store_true', help='Whether to run test on the test set')
|
||||||
parser.add_argument("--evaluate_during_training", action='store_true',
|
parser.add_argument("--evaluate_during_training", action='store_true',
|
||||||
help="Rul evaluation during training at each logging step.")
|
help="Run evaluation during training at each logging step.")
|
||||||
parser.add_argument("--do_lower_case", action='store_true',
|
parser.add_argument("--do_lower_case", action='store_true',
|
||||||
help="Set this flag if you are using an uncased model.")
|
help="Set this flag if you are using an uncased model.")
|
||||||
|
|
||||||
@@ -512,9 +517,11 @@ def main():
|
|||||||
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||||
for checkpoint in checkpoints:
|
for checkpoint in checkpoints:
|
||||||
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
||||||
|
prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
|
||||||
|
|
||||||
model = model_class.from_pretrained(checkpoint)
|
model = model_class.from_pretrained(checkpoint)
|
||||||
model.to(args.device)
|
model.to(args.device)
|
||||||
result = evaluate(args, model, tokenizer, prefix=global_step)
|
result = evaluate(args, model, tokenizer, prefix=prefix)
|
||||||
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
|
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
|
||||||
results.update(result)
|
results.update(result)
|
||||||
|
|
||||||
@@ -528,9 +535,11 @@ def main():
|
|||||||
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||||
for checkpoint in checkpoints:
|
for checkpoint in checkpoints:
|
||||||
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
||||||
|
prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
|
||||||
|
|
||||||
model = model_class.from_pretrained(checkpoint)
|
model = model_class.from_pretrained(checkpoint)
|
||||||
model.to(args.device)
|
model.to(args.device)
|
||||||
result = evaluate(args, model, tokenizer, prefix=global_step, test=True)
|
result = evaluate(args, model, tokenizer, prefix=prefix, test=True)
|
||||||
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
|
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
|
||||||
results.update(result)
|
results.update(result)
|
||||||
if best_steps:
|
if best_steps:
|
||||||
|
|||||||
515
examples/run_ner.py
Normal file
515
examples/run_ner.py
Normal file
@@ -0,0 +1,515 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Fine-tuning the library models for named entity recognition on CoNLL-2003 (Bert or Roberta). """
|
||||||
|
|
||||||
|
from __future__ import absolute_import, division, print_function
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import glob
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import random
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
from seqeval.metrics import precision_score, recall_score, f1_score
|
||||||
|
from tensorboardX import SummaryWriter
|
||||||
|
from torch.nn import CrossEntropyLoss
|
||||||
|
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
|
||||||
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
|
from tqdm import tqdm, trange
|
||||||
|
from utils_ner import convert_examples_to_features, get_labels, read_examples_from_file
|
||||||
|
|
||||||
|
from transformers import AdamW, WarmupLinearSchedule
|
||||||
|
from transformers import WEIGHTS_NAME, BertConfig, BertForTokenClassification, BertTokenizer
|
||||||
|
from transformers import RobertaConfig, RobertaForTokenClassification, RobertaTokenizer
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
ALL_MODELS = sum(
|
||||||
|
(tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, RobertaConfig)),
|
||||||
|
())
|
||||||
|
|
||||||
|
MODEL_CLASSES = {
|
||||||
|
"bert": (BertConfig, BertForTokenClassification, BertTokenizer),
|
||||||
|
"roberta": (RobertaConfig, RobertaForTokenClassification, RobertaTokenizer)
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def set_seed(args):
|
||||||
|
random.seed(args.seed)
|
||||||
|
np.random.seed(args.seed)
|
||||||
|
torch.manual_seed(args.seed)
|
||||||
|
if args.n_gpu > 0:
|
||||||
|
torch.cuda.manual_seed_all(args.seed)
|
||||||
|
|
||||||
|
|
||||||
|
def train(args, train_dataset, model, tokenizer, labels, pad_token_label_id):
|
||||||
|
""" Train the model """
|
||||||
|
if args.local_rank in [-1, 0]:
|
||||||
|
tb_writer = SummaryWriter()
|
||||||
|
|
||||||
|
args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
|
||||||
|
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
|
||||||
|
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
|
||||||
|
|
||||||
|
if args.max_steps > 0:
|
||||||
|
t_total = args.max_steps
|
||||||
|
args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
|
||||||
|
else:
|
||||||
|
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
|
||||||
|
|
||||||
|
# Prepare optimizer and schedule (linear warmup and decay)
|
||||||
|
no_decay = ["bias", "LayerNorm.weight"]
|
||||||
|
optimizer_grouped_parameters = [
|
||||||
|
{"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
|
||||||
|
"weight_decay": args.weight_decay},
|
||||||
|
{"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0}
|
||||||
|
]
|
||||||
|
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
|
||||||
|
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
|
||||||
|
if args.fp16:
|
||||||
|
try:
|
||||||
|
from apex import amp
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
|
||||||
|
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
|
||||||
|
|
||||||
|
# multi-gpu training (should be after apex fp16 initialization)
|
||||||
|
if args.n_gpu > 1:
|
||||||
|
model = torch.nn.DataParallel(model)
|
||||||
|
|
||||||
|
# Distributed training (should be after apex fp16 initialization)
|
||||||
|
if args.local_rank != -1:
|
||||||
|
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
|
||||||
|
output_device=args.local_rank,
|
||||||
|
find_unused_parameters=True)
|
||||||
|
|
||||||
|
# Train!
|
||||||
|
logger.info("***** Running training *****")
|
||||||
|
logger.info(" Num examples = %d", len(train_dataset))
|
||||||
|
logger.info(" Num Epochs = %d", args.num_train_epochs)
|
||||||
|
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
|
||||||
|
logger.info(" Total train batch size (w. parallel, distributed & accumulation) = %d",
|
||||||
|
args.train_batch_size * args.gradient_accumulation_steps * (
|
||||||
|
torch.distributed.get_world_size() if args.local_rank != -1 else 1))
|
||||||
|
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
|
||||||
|
logger.info(" Total optimization steps = %d", t_total)
|
||||||
|
|
||||||
|
global_step = 0
|
||||||
|
tr_loss, logging_loss = 0.0, 0.0
|
||||||
|
model.zero_grad()
|
||||||
|
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
|
||||||
|
set_seed(args) # Added here for reproductibility (even between python 2 and 3)
|
||||||
|
for _ in train_iterator:
|
||||||
|
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
|
||||||
|
for step, batch in enumerate(epoch_iterator):
|
||||||
|
model.train()
|
||||||
|
batch = tuple(t.to(args.device) for t in batch)
|
||||||
|
inputs = {"input_ids": batch[0],
|
||||||
|
"attention_mask": batch[1],
|
||||||
|
"token_type_ids": batch[2] if args.model_type in ["bert", "xlnet"] else None,
|
||||||
|
# XLM and RoBERTa don"t use segment_ids
|
||||||
|
"labels": batch[3]}
|
||||||
|
outputs = model(**inputs)
|
||||||
|
loss = outputs[0] # model outputs are always tuple in pytorch-transformers (see doc)
|
||||||
|
|
||||||
|
if args.n_gpu > 1:
|
||||||
|
loss = loss.mean() # mean() to average on multi-gpu parallel training
|
||||||
|
if args.gradient_accumulation_steps > 1:
|
||||||
|
loss = loss / args.gradient_accumulation_steps
|
||||||
|
|
||||||
|
if args.fp16:
|
||||||
|
with amp.scale_loss(loss, optimizer) as scaled_loss:
|
||||||
|
scaled_loss.backward()
|
||||||
|
else:
|
||||||
|
loss.backward()
|
||||||
|
|
||||||
|
tr_loss += loss.item()
|
||||||
|
if (step + 1) % args.gradient_accumulation_steps == 0:
|
||||||
|
if args.fp16:
|
||||||
|
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
||||||
|
else:
|
||||||
|
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
||||||
|
|
||||||
|
scheduler.step() # Update learning rate schedule
|
||||||
|
optimizer.step()
|
||||||
|
model.zero_grad()
|
||||||
|
global_step += 1
|
||||||
|
|
||||||
|
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
|
||||||
|
# Log metrics
|
||||||
|
if args.local_rank == -1 and args.evaluate_during_training: # Only evaluate when single GPU otherwise metrics may not average well
|
||||||
|
results, _ = evaluate(args, model, tokenizer, labels, pad_token_label_id)
|
||||||
|
for key, value in results.items():
|
||||||
|
tb_writer.add_scalar("eval_{}".format(key), value, global_step)
|
||||||
|
tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
|
||||||
|
tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
|
||||||
|
logging_loss = tr_loss
|
||||||
|
|
||||||
|
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
|
||||||
|
# Save model checkpoint
|
||||||
|
output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
|
||||||
|
if not os.path.exists(output_dir):
|
||||||
|
os.makedirs(output_dir)
|
||||||
|
model_to_save = model.module if hasattr(model, "module") else model # Take care of distributed/parallel training
|
||||||
|
model_to_save.save_pretrained(output_dir)
|
||||||
|
torch.save(args, os.path.join(output_dir, "training_args.bin"))
|
||||||
|
logger.info("Saving model checkpoint to %s", output_dir)
|
||||||
|
|
||||||
|
if args.max_steps > 0 and global_step > args.max_steps:
|
||||||
|
epoch_iterator.close()
|
||||||
|
break
|
||||||
|
if args.max_steps > 0 and global_step > args.max_steps:
|
||||||
|
train_iterator.close()
|
||||||
|
break
|
||||||
|
|
||||||
|
if args.local_rank in [-1, 0]:
|
||||||
|
tb_writer.close()
|
||||||
|
|
||||||
|
return global_step, tr_loss / global_step
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate(args, model, tokenizer, labels, pad_token_label_id, mode, prefix=""):
|
||||||
|
eval_dataset = load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode=mode)
|
||||||
|
|
||||||
|
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
|
||||||
|
# Note that DistributedSampler samples randomly
|
||||||
|
eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
|
||||||
|
eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
|
||||||
|
|
||||||
|
# Eval!
|
||||||
|
logger.info("***** Running evaluation %s *****", prefix)
|
||||||
|
logger.info(" Num examples = %d", len(eval_dataset))
|
||||||
|
logger.info(" Batch size = %d", args.eval_batch_size)
|
||||||
|
eval_loss = 0.0
|
||||||
|
nb_eval_steps = 0
|
||||||
|
preds = None
|
||||||
|
out_label_ids = None
|
||||||
|
model.eval()
|
||||||
|
for batch in tqdm(eval_dataloader, desc="Evaluating"):
|
||||||
|
batch = tuple(t.to(args.device) for t in batch)
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
inputs = {"input_ids": batch[0],
|
||||||
|
"attention_mask": batch[1],
|
||||||
|
"token_type_ids": batch[2] if args.model_type in ["bert", "xlnet"] else None,
|
||||||
|
# XLM and RoBERTa don"t use segment_ids
|
||||||
|
"labels": batch[3]}
|
||||||
|
outputs = model(**inputs)
|
||||||
|
tmp_eval_loss, logits = outputs[:2]
|
||||||
|
|
||||||
|
eval_loss += tmp_eval_loss.item()
|
||||||
|
nb_eval_steps += 1
|
||||||
|
if preds is None:
|
||||||
|
preds = logits.detach().cpu().numpy()
|
||||||
|
out_label_ids = inputs["labels"].detach().cpu().numpy()
|
||||||
|
else:
|
||||||
|
preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
|
||||||
|
out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)
|
||||||
|
|
||||||
|
eval_loss = eval_loss / nb_eval_steps
|
||||||
|
preds = np.argmax(preds, axis=2)
|
||||||
|
|
||||||
|
label_map = {i: label for i, label in enumerate(labels)}
|
||||||
|
|
||||||
|
out_label_list = [[] for _ in range(out_label_ids.shape[0])]
|
||||||
|
preds_list = [[] for _ in range(out_label_ids.shape[0])]
|
||||||
|
|
||||||
|
for i in range(out_label_ids.shape[0]):
|
||||||
|
for j in range(out_label_ids.shape[1]):
|
||||||
|
if out_label_ids[i, j] != pad_token_label_id:
|
||||||
|
out_label_list[i].append(label_map[out_label_ids[i][j]])
|
||||||
|
preds_list[i].append(label_map[preds[i][j]])
|
||||||
|
|
||||||
|
results = {
|
||||||
|
"loss": eval_loss,
|
||||||
|
"precision": precision_score(out_label_list, preds_list),
|
||||||
|
"recall": recall_score(out_label_list, preds_list),
|
||||||
|
"f1": f1_score(out_label_list, preds_list)
|
||||||
|
}
|
||||||
|
|
||||||
|
logger.info("***** Eval results %s *****", prefix)
|
||||||
|
for key in sorted(results.keys()):
|
||||||
|
logger.info(" %s = %s", key, str(results[key]))
|
||||||
|
|
||||||
|
return results, preds_list
|
||||||
|
|
||||||
|
|
||||||
|
def load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode):
|
||||||
|
if args.local_rank not in [-1, 0] and not evaluate:
|
||||||
|
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
|
||||||
|
|
||||||
|
# Load data features from cache or dataset file
|
||||||
|
cached_features_file = os.path.join(args.data_dir, "cached_{}_{}_{}".format(mode,
|
||||||
|
list(filter(None, args.model_name_or_path.split("/"))).pop(),
|
||||||
|
str(args.max_seq_length)))
|
||||||
|
if os.path.exists(cached_features_file) and not args.overwrite_cache:
|
||||||
|
logger.info("Loading features from cached file %s", cached_features_file)
|
||||||
|
features = torch.load(cached_features_file)
|
||||||
|
else:
|
||||||
|
logger.info("Creating features from dataset file at %s", args.data_dir)
|
||||||
|
examples = read_examples_from_file(args.data_dir, mode)
|
||||||
|
features = convert_examples_to_features(examples, labels, args.max_seq_length, tokenizer,
|
||||||
|
cls_token_at_end=bool(args.model_type in ["xlnet"]),
|
||||||
|
# xlnet has a cls token at the end
|
||||||
|
cls_token=tokenizer.cls_token,
|
||||||
|
cls_token_segment_id=2 if args.model_type in ["xlnet"] else 0,
|
||||||
|
sep_token=tokenizer.sep_token,
|
||||||
|
sep_token_extra=bool(args.model_type in ["roberta"]),
|
||||||
|
# roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
|
||||||
|
pad_on_left=bool(args.model_type in ["xlnet"]),
|
||||||
|
# pad on the left for xlnet
|
||||||
|
pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
|
||||||
|
pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
|
||||||
|
pad_token_label_id=pad_token_label_id
|
||||||
|
)
|
||||||
|
if args.local_rank in [-1, 0]:
|
||||||
|
logger.info("Saving features into cached file %s", cached_features_file)
|
||||||
|
torch.save(features, cached_features_file)
|
||||||
|
|
||||||
|
if args.local_rank == 0 and not evaluate:
|
||||||
|
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
|
||||||
|
|
||||||
|
# Convert to Tensors and build dataset
|
||||||
|
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
|
||||||
|
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
|
||||||
|
all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
|
||||||
|
all_label_ids = torch.tensor([f.label_ids for f in features], dtype=torch.long)
|
||||||
|
|
||||||
|
dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
|
||||||
|
return dataset
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
|
||||||
|
## Required parameters
|
||||||
|
parser.add_argument("--data_dir", default=None, type=str, required=True,
|
||||||
|
help="The input data dir. Should contain the training files for the CoNLL-2003 NER task.")
|
||||||
|
parser.add_argument("--model_type", default=None, type=str, required=True,
|
||||||
|
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
|
||||||
|
parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
|
||||||
|
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
|
||||||
|
parser.add_argument("--output_dir", default=None, type=str, required=True,
|
||||||
|
help="The output directory where the model predictions and checkpoints will be written.")
|
||||||
|
|
||||||
|
## Other parameters
|
||||||
|
parser.add_argument("--labels", default="", type=str,
|
||||||
|
help="Path to a file containing all labels. If not specified, CoNLL-2003 labels are used.")
|
||||||
|
parser.add_argument("--config_name", default="", type=str,
|
||||||
|
help="Pretrained config name or path if not the same as model_name")
|
||||||
|
parser.add_argument("--tokenizer_name", default="", type=str,
|
||||||
|
help="Pretrained tokenizer name or path if not the same as model_name")
|
||||||
|
parser.add_argument("--cache_dir", default="", type=str,
|
||||||
|
help="Where do you want to store the pre-trained models downloaded from s3")
|
||||||
|
parser.add_argument("--max_seq_length", default=128, type=int,
|
||||||
|
help="The maximum total input sequence length after tokenization. Sequences longer "
|
||||||
|
"than this will be truncated, sequences shorter will be padded.")
|
||||||
|
parser.add_argument("--do_train", action="store_true",
|
||||||
|
help="Whether to run training.")
|
||||||
|
parser.add_argument("--do_eval", action="store_true",
|
||||||
|
help="Whether to run eval on the dev set.")
|
||||||
|
parser.add_argument("--do_predict", action="store_true",
|
||||||
|
help="Whether to run predictions on the test set.")
|
||||||
|
parser.add_argument("--evaluate_during_training", action="store_true",
|
||||||
|
help="Whether to run evaluation during training at each logging step.")
|
||||||
|
parser.add_argument("--do_lower_case", action="store_true",
|
||||||
|
help="Set this flag if you are using an uncased model.")
|
||||||
|
|
||||||
|
parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
|
||||||
|
help="Batch size per GPU/CPU for training.")
|
||||||
|
parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
|
||||||
|
help="Batch size per GPU/CPU for evaluation.")
|
||||||
|
parser.add_argument("--gradient_accumulation_steps", type=int, default=1,
|
||||||
|
help="Number of updates steps to accumulate before performing a backward/update pass.")
|
||||||
|
parser.add_argument("--learning_rate", default=5e-5, type=float,
|
||||||
|
help="The initial learning rate for Adam.")
|
||||||
|
parser.add_argument("--weight_decay", default=0.0, type=float,
|
||||||
|
help="Weight decay if we apply some.")
|
||||||
|
parser.add_argument("--adam_epsilon", default=1e-8, type=float,
|
||||||
|
help="Epsilon for Adam optimizer.")
|
||||||
|
parser.add_argument("--max_grad_norm", default=1.0, type=float,
|
||||||
|
help="Max gradient norm.")
|
||||||
|
parser.add_argument("--num_train_epochs", default=3.0, type=float,
|
||||||
|
help="Total number of training epochs to perform.")
|
||||||
|
parser.add_argument("--max_steps", default=-1, type=int,
|
||||||
|
help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
|
||||||
|
parser.add_argument("--warmup_steps", default=0, type=int,
|
||||||
|
help="Linear warmup over warmup_steps.")
|
||||||
|
|
||||||
|
parser.add_argument("--logging_steps", type=int, default=50,
|
||||||
|
help="Log every X updates steps.")
|
||||||
|
parser.add_argument("--save_steps", type=int, default=50,
|
||||||
|
help="Save checkpoint every X updates steps.")
|
||||||
|
parser.add_argument("--eval_all_checkpoints", action="store_true",
|
||||||
|
help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
|
||||||
|
parser.add_argument("--no_cuda", action="store_true",
|
||||||
|
help="Avoid using CUDA when available")
|
||||||
|
parser.add_argument("--overwrite_output_dir", action="store_true",
|
||||||
|
help="Overwrite the content of the output directory")
|
||||||
|
parser.add_argument("--overwrite_cache", action="store_true",
|
||||||
|
help="Overwrite the cached training and evaluation sets")
|
||||||
|
parser.add_argument("--seed", type=int, default=42,
|
||||||
|
help="random seed for initialization")
|
||||||
|
|
||||||
|
parser.add_argument("--fp16", action="store_true",
|
||||||
|
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
|
||||||
|
parser.add_argument("--fp16_opt_level", type=str, default="O1",
|
||||||
|
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
|
||||||
|
"See details at https://nvidia.github.io/apex/amp.html")
|
||||||
|
parser.add_argument("--local_rank", type=int, default=-1,
|
||||||
|
help="For distributed training: local_rank")
|
||||||
|
parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.")
|
||||||
|
parser.add_argument("--server_port", type=str, default="", help="For distant debugging.")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if os.path.exists(args.output_dir) and os.listdir(
|
||||||
|
args.output_dir) and args.do_train and not args.overwrite_output_dir:
|
||||||
|
raise ValueError(
|
||||||
|
"Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
|
||||||
|
args.output_dir))
|
||||||
|
|
||||||
|
# Setup distant debugging if needed
|
||||||
|
if args.server_ip and args.server_port:
|
||||||
|
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
|
||||||
|
import ptvsd
|
||||||
|
print("Waiting for debugger attach")
|
||||||
|
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
|
||||||
|
ptvsd.wait_for_attach()
|
||||||
|
|
||||||
|
# Setup CUDA, GPU & distributed training
|
||||||
|
if args.local_rank == -1 or args.no_cuda:
|
||||||
|
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
|
||||||
|
args.n_gpu = torch.cuda.device_count()
|
||||||
|
else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
|
||||||
|
torch.cuda.set_device(args.local_rank)
|
||||||
|
device = torch.device("cuda", args.local_rank)
|
||||||
|
torch.distributed.init_process_group(backend="nccl")
|
||||||
|
args.n_gpu = 1
|
||||||
|
args.device = device
|
||||||
|
|
||||||
|
# Setup logging
|
||||||
|
logging.basicConfig(format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
|
||||||
|
datefmt="%m/%d/%Y %H:%M:%S",
|
||||||
|
level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
|
||||||
|
logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
|
||||||
|
args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
|
||||||
|
|
||||||
|
# Set seed
|
||||||
|
set_seed(args)
|
||||||
|
|
||||||
|
# Prepare CONLL-2003 task
|
||||||
|
labels = get_labels(args.labels)
|
||||||
|
num_labels = len(labels)
|
||||||
|
# Use cross entropy ignore index as padding label id so that only real label ids contribute to the loss later
|
||||||
|
pad_token_label_id = CrossEntropyLoss().ignore_index
|
||||||
|
|
||||||
|
# Load pretrained model and tokenizer
|
||||||
|
if args.local_rank not in [-1, 0]:
|
||||||
|
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
|
||||||
|
|
||||||
|
args.model_type = args.model_type.lower()
|
||||||
|
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
|
||||||
|
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path,
|
||||||
|
num_labels=num_labels)
|
||||||
|
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
|
||||||
|
do_lower_case=args.do_lower_case)
|
||||||
|
model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path),
|
||||||
|
config=config)
|
||||||
|
|
||||||
|
if args.local_rank == 0:
|
||||||
|
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
|
||||||
|
|
||||||
|
model.to(args.device)
|
||||||
|
|
||||||
|
logger.info("Training/evaluation parameters %s", args)
|
||||||
|
|
||||||
|
# Training
|
||||||
|
if args.do_train:
|
||||||
|
train_dataset = load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode="train")
|
||||||
|
global_step, tr_loss = train(args, train_dataset, model, tokenizer, labels, pad_token_label_id)
|
||||||
|
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
|
||||||
|
|
||||||
|
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
|
||||||
|
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
|
||||||
|
# Create output directory if needed
|
||||||
|
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
|
||||||
|
os.makedirs(args.output_dir)
|
||||||
|
|
||||||
|
logger.info("Saving model checkpoint to %s", args.output_dir)
|
||||||
|
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
|
||||||
|
# They can then be reloaded using `from_pretrained()`
|
||||||
|
model_to_save = model.module if hasattr(model, "module") else model # Take care of distributed/parallel training
|
||||||
|
model_to_save.save_pretrained(args.output_dir)
|
||||||
|
tokenizer.save_pretrained(args.output_dir)
|
||||||
|
|
||||||
|
# Good practice: save your training arguments together with the trained model
|
||||||
|
torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
|
||||||
|
|
||||||
|
# Evaluation
|
||||||
|
results = {}
|
||||||
|
if args.do_eval and args.local_rank in [-1, 0]:
|
||||||
|
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
|
||||||
|
checkpoints = [args.output_dir]
|
||||||
|
if args.eval_all_checkpoints:
|
||||||
|
checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True)))
|
||||||
|
logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging
|
||||||
|
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||||
|
for checkpoint in checkpoints:
|
||||||
|
global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
|
||||||
|
model = model_class.from_pretrained(checkpoint)
|
||||||
|
model.to(args.device)
|
||||||
|
result, _ = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="dev", prefix=global_step)
|
||||||
|
if global_step:
|
||||||
|
result = {"{}_{}".format(global_step, k): v for k, v in result.items()}
|
||||||
|
results.update(result)
|
||||||
|
output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
|
||||||
|
with open(output_eval_file, "w") as writer:
|
||||||
|
for key in sorted(results.keys()):
|
||||||
|
writer.write("{} = {}\n".format(key, str(results[key])))
|
||||||
|
|
||||||
|
if args.do_predict and args.local_rank in [-1, 0]:
|
||||||
|
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
|
||||||
|
model = model_class.from_pretrained(args.output_dir)
|
||||||
|
model.to(args.device)
|
||||||
|
result, predictions = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="test")
|
||||||
|
# Save results
|
||||||
|
output_test_results_file = os.path.join(args.output_dir, "test_results.txt")
|
||||||
|
with open(output_test_results_file, "w") as writer:
|
||||||
|
for key in sorted(result.keys()):
|
||||||
|
writer.write("{} = {}\n".format(key, str(result[key])))
|
||||||
|
# Save predictions
|
||||||
|
output_test_predictions_file = os.path.join(args.output_dir, "test_predictions.txt")
|
||||||
|
with open(output_test_predictions_file, "w") as writer:
|
||||||
|
with open(os.path.join(args.data_dir, "test.txt"), "r") as f:
|
||||||
|
example_id = 0
|
||||||
|
for line in f:
|
||||||
|
if line.startswith("-DOCSTART-") or line == "" or line == "\n":
|
||||||
|
writer.write(line)
|
||||||
|
if not predictions[example_id]:
|
||||||
|
example_id += 1
|
||||||
|
elif predictions[example_id]:
|
||||||
|
output_line = line.split()[0] + " " + predictions[example_id].pop(0) + "\n"
|
||||||
|
writer.write(output_line)
|
||||||
|
else:
|
||||||
|
logger.warning("Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0])
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -28,9 +28,13 @@ import torch
|
|||||||
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
||||||
TensorDataset)
|
TensorDataset)
|
||||||
from torch.utils.data.distributed import DistributedSampler
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
from tqdm import tqdm, trange
|
|
||||||
|
|
||||||
from tensorboardX import SummaryWriter
|
try:
|
||||||
|
from torch.utils.tensorboard import SummaryWriter
|
||||||
|
except:
|
||||||
|
from tensorboardX import SummaryWriter
|
||||||
|
|
||||||
|
from tqdm import tqdm, trange
|
||||||
|
|
||||||
from transformers import (WEIGHTS_NAME, BertConfig,
|
from transformers import (WEIGHTS_NAME, BertConfig,
|
||||||
BertForQuestionAnswering, BertTokenizer,
|
BertForQuestionAnswering, BertTokenizer,
|
||||||
@@ -134,8 +138,8 @@ def train(args, train_dataset, model, tokenizer):
|
|||||||
model.train()
|
model.train()
|
||||||
batch = tuple(t.to(args.device) for t in batch)
|
batch = tuple(t.to(args.device) for t in batch)
|
||||||
inputs = {'input_ids': batch[0],
|
inputs = {'input_ids': batch[0],
|
||||||
'attention_mask': batch[1],
|
'attention_mask': batch[1],
|
||||||
'start_positions': batch[3],
|
'start_positions': batch[3],
|
||||||
'end_positions': batch[4]}
|
'end_positions': batch[4]}
|
||||||
if args.model_type != 'distilbert':
|
if args.model_type != 'distilbert':
|
||||||
inputs['token_type_ids'] = None if args.model_type == 'xlm' else batch[2]
|
inputs['token_type_ids'] = None if args.model_type == 'xlm' else batch[2]
|
||||||
@@ -153,13 +157,16 @@ def train(args, train_dataset, model, tokenizer):
|
|||||||
if args.fp16:
|
if args.fp16:
|
||||||
with amp.scale_loss(loss, optimizer) as scaled_loss:
|
with amp.scale_loss(loss, optimizer) as scaled_loss:
|
||||||
scaled_loss.backward()
|
scaled_loss.backward()
|
||||||
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
|
||||||
else:
|
else:
|
||||||
loss.backward()
|
loss.backward()
|
||||||
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
|
||||||
|
|
||||||
tr_loss += loss.item()
|
tr_loss += loss.item()
|
||||||
if (step + 1) % args.gradient_accumulation_steps == 0:
|
if (step + 1) % args.gradient_accumulation_steps == 0:
|
||||||
|
if args.fp16:
|
||||||
|
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
||||||
|
else:
|
||||||
|
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
||||||
|
|
||||||
optimizer.step()
|
optimizer.step()
|
||||||
scheduler.step() # Update learning rate schedule
|
scheduler.step() # Update learning rate schedule
|
||||||
model.zero_grad()
|
model.zero_grad()
|
||||||
@@ -477,6 +484,16 @@ def main():
|
|||||||
|
|
||||||
logger.info("Training/evaluation parameters %s", args)
|
logger.info("Training/evaluation parameters %s", args)
|
||||||
|
|
||||||
|
# Before we do anything with models, we want to ensure that we get fp16 execution of torch.einsum if args.fp16 is set.
|
||||||
|
# Otherwise it'll default to "promote" mode, and we'll get fp32 operations. Note that running `--fp16_opt_level="O2"` will
|
||||||
|
# remove the need for this code, but it is still valid.
|
||||||
|
if args.fp16:
|
||||||
|
try:
|
||||||
|
import apex
|
||||||
|
apex.amp.register_half_function(torch, 'einsum')
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
|
||||||
|
|
||||||
# Training
|
# Training
|
||||||
if args.do_train:
|
if args.do_train:
|
||||||
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
|
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
|
||||||
|
|||||||
@@ -1,40 +1,63 @@
|
|||||||
|
import os
|
||||||
import tensorflow as tf
|
import tensorflow as tf
|
||||||
import tensorflow_datasets
|
import tensorflow_datasets
|
||||||
from transformers import BertTokenizer, TFBertForSequenceClassification, glue_convert_examples_to_features, BertForSequenceClassification
|
from transformers import BertTokenizer, TFBertForSequenceClassification, glue_convert_examples_to_features, BertForSequenceClassification
|
||||||
|
|
||||||
# Load dataset, tokenizer, model from pretrained model/vocabulary
|
# script parameters
|
||||||
|
BATCH_SIZE = 32
|
||||||
|
EVAL_BATCH_SIZE = BATCH_SIZE * 2
|
||||||
|
USE_XLA = False
|
||||||
|
USE_AMP = False
|
||||||
|
|
||||||
|
tf.config.optimizer.set_jit(USE_XLA)
|
||||||
|
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": USE_AMP})
|
||||||
|
|
||||||
|
# Load tokenizer and model from pretrained model/vocabulary
|
||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
|
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
|
||||||
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
|
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
|
||||||
data = tensorflow_datasets.load('glue/mrpc')
|
|
||||||
|
# Load dataset via TensorFlow Datasets
|
||||||
|
data, info = tensorflow_datasets.load('glue/mrpc', with_info=True)
|
||||||
|
train_examples = info.splits['train'].num_examples
|
||||||
|
valid_examples = info.splits['validation'].num_examples
|
||||||
|
|
||||||
# Prepare dataset for GLUE as a tf.data.Dataset instance
|
# Prepare dataset for GLUE as a tf.data.Dataset instance
|
||||||
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, 128, 'mrpc')
|
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, 128, 'mrpc')
|
||||||
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, 128, 'mrpc')
|
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, 128, 'mrpc')
|
||||||
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
|
train_dataset = train_dataset.shuffle(128).batch(BATCH_SIZE).repeat(-1)
|
||||||
valid_dataset = valid_dataset.batch(64)
|
valid_dataset = valid_dataset.batch(EVAL_BATCH_SIZE)
|
||||||
|
|
||||||
# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
|
# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
|
||||||
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
|
opt = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)
|
||||||
|
if USE_AMP:
|
||||||
|
# loss scaling is currently required when using mixed precision
|
||||||
|
opt = tf.keras.mixed_precision.experimental.LossScaleOptimizer(opt, 'dynamic')
|
||||||
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
|
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
|
||||||
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
|
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
|
||||||
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
|
model.compile(optimizer=opt, loss=loss, metrics=[metric])
|
||||||
|
|
||||||
# Train and evaluate using tf.keras.Model.fit()
|
# Train and evaluate using tf.keras.Model.fit()
|
||||||
history = model.fit(train_dataset, epochs=2, steps_per_epoch=115,
|
train_steps = train_examples//BATCH_SIZE
|
||||||
validation_data=valid_dataset, validation_steps=7)
|
valid_steps = valid_examples//EVAL_BATCH_SIZE
|
||||||
|
|
||||||
|
history = model.fit(train_dataset, epochs=2, steps_per_epoch=train_steps,
|
||||||
|
validation_data=valid_dataset, validation_steps=valid_steps)
|
||||||
|
|
||||||
|
# Save TF2 model
|
||||||
|
os.makedirs('./save/', exist_ok=True)
|
||||||
|
model.save_pretrained('./save/')
|
||||||
|
|
||||||
# Load the TensorFlow model in PyTorch for inspection
|
# Load the TensorFlow model in PyTorch for inspection
|
||||||
model.save_pretrained('./save/')
|
|
||||||
pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf=True)
|
pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf=True)
|
||||||
|
|
||||||
# Quickly test a few predictions - MRPC is a paraphrasing task, let's see if our model learned the task
|
# Quickly test a few predictions - MRPC is a paraphrasing task, let's see if our model learned the task
|
||||||
sentence_0 = "This research was consistent with his findings."
|
sentence_0 = 'This research was consistent with his findings.'
|
||||||
sentence_1 = "His findings were compatible with this research."
|
sentence_1 = 'His findings were compatible with this research.'
|
||||||
sentence_2 = "His findings were not compatible with this research."
|
sentence_2 = 'His findings were not compatible with this research.'
|
||||||
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
|
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
|
||||||
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')
|
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')
|
||||||
|
|
||||||
pred_1 = pytorch_model(**inputs_1)[0].argmax().item()
|
pred_1 = pytorch_model(**inputs_1)[0].argmax().item()
|
||||||
pred_2 = pytorch_model(**inputs_2)[0].argmax().item()
|
pred_2 = pytorch_model(**inputs_2)[0].argmax().item()
|
||||||
print("sentence_1 is", "a paraphrase" if pred_1 else "not a paraphrase", "of sentence_0")
|
print('sentence_1 is', 'a paraphrase' if pred_1 else 'not a paraphrase', 'of sentence_0')
|
||||||
print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sentence_0")
|
print('sentence_2 is', 'a paraphrase' if pred_2 else 'not a paraphrase', 'of sentence_0')
|
||||||
|
|||||||
@@ -13,7 +13,7 @@
|
|||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
""" BERT multiple choice fine-tuning: utilities to work with multiple choice tasks of reading comprehension """
|
""" Multiple choice fine-tuning: utilities to work with multiple choice tasks of reading comprehension """
|
||||||
|
|
||||||
from __future__ import absolute_import, division, print_function
|
from __future__ import absolute_import, division, print_function
|
||||||
|
|
||||||
@@ -26,6 +26,8 @@ import json
|
|||||||
import csv
|
import csv
|
||||||
import glob
|
import glob
|
||||||
import tqdm
|
import tqdm
|
||||||
|
from typing import List
|
||||||
|
from transformers import PreTrainedTokenizer
|
||||||
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
@@ -34,13 +36,13 @@ logger = logging.getLogger(__name__)
|
|||||||
class InputExample(object):
|
class InputExample(object):
|
||||||
"""A single training/test example for multiple choice"""
|
"""A single training/test example for multiple choice"""
|
||||||
|
|
||||||
def __init__(self, example_id, question, contexts, endings, label=None):
|
def __init__(self, example_id, question, contexts, endings, label=None):
|
||||||
"""Constructs a InputExample.
|
"""Constructs a InputExample.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
example_id: Unique id for the example.
|
example_id: Unique id for the example.
|
||||||
contexts: list of str. The untokenized text of the first sequence (context of corresponding question).
|
contexts: list of str. The untokenized text of the first sequence (context of corresponding question).
|
||||||
question: string. The untokenized text of the second sequence (qustion).
|
question: string. The untokenized text of the second sequence (question).
|
||||||
endings: list of str. multiple choice's options. Its length must be equal to contexts' length.
|
endings: list of str. multiple choice's options. Its length must be equal to contexts' length.
|
||||||
label: (Optional) string. The label of the example. This should be
|
label: (Optional) string. The label of the example. This should be
|
||||||
specified for train and dev examples, but not for test examples.
|
specified for train and dev examples, but not for test examples.
|
||||||
@@ -66,7 +68,7 @@ class InputFeatures(object):
|
|||||||
'input_mask': input_mask,
|
'input_mask': input_mask,
|
||||||
'segment_ids': segment_ids
|
'segment_ids': segment_ids
|
||||||
}
|
}
|
||||||
for _, input_ids, input_mask, segment_ids in choices_features
|
for input_ids, input_mask, segment_ids in choices_features
|
||||||
]
|
]
|
||||||
self.label = label
|
self.label = label
|
||||||
|
|
||||||
@@ -192,7 +194,7 @@ class SwagProcessor(DataProcessor):
|
|||||||
return lines
|
return lines
|
||||||
|
|
||||||
|
|
||||||
def _create_examples(self, lines, type):
|
def _create_examples(self, lines: List[List[str]], type: str):
|
||||||
"""Creates examples for the training and dev sets."""
|
"""Creates examples for the training and dev sets."""
|
||||||
if type == "train" and lines[0][-1] != 'label':
|
if type == "train" and lines[0][-1] != 'label':
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
@@ -300,24 +302,18 @@ class ArcProcessor(DataProcessor):
|
|||||||
return examples
|
return examples
|
||||||
|
|
||||||
|
|
||||||
def convert_examples_to_features(examples, label_list, max_seq_length,
|
def convert_examples_to_features(
|
||||||
tokenizer,
|
examples: List[InputExample],
|
||||||
cls_token_at_end=False,
|
label_list: List[str],
|
||||||
cls_token='[CLS]',
|
max_length: int,
|
||||||
cls_token_segment_id=1,
|
tokenizer: PreTrainedTokenizer,
|
||||||
sep_token='[SEP]',
|
pad_token_segment_id=0,
|
||||||
sequence_a_segment_id=0,
|
pad_on_left=False,
|
||||||
sequence_b_segment_id=1,
|
pad_token=0,
|
||||||
sep_token_extra=False,
|
mask_padding_with_zero=True,
|
||||||
pad_token_segment_id=0,
|
) -> List[InputFeatures]:
|
||||||
pad_on_left=False,
|
"""
|
||||||
pad_token=0,
|
Loads a data file into a list of `InputFeatures`
|
||||||
mask_padding_with_zero=True):
|
|
||||||
""" Loads a data file into a list of `InputBatch`s
|
|
||||||
`cls_token_at_end` define the location of the CLS token:
|
|
||||||
- False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
|
|
||||||
- True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
|
|
||||||
`cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
label_map = {label : i for i, label in enumerate(label_list)}
|
label_map = {label : i for i, label in enumerate(label_list)}
|
||||||
@@ -328,125 +324,70 @@ def convert_examples_to_features(examples, label_list, max_seq_length,
|
|||||||
logger.info("Writing example %d of %d" % (ex_index, len(examples)))
|
logger.info("Writing example %d of %d" % (ex_index, len(examples)))
|
||||||
choices_features = []
|
choices_features = []
|
||||||
for ending_idx, (context, ending) in enumerate(zip(example.contexts, example.endings)):
|
for ending_idx, (context, ending) in enumerate(zip(example.contexts, example.endings)):
|
||||||
tokens_a = tokenizer.tokenize(context)
|
text_a = context
|
||||||
tokens_b = None
|
|
||||||
if example.question.find("_") != -1:
|
if example.question.find("_") != -1:
|
||||||
#this is for cloze question
|
# this is for cloze question
|
||||||
tokens_b = tokenizer.tokenize(example.question.replace("_", ending))
|
text_b = example.question.replace("_", ending)
|
||||||
else:
|
else:
|
||||||
tokens_b = tokenizer.tokenize(example.question + " " + ending)
|
text_b = example.question + " " + ending
|
||||||
# you can add seq token between quesiotn and ending. This does not make too much difference.
|
|
||||||
# tokens_b = tokenizer.tokenize(example.question)
|
|
||||||
# tokens_b += [sep_token]
|
|
||||||
# if sep_token_extra:
|
|
||||||
# tokens_b += [sep_token]
|
|
||||||
# tokens_b += tokenizer.tokenize(ending)
|
|
||||||
|
|
||||||
special_tokens_count = 4 if sep_token_extra else 3
|
inputs = tokenizer.encode_plus(
|
||||||
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - special_tokens_count)
|
text_a,
|
||||||
|
text_b,
|
||||||
|
add_special_tokens=True,
|
||||||
|
max_length=max_length,
|
||||||
|
)
|
||||||
|
if 'num_truncated_tokens' in inputs and inputs['num_truncated_tokens'] > 0:
|
||||||
|
logger.info('Attention! you are cropping tokens (swag task is ok). '
|
||||||
|
'If you are training ARC and RACE and you are poping question + options,'
|
||||||
|
'you need to try to use a bigger max seq length!')
|
||||||
|
|
||||||
# The convention in BERT is:
|
input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
|
||||||
# (a) For sequence pairs:
|
|
||||||
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
|
|
||||||
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
|
|
||||||
# (b) For single sequences:
|
|
||||||
# tokens: [CLS] the dog is hairy . [SEP]
|
|
||||||
# type_ids: 0 0 0 0 0 0 0
|
|
||||||
#
|
|
||||||
# Where "type_ids" are used to indicate whether this is the first
|
|
||||||
# sequence or the second sequence. The embedding vectors for `type=0` and
|
|
||||||
# `type=1` were learned during pre-training and are added to the wordpiece
|
|
||||||
# embedding vector (and position vector). This is not *strictly* necessary
|
|
||||||
# since the [SEP] token unambiguously separates the sequences, but it makes
|
|
||||||
# it easier for the model to learn the concept of sequences.
|
|
||||||
#
|
|
||||||
# For classification tasks, the first vector (corresponding to [CLS]) is
|
|
||||||
# used as as the "sentence vector". Note that this only makes sense because
|
|
||||||
# the entire model is fine-tuned.
|
|
||||||
tokens = tokens_a + [sep_token]
|
|
||||||
if sep_token_extra:
|
|
||||||
# roberta uses an extra separator b/w pairs of sentences
|
|
||||||
tokens += [sep_token]
|
|
||||||
|
|
||||||
segment_ids = [sequence_a_segment_id] * len(tokens)
|
|
||||||
|
|
||||||
if tokens_b:
|
|
||||||
tokens += tokens_b + [sep_token]
|
|
||||||
segment_ids += [sequence_b_segment_id] * (len(tokens_b) + 1)
|
|
||||||
|
|
||||||
if cls_token_at_end:
|
|
||||||
tokens = tokens + [cls_token]
|
|
||||||
segment_ids = segment_ids + [cls_token_segment_id]
|
|
||||||
else:
|
|
||||||
tokens = [cls_token] + tokens
|
|
||||||
segment_ids = [cls_token_segment_id] + segment_ids
|
|
||||||
|
|
||||||
input_ids = tokenizer.convert_tokens_to_ids(tokens)
|
|
||||||
|
|
||||||
# The mask has 1 for real tokens and 0 for padding tokens. Only real
|
# The mask has 1 for real tokens and 0 for padding tokens. Only real
|
||||||
# tokens are attended to.
|
# tokens are attended to.
|
||||||
input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
|
attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
|
||||||
|
|
||||||
# Zero-pad up to the sequence length.
|
# Zero-pad up to the sequence length.
|
||||||
padding_length = max_seq_length - len(input_ids)
|
padding_length = max_length - len(input_ids)
|
||||||
if pad_on_left:
|
if pad_on_left:
|
||||||
input_ids = ([pad_token] * padding_length) + input_ids
|
input_ids = ([pad_token] * padding_length) + input_ids
|
||||||
input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
|
attention_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + attention_mask
|
||||||
segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
|
token_type_ids = ([pad_token_segment_id] * padding_length) + token_type_ids
|
||||||
else:
|
else:
|
||||||
input_ids = input_ids + ([pad_token] * padding_length)
|
input_ids = input_ids + ([pad_token] * padding_length)
|
||||||
input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
|
attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
|
||||||
segment_ids = segment_ids + ([pad_token_segment_id] * padding_length)
|
token_type_ids = token_type_ids + ([pad_token_segment_id] * padding_length)
|
||||||
|
|
||||||
|
assert len(input_ids) == max_length
|
||||||
|
assert len(attention_mask) == max_length
|
||||||
|
assert len(token_type_ids) == max_length
|
||||||
|
choices_features.append((input_ids, attention_mask, token_type_ids))
|
||||||
|
|
||||||
|
|
||||||
assert len(input_ids) == max_seq_length
|
|
||||||
assert len(input_mask) == max_seq_length
|
|
||||||
assert len(segment_ids) == max_seq_length
|
|
||||||
choices_features.append((tokens, input_ids, input_mask, segment_ids))
|
|
||||||
label = label_map[example.label]
|
label = label_map[example.label]
|
||||||
|
|
||||||
if ex_index < 2:
|
if ex_index < 2:
|
||||||
logger.info("*** Example ***")
|
logger.info("*** Example ***")
|
||||||
logger.info("race_id: {}".format(example.example_id))
|
logger.info("race_id: {}".format(example.example_id))
|
||||||
for choice_idx, (tokens, input_ids, input_mask, segment_ids) in enumerate(choices_features):
|
for choice_idx, (input_ids, attention_mask, token_type_ids) in enumerate(choices_features):
|
||||||
logger.info("choice: {}".format(choice_idx))
|
logger.info("choice: {}".format(choice_idx))
|
||||||
logger.info("tokens: {}".format(' '.join(tokens)))
|
|
||||||
logger.info("input_ids: {}".format(' '.join(map(str, input_ids))))
|
logger.info("input_ids: {}".format(' '.join(map(str, input_ids))))
|
||||||
logger.info("input_mask: {}".format(' '.join(map(str, input_mask))))
|
logger.info("attention_mask: {}".format(' '.join(map(str, attention_mask))))
|
||||||
logger.info("segment_ids: {}".format(' '.join(map(str, segment_ids))))
|
logger.info("token_type_ids: {}".format(' '.join(map(str, token_type_ids))))
|
||||||
logger.info("label: {}".format(label))
|
logger.info("label: {}".format(label))
|
||||||
|
|
||||||
features.append(
|
features.append(
|
||||||
InputFeatures(
|
InputFeatures(
|
||||||
example_id = example.example_id,
|
example_id=example.example_id,
|
||||||
choices_features = choices_features,
|
choices_features=choices_features,
|
||||||
label = label
|
label=label,
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
|
|
||||||
return features
|
return features
|
||||||
|
|
||||||
|
|
||||||
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
|
|
||||||
"""Truncates a sequence pair in place to the maximum length."""
|
|
||||||
|
|
||||||
# This is a simple heuristic which will always truncate the longer sequence
|
|
||||||
# one token at a time. This makes more sense than truncating an equal percent
|
|
||||||
# of tokens from each, since if one sequence is very short then each token
|
|
||||||
# that's truncated likely contains more information than a longer sequence.
|
|
||||||
|
|
||||||
# However, since we'd better not to remove tokens of options and questions, you can choose to use a bigger
|
|
||||||
# length or only pop from context
|
|
||||||
while True:
|
|
||||||
total_length = len(tokens_a) + len(tokens_b)
|
|
||||||
if total_length <= max_length:
|
|
||||||
break
|
|
||||||
if len(tokens_a) > len(tokens_b):
|
|
||||||
tokens_a.pop()
|
|
||||||
else:
|
|
||||||
logger.info('Attention! you are removing from token_b (swag task is ok). '
|
|
||||||
'If you are training ARC and RACE (you are poping question + options), '
|
|
||||||
'you need to try to use a bigger max seq length!')
|
|
||||||
tokens_b.pop()
|
|
||||||
|
|
||||||
|
|
||||||
processors = {
|
processors = {
|
||||||
@@ -456,7 +397,7 @@ processors = {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
GLUE_TASKS_NUM_LABELS = {
|
MULTIPLE_CHOICE_TASKS_NUM_LABELS = {
|
||||||
"race", 4,
|
"race", 4,
|
||||||
"swag", 4,
|
"swag", 4,
|
||||||
"arc", 4
|
"arc", 4
|
||||||
|
|||||||
212
examples/utils_ner.py
Normal file
212
examples/utils_ner.py
Normal file
@@ -0,0 +1,212 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Named entity recognition fine-tuning: utilities to work with CoNLL-2003 task. """
|
||||||
|
|
||||||
|
from __future__ import absolute_import, division, print_function
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
from io import open
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class InputExample(object):
|
||||||
|
"""A single training/test example for token classification."""
|
||||||
|
|
||||||
|
def __init__(self, guid, words, labels):
|
||||||
|
"""Constructs a InputExample.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
guid: Unique id for the example.
|
||||||
|
words: list. The words of the sequence.
|
||||||
|
labels: (Optional) list. The labels for each word of the sequence. This should be
|
||||||
|
specified for train and dev examples, but not for test examples.
|
||||||
|
"""
|
||||||
|
self.guid = guid
|
||||||
|
self.words = words
|
||||||
|
self.labels = labels
|
||||||
|
|
||||||
|
|
||||||
|
class InputFeatures(object):
|
||||||
|
"""A single set of features of data."""
|
||||||
|
|
||||||
|
def __init__(self, input_ids, input_mask, segment_ids, label_ids):
|
||||||
|
self.input_ids = input_ids
|
||||||
|
self.input_mask = input_mask
|
||||||
|
self.segment_ids = segment_ids
|
||||||
|
self.label_ids = label_ids
|
||||||
|
|
||||||
|
|
||||||
|
def read_examples_from_file(data_dir, mode):
|
||||||
|
file_path = os.path.join(data_dir, "{}.txt".format(mode))
|
||||||
|
guid_index = 1
|
||||||
|
examples = []
|
||||||
|
with open(file_path, encoding="utf-8") as f:
|
||||||
|
words = []
|
||||||
|
labels = []
|
||||||
|
for line in f:
|
||||||
|
if line.startswith("-DOCSTART-") or line == "" or line == "\n":
|
||||||
|
if words:
|
||||||
|
examples.append(InputExample(guid="{}-{}".format(mode, guid_index),
|
||||||
|
words=words,
|
||||||
|
labels=labels))
|
||||||
|
guid_index += 1
|
||||||
|
words = []
|
||||||
|
labels = []
|
||||||
|
else:
|
||||||
|
splits = line.split(" ")
|
||||||
|
words.append(splits[0])
|
||||||
|
if len(splits) > 1:
|
||||||
|
labels.append(splits[-1].replace("\n", ""))
|
||||||
|
else:
|
||||||
|
# Examples could have no label for mode = "test"
|
||||||
|
labels.append("O")
|
||||||
|
if words:
|
||||||
|
examples.append(InputExample(guid="%s-%d".format(mode, guid_index),
|
||||||
|
words=words,
|
||||||
|
labels=labels))
|
||||||
|
return examples
|
||||||
|
|
||||||
|
|
||||||
|
def convert_examples_to_features(examples,
|
||||||
|
label_list,
|
||||||
|
max_seq_length,
|
||||||
|
tokenizer,
|
||||||
|
cls_token_at_end=False,
|
||||||
|
cls_token="[CLS]",
|
||||||
|
cls_token_segment_id=1,
|
||||||
|
sep_token="[SEP]",
|
||||||
|
sep_token_extra=False,
|
||||||
|
pad_on_left=False,
|
||||||
|
pad_token=0,
|
||||||
|
pad_token_segment_id=0,
|
||||||
|
pad_token_label_id=-1,
|
||||||
|
sequence_a_segment_id=0,
|
||||||
|
mask_padding_with_zero=True):
|
||||||
|
""" Loads a data file into a list of `InputBatch`s
|
||||||
|
`cls_token_at_end` define the location of the CLS token:
|
||||||
|
- False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
|
||||||
|
- True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
|
||||||
|
`cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
|
||||||
|
"""
|
||||||
|
|
||||||
|
label_map = {label: i for i, label in enumerate(label_list)}
|
||||||
|
|
||||||
|
features = []
|
||||||
|
for (ex_index, example) in enumerate(examples):
|
||||||
|
if ex_index % 10000 == 0:
|
||||||
|
logger.info("Writing example %d of %d", ex_index, len(examples))
|
||||||
|
|
||||||
|
tokens = []
|
||||||
|
label_ids = []
|
||||||
|
for word, label in zip(example.words, example.labels):
|
||||||
|
word_tokens = tokenizer.tokenize(word)
|
||||||
|
tokens.extend(word_tokens)
|
||||||
|
# Use the real label id for the first token of the word, and padding ids for the remaining tokens
|
||||||
|
label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
|
||||||
|
|
||||||
|
# Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
|
||||||
|
special_tokens_count = 3 if sep_token_extra else 2
|
||||||
|
if len(tokens) > max_seq_length - special_tokens_count:
|
||||||
|
tokens = tokens[:(max_seq_length - special_tokens_count)]
|
||||||
|
label_ids = label_ids[:(max_seq_length - special_tokens_count)]
|
||||||
|
|
||||||
|
# The convention in BERT is:
|
||||||
|
# (a) For sequence pairs:
|
||||||
|
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
|
||||||
|
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
|
||||||
|
# (b) For single sequences:
|
||||||
|
# tokens: [CLS] the dog is hairy . [SEP]
|
||||||
|
# type_ids: 0 0 0 0 0 0 0
|
||||||
|
#
|
||||||
|
# Where "type_ids" are used to indicate whether this is the first
|
||||||
|
# sequence or the second sequence. The embedding vectors for `type=0` and
|
||||||
|
# `type=1` were learned during pre-training and are added to the wordpiece
|
||||||
|
# embedding vector (and position vector). This is not *strictly* necessary
|
||||||
|
# since the [SEP] token unambiguously separates the sequences, but it makes
|
||||||
|
# it easier for the model to learn the concept of sequences.
|
||||||
|
#
|
||||||
|
# For classification tasks, the first vector (corresponding to [CLS]) is
|
||||||
|
# used as as the "sentence vector". Note that this only makes sense because
|
||||||
|
# the entire model is fine-tuned.
|
||||||
|
tokens += [sep_token]
|
||||||
|
label_ids += [pad_token_label_id]
|
||||||
|
if sep_token_extra:
|
||||||
|
# roberta uses an extra separator b/w pairs of sentences
|
||||||
|
tokens += [sep_token]
|
||||||
|
label_ids += [pad_token_label_id]
|
||||||
|
segment_ids = [sequence_a_segment_id] * len(tokens)
|
||||||
|
|
||||||
|
if cls_token_at_end:
|
||||||
|
tokens += [cls_token]
|
||||||
|
label_ids += [pad_token_label_id]
|
||||||
|
segment_ids += [cls_token_segment_id]
|
||||||
|
else:
|
||||||
|
tokens = [cls_token] + tokens
|
||||||
|
label_ids = [pad_token_label_id] + label_ids
|
||||||
|
segment_ids = [cls_token_segment_id] + segment_ids
|
||||||
|
|
||||||
|
input_ids = tokenizer.convert_tokens_to_ids(tokens)
|
||||||
|
|
||||||
|
# The mask has 1 for real tokens and 0 for padding tokens. Only real
|
||||||
|
# tokens are attended to.
|
||||||
|
input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
|
||||||
|
|
||||||
|
# Zero-pad up to the sequence length.
|
||||||
|
padding_length = max_seq_length - len(input_ids)
|
||||||
|
if pad_on_left:
|
||||||
|
input_ids = ([pad_token] * padding_length) + input_ids
|
||||||
|
input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
|
||||||
|
segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
|
||||||
|
label_ids = ([pad_token_label_id] * padding_length) + label_ids
|
||||||
|
else:
|
||||||
|
input_ids += ([pad_token] * padding_length)
|
||||||
|
input_mask += ([0 if mask_padding_with_zero else 1] * padding_length)
|
||||||
|
segment_ids += ([pad_token_segment_id] * padding_length)
|
||||||
|
label_ids += ([pad_token_label_id] * padding_length)
|
||||||
|
|
||||||
|
assert len(input_ids) == max_seq_length
|
||||||
|
assert len(input_mask) == max_seq_length
|
||||||
|
assert len(segment_ids) == max_seq_length
|
||||||
|
assert len(label_ids) == max_seq_length
|
||||||
|
|
||||||
|
if ex_index < 5:
|
||||||
|
logger.info("*** Example ***")
|
||||||
|
logger.info("guid: %s", example.guid)
|
||||||
|
logger.info("tokens: %s", " ".join([str(x) for x in tokens]))
|
||||||
|
logger.info("input_ids: %s", " ".join([str(x) for x in input_ids]))
|
||||||
|
logger.info("input_mask: %s", " ".join([str(x) for x in input_mask]))
|
||||||
|
logger.info("segment_ids: %s", " ".join([str(x) for x in segment_ids]))
|
||||||
|
logger.info("label_ids: %s", " ".join([str(x) for x in label_ids]))
|
||||||
|
|
||||||
|
features.append(
|
||||||
|
InputFeatures(input_ids=input_ids,
|
||||||
|
input_mask=input_mask,
|
||||||
|
segment_ids=segment_ids,
|
||||||
|
label_ids=label_ids))
|
||||||
|
return features
|
||||||
|
|
||||||
|
|
||||||
|
def get_labels(path):
|
||||||
|
if path:
|
||||||
|
with open(path, "r") as f:
|
||||||
|
labels = f.read().splitlines()
|
||||||
|
if "O" not in labels:
|
||||||
|
labels = ["O"] + labels
|
||||||
|
return labels
|
||||||
|
else:
|
||||||
|
return ["O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
|
||||||
48
requirements-dev.txt
Normal file
48
requirements-dev.txt
Normal file
@@ -0,0 +1,48 @@
|
|||||||
|
absl-py==0.8.0
|
||||||
|
astor==0.8.0
|
||||||
|
atomicwrites==1.3.0
|
||||||
|
attrs==19.2.0
|
||||||
|
boto3==1.9.243
|
||||||
|
botocore==1.12.243
|
||||||
|
certifi==2019.9.11
|
||||||
|
chardet==3.0.4
|
||||||
|
Click==7.0
|
||||||
|
docutils==0.15.2
|
||||||
|
gast==0.2.2
|
||||||
|
google-pasta==0.1.7
|
||||||
|
grpcio==1.24.1
|
||||||
|
h5py==2.10.0
|
||||||
|
idna==2.8
|
||||||
|
importlib-metadata==0.23
|
||||||
|
jmespath==0.9.4
|
||||||
|
joblib==0.14.0
|
||||||
|
Keras-Applications==1.0.8
|
||||||
|
Keras-Preprocessing==1.1.0
|
||||||
|
Markdown==3.1.1
|
||||||
|
more-itertools==7.2.0
|
||||||
|
numpy==1.17.2
|
||||||
|
opt-einsum==3.1.0
|
||||||
|
packaging==19.2
|
||||||
|
pluggy==0.13.0
|
||||||
|
protobuf==3.10.0
|
||||||
|
py==1.8.0
|
||||||
|
pyparsing==2.4.2
|
||||||
|
pytest==5.2.1
|
||||||
|
python-dateutil==2.8.0
|
||||||
|
regex==2019.8.19
|
||||||
|
requests==2.22.0
|
||||||
|
s3transfer==0.2.1
|
||||||
|
sacremoses==0.0.35
|
||||||
|
sentencepiece==0.1.83
|
||||||
|
six==1.12.0
|
||||||
|
tensorboard==2.0.0
|
||||||
|
tensorflow==2.0.0
|
||||||
|
tensorflow-estimator==2.0.0
|
||||||
|
termcolor==1.1.0
|
||||||
|
torch==1.2.0
|
||||||
|
tqdm==4.36.1
|
||||||
|
urllib3==1.25.6
|
||||||
|
wcwidth==0.1.7
|
||||||
|
Werkzeug==0.16.0
|
||||||
|
wrapt==1.11.2
|
||||||
|
zipp==0.6.0
|
||||||
4
setup.py
4
setup.py
@@ -3,7 +3,7 @@ Simple check list from AllenNLP repo: https://github.com/allenai/allennlp/blob/m
|
|||||||
|
|
||||||
To create the package for pypi.
|
To create the package for pypi.
|
||||||
|
|
||||||
1. Change the version in __init__.py and setup.py.
|
1. Change the version in __init__.py, setup.py as well as docs/source/conf.py.
|
||||||
|
|
||||||
2. Commit these changes with the message: "Release: VERSION"
|
2. Commit these changes with the message: "Release: VERSION"
|
||||||
|
|
||||||
@@ -38,7 +38,7 @@ from setuptools import find_packages, setup
|
|||||||
|
|
||||||
setup(
|
setup(
|
||||||
name="transformers",
|
name="transformers",
|
||||||
version="2.0.0",
|
version="2.1.1",
|
||||||
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
|
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
|
||||||
author_email="thomas@huggingface.co",
|
author_email="thomas@huggingface.co",
|
||||||
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
|
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
__version__ = "2.0.0"
|
__version__ = "2.1.1"
|
||||||
|
|
||||||
# Work around to update TensorFlow's absl.logging threshold which alters the
|
# Work around to update TensorFlow's absl.logging threshold which alters the
|
||||||
# default Python logging output behavior when present.
|
# default Python logging output behavior when present.
|
||||||
@@ -37,6 +37,7 @@ from .tokenization_bert import BertTokenizer, BasicTokenizer, WordpieceTokenizer
|
|||||||
from .tokenization_openai import OpenAIGPTTokenizer
|
from .tokenization_openai import OpenAIGPTTokenizer
|
||||||
from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus)
|
from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus)
|
||||||
from .tokenization_gpt2 import GPT2Tokenizer
|
from .tokenization_gpt2 import GPT2Tokenizer
|
||||||
|
from .tokenization_ctrl import CTRLTokenizer
|
||||||
from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
|
from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
|
||||||
from .tokenization_xlm import XLMTokenizer
|
from .tokenization_xlm import XLMTokenizer
|
||||||
from .tokenization_roberta import RobertaTokenizer
|
from .tokenization_roberta import RobertaTokenizer
|
||||||
@@ -49,7 +50,9 @@ from .configuration_bert import BertConfig, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
|||||||
from .configuration_openai import OpenAIGPTConfig, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
from .configuration_openai import OpenAIGPTConfig, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
from .configuration_transfo_xl import TransfoXLConfig, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP
|
from .configuration_transfo_xl import TransfoXLConfig, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
from .configuration_gpt2 import GPT2Config, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP
|
from .configuration_gpt2 import GPT2Config, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
|
from .configuration_ctrl import CTRLConfig, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
from .configuration_xlnet import XLNetConfig, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP
|
from .configuration_xlnet import XLNetConfig, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
|
from .configuration_ctrl import CTRLConfig, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
from .configuration_xlm import XLMConfig, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP
|
from .configuration_xlm import XLMConfig, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
from .configuration_roberta import RobertaConfig, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP
|
from .configuration_roberta import RobertaConfig, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
from .configuration_distilbert import DistilBertConfig, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
from .configuration_distilbert import DistilBertConfig, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
@@ -73,6 +76,9 @@ if is_torch_available():
|
|||||||
from .modeling_gpt2 import (GPT2PreTrainedModel, GPT2Model,
|
from .modeling_gpt2 import (GPT2PreTrainedModel, GPT2Model,
|
||||||
GPT2LMHeadModel, GPT2DoubleHeadsModel,
|
GPT2LMHeadModel, GPT2DoubleHeadsModel,
|
||||||
load_tf_weights_in_gpt2, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP)
|
load_tf_weights_in_gpt2, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
from .modeling_ctrl import (CTRLPreTrainedModel, CTRLModel,
|
||||||
|
CTRLLMHeadModel,
|
||||||
|
CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
from .modeling_xlnet import (XLNetPreTrainedModel, XLNetModel, XLNetLMHeadModel,
|
from .modeling_xlnet import (XLNetPreTrainedModel, XLNetModel, XLNetLMHeadModel,
|
||||||
XLNetForSequenceClassification, XLNetForMultipleChoice,
|
XLNetForSequenceClassification, XLNetForMultipleChoice,
|
||||||
XLNetForQuestionAnsweringSimple, XLNetForQuestionAnswering,
|
XLNetForQuestionAnsweringSimple, XLNetForQuestionAnswering,
|
||||||
@@ -83,6 +89,7 @@ if is_torch_available():
|
|||||||
XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
|
XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
from .modeling_roberta import (RobertaForMaskedLM, RobertaModel,
|
from .modeling_roberta import (RobertaForMaskedLM, RobertaModel,
|
||||||
RobertaForSequenceClassification, RobertaForMultipleChoice,
|
RobertaForSequenceClassification, RobertaForMultipleChoice,
|
||||||
|
RobertaForTokenClassification,
|
||||||
ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
|
ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
from .modeling_distilbert import (DistilBertForMaskedLM, DistilBertModel,
|
from .modeling_distilbert import (DistilBertForMaskedLM, DistilBertModel,
|
||||||
DistilBertForSequenceClassification, DistilBertForQuestionAnswering,
|
DistilBertForSequenceClassification, DistilBertForQuestionAnswering,
|
||||||
@@ -105,60 +112,56 @@ if is_tf_available():
|
|||||||
TFBertForMaskedLM, TFBertForNextSentencePrediction,
|
TFBertForMaskedLM, TFBertForNextSentencePrediction,
|
||||||
TFBertForSequenceClassification, TFBertForMultipleChoice,
|
TFBertForSequenceClassification, TFBertForMultipleChoice,
|
||||||
TFBertForTokenClassification, TFBertForQuestionAnswering,
|
TFBertForTokenClassification, TFBertForQuestionAnswering,
|
||||||
load_bert_pt_weights_in_tf2,
|
|
||||||
TF_BERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
TF_BERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
|
||||||
from .modeling_tf_gpt2 import (TFGPT2PreTrainedModel, TFGPT2MainLayer,
|
from .modeling_tf_gpt2 import (TFGPT2PreTrainedModel, TFGPT2MainLayer,
|
||||||
TFGPT2Model, TFGPT2LMHeadModel, TFGPT2DoubleHeadsModel,
|
TFGPT2Model, TFGPT2LMHeadModel, TFGPT2DoubleHeadsModel,
|
||||||
load_gpt2_pt_weights_in_tf2,
|
|
||||||
TF_GPT2_PRETRAINED_MODEL_ARCHIVE_MAP)
|
TF_GPT2_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
|
||||||
from .modeling_tf_openai import (TFOpenAIGPTPreTrainedModel, TFOpenAIGPTMainLayer,
|
from .modeling_tf_openai import (TFOpenAIGPTPreTrainedModel, TFOpenAIGPTMainLayer,
|
||||||
TFOpenAIGPTModel, TFOpenAIGPTLMHeadModel, TFOpenAIGPTDoubleHeadsModel,
|
TFOpenAIGPTModel, TFOpenAIGPTLMHeadModel, TFOpenAIGPTDoubleHeadsModel,
|
||||||
load_openai_gpt_pt_weights_in_tf2,
|
|
||||||
TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
|
||||||
from .modeling_tf_transfo_xl import (TFTransfoXLPreTrainedModel, TFTransfoXLMainLayer,
|
from .modeling_tf_transfo_xl import (TFTransfoXLPreTrainedModel, TFTransfoXLMainLayer,
|
||||||
TFTransfoXLModel, TFTransfoXLLMHeadModel,
|
TFTransfoXLModel, TFTransfoXLLMHeadModel,
|
||||||
load_transfo_xl_pt_weights_in_tf2,
|
|
||||||
TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
|
||||||
from .modeling_tf_xlnet import (TFXLNetPreTrainedModel, TFXLNetMainLayer,
|
from .modeling_tf_xlnet import (TFXLNetPreTrainedModel, TFXLNetMainLayer,
|
||||||
TFXLNetModel, TFXLNetLMHeadModel,
|
TFXLNetModel, TFXLNetLMHeadModel,
|
||||||
TFXLNetForSequenceClassification,
|
TFXLNetForSequenceClassification,
|
||||||
TFXLNetForQuestionAnsweringSimple,
|
TFXLNetForQuestionAnsweringSimple,
|
||||||
load_xlnet_pt_weights_in_tf2,
|
|
||||||
TF_XLNET_PRETRAINED_MODEL_ARCHIVE_MAP)
|
TF_XLNET_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
|
||||||
from .modeling_tf_xlm import (TFXLMPreTrainedModel, TFXLMMainLayer,
|
from .modeling_tf_xlm import (TFXLMPreTrainedModel, TFXLMMainLayer,
|
||||||
TFXLMModel, TFXLMWithLMHeadModel,
|
TFXLMModel, TFXLMWithLMHeadModel,
|
||||||
TFXLMForSequenceClassification,
|
TFXLMForSequenceClassification,
|
||||||
TFXLMForQuestionAnsweringSimple,
|
TFXLMForQuestionAnsweringSimple,
|
||||||
load_xlm_pt_weights_in_tf2,
|
|
||||||
TF_XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
|
TF_XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
|
||||||
from .modeling_tf_roberta import (TFRobertaPreTrainedModel, TFRobertaMainLayer,
|
from .modeling_tf_roberta import (TFRobertaPreTrainedModel, TFRobertaMainLayer,
|
||||||
TFRobertaModel, TFRobertaForMaskedLM,
|
TFRobertaModel, TFRobertaForMaskedLM,
|
||||||
TFRobertaForSequenceClassification,
|
TFRobertaForSequenceClassification,
|
||||||
load_roberta_pt_weights_in_tf2,
|
TFRobertaForTokenClassification,
|
||||||
TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
|
TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
|
||||||
from .modeling_tf_distilbert import (TFDistilBertPreTrainedModel, TFDistilBertMainLayer,
|
from .modeling_tf_distilbert import (TFDistilBertPreTrainedModel, TFDistilBertMainLayer,
|
||||||
TFDistilBertModel, TFDistilBertForMaskedLM,
|
TFDistilBertModel, TFDistilBertForMaskedLM,
|
||||||
TFDistilBertForSequenceClassification,
|
TFDistilBertForSequenceClassification,
|
||||||
TFDistilBertForQuestionAnswering,
|
TFDistilBertForQuestionAnswering,
|
||||||
load_distilbert_pt_weights_in_tf2,
|
|
||||||
TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
|
||||||
|
from .modeling_tf_ctrl import (TFCTRLPreTrainedModel, TFCTRLModel,
|
||||||
|
TFCTRLLMHeadModel,
|
||||||
|
TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
|
||||||
# TF 2.0 <=> PyTorch conversion utilities
|
# TF 2.0 <=> PyTorch conversion utilities
|
||||||
if is_tf_available() and is_torch_available():
|
from .modeling_tf_pytorch_utils import (convert_tf_weight_name_to_pt_weight_name,
|
||||||
from .modeling_tf_pytorch_utils import (convert_tf_weight_name_to_pt_weight_name,
|
load_pytorch_checkpoint_in_tf2_model,
|
||||||
load_pytorch_checkpoint_in_tf2_model,
|
load_pytorch_weights_in_tf2_model,
|
||||||
load_pytorch_weights_in_tf2_model,
|
load_pytorch_model_in_tf2_model,
|
||||||
load_pytorch_model_in_tf2_model,
|
load_tf2_checkpoint_in_pytorch_model,
|
||||||
load_tf2_checkpoint_in_pytorch_model,
|
load_tf2_weights_in_pytorch_model,
|
||||||
load_tf2_weights_in_pytorch_model,
|
load_tf2_model_in_pytorch_model)
|
||||||
load_tf2_model_in_pytorch_model)
|
|
||||||
|
|
||||||
if not is_tf_available() and not is_torch_available():
|
if not is_tf_available() and not is_torch_available():
|
||||||
logger.warning("Neither PyTorch nor TensorFlow >= 2.0 have been found."
|
logger.warning("Neither PyTorch nor TensorFlow >= 2.0 have been found."
|
||||||
|
|||||||
@@ -26,6 +26,7 @@ from .configuration_xlnet import XLNetConfig
|
|||||||
from .configuration_xlm import XLMConfig
|
from .configuration_xlm import XLMConfig
|
||||||
from .configuration_roberta import RobertaConfig
|
from .configuration_roberta import RobertaConfig
|
||||||
from .configuration_distilbert import DistilBertConfig
|
from .configuration_distilbert import DistilBertConfig
|
||||||
|
from .configuration_ctrl import CTRLConfig
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -49,7 +50,7 @@ class AutoConfig(object):
|
|||||||
- contains `xlnet`: XLNetConfig (XLNet model)
|
- contains `xlnet`: XLNetConfig (XLNet model)
|
||||||
- contains `xlm`: XLMConfig (XLM model)
|
- contains `xlm`: XLMConfig (XLM model)
|
||||||
- contains `roberta`: RobertaConfig (RoBERTa model)
|
- contains `roberta`: RobertaConfig (RoBERTa model)
|
||||||
|
- contains `ctrl` : CTRLConfig (CTRL model)
|
||||||
This class cannot be instantiated using `__init__()` (throw an error).
|
This class cannot be instantiated using `__init__()` (throw an error).
|
||||||
"""
|
"""
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
@@ -71,7 +72,7 @@ class AutoConfig(object):
|
|||||||
- contains `xlnet`: XLNetConfig (XLNet model)
|
- contains `xlnet`: XLNetConfig (XLNet model)
|
||||||
- contains `xlm`: XLMConfig (XLM model)
|
- contains `xlm`: XLMConfig (XLM model)
|
||||||
- contains `roberta`: RobertaConfig (RoBERTa model)
|
- contains `roberta`: RobertaConfig (RoBERTa model)
|
||||||
|
- contains `ctrl` : CTRLConfig (CTRL model)
|
||||||
Params:
|
Params:
|
||||||
pretrained_model_name_or_path: either:
|
pretrained_model_name_or_path: either:
|
||||||
|
|
||||||
@@ -129,7 +130,8 @@ class AutoConfig(object):
|
|||||||
return XLNetConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
|
return XLNetConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
|
||||||
elif 'xlm' in pretrained_model_name_or_path:
|
elif 'xlm' in pretrained_model_name_or_path:
|
||||||
return XLMConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
|
return XLMConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
|
||||||
|
elif 'ctrl' in pretrained_model_name_or_path:
|
||||||
|
return CTRLConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
|
||||||
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
||||||
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
||||||
"'xlm', 'roberta'".format(pretrained_model_name_or_path))
|
"'xlm', 'roberta', 'ctrl'".format(pretrained_model_name_or_path))
|
||||||
|
|||||||
@@ -40,6 +40,8 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
|||||||
'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-config.json",
|
'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-config.json",
|
||||||
'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-config.json",
|
'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-config.json",
|
||||||
'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-config.json",
|
'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-config.json",
|
||||||
|
'bert-base-german-dbmdz-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-config.json",
|
||||||
|
'bert-base-german-dbmdz-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-config.json",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
143
transformers/configuration_ctrl.py
Normal file
143
transformers/configuration_ctrl.py
Normal file
@@ -0,0 +1,143 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 Salesforce and HuggingFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Salesforce CTRL configuration """
|
||||||
|
|
||||||
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from io import open
|
||||||
|
|
||||||
|
from .configuration_utils import PretrainedConfig
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {"ctrl": "https://storage.googleapis.com/sf-ctrl/pytorch/ctrl-config.json"}
|
||||||
|
|
||||||
|
class CTRLConfig(PretrainedConfig):
|
||||||
|
"""Configuration class to store the configuration of a `CTRLModel`.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `CTRLModel` or a configuration json file.
|
||||||
|
n_positions: Number of positional embeddings.
|
||||||
|
n_ctx: Size of the causal mask (usually same as n_positions).
|
||||||
|
dff: Size of the inner dimension of the FFN.
|
||||||
|
n_embd: Dimensionality of the embeddings and hidden states.
|
||||||
|
n_layer: Number of hidden layers in the Transformer encoder.
|
||||||
|
n_head: Number of attention heads for each attention layer in
|
||||||
|
the Transformer encoder.
|
||||||
|
layer_norm_epsilon: epsilon to use in the layer norm layers
|
||||||
|
resid_pdrop: The dropout probabilitiy for all fully connected
|
||||||
|
layers in the embeddings, encoder, and pooler.
|
||||||
|
attn_pdrop: The dropout ratio for the attention
|
||||||
|
probabilities.
|
||||||
|
embd_pdrop: The dropout ratio for the embeddings.
|
||||||
|
initializer_range: The sttdev of the truncated_normal_initializer for
|
||||||
|
initializing all weight matrices.
|
||||||
|
"""
|
||||||
|
pretrained_config_archive_map = CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_size_or_config_json_file=246534,
|
||||||
|
n_positions=256,
|
||||||
|
n_ctx=256,
|
||||||
|
n_embd=1280,
|
||||||
|
dff=8192,
|
||||||
|
n_layer=48,
|
||||||
|
n_head=16,
|
||||||
|
resid_pdrop=0.1,
|
||||||
|
embd_pdrop=0.1,
|
||||||
|
attn_pdrop=0.1,
|
||||||
|
layer_norm_epsilon=1e-6,
|
||||||
|
initializer_range=0.02,
|
||||||
|
|
||||||
|
num_labels=1,
|
||||||
|
summary_type='cls_index',
|
||||||
|
summary_use_proj=True,
|
||||||
|
summary_activation=None,
|
||||||
|
summary_proj_to_labels=True,
|
||||||
|
summary_first_dropout=0.1,
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
"""Constructs CTRLConfig.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `CTRLModel` or a configuration json file.
|
||||||
|
n_positions: Number of positional embeddings.
|
||||||
|
n_ctx: Size of the causal mask (usually same as n_positions).
|
||||||
|
dff: Size of the inner dimension of the FFN.
|
||||||
|
n_embd: Dimensionality of the embeddings and hidden states.
|
||||||
|
n_layer: Number of hidden layers in the Transformer encoder.
|
||||||
|
n_head: Number of attention heads for each attention layer in
|
||||||
|
the Transformer encoder.
|
||||||
|
layer_norm_epsilon: epsilon to use in the layer norm layers
|
||||||
|
resid_pdrop: The dropout probabilitiy for all fully connected
|
||||||
|
layers in the embeddings, encoder, and pooler.
|
||||||
|
attn_pdrop: The dropout ratio for the attention
|
||||||
|
probabilities.
|
||||||
|
embd_pdrop: The dropout ratio for the embeddings.
|
||||||
|
initializer_range: The sttdev of the truncated_normal_initializer for
|
||||||
|
initializing all weight matrices.
|
||||||
|
"""
|
||||||
|
super(CTRLConfig, self).__init__(**kwargs)
|
||||||
|
|
||||||
|
self.vocab_size = vocab_size_or_config_json_file if isinstance(vocab_size_or_config_json_file, int) else -1
|
||||||
|
self.n_ctx = n_ctx
|
||||||
|
self.n_positions = n_positions
|
||||||
|
self.n_embd = n_embd
|
||||||
|
self.n_layer = n_layer
|
||||||
|
self.n_head = n_head
|
||||||
|
self.dff = dff
|
||||||
|
self.resid_pdrop = resid_pdrop
|
||||||
|
self.embd_pdrop = embd_pdrop
|
||||||
|
self.attn_pdrop = attn_pdrop
|
||||||
|
self.layer_norm_epsilon = layer_norm_epsilon
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
|
||||||
|
self.num_labels = num_labels
|
||||||
|
self.summary_type = summary_type
|
||||||
|
self.summary_use_proj = summary_use_proj
|
||||||
|
self.summary_activation = summary_activation
|
||||||
|
self.summary_first_dropout = summary_first_dropout
|
||||||
|
self.summary_proj_to_labels = summary_proj_to_labels
|
||||||
|
if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
|
||||||
|
and isinstance(vocab_size_or_config_json_file, unicode)):
|
||||||
|
with open(vocab_size_or_config_json_file, "r", encoding="utf-8") as reader:
|
||||||
|
json_config = json.loads(reader.read())
|
||||||
|
for key, value in json_config.items():
|
||||||
|
self.__dict__[key] = value
|
||||||
|
elif not isinstance(vocab_size_or_config_json_file, int):
|
||||||
|
raise ValueError(
|
||||||
|
"First argument must be either a vocabulary size (int)"
|
||||||
|
"or the path to a pretrained model config file (str)"
|
||||||
|
)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def max_position_embeddings(self):
|
||||||
|
return self.n_positions
|
||||||
|
|
||||||
|
@property
|
||||||
|
def hidden_size(self):
|
||||||
|
return self.n_embd
|
||||||
|
|
||||||
|
@property
|
||||||
|
def num_attention_heads(self):
|
||||||
|
return self.n_head
|
||||||
|
|
||||||
|
@property
|
||||||
|
def num_hidden_layers(self):
|
||||||
|
return self.n_layer
|
||||||
@@ -28,6 +28,7 @@ ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
|||||||
'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json",
|
'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json",
|
||||||
'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-config.json",
|
'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-config.json",
|
||||||
'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-config.json",
|
'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-config.json",
|
||||||
|
'distilroberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-config.json",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -53,7 +53,8 @@ class PretrainedConfig(object):
|
|||||||
self.num_labels = kwargs.pop('num_labels', 2)
|
self.num_labels = kwargs.pop('num_labels', 2)
|
||||||
self.output_attentions = kwargs.pop('output_attentions', False)
|
self.output_attentions = kwargs.pop('output_attentions', False)
|
||||||
self.output_hidden_states = kwargs.pop('output_hidden_states', False)
|
self.output_hidden_states = kwargs.pop('output_hidden_states', False)
|
||||||
self.torchscript = kwargs.pop('torchscript', False)
|
self.output_past = kwargs.pop('output_past', True) # Not used by all models
|
||||||
|
self.torchscript = kwargs.pop('torchscript', False) # Only used by PyTorch models
|
||||||
self.use_bfloat16 = kwargs.pop('use_bfloat16', False)
|
self.use_bfloat16 = kwargs.pop('use_bfloat16', False)
|
||||||
self.pruned_heads = kwargs.pop('pruned_heads', {})
|
self.pruned_heads = kwargs.pop('pruned_heads', {})
|
||||||
self.is_decoder = kwargs.pop('is_decoder', False)
|
self.is_decoder = kwargs.pop('is_decoder', False)
|
||||||
@@ -131,20 +132,19 @@ class PretrainedConfig(object):
|
|||||||
# redirect to the cache, if necessary
|
# redirect to the cache, if necessary
|
||||||
try:
|
try:
|
||||||
resolved_config_file = cached_path(config_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
|
resolved_config_file = cached_path(config_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
|
||||||
except EnvironmentError as e:
|
except EnvironmentError:
|
||||||
if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
|
if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
|
||||||
logger.error(
|
msg = "Couldn't reach server at '{}' to download pretrained model configuration file.".format(
|
||||||
"Couldn't reach server at '{}' to download pretrained model configuration file.".format(
|
config_file)
|
||||||
config_file))
|
|
||||||
else:
|
else:
|
||||||
logger.error(
|
msg = "Model name '{}' was not found in model name list ({}). " \
|
||||||
"Model name '{}' was not found in model name list ({}). "
|
"We assumed '{}' was a path or url to a configuration file named {} or " \
|
||||||
"We assumed '{}' was a path or url but couldn't find any file "
|
"a directory containing such a file but couldn't find any such file at this path or url.".format(
|
||||||
"associated to this path or url.".format(
|
|
||||||
pretrained_model_name_or_path,
|
pretrained_model_name_or_path,
|
||||||
', '.join(cls.pretrained_config_archive_map.keys()),
|
', '.join(cls.pretrained_config_archive_map.keys()),
|
||||||
config_file))
|
config_file, CONFIG_NAME)
|
||||||
raise e
|
raise EnvironmentError(msg)
|
||||||
|
|
||||||
if resolved_config_file == config_file:
|
if resolved_config_file == config_file:
|
||||||
logger.info("loading configuration file {}".format(config_file))
|
logger.info("loading configuration file {}".format(config_file))
|
||||||
else:
|
else:
|
||||||
@@ -155,7 +155,7 @@ class PretrainedConfig(object):
|
|||||||
config = cls.from_json_file(resolved_config_file)
|
config = cls.from_json_file(resolved_config_file)
|
||||||
|
|
||||||
if hasattr(config, 'pruned_heads'):
|
if hasattr(config, 'pruned_heads'):
|
||||||
config.pruned_heads = dict((int(key), set(value)) for key, value in config.pruned_heads.items())
|
config.pruned_heads = dict((int(key), value) for key, value in config.pruned_heads.items())
|
||||||
|
|
||||||
# Update config with kwargs if needed
|
# Update config with kwargs if needed
|
||||||
to_remove = []
|
to_remove = []
|
||||||
@@ -166,7 +166,7 @@ class PretrainedConfig(object):
|
|||||||
for key in to_remove:
|
for key in to_remove:
|
||||||
kwargs.pop(key, None)
|
kwargs.pop(key, None)
|
||||||
|
|
||||||
logger.info("Model config %s", config)
|
logger.info("Model config %s", str(config))
|
||||||
if return_unused_kwargs:
|
if return_unused_kwargs:
|
||||||
return config, kwargs
|
return config, kwargs
|
||||||
else:
|
else:
|
||||||
|
|||||||
@@ -24,14 +24,16 @@ import tensorflow as tf
|
|||||||
|
|
||||||
from transformers import is_torch_available, cached_path
|
from transformers import is_torch_available, cached_path
|
||||||
|
|
||||||
from transformers import (BertConfig, TFBertForPreTraining, TFBertForQuestionAnswering, TFBertForSequenceClassification, load_bert_pt_weights_in_tf2, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
from transformers import (load_pytorch_checkpoint_in_tf2_model,
|
||||||
GPT2Config, TFGPT2LMHeadModel, load_gpt2_pt_weights_in_tf2, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
BertConfig, TFBertForPreTraining, TFBertForQuestionAnswering, TFBertForSequenceClassification, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
XLNetConfig, TFXLNetLMHeadModel, load_xlnet_pt_weights_in_tf2, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
GPT2Config, TFGPT2LMHeadModel, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
XLMConfig, TFXLMWithLMHeadModel, load_xlm_pt_weights_in_tf2, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
XLNetConfig, TFXLNetLMHeadModel, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
TransfoXLConfig, TFTransfoXLLMHeadModel, load_transfo_xl_pt_weights_in_tf2, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
XLMConfig, TFXLMWithLMHeadModel, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
OpenAIGPTConfig, TFOpenAIGPTLMHeadModel, load_openai_gpt_pt_weights_in_tf2, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
TransfoXLConfig, TFTransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
RobertaConfig, TFRobertaForMaskedLM, TFRobertaForSequenceClassification, load_roberta_pt_weights_in_tf2, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
OpenAIGPTConfig, TFOpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
DistilBertConfig, TFDistilBertForMaskedLM, TFDistilBertForQuestionAnswering, load_distilbert_pt_weights_in_tf2, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
RobertaConfig, TFRobertaForMaskedLM, TFRobertaForSequenceClassification, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
|
DistilBertConfig, TFDistilBertForMaskedLM, TFDistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
|
CTRLConfig, TFCTRLLMHeadModel, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
||||||
|
|
||||||
if is_torch_available():
|
if is_torch_available():
|
||||||
import torch
|
import torch
|
||||||
@@ -43,7 +45,8 @@ if is_torch_available():
|
|||||||
TransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP,
|
TransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
|
CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
else:
|
else:
|
||||||
(BertForPreTraining, BertForQuestionAnswering, BertForSequenceClassification, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
(BertForPreTraining, BertForQuestionAnswering, BertForSequenceClassification, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
GPT2LMHeadModel, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP,
|
GPT2LMHeadModel, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
@@ -52,7 +55,8 @@ else:
|
|||||||
TransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP,
|
TransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,) = (
|
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
|
CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP) = (
|
||||||
None, None, None, None,
|
None, None, None, None,
|
||||||
None, None,
|
None, None,
|
||||||
None, None,
|
None, None,
|
||||||
@@ -60,33 +64,35 @@ else:
|
|||||||
None, None,
|
None, None,
|
||||||
None, None,
|
None, None,
|
||||||
None, None, None,
|
None, None, None,
|
||||||
None, None, None,)
|
None, None, None,
|
||||||
|
None, None)
|
||||||
|
|
||||||
|
|
||||||
import logging
|
import logging
|
||||||
logging.basicConfig(level=logging.INFO)
|
logging.basicConfig(level=logging.INFO)
|
||||||
|
|
||||||
MODEL_CLASSES = {
|
MODEL_CLASSES = {
|
||||||
'bert': (BertConfig, TFBertForPreTraining, load_bert_pt_weights_in_tf2, BertForPreTraining, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'bert': (BertConfig, TFBertForPreTraining, BertForPreTraining, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'bert-large-uncased-whole-word-masking-finetuned-squad': (BertConfig, TFBertForQuestionAnswering, load_bert_pt_weights_in_tf2, BertForQuestionAnswering, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'bert-large-uncased-whole-word-masking-finetuned-squad': (BertConfig, TFBertForQuestionAnswering, BertForQuestionAnswering, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'bert-large-cased-whole-word-masking-finetuned-squad': (BertConfig, TFBertForQuestionAnswering, load_bert_pt_weights_in_tf2, BertForQuestionAnswering, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'bert-large-cased-whole-word-masking-finetuned-squad': (BertConfig, TFBertForQuestionAnswering, BertForQuestionAnswering, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'bert-base-cased-finetuned-mrpc': (BertConfig, TFBertForSequenceClassification, load_bert_pt_weights_in_tf2, BertForSequenceClassification, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'bert-base-cased-finetuned-mrpc': (BertConfig, TFBertForSequenceClassification, BertForSequenceClassification, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'gpt2': (GPT2Config, TFGPT2LMHeadModel, load_gpt2_pt_weights_in_tf2, GPT2LMHeadModel, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'gpt2': (GPT2Config, TFGPT2LMHeadModel, GPT2LMHeadModel, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'xlnet': (XLNetConfig, TFXLNetLMHeadModel, load_xlnet_pt_weights_in_tf2, XLNetLMHeadModel, XLNET_PRETRAINED_MODEL_ARCHIVE_MAP, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'xlnet': (XLNetConfig, TFXLNetLMHeadModel, XLNetLMHeadModel, XLNET_PRETRAINED_MODEL_ARCHIVE_MAP, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'xlm': (XLMConfig, TFXLMWithLMHeadModel, load_xlm_pt_weights_in_tf2, XLMWithLMHeadModel, XLM_PRETRAINED_MODEL_ARCHIVE_MAP, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'xlm': (XLMConfig, TFXLMWithLMHeadModel, XLMWithLMHeadModel, XLM_PRETRAINED_MODEL_ARCHIVE_MAP, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'transfo-xl': (TransfoXLConfig, TFTransfoXLLMHeadModel, load_transfo_xl_pt_weights_in_tf2, TransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'transfo-xl': (TransfoXLConfig, TFTransfoXLLMHeadModel, TransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'openai-gpt': (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel, load_openai_gpt_pt_weights_in_tf2, OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'openai-gpt': (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel, OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'roberta': (RobertaConfig, TFRobertaForMaskedLM, load_roberta_pt_weights_in_tf2, RobertaForMaskedLM, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'roberta': (RobertaConfig, TFRobertaForMaskedLM, RobertaForMaskedLM, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'roberta-large-mnli': (RobertaConfig, TFRobertaForSequenceClassification, load_roberta_pt_weights_in_tf2, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'roberta-large-mnli': (RobertaConfig, TFRobertaForSequenceClassification, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'distilbert': (DistilBertConfig, TFDistilBertForMaskedLM, load_distilbert_pt_weights_in_tf2, DistilBertForMaskedLM, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'distilbert': (DistilBertConfig, TFDistilBertForMaskedLM, DistilBertForMaskedLM, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'distilbert-base-uncased-distilled-squad': (DistilBertConfig, TFDistilBertForQuestionAnswering, load_distilbert_pt_weights_in_tf2, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'distilbert-base-uncased-distilled-squad': (DistilBertConfig, TFDistilBertForQuestionAnswering, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
|
'ctrl': (CTRLConfig, TFCTRLLMHeadModel, CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
||||||
}
|
}
|
||||||
|
|
||||||
def convert_pt_checkpoint_to_tf(model_type, pytorch_checkpoint_path, config_file, tf_dump_path, compare_with_pt_model=False, use_cached_models=True):
|
def convert_pt_checkpoint_to_tf(model_type, pytorch_checkpoint_path, config_file, tf_dump_path, compare_with_pt_model=False, use_cached_models=True):
|
||||||
if model_type not in MODEL_CLASSES:
|
if model_type not in MODEL_CLASSES:
|
||||||
raise ValueError("Unrecognized model type, should be one of {}.".format(list(MODEL_CLASSES.keys())))
|
raise ValueError("Unrecognized model type, should be one of {}.".format(list(MODEL_CLASSES.keys())))
|
||||||
|
|
||||||
config_class, model_class, loading_fct, pt_model_class, aws_model_maps, aws_config_map = MODEL_CLASSES[model_type]
|
config_class, model_class, pt_model_class, aws_model_maps, aws_config_map = MODEL_CLASSES[model_type]
|
||||||
|
|
||||||
# Initialise TF model
|
# Initialise TF model
|
||||||
if config_file in aws_config_map:
|
if config_file in aws_config_map:
|
||||||
@@ -100,7 +106,8 @@ def convert_pt_checkpoint_to_tf(model_type, pytorch_checkpoint_path, config_file
|
|||||||
# Load weights from tf checkpoint
|
# Load weights from tf checkpoint
|
||||||
if pytorch_checkpoint_path in aws_model_maps:
|
if pytorch_checkpoint_path in aws_model_maps:
|
||||||
pytorch_checkpoint_path = cached_path(aws_model_maps[pytorch_checkpoint_path], force_download=not use_cached_models)
|
pytorch_checkpoint_path = cached_path(aws_model_maps[pytorch_checkpoint_path], force_download=not use_cached_models)
|
||||||
tf_model = loading_fct(tf_model, pytorch_checkpoint_path)
|
# Load PyTorch checkpoint in tf2 model:
|
||||||
|
tf_model = load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path)
|
||||||
|
|
||||||
if compare_with_pt_model:
|
if compare_with_pt_model:
|
||||||
inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
||||||
@@ -142,7 +149,7 @@ def convert_all_pt_checkpoints_to_tf(args_model_type, tf_dump_path, model_shortc
|
|||||||
if model_type not in MODEL_CLASSES:
|
if model_type not in MODEL_CLASSES:
|
||||||
raise ValueError("Unrecognized model type {}, should be one of {}.".format(model_type, list(MODEL_CLASSES.keys())))
|
raise ValueError("Unrecognized model type {}, should be one of {}.".format(model_type, list(MODEL_CLASSES.keys())))
|
||||||
|
|
||||||
config_class, model_class, loading_fct, pt_model_class, aws_model_maps, aws_config_map = MODEL_CLASSES[model_type]
|
config_class, model_class, pt_model_class, aws_model_maps, aws_config_map = MODEL_CLASSES[model_type]
|
||||||
|
|
||||||
if model_shortcut_names_or_path is None:
|
if model_shortcut_names_or_path is None:
|
||||||
model_shortcut_names_or_path = list(aws_model_maps.keys())
|
model_shortcut_names_or_path = list(aws_model_maps.keys())
|
||||||
@@ -173,10 +180,12 @@ def convert_all_pt_checkpoints_to_tf(args_model_type, tf_dump_path, model_shortc
|
|||||||
else:
|
else:
|
||||||
model_file = cached_path(model_shortcut_name, force_download=not use_cached_models)
|
model_file = cached_path(model_shortcut_name, force_download=not use_cached_models)
|
||||||
|
|
||||||
convert_pt_checkpoint_to_tf(model_type,
|
if os.path.isfile(model_shortcut_name):
|
||||||
model_file,
|
model_shortcut_name = 'converted_model'
|
||||||
config_file,
|
convert_pt_checkpoint_to_tf(model_type=model_type,
|
||||||
os.path.join(tf_dump_path, model_shortcut_name + '-tf_model.h5'),
|
pytorch_checkpoint_path=model_file,
|
||||||
|
config_file=config_file,
|
||||||
|
tf_dump_path=os.path.join(tf_dump_path, model_shortcut_name + '-tf_model.h5'),
|
||||||
compare_with_pt_model=compare_with_pt_model)
|
compare_with_pt_model=compare_with_pt_model)
|
||||||
os.remove(config_file)
|
os.remove(config_file)
|
||||||
os.remove(model_file)
|
os.remove(model_file)
|
||||||
|
|||||||
@@ -23,15 +23,15 @@ import torch
|
|||||||
|
|
||||||
from fairseq.models.roberta import RobertaModel as FairseqRobertaModel
|
from fairseq.models.roberta import RobertaModel as FairseqRobertaModel
|
||||||
from fairseq.modules import TransformerSentenceEncoderLayer
|
from fairseq.modules import TransformerSentenceEncoderLayer
|
||||||
from transformers import (BertConfig, BertEncoder,
|
from transformers.modeling_bert import (BertConfig, BertEncoder,
|
||||||
BertIntermediate, BertLayer,
|
BertIntermediate, BertLayer,
|
||||||
BertModel, BertOutput,
|
BertModel, BertOutput,
|
||||||
BertSelfAttention,
|
BertSelfAttention,
|
||||||
BertSelfOutput)
|
BertSelfOutput)
|
||||||
from transformers import (RobertaEmbeddings,
|
from transformers.modeling_roberta import (RobertaEmbeddings,
|
||||||
RobertaForMaskedLM,
|
RobertaForMaskedLM,
|
||||||
RobertaForSequenceClassification,
|
RobertaForSequenceClassification,
|
||||||
RobertaModel)
|
RobertaModel)
|
||||||
|
|
||||||
logging.basicConfig(level=logging.INFO)
|
logging.basicConfig(level=logging.INFO)
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|||||||
@@ -86,7 +86,6 @@ def glue_convert_examples_to_features(examples, tokenizer,
|
|||||||
example.text_b,
|
example.text_b,
|
||||||
add_special_tokens=True,
|
add_special_tokens=True,
|
||||||
max_length=max_length,
|
max_length=max_length,
|
||||||
truncate_first_sequence=True # We're truncating the first sequence in priority
|
|
||||||
)
|
)
|
||||||
input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
|
input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
|
||||||
|
|
||||||
|
|||||||
@@ -27,7 +27,7 @@ logger = logging.getLogger(__name__) # pylint: disable=invalid-name
|
|||||||
|
|
||||||
try:
|
try:
|
||||||
import tensorflow as tf
|
import tensorflow as tf
|
||||||
assert int(tf.__version__[0]) >= 2
|
assert hasattr(tf, '__version__') and int(tf.__version__[0]) >= 2
|
||||||
_tf_available = True # pylint: disable=invalid-name
|
_tf_available = True # pylint: disable=invalid-name
|
||||||
logger.info("TensorFlow version {} available.".format(tf.__version__))
|
logger.info("TensorFlow version {} available.".format(tf.__version__))
|
||||||
except (ImportError, AssertionError):
|
except (ImportError, AssertionError):
|
||||||
@@ -246,7 +246,7 @@ def http_get(url, temp_file, proxies=None):
|
|||||||
progress.close()
|
progress.close()
|
||||||
|
|
||||||
|
|
||||||
def get_from_cache(url, cache_dir=None, force_download=False, proxies=None):
|
def get_from_cache(url, cache_dir=None, force_download=False, proxies=None, etag_timeout=10):
|
||||||
"""
|
"""
|
||||||
Given a URL, look for the corresponding dataset in the local cache.
|
Given a URL, look for the corresponding dataset in the local cache.
|
||||||
If it's not there, download it. Then return the path to the cached file.
|
If it's not there, download it. Then return the path to the cached file.
|
||||||
@@ -266,12 +266,12 @@ def get_from_cache(url, cache_dir=None, force_download=False, proxies=None):
|
|||||||
etag = s3_etag(url, proxies=proxies)
|
etag = s3_etag(url, proxies=proxies)
|
||||||
else:
|
else:
|
||||||
try:
|
try:
|
||||||
response = requests.head(url, allow_redirects=True, proxies=proxies)
|
response = requests.head(url, allow_redirects=True, proxies=proxies, timeout=etag_timeout)
|
||||||
if response.status_code != 200:
|
if response.status_code != 200:
|
||||||
etag = None
|
etag = None
|
||||||
else:
|
else:
|
||||||
etag = response.headers.get("ETag")
|
etag = response.headers.get("ETag")
|
||||||
except EnvironmentError:
|
except (EnvironmentError, requests.exceptions.Timeout):
|
||||||
etag = None
|
etag = None
|
||||||
|
|
||||||
if sys.version_info[0] == 2 and etag is not None:
|
if sys.version_info[0] == 2 and etag is not None:
|
||||||
|
|||||||
@@ -21,6 +21,7 @@ import logging
|
|||||||
from .modeling_bert import BertModel, BertForMaskedLM, BertForSequenceClassification, BertForQuestionAnswering
|
from .modeling_bert import BertModel, BertForMaskedLM, BertForSequenceClassification, BertForQuestionAnswering
|
||||||
from .modeling_openai import OpenAIGPTModel, OpenAIGPTLMHeadModel
|
from .modeling_openai import OpenAIGPTModel, OpenAIGPTLMHeadModel
|
||||||
from .modeling_gpt2 import GPT2Model, GPT2LMHeadModel
|
from .modeling_gpt2 import GPT2Model, GPT2LMHeadModel
|
||||||
|
from .modeling_ctrl import CTRLModel, CTRLLMHeadModel
|
||||||
from .modeling_transfo_xl import TransfoXLModel, TransfoXLLMHeadModel
|
from .modeling_transfo_xl import TransfoXLModel, TransfoXLLMHeadModel
|
||||||
from .modeling_xlnet import XLNetModel, XLNetLMHeadModel, XLNetForSequenceClassification, XLNetForQuestionAnswering
|
from .modeling_xlnet import XLNetModel, XLNetLMHeadModel, XLNetForSequenceClassification, XLNetForQuestionAnswering
|
||||||
from .modeling_xlm import XLMModel, XLMWithLMHeadModel, XLMForSequenceClassification, XLMForQuestionAnswering
|
from .modeling_xlm import XLMModel, XLMWithLMHeadModel, XLMForSequenceClassification, XLMForQuestionAnswering
|
||||||
@@ -51,6 +52,7 @@ class AutoModel(object):
|
|||||||
- contains `bert`: BertModel (Bert model)
|
- contains `bert`: BertModel (Bert model)
|
||||||
- contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
|
- contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
|
||||||
- contains `gpt2`: GPT2Model (OpenAI GPT-2 model)
|
- contains `gpt2`: GPT2Model (OpenAI GPT-2 model)
|
||||||
|
- contains `ctrl`: CTRLModel (Salesforce CTRL model)
|
||||||
- contains `transfo-xl`: TransfoXLModel (Transformer-XL model)
|
- contains `transfo-xl`: TransfoXLModel (Transformer-XL model)
|
||||||
- contains `xlnet`: XLNetModel (XLNet model)
|
- contains `xlnet`: XLNetModel (XLNet model)
|
||||||
- contains `xlm`: XLMModel (XLM model)
|
- contains `xlm`: XLMModel (XLM model)
|
||||||
@@ -73,6 +75,7 @@ class AutoModel(object):
|
|||||||
- contains `bert`: BertModel (Bert model)
|
- contains `bert`: BertModel (Bert model)
|
||||||
- contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
|
- contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
|
||||||
- contains `gpt2`: GPT2Model (OpenAI GPT-2 model)
|
- contains `gpt2`: GPT2Model (OpenAI GPT-2 model)
|
||||||
|
- contains `ctrl`: CTRLModel (Salesforce CTRL model)
|
||||||
- contains `transfo-xl`: TransfoXLModel (Transformer-XL model)
|
- contains `transfo-xl`: TransfoXLModel (Transformer-XL model)
|
||||||
- contains `xlnet`: XLNetModel (XLNet model)
|
- contains `xlnet`: XLNetModel (XLNet model)
|
||||||
- contains `xlm`: XLMModel (XLM model)
|
- contains `xlm`: XLMModel (XLM model)
|
||||||
@@ -149,10 +152,11 @@ class AutoModel(object):
|
|||||||
return XLNetModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return XLNetModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
elif 'xlm' in pretrained_model_name_or_path:
|
elif 'xlm' in pretrained_model_name_or_path:
|
||||||
return XLMModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return XLMModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
|
elif 'ctrl' in pretrained_model_name_or_path:
|
||||||
|
return CTRLModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
||||||
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
||||||
"'xlm', 'roberta'".format(pretrained_model_name_or_path))
|
"'xlm', 'roberta, 'ctrl'".format(pretrained_model_name_or_path))
|
||||||
|
|
||||||
|
|
||||||
class AutoModelWithLMHead(object):
|
class AutoModelWithLMHead(object):
|
||||||
@@ -172,6 +176,7 @@ class AutoModelWithLMHead(object):
|
|||||||
- contains `bert`: BertForMaskedLM (Bert model)
|
- contains `bert`: BertForMaskedLM (Bert model)
|
||||||
- contains `openai-gpt`: OpenAIGPTLMHeadModel (OpenAI GPT model)
|
- contains `openai-gpt`: OpenAIGPTLMHeadModel (OpenAI GPT model)
|
||||||
- contains `gpt2`: GPT2LMHeadModel (OpenAI GPT-2 model)
|
- contains `gpt2`: GPT2LMHeadModel (OpenAI GPT-2 model)
|
||||||
|
- contains `ctrl`: CTRLLMModel (Salesforce CTRL model)
|
||||||
- contains `transfo-xl`: TransfoXLLMHeadModel (Transformer-XL model)
|
- contains `transfo-xl`: TransfoXLLMHeadModel (Transformer-XL model)
|
||||||
- contains `xlnet`: XLNetLMHeadModel (XLNet model)
|
- contains `xlnet`: XLNetLMHeadModel (XLNet model)
|
||||||
- contains `xlm`: XLMWithLMHeadModel (XLM model)
|
- contains `xlm`: XLMWithLMHeadModel (XLM model)
|
||||||
@@ -273,10 +278,11 @@ class AutoModelWithLMHead(object):
|
|||||||
return XLNetLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return XLNetLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
elif 'xlm' in pretrained_model_name_or_path:
|
elif 'xlm' in pretrained_model_name_or_path:
|
||||||
return XLMWithLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return XLMWithLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
|
elif 'ctrl' in pretrained_model_name_or_path:
|
||||||
|
return CTRLLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
||||||
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
||||||
"'xlm', 'roberta'".format(pretrained_model_name_or_path))
|
"'xlm', 'roberta','ctrl'".format(pretrained_model_name_or_path))
|
||||||
|
|
||||||
|
|
||||||
class AutoModelForSequenceClassification(object):
|
class AutoModelForSequenceClassification(object):
|
||||||
|
|||||||
@@ -46,6 +46,8 @@ BERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
|||||||
'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin",
|
'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin",
|
||||||
'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-pytorch_model.bin",
|
'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-pytorch_model.bin",
|
||||||
'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-pytorch_model.bin",
|
'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-pytorch_model.bin",
|
||||||
|
'bert-base-german-dbmdz-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-pytorch_model.bin",
|
||||||
|
'bert-base-german-dbmdz-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-pytorch_model.bin",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -1194,12 +1196,16 @@ class BertForQuestionAnswering(BertPreTrainedModel):
|
|||||||
Examples::
|
Examples::
|
||||||
|
|
||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||||||
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
|
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
|
||||||
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
|
||||||
start_positions = torch.tensor([1])
|
input_text = "[CLS] " + question + " [SEP] " + text + " [SEP]"
|
||||||
end_positions = torch.tensor([3])
|
input_ids = tokenizer.encode(input_text)
|
||||||
outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
|
token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]
|
||||||
loss, start_scores, end_scores = outputs[:2]
|
start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
|
||||||
|
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
|
||||||
|
print(' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1]))
|
||||||
|
# a nice puppet
|
||||||
|
|
||||||
|
|
||||||
"""
|
"""
|
||||||
def __init__(self, config):
|
def __init__(self, config):
|
||||||
|
|||||||
485
transformers/modeling_ctrl.py
Normal file
485
transformers/modeling_ctrl.py
Normal file
@@ -0,0 +1,485 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 Salesforce and HuggingFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" PyTorch CTRL model."""
|
||||||
|
|
||||||
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
|
|
||||||
|
import collections
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import math
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
from io import open
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
from torch.nn import CrossEntropyLoss
|
||||||
|
from torch.nn.parameter import Parameter
|
||||||
|
|
||||||
|
from .modeling_utils import PreTrainedModel, Conv1D, prune_conv1d_layer, SequenceSummary
|
||||||
|
from .configuration_ctrl import CTRLConfig
|
||||||
|
from .file_utils import add_start_docstrings
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
CTRL_PRETRAINED_MODEL_ARCHIVE_MAP = {"ctrl": "https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin"}
|
||||||
|
|
||||||
|
|
||||||
|
def angle_defn(pos, i, d_model_size):
|
||||||
|
angle_rates = 1 / torch.pow(10000, (2 * (i//2)) / d_model_size)
|
||||||
|
return pos * angle_rates
|
||||||
|
|
||||||
|
def positional_encoding(position, d_model_size, dtype):
|
||||||
|
# create the sinusoidal pattern for the positional encoding
|
||||||
|
angle_rads = (angle_defn(torch.arange(position, dtype=dtype).unsqueeze(1),
|
||||||
|
torch.arange(d_model_size, dtype=dtype).unsqueeze(0),
|
||||||
|
d_model_size))
|
||||||
|
|
||||||
|
sines = torch.sin(angle_rads[:, 0::2])
|
||||||
|
cosines = torch.cos(angle_rads[:, 1::2])
|
||||||
|
|
||||||
|
pos_encoding = torch.cat([sines, cosines], dim=-1)
|
||||||
|
return pos_encoding
|
||||||
|
|
||||||
|
def scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):
|
||||||
|
# calculate attention
|
||||||
|
matmul_qk = torch.matmul(q, k.permute(0,1,3,2))
|
||||||
|
|
||||||
|
dk = k.shape[-1]
|
||||||
|
scaled_attention_logits = matmul_qk / np.sqrt(dk)
|
||||||
|
|
||||||
|
if mask is not None:
|
||||||
|
scaled_attention_logits += (mask * -1e4)
|
||||||
|
|
||||||
|
if attention_mask is not None:
|
||||||
|
# Apply the attention mask
|
||||||
|
scaled_attention_logits = scaled_attention_logits + attention_mask
|
||||||
|
|
||||||
|
attention_weights = torch.softmax(scaled_attention_logits, dim=-1)
|
||||||
|
|
||||||
|
# Mask heads if we want to
|
||||||
|
if head_mask is not None:
|
||||||
|
attention_weights = attention_weights * head_mask
|
||||||
|
|
||||||
|
output = torch.matmul(attention_weights, v)
|
||||||
|
|
||||||
|
return output, attention_weights
|
||||||
|
|
||||||
|
|
||||||
|
class MultiHeadAttention(torch.nn.Module):
|
||||||
|
def __init__(self, d_model_size, num_heads, output_attentions=False):
|
||||||
|
super(MultiHeadAttention, self).__init__()
|
||||||
|
self.output_attentions = output_attentions
|
||||||
|
self.num_heads = num_heads
|
||||||
|
self.d_model_size = d_model_size
|
||||||
|
|
||||||
|
self.depth = int(d_model_size / self.num_heads)
|
||||||
|
|
||||||
|
self.Wq = torch.nn.Linear(d_model_size, d_model_size)
|
||||||
|
self.Wk = torch.nn.Linear(d_model_size, d_model_size)
|
||||||
|
self.Wv = torch.nn.Linear(d_model_size, d_model_size)
|
||||||
|
|
||||||
|
self.dense = torch.nn.Linear(d_model_size, d_model_size)
|
||||||
|
|
||||||
|
def split_into_heads(self, x, batch_size):
|
||||||
|
x = x.reshape(batch_size, -1, self.num_heads, self.depth)
|
||||||
|
return x.permute([0, 2, 1, 3])
|
||||||
|
|
||||||
|
def forward(self, v, k, q, mask, layer_past=None, attention_mask=None, head_mask=None):
|
||||||
|
batch_size = q.shape[0]
|
||||||
|
|
||||||
|
q = self.Wq(q)
|
||||||
|
k = self.Wk(k)
|
||||||
|
v = self.Wv(v)
|
||||||
|
|
||||||
|
q = self.split_into_heads(q, batch_size)
|
||||||
|
k = self.split_into_heads(k, batch_size)
|
||||||
|
v = self.split_into_heads(v, batch_size)
|
||||||
|
if layer_past is not None:
|
||||||
|
past_key, past_value = layer_past[0], layer_past[1]
|
||||||
|
k = torch.cat((past_key, k), dim=-2)
|
||||||
|
v = torch.cat((past_value, v), dim=-2)
|
||||||
|
present = torch.stack((k, v))
|
||||||
|
|
||||||
|
output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)
|
||||||
|
scaled_attention = output[0].permute([0, 2, 1, 3])
|
||||||
|
attn = output[1]
|
||||||
|
original_size_attention = scaled_attention.reshape(batch_size, -1, self.d_model_size)
|
||||||
|
output = self.dense(original_size_attention)
|
||||||
|
|
||||||
|
outputs = (output, present)
|
||||||
|
if self.output_attentions:
|
||||||
|
outputs = outputs + (attn,)
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def point_wise_feed_forward_network(d_model_size, dff):
|
||||||
|
return torch.nn.Sequential(torch.nn.Linear(d_model_size, dff),
|
||||||
|
torch.nn.ReLU(),
|
||||||
|
torch.nn.Linear(dff, d_model_size))
|
||||||
|
|
||||||
|
|
||||||
|
class EncoderLayer(torch.nn.Module):
|
||||||
|
def __init__(self, d_model_size, num_heads, dff, rate=0.1, output_attentions=False):
|
||||||
|
super(EncoderLayer, self).__init__()
|
||||||
|
|
||||||
|
self.multi_head_attention = MultiHeadAttention(d_model_size, num_heads, output_attentions)
|
||||||
|
self.ffn = point_wise_feed_forward_network(d_model_size, dff)
|
||||||
|
|
||||||
|
self.layernorm1 = torch.nn.LayerNorm(d_model_size, eps=1e-6)
|
||||||
|
self.layernorm2 = torch.nn.LayerNorm(d_model_size, eps=1e-6)
|
||||||
|
|
||||||
|
self.dropout1 = torch.nn.Dropout(rate)
|
||||||
|
self.dropout2 = torch.nn.Dropout(rate)
|
||||||
|
|
||||||
|
def forward(self, x, mask, layer_past=None, attention_mask=None, head_mask=None):
|
||||||
|
normed = self.layernorm1(x)
|
||||||
|
attn_outputs = self.multi_head_attention(normed, normed, normed, mask,
|
||||||
|
layer_past=layer_past,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
head_mask=head_mask)
|
||||||
|
attn_output = attn_outputs[0]
|
||||||
|
attn_output = self.dropout1(attn_output)
|
||||||
|
out1 = x + attn_output
|
||||||
|
|
||||||
|
out2 = self.layernorm2(out1)
|
||||||
|
ffn_output = self.ffn(out2)
|
||||||
|
ffn_output = self.dropout2(ffn_output)
|
||||||
|
out2 = out1 + ffn_output
|
||||||
|
|
||||||
|
outputs = (out2,) + attn_outputs[1:]
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class CTRLPreTrainedModel(PreTrainedModel):
|
||||||
|
""" An abstract class to handle weights initialization and
|
||||||
|
a simple interface for dowloading and loading pretrained models.
|
||||||
|
"""
|
||||||
|
config_class = CTRLConfig
|
||||||
|
pretrained_model_archive_map = CTRL_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
|
base_model_prefix = "transformer"
|
||||||
|
|
||||||
|
def _init_weights(self, module):
|
||||||
|
""" Initialize the weights.
|
||||||
|
"""
|
||||||
|
if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):
|
||||||
|
# Slightly different from the TF version which uses truncated_normal for initialization
|
||||||
|
# cf https://github.com/pytorch/pytorch/pull/5617
|
||||||
|
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
|
||||||
|
if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:
|
||||||
|
module.bias.data.zero_()
|
||||||
|
elif isinstance(module, nn.LayerNorm):
|
||||||
|
module.bias.data.zero_()
|
||||||
|
module.weight.data.fill_(1.0)
|
||||||
|
|
||||||
|
|
||||||
|
CTRL_START_DOCSTRING = r""" CTRL model was proposed in
|
||||||
|
`CTRL: A Conditional Transformer Language Model for Controllable Generation`_
|
||||||
|
by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
|
||||||
|
It's a causal (unidirectional) transformer pre-trained using language modeling on a very large
|
||||||
|
corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
|
||||||
|
|
||||||
|
This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
|
||||||
|
refer to the PyTorch documentation for all matter related to general usage and behavior.
|
||||||
|
|
||||||
|
.. _`CTRL: A Conditional Transformer Language Model for Controllable Generation`:
|
||||||
|
https://www.github.com/salesforce/ctrl
|
||||||
|
|
||||||
|
.. _`torch.nn.Module`:
|
||||||
|
https://pytorch.org/docs/stable/nn.html#module
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
config (:class:`~transformers.CTRLConfig`): Model configuration class with all the parameters of the model.
|
||||||
|
Initializing with a config file does not load the weights associated with the model, only the configuration.
|
||||||
|
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
|
||||||
|
"""
|
||||||
|
|
||||||
|
CTRL_INPUTS_DOCSTRING = r""" Inputs:
|
||||||
|
**input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Indices of input sequence tokens in the vocabulary.
|
||||||
|
CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on
|
||||||
|
the right rather than the left.
|
||||||
|
Indices can be obtained using :class:`transformers.CTRLTokenizer`.
|
||||||
|
See :func:`transformers.PreTrainedTokenizer.encode` and
|
||||||
|
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
|
||||||
|
**past**:
|
||||||
|
list of ``torch.FloatTensor`` (one for each layer):
|
||||||
|
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||||
|
(see `past` output below). Can be used to speed up sequential decoding.
|
||||||
|
**attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Mask to avoid performing attention on padding token indices.
|
||||||
|
Mask values selected in ``[0, 1]``:
|
||||||
|
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
|
||||||
|
**token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
A parallel sequence of tokens (can be used to indicate various portions of the inputs).
|
||||||
|
The embeddings from these tokens will be summed with the respective token embeddings.
|
||||||
|
Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).
|
||||||
|
**position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Indices of positions of each input sequence tokens in the position embeddings.
|
||||||
|
Selected in the range ``[0, config.max_position_embeddings - 1]``.
|
||||||
|
**head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
|
||||||
|
Mask to nullify selected heads of the self-attention modules.
|
||||||
|
Mask values selected in ``[0, 1]``:
|
||||||
|
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
|
||||||
|
"""
|
||||||
|
|
||||||
|
@add_start_docstrings("The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.",
|
||||||
|
CTRL_START_DOCSTRING, CTRL_INPUTS_DOCSTRING)
|
||||||
|
class CTRLModel(CTRLPreTrainedModel):
|
||||||
|
r"""
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
|
||||||
|
Sequence of hidden-states at the last layer of the model.
|
||||||
|
**past**:
|
||||||
|
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
that contains pre-computed hidden-states (key and values in the attention blocks).
|
||||||
|
Can be used (see `past` input) to speed up sequential decoding.
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
tokenizer = CTRLTokenizer.from_pretrained('ctrl')
|
||||||
|
model = CTRLModel.from_pretrained('ctrl')
|
||||||
|
input_ids = torch.tensor(tokenizer.encode("Links Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
||||||
|
outputs = model(input_ids)
|
||||||
|
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||||
|
|
||||||
|
"""
|
||||||
|
def __init__(self, config):
|
||||||
|
super(CTRLModel, self).__init__(config)
|
||||||
|
self.output_hidden_states = config.output_hidden_states
|
||||||
|
self.output_attentions = config.output_attentions
|
||||||
|
self.output_past = config.output_past
|
||||||
|
|
||||||
|
self.d_model_size = config.n_embd
|
||||||
|
self.num_layers = config.n_layer
|
||||||
|
|
||||||
|
self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size, torch.float)
|
||||||
|
|
||||||
|
self.w = nn.Embedding(config.vocab_size, config.n_embd)
|
||||||
|
|
||||||
|
self.dropout = nn.Dropout(config.embd_pdrop)
|
||||||
|
self.h = nn.ModuleList([EncoderLayer(config.n_embd,
|
||||||
|
config.n_head,
|
||||||
|
config.dff,
|
||||||
|
config.resid_pdrop,
|
||||||
|
config.output_attentions) for _ in range(config.n_layer)])
|
||||||
|
self.layernorm = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
|
||||||
|
|
||||||
|
self.init_weights()
|
||||||
|
|
||||||
|
def _resize_token_embeddings(self, new_num_tokens):
|
||||||
|
self.w = self._get_resized_embeddings(self.w, new_num_tokens)
|
||||||
|
return self.w
|
||||||
|
|
||||||
|
def _prune_heads(self, heads_to_prune):
|
||||||
|
""" Prunes heads of the model.
|
||||||
|
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
|
||||||
|
"""
|
||||||
|
for layer, heads in heads_to_prune.items():
|
||||||
|
self.h[layer].attn.prune_heads(heads)
|
||||||
|
|
||||||
|
def forward(self, input_ids, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None):
|
||||||
|
input_shape = input_ids.size()
|
||||||
|
input_ids = input_ids.view(-1, input_shape[-1])
|
||||||
|
if past is None:
|
||||||
|
past_length = 0
|
||||||
|
past = [None] * len(self.h)
|
||||||
|
else:
|
||||||
|
past_length = past[0][0].size(-2)
|
||||||
|
if position_ids is None:
|
||||||
|
position_ids = torch.arange(past_length, input_ids.size(-1) + past_length, dtype=torch.long, device=input_ids.device)
|
||||||
|
position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
|
||||||
|
|
||||||
|
# Attention mask.
|
||||||
|
if attention_mask is not None:
|
||||||
|
attention_mask = attention_mask.view(-1, input_shape[-1])
|
||||||
|
# We create a 3D attention mask from a 2D tensor mask.
|
||||||
|
# Sizes are [batch_size, 1, 1, to_seq_length]
|
||||||
|
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
|
||||||
|
# this attention mask is more simple than the triangular masking of causal attention
|
||||||
|
# used in OpenAI GPT, we just need to prepare the broadcast dimension here.
|
||||||
|
attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
|
||||||
|
|
||||||
|
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
|
||||||
|
# masked positions, this operation will create a tensor which is 0.0 for
|
||||||
|
# positions we want to attend and -10000.0 for masked positions.
|
||||||
|
# Since we are adding it to the raw scores before the softmax, this is
|
||||||
|
# effectively the same as removing these entirely.
|
||||||
|
attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
|
||||||
|
attention_mask = (1.0 - attention_mask) * -10000.0
|
||||||
|
|
||||||
|
# Prepare head mask if needed
|
||||||
|
# 1.0 in head_mask indicate we keep the head
|
||||||
|
# attention_probs has shape bsz x n_heads x N x N
|
||||||
|
# head_mask has shape n_layer x batch x n_heads x N x N
|
||||||
|
if head_mask is not None:
|
||||||
|
if head_mask.dim() == 1:
|
||||||
|
head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
|
||||||
|
head_mask = head_mask.expand(self.config.n_layer, -1, -1, -1, -1)
|
||||||
|
elif head_mask.dim() == 2:
|
||||||
|
head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1) # We can specify head_mask for each layer
|
||||||
|
head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
|
||||||
|
else:
|
||||||
|
head_mask = [None] * self.config.n_layer
|
||||||
|
|
||||||
|
if token_type_ids is not None:
|
||||||
|
token_type_ids = token_type_ids.view(-1, input_shape[-1])
|
||||||
|
token_type_embeds = self.w(token_type_ids)
|
||||||
|
token_type_embeds *= np.sqrt(self.d_model_size)
|
||||||
|
else:
|
||||||
|
token_type_embeds = 0
|
||||||
|
position_ids = position_ids.view(-1, input_shape[-1])
|
||||||
|
|
||||||
|
inputs_embeds = self.w(input_ids)
|
||||||
|
# inputs_embeds = embedded.unsqueeze(0) if len(input_ids.shape)<2 else embedded
|
||||||
|
seq_len = input_ids.shape[-1]
|
||||||
|
mask = torch.triu(torch.ones(seq_len, seq_len), 1).to(inputs_embeds.device)
|
||||||
|
|
||||||
|
inputs_embeds *= np.sqrt(self.d_model_size)
|
||||||
|
|
||||||
|
pos_embeds = self.pos_encoding[position_ids, :].to(inputs_embeds.device)
|
||||||
|
|
||||||
|
hidden_states = inputs_embeds + pos_embeds + token_type_embeds
|
||||||
|
|
||||||
|
hidden_states = self.dropout(hidden_states)
|
||||||
|
|
||||||
|
output_shape = input_shape + (inputs_embeds.size(-1),)
|
||||||
|
presents = ()
|
||||||
|
all_hidden_states = ()
|
||||||
|
all_attentions = []
|
||||||
|
for i, (h, layer_past) in enumerate(zip(self.h, past)):
|
||||||
|
if self.output_hidden_states:
|
||||||
|
all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)
|
||||||
|
outputs = h(hidden_states,
|
||||||
|
mask,
|
||||||
|
layer_past=layer_past,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
head_mask=head_mask[i])
|
||||||
|
hidden_states, present = outputs[:2]
|
||||||
|
if self.output_past:
|
||||||
|
presents = presents + (present,)
|
||||||
|
|
||||||
|
if self.output_attentions:
|
||||||
|
all_attentions.append(outputs[2])
|
||||||
|
|
||||||
|
hidden_states = self.layernorm(hidden_states)
|
||||||
|
hidden_states = hidden_states.view(*output_shape)
|
||||||
|
if self.output_hidden_states:
|
||||||
|
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||||
|
|
||||||
|
outputs = (hidden_states,)
|
||||||
|
if self.output_past:
|
||||||
|
outputs = outputs + (presents,)
|
||||||
|
if self.output_hidden_states:
|
||||||
|
outputs = outputs + (all_hidden_states,)
|
||||||
|
if self.output_attentions:
|
||||||
|
# let the number of heads free (-1) so we can extract attention even after head pruning
|
||||||
|
attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]
|
||||||
|
all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)
|
||||||
|
outputs = outputs + (all_attentions,)
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings("""The CTRL Model transformer with a language modeling head on top
|
||||||
|
(linear layer with weights tied to the input embeddings). """, CTRL_START_DOCSTRING, CTRL_INPUTS_DOCSTRING)
|
||||||
|
class CTRLLMHeadModel(CTRLPreTrainedModel):
|
||||||
|
r"""
|
||||||
|
**labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Labels for language modeling.
|
||||||
|
Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
|
||||||
|
Indices are selected in ``[-1, 0, ..., config.vocab_size]``
|
||||||
|
All labels set to ``-1`` are ignored (masked), the loss is only
|
||||||
|
computed for labels in ``[0, ..., config.vocab_size]``
|
||||||
|
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
|
||||||
|
Language modeling loss.
|
||||||
|
**prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
|
||||||
|
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
||||||
|
**past**:
|
||||||
|
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
that contains pre-computed hidden-states (key and values in the attention blocks).
|
||||||
|
Can be used (see `past` input) to speed up sequential decoding.
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from transformers import CTRLTokenizer, CTRLLMHeadModel
|
||||||
|
|
||||||
|
tokenizer = CTRLTokenizer.from_pretrained('ctrl')
|
||||||
|
model = CTRLLMHeadModel.from_pretrained('ctrl')
|
||||||
|
|
||||||
|
input_ids = torch.tensor(tokenizer.encode("Links Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
||||||
|
outputs = model(input_ids, labels=input_ids)
|
||||||
|
loss, logits = outputs[:2]
|
||||||
|
|
||||||
|
"""
|
||||||
|
def __init__(self, config):
|
||||||
|
super(CTRLLMHeadModel, self).__init__(config)
|
||||||
|
self.transformer = CTRLModel(config)
|
||||||
|
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=True)
|
||||||
|
|
||||||
|
self.init_weights()
|
||||||
|
self.tie_weights()
|
||||||
|
|
||||||
|
def tie_weights(self):
|
||||||
|
""" Make sure we are sharing the input and output embeddings.
|
||||||
|
Export to TorchScript can't handle parameter sharing so we are cloning them instead.
|
||||||
|
"""
|
||||||
|
self._tie_or_clone_weights(self.lm_head, self.transformer.w)
|
||||||
|
|
||||||
|
def forward(self, input_ids, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
|
||||||
|
labels=None):
|
||||||
|
transformer_outputs = self.transformer(input_ids,
|
||||||
|
past=past,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
token_type_ids=token_type_ids,
|
||||||
|
position_ids=position_ids,
|
||||||
|
head_mask=head_mask)
|
||||||
|
|
||||||
|
hidden_states = transformer_outputs[0]
|
||||||
|
|
||||||
|
lm_logits = self.lm_head(hidden_states)
|
||||||
|
|
||||||
|
outputs = (lm_logits,) + transformer_outputs[1:]
|
||||||
|
|
||||||
|
if labels is not None:
|
||||||
|
# Shift so that tokens < n predict n
|
||||||
|
shift_logits = lm_logits[..., :-1, :].contiguous()
|
||||||
|
shift_labels = labels[..., 1:].contiguous()
|
||||||
|
# Flatten the tokens
|
||||||
|
loss_fct = CrossEntropyLoss(ignore_index=-1)
|
||||||
|
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)),
|
||||||
|
shift_labels.view(-1))
|
||||||
|
outputs = (loss,) + outputs
|
||||||
|
|
||||||
|
return outputs # (loss), lm_logits, presents, (all hidden_states), (attentions)
|
||||||
@@ -159,8 +159,6 @@ class MultiHeadSelfAttention(nn.Module):
|
|||||||
|
|
||||||
dim_per_head = self.dim // self.n_heads
|
dim_per_head = self.dim // self.n_heads
|
||||||
|
|
||||||
assert 2 <= mask.dim() <= 3
|
|
||||||
causal = (mask.dim() == 3)
|
|
||||||
mask_reshp = (bs, 1, 1, k_length)
|
mask_reshp = (bs, 1, 1, k_length)
|
||||||
|
|
||||||
def shape(x):
|
def shape(x):
|
||||||
|
|||||||
@@ -347,6 +347,7 @@ class GPT2Model(GPT2PreTrainedModel):
|
|||||||
super(GPT2Model, self).__init__(config)
|
super(GPT2Model, self).__init__(config)
|
||||||
self.output_hidden_states = config.output_hidden_states
|
self.output_hidden_states = config.output_hidden_states
|
||||||
self.output_attentions = config.output_attentions
|
self.output_attentions = config.output_attentions
|
||||||
|
self.output_past = config.output_past
|
||||||
|
|
||||||
self.wte = nn.Embedding(config.vocab_size, config.n_embd)
|
self.wte = nn.Embedding(config.vocab_size, config.n_embd)
|
||||||
self.wpe = nn.Embedding(config.n_positions, config.n_embd)
|
self.wpe = nn.Embedding(config.n_positions, config.n_embd)
|
||||||
@@ -440,7 +441,8 @@ class GPT2Model(GPT2PreTrainedModel):
|
|||||||
head_mask=head_mask[i])
|
head_mask=head_mask[i])
|
||||||
|
|
||||||
hidden_states, present = outputs[:2]
|
hidden_states, present = outputs[:2]
|
||||||
presents = presents + (present,)
|
if self.output_past:
|
||||||
|
presents = presents + (present,)
|
||||||
|
|
||||||
if self.output_attentions:
|
if self.output_attentions:
|
||||||
all_attentions.append(outputs[2])
|
all_attentions.append(outputs[2])
|
||||||
@@ -452,7 +454,9 @@ class GPT2Model(GPT2PreTrainedModel):
|
|||||||
if self.output_hidden_states:
|
if self.output_hidden_states:
|
||||||
all_hidden_states = all_hidden_states + (hidden_states,)
|
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||||
|
|
||||||
outputs = (hidden_states, presents)
|
outputs = (hidden_states,)
|
||||||
|
if self.output_past:
|
||||||
|
outputs = outputs + (presents,)
|
||||||
if self.output_hidden_states:
|
if self.output_hidden_states:
|
||||||
outputs = outputs + (all_hidden_states,)
|
outputs = outputs + (all_hidden_states,)
|
||||||
if self.output_attentions:
|
if self.output_attentions:
|
||||||
@@ -460,7 +464,7 @@ class GPT2Model(GPT2PreTrainedModel):
|
|||||||
attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]
|
attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]
|
||||||
all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)
|
all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)
|
||||||
outputs = outputs + (all_attentions,)
|
outputs = outputs + (all_attentions,)
|
||||||
return outputs # last hidden state, presents, (all hidden_states), (attentions)
|
return outputs # last hidden state, (presents), (all hidden_states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
@add_start_docstrings("""The GPT2 Model transformer with a language modeling head on top
|
@add_start_docstrings("""The GPT2 Model transformer with a language modeling head on top
|
||||||
|
|||||||
@@ -170,7 +170,7 @@ class Attention(nn.Module):
|
|||||||
# w = w * self.bias + -1e9 * (1 - self.bias) # TF implem method: mask_attn_weights
|
# w = w * self.bias + -1e9 * (1 - self.bias) # TF implem method: mask_attn_weights
|
||||||
# XD: self.b may be larger than w, so we need to crop it
|
# XD: self.b may be larger than w, so we need to crop it
|
||||||
b = self.bias[:, :, : w.size(-2), : w.size(-1)]
|
b = self.bias[:, :, : w.size(-2), : w.size(-1)]
|
||||||
w = w * b + -1e9 * (1 - b)
|
w = w * b + - 1e4 * (1 - b)
|
||||||
|
|
||||||
if attention_mask is not None:
|
if attention_mask is not None:
|
||||||
# Apply the attention mask
|
# Apply the attention mask
|
||||||
|
|||||||
@@ -34,6 +34,7 @@ ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
|||||||
'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-pytorch_model.bin",
|
'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-pytorch_model.bin",
|
||||||
'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-pytorch_model.bin",
|
'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-pytorch_model.bin",
|
||||||
'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-pytorch_model.bin",
|
'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-pytorch_model.bin",
|
||||||
|
'distilroberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-pytorch_model.bin",
|
||||||
}
|
}
|
||||||
|
|
||||||
class RobertaEmbeddings(BertEmbeddings):
|
class RobertaEmbeddings(BertEmbeddings):
|
||||||
@@ -172,7 +173,8 @@ class RobertaModel(BertModel):
|
|||||||
if input_ids[:, 0].sum().item() != 0:
|
if input_ids[:, 0].sum().item() != 0:
|
||||||
logger.warning("A sequence with no special tokens has been passed to the RoBERTa model. "
|
logger.warning("A sequence with no special tokens has been passed to the RoBERTa model. "
|
||||||
"This model requires special tokens in order to work. "
|
"This model requires special tokens in order to work. "
|
||||||
"Please specify add_special_tokens=True in your encoding.")
|
"Please specify add_special_tokens=True in your tokenize.encode()"
|
||||||
|
"or tokenizer.convert_tokens_to_ids().")
|
||||||
return super(RobertaModel, self).forward(input_ids,
|
return super(RobertaModel, self).forward(input_ids,
|
||||||
attention_mask=attention_mask,
|
attention_mask=attention_mask,
|
||||||
token_type_ids=token_type_ids,
|
token_type_ids=token_type_ids,
|
||||||
@@ -341,6 +343,7 @@ class RobertaForSequenceClassification(BertPreTrainedModel):
|
|||||||
|
|
||||||
return outputs # (loss), logits, (hidden_states), (attentions)
|
return outputs # (loss), logits, (hidden_states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
@add_start_docstrings("""Roberta Model with a multiple choice classification head on top (a linear layer on top of
|
@add_start_docstrings("""Roberta Model with a multiple choice classification head on top (a linear layer on top of
|
||||||
the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
|
the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
|
||||||
ROBERTA_START_DOCSTRING, ROBERTA_INPUTS_DOCSTRING)
|
ROBERTA_START_DOCSTRING, ROBERTA_INPUTS_DOCSTRING)
|
||||||
@@ -449,6 +452,81 @@ class RobertaForMultipleChoice(BertPreTrainedModel):
|
|||||||
return outputs # (loss), reshaped_logits, (hidden_states), (attentions)
|
return outputs # (loss), reshaped_logits, (hidden_states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings("""Roberta Model with a token classification head on top (a linear layer on top of
|
||||||
|
the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
|
||||||
|
ROBERTA_START_DOCSTRING, ROBERTA_INPUTS_DOCSTRING)
|
||||||
|
class RobertaForTokenClassification(BertPreTrainedModel):
|
||||||
|
r"""
|
||||||
|
**labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Labels for computing the token classification loss.
|
||||||
|
Indices should be in ``[0, ..., config.num_labels - 1]``.
|
||||||
|
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
|
||||||
|
Classification loss.
|
||||||
|
**scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.num_labels)``
|
||||||
|
Classification scores (before SoftMax).
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
|
||||||
|
model = RobertaForTokenClassification.from_pretrained('roberta-base')
|
||||||
|
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
||||||
|
labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0) # Batch size 1
|
||||||
|
outputs = model(input_ids, labels=labels)
|
||||||
|
loss, scores = outputs[:2]
|
||||||
|
|
||||||
|
"""
|
||||||
|
config_class = RobertaConfig
|
||||||
|
pretrained_model_archive_map = ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
|
base_model_prefix = "roberta"
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super(RobertaForTokenClassification, self).__init__(config)
|
||||||
|
self.num_labels = config.num_labels
|
||||||
|
|
||||||
|
self.roberta = RobertaModel(config)
|
||||||
|
self.dropout = nn.Dropout(config.hidden_dropout_prob)
|
||||||
|
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
|
||||||
|
|
||||||
|
self.init_weights()
|
||||||
|
|
||||||
|
def forward(self, input_ids, attention_mask=None, token_type_ids=None,
|
||||||
|
position_ids=None, head_mask=None, labels=None):
|
||||||
|
|
||||||
|
outputs = self.roberta(input_ids,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
token_type_ids=token_type_ids,
|
||||||
|
position_ids=position_ids,
|
||||||
|
head_mask=head_mask)
|
||||||
|
|
||||||
|
sequence_output = outputs[0]
|
||||||
|
|
||||||
|
sequence_output = self.dropout(sequence_output)
|
||||||
|
logits = self.classifier(sequence_output)
|
||||||
|
|
||||||
|
outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
|
||||||
|
if labels is not None:
|
||||||
|
loss_fct = CrossEntropyLoss()
|
||||||
|
# Only keep active parts of the loss
|
||||||
|
if attention_mask is not None:
|
||||||
|
active_loss = attention_mask.view(-1) == 1
|
||||||
|
active_logits = logits.view(-1, self.num_labels)[active_loss]
|
||||||
|
active_labels = labels.view(-1)[active_loss]
|
||||||
|
loss = loss_fct(active_logits, active_labels)
|
||||||
|
else:
|
||||||
|
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
|
||||||
|
outputs = (loss,) + outputs
|
||||||
|
|
||||||
|
return outputs # (loss), scores, (hidden_states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
class RobertaClassificationHead(nn.Module):
|
class RobertaClassificationHead(nn.Module):
|
||||||
"""Head for sentence-level classification tasks."""
|
"""Head for sentence-level classification tasks."""
|
||||||
|
|||||||
@@ -26,6 +26,7 @@ from .modeling_tf_xlnet import TFXLNetModel, TFXLNetLMHeadModel, TFXLNetForSeque
|
|||||||
from .modeling_tf_xlm import TFXLMModel, TFXLMWithLMHeadModel, TFXLMForSequenceClassification, TFXLMForQuestionAnsweringSimple
|
from .modeling_tf_xlm import TFXLMModel, TFXLMWithLMHeadModel, TFXLMForSequenceClassification, TFXLMForQuestionAnsweringSimple
|
||||||
from .modeling_tf_roberta import TFRobertaModel, TFRobertaForMaskedLM, TFRobertaForSequenceClassification
|
from .modeling_tf_roberta import TFRobertaModel, TFRobertaForMaskedLM, TFRobertaForSequenceClassification
|
||||||
from .modeling_tf_distilbert import TFDistilBertModel, TFDistilBertForQuestionAnswering, TFDistilBertForMaskedLM, TFDistilBertForSequenceClassification
|
from .modeling_tf_distilbert import TFDistilBertModel, TFDistilBertForQuestionAnswering, TFDistilBertForMaskedLM, TFDistilBertForSequenceClassification
|
||||||
|
from .modeling_tf_ctrl import TFCTRLModel, TFCTRLLMHeadModel
|
||||||
|
|
||||||
from .file_utils import add_start_docstrings
|
from .file_utils import add_start_docstrings
|
||||||
|
|
||||||
@@ -52,6 +53,7 @@ class TFAutoModel(object):
|
|||||||
- contains `transfo-xl`: TFTransfoXLModel (Transformer-XL model)
|
- contains `transfo-xl`: TFTransfoXLModel (Transformer-XL model)
|
||||||
- contains `xlnet`: TFXLNetModel (XLNet model)
|
- contains `xlnet`: TFXLNetModel (XLNet model)
|
||||||
- contains `xlm`: TFXLMModel (XLM model)
|
- contains `xlm`: TFXLMModel (XLM model)
|
||||||
|
- contains `ctrl`: TFCTRLModel (CTRL model)
|
||||||
|
|
||||||
This class cannot be instantiated using `__init__()` (throws an error).
|
This class cannot be instantiated using `__init__()` (throws an error).
|
||||||
"""
|
"""
|
||||||
@@ -73,7 +75,7 @@ class TFAutoModel(object):
|
|||||||
- contains `gpt2`: TFGPT2Model (OpenAI GPT-2 model)
|
- contains `gpt2`: TFGPT2Model (OpenAI GPT-2 model)
|
||||||
- contains `transfo-xl`: TFTransfoXLModel (Transformer-XL model)
|
- contains `transfo-xl`: TFTransfoXLModel (Transformer-XL model)
|
||||||
- contains `xlnet`: TFXLNetModel (XLNet model)
|
- contains `xlnet`: TFXLNetModel (XLNet model)
|
||||||
- contains `xlm`: TFXLMModel (XLM model)
|
- contains `ctrl`: TFCTRLModel (CTRL model)
|
||||||
|
|
||||||
Params:
|
Params:
|
||||||
pretrained_model_name_or_path: either:
|
pretrained_model_name_or_path: either:
|
||||||
@@ -147,10 +149,12 @@ class TFAutoModel(object):
|
|||||||
return TFXLNetModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return TFXLNetModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
elif 'xlm' in pretrained_model_name_or_path:
|
elif 'xlm' in pretrained_model_name_or_path:
|
||||||
return TFXLMModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return TFXLMModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
|
elif 'ctrl' in pretrained_model_name_or_path:
|
||||||
|
return TFCTRLModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
|
|
||||||
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
||||||
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
||||||
"'xlm', 'roberta'".format(pretrained_model_name_or_path))
|
"'xlm', 'roberta', 'ctrl'".format(pretrained_model_name_or_path))
|
||||||
|
|
||||||
|
|
||||||
class TFAutoModelWithLMHead(object):
|
class TFAutoModelWithLMHead(object):
|
||||||
@@ -173,6 +177,7 @@ class TFAutoModelWithLMHead(object):
|
|||||||
- contains `transfo-xl`: TFTransfoXLLMHeadModel (Transformer-XL model)
|
- contains `transfo-xl`: TFTransfoXLLMHeadModel (Transformer-XL model)
|
||||||
- contains `xlnet`: TFXLNetLMHeadModel (XLNet model)
|
- contains `xlnet`: TFXLNetLMHeadModel (XLNet model)
|
||||||
- contains `xlm`: TFXLMWithLMHeadModel (XLM model)
|
- contains `xlm`: TFXLMWithLMHeadModel (XLM model)
|
||||||
|
- contains `ctrl`: TFCTRLLMHeadModel (CTRL model)
|
||||||
|
|
||||||
This class cannot be instantiated using `__init__()` (throws an error).
|
This class cannot be instantiated using `__init__()` (throws an error).
|
||||||
"""
|
"""
|
||||||
@@ -198,6 +203,7 @@ class TFAutoModelWithLMHead(object):
|
|||||||
- contains `transfo-xl`: TFTransfoXLLMHeadModel (Transformer-XL model)
|
- contains `transfo-xl`: TFTransfoXLLMHeadModel (Transformer-XL model)
|
||||||
- contains `xlnet`: TFXLNetLMHeadModel (XLNet model)
|
- contains `xlnet`: TFXLNetLMHeadModel (XLNet model)
|
||||||
- contains `xlm`: TFXLMWithLMHeadModel (XLM model)
|
- contains `xlm`: TFXLMWithLMHeadModel (XLM model)
|
||||||
|
- contains `ctrl`: TFCTRLLMHeadModel (CTRL model)
|
||||||
|
|
||||||
Params:
|
Params:
|
||||||
pretrained_model_name_or_path: either:
|
pretrained_model_name_or_path: either:
|
||||||
@@ -271,10 +277,12 @@ class TFAutoModelWithLMHead(object):
|
|||||||
return TFXLNetLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return TFXLNetLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
elif 'xlm' in pretrained_model_name_or_path:
|
elif 'xlm' in pretrained_model_name_or_path:
|
||||||
return TFXLMWithLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return TFXLMWithLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
|
elif 'ctrl' in pretrained_model_name_or_path:
|
||||||
|
return TFCTRLLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
|
|
||||||
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
||||||
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
||||||
"'xlm', 'roberta'".format(pretrained_model_name_or_path))
|
"'xlm', 'roberta', 'ctrl'".format(pretrained_model_name_or_path))
|
||||||
|
|
||||||
|
|
||||||
class TFAutoModelForSequenceClassification(object):
|
class TFAutoModelForSequenceClassification(object):
|
||||||
|
|||||||
@@ -30,7 +30,6 @@ import tensorflow as tf
|
|||||||
from .configuration_bert import BertConfig
|
from .configuration_bert import BertConfig
|
||||||
from .modeling_tf_utils import TFPreTrainedModel, get_initializer
|
from .modeling_tf_utils import TFPreTrainedModel, get_initializer
|
||||||
from .file_utils import add_start_docstrings
|
from .file_utils import add_start_docstrings
|
||||||
from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -52,14 +51,6 @@ TF_BERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def load_bert_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
|
|
||||||
# build the network
|
|
||||||
inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
|
||||||
tf_inputs = tf.constant(inputs_list)
|
|
||||||
tfo = tf_model(tf_inputs, training=False)
|
|
||||||
return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
|
|
||||||
|
|
||||||
|
|
||||||
def gelu(x):
|
def gelu(x):
|
||||||
""" Gaussian Error Linear Unit.
|
""" Gaussian Error Linear Unit.
|
||||||
Original Implementation of the gelu activation function in Google Bert repo when initially created.
|
Original Implementation of the gelu activation function in Google Bert repo when initially created.
|
||||||
@@ -545,7 +536,6 @@ class TFBertPreTrainedModel(TFPreTrainedModel):
|
|||||||
"""
|
"""
|
||||||
config_class = BertConfig
|
config_class = BertConfig
|
||||||
pretrained_model_archive_map = TF_BERT_PRETRAINED_MODEL_ARCHIVE_MAP
|
pretrained_model_archive_map = TF_BERT_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
load_pt_weights = load_bert_pt_weights_in_tf2
|
|
||||||
base_model_prefix = "bert"
|
base_model_prefix = "bert"
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
487
transformers/modeling_tf_ctrl.py
Normal file
487
transformers/modeling_tf_ctrl.py
Normal file
@@ -0,0 +1,487 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 Salesforce and HuggingFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" TF 2.0 CTRL model."""
|
||||||
|
|
||||||
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
from io import open
|
||||||
|
import numpy as np
|
||||||
|
import tensorflow as tf
|
||||||
|
|
||||||
|
from .configuration_ctrl import CTRLConfig
|
||||||
|
from .modeling_tf_utils import TFPreTrainedModel, get_initializer, shape_list, TFSharedEmbeddings
|
||||||
|
from .file_utils import add_start_docstrings
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP = {"ctrl": "https://s3.amazonaws.com/models.huggingface.co/bert/ctrl-tf_model.h5"}
|
||||||
|
|
||||||
|
def angle_defn(pos, i, d_model_size):
|
||||||
|
angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model_size))
|
||||||
|
return pos * angle_rates
|
||||||
|
|
||||||
|
def positional_encoding(position, d_model_size):
|
||||||
|
# create the sinusoidal pattern for the positional encoding
|
||||||
|
angle_rads = angle_defn(np.arange(position)[:, np.newaxis],
|
||||||
|
np.arange(d_model_size)[np.newaxis, :],
|
||||||
|
d_model_size)
|
||||||
|
|
||||||
|
sines = np.sin(angle_rads[:, 0::2])
|
||||||
|
cosines = np.cos(angle_rads[:, 1::2])
|
||||||
|
|
||||||
|
# pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1)[np.newaxis, ...], dtype=tf.float32)
|
||||||
|
pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1), dtype=tf.float32)
|
||||||
|
return pos_encoding
|
||||||
|
|
||||||
|
def scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):
|
||||||
|
# calculate attention
|
||||||
|
matmul_qk = tf.matmul(q, k, transpose_b=True)
|
||||||
|
|
||||||
|
dk = tf.cast(shape_list(k)[-1], tf.float32)
|
||||||
|
scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
|
||||||
|
|
||||||
|
if mask is not None:
|
||||||
|
scaled_attention_logits += (mask * -1e4)
|
||||||
|
|
||||||
|
if attention_mask is not None:
|
||||||
|
# Apply the attention mask
|
||||||
|
scaled_attention_logits = scaled_attention_logits + attention_mask
|
||||||
|
|
||||||
|
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
|
||||||
|
|
||||||
|
# Mask heads if we want to
|
||||||
|
if head_mask is not None:
|
||||||
|
attention_weights = attention_weights * head_mask
|
||||||
|
|
||||||
|
output = tf.matmul(attention_weights, v)
|
||||||
|
|
||||||
|
return output, attention_weights
|
||||||
|
|
||||||
|
|
||||||
|
class TFMultiHeadAttention(tf.keras.layers.Layer):
|
||||||
|
def __init__(self, d_model_size, num_heads, output_attentions=False, **kwargs):
|
||||||
|
super(TFMultiHeadAttention, self).__init__(**kwargs)
|
||||||
|
self.output_attentions = output_attentions
|
||||||
|
self.num_heads = num_heads
|
||||||
|
self.d_model_size = d_model_size
|
||||||
|
|
||||||
|
self.depth = int(d_model_size / self.num_heads)
|
||||||
|
|
||||||
|
self.Wq = tf.keras.layers.Dense(d_model_size, name='Wq')
|
||||||
|
self.Wk = tf.keras.layers.Dense(d_model_size, name='Wk')
|
||||||
|
self.Wv = tf.keras.layers.Dense(d_model_size, name='Wv')
|
||||||
|
|
||||||
|
self.dense = tf.keras.layers.Dense(d_model_size, name='dense')
|
||||||
|
|
||||||
|
def split_into_heads(self, x, batch_size):
|
||||||
|
x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
|
||||||
|
return tf.transpose(x, perm=[0, 2, 1, 3])
|
||||||
|
|
||||||
|
def call(self, inputs, training=False):
|
||||||
|
v, k, q, mask, layer_past, attention_mask, head_mask = inputs
|
||||||
|
batch_size = q.shape[0]
|
||||||
|
|
||||||
|
q = self.Wq(q)
|
||||||
|
k = self.Wk(k)
|
||||||
|
v = self.Wv(v)
|
||||||
|
|
||||||
|
q = self.split_into_heads(q, batch_size)
|
||||||
|
k = self.split_into_heads(k, batch_size)
|
||||||
|
v = self.split_into_heads(v, batch_size)
|
||||||
|
if layer_past is not None:
|
||||||
|
past_key, past_value = tf.unstack(layer_past, axis=1)
|
||||||
|
k = tf.concat((past_key, k), dim=-2)
|
||||||
|
v = tf.concat((past_value, v), dim=-2)
|
||||||
|
present = tf.stack((k, v), axis=1)
|
||||||
|
|
||||||
|
output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)
|
||||||
|
scaled_attention = tf.transpose(output[0], perm=[0, 2, 1, 3])
|
||||||
|
attn = output[1]
|
||||||
|
original_size_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model_size))
|
||||||
|
output = self.dense(original_size_attention)
|
||||||
|
|
||||||
|
outputs = (output, present)
|
||||||
|
if self.output_attentions:
|
||||||
|
outputs = outputs + (attn,)
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def point_wise_feed_forward_network(d_model_size, dff, name=""):
|
||||||
|
return tf.keras.Sequential([
|
||||||
|
tf.keras.layers.Dense(dff, activation='relu', name="0"),
|
||||||
|
tf.keras.layers.Dense(d_model_size, name="2")
|
||||||
|
], name="ffn")
|
||||||
|
|
||||||
|
|
||||||
|
class TFEncoderLayer(tf.keras.layers.Layer):
|
||||||
|
def __init__(self, d_model_size, num_heads, dff, rate=0.1, layer_norm_epsilon=1e-6, output_attentions=False, **kwargs):
|
||||||
|
super(TFEncoderLayer, self).__init__(**kwargs)
|
||||||
|
|
||||||
|
self.multi_head_attention = TFMultiHeadAttention(d_model_size,
|
||||||
|
num_heads,
|
||||||
|
output_attentions,
|
||||||
|
name="multi_head_attention")
|
||||||
|
self.ffn = point_wise_feed_forward_network(d_model_size, dff, name="ffn")
|
||||||
|
|
||||||
|
self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layernorm1")
|
||||||
|
self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layernorm2")
|
||||||
|
|
||||||
|
self.dropout1 = tf.keras.layers.Dropout(rate)
|
||||||
|
self.dropout2 = tf.keras.layers.Dropout(rate)
|
||||||
|
|
||||||
|
def call(self, inputs, training=False):
|
||||||
|
x, mask, layer_past, attention_mask, head_mask = inputs
|
||||||
|
normed = self.layernorm1(x)
|
||||||
|
attn_outputs = self.multi_head_attention([normed, normed, normed, mask, layer_past,
|
||||||
|
attention_mask, head_mask], training=training)
|
||||||
|
attn_output = attn_outputs[0]
|
||||||
|
attn_output = self.dropout1(attn_output, training=training)
|
||||||
|
out1 = x + attn_output
|
||||||
|
|
||||||
|
out2 = self.layernorm2(out1)
|
||||||
|
ffn_output = self.ffn(out2)
|
||||||
|
ffn_output = self.dropout2(ffn_output, training=training)
|
||||||
|
out2 = out1 + ffn_output
|
||||||
|
|
||||||
|
outputs = (out2,) + attn_outputs[1:]
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class TFCTRLMainLayer(tf.keras.layers.Layer):
|
||||||
|
def __init__(self, config, **kwargs):
|
||||||
|
super(TFCTRLMainLayer, self).__init__(**kwargs)
|
||||||
|
self.output_hidden_states = config.output_hidden_states
|
||||||
|
self.output_attentions = config.output_attentions
|
||||||
|
self.output_past = config.output_past
|
||||||
|
|
||||||
|
self.d_model_size = config.n_embd
|
||||||
|
self.num_layers = config.n_layer
|
||||||
|
|
||||||
|
self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size)
|
||||||
|
|
||||||
|
|
||||||
|
self.w = TFSharedEmbeddings(config.vocab_size,
|
||||||
|
config.n_embd,
|
||||||
|
initializer_range=config.initializer_range,
|
||||||
|
name="w")
|
||||||
|
|
||||||
|
self.dropout = tf.keras.layers.Dropout(config.embd_pdrop)
|
||||||
|
self.h = [TFEncoderLayer(config.n_embd,
|
||||||
|
config.n_head,
|
||||||
|
config.dff,
|
||||||
|
config.resid_pdrop,
|
||||||
|
config.layer_norm_epsilon,
|
||||||
|
config.output_attentions,
|
||||||
|
name='h_._{}'.format(i)) for i in range(config.n_layer)]
|
||||||
|
self.layernorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name="layernorm")
|
||||||
|
|
||||||
|
def _resize_token_embeddings(self, new_num_tokens):
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
def _prune_heads(self, heads_to_prune):
|
||||||
|
""" Prunes heads of the model.
|
||||||
|
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
|
||||||
|
"""
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
def call(self, inputs, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, training=False):
|
||||||
|
if isinstance(inputs, (tuple, list)):
|
||||||
|
input_ids = inputs[0]
|
||||||
|
past = inputs[1] if len(inputs) > 1 else past
|
||||||
|
attention_mask = inputs[2] if len(inputs) > 2 else attention_mask
|
||||||
|
token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids
|
||||||
|
position_ids = inputs[4] if len(inputs) > 4 else position_ids
|
||||||
|
head_mask = inputs[5] if len(inputs) > 5 else head_mask
|
||||||
|
assert len(inputs) <= 6, "Too many inputs."
|
||||||
|
elif isinstance(inputs, dict):
|
||||||
|
input_ids = inputs.get('input_ids')
|
||||||
|
past = inputs.get('past', past)
|
||||||
|
attention_mask = inputs.get('attention_mask', attention_mask)
|
||||||
|
token_type_ids = inputs.get('token_type_ids', token_type_ids)
|
||||||
|
position_ids = inputs.get('position_ids', position_ids)
|
||||||
|
head_mask = inputs.get('head_mask', head_mask)
|
||||||
|
assert len(inputs) <= 6, "Too many inputs."
|
||||||
|
else:
|
||||||
|
input_ids = inputs
|
||||||
|
|
||||||
|
input_shape = shape_list(input_ids)
|
||||||
|
input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])
|
||||||
|
|
||||||
|
if past is None:
|
||||||
|
past_length = 0
|
||||||
|
past = [None] * len(self.h)
|
||||||
|
else:
|
||||||
|
past_length = shape_list(past[0][0])[-2]
|
||||||
|
if position_ids is None:
|
||||||
|
position_ids = tf.range(past_length, shape_list(input_ids)[-1] + past_length, dtype=tf.int32)[tf.newaxis, :]
|
||||||
|
position_ids = tf.tile(position_ids, [shape_list(input_ids)[0], 1])
|
||||||
|
|
||||||
|
# Attention mask.
|
||||||
|
if attention_mask is not None:
|
||||||
|
# We create a 3D attention mask from a 2D tensor mask.
|
||||||
|
# Sizes are [batch_size, 1, 1, to_seq_length]
|
||||||
|
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
|
||||||
|
# this attention mask is more simple than the triangular masking of causal attention
|
||||||
|
# used in OpenAI GPT, we just need to prepare the broadcast dimension here.
|
||||||
|
attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]
|
||||||
|
|
||||||
|
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
|
||||||
|
# masked positions, this operation will create a tensor which is 0.0 for
|
||||||
|
# positions we want to attend and -10000.0 for masked positions.
|
||||||
|
# Since we are adding it to the raw scores before the softmax, this is
|
||||||
|
# effectively the same as removing these entirely.
|
||||||
|
|
||||||
|
attention_mask = tf.cast(attention_mask, tf.float32)
|
||||||
|
attention_mask = (1.0 - attention_mask) * -10000.0
|
||||||
|
else:
|
||||||
|
attention_mask = None
|
||||||
|
|
||||||
|
# Prepare head mask if needed
|
||||||
|
# 1.0 in head_mask indicate we keep the head
|
||||||
|
# attention_probs has shape bsz x n_heads x N x N
|
||||||
|
# head_mask has shape n_layer x batch x n_heads x N x N
|
||||||
|
if head_mask is not None:
|
||||||
|
raise NotImplementedError
|
||||||
|
else:
|
||||||
|
head_mask = [None] * self.num_layers
|
||||||
|
|
||||||
|
if token_type_ids is not None:
|
||||||
|
token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])
|
||||||
|
token_type_embeds = self.w(token_type_ids, mode='embedding')
|
||||||
|
token_type_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))
|
||||||
|
else:
|
||||||
|
token_type_embeds = 0
|
||||||
|
position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])
|
||||||
|
|
||||||
|
inputs_embeds = self.w(input_ids, mode='embedding')
|
||||||
|
# x = embedded.unsqueeze(0) if len(input_ids.shape)<2 else embedded
|
||||||
|
seq_len = input_shape[-1]
|
||||||
|
mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
|
||||||
|
|
||||||
|
inputs_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))
|
||||||
|
|
||||||
|
pos_embeds = tf.gather(self.pos_encoding, position_ids)
|
||||||
|
|
||||||
|
hidden_states = inputs_embeds + pos_embeds + token_type_embeds
|
||||||
|
|
||||||
|
hidden_states = self.dropout(hidden_states, training=training)
|
||||||
|
|
||||||
|
output_shape = input_shape + [shape_list(hidden_states)[-1]]
|
||||||
|
presents = ()
|
||||||
|
all_hidden_states = ()
|
||||||
|
all_attentions = []
|
||||||
|
for i, (h, layer_past) in enumerate(zip(self.h, past)):
|
||||||
|
if self.output_hidden_states:
|
||||||
|
all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)
|
||||||
|
outputs = h([hidden_states, mask, layer_past, attention_mask, head_mask[i]], training=training)
|
||||||
|
hidden_states, present = outputs[:2]
|
||||||
|
|
||||||
|
if self.output_past:
|
||||||
|
presents = presents + (present,)
|
||||||
|
|
||||||
|
if self.output_attentions:
|
||||||
|
all_attentions.append(outputs[2])
|
||||||
|
|
||||||
|
hidden_states = self.layernorm(hidden_states)
|
||||||
|
hidden_states = tf.reshape(hidden_states, output_shape)
|
||||||
|
if self.output_hidden_states:
|
||||||
|
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||||
|
|
||||||
|
outputs = (hidden_states,)
|
||||||
|
if self.output_past:
|
||||||
|
outputs = outputs + (presents,)
|
||||||
|
if self.output_hidden_states:
|
||||||
|
outputs = outputs + (all_hidden_states,)
|
||||||
|
if self.output_attentions:
|
||||||
|
# let the number of heads free (-1) so we can extract attention even after head pruning
|
||||||
|
attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]
|
||||||
|
all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)
|
||||||
|
outputs = outputs + (all_attentions,)
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class TFCTRLPreTrainedModel(TFPreTrainedModel):
|
||||||
|
""" An abstract class to handle weights initialization and
|
||||||
|
a simple interface for dowloading and loading pretrained models.
|
||||||
|
"""
|
||||||
|
config_class = CTRLConfig
|
||||||
|
pretrained_model_archive_map = TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
|
base_model_prefix = "transformer"
|
||||||
|
|
||||||
|
|
||||||
|
CTRL_START_DOCSTRING = r""" CTRL model was proposed in
|
||||||
|
`CTRL: A Conditional Transformer Language Model for Controllable Generation`_
|
||||||
|
by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
|
||||||
|
It's a causal (unidirectional) transformer pre-trained using language modeling on a very large
|
||||||
|
corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
|
||||||
|
|
||||||
|
This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
|
||||||
|
refer to the PyTorch documentation for all matter related to general usage and behavior.
|
||||||
|
|
||||||
|
.. _`CTRL: A Conditional Transformer Language Model for Controllable Generation`:
|
||||||
|
https://www.github.com/salesforce/ctrl
|
||||||
|
|
||||||
|
.. _`torch.nn.Module`:
|
||||||
|
https://pytorch.org/docs/stable/nn.html#module
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
config (:class:`~transformers.CTRLConfig`): Model configuration class with all the parameters of the model.
|
||||||
|
Initializing with a config file does not load the weights associated with the model, only the configuration.
|
||||||
|
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
|
||||||
|
"""
|
||||||
|
|
||||||
|
CTRL_INPUTS_DOCSTRING = r""" Inputs:
|
||||||
|
**input_ids**: ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Indices of input sequence tokens in the vocabulary.
|
||||||
|
CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on
|
||||||
|
the right rather than the left.
|
||||||
|
Indices can be obtained using :class:`transformers.CTRLTokenizer`.
|
||||||
|
See :func:`transformers.PreTrainedTokenizer.encode` and
|
||||||
|
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
|
||||||
|
**past**:
|
||||||
|
list of ``Numpy array`` or ``tf.Tensor`` (one for each layer):
|
||||||
|
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||||
|
(see `past` output below). Can be used to speed up sequential decoding.
|
||||||
|
**attention_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Mask to avoid performing attention on padding token indices.
|
||||||
|
Mask values selected in ``[0, 1]``:
|
||||||
|
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
|
||||||
|
**token_type_ids**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
A parallel sequence of tokens (can be used to indicate various portions of the inputs).
|
||||||
|
The embeddings from these tokens will be summed with the respective token embeddings.
|
||||||
|
Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).
|
||||||
|
**position_ids**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Indices of positions of each input sequence tokens in the position embeddings.
|
||||||
|
Selected in the range ``[0, config.max_position_embeddings - 1]``.
|
||||||
|
**head_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
|
||||||
|
Mask to nullify selected heads of the self-attention modules.
|
||||||
|
Mask values selected in ``[0, 1]``:
|
||||||
|
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
|
||||||
|
"""
|
||||||
|
|
||||||
|
@add_start_docstrings("The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.",
|
||||||
|
CTRL_START_DOCSTRING, CTRL_INPUTS_DOCSTRING)
|
||||||
|
class TFCTRLModel(TFCTRLPreTrainedModel):
|
||||||
|
r"""
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**last_hidden_state**: ``tf.Tensor`` of shape ``(batch_size, sequence_length, hidden_size)``
|
||||||
|
Sequence of hidden-states at the last layer of the model.
|
||||||
|
**past**:
|
||||||
|
list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
that contains pre-computed hidden-states (key and values in the attention blocks).
|
||||||
|
Can be used (see `past` input) to speed up sequential decoding.
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
import tensorflow as tf
|
||||||
|
from transformers import CTRLTokenizer, TFCTRLModel
|
||||||
|
|
||||||
|
tokenizer = CTRLTokenizer.from_pretrained('ctrl')
|
||||||
|
model = TFCTRLModel.from_pretrained('ctrl')
|
||||||
|
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
|
||||||
|
outputs = model(input_ids)
|
||||||
|
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||||
|
|
||||||
|
"""
|
||||||
|
def __init__(self, config, *inputs, **kwargs):
|
||||||
|
super(TFCTRLModel, self).__init__(config, *inputs, **kwargs)
|
||||||
|
self.transformer = TFCTRLMainLayer(config, name='transformer')
|
||||||
|
|
||||||
|
def call(self, inputs, **kwargs):
|
||||||
|
outputs = self.transformer(inputs, **kwargs)
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class TFCTRLLMHead(tf.keras.layers.Layer):
|
||||||
|
def __init__(self, config, input_embeddings, **kwargs):
|
||||||
|
super(TFCTRLLMHead, self).__init__(**kwargs)
|
||||||
|
self.vocab_size = config.vocab_size
|
||||||
|
|
||||||
|
# The output weights are the same as the input embeddings, but there is
|
||||||
|
# an output-only bias for each token.
|
||||||
|
self.input_embeddings = input_embeddings
|
||||||
|
|
||||||
|
def build(self, input_shape):
|
||||||
|
self.bias = self.add_weight(shape=(self.vocab_size,),
|
||||||
|
initializer='zeros',
|
||||||
|
trainable=True,
|
||||||
|
name='bias')
|
||||||
|
super(TFCTRLLMHead, self).build(input_shape)
|
||||||
|
|
||||||
|
def call(self, hidden_states):
|
||||||
|
hidden_states = self.input_embeddings(hidden_states, mode="linear")
|
||||||
|
hidden_states = hidden_states + self.bias
|
||||||
|
return hidden_states
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings("""The CTRL Model transformer with a language modeling head on top
|
||||||
|
(linear layer with weights tied to the input embeddings). """, CTRL_START_DOCSTRING, CTRL_INPUTS_DOCSTRING)
|
||||||
|
class TFCTRLLMHeadModel(TFCTRLPreTrainedModel):
|
||||||
|
r"""
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
|
||||||
|
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
||||||
|
**past**:
|
||||||
|
list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
that contains pre-computed hidden-states (key and values in the attention blocks).
|
||||||
|
Can be used (see `past` input) to speed up sequential decoding.
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from transformers import CTRLTokenizer, TFCTRLLMHeadModel
|
||||||
|
|
||||||
|
tokenizer = CTRLTokenizer.from_pretrained('ctrl')
|
||||||
|
model = TFCTRLLMHeadModel.from_pretrained('ctrl')
|
||||||
|
|
||||||
|
input_ids = torch.tensor(tokenizer.encode("Links Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
||||||
|
outputs = model(input_ids, labels=input_ids)
|
||||||
|
loss, logits = outputs[:2]
|
||||||
|
|
||||||
|
"""
|
||||||
|
def __init__(self, config, *inputs, **kwargs):
|
||||||
|
super(TFCTRLLMHeadModel, self).__init__(config, *inputs, **kwargs)
|
||||||
|
self.transformer = TFCTRLMainLayer(config, name='transformer')
|
||||||
|
|
||||||
|
self.lm_head = TFCTRLLMHead(config, self.transformer.w, name="lm_head")
|
||||||
|
|
||||||
|
def call(self, inputs, **kwargs):
|
||||||
|
transformer_outputs = self.transformer(inputs, **kwargs)
|
||||||
|
hidden_states = transformer_outputs[0]
|
||||||
|
|
||||||
|
lm_logits = self.lm_head(hidden_states)
|
||||||
|
|
||||||
|
outputs = (lm_logits,) + transformer_outputs[1:]
|
||||||
|
|
||||||
|
return outputs # lm_logits, presents, (all hidden_states), (attentions)
|
||||||
@@ -31,7 +31,6 @@ import tensorflow as tf
|
|||||||
from .configuration_distilbert import DistilBertConfig
|
from .configuration_distilbert import DistilBertConfig
|
||||||
from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, shape_list, get_initializer
|
from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, shape_list, get_initializer
|
||||||
from .file_utils import add_start_docstrings
|
from .file_utils import add_start_docstrings
|
||||||
from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -66,14 +65,6 @@ def gelu_new(x):
|
|||||||
(np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
|
(np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
|
||||||
return x * cdf
|
return x * cdf
|
||||||
|
|
||||||
def load_distilbert_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
|
|
||||||
# build the network
|
|
||||||
inputs_list = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
|
|
||||||
attns_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])
|
|
||||||
tf_inputs = [inputs_list, attns_list]
|
|
||||||
tfo = tf_model(tf_inputs, training=False)
|
|
||||||
return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
|
|
||||||
|
|
||||||
class TFEmbeddings(tf.keras.layers.Layer):
|
class TFEmbeddings(tf.keras.layers.Layer):
|
||||||
def __init__(self, config, **kwargs):
|
def __init__(self, config, **kwargs):
|
||||||
super(TFEmbeddings, self).__init__(**kwargs)
|
super(TFEmbeddings, self).__init__(**kwargs)
|
||||||
@@ -226,8 +217,6 @@ class TFMultiHeadSelfAttention(tf.keras.layers.Layer):
|
|||||||
|
|
||||||
dim_per_head = self.dim // self.n_heads
|
dim_per_head = self.dim // self.n_heads
|
||||||
|
|
||||||
assert 2 <= len(tf.shape(mask)) <= 3
|
|
||||||
causal = (len(tf.shape(mask)) == 3)
|
|
||||||
mask_reshape = [bs, 1, 1, k_length]
|
mask_reshape = [bs, 1, 1, k_length]
|
||||||
|
|
||||||
def shape(x):
|
def shape(x):
|
||||||
@@ -456,7 +445,6 @@ class TFDistilBertPreTrainedModel(TFPreTrainedModel):
|
|||||||
"""
|
"""
|
||||||
config_class = DistilBertConfig
|
config_class = DistilBertConfig
|
||||||
pretrained_model_archive_map = TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
|
pretrained_model_archive_map = TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
load_pt_weights = load_distilbert_pt_weights_in_tf2
|
|
||||||
base_model_prefix = "distilbert"
|
base_model_prefix = "distilbert"
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -32,7 +32,6 @@ from .modeling_tf_utils import (TFPreTrainedModel, TFConv1D, TFSharedEmbeddings,
|
|||||||
TFSequenceSummary, shape_list, get_initializer)
|
TFSequenceSummary, shape_list, get_initializer)
|
||||||
from .configuration_gpt2 import GPT2Config
|
from .configuration_gpt2 import GPT2Config
|
||||||
from .file_utils import add_start_docstrings
|
from .file_utils import add_start_docstrings
|
||||||
from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -42,14 +41,6 @@ TF_GPT2_PRETRAINED_MODEL_ARCHIVE_MAP = {"gpt2": "https://s3.amazonaws.com/models
|
|||||||
"distilgpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-tf_model.h5",}
|
"distilgpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-tf_model.h5",}
|
||||||
|
|
||||||
|
|
||||||
def load_gpt2_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
|
|
||||||
# build the network
|
|
||||||
inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
|
||||||
tf_inputs = tf.constant(inputs_list)
|
|
||||||
tfo = tf_model(tf_inputs, training=False)
|
|
||||||
return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
|
|
||||||
|
|
||||||
|
|
||||||
def gelu(x):
|
def gelu(x):
|
||||||
"""Gaussian Error Linear Unit.
|
"""Gaussian Error Linear Unit.
|
||||||
This is a smoother version of the RELU.
|
This is a smoother version of the RELU.
|
||||||
@@ -350,7 +341,6 @@ class TFGPT2PreTrainedModel(TFPreTrainedModel):
|
|||||||
"""
|
"""
|
||||||
config_class = GPT2Config
|
config_class = GPT2Config
|
||||||
pretrained_model_archive_map = TF_GPT2_PRETRAINED_MODEL_ARCHIVE_MAP
|
pretrained_model_archive_map = TF_GPT2_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
load_pt_weights = load_gpt2_pt_weights_in_tf2
|
|
||||||
base_model_prefix = "transformer"
|
base_model_prefix = "transformer"
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -32,21 +32,12 @@ from .modeling_tf_utils import (TFPreTrainedModel, TFConv1D, TFSharedEmbeddings,
|
|||||||
TFSequenceSummary, shape_list, get_initializer)
|
TFSequenceSummary, shape_list, get_initializer)
|
||||||
from .configuration_openai import OpenAIGPTConfig
|
from .configuration_openai import OpenAIGPTConfig
|
||||||
from .file_utils import add_start_docstrings
|
from .file_utils import add_start_docstrings
|
||||||
from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP = {"openai-gpt": "https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-tf_model.h5"}
|
TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP = {"openai-gpt": "https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-tf_model.h5"}
|
||||||
|
|
||||||
|
|
||||||
def load_openai_gpt_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
|
|
||||||
# build the network
|
|
||||||
inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
|
||||||
tf_inputs = tf.constant(inputs_list)
|
|
||||||
tfo = tf_model(tf_inputs, training=False)
|
|
||||||
return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
|
|
||||||
|
|
||||||
|
|
||||||
def gelu(x):
|
def gelu(x):
|
||||||
"""Gaussian Error Linear Unit.
|
"""Gaussian Error Linear Unit.
|
||||||
This is a smoother version of the RELU.
|
This is a smoother version of the RELU.
|
||||||
@@ -335,7 +326,6 @@ class TFOpenAIGPTPreTrainedModel(TFPreTrainedModel):
|
|||||||
"""
|
"""
|
||||||
config_class = OpenAIGPTConfig
|
config_class = OpenAIGPTConfig
|
||||||
pretrained_model_archive_map = TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP
|
pretrained_model_archive_map = TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
load_pt_weights = load_openai_gpt_pt_weights_in_tf2
|
|
||||||
base_model_prefix = "transformer"
|
base_model_prefix = "transformer"
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -25,8 +25,6 @@ import numpy
|
|||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
DUMMY_INPUTS = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
|
||||||
|
|
||||||
def convert_tf_weight_name_to_pt_weight_name(tf_name, start_prefix_to_remove=''):
|
def convert_tf_weight_name_to_pt_weight_name(tf_name, start_prefix_to_remove=''):
|
||||||
""" Convert a TF 2.0 model variable name in a pytorch model weight name.
|
""" Convert a TF 2.0 model variable name in a pytorch model weight name.
|
||||||
|
|
||||||
@@ -105,7 +103,7 @@ def load_pytorch_weights_in_tf2_model(tf_model, pt_state_dict, tf_inputs=None, a
|
|||||||
raise e
|
raise e
|
||||||
|
|
||||||
if tf_inputs is None:
|
if tf_inputs is None:
|
||||||
tf_inputs = tf.constant(DUMMY_INPUTS)
|
tf_inputs = tf_model.dummy_inputs
|
||||||
|
|
||||||
if tf_inputs is not None:
|
if tf_inputs is not None:
|
||||||
tfo = tf_model(tf_inputs, training=False) # Make sure model is built
|
tfo = tf_model(tf_inputs, training=False) # Make sure model is built
|
||||||
@@ -200,7 +198,7 @@ def load_tf2_checkpoint_in_pytorch_model(pt_model, tf_checkpoint_path, tf_inputs
|
|||||||
tf_model = tf_model_class(pt_model.config)
|
tf_model = tf_model_class(pt_model.config)
|
||||||
|
|
||||||
if tf_inputs is None:
|
if tf_inputs is None:
|
||||||
tf_inputs = tf.constant(DUMMY_INPUTS)
|
tf_inputs = tf_model.dummy_inputs
|
||||||
|
|
||||||
if tf_inputs is not None:
|
if tf_inputs is not None:
|
||||||
tfo = tf_model(tf_inputs, training=False) # Make sure model is built
|
tfo = tf_model(tf_inputs, training=False) # Make sure model is built
|
||||||
|
|||||||
@@ -26,7 +26,6 @@ import tensorflow as tf
|
|||||||
from .configuration_roberta import RobertaConfig
|
from .configuration_roberta import RobertaConfig
|
||||||
from .modeling_tf_utils import TFPreTrainedModel, get_initializer
|
from .modeling_tf_utils import TFPreTrainedModel, get_initializer
|
||||||
from .file_utils import add_start_docstrings
|
from .file_utils import add_start_docstrings
|
||||||
from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model
|
|
||||||
|
|
||||||
from .modeling_tf_bert import TFBertEmbeddings, TFBertMainLayer, gelu, gelu_new
|
from .modeling_tf_bert import TFBertEmbeddings, TFBertMainLayer, gelu, gelu_new
|
||||||
|
|
||||||
@@ -36,16 +35,9 @@ TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
|||||||
'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-tf_model.h5",
|
'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-tf_model.h5",
|
||||||
'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-tf_model.h5",
|
'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-tf_model.h5",
|
||||||
'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-tf_model.h5",
|
'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-tf_model.h5",
|
||||||
|
'distilroberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-tf_model.h5",
|
||||||
}
|
}
|
||||||
|
|
||||||
def load_roberta_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
|
|
||||||
# build the network
|
|
||||||
inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
|
||||||
tf_inputs = tf.constant(inputs_list)
|
|
||||||
tfo = tf_model(tf_inputs, training=False)
|
|
||||||
return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
|
|
||||||
|
|
||||||
|
|
||||||
class TFRobertaEmbeddings(TFBertEmbeddings):
|
class TFRobertaEmbeddings(TFBertEmbeddings):
|
||||||
"""
|
"""
|
||||||
Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.
|
Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.
|
||||||
@@ -83,7 +75,7 @@ class TFRobertaMainLayer(TFBertMainLayer):
|
|||||||
input_ids = inputs
|
input_ids = inputs
|
||||||
|
|
||||||
if tf.not_equal(tf.reduce_sum(input_ids[:, 0]), 0):
|
if tf.not_equal(tf.reduce_sum(input_ids[:, 0]), 0):
|
||||||
logger.warning("A sequence with no special tokens has been passed to the RoBERTa model. "
|
tf.print("A sequence with no special tokens has been passed to the RoBERTa model. "
|
||||||
"This model requires special tokens in order to work. "
|
"This model requires special tokens in order to work. "
|
||||||
"Please specify add_special_tokens=True in your encoding.")
|
"Please specify add_special_tokens=True in your encoding.")
|
||||||
|
|
||||||
@@ -96,7 +88,6 @@ class TFRobertaPreTrainedModel(TFPreTrainedModel):
|
|||||||
"""
|
"""
|
||||||
config_class = RobertaConfig
|
config_class = RobertaConfig
|
||||||
pretrained_model_archive_map = TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
|
pretrained_model_archive_map = TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
load_pt_weights = load_roberta_pt_weights_in_tf2
|
|
||||||
base_model_prefix = "roberta"
|
base_model_prefix = "roberta"
|
||||||
|
|
||||||
|
|
||||||
@@ -380,3 +371,54 @@ class TFRobertaForSequenceClassification(TFRobertaPreTrainedModel):
|
|||||||
outputs = (logits,) + outputs[2:]
|
outputs = (logits,) + outputs[2:]
|
||||||
|
|
||||||
return outputs # logits, (hidden_states), (attentions)
|
return outputs # logits, (hidden_states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings("""RoBERTa Model with a token classification head on top (a linear layer on top of
|
||||||
|
the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
|
||||||
|
ROBERTA_START_DOCSTRING, ROBERTA_INPUTS_DOCSTRING)
|
||||||
|
class TFRobertaForTokenClassification(TFRobertaPreTrainedModel):
|
||||||
|
r"""
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**scores**: ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length, config.num_labels)``
|
||||||
|
Classification scores (before SoftMax).
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``Numpy array`` or ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``Numpy array`` or ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
import tensorflow as tf
|
||||||
|
from transformers import RobertaTokenizer, TFRobertaForTokenClassification
|
||||||
|
|
||||||
|
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
|
||||||
|
model = TFRobertaForTokenClassification.from_pretrained('roberta-base')
|
||||||
|
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True))[None, :] # Batch size 1
|
||||||
|
outputs = model(input_ids)
|
||||||
|
scores = outputs[0]
|
||||||
|
|
||||||
|
"""
|
||||||
|
def __init__(self, config, *inputs, **kwargs):
|
||||||
|
super(TFRobertaForTokenClassification, self).__init__(config, *inputs, **kwargs)
|
||||||
|
self.num_labels = config.num_labels
|
||||||
|
|
||||||
|
self.roberta = TFRobertaMainLayer(config, name='roberta')
|
||||||
|
self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
|
||||||
|
self.classifier = tf.keras.layers.Dense(config.num_labels,
|
||||||
|
kernel_initializer=get_initializer(config.initializer_range),
|
||||||
|
name='classifier')
|
||||||
|
|
||||||
|
def call(self, inputs, **kwargs):
|
||||||
|
outputs = self.roberta(inputs, **kwargs)
|
||||||
|
|
||||||
|
sequence_output = outputs[0]
|
||||||
|
|
||||||
|
sequence_output = self.dropout(sequence_output, training=kwargs.get('training', False))
|
||||||
|
logits = self.classifier(sequence_output)
|
||||||
|
|
||||||
|
outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
|
||||||
|
|
||||||
|
return outputs # scores, (hidden_states), (attentions)
|
||||||
|
|||||||
@@ -33,7 +33,6 @@ from .configuration_transfo_xl import TransfoXLConfig
|
|||||||
from .modeling_tf_utils import TFPreTrainedModel, TFConv1D, TFSequenceSummary, shape_list, get_initializer
|
from .modeling_tf_utils import TFPreTrainedModel, TFConv1D, TFSequenceSummary, shape_list, get_initializer
|
||||||
from .modeling_tf_transfo_xl_utilities import TFAdaptiveSoftmaxMask
|
from .modeling_tf_transfo_xl_utilities import TFAdaptiveSoftmaxMask
|
||||||
from .file_utils import add_start_docstrings
|
from .file_utils import add_start_docstrings
|
||||||
from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -41,14 +40,6 @@ TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
|||||||
'transfo-xl-wt103': "https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-tf_model.h5",
|
'transfo-xl-wt103': "https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-tf_model.h5",
|
||||||
}
|
}
|
||||||
|
|
||||||
def load_transfo_xl_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
|
|
||||||
# build the network
|
|
||||||
inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
|
||||||
tf_inputs = tf.constant(inputs_list)
|
|
||||||
tfo = tf_model(tf_inputs, training=False)
|
|
||||||
return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
|
|
||||||
|
|
||||||
|
|
||||||
class TFPositionalEmbedding(tf.keras.layers.Layer):
|
class TFPositionalEmbedding(tf.keras.layers.Layer):
|
||||||
def __init__(self, demb, **kwargs):
|
def __init__(self, demb, **kwargs):
|
||||||
super(TFPositionalEmbedding, self).__init__(**kwargs)
|
super(TFPositionalEmbedding, self).__init__(**kwargs)
|
||||||
@@ -577,7 +568,6 @@ class TFTransfoXLPreTrainedModel(TFPreTrainedModel):
|
|||||||
"""
|
"""
|
||||||
config_class = TransfoXLConfig
|
config_class = TransfoXLConfig
|
||||||
pretrained_model_archive_map = TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP
|
pretrained_model_archive_map = TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
load_pt_weights = load_transfo_xl_pt_weights_in_tf2
|
|
||||||
base_model_prefix = "transformer"
|
base_model_prefix = "transformer"
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -25,9 +25,11 @@ import tensorflow as tf
|
|||||||
|
|
||||||
from .configuration_utils import PretrainedConfig
|
from .configuration_utils import PretrainedConfig
|
||||||
from .file_utils import cached_path, WEIGHTS_NAME, TF_WEIGHTS_NAME, TF2_WEIGHTS_NAME
|
from .file_utils import cached_path, WEIGHTS_NAME, TF_WEIGHTS_NAME, TF2_WEIGHTS_NAME
|
||||||
|
from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
DUMMY_INPUTS = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
||||||
|
|
||||||
class TFPreTrainedModel(tf.keras.Model):
|
class TFPreTrainedModel(tf.keras.Model):
|
||||||
r""" Base class for all TF models.
|
r""" Base class for all TF models.
|
||||||
@@ -48,8 +50,8 @@ class TFPreTrainedModel(tf.keras.Model):
|
|||||||
"""
|
"""
|
||||||
config_class = None
|
config_class = None
|
||||||
pretrained_model_archive_map = {}
|
pretrained_model_archive_map = {}
|
||||||
load_pt_weights = lambda model, config, path: None
|
|
||||||
base_model_prefix = ""
|
base_model_prefix = ""
|
||||||
|
dummy_inputs = tf.constant(DUMMY_INPUTS) # dummy inputs to build the network
|
||||||
|
|
||||||
def __init__(self, config, *inputs, **kwargs):
|
def __init__(self, config, *inputs, **kwargs):
|
||||||
super(TFPreTrainedModel, self).__init__(*inputs, **kwargs)
|
super(TFPreTrainedModel, self).__init__(*inputs, **kwargs)
|
||||||
@@ -262,17 +264,16 @@ class TFPreTrainedModel(tf.keras.Model):
|
|||||||
|
|
||||||
if from_pt:
|
if from_pt:
|
||||||
# Load from a PyTorch checkpoint
|
# Load from a PyTorch checkpoint
|
||||||
return cls.load_pt_weights(model, resolved_archive_file)
|
return load_pytorch_checkpoint_in_tf2_model(model, resolved_archive_file)
|
||||||
|
|
||||||
inputs = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
|
ret = model(model.dummy_inputs, training=False) # build the network with dummy inputs
|
||||||
ret = model(inputs, training=False) # build the network with dummy inputs
|
|
||||||
|
|
||||||
assert os.path.isfile(resolved_archive_file), "Error retrieving file {}".format(resolved_archive_file)
|
assert os.path.isfile(resolved_archive_file), "Error retrieving file {}".format(resolved_archive_file)
|
||||||
# 'by_name' allow us to do transfer learning by skipping/adding layers
|
# 'by_name' allow us to do transfer learning by skipping/adding layers
|
||||||
# see https://github.com/tensorflow/tensorflow/blob/00fad90125b18b80fe054de1055770cfb8fe4ba3/tensorflow/python/keras/engine/network.py#L1339-L1357
|
# see https://github.com/tensorflow/tensorflow/blob/00fad90125b18b80fe054de1055770cfb8fe4ba3/tensorflow/python/keras/engine/network.py#L1339-L1357
|
||||||
model.load_weights(resolved_archive_file, by_name=True)
|
model.load_weights(resolved_archive_file, by_name=True)
|
||||||
|
|
||||||
ret = model(inputs, training=False) # Make sure restore ops are run
|
ret = model(model.dummy_inputs, training=False) # Make sure restore ops are run
|
||||||
|
|
||||||
return model
|
return model
|
||||||
|
|
||||||
@@ -393,26 +394,26 @@ class TFSequenceSummary(tf.keras.layers.Layer):
|
|||||||
# We can probably just use the multi-head attention module of PyTorch >=1.1.0
|
# We can probably just use the multi-head attention module of PyTorch >=1.1.0
|
||||||
raise NotImplementedError
|
raise NotImplementedError
|
||||||
|
|
||||||
self.summary = None
|
self.has_summary = hasattr(config, 'summary_use_proj') and config.summary_use_proj
|
||||||
if hasattr(config, 'summary_use_proj') and config.summary_use_proj:
|
if self.has_summary:
|
||||||
if hasattr(config, 'summary_proj_to_labels') and config.summary_proj_to_labels and config.num_labels > 0:
|
if hasattr(config, 'summary_proj_to_labels') and config.summary_proj_to_labels and config.num_labels > 0:
|
||||||
num_classes = config.num_labels
|
num_classes = config.num_labels
|
||||||
else:
|
else:
|
||||||
num_classes = config.hidden_size
|
num_classes = config.hidden_size
|
||||||
self.summary = tf.keras.layers.Dense(num_classes,
|
self.summary = tf.keras.layers.Dense(num_classes,
|
||||||
kernel_initializer=get_initializer(initializer_range),
|
kernel_initializer=get_initializer(initializer_range),
|
||||||
name='summary')
|
name='summary')
|
||||||
|
|
||||||
self.activation = None
|
self.has_activation = hasattr(config, 'summary_activation') and config.summary_activation == 'tanh'
|
||||||
if hasattr(config, 'summary_activation') and config.summary_activation == 'tanh':
|
if self.has_activation:
|
||||||
self.activation = tf.keras.activations.tanh
|
self.activation = tf.keras.activations.tanh
|
||||||
|
|
||||||
self.first_dropout = None
|
self.has_first_dropout = hasattr(config, 'summary_first_dropout') and config.summary_first_dropout > 0
|
||||||
if hasattr(config, 'summary_first_dropout') and config.summary_first_dropout > 0:
|
if self.has_first_dropout:
|
||||||
self.first_dropout = tf.keras.layers.Dropout(config.summary_first_dropout)
|
self.first_dropout = tf.keras.layers.Dropout(config.summary_first_dropout)
|
||||||
|
|
||||||
self.last_dropout = None
|
self.has_last_dropout = hasattr(config, 'summary_last_dropout') and config.summary_last_dropout > 0
|
||||||
if hasattr(config, 'summary_last_dropout') and config.summary_last_dropout > 0:
|
if self.has_last_dropout:
|
||||||
self.last_dropout = tf.keras.layers.Dropout(config.summary_last_dropout)
|
self.last_dropout = tf.keras.layers.Dropout(config.summary_last_dropout)
|
||||||
|
|
||||||
def call(self, inputs, training=False):
|
def call(self, inputs, training=False):
|
||||||
@@ -455,17 +456,17 @@ class TFSequenceSummary(tf.keras.layers.Layer):
|
|||||||
elif self.summary_type == 'attn':
|
elif self.summary_type == 'attn':
|
||||||
raise NotImplementedError
|
raise NotImplementedError
|
||||||
|
|
||||||
if training and self.first_dropout is not None:
|
if self.has_first_dropout:
|
||||||
output = self.first_dropout(output)
|
output = self.first_dropout(output, training=training)
|
||||||
|
|
||||||
if self.summary is not None:
|
if self.has_summary:
|
||||||
output = self.summary(output)
|
output = self.summary(output)
|
||||||
|
|
||||||
if self.activation is not None:
|
if self.has_activation:
|
||||||
output = self.activation(output)
|
output = self.activation(output)
|
||||||
|
|
||||||
if training and self.last_dropout is not None:
|
if self.has_last_dropout:
|
||||||
output = self.last_dropout(output)
|
output = self.last_dropout(output, training=training)
|
||||||
|
|
||||||
return output
|
return output
|
||||||
|
|
||||||
|
|||||||
@@ -25,9 +25,8 @@ import numpy as np
|
|||||||
import tensorflow as tf
|
import tensorflow as tf
|
||||||
|
|
||||||
from .configuration_xlm import XLMConfig
|
from .configuration_xlm import XLMConfig
|
||||||
from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, TFSequenceSummary, shape_list, get_initializer
|
from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, TFSequenceSummary, shape_list, get_initializer, DUMMY_INPUTS
|
||||||
from .file_utils import add_start_docstrings
|
from .file_utils import add_start_docstrings
|
||||||
from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -45,19 +44,6 @@ TF_XLM_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def load_xlm_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
|
|
||||||
# build the network
|
|
||||||
inputs_list = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
|
|
||||||
attns_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])
|
|
||||||
if tf_model.config.use_lang_emb and tf_model.config.n_langs > 1:
|
|
||||||
langs_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])
|
|
||||||
else:
|
|
||||||
langs_list = None
|
|
||||||
tf_inputs = [inputs_list, attns_list, langs_list]
|
|
||||||
tfo = tf_model(tf_inputs, training=False)
|
|
||||||
return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
|
|
||||||
|
|
||||||
|
|
||||||
def create_sinusoidal_embeddings(n_pos, dim, out):
|
def create_sinusoidal_embeddings(n_pos, dim, out):
|
||||||
position_enc = np.array([
|
position_enc = np.array([
|
||||||
[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)]
|
[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)]
|
||||||
@@ -441,9 +427,19 @@ class TFXLMPreTrainedModel(TFPreTrainedModel):
|
|||||||
"""
|
"""
|
||||||
config_class = XLMConfig
|
config_class = XLMConfig
|
||||||
pretrained_model_archive_map = TF_XLM_PRETRAINED_MODEL_ARCHIVE_MAP
|
pretrained_model_archive_map = TF_XLM_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
load_pt_weights = load_xlm_pt_weights_in_tf2
|
|
||||||
base_model_prefix = "transformer"
|
base_model_prefix = "transformer"
|
||||||
|
|
||||||
|
@property
|
||||||
|
def dummy_inputs(self):
|
||||||
|
# Sometimes XLM has language embeddings so don't forget to build them as well if needed
|
||||||
|
inputs_list = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
|
||||||
|
attns_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])
|
||||||
|
if self.config.use_lang_emb and self.config.n_langs > 1:
|
||||||
|
langs_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])
|
||||||
|
else:
|
||||||
|
langs_list = None
|
||||||
|
return [inputs_list, attns_list, langs_list]
|
||||||
|
|
||||||
|
|
||||||
XLM_START_DOCSTRING = r""" The XLM model was proposed in
|
XLM_START_DOCSTRING = r""" The XLM model was proposed in
|
||||||
`Cross-lingual Language Model Pretraining`_
|
`Cross-lingual Language Model Pretraining`_
|
||||||
|
|||||||
@@ -30,7 +30,6 @@ import tensorflow as tf
|
|||||||
from .configuration_xlnet import XLNetConfig
|
from .configuration_xlnet import XLNetConfig
|
||||||
from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, TFSequenceSummary, shape_list, get_initializer
|
from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, TFSequenceSummary, shape_list, get_initializer
|
||||||
from .file_utils import add_start_docstrings
|
from .file_utils import add_start_docstrings
|
||||||
from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model
|
|
||||||
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
@@ -41,13 +40,6 @@ TF_XLNET_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def load_xlnet_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
|
|
||||||
inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
|
||||||
tf_inputs = tf.constant(inputs_list)
|
|
||||||
tfo = tf_model(tf_inputs, training=False) # build the network
|
|
||||||
return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
|
|
||||||
|
|
||||||
|
|
||||||
def gelu(x):
|
def gelu(x):
|
||||||
""" Implementation of the gelu activation function.
|
""" Implementation of the gelu activation function.
|
||||||
XLNet is using OpenAI GPT's gelu
|
XLNet is using OpenAI GPT's gelu
|
||||||
@@ -362,6 +354,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||||||
super(TFXLNetMainLayer, self).__init__(**kwargs)
|
super(TFXLNetMainLayer, self).__init__(**kwargs)
|
||||||
self.output_attentions = config.output_attentions
|
self.output_attentions = config.output_attentions
|
||||||
self.output_hidden_states = config.output_hidden_states
|
self.output_hidden_states = config.output_hidden_states
|
||||||
|
self.output_past = config.output_past
|
||||||
|
|
||||||
self.mem_len = config.mem_len
|
self.mem_len = config.mem_len
|
||||||
self.reuse_len = config.reuse_len
|
self.reuse_len = config.reuse_len
|
||||||
@@ -421,16 +414,13 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||||||
|
|
||||||
def cache_mem(self, curr_out, prev_mem):
|
def cache_mem(self, curr_out, prev_mem):
|
||||||
"""cache hidden states into memory."""
|
"""cache hidden states into memory."""
|
||||||
if self.mem_len is None or self.mem_len == 0:
|
if self.reuse_len is not None and self.reuse_len > 0:
|
||||||
return None
|
curr_out = curr_out[:self.reuse_len]
|
||||||
else:
|
|
||||||
if self.reuse_len is not None and self.reuse_len > 0:
|
|
||||||
curr_out = curr_out[:self.reuse_len]
|
|
||||||
|
|
||||||
if prev_mem is None:
|
if prev_mem is None:
|
||||||
new_mem = curr_out[-self.mem_len:]
|
new_mem = curr_out[-self.mem_len:]
|
||||||
else:
|
else:
|
||||||
new_mem = tf.concat([prev_mem, curr_out], 0)[-self.mem_len:]
|
new_mem = tf.concat([prev_mem, curr_out], 0)[-self.mem_len:]
|
||||||
|
|
||||||
return tf.stop_gradient(new_mem)
|
return tf.stop_gradient(new_mem)
|
||||||
|
|
||||||
@@ -546,8 +536,8 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||||||
raise ValueError('Unsupported attention type: {}'.format(self.attn_type))
|
raise ValueError('Unsupported attention type: {}'.format(self.attn_type))
|
||||||
|
|
||||||
# data mask: input mask & perm mask
|
# data mask: input mask & perm mask
|
||||||
assert input_mask is None or attention_mask is None, "You can only use one of input_mask (uses 1 for padding) "
|
assert input_mask is None or attention_mask is None, "You can only use one of input_mask (uses 1 for padding) " \
|
||||||
"or attention_mask (uses 0 for padding, added for compatbility with BERT). Please choose one."
|
"or attention_mask (uses 0 for padding, added for compatbility with BERT). Please choose one."
|
||||||
if input_mask is None and attention_mask is not None:
|
if input_mask is None and attention_mask is not None:
|
||||||
input_mask = 1.0 - attention_mask
|
input_mask = 1.0 - attention_mask
|
||||||
if input_mask is not None and perm_mask is not None:
|
if input_mask is not None and perm_mask is not None:
|
||||||
@@ -632,7 +622,8 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||||||
hidden_states = []
|
hidden_states = []
|
||||||
for i, layer_module in enumerate(self.layer):
|
for i, layer_module in enumerate(self.layer):
|
||||||
# cache new mems
|
# cache new mems
|
||||||
new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
|
if self.mem_len is not None and self.mem_len > 0 and self.output_past:
|
||||||
|
new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
|
||||||
if self.output_hidden_states:
|
if self.output_hidden_states:
|
||||||
hidden_states.append((output_h, output_g) if output_g is not None else output_h)
|
hidden_states.append((output_h, output_g) if output_g is not None else output_h)
|
||||||
|
|
||||||
@@ -650,7 +641,11 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||||||
output = self.dropout(output_g if output_g is not None else output_h, training=training)
|
output = self.dropout(output_g if output_g is not None else output_h, training=training)
|
||||||
|
|
||||||
# Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
|
# Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
|
||||||
outputs = (tf.transpose(output, perm=(1, 0, 2)), new_mems)
|
outputs = (tf.transpose(output, perm=(1, 0, 2)),)
|
||||||
|
|
||||||
|
if self.mem_len is not None and self.mem_len > 0 and self.output_past:
|
||||||
|
outputs = outputs + (new_mems,)
|
||||||
|
|
||||||
if self.output_hidden_states:
|
if self.output_hidden_states:
|
||||||
if output_g is not None:
|
if output_g is not None:
|
||||||
hidden_states = tuple(tf.transpose(h, perm=(1, 0, 2)) for hs in hidden_states for h in hs)
|
hidden_states = tuple(tf.transpose(h, perm=(1, 0, 2)) for hs in hidden_states for h in hs)
|
||||||
@@ -661,7 +656,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||||||
attentions = tuple(tf.transpose(t, perm=(2, 3, 0, 1)) for t in attentions)
|
attentions = tuple(tf.transpose(t, perm=(2, 3, 0, 1)) for t in attentions)
|
||||||
outputs = outputs + (attentions,)
|
outputs = outputs + (attentions,)
|
||||||
|
|
||||||
return outputs # outputs, new_mems, (hidden_states), (attentions)
|
return outputs # outputs, (new_mems), (hidden_states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
class TFXLNetPreTrainedModel(TFPreTrainedModel):
|
class TFXLNetPreTrainedModel(TFPreTrainedModel):
|
||||||
@@ -670,7 +665,6 @@ class TFXLNetPreTrainedModel(TFPreTrainedModel):
|
|||||||
"""
|
"""
|
||||||
config_class = XLNetConfig
|
config_class = XLNetConfig
|
||||||
pretrained_model_archive_map = TF_XLNET_PRETRAINED_MODEL_ARCHIVE_MAP
|
pretrained_model_archive_map = TF_XLNET_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
load_pt_weights = load_xlnet_pt_weights_in_tf2
|
|
||||||
base_model_prefix = "transformer"
|
base_model_prefix = "transformer"
|
||||||
|
|
||||||
|
|
||||||
@@ -777,7 +771,7 @@ class TFXLNetModel(TFXLNetPreTrainedModel):
|
|||||||
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
**last_hidden_state**: ``tf.Tensor`` of shape ``(batch_size, sequence_length, hidden_size)``
|
**last_hidden_state**: ``tf.Tensor`` of shape ``(batch_size, sequence_length, hidden_size)``
|
||||||
Sequence of hidden-states at the last layer of the model.
|
Sequence of hidden-states at the last layer of the model.
|
||||||
**mems**:
|
**mems**: (`optional`, returned when ``config.mem_len > 0``)
|
||||||
list of ``tf.Tensor`` (one for each layer):
|
list of ``tf.Tensor`` (one for each layer):
|
||||||
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||||
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
||||||
@@ -819,7 +813,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel):
|
|||||||
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
**prediction_scores**: ``tf.Tensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
|
**prediction_scores**: ``tf.Tensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
|
||||||
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
||||||
**mems**:
|
**mems**: (`optional`, returned when ``config.mem_len > 0``)
|
||||||
list of ``tf.Tensor`` (one for each layer):
|
list of ``tf.Tensor`` (one for each layer):
|
||||||
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||||
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
||||||
@@ -863,7 +857,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel):
|
|||||||
|
|
||||||
outputs = (logits,) + transformer_outputs[1:] # Keep mems, hidden states, attentions if there are in it
|
outputs = (logits,) + transformer_outputs[1:] # Keep mems, hidden states, attentions if there are in it
|
||||||
|
|
||||||
return outputs # return logits, mems, (hidden states), (attentions)
|
return outputs # return logits, (mems), (hidden states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
@add_start_docstrings("""XLNet Model with a sequence classification/regression head on top (a linear layer on top of
|
@add_start_docstrings("""XLNet Model with a sequence classification/regression head on top (a linear layer on top of
|
||||||
@@ -874,7 +868,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel):
|
|||||||
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
**logits**: ``tf.Tensor`` of shape ``(batch_size, config.num_labels)``
|
**logits**: ``tf.Tensor`` of shape ``(batch_size, config.num_labels)``
|
||||||
Classification (or regression if config.num_labels==1) scores (before SoftMax).
|
Classification (or regression if config.num_labels==1) scores (before SoftMax).
|
||||||
**mems**:
|
**mems**: (`optional`, returned when ``config.mem_len > 0``)
|
||||||
list of ``tf.Tensor`` (one for each layer):
|
list of ``tf.Tensor`` (one for each layer):
|
||||||
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||||
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
||||||
@@ -918,7 +912,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel):
|
|||||||
|
|
||||||
outputs = (logits,) + transformer_outputs[1:] # Keep mems, hidden states, attentions if there are in it
|
outputs = (logits,) + transformer_outputs[1:] # Keep mems, hidden states, attentions if there are in it
|
||||||
|
|
||||||
return outputs # return logits, mems, (hidden states), (attentions)
|
return outputs # return logits, (mems), (hidden states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
# @add_start_docstrings("""XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
|
# @add_start_docstrings("""XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
|
||||||
@@ -932,6 +926,11 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel):
|
|||||||
Span-start scores (before SoftMax).
|
Span-start scores (before SoftMax).
|
||||||
**end_scores**: ``tf.Tensor`` of shape ``(batch_size, sequence_length,)``
|
**end_scores**: ``tf.Tensor`` of shape ``(batch_size, sequence_length,)``
|
||||||
Span-end scores (before SoftMax).
|
Span-end scores (before SoftMax).
|
||||||
|
**mems**: (`optional`, returned when ``config.mem_len > 0``)
|
||||||
|
list of ``tf.Tensor`` (one for each layer):
|
||||||
|
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||||
|
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
||||||
|
See details in the docstring of the `mems` input above.
|
||||||
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
|
list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
of shape ``(batch_size, sequence_length, hidden_size)``:
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
@@ -971,7 +970,7 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel):
|
|||||||
|
|
||||||
outputs = (start_logits, end_logits,) + transformer_outputs[1:] # Keep mems, hidden states, attentions if there are in it
|
outputs = (start_logits, end_logits,) + transformer_outputs[1:] # Keep mems, hidden states, attentions if there are in it
|
||||||
|
|
||||||
return outputs # start_logits, end_logits, (hidden_states), (attentions)
|
return outputs # start_logits, end_logits, (mems), (hidden_states), (attentions)
|
||||||
|
|
||||||
# @add_start_docstrings("""XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
|
# @add_start_docstrings("""XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
|
||||||
# the hidden-states output to compute `span start logits` and `span end logits`). """,
|
# the hidden-states output to compute `span start logits` and `span end logits`). """,
|
||||||
|
|||||||
@@ -316,20 +316,20 @@ class PreTrainedModel(nn.Module):
|
|||||||
# redirect to the cache, if necessary
|
# redirect to the cache, if necessary
|
||||||
try:
|
try:
|
||||||
resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
|
resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
|
||||||
except EnvironmentError as e:
|
except EnvironmentError:
|
||||||
if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
|
if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
|
||||||
logger.error(
|
msg = "Couldn't reach server at '{}' to download pretrained weights.".format(
|
||||||
"Couldn't reach server at '{}' to download pretrained weights.".format(
|
archive_file)
|
||||||
archive_file))
|
|
||||||
else:
|
else:
|
||||||
logger.error(
|
msg = "Model name '{}' was not found in model name list ({}). " \
|
||||||
"Model name '{}' was not found in model name list ({}). "
|
"We assumed '{}' was a path or url to model weight files named one of {} but " \
|
||||||
"We assumed '{}' was a path or url but couldn't find any file "
|
"couldn't find any such file at this path or url.".format(
|
||||||
"associated to this path or url.".format(
|
|
||||||
pretrained_model_name_or_path,
|
pretrained_model_name_or_path,
|
||||||
', '.join(cls.pretrained_model_archive_map.keys()),
|
', '.join(cls.pretrained_model_archive_map.keys()),
|
||||||
archive_file))
|
archive_file,
|
||||||
raise e
|
[WEIGHTS_NAME, TF2_WEIGHTS_NAME, TF_WEIGHTS_NAME])
|
||||||
|
raise EnvironmentError(msg)
|
||||||
|
|
||||||
if resolved_archive_file == archive_file:
|
if resolved_archive_file == archive_file:
|
||||||
logger.info("loading weights file {}".format(archive_file))
|
logger.info("loading weights file {}".format(archive_file))
|
||||||
else:
|
else:
|
||||||
|
|||||||
@@ -188,11 +188,8 @@ def swish(x):
|
|||||||
ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish}
|
ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish}
|
||||||
|
|
||||||
|
|
||||||
try:
|
XLNetLayerNorm = nn.LayerNorm
|
||||||
from apex.normalization.fused_layer_norm import FusedLayerNorm as XLNetLayerNorm
|
|
||||||
except (ImportError, AttributeError) as e:
|
|
||||||
logger.info("Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .")
|
|
||||||
from torch.nn import LayerNorm as XLNetLayerNorm
|
|
||||||
|
|
||||||
class XLNetRelativeAttention(nn.Module):
|
class XLNetRelativeAttention(nn.Module):
|
||||||
def __init__(self, config):
|
def __init__(self, config):
|
||||||
@@ -239,45 +236,60 @@ class XLNetRelativeAttention(nn.Module):
|
|||||||
|
|
||||||
return x
|
return x
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def rel_shift_bnij(x, klen=-1):
|
||||||
|
x_size = x.shape
|
||||||
|
|
||||||
|
x = x.reshape(x_size[0], x_size[1], x_size[3], x_size[2])
|
||||||
|
x = x[:, :, 1:, :]
|
||||||
|
x = x.reshape(x_size[0], x_size[1], x_size[2], x_size[3]-1)
|
||||||
|
# Note: the tensor-slice form was faster in my testing than torch.index_select
|
||||||
|
# However, tracing doesn't like the nature of the slice, and if klen changes
|
||||||
|
# during the run then it'll fail, whereas index_select will be fine.
|
||||||
|
x = torch.index_select(x, 3, torch.arange(klen, device=x.device, dtype=torch.long))
|
||||||
|
# x = x[:, :, :, :klen]
|
||||||
|
|
||||||
|
return x
|
||||||
|
|
||||||
def rel_attn_core(self, q_head, k_head_h, v_head_h, k_head_r, seg_mat=None, attn_mask=None, head_mask=None):
|
def rel_attn_core(self, q_head, k_head_h, v_head_h, k_head_r, seg_mat=None, attn_mask=None, head_mask=None):
|
||||||
"""Core relative positional attention operations."""
|
"""Core relative positional attention operations."""
|
||||||
|
|
||||||
# content based attention score
|
# content based attention score
|
||||||
ac = torch.einsum('ibnd,jbnd->ijbn', q_head + self.r_w_bias, k_head_h)
|
ac = torch.einsum('ibnd,jbnd->bnij', q_head + self.r_w_bias, k_head_h)
|
||||||
|
|
||||||
# position based attention score
|
# position based attention score
|
||||||
bd = torch.einsum('ibnd,jbnd->ijbn', q_head + self.r_r_bias, k_head_r)
|
bd = torch.einsum('ibnd,jbnd->bnij', q_head + self.r_r_bias, k_head_r)
|
||||||
bd = self.rel_shift(bd, klen=ac.shape[1])
|
bd = self.rel_shift_bnij(bd, klen=ac.shape[3])
|
||||||
|
|
||||||
# segment based attention score
|
# segment based attention score
|
||||||
if seg_mat is None:
|
if seg_mat is None:
|
||||||
ef = 0
|
ef = 0
|
||||||
else:
|
else:
|
||||||
ef = torch.einsum('ibnd,snd->ibns', q_head + self.r_s_bias, self.seg_embed)
|
ef = torch.einsum('ibnd,snd->ibns', q_head + self.r_s_bias, self.seg_embed)
|
||||||
ef = torch.einsum('ijbs,ibns->ijbn', seg_mat, ef)
|
ef = torch.einsum('ijbs,ibns->bnij', seg_mat, ef)
|
||||||
|
|
||||||
# merge attention scores and perform masking
|
# merge attention scores and perform masking
|
||||||
attn_score = (ac + bd + ef) * self.scale
|
attn_score = (ac + bd + ef) * self.scale
|
||||||
if attn_mask is not None:
|
if attn_mask is not None:
|
||||||
# attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask
|
# attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask
|
||||||
if attn_mask.dtype == torch.float16:
|
if attn_mask.dtype == torch.float16:
|
||||||
attn_score = attn_score - 65500 * attn_mask
|
attn_score = attn_score - 65500 * torch.einsum('ijbn->bnij', attn_mask)
|
||||||
else:
|
else:
|
||||||
attn_score = attn_score - 1e30 * attn_mask
|
attn_score = attn_score - 1e30 * torch.einsum('ijbn->bnij', attn_mask)
|
||||||
|
|
||||||
# attention probability
|
# attention probability
|
||||||
attn_prob = F.softmax(attn_score, dim=1)
|
attn_prob = F.softmax(attn_score, dim=3)
|
||||||
attn_prob = self.dropout(attn_prob)
|
attn_prob = self.dropout(attn_prob)
|
||||||
|
|
||||||
# Mask heads if we want to
|
# Mask heads if we want to
|
||||||
if head_mask is not None:
|
if head_mask is not None:
|
||||||
attn_prob = attn_prob * head_mask
|
attn_prob = attn_prob * torch.einsum('ijbn->bnij', head_mask)
|
||||||
|
|
||||||
# attention output
|
# attention output
|
||||||
attn_vec = torch.einsum('ijbn,jbnd->ibnd', attn_prob, v_head_h)
|
attn_vec = torch.einsum('bnij,jbnd->ibnd', attn_prob, v_head_h)
|
||||||
|
|
||||||
if self.output_attentions:
|
if self.output_attentions:
|
||||||
return attn_vec, attn_prob
|
return attn_vec, torch.einsum('bnij->ijbn', attn_prob)
|
||||||
|
|
||||||
return attn_vec
|
return attn_vec
|
||||||
|
|
||||||
@@ -555,7 +567,7 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||||||
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
**last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
|
**last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
|
||||||
Sequence of hidden-states at the last layer of the model.
|
Sequence of hidden-states at the last layer of the model.
|
||||||
**mems**:
|
**mems**: (`optional`, returned when ``config.mem_len > 0``)
|
||||||
list of ``torch.FloatTensor`` (one for each layer):
|
list of ``torch.FloatTensor`` (one for each layer):
|
||||||
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||||
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
||||||
@@ -581,6 +593,7 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||||||
super(XLNetModel, self).__init__(config)
|
super(XLNetModel, self).__init__(config)
|
||||||
self.output_attentions = config.output_attentions
|
self.output_attentions = config.output_attentions
|
||||||
self.output_hidden_states = config.output_hidden_states
|
self.output_hidden_states = config.output_hidden_states
|
||||||
|
self.output_past = config.output_past
|
||||||
|
|
||||||
self.mem_len = config.mem_len
|
self.mem_len = config.mem_len
|
||||||
self.reuse_len = config.reuse_len
|
self.reuse_len = config.reuse_len
|
||||||
@@ -637,16 +650,13 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||||||
|
|
||||||
def cache_mem(self, curr_out, prev_mem):
|
def cache_mem(self, curr_out, prev_mem):
|
||||||
"""cache hidden states into memory."""
|
"""cache hidden states into memory."""
|
||||||
if self.mem_len is None or self.mem_len == 0:
|
if self.reuse_len is not None and self.reuse_len > 0:
|
||||||
return None
|
curr_out = curr_out[:self.reuse_len]
|
||||||
else:
|
|
||||||
if self.reuse_len is not None and self.reuse_len > 0:
|
|
||||||
curr_out = curr_out[:self.reuse_len]
|
|
||||||
|
|
||||||
if prev_mem is None:
|
if prev_mem is None:
|
||||||
new_mem = curr_out[-self.mem_len:]
|
new_mem = curr_out[-self.mem_len:]
|
||||||
else:
|
else:
|
||||||
new_mem = torch.cat([prev_mem, curr_out], dim=0)[-self.mem_len:]
|
new_mem = torch.cat([prev_mem, curr_out], dim=0)[-self.mem_len:]
|
||||||
|
|
||||||
return new_mem.detach()
|
return new_mem.detach()
|
||||||
|
|
||||||
@@ -817,8 +827,9 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||||||
attentions = []
|
attentions = []
|
||||||
hidden_states = []
|
hidden_states = []
|
||||||
for i, layer_module in enumerate(self.layer):
|
for i, layer_module in enumerate(self.layer):
|
||||||
# cache new mems
|
if self.mem_len is not None and self.mem_len > 0 and self.output_past:
|
||||||
new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
|
# cache new mems
|
||||||
|
new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
|
||||||
if self.output_hidden_states:
|
if self.output_hidden_states:
|
||||||
hidden_states.append((output_h, output_g) if output_g is not None else output_h)
|
hidden_states.append((output_h, output_g) if output_g is not None else output_h)
|
||||||
|
|
||||||
@@ -836,7 +847,11 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||||||
output = self.dropout(output_g if output_g is not None else output_h)
|
output = self.dropout(output_g if output_g is not None else output_h)
|
||||||
|
|
||||||
# Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
|
# Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
|
||||||
outputs = (output.permute(1, 0, 2).contiguous(), new_mems)
|
outputs = (output.permute(1, 0, 2).contiguous(),)
|
||||||
|
|
||||||
|
if self.mem_len is not None and self.mem_len > 0 and self.output_past:
|
||||||
|
outputs = outputs + (new_mems,)
|
||||||
|
|
||||||
if self.output_hidden_states:
|
if self.output_hidden_states:
|
||||||
if output_g is not None:
|
if output_g is not None:
|
||||||
hidden_states = tuple(h.permute(1, 0, 2).contiguous() for hs in hidden_states for h in hs)
|
hidden_states = tuple(h.permute(1, 0, 2).contiguous() for hs in hidden_states for h in hs)
|
||||||
@@ -847,7 +862,7 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||||||
attentions = tuple(t.permute(2, 3, 0, 1).contiguous() for t in attentions)
|
attentions = tuple(t.permute(2, 3, 0, 1).contiguous() for t in attentions)
|
||||||
outputs = outputs + (attentions,)
|
outputs = outputs + (attentions,)
|
||||||
|
|
||||||
return outputs # outputs, new_mems, (hidden_states), (attentions)
|
return outputs # outputs, (new_mems), (hidden_states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
@add_start_docstrings("""XLNet Model with a language modeling head on top
|
@add_start_docstrings("""XLNet Model with a language modeling head on top
|
||||||
@@ -867,7 +882,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
|
|||||||
Language modeling loss.
|
Language modeling loss.
|
||||||
**prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
|
**prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
|
||||||
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
||||||
**mems**:
|
**mems**: (`optional`, returned when ``config.mem_len > 0``)
|
||||||
list of ``torch.FloatTensor`` (one for each layer):
|
list of ``torch.FloatTensor`` (one for each layer):
|
||||||
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||||
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
||||||
@@ -918,7 +933,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
|
|||||||
perm_mask=perm_mask,
|
perm_mask=perm_mask,
|
||||||
target_mapping=target_mapping,
|
target_mapping=target_mapping,
|
||||||
token_type_ids=token_type_ids,
|
token_type_ids=token_type_ids,
|
||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask)
|
head_mask=head_mask)
|
||||||
|
|
||||||
logits = self.lm_loss(transformer_outputs[0])
|
logits = self.lm_loss(transformer_outputs[0])
|
||||||
@@ -932,7 +947,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
|
|||||||
labels.view(-1))
|
labels.view(-1))
|
||||||
outputs = (loss,) + outputs
|
outputs = (loss,) + outputs
|
||||||
|
|
||||||
return outputs # return (loss), logits, mems, (hidden states), (attentions)
|
return outputs # return (loss), logits, (mems), (hidden states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
@add_start_docstrings("""XLNet Model with a sequence classification/regression head on top (a linear layer on top of
|
@add_start_docstrings("""XLNet Model with a sequence classification/regression head on top (a linear layer on top of
|
||||||
@@ -951,7 +966,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
|
|||||||
Classification (or regression if config.num_labels==1) loss.
|
Classification (or regression if config.num_labels==1) loss.
|
||||||
**logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
|
**logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
|
||||||
Classification (or regression if config.num_labels==1) scores (before SoftMax).
|
Classification (or regression if config.num_labels==1) scores (before SoftMax).
|
||||||
**mems**:
|
**mems**: (`optional`, returned when ``config.mem_len > 0``)
|
||||||
list of ``torch.FloatTensor`` (one for each layer):
|
list of ``torch.FloatTensor`` (one for each layer):
|
||||||
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||||
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
||||||
@@ -992,7 +1007,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
|
|||||||
perm_mask=perm_mask,
|
perm_mask=perm_mask,
|
||||||
target_mapping=target_mapping,
|
target_mapping=target_mapping,
|
||||||
token_type_ids=token_type_ids,
|
token_type_ids=token_type_ids,
|
||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask)
|
head_mask=head_mask)
|
||||||
output = transformer_outputs[0]
|
output = transformer_outputs[0]
|
||||||
|
|
||||||
@@ -1011,7 +1026,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
|
|||||||
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
|
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
|
||||||
outputs = (loss,) + outputs
|
outputs = (loss,) + outputs
|
||||||
|
|
||||||
return outputs # return (loss), logits, mems, (hidden states), (attentions)
|
return outputs # return (loss), logits, (mems), (hidden states), (attentions)
|
||||||
|
|
||||||
@add_start_docstrings("""XLNet Model with a multiple choice classification head on top (a linear layer on top of
|
@add_start_docstrings("""XLNet Model with a multiple choice classification head on top (a linear layer on top of
|
||||||
the pooled output and a softmax) e.g. for RACE/SWAG tasks. """,
|
the pooled output and a softmax) e.g. for RACE/SWAG tasks. """,
|
||||||
@@ -1046,6 +1061,11 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
|
|||||||
**classification_scores**: ``torch.FloatTensor`` of shape ``(batch_size, num_choices)`` where `num_choices` is the size of the second dimension
|
**classification_scores**: ``torch.FloatTensor`` of shape ``(batch_size, num_choices)`` where `num_choices` is the size of the second dimension
|
||||||
of the input tensors. (see `input_ids` above).
|
of the input tensors. (see `input_ids` above).
|
||||||
Classification scores (before SoftMax).
|
Classification scores (before SoftMax).
|
||||||
|
**mems**: (`optional`, returned when ``config.mem_len > 0``)
|
||||||
|
list of ``torch.FloatTensor`` (one for each layer):
|
||||||
|
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||||
|
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
||||||
|
See details in the docstring of the `mems` input above.
|
||||||
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
of shape ``(batch_size, sequence_length, hidden_size)``:
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
@@ -1102,7 +1122,7 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
|
|||||||
loss = loss_fct(reshaped_logits, labels.view(-1))
|
loss = loss_fct(reshaped_logits, labels.view(-1))
|
||||||
outputs = (loss,) + outputs
|
outputs = (loss,) + outputs
|
||||||
|
|
||||||
return outputs # return (loss), logits, mems, (hidden states), (attentions)
|
return outputs # return (loss), logits, (mems), (hidden states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
@add_start_docstrings("""XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
|
@add_start_docstrings("""XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
|
||||||
@@ -1126,7 +1146,7 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
|
|||||||
Span-start scores (before SoftMax).
|
Span-start scores (before SoftMax).
|
||||||
**end_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
|
**end_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
|
||||||
Span-end scores (before SoftMax).
|
Span-end scores (before SoftMax).
|
||||||
**mems**:
|
**mems**: (`optional`, returned when ``config.mem_len > 0``)
|
||||||
list of ``torch.FloatTensor`` (one for each layer):
|
list of ``torch.FloatTensor`` (one for each layer):
|
||||||
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||||
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
||||||
@@ -1169,7 +1189,7 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
|
|||||||
perm_mask=perm_mask,
|
perm_mask=perm_mask,
|
||||||
target_mapping=target_mapping,
|
target_mapping=target_mapping,
|
||||||
token_type_ids=token_type_ids,
|
token_type_ids=token_type_ids,
|
||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask)
|
head_mask=head_mask)
|
||||||
|
|
||||||
sequence_output = outputs[0]
|
sequence_output = outputs[0]
|
||||||
@@ -1197,7 +1217,7 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
|
|||||||
total_loss = (start_loss + end_loss) / 2
|
total_loss = (start_loss + end_loss) / 2
|
||||||
outputs = (total_loss,) + outputs
|
outputs = (total_loss,) + outputs
|
||||||
|
|
||||||
return outputs # (loss), start_logits, end_logits, (hidden_states), (attentions)
|
return outputs # (loss), start_logits, end_logits, (mems), (hidden_states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
@add_start_docstrings("""XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
|
@add_start_docstrings("""XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
|
||||||
@@ -1239,7 +1259,7 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
|
|||||||
**cls_logits**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)
|
**cls_logits**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)
|
||||||
``torch.FloatTensor`` of shape ``(batch_size,)``
|
``torch.FloatTensor`` of shape ``(batch_size,)``
|
||||||
Log probabilities for the ``is_impossible`` label of the answers.
|
Log probabilities for the ``is_impossible`` label of the answers.
|
||||||
**mems**:
|
**mems**: (`optional`, returned when ``config.mem_len > 0``)
|
||||||
list of ``torch.FloatTensor`` (one for each layer):
|
list of ``torch.FloatTensor`` (one for each layer):
|
||||||
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||||
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
|
||||||
@@ -1284,7 +1304,7 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
|
|||||||
perm_mask=perm_mask,
|
perm_mask=perm_mask,
|
||||||
target_mapping=target_mapping,
|
target_mapping=target_mapping,
|
||||||
token_type_ids=token_type_ids,
|
token_type_ids=token_type_ids,
|
||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask)
|
head_mask=head_mask)
|
||||||
hidden_states = transformer_outputs[0]
|
hidden_states = transformer_outputs[0]
|
||||||
start_logits = self.start_logits(hidden_states, p_mask=p_mask)
|
start_logits = self.start_logits(hidden_states, p_mask=p_mask)
|
||||||
|
|||||||
@@ -17,8 +17,10 @@ from __future__ import division
|
|||||||
from __future__ import print_function
|
from __future__ import print_function
|
||||||
|
|
||||||
import copy
|
import copy
|
||||||
|
import sys
|
||||||
import os
|
import os
|
||||||
import shutil
|
import shutil
|
||||||
|
import tempfile
|
||||||
import json
|
import json
|
||||||
import random
|
import random
|
||||||
import uuid
|
import uuid
|
||||||
@@ -31,6 +33,7 @@ from transformers import is_torch_available
|
|||||||
|
|
||||||
if is_torch_available():
|
if is_torch_available():
|
||||||
import torch
|
import torch
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
from transformers import (PretrainedConfig, PreTrainedModel,
|
from transformers import (PretrainedConfig, PreTrainedModel,
|
||||||
BertModel, BertConfig, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
BertModel, BertConfig, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
@@ -38,6 +41,20 @@ if is_torch_available():
|
|||||||
else:
|
else:
|
||||||
pytestmark = pytest.mark.skip("Require Torch")
|
pytestmark = pytest.mark.skip("Require Torch")
|
||||||
|
|
||||||
|
if sys.version_info[0] == 2:
|
||||||
|
import cPickle as pickle
|
||||||
|
|
||||||
|
class TemporaryDirectory(object):
|
||||||
|
"""Context manager for tempfile.mkdtemp() so it's usable with "with" statement."""
|
||||||
|
def __enter__(self):
|
||||||
|
self.name = tempfile.mkdtemp()
|
||||||
|
return self.name
|
||||||
|
def __exit__(self, exc_type, exc_value, traceback):
|
||||||
|
shutil.rmtree(self.name)
|
||||||
|
else:
|
||||||
|
import pickle
|
||||||
|
TemporaryDirectory = tempfile.TemporaryDirectory
|
||||||
|
unicode = str
|
||||||
|
|
||||||
def _config_zero_init(config):
|
def _config_zero_init(config):
|
||||||
configs_no_init = copy.deepcopy(config)
|
configs_no_init = copy.deepcopy(config)
|
||||||
@@ -57,6 +74,29 @@ class CommonTestCases:
|
|||||||
test_resize_embeddings = True
|
test_resize_embeddings = True
|
||||||
test_head_masking = True
|
test_head_masking = True
|
||||||
|
|
||||||
|
def test_save_load(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**inputs_dict)
|
||||||
|
|
||||||
|
with TemporaryDirectory() as tmpdirname:
|
||||||
|
model.save_pretrained(tmpdirname)
|
||||||
|
model = model_class.from_pretrained(tmpdirname)
|
||||||
|
with torch.no_grad():
|
||||||
|
after_outputs = model(**inputs_dict)
|
||||||
|
|
||||||
|
# Make sure we don't have nans
|
||||||
|
out_1 = after_outputs[0].numpy()
|
||||||
|
out_2 = outputs[0].numpy()
|
||||||
|
out_1 = out_1[~np.isnan(out_1)]
|
||||||
|
out_2 = out_2[~np.isnan(out_2)]
|
||||||
|
max_diff = np.amax(np.abs(out_1 - out_2))
|
||||||
|
self.assertLessEqual(max_diff, 1e-5)
|
||||||
|
|
||||||
def test_initialization(self):
|
def test_initialization(self):
|
||||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
|||||||
215
transformers/tests/modeling_ctrl_test.py
Normal file
215
transformers/tests/modeling_ctrl_test.py
Normal file
@@ -0,0 +1,215 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 Salesforce and HuggingFace Inc. team.
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from __future__ import absolute_import
|
||||||
|
from __future__ import division
|
||||||
|
from __future__ import print_function
|
||||||
|
|
||||||
|
import unittest
|
||||||
|
import pytest
|
||||||
|
import shutil
|
||||||
|
import pdb
|
||||||
|
|
||||||
|
from transformers import is_torch_available
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
from transformers import (CTRLConfig, CTRLModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
|
CTRLLMHeadModel)
|
||||||
|
else:
|
||||||
|
pytestmark = pytest.mark.skip("Require Torch")
|
||||||
|
|
||||||
|
from .modeling_common_test import (CommonTestCases, ids_tensor)
|
||||||
|
from .configuration_common_test import ConfigTester
|
||||||
|
|
||||||
|
|
||||||
|
class CTRLModelTest(CommonTestCases.CommonModelTester):
|
||||||
|
|
||||||
|
all_model_classes = (CTRLModel, CTRLLMHeadModel) if is_torch_available() else ()
|
||||||
|
test_pruning = False
|
||||||
|
test_torchscript = False
|
||||||
|
test_resize_embeddings = False
|
||||||
|
test_head_masking = False
|
||||||
|
|
||||||
|
class CTRLModelTester(object):
|
||||||
|
|
||||||
|
def __init__(self,
|
||||||
|
parent,
|
||||||
|
batch_size=13,
|
||||||
|
seq_length=7,
|
||||||
|
is_training=True,
|
||||||
|
use_token_type_ids=True,
|
||||||
|
use_input_mask=True,
|
||||||
|
use_labels=True,
|
||||||
|
use_mc_token_ids=True,
|
||||||
|
vocab_size=99,
|
||||||
|
hidden_size=32,
|
||||||
|
num_hidden_layers=5,
|
||||||
|
num_attention_heads=4,
|
||||||
|
intermediate_size=37,
|
||||||
|
hidden_act="gelu",
|
||||||
|
hidden_dropout_prob=0.1,
|
||||||
|
attention_probs_dropout_prob=0.1,
|
||||||
|
max_position_embeddings=512,
|
||||||
|
type_vocab_size=16,
|
||||||
|
type_sequence_label_size=2,
|
||||||
|
initializer_range=0.02,
|
||||||
|
num_labels=3,
|
||||||
|
num_choices=4,
|
||||||
|
scope=None,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.seq_length = seq_length
|
||||||
|
self.is_training = is_training
|
||||||
|
self.use_token_type_ids = use_token_type_ids
|
||||||
|
self.use_input_mask = use_input_mask
|
||||||
|
self.use_labels = use_labels
|
||||||
|
self.use_mc_token_ids = use_mc_token_ids
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.type_vocab_size = type_vocab_size
|
||||||
|
self.type_sequence_label_size = type_sequence_label_size
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.num_labels = num_labels
|
||||||
|
self.num_choices = num_choices
|
||||||
|
self.scope = scope
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||||
|
|
||||||
|
input_mask = None
|
||||||
|
if self.use_input_mask:
|
||||||
|
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
|
||||||
|
|
||||||
|
token_type_ids = None
|
||||||
|
if self.use_token_type_ids:
|
||||||
|
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
|
||||||
|
|
||||||
|
mc_token_ids = None
|
||||||
|
if self.use_mc_token_ids:
|
||||||
|
mc_token_ids = ids_tensor([self.batch_size, self.num_choices], self.seq_length)
|
||||||
|
|
||||||
|
sequence_labels = None
|
||||||
|
token_labels = None
|
||||||
|
choice_labels = None
|
||||||
|
if self.use_labels:
|
||||||
|
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
|
||||||
|
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
|
||||||
|
choice_labels = ids_tensor([self.batch_size], self.num_choices)
|
||||||
|
|
||||||
|
config = CTRLConfig(
|
||||||
|
vocab_size_or_config_json_file=self.vocab_size,
|
||||||
|
n_embd=self.hidden_size,
|
||||||
|
n_layer=self.num_hidden_layers,
|
||||||
|
n_head=self.num_attention_heads,
|
||||||
|
# intermediate_size=self.intermediate_size,
|
||||||
|
# hidden_act=self.hidden_act,
|
||||||
|
# hidden_dropout_prob=self.hidden_dropout_prob,
|
||||||
|
# attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||||
|
n_positions=self.max_position_embeddings,
|
||||||
|
n_ctx=self.max_position_embeddings
|
||||||
|
# type_vocab_size=self.type_vocab_size,
|
||||||
|
# initializer_range=self.initializer_range
|
||||||
|
)
|
||||||
|
|
||||||
|
head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
|
||||||
|
|
||||||
|
return config, input_ids, input_mask, head_mask, token_type_ids, mc_token_ids, sequence_labels, token_labels, choice_labels
|
||||||
|
|
||||||
|
def check_loss_output(self, result):
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["loss"].size()),
|
||||||
|
[])
|
||||||
|
|
||||||
|
def create_and_check_ctrl_model(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
|
||||||
|
model = CTRLModel(config=config)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
model(input_ids, token_type_ids=token_type_ids, head_mask=head_mask)
|
||||||
|
model(input_ids, token_type_ids=token_type_ids)
|
||||||
|
sequence_output, presents = model(input_ids)
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"sequence_output": sequence_output,
|
||||||
|
"presents": presents,
|
||||||
|
}
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["sequence_output"].size()),
|
||||||
|
[self.batch_size, self.seq_length, self.hidden_size])
|
||||||
|
self.parent.assertEqual(len(result["presents"]), config.n_layer)
|
||||||
|
|
||||||
|
def create_and_check_lm_head_model(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
|
||||||
|
model = CTRLLMHeadModel(config)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
loss, lm_logits, _ = model(input_ids, token_type_ids=token_type_ids, labels=input_ids)
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"loss": loss,
|
||||||
|
"lm_logits": lm_logits
|
||||||
|
}
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["loss"].size()),
|
||||||
|
[])
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["lm_logits"].size()),
|
||||||
|
[self.batch_size, self.seq_length, self.vocab_size])
|
||||||
|
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
|
|
||||||
|
(config, input_ids, input_mask, head_mask, token_type_ids,
|
||||||
|
mc_token_ids, sequence_labels, token_labels, choice_labels) = config_and_inputs
|
||||||
|
|
||||||
|
inputs_dict = {
|
||||||
|
'input_ids': input_ids,
|
||||||
|
'token_type_ids': token_type_ids,
|
||||||
|
'head_mask': head_mask
|
||||||
|
}
|
||||||
|
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = CTRLModelTest.CTRLModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=CTRLConfig, n_embd=37)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_ctrl_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_ctrl_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_ctrl_lm_head_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
|
||||||
|
|
||||||
|
@pytest.mark.slow
|
||||||
|
def test_model_from_pretrained(self):
|
||||||
|
cache_dir = "/tmp/transformers_test/"
|
||||||
|
for model_name in list(CTRL_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
||||||
|
model = CTRLModel.from_pretrained(model_name, cache_dir=cache_dir)
|
||||||
|
shutil.rmtree(cache_dir)
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
@@ -24,7 +24,8 @@ from transformers import is_torch_available
|
|||||||
|
|
||||||
if is_torch_available():
|
if is_torch_available():
|
||||||
import torch
|
import torch
|
||||||
from transformers import (RobertaConfig, RobertaModel, RobertaForMaskedLM, RobertaForSequenceClassification)
|
from transformers import (RobertaConfig, RobertaModel, RobertaForMaskedLM,
|
||||||
|
RobertaForSequenceClassification, RobertaForTokenClassification)
|
||||||
from transformers.modeling_roberta import ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
|
from transformers.modeling_roberta import ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
else:
|
else:
|
||||||
pytestmark = pytest.mark.skip("Require Torch")
|
pytestmark = pytest.mark.skip("Require Torch")
|
||||||
@@ -156,6 +157,22 @@ class RobertaModelTest(CommonTestCases.CommonModelTester):
|
|||||||
[self.batch_size, self.seq_length, self.vocab_size])
|
[self.batch_size, self.seq_length, self.vocab_size])
|
||||||
self.check_loss_output(result)
|
self.check_loss_output(result)
|
||||||
|
|
||||||
|
def create_and_check_roberta_for_token_classification(self, config, input_ids, token_type_ids, input_mask,
|
||||||
|
sequence_labels, token_labels, choice_labels):
|
||||||
|
config.num_labels = self.num_labels
|
||||||
|
model = RobertaForTokenClassification(config=config)
|
||||||
|
model.eval()
|
||||||
|
loss, logits = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids,
|
||||||
|
labels=token_labels)
|
||||||
|
result = {
|
||||||
|
"loss": loss,
|
||||||
|
"logits": logits,
|
||||||
|
}
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["logits"].size()),
|
||||||
|
[self.batch_size, self.seq_length, self.num_labels])
|
||||||
|
self.check_loss_output(result)
|
||||||
|
|
||||||
def prepare_config_and_inputs_for_common(self):
|
def prepare_config_and_inputs_for_common(self):
|
||||||
config_and_inputs = self.prepare_config_and_inputs()
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
(config, input_ids, token_type_ids, input_mask,
|
(config, input_ids, token_type_ids, input_mask,
|
||||||
|
|||||||
@@ -14,6 +14,7 @@
|
|||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
from __future__ import absolute_import, division, print_function
|
from __future__ import absolute_import, division, print_function
|
||||||
|
|
||||||
|
import os
|
||||||
import copy
|
import copy
|
||||||
import json
|
import json
|
||||||
import logging
|
import logging
|
||||||
@@ -22,6 +23,7 @@ import random
|
|||||||
import shutil
|
import shutil
|
||||||
import unittest
|
import unittest
|
||||||
import uuid
|
import uuid
|
||||||
|
import tempfile
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
import sys
|
import sys
|
||||||
@@ -36,6 +38,20 @@ if is_tf_available():
|
|||||||
else:
|
else:
|
||||||
pytestmark = pytest.mark.skip("Require TensorFlow")
|
pytestmark = pytest.mark.skip("Require TensorFlow")
|
||||||
|
|
||||||
|
if sys.version_info[0] == 2:
|
||||||
|
import cPickle as pickle
|
||||||
|
|
||||||
|
class TemporaryDirectory(object):
|
||||||
|
"""Context manager for tempfile.mkdtemp() so it's usable with "with" statement."""
|
||||||
|
def __enter__(self):
|
||||||
|
self.name = tempfile.mkdtemp()
|
||||||
|
return self.name
|
||||||
|
def __exit__(self, exc_type, exc_value, traceback):
|
||||||
|
shutil.rmtree(self.name)
|
||||||
|
else:
|
||||||
|
import pickle
|
||||||
|
TemporaryDirectory = tempfile.TemporaryDirectory
|
||||||
|
unicode = str
|
||||||
|
|
||||||
def _config_zero_init(config):
|
def _config_zero_init(config):
|
||||||
configs_no_init = copy.deepcopy(config)
|
configs_no_init = copy.deepcopy(config)
|
||||||
@@ -66,11 +82,31 @@ class TFCommonTestCases:
|
|||||||
# self.assertIn(param.data.mean().item(), [0.0, 1.0],
|
# self.assertIn(param.data.mean().item(), [0.0, 1.0],
|
||||||
# msg="Parameter {} of model {} seems not properly initialized".format(name, model_class))
|
# msg="Parameter {} of model {} seems not properly initialized".format(name, model_class))
|
||||||
|
|
||||||
|
def test_save_load(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
outputs = model(inputs_dict)
|
||||||
|
|
||||||
|
with TemporaryDirectory() as tmpdirname:
|
||||||
|
model.save_pretrained(tmpdirname)
|
||||||
|
model = model_class.from_pretrained(tmpdirname)
|
||||||
|
after_outputs = model(inputs_dict)
|
||||||
|
|
||||||
|
# Make sure we don't have nans
|
||||||
|
out_1 = after_outputs[0].numpy()
|
||||||
|
out_2 = outputs[0].numpy()
|
||||||
|
out_1 = out_1[~np.isnan(out_1)]
|
||||||
|
out_2 = out_2[~np.isnan(out_2)]
|
||||||
|
max_diff = np.amax(np.abs(out_1 - out_2))
|
||||||
|
self.assertLessEqual(max_diff, 1e-5)
|
||||||
|
|
||||||
def test_pt_tf_model_equivalence(self):
|
def test_pt_tf_model_equivalence(self):
|
||||||
if not is_torch_available():
|
if not is_torch_available():
|
||||||
return
|
return
|
||||||
|
|
||||||
|
import torch
|
||||||
import transformers
|
import transformers
|
||||||
|
|
||||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
@@ -79,12 +115,71 @@ class TFCommonTestCases:
|
|||||||
pt_model_class_name = model_class.__name__[2:] # Skip the "TF" at the beggining
|
pt_model_class_name = model_class.__name__[2:] # Skip the "TF" at the beggining
|
||||||
pt_model_class = getattr(transformers, pt_model_class_name)
|
pt_model_class = getattr(transformers, pt_model_class_name)
|
||||||
|
|
||||||
|
config.output_hidden_states = True
|
||||||
tf_model = model_class(config)
|
tf_model = model_class(config)
|
||||||
pt_model = pt_model_class(config)
|
pt_model = pt_model_class(config)
|
||||||
|
|
||||||
|
# Check we can load pt model in tf and vice-versa with model => model functions
|
||||||
tf_model = transformers.load_pytorch_model_in_tf2_model(tf_model, pt_model, tf_inputs=inputs_dict)
|
tf_model = transformers.load_pytorch_model_in_tf2_model(tf_model, pt_model, tf_inputs=inputs_dict)
|
||||||
pt_model = transformers.load_tf2_model_in_pytorch_model(pt_model, tf_model)
|
pt_model = transformers.load_tf2_model_in_pytorch_model(pt_model, tf_model)
|
||||||
|
|
||||||
|
# Check predictions on first output (logits/hidden-states) are close enought given low-level computational differences
|
||||||
|
pt_model.eval()
|
||||||
|
pt_inputs_dict = dict((name, torch.from_numpy(key.numpy()).to(torch.long))
|
||||||
|
for name, key in inputs_dict.items())
|
||||||
|
with torch.no_grad():
|
||||||
|
pto = pt_model(**pt_inputs_dict)
|
||||||
|
tfo = tf_model(inputs_dict)
|
||||||
|
max_diff = np.amax(np.abs(tfo[0].numpy() - pto[0].numpy()))
|
||||||
|
self.assertLessEqual(max_diff, 2e-2)
|
||||||
|
|
||||||
|
# Check we can load pt model in tf and vice-versa with checkpoint => model functions
|
||||||
|
with TemporaryDirectory() as tmpdirname:
|
||||||
|
pt_checkpoint_path = os.path.join(tmpdirname, 'pt_model.bin')
|
||||||
|
torch.save(pt_model.state_dict(), pt_checkpoint_path)
|
||||||
|
tf_model = transformers.load_pytorch_checkpoint_in_tf2_model(tf_model, pt_checkpoint_path)
|
||||||
|
|
||||||
|
tf_checkpoint_path = os.path.join(tmpdirname, 'tf_model.h5')
|
||||||
|
tf_model.save_weights(tf_checkpoint_path)
|
||||||
|
pt_model = transformers.load_tf2_checkpoint_in_pytorch_model(pt_model, tf_checkpoint_path)
|
||||||
|
|
||||||
|
# Check predictions on first output (logits/hidden-states) are close enought given low-level computational differences
|
||||||
|
pt_model.eval()
|
||||||
|
pt_inputs_dict = dict((name, torch.from_numpy(key.numpy()).to(torch.long))
|
||||||
|
for name, key in inputs_dict.items())
|
||||||
|
with torch.no_grad():
|
||||||
|
pto = pt_model(**pt_inputs_dict)
|
||||||
|
tfo = tf_model(inputs_dict)
|
||||||
|
max_diff = np.amax(np.abs(tfo[0].numpy() - pto[0].numpy()))
|
||||||
|
self.assertLessEqual(max_diff, 2e-2)
|
||||||
|
|
||||||
|
def test_compile_tf_model(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
input_ids = tf.keras.Input(batch_shape=(2, 2000), name='input_ids', dtype='int32')
|
||||||
|
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
|
||||||
|
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
|
||||||
|
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
# Prepare our model
|
||||||
|
model = model_class(config)
|
||||||
|
|
||||||
|
# Let's load it from the disk to be sure we can use pretrained weights
|
||||||
|
with TemporaryDirectory() as tmpdirname:
|
||||||
|
outputs = model(inputs_dict) # build the model
|
||||||
|
model.save_pretrained(tmpdirname)
|
||||||
|
model = model_class.from_pretrained(tmpdirname)
|
||||||
|
|
||||||
|
outputs_dict = model(input_ids)
|
||||||
|
hidden_states = outputs_dict[0]
|
||||||
|
|
||||||
|
# Add a dense layer on top to test intetgration with other keras modules
|
||||||
|
outputs = tf.keras.layers.Dense(2, activation='softmax', name='outputs')(hidden_states)
|
||||||
|
|
||||||
|
# Compile extended model
|
||||||
|
extended_model = tf.keras.Model(inputs=[input_ids], outputs=[outputs])
|
||||||
|
extended_model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
|
||||||
|
|
||||||
def test_keyword_and_dict_args(self):
|
def test_keyword_and_dict_args(self):
|
||||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|||||||
201
transformers/tests/modeling_tf_ctrl_test.py
Normal file
201
transformers/tests/modeling_tf_ctrl_test.py
Normal file
@@ -0,0 +1,201 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Google AI Language Team Authors.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from __future__ import absolute_import
|
||||||
|
from __future__ import division
|
||||||
|
from __future__ import print_function
|
||||||
|
|
||||||
|
import unittest
|
||||||
|
import shutil
|
||||||
|
import pytest
|
||||||
|
import sys
|
||||||
|
|
||||||
|
from .modeling_tf_common_test import (TFCommonTestCases, ids_tensor)
|
||||||
|
from .configuration_common_test import ConfigTester
|
||||||
|
|
||||||
|
from transformers import CTRLConfig, is_tf_available
|
||||||
|
|
||||||
|
if is_tf_available():
|
||||||
|
import tensorflow as tf
|
||||||
|
from transformers.modeling_tf_ctrl import (TFCTRLModel, TFCTRLLMHeadModel,
|
||||||
|
TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
else:
|
||||||
|
pytestmark = pytest.mark.skip("Require TensorFlow")
|
||||||
|
|
||||||
|
|
||||||
|
class TFCTRLModelTest(TFCommonTestCases.TFCommonModelTester):
|
||||||
|
|
||||||
|
all_model_classes = (TFCTRLModel, TFCTRLLMHeadModel) if is_tf_available() else ()
|
||||||
|
|
||||||
|
class TFCTRLModelTester(object):
|
||||||
|
|
||||||
|
def __init__(self,
|
||||||
|
parent,
|
||||||
|
batch_size=13,
|
||||||
|
seq_length=7,
|
||||||
|
is_training=True,
|
||||||
|
use_token_type_ids=True,
|
||||||
|
use_input_mask=True,
|
||||||
|
use_labels=True,
|
||||||
|
use_mc_token_ids=True,
|
||||||
|
vocab_size=99,
|
||||||
|
hidden_size=32,
|
||||||
|
num_hidden_layers=5,
|
||||||
|
num_attention_heads=4,
|
||||||
|
intermediate_size=37,
|
||||||
|
hidden_act="gelu",
|
||||||
|
hidden_dropout_prob=0.1,
|
||||||
|
attention_probs_dropout_prob=0.1,
|
||||||
|
max_position_embeddings=512,
|
||||||
|
type_vocab_size=16,
|
||||||
|
type_sequence_label_size=2,
|
||||||
|
initializer_range=0.02,
|
||||||
|
num_labels=3,
|
||||||
|
num_choices=4,
|
||||||
|
scope=None,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.seq_length = seq_length
|
||||||
|
self.is_training = is_training
|
||||||
|
self.use_token_type_ids = use_token_type_ids
|
||||||
|
self.use_input_mask = use_input_mask
|
||||||
|
self.use_labels = use_labels
|
||||||
|
self.use_mc_token_ids = use_mc_token_ids
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.type_vocab_size = type_vocab_size
|
||||||
|
self.type_sequence_label_size = type_sequence_label_size
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.num_labels = num_labels
|
||||||
|
self.num_choices = num_choices
|
||||||
|
self.scope = scope
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||||
|
|
||||||
|
input_mask = None
|
||||||
|
if self.use_input_mask:
|
||||||
|
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
|
||||||
|
|
||||||
|
token_type_ids = None
|
||||||
|
if self.use_token_type_ids:
|
||||||
|
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
|
||||||
|
|
||||||
|
mc_token_ids = None
|
||||||
|
if self.use_mc_token_ids:
|
||||||
|
mc_token_ids = ids_tensor([self.batch_size, self.num_choices], self.seq_length)
|
||||||
|
|
||||||
|
sequence_labels = None
|
||||||
|
token_labels = None
|
||||||
|
choice_labels = None
|
||||||
|
if self.use_labels:
|
||||||
|
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
|
||||||
|
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
|
||||||
|
choice_labels = ids_tensor([self.batch_size], self.num_choices)
|
||||||
|
|
||||||
|
config = CTRLConfig(
|
||||||
|
vocab_size_or_config_json_file=self.vocab_size,
|
||||||
|
n_embd=self.hidden_size,
|
||||||
|
n_layer=self.num_hidden_layers,
|
||||||
|
n_head=self.num_attention_heads,
|
||||||
|
# intermediate_size=self.intermediate_size,
|
||||||
|
# hidden_act=self.hidden_act,
|
||||||
|
# hidden_dropout_prob=self.hidden_dropout_prob,
|
||||||
|
# attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||||
|
n_positions=self.max_position_embeddings,
|
||||||
|
n_ctx=self.max_position_embeddings
|
||||||
|
# type_vocab_size=self.type_vocab_size,
|
||||||
|
# initializer_range=self.initializer_range
|
||||||
|
)
|
||||||
|
|
||||||
|
head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
|
||||||
|
|
||||||
|
return config, input_ids, input_mask, head_mask, token_type_ids, mc_token_ids, sequence_labels, token_labels, choice_labels
|
||||||
|
|
||||||
|
def create_and_check_ctrl_model(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
|
||||||
|
model = TFCTRLModel(config=config)
|
||||||
|
inputs = {'input_ids': input_ids,
|
||||||
|
'attention_mask': input_mask,
|
||||||
|
'token_type_ids': token_type_ids}
|
||||||
|
sequence_output = model(inputs)[0]
|
||||||
|
|
||||||
|
inputs = [input_ids, None, input_mask] # None is the input for 'past'
|
||||||
|
sequence_output = model(inputs)[0]
|
||||||
|
|
||||||
|
sequence_output = model(input_ids)[0]
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"sequence_output": sequence_output.numpy(),
|
||||||
|
}
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["sequence_output"].shape),
|
||||||
|
[self.batch_size, self.seq_length, self.hidden_size])
|
||||||
|
|
||||||
|
|
||||||
|
def create_and_check_ctrl_lm_head(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
|
||||||
|
model = TFCTRLLMHeadModel(config=config)
|
||||||
|
inputs = {'input_ids': input_ids,
|
||||||
|
'attention_mask': input_mask,
|
||||||
|
'token_type_ids': token_type_ids}
|
||||||
|
prediction_scores = model(inputs)[0]
|
||||||
|
result = {
|
||||||
|
"prediction_scores": prediction_scores.numpy(),
|
||||||
|
}
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["prediction_scores"].shape),
|
||||||
|
[self.batch_size, self.seq_length, self.vocab_size])
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
|
|
||||||
|
(config, input_ids, input_mask, head_mask, token_type_ids,
|
||||||
|
mc_token_ids, sequence_labels, token_labels, choice_labels) = config_and_inputs
|
||||||
|
|
||||||
|
inputs_dict = {'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': input_mask}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = TFCTRLModelTest.TFCTRLModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=CTRLConfig, n_embd=37)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_ctrl_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_ctrl_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_ctrl_lm_head(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_ctrl_lm_head(*config_and_inputs)
|
||||||
|
|
||||||
|
@pytest.mark.slow
|
||||||
|
def test_model_from_pretrained(self):
|
||||||
|
cache_dir = "/tmp/transformers_test/"
|
||||||
|
for model_name in list(TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
||||||
|
model = TFCTRLModel.from_pretrained(model_name, cache_dir=cache_dir)
|
||||||
|
shutil.rmtree(cache_dir)
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
|
|
||||||
@@ -222,7 +222,7 @@ class TFGPT2ModelTest(TFCommonTestCases.TFCommonModelTester):
|
|||||||
@pytest.mark.slow
|
@pytest.mark.slow
|
||||||
def test_model_from_pretrained(self):
|
def test_model_from_pretrained(self):
|
||||||
cache_dir = "/tmp/transformers_test/"
|
cache_dir = "/tmp/transformers_test/"
|
||||||
for model_name in list(TF_gpt2_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
for model_name in list(TF_GPT2_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
||||||
model = TFGPT2Model.from_pretrained(model_name, cache_dir=cache_dir)
|
model = TFGPT2Model.from_pretrained(model_name, cache_dir=cache_dir)
|
||||||
shutil.rmtree(cache_dir)
|
shutil.rmtree(cache_dir)
|
||||||
self.assertIsNotNone(model)
|
self.assertIsNotNone(model)
|
||||||
|
|||||||
@@ -30,6 +30,7 @@ if is_tf_available():
|
|||||||
import numpy
|
import numpy
|
||||||
from transformers.modeling_tf_roberta import (TFRobertaModel, TFRobertaForMaskedLM,
|
from transformers.modeling_tf_roberta import (TFRobertaModel, TFRobertaForMaskedLM,
|
||||||
TFRobertaForSequenceClassification,
|
TFRobertaForSequenceClassification,
|
||||||
|
TFRobertaForTokenClassification,
|
||||||
TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
|
TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
else:
|
else:
|
||||||
pytestmark = pytest.mark.skip("Require TensorFlow")
|
pytestmark = pytest.mark.skip("Require TensorFlow")
|
||||||
@@ -154,6 +155,20 @@ class TFRobertaModelTest(TFCommonTestCases.TFCommonModelTester):
|
|||||||
list(result["prediction_scores"].shape),
|
list(result["prediction_scores"].shape),
|
||||||
[self.batch_size, self.seq_length, self.vocab_size])
|
[self.batch_size, self.seq_length, self.vocab_size])
|
||||||
|
|
||||||
|
def create_and_check_roberta_for_token_classification(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
|
||||||
|
config.num_labels = self.num_labels
|
||||||
|
model = TFRobertaForTokenClassification(config=config)
|
||||||
|
inputs = {'input_ids': input_ids,
|
||||||
|
'attention_mask': input_mask,
|
||||||
|
'token_type_ids': token_type_ids}
|
||||||
|
logits, = model(inputs)
|
||||||
|
result = {
|
||||||
|
"logits": logits.numpy(),
|
||||||
|
}
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["logits"].shape),
|
||||||
|
[self.batch_size, self.seq_length, self.num_labels])
|
||||||
|
|
||||||
def prepare_config_and_inputs_for_common(self):
|
def prepare_config_and_inputs_for_common(self):
|
||||||
config_and_inputs = self.prepare_config_and_inputs()
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
(config, input_ids, token_type_ids, input_mask,
|
(config, input_ids, token_type_ids, input_mask,
|
||||||
|
|||||||
@@ -161,6 +161,11 @@ class TFXLNetModelTest(TFCommonTestCases.TFCommonModelTester):
|
|||||||
"outputs": outputs.numpy(),
|
"outputs": outputs.numpy(),
|
||||||
}
|
}
|
||||||
|
|
||||||
|
config.mem_len = 0
|
||||||
|
model = TFXLNetModel(config)
|
||||||
|
no_mems_outputs = model(inputs)
|
||||||
|
self.parent.assertEqual(len(no_mems_outputs), 1)
|
||||||
|
|
||||||
self.parent.assertListEqual(
|
self.parent.assertListEqual(
|
||||||
list(result["outputs"].shape),
|
list(result["outputs"].shape),
|
||||||
[self.batch_size, self.seq_length, self.hidden_size])
|
[self.batch_size, self.seq_length, self.hidden_size])
|
||||||
|
|||||||
@@ -150,6 +150,12 @@ class XLNetModelTest(CommonTestCases.CommonModelTester):
|
|||||||
"outputs": outputs,
|
"outputs": outputs,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
config.mem_len = 0
|
||||||
|
model = XLNetModel(config)
|
||||||
|
model.eval()
|
||||||
|
no_mems_outputs = model(input_ids_1)
|
||||||
|
self.parent.assertEqual(len(no_mems_outputs), 1)
|
||||||
|
|
||||||
self.parent.assertListEqual(
|
self.parent.assertListEqual(
|
||||||
list(result["outputs"].size()),
|
list(result["outputs"].size()),
|
||||||
[self.batch_size, self.seq_length, self.hidden_size])
|
[self.batch_size, self.seq_length, self.hidden_size])
|
||||||
|
|||||||
@@ -131,8 +131,8 @@ class BertTokenizationTest(CommonTestCases.CommonTokenizerTester):
|
|||||||
text = tokenizer.encode("sequence builders")
|
text = tokenizer.encode("sequence builders")
|
||||||
text_2 = tokenizer.encode("multi-sequence build")
|
text_2 = tokenizer.encode("multi-sequence build")
|
||||||
|
|
||||||
encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
|
encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
|
||||||
encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
|
encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
|
||||||
|
|
||||||
assert encoded_sentence == [101] + text + [102]
|
assert encoded_sentence == [101] + text + [102]
|
||||||
assert encoded_pair == [101] + text + [102] + text_2 + [102]
|
assert encoded_pair == [101] + text + [102] + text_2 + [102]
|
||||||
|
|||||||
69
transformers/tests/tokenization_ctrl_test.py
Normal file
69
transformers/tests/tokenization_ctrl_test.py
Normal file
@@ -0,0 +1,69 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 Salesforce and HuggingFace Inc. team.
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
|
|
||||||
|
import os
|
||||||
|
import unittest
|
||||||
|
import json
|
||||||
|
from io import open
|
||||||
|
|
||||||
|
from transformers.tokenization_ctrl import CTRLTokenizer, VOCAB_FILES_NAMES
|
||||||
|
|
||||||
|
from .tokenization_tests_commons import CommonTestCases
|
||||||
|
|
||||||
|
class CTRLTokenizationTest(CommonTestCases.CommonTokenizerTester):
|
||||||
|
|
||||||
|
tokenizer_class = CTRLTokenizer
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
super(CTRLTokenizationTest, self).setUp()
|
||||||
|
|
||||||
|
# Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
|
||||||
|
vocab = ['adapt', 're@@', 'a@@', 'apt', 'c@@', 't', '<unk>']
|
||||||
|
vocab_tokens = dict(zip(vocab, range(len(vocab))))
|
||||||
|
merges = ["#version: 0.2", 'a p', 'ap t</w>', 'r e', 'a d', 'ad apt</w>', '']
|
||||||
|
self.special_tokens_map = {"unk_token": "<unk>"}
|
||||||
|
|
||||||
|
self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
|
||||||
|
self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['merges_file'])
|
||||||
|
with open(self.vocab_file, "w", encoding="utf-8") as fp:
|
||||||
|
fp.write(json.dumps(vocab_tokens) + "\n")
|
||||||
|
with open(self.merges_file, "w", encoding="utf-8") as fp:
|
||||||
|
fp.write("\n".join(merges))
|
||||||
|
|
||||||
|
def get_tokenizer(self, **kwargs):
|
||||||
|
kwargs.update(self.special_tokens_map)
|
||||||
|
return CTRLTokenizer.from_pretrained(self.tmpdirname, **kwargs)
|
||||||
|
|
||||||
|
def get_input_output_texts(self):
|
||||||
|
input_text = u"adapt react readapt apt"
|
||||||
|
output_text = u"adapt react readapt apt"
|
||||||
|
return input_text, output_text
|
||||||
|
|
||||||
|
def test_full_tokenizer(self):
|
||||||
|
tokenizer = CTRLTokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
|
||||||
|
text = "adapt react readapt apt"
|
||||||
|
bpe_tokens = 'adapt re@@ a@@ c@@ t re@@ adapt apt'.split()
|
||||||
|
tokens = tokenizer.tokenize(text)
|
||||||
|
self.assertListEqual(tokens, bpe_tokens)
|
||||||
|
|
||||||
|
input_tokens = tokens + [tokenizer.unk_token]
|
||||||
|
|
||||||
|
input_bpe_tokens = [0, 1, 2, 4, 5, 1, 0, 3, 6]
|
||||||
|
self.assertListEqual(
|
||||||
|
tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
unittest.main()
|
||||||
@@ -36,8 +36,8 @@ class DistilBertTokenizationTest(BertTokenizationTest):
|
|||||||
text = tokenizer.encode("sequence builders")
|
text = tokenizer.encode("sequence builders")
|
||||||
text_2 = tokenizer.encode("multi-sequence build")
|
text_2 = tokenizer.encode("multi-sequence build")
|
||||||
|
|
||||||
encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
|
encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
|
||||||
encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
|
encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
|
||||||
|
|
||||||
assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
|
assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
|
||||||
assert encoded_pair == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id] + \
|
assert encoded_pair == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id] + \
|
||||||
|
|||||||
@@ -87,8 +87,8 @@ class RobertaTokenizationTest(CommonTestCases.CommonTokenizerTester):
|
|||||||
encoded_text_from_decode = tokenizer.encode("sequence builders", add_special_tokens=True)
|
encoded_text_from_decode = tokenizer.encode("sequence builders", add_special_tokens=True)
|
||||||
encoded_pair_from_decode = tokenizer.encode("sequence builders", "multi-sequence build", add_special_tokens=True)
|
encoded_pair_from_decode = tokenizer.encode("sequence builders", "multi-sequence build", add_special_tokens=True)
|
||||||
|
|
||||||
encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
|
encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
|
||||||
encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
|
encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
|
||||||
|
|
||||||
assert encoded_sentence == encoded_text_from_decode
|
assert encoded_sentence == encoded_text_from_decode
|
||||||
assert encoded_pair == encoded_pair_from_decode
|
assert encoded_pair == encoded_pair_from_decode
|
||||||
|
|||||||
@@ -193,12 +193,12 @@ class CommonTestCases:
|
|||||||
|
|
||||||
tokenizer = self.get_tokenizer()
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
if tokenizer.add_special_tokens_sequence_pair.__qualname__.split('.')[0] != "PreTrainedTokenizer":
|
if tokenizer.build_inputs_with_special_tokens.__qualname__.split('.')[0] != "PreTrainedTokenizer":
|
||||||
seq_0 = "Test this method."
|
seq_0 = "Test this method."
|
||||||
seq_1 = "With these inputs."
|
seq_1 = "With these inputs."
|
||||||
information = tokenizer.encode_plus(seq_0, seq_1, add_special_tokens=True)
|
information = tokenizer.encode_plus(seq_0, seq_1, add_special_tokens=True)
|
||||||
sequences, mask = information["input_ids"], information["token_type_ids"]
|
sequences, mask = information["input_ids"], information["token_type_ids"]
|
||||||
assert len(sequences) == len(mask)
|
self.assertEqual(len(sequences), len(mask))
|
||||||
|
|
||||||
def test_number_of_added_tokens(self):
|
def test_number_of_added_tokens(self):
|
||||||
tokenizer = self.get_tokenizer()
|
tokenizer = self.get_tokenizer()
|
||||||
@@ -211,7 +211,7 @@ class CommonTestCases:
|
|||||||
|
|
||||||
# Method is implemented (e.g. not GPT-2)
|
# Method is implemented (e.g. not GPT-2)
|
||||||
if len(attached_sequences) != 2:
|
if len(attached_sequences) != 2:
|
||||||
assert tokenizer.num_added_tokens(pair=True) == len(attached_sequences) - len(sequences)
|
self.assertEqual(tokenizer.num_added_tokens(pair=True), len(attached_sequences) - len(sequences))
|
||||||
|
|
||||||
def test_maximum_encoding_length_single_input(self):
|
def test_maximum_encoding_length_single_input(self):
|
||||||
tokenizer = self.get_tokenizer()
|
tokenizer = self.get_tokenizer()
|
||||||
@@ -227,10 +227,10 @@ class CommonTestCases:
|
|||||||
truncated_sequence = information["input_ids"]
|
truncated_sequence = information["input_ids"]
|
||||||
overflowing_tokens = information["overflowing_tokens"]
|
overflowing_tokens = information["overflowing_tokens"]
|
||||||
|
|
||||||
assert len(overflowing_tokens) == 2 + stride
|
self.assertEqual(len(overflowing_tokens), 2 + stride)
|
||||||
assert overflowing_tokens == sequence[-(2 + stride):]
|
self.assertEqual(overflowing_tokens, sequence[-(2 + stride):])
|
||||||
assert len(truncated_sequence) == total_length - 2
|
self.assertEqual(len(truncated_sequence), total_length - 2)
|
||||||
assert truncated_sequence == tokenizer.add_special_tokens_single_sequence(sequence[:-2])
|
self.assertEqual(truncated_sequence, tokenizer.build_inputs_with_special_tokens(sequence[:-2]))
|
||||||
|
|
||||||
def test_maximum_encoding_length_pair_input(self):
|
def test_maximum_encoding_length_pair_input(self):
|
||||||
tokenizer = self.get_tokenizer()
|
tokenizer = self.get_tokenizer()
|
||||||
@@ -243,26 +243,26 @@ class CommonTestCases:
|
|||||||
sequence_1_no_special_tokens = tokenizer.encode(seq_1)
|
sequence_1_no_special_tokens = tokenizer.encode(seq_1)
|
||||||
|
|
||||||
sequence = tokenizer.encode(seq_0, seq_1, add_special_tokens=True)
|
sequence = tokenizer.encode(seq_0, seq_1, add_special_tokens=True)
|
||||||
truncated_second_sequence = tokenizer.add_special_tokens_sequence_pair(
|
truncated_second_sequence = tokenizer.build_inputs_with_special_tokens(
|
||||||
tokenizer.encode(seq_0),
|
tokenizer.encode(seq_0),
|
||||||
tokenizer.encode(seq_1)[:-2]
|
tokenizer.encode(seq_1)[:-2]
|
||||||
)
|
)
|
||||||
|
|
||||||
information = tokenizer.encode_plus(seq_0, seq_1, max_length=len(sequence) - 2, add_special_tokens=True,
|
information = tokenizer.encode_plus(seq_0, seq_1, max_length=len(sequence) - 2, add_special_tokens=True,
|
||||||
stride=stride, truncate_first_sequence=False)
|
stride=stride, truncation_strategy='only_second')
|
||||||
information_first_truncated = tokenizer.encode_plus(seq_0, seq_1, max_length=len(sequence) - 2,
|
information_first_truncated = tokenizer.encode_plus(seq_0, seq_1, max_length=len(sequence) - 2,
|
||||||
add_special_tokens=True, stride=stride,
|
add_special_tokens=True, stride=stride,
|
||||||
truncate_first_sequence=True)
|
truncation_strategy='only_first')
|
||||||
|
|
||||||
truncated_sequence = information["input_ids"]
|
truncated_sequence = information["input_ids"]
|
||||||
overflowing_tokens = information["overflowing_tokens"]
|
overflowing_tokens = information["overflowing_tokens"]
|
||||||
overflowing_tokens_first_truncated = information_first_truncated["overflowing_tokens"]
|
overflowing_tokens_first_truncated = information_first_truncated["overflowing_tokens"]
|
||||||
|
|
||||||
assert len(overflowing_tokens) == 2 + stride
|
self.assertEqual(len(overflowing_tokens), 2 + stride)
|
||||||
assert overflowing_tokens == sequence_1_no_special_tokens[-(2 + stride):]
|
self.assertEqual(overflowing_tokens, sequence_1_no_special_tokens[-(2 + stride):])
|
||||||
assert overflowing_tokens_first_truncated == sequence_0_no_special_tokens[-(2 + stride):]
|
self.assertEqual(overflowing_tokens_first_truncated, sequence_0_no_special_tokens[-(2 + stride):])
|
||||||
assert len(truncated_sequence) == len(sequence) - 2
|
self.assertEqual(len(truncated_sequence), len(sequence) - 2)
|
||||||
assert truncated_sequence == truncated_second_sequence
|
self.assertEqual(truncated_sequence, truncated_second_sequence)
|
||||||
|
|
||||||
def test_encode_input_type(self):
|
def test_encode_input_type(self):
|
||||||
tokenizer = self.get_tokenizer()
|
tokenizer = self.get_tokenizer()
|
||||||
@@ -273,5 +273,43 @@ class CommonTestCases:
|
|||||||
input_ids = tokenizer.convert_tokens_to_ids(tokens)
|
input_ids = tokenizer.convert_tokens_to_ids(tokens)
|
||||||
formatted_input = tokenizer.encode(sequence, add_special_tokens=True)
|
formatted_input = tokenizer.encode(sequence, add_special_tokens=True)
|
||||||
|
|
||||||
assert tokenizer.encode(tokens, add_special_tokens=True) == formatted_input
|
self.assertEqual(tokenizer.encode(tokens, add_special_tokens=True), formatted_input)
|
||||||
assert tokenizer.encode(input_ids, add_special_tokens=True) == formatted_input
|
self.assertEqual(tokenizer.encode(input_ids, add_special_tokens=True), formatted_input)
|
||||||
|
|
||||||
|
def test_special_tokens_mask(self):
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
sequence_0 = "Encode this."
|
||||||
|
sequence_1 = "This one too please."
|
||||||
|
|
||||||
|
# Testing single inputs
|
||||||
|
encoded_sequence = tokenizer.encode(sequence_0)
|
||||||
|
encoded_sequence_dict = tokenizer.encode_plus(sequence_0, add_special_tokens=True)
|
||||||
|
encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
|
||||||
|
special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
|
||||||
|
self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
|
||||||
|
|
||||||
|
filtered_sequence = [(x if not special_tokens_mask[i] else None) for i, x in enumerate(encoded_sequence_w_special)]
|
||||||
|
filtered_sequence = [x for x in filtered_sequence if x is not None]
|
||||||
|
self.assertEqual(encoded_sequence, filtered_sequence)
|
||||||
|
|
||||||
|
# Testing inputs pairs
|
||||||
|
encoded_sequence = tokenizer.encode(sequence_0) + tokenizer.encode(sequence_1)
|
||||||
|
encoded_sequence_dict = tokenizer.encode_plus(sequence_0, sequence_1, add_special_tokens=True)
|
||||||
|
encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
|
||||||
|
special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
|
||||||
|
self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
|
||||||
|
|
||||||
|
filtered_sequence = [(x if not special_tokens_mask[i] else None) for i, x in enumerate(encoded_sequence_w_special)]
|
||||||
|
filtered_sequence = [x for x in filtered_sequence if x is not None]
|
||||||
|
self.assertEqual(encoded_sequence, filtered_sequence)
|
||||||
|
|
||||||
|
# Testing with already existing special tokens
|
||||||
|
if tokenizer.cls_token_id == tokenizer.unk_token_id and tokenizer.cls_token_id == tokenizer.unk_token_id:
|
||||||
|
tokenizer.add_special_tokens({'cls_token': '</s>', 'sep_token': '<s>'})
|
||||||
|
encoded_sequence_dict = tokenizer.encode_plus(sequence_0, add_special_tokens=True)
|
||||||
|
encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
|
||||||
|
special_tokens_mask_orig = encoded_sequence_dict["special_tokens_mask"]
|
||||||
|
special_tokens_mask = tokenizer.get_special_tokens_mask(encoded_sequence_w_special, already_has_special_tokens=True)
|
||||||
|
self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
|
||||||
|
self.assertEqual(special_tokens_mask_orig, special_tokens_mask)
|
||||||
|
|||||||
@@ -72,8 +72,8 @@ class XLMTokenizationTest(CommonTestCases.CommonTokenizerTester):
|
|||||||
text = tokenizer.encode("sequence builders")
|
text = tokenizer.encode("sequence builders")
|
||||||
text_2 = tokenizer.encode("multi-sequence build")
|
text_2 = tokenizer.encode("multi-sequence build")
|
||||||
|
|
||||||
encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
|
encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
|
||||||
encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
|
encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
|
||||||
|
|
||||||
assert encoded_sentence == [1] + text + [1]
|
assert encoded_sentence == [1] + text + [1]
|
||||||
assert encoded_pair == [1] + text + [1] + text_2 + [1]
|
assert encoded_pair == [1] + text + [1] + text_2 + [1]
|
||||||
|
|||||||
@@ -95,8 +95,8 @@ class XLNetTokenizationTest(CommonTestCases.CommonTokenizerTester):
|
|||||||
text = tokenizer.encode("sequence builders")
|
text = tokenizer.encode("sequence builders")
|
||||||
text_2 = tokenizer.encode("multi-sequence build")
|
text_2 = tokenizer.encode("multi-sequence build")
|
||||||
|
|
||||||
encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
|
encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
|
||||||
encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
|
encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
|
||||||
|
|
||||||
assert encoded_sentence == text + [4, 3]
|
assert encoded_sentence == text + [4, 3]
|
||||||
assert encoded_pair == text + [4] + text_2 + [4, 3]
|
assert encoded_pair == text + [4] + text_2 + [4, 3]
|
||||||
|
|||||||
@@ -21,6 +21,7 @@ import logging
|
|||||||
from .tokenization_bert import BertTokenizer
|
from .tokenization_bert import BertTokenizer
|
||||||
from .tokenization_openai import OpenAIGPTTokenizer
|
from .tokenization_openai import OpenAIGPTTokenizer
|
||||||
from .tokenization_gpt2 import GPT2Tokenizer
|
from .tokenization_gpt2 import GPT2Tokenizer
|
||||||
|
from .tokenization_ctrl import CTRLTokenizer
|
||||||
from .tokenization_transfo_xl import TransfoXLTokenizer
|
from .tokenization_transfo_xl import TransfoXLTokenizer
|
||||||
from .tokenization_xlnet import XLNetTokenizer
|
from .tokenization_xlnet import XLNetTokenizer
|
||||||
from .tokenization_xlm import XLMTokenizer
|
from .tokenization_xlm import XLMTokenizer
|
||||||
@@ -45,6 +46,7 @@ class AutoTokenizer(object):
|
|||||||
- contains `bert`: BertTokenizer (Bert model)
|
- contains `bert`: BertTokenizer (Bert model)
|
||||||
- contains `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)
|
- contains `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)
|
||||||
- contains `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)
|
- contains `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)
|
||||||
|
- contains `ctrl`: CTRLTokenizer (Salesforce CTRL model)
|
||||||
- contains `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)
|
- contains `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)
|
||||||
- contains `xlnet`: XLNetTokenizer (XLNet model)
|
- contains `xlnet`: XLNetTokenizer (XLNet model)
|
||||||
- contains `xlm`: XLMTokenizer (XLM model)
|
- contains `xlm`: XLMTokenizer (XLM model)
|
||||||
@@ -67,6 +69,7 @@ class AutoTokenizer(object):
|
|||||||
- contains `bert`: BertTokenizer (Bert model)
|
- contains `bert`: BertTokenizer (Bert model)
|
||||||
- contains `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)
|
- contains `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)
|
||||||
- contains `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)
|
- contains `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)
|
||||||
|
- contains `ctrl`: CTRLTokenizer (Salesforce CTRL model)
|
||||||
- contains `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)
|
- contains `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)
|
||||||
- contains `xlnet`: XLNetTokenizer (XLNet model)
|
- contains `xlnet`: XLNetTokenizer (XLNet model)
|
||||||
- contains `xlm`: XLMTokenizer (XLM model)
|
- contains `xlm`: XLMTokenizer (XLM model)
|
||||||
@@ -114,7 +117,8 @@ class AutoTokenizer(object):
|
|||||||
return XLNetTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
|
return XLNetTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
|
||||||
elif 'xlm' in pretrained_model_name_or_path:
|
elif 'xlm' in pretrained_model_name_or_path:
|
||||||
return XLMTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
|
return XLMTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
|
||||||
|
elif 'ctrl' in pretrained_model_name_or_path:
|
||||||
|
return CTRLTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
|
||||||
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
||||||
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
||||||
"'xlm', 'roberta'".format(pretrained_model_name_or_path))
|
"'xlm', 'roberta', 'ctrl'".format(pretrained_model_name_or_path))
|
||||||
|
|||||||
@@ -44,6 +44,8 @@ PRETRAINED_VOCAB_FILES_MAP = {
|
|||||||
'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt",
|
'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt",
|
||||||
'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt",
|
'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt",
|
||||||
'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt",
|
'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt",
|
||||||
|
'bert-base-german-dbmdz-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-vocab.txt",
|
||||||
|
'bert-base-german-dbmdz-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-vocab.txt",
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -61,6 +63,8 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
|||||||
'bert-large-uncased-whole-word-masking-finetuned-squad': 512,
|
'bert-large-uncased-whole-word-masking-finetuned-squad': 512,
|
||||||
'bert-large-cased-whole-word-masking-finetuned-squad': 512,
|
'bert-large-cased-whole-word-masking-finetuned-squad': 512,
|
||||||
'bert-base-cased-finetuned-mrpc': 512,
|
'bert-base-cased-finetuned-mrpc': 512,
|
||||||
|
'bert-base-german-dbmdz-cased': 512,
|
||||||
|
'bert-base-german-dbmdz-uncased': 512,
|
||||||
}
|
}
|
||||||
|
|
||||||
PRETRAINED_INIT_CONFIGURATION = {
|
PRETRAINED_INIT_CONFIGURATION = {
|
||||||
@@ -77,6 +81,8 @@ PRETRAINED_INIT_CONFIGURATION = {
|
|||||||
'bert-large-uncased-whole-word-masking-finetuned-squad': {'do_lower_case': True},
|
'bert-large-uncased-whole-word-masking-finetuned-squad': {'do_lower_case': True},
|
||||||
'bert-large-cased-whole-word-masking-finetuned-squad': {'do_lower_case': False},
|
'bert-large-cased-whole-word-masking-finetuned-squad': {'do_lower_case': False},
|
||||||
'bert-base-cased-finetuned-mrpc': {'do_lower_case': False},
|
'bert-base-cased-finetuned-mrpc': {'do_lower_case': False},
|
||||||
|
'bert-base-german-dbmdz-cased': {'do_lower_case': False},
|
||||||
|
'bert-base-german-dbmdz-uncased': {'do_lower_case': True},
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -187,33 +193,59 @@ class BertTokenizer(PreTrainedTokenizer):
|
|||||||
out_string = ' '.join(tokens).replace(' ##', '').strip()
|
out_string = ' '.join(tokens).replace(' ##', '').strip()
|
||||||
return out_string
|
return out_string
|
||||||
|
|
||||||
def add_special_tokens_single_sequence(self, token_ids):
|
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||||
"""
|
"""
|
||||||
Adds special tokens to the a sequence for sequence classification tasks.
|
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||||
A BERT sequence has the following format: [CLS] X [SEP]
|
by concatenating and adding special tokens.
|
||||||
|
A BERT sequence has the following format:
|
||||||
|
single sequence: [CLS] X [SEP]
|
||||||
|
pair of sequences: [CLS] A [SEP] B [SEP]
|
||||||
"""
|
"""
|
||||||
return [self.cls_token_id] + token_ids + [self.sep_token_id]
|
if token_ids_1 is None:
|
||||||
|
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
|
||||||
def add_special_tokens_sequence_pair(self, token_ids_0, token_ids_1):
|
|
||||||
"""
|
|
||||||
Adds special tokens to a sequence pair for sequence classification tasks.
|
|
||||||
A BERT sequence pair has the following format: [CLS] A [SEP] B [SEP]
|
|
||||||
"""
|
|
||||||
sep = [self.sep_token_id]
|
|
||||||
cls = [self.cls_token_id]
|
cls = [self.cls_token_id]
|
||||||
|
sep = [self.sep_token_id]
|
||||||
return cls + token_ids_0 + sep + token_ids_1 + sep
|
return cls + token_ids_0 + sep + token_ids_1 + sep
|
||||||
|
|
||||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1):
|
def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
|
||||||
|
"""
|
||||||
|
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||||
|
special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
token_ids_0: list of ids (must not contain special tokens)
|
||||||
|
token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids
|
||||||
|
for sequence pairs
|
||||||
|
already_has_special_tokens: (default False) Set to True if the token list is already formated with
|
||||||
|
special tokens for the model
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token.
|
||||||
|
"""
|
||||||
|
|
||||||
|
if already_has_special_tokens:
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
raise ValueError("You should not supply a second sequence if the provided sequence of "
|
||||||
|
"ids is already formated with special tokens for the model.")
|
||||||
|
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
|
||||||
|
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
|
||||||
|
return [1] + ([0] * len(token_ids_0)) + [1]
|
||||||
|
|
||||||
|
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
||||||
"""
|
"""
|
||||||
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
||||||
A BERT sequence pair mask has the following format:
|
A BERT sequence pair mask has the following format:
|
||||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
|
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
|
||||||
| first sequence | second sequence
|
| first sequence | second sequence
|
||||||
|
|
||||||
|
if token_ids_1 is None, only returns the first portion of the mask (0's).
|
||||||
"""
|
"""
|
||||||
sep = [self.sep_token_id]
|
sep = [self.sep_token_id]
|
||||||
cls = [self.cls_token_id]
|
cls = [self.cls_token_id]
|
||||||
|
if token_ids_1 is None:
|
||||||
|
return len(cls + token_ids_0 + sep) * [0]
|
||||||
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
|
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
|
||||||
|
|
||||||
def save_vocabulary(self, vocab_path):
|
def save_vocabulary(self, vocab_path):
|
||||||
|
|||||||
242
transformers/tokenization_ctrl.py
Normal file
242
transformers/tokenization_ctrl.py
Normal file
@@ -0,0 +1,242 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 Salesforce and The HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Tokenization classes for Salesforce CTRL."""
|
||||||
|
from __future__ import (absolute_import, division, print_function,
|
||||||
|
unicode_literals)
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import regex as re
|
||||||
|
from io import open
|
||||||
|
|
||||||
|
from .tokenization_utils import PreTrainedTokenizer
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
VOCAB_FILES_NAMES = {
|
||||||
|
'vocab_file': 'vocab.json',
|
||||||
|
'merges_file': 'merges.txt',
|
||||||
|
}
|
||||||
|
|
||||||
|
PRETRAINED_VOCAB_FILES_MAP = {
|
||||||
|
'vocab_file':
|
||||||
|
{
|
||||||
|
'ctrl': "https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-vocab.json",
|
||||||
|
},
|
||||||
|
'merges_file':
|
||||||
|
{
|
||||||
|
'ctrl': "https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-merges.txt",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
||||||
|
'ctrl': 256,
|
||||||
|
}
|
||||||
|
|
||||||
|
CONTROL_CODES = {
|
||||||
|
"Pregnancy": 168629,
|
||||||
|
"Christianity": 7675,
|
||||||
|
"Explain": 106423,
|
||||||
|
"Fitness": 63440,
|
||||||
|
"Saving": 63163,
|
||||||
|
"Ask": 27171,
|
||||||
|
"Ass": 95985,
|
||||||
|
"Joke": 163509,
|
||||||
|
"Questions": 45622,
|
||||||
|
"Thoughts": 49605,
|
||||||
|
"Retail": 52342,
|
||||||
|
"Feminism": 164338,
|
||||||
|
"Writing": 11992,
|
||||||
|
"Atheism": 192263,
|
||||||
|
"Netflix": 48616,
|
||||||
|
"Computing": 39639,
|
||||||
|
"Opinion": 43213,
|
||||||
|
"Alone": 44967,
|
||||||
|
"Funny": 58917,
|
||||||
|
"Gaming": 40358,
|
||||||
|
"Human": 4088,
|
||||||
|
"India": 1331,
|
||||||
|
"Joker": 77138,
|
||||||
|
"Diet": 36206,
|
||||||
|
"Legal": 11859,
|
||||||
|
"Norman": 4939,
|
||||||
|
"Tip": 72689,
|
||||||
|
"Weight": 52343,
|
||||||
|
"Movies": 46273,
|
||||||
|
"Running": 23425,
|
||||||
|
"Science": 2090,
|
||||||
|
"Horror": 37793,
|
||||||
|
"Confession": 60572,
|
||||||
|
"Finance": 12250,
|
||||||
|
"Politics": 16360,
|
||||||
|
"Scary": 191985,
|
||||||
|
"Support": 12654,
|
||||||
|
"Technologies": 32516,
|
||||||
|
"Teenage": 66160,
|
||||||
|
"Event": 32769,
|
||||||
|
"Learned": 67460,
|
||||||
|
"Notion": 182770,
|
||||||
|
"Wikipedia": 37583,
|
||||||
|
"Books": 6665,
|
||||||
|
"Extract": 76050,
|
||||||
|
"Confessions": 102701,
|
||||||
|
"Conspiracy": 75932,
|
||||||
|
"Links": 63674,
|
||||||
|
"Narcissus": 150425,
|
||||||
|
"Relationship": 54766,
|
||||||
|
"Relationships": 134796,
|
||||||
|
"Reviews": 41671,
|
||||||
|
"News": 4256,
|
||||||
|
"Translation": 26820,
|
||||||
|
"multilingual": 128406,
|
||||||
|
}
|
||||||
|
|
||||||
|
def get_pairs(word):
|
||||||
|
"""Return set of symbol pairs in a word.
|
||||||
|
|
||||||
|
Word is represented as tuple of symbols (symbols being variable-length strings).
|
||||||
|
"""
|
||||||
|
pairs = set()
|
||||||
|
prev_char = word[0]
|
||||||
|
for char in word[1:]:
|
||||||
|
pairs.add((prev_char, char))
|
||||||
|
prev_char = char
|
||||||
|
|
||||||
|
pairs = set(pairs)
|
||||||
|
return pairs
|
||||||
|
|
||||||
|
class CTRLTokenizer(PreTrainedTokenizer):
|
||||||
|
"""
|
||||||
|
CTRL BPE tokenizer. Peculiarities:
|
||||||
|
- Byte-Pair-Encoding
|
||||||
|
"""
|
||||||
|
vocab_files_names = VOCAB_FILES_NAMES
|
||||||
|
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||||
|
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
|
||||||
|
control_codes = CONTROL_CODES
|
||||||
|
|
||||||
|
def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
|
||||||
|
super(CTRLTokenizer, self).__init__(unk_token=unk_token, **kwargs)
|
||||||
|
self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
|
||||||
|
self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
|
||||||
|
|
||||||
|
self.encoder = json.load(open(vocab_file, encoding="utf-8"))
|
||||||
|
self.decoder = {v:k for k,v in self.encoder.items()}
|
||||||
|
merges = open(merges_file, encoding='utf-8').read().split('\n')[1:-1]
|
||||||
|
merges = [tuple(merge.split()) for merge in merges]
|
||||||
|
self.bpe_ranks = dict(zip(merges, range(len(merges))))
|
||||||
|
self.cache = {}
|
||||||
|
|
||||||
|
@property
|
||||||
|
def vocab_size(self):
|
||||||
|
return len(self.encoder)
|
||||||
|
|
||||||
|
def bpe(self, token):
|
||||||
|
if token in self.cache:
|
||||||
|
return self.cache[token]
|
||||||
|
word = tuple(token)
|
||||||
|
word = tuple(list(word[:-1]) + [word[-1]+'</w>'])
|
||||||
|
pairs = get_pairs(word)
|
||||||
|
|
||||||
|
if not pairs:
|
||||||
|
return token
|
||||||
|
|
||||||
|
while True:
|
||||||
|
bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
|
||||||
|
if bigram not in self.bpe_ranks:
|
||||||
|
break
|
||||||
|
first, second = bigram
|
||||||
|
new_word = []
|
||||||
|
i = 0
|
||||||
|
while i < len(word):
|
||||||
|
try:
|
||||||
|
j = word.index(first, i)
|
||||||
|
new_word.extend(word[i:j])
|
||||||
|
i = j
|
||||||
|
except:
|
||||||
|
new_word.extend(word[i:])
|
||||||
|
break
|
||||||
|
|
||||||
|
if word[i] == first and i < len(word)-1 and word[i+1] == second:
|
||||||
|
new_word.append(first+second)
|
||||||
|
i += 2
|
||||||
|
else:
|
||||||
|
new_word.append(word[i])
|
||||||
|
i += 1
|
||||||
|
new_word = tuple(new_word)
|
||||||
|
word = new_word
|
||||||
|
if len(word) == 1:
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
pairs = get_pairs(word)
|
||||||
|
word = '@@ '.join(word)
|
||||||
|
word = word[:-4]
|
||||||
|
self.cache[token] = word
|
||||||
|
return word
|
||||||
|
|
||||||
|
def _tokenize(self, text):
|
||||||
|
""" Tokenize a string.
|
||||||
|
"""
|
||||||
|
split_tokens = []
|
||||||
|
|
||||||
|
text = text.split(' ')
|
||||||
|
|
||||||
|
for token in text:
|
||||||
|
split_tokens.extend([t for t in self.bpe(token).split(' ')])
|
||||||
|
return split_tokens
|
||||||
|
|
||||||
|
def _convert_token_to_id(self, token):
|
||||||
|
""" Converts a token (str/unicode) in an id using the vocab. """
|
||||||
|
return self.encoder.get(token, self.encoder.get(self.unk_token))
|
||||||
|
|
||||||
|
def _convert_id_to_token(self, index):
|
||||||
|
"""Converts an index (integer) in a token (string/unicode) using the vocab."""
|
||||||
|
return self.decoder.get(index, self.unk_token)
|
||||||
|
|
||||||
|
def convert_tokens_to_string(self, tokens):
|
||||||
|
""" Converts a sequence of tokens (string) in a single string. """
|
||||||
|
out_string = ' '.join(tokens).replace('@@ ', '').strip()
|
||||||
|
return out_string
|
||||||
|
|
||||||
|
def save_vocabulary(self, save_directory):
|
||||||
|
"""Save the tokenizer vocabulary and merge files to a directory."""
|
||||||
|
if not os.path.isdir(save_directory):
|
||||||
|
logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
|
||||||
|
return
|
||||||
|
vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES['vocab_file'])
|
||||||
|
merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES['merges_file'])
|
||||||
|
|
||||||
|
with open(vocab_file, 'w', encoding='utf-8') as f:
|
||||||
|
f.write(json.dumps(self.encoder, ensure_ascii=False))
|
||||||
|
|
||||||
|
index = 0
|
||||||
|
with open(merge_file, "w", encoding="utf-8") as writer:
|
||||||
|
writer.write(u'#version: 0.2\n')
|
||||||
|
for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
|
||||||
|
if index != token_index:
|
||||||
|
logger.warning("Saving vocabulary to {}: BPE merge indices are not consecutive."
|
||||||
|
" Please check that the tokenizer is not corrupted!".format(merge_file))
|
||||||
|
index = token_index
|
||||||
|
writer.write(' '.join(bpe_tokens) + u'\n')
|
||||||
|
index += 1
|
||||||
|
|
||||||
|
return vocab_file, merge_file
|
||||||
|
|
||||||
|
# def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):
|
||||||
|
# filtered_tokens = ' '.join(self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens))
|
||||||
|
# tokens_generated_so_far = re.sub('(@@ )', '', string=filtered_tokens)
|
||||||
|
# tokens_generated_so_far = re.sub('(@@ ?$)', '', string=tokens_generated_so_far)
|
||||||
|
# return ''.join(tokens_generated_so_far)
|
||||||
@@ -46,12 +46,14 @@ PRETRAINED_VOCAB_FILES_MAP = {
|
|||||||
'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json",
|
'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json",
|
||||||
'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json",
|
'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json",
|
||||||
'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-vocab.json",
|
'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-vocab.json",
|
||||||
|
'distilroberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-vocab.json",
|
||||||
},
|
},
|
||||||
'merges_file':
|
'merges_file':
|
||||||
{
|
{
|
||||||
'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt",
|
'roberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt",
|
||||||
'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt",
|
'roberta-large': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt",
|
||||||
'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-merges.txt",
|
'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-merges.txt",
|
||||||
|
'distilroberta-base': "https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-merges.txt",
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -59,6 +61,7 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
|||||||
'roberta-base': 512,
|
'roberta-base': 512,
|
||||||
'roberta-large': 512,
|
'roberta-large': 512,
|
||||||
'roberta-large-mnli': 512,
|
'roberta-large-mnli': 512,
|
||||||
|
'distilroberta-base': 512,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -84,30 +87,57 @@ class RobertaTokenizer(GPT2Tokenizer):
|
|||||||
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
|
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
|
||||||
self.max_len_sentences_pair = self.max_len - 4 # take into account special tokens
|
self.max_len_sentences_pair = self.max_len - 4 # take into account special tokens
|
||||||
|
|
||||||
def add_special_tokens_single_sequence(self, token_ids):
|
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||||
"""
|
"""
|
||||||
Adds special tokens to a sequence for sequence classification tasks.
|
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||||
A RoBERTa sequence has the following format: <s> X </s>
|
by concatenating and adding special tokens.
|
||||||
|
A RoBERTa sequence has the following format:
|
||||||
|
single sequence: <s> X </s>
|
||||||
|
pair of sequences: <s> A </s></s> B </s>
|
||||||
"""
|
"""
|
||||||
return [self.cls_token_id] + token_ids + [self.sep_token_id]
|
if token_ids_1 is None:
|
||||||
|
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
|
||||||
def add_special_tokens_sequence_pair(self, token_ids_0, token_ids_1):
|
|
||||||
"""
|
|
||||||
Adds special tokens to a sequence pair for sequence classification tasks.
|
|
||||||
A RoBERTa sequence pair has the following format: <s> A </s></s> B </s>
|
|
||||||
"""
|
|
||||||
sep = [self.sep_token_id]
|
|
||||||
cls = [self.cls_token_id]
|
cls = [self.cls_token_id]
|
||||||
|
sep = [self.sep_token_id]
|
||||||
return cls + token_ids_0 + sep + sep + token_ids_1 + sep
|
return cls + token_ids_0 + sep + sep + token_ids_1 + sep
|
||||||
|
|
||||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1):
|
def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
|
||||||
|
"""
|
||||||
|
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||||
|
special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
token_ids_0: list of ids (must not contain special tokens)
|
||||||
|
token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids
|
||||||
|
for sequence pairs
|
||||||
|
already_has_special_tokens: (default False) Set to True if the token list is already formated with
|
||||||
|
special tokens for the model
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token.
|
||||||
|
"""
|
||||||
|
if already_has_special_tokens:
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
raise ValueError("You should not supply a second sequence if the provided sequence of "
|
||||||
|
"ids is already formated with special tokens for the model.")
|
||||||
|
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
|
||||||
|
|
||||||
|
if token_ids_1 is None:
|
||||||
|
return [1] + ([0] * len(token_ids_0)) + [1]
|
||||||
|
return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
|
||||||
|
|
||||||
|
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
||||||
"""
|
"""
|
||||||
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
||||||
A RoBERTa sequence pair mask has the following format:
|
A RoBERTa sequence pair mask has the following format:
|
||||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
|
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
|
||||||
| first sequence | second sequence
|
| first sequence | second sequence
|
||||||
|
|
||||||
|
if token_ids_1 is None, only returns the first portion of the mask (0's).
|
||||||
"""
|
"""
|
||||||
sep = [self.sep_token_id]
|
sep = [self.sep_token_id]
|
||||||
cls = [self.cls_token_id]
|
cls = [self.cls_token_id]
|
||||||
|
|
||||||
return len(cls + token_ids_0 + sep + sep) * [0] + len(token_ids_1 + sep) * [1]
|
if token_ids_1 is None:
|
||||||
|
return len(cls + token_ids_0 + sep) * [0]
|
||||||
|
return len(cls + token_ids_0 + sep + sep) * [0] + len(token_ids_1 + sep) * [1]
|
||||||
|
|||||||
@@ -337,13 +337,13 @@ class PreTrainedTokenizer(object):
|
|||||||
vocab_files[file_id] = full_file_name
|
vocab_files[file_id] = full_file_name
|
||||||
|
|
||||||
if all(full_file_name is None for full_file_name in vocab_files.values()):
|
if all(full_file_name is None for full_file_name in vocab_files.values()):
|
||||||
logger.error(
|
raise EnvironmentError(
|
||||||
"Model name '{}' was not found in model name list ({}). "
|
"Model name '{}' was not found in tokenizers model name list ({}). "
|
||||||
"We assumed '{}' was a path or url but couldn't find tokenizer files"
|
"We assumed '{}' was a path or url to a directory containing vocabulary files "
|
||||||
"at this path or url.".format(
|
"named {} but couldn't find such vocabulary files at this path or url.".format(
|
||||||
pretrained_model_name_or_path, ', '.join(s3_models),
|
pretrained_model_name_or_path, ', '.join(s3_models),
|
||||||
pretrained_model_name_or_path, ))
|
pretrained_model_name_or_path,
|
||||||
return None
|
list(cls.vocab_files_names.values())))
|
||||||
|
|
||||||
# Get files from url, cache, or disk depending on the case
|
# Get files from url, cache, or disk depending on the case
|
||||||
try:
|
try:
|
||||||
@@ -353,17 +353,18 @@ class PreTrainedTokenizer(object):
|
|||||||
resolved_vocab_files[file_id] = None
|
resolved_vocab_files[file_id] = None
|
||||||
else:
|
else:
|
||||||
resolved_vocab_files[file_id] = cached_path(file_path, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
|
resolved_vocab_files[file_id] = cached_path(file_path, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
|
||||||
except EnvironmentError as e:
|
except EnvironmentError:
|
||||||
if pretrained_model_name_or_path in s3_models:
|
if pretrained_model_name_or_path in s3_models:
|
||||||
logger.error("Couldn't reach server to download vocabulary.")
|
msg = "Couldn't reach server at '{}' to download vocabulary files."
|
||||||
else:
|
else:
|
||||||
logger.error(
|
msg = "Model name '{}' was not found in tokenizers model name list ({}). " \
|
||||||
"Model name '{}' was not found in model name list ({}). "
|
"We assumed '{}' was a path or url to a directory containing vocabulary files " \
|
||||||
"We assumed '{}' was a path or url but couldn't find files {} "
|
"named {}, but couldn't find such vocabulary files at this path or url.".format(
|
||||||
"at this path or url.".format(
|
|
||||||
pretrained_model_name_or_path, ', '.join(s3_models),
|
pretrained_model_name_or_path, ', '.join(s3_models),
|
||||||
pretrained_model_name_or_path, str(vocab_files.keys())))
|
pretrained_model_name_or_path,
|
||||||
raise e
|
list(cls.vocab_files_names.values()))
|
||||||
|
|
||||||
|
raise EnvironmentError(msg)
|
||||||
|
|
||||||
for file_id, file_path in vocab_files.items():
|
for file_id, file_path in vocab_files.items():
|
||||||
if file_path == resolved_vocab_files[file_id]:
|
if file_path == resolved_vocab_files[file_id]:
|
||||||
@@ -539,15 +540,9 @@ class PreTrainedTokenizer(object):
|
|||||||
Returns:
|
Returns:
|
||||||
Number of tokens added to sequences
|
Number of tokens added to sequences
|
||||||
"""
|
"""
|
||||||
|
token_ids_0 = []
|
||||||
if pair:
|
token_ids_1 = []
|
||||||
initial_tokens_len = len(self.encode("This is a sequence") + self.encode("This is another"))
|
return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
|
||||||
final_tokens_len = len(self.encode("This is a sequence", "This is another", add_special_tokens=True))
|
|
||||||
else:
|
|
||||||
initial_tokens_len = len(self.encode("This is a sequence"))
|
|
||||||
final_tokens_len = len(self.encode("This is a sequence", add_special_tokens=True))
|
|
||||||
|
|
||||||
return final_tokens_len - initial_tokens_len
|
|
||||||
|
|
||||||
def add_special_tokens(self, special_tokens_dict):
|
def add_special_tokens(self, special_tokens_dict):
|
||||||
"""
|
"""
|
||||||
@@ -699,7 +694,7 @@ class PreTrainedTokenizer(object):
|
|||||||
add_special_tokens=False,
|
add_special_tokens=False,
|
||||||
max_length=None,
|
max_length=None,
|
||||||
stride=0,
|
stride=0,
|
||||||
truncate_first_sequence=True,
|
truncation_strategy='longest_first',
|
||||||
return_tensors=None,
|
return_tensors=None,
|
||||||
**kwargs):
|
**kwargs):
|
||||||
"""
|
"""
|
||||||
@@ -719,9 +714,13 @@ class PreTrainedTokenizer(object):
|
|||||||
max_length: if set to a number, will limit the total sequence returned so that it has a maximum length.
|
max_length: if set to a number, will limit the total sequence returned so that it has a maximum length.
|
||||||
If there are overflowing tokens, those will be added to the returned dictionary
|
If there are overflowing tokens, those will be added to the returned dictionary
|
||||||
stride: if set to a number along with max_length, the overflowing tokens returned will contain some tokens
|
stride: if set to a number along with max_length, the overflowing tokens returned will contain some tokens
|
||||||
from the main sequence returned. The value of this argument defined the number of additional tokens.
|
from the main sequence returned. The value of this argument defines the number of additional tokens.
|
||||||
truncate_first_sequence: if there is a specified max_length, this flag will choose which sequence
|
truncation_strategy: string selected in the following options:
|
||||||
will be truncated.
|
- 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length
|
||||||
|
starting from the longest one at each token (when there is a pair of input sequences)
|
||||||
|
- 'only_first': Only truncate the first sequence
|
||||||
|
- 'only_second': Only truncate the second sequence
|
||||||
|
- 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
|
||||||
return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant
|
return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant
|
||||||
or PyTorch torch.Tensor instead of a list of python integers.
|
or PyTorch torch.Tensor instead of a list of python integers.
|
||||||
**kwargs: passed to the `self.tokenize()` method
|
**kwargs: passed to the `self.tokenize()` method
|
||||||
@@ -731,7 +730,7 @@ class PreTrainedTokenizer(object):
|
|||||||
max_length=max_length,
|
max_length=max_length,
|
||||||
add_special_tokens=add_special_tokens,
|
add_special_tokens=add_special_tokens,
|
||||||
stride=stride,
|
stride=stride,
|
||||||
truncate_first_sequence=truncate_first_sequence,
|
truncation_strategy=truncation_strategy,
|
||||||
return_tensors=return_tensors,
|
return_tensors=return_tensors,
|
||||||
**kwargs)
|
**kwargs)
|
||||||
|
|
||||||
@@ -743,7 +742,7 @@ class PreTrainedTokenizer(object):
|
|||||||
add_special_tokens=False,
|
add_special_tokens=False,
|
||||||
max_length=None,
|
max_length=None,
|
||||||
stride=0,
|
stride=0,
|
||||||
truncate_first_sequence=True,
|
truncation_strategy='longest_first',
|
||||||
return_tensors=None,
|
return_tensors=None,
|
||||||
**kwargs):
|
**kwargs):
|
||||||
"""
|
"""
|
||||||
@@ -762,9 +761,13 @@ class PreTrainedTokenizer(object):
|
|||||||
max_length: if set to a number, will limit the total sequence returned so that it has a maximum length.
|
max_length: if set to a number, will limit the total sequence returned so that it has a maximum length.
|
||||||
If there are overflowing tokens, those will be added to the returned dictionary
|
If there are overflowing tokens, those will be added to the returned dictionary
|
||||||
stride: if set to a number along with max_length, the overflowing tokens returned will contain some tokens
|
stride: if set to a number along with max_length, the overflowing tokens returned will contain some tokens
|
||||||
from the main sequence returned. The value of this argument defined the number of additional tokens.
|
from the main sequence returned. The value of this argument defines the number of additional tokens.
|
||||||
truncate_first_sequence: if there is a specified max_length, this flag will choose which sequence
|
truncation_strategy: string selected in the following options:
|
||||||
will be truncated.
|
- 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length
|
||||||
|
starting from the longest one at each token (when there is a pair of input sequences)
|
||||||
|
- 'only_first': Only truncate the first sequence
|
||||||
|
- 'only_second': Only truncate the second sequence
|
||||||
|
- 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
|
||||||
return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant
|
return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant
|
||||||
or PyTorch torch.Tensor instead of a list of python integers.
|
or PyTorch torch.Tensor instead of a list of python integers.
|
||||||
**kwargs: passed to the `self.tokenize()` method
|
**kwargs: passed to the `self.tokenize()` method
|
||||||
@@ -788,12 +791,11 @@ class PreTrainedTokenizer(object):
|
|||||||
max_length=max_length,
|
max_length=max_length,
|
||||||
add_special_tokens=add_special_tokens,
|
add_special_tokens=add_special_tokens,
|
||||||
stride=stride,
|
stride=stride,
|
||||||
truncate_first_sequence=truncate_first_sequence,
|
truncation_strategy=truncation_strategy,
|
||||||
return_tensors=return_tensors)
|
return_tensors=return_tensors)
|
||||||
|
|
||||||
|
|
||||||
def prepare_for_model(self, ids, pair_ids=None, max_length=None, add_special_tokens=False, stride=0,
|
def prepare_for_model(self, ids, pair_ids=None, max_length=None, add_special_tokens=False, stride=0,
|
||||||
truncate_first_sequence=True, return_tensors=None):
|
truncation_strategy='longest_first', return_tensors=None):
|
||||||
"""
|
"""
|
||||||
Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model.
|
Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model.
|
||||||
It adds special tokens, truncates
|
It adds special tokens, truncates
|
||||||
@@ -810,41 +812,50 @@ class PreTrainedTokenizer(object):
|
|||||||
to their model.
|
to their model.
|
||||||
stride: window stride for overflowing tokens. Can be useful for edge effect removal when using sequential
|
stride: window stride for overflowing tokens. Can be useful for edge effect removal when using sequential
|
||||||
list of inputs.
|
list of inputs.
|
||||||
truncate_first_sequence: if set to `True` and an optional second list of input ids is provided,
|
truncation_strategy: string selected in the following options:
|
||||||
alongside a specified `max_length`, will truncate the first sequence if the total size is superior
|
- 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length
|
||||||
than the specified `max_length`. If set to `False`, will truncate the second sequence instead.
|
starting from the longest one at each token (when there is a pair of input sequences)
|
||||||
|
- 'only_first': Only truncate the first sequence
|
||||||
|
- 'only_second': Only truncate the second sequence
|
||||||
|
- 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
|
||||||
return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant
|
return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant
|
||||||
or PyTorch torch.Tensor instead of a list of python integers.
|
or PyTorch torch.Tensor instead of a list of python integers.
|
||||||
|
|
||||||
Return:
|
Return:
|
||||||
a dictionary containing the `input_ids` as well as the `overflowing_tokens` if a `max_length` was given.
|
A Dictionary of shape::
|
||||||
|
|
||||||
|
{
|
||||||
|
input_ids: list[int],
|
||||||
|
overflowing_tokens: list[int] if a ``max_length`` is specified, else None
|
||||||
|
special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True``
|
||||||
|
}
|
||||||
|
|
||||||
|
With the fields:
|
||||||
|
``input_ids``: list of tokens to be fed to a model
|
||||||
|
|
||||||
|
``overflowing_tokens``: list of overflowing tokens if a max length is specified.
|
||||||
|
|
||||||
|
``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added
|
||||||
|
tokens and 1 specifying sequence tokens.
|
||||||
"""
|
"""
|
||||||
pair = bool(pair_ids is not None)
|
pair = bool(pair_ids is not None)
|
||||||
len_ids = len(ids)
|
len_ids = len(ids)
|
||||||
len_pair_ids = len(pair_ids) if pair else 0
|
len_pair_ids = len(pair_ids) if pair else 0
|
||||||
|
|
||||||
encoded_inputs = {}
|
encoded_inputs = {}
|
||||||
if max_length:
|
total_len = len_ids + len_pair_ids + (self.num_added_tokens(pair=pair) if add_special_tokens else 0)
|
||||||
n_added_tokens = self.num_added_tokens(pair=pair) if add_special_tokens else 0
|
if max_length and total_len > max_length:
|
||||||
if pair and n_added_tokens + (len_pair_ids if truncate_first_sequence else len_ids) >= max_length:
|
ids, pair_ids, overflowing_tokens = self.truncate_sequences(ids, pair_ids=pair_ids,
|
||||||
logger.warning(
|
num_tokens_to_remove=total_len-max_length,
|
||||||
"You supplied a pair of sequence in which the sequence that will not be truncated is longer than the maximum specified length."
|
truncation_strategy=truncation_strategy,
|
||||||
"This pair of sequences will not be truncated.")
|
stride=stride)
|
||||||
else:
|
encoded_inputs["overflowing_tokens"] = overflowing_tokens
|
||||||
if n_added_tokens + len_ids + len_pair_ids > max_length:
|
encoded_inputs["num_truncated_tokens"] = total_len - max_length
|
||||||
if truncate_first_sequence or not pair:
|
|
||||||
encoded_inputs["overflowing_tokens"] = ids[max_length - len_pair_ids - n_added_tokens - stride:]
|
|
||||||
ids = ids[:max_length - len_pair_ids - n_added_tokens]
|
|
||||||
elif not truncate_first_sequence and pair:
|
|
||||||
encoded_inputs["overflowing_tokens"] = pair_ids[max_length - len_ids - n_added_tokens - stride:]
|
|
||||||
pair_ids = pair_ids[:max_length - len_ids - n_added_tokens]
|
|
||||||
else:
|
|
||||||
logger.warning(
|
|
||||||
"Cannot truncate second sequence as it is not provided. No truncation.")
|
|
||||||
|
|
||||||
if add_special_tokens:
|
if add_special_tokens:
|
||||||
sequence = self.add_special_tokens_sequence_pair(ids, pair_ids) if pair else self.add_special_tokens_single_sequence(ids)
|
sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
|
||||||
token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids) if pair else [0] * len(sequence)
|
token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
|
||||||
|
encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
|
||||||
else:
|
else:
|
||||||
sequence = ids + pair_ids if pair else ids
|
sequence = ids + pair_ids if pair else ids
|
||||||
token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])
|
token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])
|
||||||
@@ -861,20 +872,89 @@ class PreTrainedTokenizer(object):
|
|||||||
encoded_inputs["input_ids"] = sequence
|
encoded_inputs["input_ids"] = sequence
|
||||||
encoded_inputs["token_type_ids"] = token_type_ids
|
encoded_inputs["token_type_ids"] = token_type_ids
|
||||||
|
|
||||||
|
if max_length and len(encoded_inputs["input_ids"]) > max_length:
|
||||||
|
encoded_inputs["input_ids"] = encoded_inputs["input_ids"][:max_length]
|
||||||
|
encoded_inputs["token_type_ids"] = encoded_inputs["token_type_ids"][:max_length]
|
||||||
|
encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"][:max_length]
|
||||||
|
|
||||||
return encoded_inputs
|
return encoded_inputs
|
||||||
|
|
||||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1):
|
def truncate_sequences(self, ids, pair_ids=None, num_tokens_to_remove=0, truncation_strategy='longest_first', stride=0):
|
||||||
|
"""Truncates a sequence pair in place to the maximum length.
|
||||||
|
truncation_strategy: string selected in the following options:
|
||||||
|
- 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length
|
||||||
|
starting from the longest one at each token (when there is a pair of input sequences).
|
||||||
|
Overflowing tokens only contains overflow from the first sequence.
|
||||||
|
- 'only_first': Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.
|
||||||
|
- 'only_second': Only truncate the second sequence
|
||||||
|
- 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
|
||||||
|
"""
|
||||||
|
if num_tokens_to_remove <= 0:
|
||||||
|
return ids, pair_ids, []
|
||||||
|
|
||||||
|
if truncation_strategy == 'longest_first':
|
||||||
|
overflowing_tokens = []
|
||||||
|
for _ in range(num_tokens_to_remove):
|
||||||
|
if pair_ids is None or len(ids) > len(pair_ids):
|
||||||
|
overflowing_tokens = [ids[-1]] + overflowing_tokens
|
||||||
|
ids = ids[:-1]
|
||||||
|
else:
|
||||||
|
pair_ids = pair_ids[:-1]
|
||||||
|
window_len = min(len(ids), stride)
|
||||||
|
if window_len > 0:
|
||||||
|
overflowing_tokens = ids[-window_len:] + overflowing_tokens
|
||||||
|
elif truncation_strategy == 'only_first':
|
||||||
|
assert len(ids) > num_tokens_to_remove
|
||||||
|
window_len = min(len(ids), stride + num_tokens_to_remove)
|
||||||
|
overflowing_tokens = ids[-window_len:]
|
||||||
|
ids = ids[:-num_tokens_to_remove]
|
||||||
|
elif truncation_strategy == 'only_second':
|
||||||
|
assert pair_ids is not None and len(pair_ids) > num_tokens_to_remove
|
||||||
|
window_len = min(len(pair_ids), stride + num_tokens_to_remove)
|
||||||
|
overflowing_tokens = pair_ids[-window_len:]
|
||||||
|
pair_ids = pair_ids[:-num_tokens_to_remove]
|
||||||
|
elif truncation_strategy == 'do_not_truncate':
|
||||||
|
raise ValueError("Input sequence are too long for max_length. Please select a truncation strategy.")
|
||||||
|
else:
|
||||||
|
raise ValueError("Truncation_strategy should be selected in ['longest_first', 'only_first', 'only_second', 'do_not_truncate']")
|
||||||
|
return (ids, pair_ids, overflowing_tokens)
|
||||||
|
|
||||||
|
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
||||||
logger.warning("This tokenizer does not make use of special tokens.")
|
logger.warning("This tokenizer does not make use of special tokens.")
|
||||||
|
if token_ids_1 is None:
|
||||||
|
return len(token_ids_0) * [0]
|
||||||
return [0] * len(token_ids_0) + [1] * len(token_ids_1)
|
return [0] * len(token_ids_0) + [1] * len(token_ids_1)
|
||||||
|
|
||||||
def add_special_tokens_single_sequence(self, token_ids):
|
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||||
logger.warning("This tokenizer does not make use of special tokens. The sequence has been returned with no modification.")
|
"""
|
||||||
return token_ids
|
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||||
|
by concatenating and adding special tokens.
|
||||||
def add_special_tokens_sequence_pair(self, token_ids_0, token_ids_1):
|
A RoBERTa sequence has the following format:
|
||||||
logger.warning("This tokenizer does not make use of special tokens. The two sequences have been concatenated.")
|
single sequence: <s> X </s>
|
||||||
|
pair of sequences: <s> A </s></s> B </s>
|
||||||
|
"""
|
||||||
|
logger.warning("This tokenizer does not make use of special tokens. Input is returned with no modification.")
|
||||||
|
if token_ids_1 is None:
|
||||||
|
return token_ids_0
|
||||||
return token_ids_0 + token_ids_1
|
return token_ids_0 + token_ids_1
|
||||||
|
|
||||||
|
def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
|
||||||
|
"""
|
||||||
|
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||||
|
special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
token_ids_0: list of ids (must not contain special tokens)
|
||||||
|
token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids
|
||||||
|
for sequence pairs
|
||||||
|
already_has_special_tokens: (default False) Set to True if the token list is already formated with
|
||||||
|
special tokens for the model
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token.
|
||||||
|
"""
|
||||||
|
return [0] * ((len(token_ids_1) if token_ids_1 else 0) + len(token_ids_0))
|
||||||
|
|
||||||
def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
|
def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
|
||||||
""" Converts a single index or a sequence of indices (integers) in a token "
|
""" Converts a single index or a sequence of indices (integers) in a token "
|
||||||
(resp.) a sequence of tokens (str/unicode), using the vocabulary and added tokens.
|
(resp.) a sequence of tokens (str/unicode), using the vocabulary and added tokens.
|
||||||
|
|||||||
@@ -754,32 +754,59 @@ class XLMTokenizer(PreTrainedTokenizer):
|
|||||||
out_string = ''.join(tokens).replace('</w>', ' ').strip()
|
out_string = ''.join(tokens).replace('</w>', ' ').strip()
|
||||||
return out_string
|
return out_string
|
||||||
|
|
||||||
def add_special_tokens_single_sequence(self, token_ids):
|
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||||
"""
|
"""
|
||||||
Adds special tokens to a sequence for sequence classification tasks.
|
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||||
An XLM sequence has the following format: [CLS] X [SEP]
|
by concatenating and adding special tokens.
|
||||||
"""
|
A RoBERTa sequence has the following format:
|
||||||
return [self.cls_token_id] + token_ids + [self.sep_token_id]
|
single sequence: <s> X </s>
|
||||||
|
pair of sequences: <s> A </s></s> B </s>
|
||||||
def add_special_tokens_sequence_pair(self, token_ids_0, token_ids_1):
|
|
||||||
"""
|
|
||||||
Adds special tokens to a sequence pair for sequence classification tasks.
|
|
||||||
An XLM sequence pair has the following format: [CLS] A [SEP] B [SEP]
|
|
||||||
"""
|
"""
|
||||||
|
if token_ids_1 is None:
|
||||||
|
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
|
||||||
sep = [self.sep_token_id]
|
sep = [self.sep_token_id]
|
||||||
cls = [self.cls_token_id]
|
cls = [self.cls_token_id]
|
||||||
return cls + token_ids_0 + sep + token_ids_1 + sep
|
return cls + token_ids_0 + sep + token_ids_1 + sep
|
||||||
|
|
||||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1):
|
def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
|
||||||
|
"""
|
||||||
|
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||||
|
special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
token_ids_0: list of ids (must not contain special tokens)
|
||||||
|
token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids
|
||||||
|
for sequence pairs
|
||||||
|
already_has_special_tokens: (default False) Set to True if the token list is already formated with
|
||||||
|
special tokens for the model
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token.
|
||||||
|
"""
|
||||||
|
|
||||||
|
if already_has_special_tokens:
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
raise ValueError("You should not supply a second sequence if the provided sequence of "
|
||||||
|
"ids is already formated with special tokens for the model.")
|
||||||
|
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
|
||||||
|
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
|
||||||
|
return [1] + ([0] * len(token_ids_0)) + [1]
|
||||||
|
|
||||||
|
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
||||||
"""
|
"""
|
||||||
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
||||||
An XLM sequence pair mask has the following format:
|
An XLM sequence pair mask has the following format:
|
||||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
|
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
|
||||||
| first sequence | second sequence
|
| first sequence | second sequence
|
||||||
|
|
||||||
|
if token_ids_1 is None, only returns the first portion of the mask (0's).
|
||||||
"""
|
"""
|
||||||
sep = [self.sep_token_id]
|
sep = [self.sep_token_id]
|
||||||
cls = [self.cls_token_id]
|
cls = [self.cls_token_id]
|
||||||
|
if token_ids_1 is None:
|
||||||
|
return len(cls + token_ids_0 + sep) * [0]
|
||||||
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
|
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
|
||||||
|
|
||||||
def save_vocabulary(self, save_directory):
|
def save_vocabulary(self, save_directory):
|
||||||
|
|||||||
@@ -181,36 +181,61 @@ class XLNetTokenizer(PreTrainedTokenizer):
|
|||||||
out_string = ''.join(tokens).replace(SPIECE_UNDERLINE, ' ').strip()
|
out_string = ''.join(tokens).replace(SPIECE_UNDERLINE, ' ').strip()
|
||||||
return out_string
|
return out_string
|
||||||
|
|
||||||
def add_special_tokens_single_sequence(self, token_ids):
|
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||||
"""
|
"""
|
||||||
Adds special tokens to a sequence for sequence classification tasks.
|
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||||
An XLNet sequence has the following format: X [SEP][CLS]
|
by concatenating and adding special tokens.
|
||||||
|
A RoBERTa sequence has the following format:
|
||||||
|
single sequence: <s> X </s>
|
||||||
|
pair of sequences: <s> A </s></s> B </s>
|
||||||
"""
|
"""
|
||||||
sep = [self.sep_token_id]
|
sep = [self.sep_token_id]
|
||||||
cls = [self.cls_token_id]
|
cls = [self.cls_token_id]
|
||||||
return token_ids + sep + cls
|
if token_ids_1 is None:
|
||||||
|
return token_ids_0 + sep + cls
|
||||||
def add_special_tokens_sequence_pair(self, token_ids_0, token_ids_1):
|
|
||||||
"""
|
|
||||||
Adds special tokens to a sequence pair for sequence classification tasks.
|
|
||||||
An XLNet sequence pair has the following format: A [SEP] B [SEP][CLS]
|
|
||||||
"""
|
|
||||||
|
|
||||||
sep = [self.sep_token_id]
|
|
||||||
cls = [self.cls_token_id]
|
|
||||||
return token_ids_0 + sep + token_ids_1 + sep + cls
|
return token_ids_0 + sep + token_ids_1 + sep + cls
|
||||||
|
|
||||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1):
|
def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
|
||||||
|
"""
|
||||||
|
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||||
|
special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
token_ids_0: list of ids (must not contain special tokens)
|
||||||
|
token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids
|
||||||
|
for sequence pairs
|
||||||
|
already_has_special_tokens: (default False) Set to True if the token list is already formated with
|
||||||
|
special tokens for the model
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token.
|
||||||
|
"""
|
||||||
|
|
||||||
|
if already_has_special_tokens:
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
raise ValueError("You should not supply a second sequence if the provided sequence of "
|
||||||
|
"ids is already formated with special tokens for the model.")
|
||||||
|
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
|
||||||
|
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1, 1]
|
||||||
|
return ([0] * len(token_ids_0)) + [1, 1]
|
||||||
|
|
||||||
|
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
||||||
"""
|
"""
|
||||||
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
||||||
A BERT sequence pair mask has the following format:
|
A BERT sequence pair mask has the following format:
|
||||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2
|
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2
|
||||||
| first sequence | second sequence | CLS segment ID
|
| first sequence | second sequence | CLS segment ID
|
||||||
|
|
||||||
|
if token_ids_1 is None, only returns the first portion of the mask (0's).
|
||||||
"""
|
"""
|
||||||
sep = [self.sep_token_id]
|
sep = [self.sep_token_id]
|
||||||
cls = [self.cls_token_id]
|
cls = [self.cls_token_id]
|
||||||
cls_segment_id = [2]
|
cls_segment_id = [2]
|
||||||
|
|
||||||
|
if token_ids_1 is None:
|
||||||
|
return len(token_ids_0 + sep + cls) * [0]
|
||||||
return len(token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1] + cls_segment_id
|
return len(token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1] + cls_segment_id
|
||||||
|
|
||||||
def save_vocabulary(self, save_directory):
|
def save_vocabulary(self, save_directory):
|
||||||
|
|||||||
Reference in New Issue
Block a user