Compare commits
145 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
9c2e0a4acf | ||
|
|
7fe98d8c18 | ||
|
|
89f86f9661 | ||
|
|
e17ea08e24 | ||
|
|
2431fea98a | ||
|
|
d9e60f4f0d | ||
|
|
e84470ef81 | ||
|
|
07d055f849 | ||
|
|
48b438ff2a | ||
|
|
69629c4f0f | ||
|
|
bf34a252b8 | ||
|
|
528d3f327b | ||
|
|
56301bd9e8 | ||
|
|
d6c5469712 | ||
|
|
54a31f50fb | ||
|
|
c19b8e4ae0 | ||
|
|
6dce6dda1b | ||
|
|
c56d921dda | ||
|
|
1c5079952f | ||
|
|
58b302caf3 | ||
|
|
439fac723a | ||
|
|
23b7138ab4 | ||
|
|
d688af19e5 | ||
|
|
45dc04f33d | ||
|
|
248314772f | ||
|
|
03c2c762a6 | ||
|
|
3edfa1d6aa | ||
|
|
f4d41fe33e | ||
|
|
45de313a9e | ||
|
|
ade05b6cef | ||
|
|
e9c09052a4 | ||
|
|
8fcc6507ce | ||
|
|
6e3e1c959e | ||
|
|
7ce83b4931 | ||
|
|
9f81f1cba8 | ||
|
|
7afd00a661 | ||
|
|
bd5363cc83 | ||
|
|
dc89441167 | ||
|
|
320b7a7e01 | ||
|
|
1615360c71 | ||
|
|
6dc6c716c5 | ||
|
|
904158ac4d | ||
|
|
0f65d8cbbe | ||
|
|
f3e0218fbb | ||
|
|
78ef1a9930 | ||
|
|
6c1d0bc066 | ||
|
|
0820bb0555 | ||
|
|
f5891c3821 | ||
|
|
764a7923ec | ||
|
|
bb464289ce | ||
|
|
92c0f2fb90 | ||
|
|
9e136ff57c | ||
|
|
7bddb45a6f | ||
|
|
dbed1c5d94 | ||
|
|
b3cfd97946 | ||
|
|
81a1e12469 | ||
|
|
d3f24dfad7 | ||
|
|
ecc4f1bdfa | ||
|
|
c2c2ca0fdb | ||
|
|
1569610f2d | ||
|
|
e1b2949ae6 | ||
|
|
e2ae9c0b73 | ||
|
|
aebd83230f | ||
|
|
651bfb7ad5 | ||
|
|
5ed50a93fb | ||
|
|
cc412edd42 | ||
|
|
2f259b228e | ||
|
|
7c789c337d | ||
|
|
7af0777910 | ||
|
|
c1689ac301 | ||
|
|
4a790c40b1 | ||
|
|
6be46a6e64 | ||
|
|
5f07d8f11a | ||
|
|
35071007cb | ||
|
|
f1f23ad171 | ||
|
|
2a91f6071f | ||
|
|
c51e533a5f | ||
|
|
a76c3f9cb0 | ||
|
|
bb9c5ead54 | ||
|
|
a12ab0a8db | ||
|
|
4d6dfbd376 | ||
|
|
23edebc079 | ||
|
|
cbfcfce205 | ||
|
|
19e4ebbe3f | ||
|
|
594202a934 | ||
|
|
38084507c4 | ||
|
|
2195c0d5f9 | ||
|
|
ebb32261b1 | ||
|
|
63ed224b7c | ||
|
|
a95158518d | ||
|
|
d73957899a | ||
|
|
cd69bc9c87 | ||
|
|
391db836ab | ||
|
|
963529e29b | ||
|
|
f7978f70ec | ||
|
|
1e4a191366 | ||
|
|
c50783e388 | ||
|
|
6971556ab8 | ||
|
|
b350662955 | ||
|
|
f5bcde0b2f | ||
|
|
5c3b32d44d | ||
|
|
2dc8cb8734 | ||
|
|
0a4ed7192e | ||
|
|
ae50ad91ea | ||
|
|
60f791631b | ||
|
|
a6a6d9e638 | ||
|
|
d8b641c839 | ||
|
|
c6acbdd50a | ||
|
|
df7cd9e4e4 | ||
|
|
6a17b3c51b | ||
|
|
04e9a6f512 | ||
|
|
9478590630 | ||
|
|
795b3e76ff | ||
|
|
e31a472801 | ||
|
|
4f2b6579bf | ||
|
|
ca559826c4 | ||
|
|
d2de5b9d8c | ||
|
|
d83d295763 | ||
|
|
f6de000305 | ||
|
|
15749bfc10 | ||
|
|
da2e47ad15 | ||
|
|
528c288fa9 | ||
|
|
702f589848 | ||
|
|
22d2fded2c | ||
|
|
fc9faa8a47 | ||
|
|
ecfddc6034 | ||
|
|
93f0c5fc72 | ||
|
|
6c3b131516 | ||
|
|
f83b35b77d | ||
|
|
4e63c90720 | ||
|
|
7e957237e4 | ||
|
|
302a4813a5 | ||
|
|
f71a4577b8 | ||
|
|
a3e0dbba95 | ||
|
|
0f92f76ca3 | ||
|
|
4094958df2 | ||
|
|
7d8b395afa | ||
|
|
927904bc91 | ||
|
|
294edfd83d | ||
|
|
de5e4864cb | ||
|
|
e4e35296fb | ||
|
|
4b543c3007 | ||
|
|
2e6797cc7d | ||
|
|
f0340eccf9 | ||
|
|
ec94f4e0f8 |
@@ -81,7 +81,6 @@ jobs:
|
||||
- checkout
|
||||
- run: sudo pip install --progress-bar off -r docs/requirements.txt
|
||||
- run: sudo pip install --progress-bar off -r requirements.txt
|
||||
- run: cd docs/source && ln -s ../../examples/README.md examples.md && cd -
|
||||
- run: cd docs && make clean && make html && scp -r -oStrictHostKeyChecking=no _build/html/* $doc:$dir
|
||||
workflow_filters: &workflow_filters
|
||||
filters:
|
||||
|
||||
23
.github/ISSUE_TEMPLATE/--new-model-addition.md
vendored
Normal file
23
.github/ISSUE_TEMPLATE/--new-model-addition.md
vendored
Normal file
@@ -0,0 +1,23 @@
|
||||
---
|
||||
name: "\U0001F31FNew model addition"
|
||||
about: Submit a proposal/request to implement a new Transformer-based model
|
||||
title: ''
|
||||
labels: ''
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
# 🌟New model addition
|
||||
|
||||
## Model description
|
||||
|
||||
<!-- Important information -->
|
||||
|
||||
## Open Source status
|
||||
|
||||
* [ ] the model implementation is available: (give details)
|
||||
* [ ] the model weights are available: (give details)
|
||||
|
||||
## Additional context
|
||||
|
||||
<!-- Add any other context about the problem here. -->
|
||||
6
.github/ISSUE_TEMPLATE/bug-report.md
vendored
6
.github/ISSUE_TEMPLATE/bug-report.md
vendored
@@ -1,6 +1,10 @@
|
||||
---
|
||||
name: "\U0001F41B Bug Report"
|
||||
about: Submit a bug report to help us improve PyTorch Transformers
|
||||
title: ''
|
||||
labels: ''
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
## 🐛 Bug
|
||||
@@ -45,4 +49,4 @@ Steps to reproduce the behavior:
|
||||
|
||||
## Additional context
|
||||
|
||||
<!-- Add any other context about the problem here. -->
|
||||
<!-- Add any other context about the problem here. -->
|
||||
|
||||
6
.github/ISSUE_TEMPLATE/feature-request.md
vendored
6
.github/ISSUE_TEMPLATE/feature-request.md
vendored
@@ -1,6 +1,10 @@
|
||||
---
|
||||
name: "\U0001F680 Feature Request"
|
||||
about: Submit a proposal/request for a new PyTorch Transformers feature
|
||||
title: ''
|
||||
labels: ''
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Feature
|
||||
@@ -13,4 +17,4 @@ about: Submit a proposal/request for a new PyTorch Transformers feature
|
||||
|
||||
## Additional context
|
||||
|
||||
<!-- Add any other context or screenshots about the feature request here. -->
|
||||
<!-- Add any other context or screenshots about the feature request here. -->
|
||||
|
||||
6
.github/ISSUE_TEMPLATE/migration.md
vendored
6
.github/ISSUE_TEMPLATE/migration.md
vendored
@@ -1,6 +1,10 @@
|
||||
---
|
||||
name: "\U0001F4DA Migration from PyTorch-pretrained-Bert"
|
||||
about: Report a problem when migrating from PyTorch-pretrained-Bert to Transformers
|
||||
title: ''
|
||||
labels: ''
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
## 📚 Migration
|
||||
@@ -40,4 +44,4 @@ Details of the issue:
|
||||
|
||||
## Additional context
|
||||
|
||||
<!-- Add any other context about the problem here. -->
|
||||
<!-- Add any other context about the problem here. -->
|
||||
|
||||
6
.github/ISSUE_TEMPLATE/question-help.md
vendored
6
.github/ISSUE_TEMPLATE/question-help.md
vendored
@@ -1,8 +1,12 @@
|
||||
---
|
||||
name: "❓Questions & Help"
|
||||
about: Start a general discussion related to PyTorch Transformers
|
||||
title: ''
|
||||
labels: ''
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
## ❓ Questions & Help
|
||||
|
||||
<!-- A clear and concise description of the question. -->
|
||||
<!-- A clear and concise description of the question. -->
|
||||
|
||||
8
.gitignore
vendored
8
.gitignore
vendored
@@ -118,6 +118,9 @@ dmypy.json
|
||||
# vscode
|
||||
.vscode
|
||||
|
||||
# Pycharm
|
||||
.idea
|
||||
|
||||
# TF code
|
||||
tensorflow_code
|
||||
|
||||
@@ -131,4 +134,7 @@ examples/runs
|
||||
|
||||
# data
|
||||
/data
|
||||
serialization_dir
|
||||
serialization_dir
|
||||
|
||||
# emacs
|
||||
*.*~
|
||||
175
CONTRIBUTING.md
Normal file
175
CONTRIBUTING.md
Normal file
@@ -0,0 +1,175 @@
|
||||
# How to contribute to transformers?
|
||||
|
||||
Everyone is welcome to contribute, and we value everybody's contribution. Code
|
||||
is thus not the only way to help the community. Answering questions, helping
|
||||
others, reaching out and improving the documentations are immensely valuable to
|
||||
the community.
|
||||
|
||||
It also helps us if you spread the word: reference the library from blog posts
|
||||
on the awesome projects it made possible, shout out on Twitter every time it has
|
||||
helped you, or simply star the repo to say "thank you".
|
||||
|
||||
## You can contribute in so many ways!
|
||||
|
||||
There are 4 ways you can contribute to transformers:
|
||||
* Fixing outstanding issues with the existing code;
|
||||
* Implementing new models;
|
||||
* Contributing to the examples or to the documentation;
|
||||
* Submitting issues related to bugs or desired new features.
|
||||
|
||||
*All are equally valuable to the community.*
|
||||
|
||||
## Submitting a new issue or feature request
|
||||
|
||||
Do your best to follow these guidelines when submitting an issue or a feature
|
||||
request. It will make it easier for us to come back to you quickly and with good
|
||||
feedback.
|
||||
|
||||
### Did you find a bug?
|
||||
|
||||
The transformers are robust and reliable thanks to the users who notify us of
|
||||
the problems they encounter. So thank you for reporting an issue.
|
||||
|
||||
First, we would really appreciate it if you could **make sure the bug was not
|
||||
already reported** (use the search bar on Github under Issues).
|
||||
|
||||
Did not find it? :( So we can act quickly on it, please follow these steps:
|
||||
|
||||
* Include your **OS type and version**, the versions of **Python**, **PyTorch** and
|
||||
**Tensorflow** when applicable;
|
||||
* A short, self-contained, code snippet that allows us to reproduce the bug in
|
||||
less than 30s;
|
||||
* Provide the *full* traceback if an exception is raised.
|
||||
|
||||
To get the OS and software versions, execute the following code and copy-paste
|
||||
the output:
|
||||
|
||||
```
|
||||
import platform; print("Platform", platform.platform())
|
||||
import sys; print("Python", sys.version)
|
||||
import torch; print("PyTorch", torch.__version__)
|
||||
import tensorflow; print("Tensorflow", tensorflow.__version__)
|
||||
```
|
||||
|
||||
### Do you want to implement a new model?
|
||||
|
||||
Awesome! Please provide the following information:
|
||||
|
||||
* Short description of the model and link to the paper;
|
||||
* Link to the implementation if it is open-source;
|
||||
* Link to the model weights if they are available.
|
||||
|
||||
If you are willing to contribute the model yourself, let us know so we can best
|
||||
guide you.
|
||||
|
||||
### Do you want a new feature (that is not a model)?
|
||||
|
||||
A world-class feature request addresses the following points:
|
||||
|
||||
1. Motivation first:
|
||||
* Is it related to a problem/frustration with the library? If so, please explain
|
||||
why. Providing a code snippet that demonstrates the problem is best.
|
||||
* Is it related to something you would need for a project? We'd love to hear
|
||||
about it!
|
||||
* Is it something you worked on and think could benefit the community?
|
||||
Awesome! Tell us what problem it solved for you.
|
||||
2. Write a *full paragraph* describing the feature;
|
||||
3. Provide a **code snippet** that demonstrates its future use;
|
||||
4. In case this is related to a paper, please attach a link;
|
||||
5. Attach any additional information (drawings, screenshots, etc.) you think may help.
|
||||
|
||||
If your issue is well written we're already 80% of the way there by the time you
|
||||
post it.
|
||||
|
||||
## Start contributing! (Pull Requests)
|
||||
|
||||
Before writing code, we strongly advise you to search through the exising PRs or
|
||||
issues to make sure that nobody is already working on the same thing. If you are
|
||||
unsure, it is always a good idea to open an issue to get some feedback.
|
||||
|
||||
You will need basic `git` proficiency to be able to contribute to
|
||||
`transformers`. `git` is not the easiest tool to use but it has the greatest
|
||||
manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
|
||||
Git](https://git-scm.com/book/en/v2) is a very good reference.
|
||||
|
||||
Follow these steps to start contributing:
|
||||
|
||||
1. Fork the [repository](https://github.com/huggingface/transformers) by
|
||||
clicking on the 'Fork' button on the repository's page. This creates a copy of the code
|
||||
under your github user account.
|
||||
2. Clone your fork to your local disk, and add the base repository as a remote:
|
||||
|
||||
```bash
|
||||
$ git clone git@github.com:<your Github handle>/transformers.git
|
||||
$ cd transformers
|
||||
$ git remote add upstream git@github.com:huggingface/transformers.git
|
||||
```
|
||||
|
||||
3. Create a new branch to hold your development changes:
|
||||
|
||||
```bash
|
||||
$ git checkout -b a-descriptive-name-for-my-changes
|
||||
```
|
||||
|
||||
**do not** work on the `master` branch.
|
||||
|
||||
4. Set up a development environment by running the following command in a virtual environment:
|
||||
|
||||
```bash
|
||||
$ pip install -r requirements-dev.txt
|
||||
```
|
||||
|
||||
5. Develop the features on your branch. Add changed files using `git add` and
|
||||
then `git commit` to record your changes locally:
|
||||
|
||||
```bash
|
||||
$ git add modified_file.py
|
||||
$ git commit
|
||||
```
|
||||
|
||||
Please write [good commit
|
||||
messages](https://chris.beams.io/posts/git-commit/). It
|
||||
is a good idea to sync your copy of the code with the original repository
|
||||
regularly. This way you can quickly account for changes:
|
||||
|
||||
```bash
|
||||
$ git fetch upstream
|
||||
$ git rebase upstream/master
|
||||
```
|
||||
|
||||
Push the changes to your account using:
|
||||
|
||||
```bash
|
||||
$ git push -u origin a-descriptive-name-for-my-changes
|
||||
```
|
||||
|
||||
6. Once you are satisfied (**and the checklist below is happy too**), go to the
|
||||
webpage of your fork on Github. Click on 'Pull request' to send your changes
|
||||
to the project maintainers for review.
|
||||
|
||||
7. It's ok if maintainers ask you for changes. It happens to core contributors
|
||||
too! So everyone can see the changes in the Pull request, work in your local
|
||||
branch and push the changes to your fork. They will automatically appear in
|
||||
the pull request.
|
||||
|
||||
|
||||
### Checklist
|
||||
|
||||
1. The title of your pull request should be a summary of its contribution;
|
||||
2. If your pull request adresses an issue, please mention the issue number in
|
||||
the pull request description to make sure they are linked (and people
|
||||
consulting the issue know you are working on it);
|
||||
3. To indicate a work in progress please prefix the title with `[WIP]`. These
|
||||
are useful to avoid duplicated work, and to differentiate it from PRs ready
|
||||
to be merged;
|
||||
4. Make sure pre-existing tests still pass;
|
||||
5. Add high-coverage tests. No quality test, no merge;
|
||||
6. All public methods must have informative doctrings;
|
||||
|
||||
|
||||
### Style guide
|
||||
|
||||
For documentation strings, `transformers` follows the [google
|
||||
style](https://google.github.io/styleguide/pyguide.html).
|
||||
|
||||
#### This guide was heavily inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md)
|
||||
102
README.md
102
README.md
@@ -4,7 +4,7 @@
|
||||
<br>
|
||||
<p>
|
||||
<p align="center">
|
||||
<a href="https://github.com/huggingface/transformers/blob/master/LICENSE">
|
||||
<a href="https://circleci.com/gh/huggingface/transformers">
|
||||
<img alt="Build" src="https://img.shields.io/circleci/build/github/huggingface/transformers/master">
|
||||
</a>
|
||||
<a href="https://github.com/huggingface/transformers/blob/master/LICENSE">
|
||||
@@ -22,7 +22,7 @@
|
||||
<p>State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
|
||||
</h3>
|
||||
|
||||
🤗 Transformers (formerly known as `pytorch-transformers` and `pytorch-pretrained-bert`) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.
|
||||
🤗 Transformers (formerly known as `pytorch-transformers` and `pytorch-pretrained-bert`) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.
|
||||
|
||||
### Features
|
||||
|
||||
@@ -54,19 +54,22 @@ Choose the right framework for every part of a model's lifetime
|
||||
| [Model architectures](#model-architectures) | Architectures (with pretrained weights) |
|
||||
| [Online demo](#online-demo) | Experimenting with this repo’s text generation capabilities |
|
||||
| [Quick tour: Usage](#quick-tour) | Tokenizers & models usage: Bert and GPT-2 |
|
||||
| [Quick tour: TF 2.0 and PyTorch ](#Quick-tour-TF-2.0-training-and-PyTorch-interoperability) | Train a TF 2.0 model in 10 lines of code, load it in PyTorch |
|
||||
| [Quick tour: TF 2.0 and PyTorch ](#Quick-tour-TF-20-training-and-PyTorch-interoperability) | Train a TF 2.0 model in 10 lines of code, load it in PyTorch |
|
||||
| [Quick tour: Fine-tuning/usage scripts](#quick-tour-of-the-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
|
||||
| [Migrating from pytorch-transformers to transformers](#Migrating-from-pytorch-transformers-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
|
||||
| [Migrating from pytorch-transformers to transformers](#Migrating-from-pytorch-transformers-to-transformers) | Migrating your code from pytorch-transformers to transformers |
|
||||
| [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
|
||||
| [Documentation](https://huggingface.co/transformers/) | Full API documentation and more |
|
||||
|
||||
## Installation
|
||||
|
||||
This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.0.0+
|
||||
This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+), PyTorch 1.0.0+ and TensorFlow 2.0.0-rc1
|
||||
|
||||
### With pip
|
||||
|
||||
Transformers can be installed by pip as follows:
|
||||
First you need to install one of, or both, TensorFlow 2.0 and PyTorch.
|
||||
Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform.
|
||||
|
||||
When TensorFlow 2.0 and/or PyTorch has been installed, 🤗 Transformers can be installed using pip as follows:
|
||||
|
||||
```bash
|
||||
pip install transformers
|
||||
@@ -74,7 +77,10 @@ pip install transformers
|
||||
|
||||
### From source
|
||||
|
||||
Clone the repository and run:
|
||||
Here also, you first need to install one of, or both, TensorFlow 2.0 and PyTorch.
|
||||
Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform.
|
||||
|
||||
When TensorFlow 2.0 and/or PyTorch has been installed, you can install from source by cloning the repository and running:
|
||||
|
||||
```bash
|
||||
pip install [--editable] .
|
||||
@@ -82,10 +88,12 @@ pip install [--editable] .
|
||||
|
||||
### Tests
|
||||
|
||||
A series of tests is included for the library and the example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
|
||||
A series of tests are included for the library and the example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
|
||||
|
||||
These tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
|
||||
|
||||
Depending on which framework is installed (TensorFlow 2.0 and/or PyTorch), the irrelevant tests will be skipped. Ensure that both frameworks are installed if you want to execute all tests.
|
||||
|
||||
You can run the tests from the root of the cloned repository with the commands:
|
||||
|
||||
```bash
|
||||
@@ -97,10 +105,9 @@ python -m pytest -sv ./examples/
|
||||
|
||||
You should check out our [`swift-coreml-transformers`](https://github.com/huggingface/swift-coreml-transformers) repo.
|
||||
|
||||
It contains an example of a conversion script from a Pytorch trained Transformer model (here, `GPT-2`) to a CoreML model that runs on iOS devices.
|
||||
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`, `DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
|
||||
|
||||
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
|
||||
or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
|
||||
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models to productizing them in CoreML, or prototype a model or an app in CoreML then research its hyperparameters or architecture from TensorFlow 2.0 and/or PyTorch. Super exciting!
|
||||
|
||||
## Model architectures
|
||||
|
||||
@@ -113,8 +120,8 @@ or prototype a model or an app in CoreML then research its hyperparameters or ar
|
||||
5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
||||
6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
|
||||
7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
||||
8. **[DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the blogpost [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5
|
||||
) by Victor Sanh, Lysandre Debut and Thomas Wolf.
|
||||
8. **[DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation).
|
||||
9. **[CTRL](https://github.com/salesforce/ctrl/)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
|
||||
|
||||
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
|
||||
|
||||
@@ -141,6 +148,7 @@ from transformers import *
|
||||
MODELS = [(BertModel, BertTokenizer, 'bert-base-uncased'),
|
||||
(OpenAIGPTModel, OpenAIGPTTokenizer, 'openai-gpt'),
|
||||
(GPT2Model, GPT2Tokenizer, 'gpt2'),
|
||||
(CTRLModel, CTRLTokenizer, 'ctrl'),
|
||||
(TransfoXLModel, TransfoXLTokenizer, 'transfo-xl-wt103'),
|
||||
(XLNetModel, XLNetTokenizer, 'xlnet-base-cased'),
|
||||
(XLMModel, XLMTokenizer, 'xlm-mlm-enfr-1024'),
|
||||
@@ -173,24 +181,24 @@ for model_class in BERT_MODEL_CLASSES:
|
||||
# Load pretrained model/tokenizer
|
||||
model = model_class.from_pretrained('bert-base-uncased')
|
||||
|
||||
# Models can return full list of hidden-states & attentions weights at each layer
|
||||
model = model_class.from_pretrained(pretrained_weights,
|
||||
output_hidden_states=True,
|
||||
output_attentions=True)
|
||||
input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
|
||||
all_hidden_states, all_attentions = model(input_ids)[-2:]
|
||||
# Models can return full list of hidden-states & attentions weights at each layer
|
||||
model = model_class.from_pretrained(pretrained_weights,
|
||||
output_hidden_states=True,
|
||||
output_attentions=True)
|
||||
input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
|
||||
all_hidden_states, all_attentions = model(input_ids)[-2:]
|
||||
|
||||
# Models are compatible with Torchscript
|
||||
model = model_class.from_pretrained(pretrained_weights, torchscript=True)
|
||||
traced_model = torch.jit.trace(model, (input_ids,))
|
||||
# Models are compatible with Torchscript
|
||||
model = model_class.from_pretrained(pretrained_weights, torchscript=True)
|
||||
traced_model = torch.jit.trace(model, (input_ids,))
|
||||
|
||||
# Simple serialization for models and tokenizers
|
||||
model.save_pretrained('./directory/to/save/') # save
|
||||
model = model_class.from_pretrained('./directory/to/save/') # re-load
|
||||
tokenizer.save_pretrained('./directory/to/save/') # save
|
||||
tokenizer = tokenizer_class.from_pretrained('./directory/to/save/') # re-load
|
||||
# Simple serialization for models and tokenizers
|
||||
model.save_pretrained('./directory/to/save/') # save
|
||||
model = model_class.from_pretrained('./directory/to/save/') # re-load
|
||||
tokenizer.save_pretrained('./directory/to/save/') # save
|
||||
tokenizer = BertTokenizer.from_pretrained('./directory/to/save/') # re-load
|
||||
|
||||
# SOTA examples for GLUE, SQUAD, text generation...
|
||||
# SOTA examples for GLUE, SQUAD, text generation...
|
||||
```
|
||||
|
||||
## Quick tour TF 2.0 training and PyTorch interoperability
|
||||
@@ -200,7 +208,7 @@ Let's do a quick example of how a TensorFlow 2.0 model can be trained in 12 line
|
||||
```python
|
||||
import tensorflow as tf
|
||||
import tensorflow_datasets
|
||||
from pytorch_transformers import *
|
||||
from transformers import *
|
||||
|
||||
# Load dataset, tokenizer, model from pretrained model/vocabulary
|
||||
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
|
||||
@@ -208,8 +216,8 @@ model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
|
||||
data = tensorflow_datasets.load('glue/mrpc')
|
||||
|
||||
# Prepare dataset for GLUE as a tf.data.Dataset instance
|
||||
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, 128, 'mrpc')
|
||||
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, 128, 'mrpc')
|
||||
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
|
||||
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, max_length=128, task='mrpc')
|
||||
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
|
||||
valid_dataset = valid_dataset.batch(64)
|
||||
|
||||
@@ -246,7 +254,7 @@ The library comprises several example scripts with SOTA performances for NLU and
|
||||
|
||||
- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
|
||||
- `run_squad.py`: an example fine-tuning Bert, XLNet and XLM on the question answering dataset SQuAD 2.0 (*token-level classification*)
|
||||
- `run_generation.py`: an example using GPT, GPT-2, Transformer-XL and XLNet for conditional language generation
|
||||
- `run_generation.py`: an example using GPT, GPT-2, CTRL, Transformer-XL and XLNet for conditional language generation
|
||||
- other model-specific examples (see the documentation).
|
||||
|
||||
Here are three quick usage examples for these scripts:
|
||||
@@ -384,10 +392,10 @@ python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncase
|
||||
|
||||
This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-squad`.
|
||||
|
||||
### `run_generation.py`: Text generation with GPT, GPT-2, Transformer-XL and XLNet
|
||||
### `run_generation.py`: Text generation with GPT, GPT-2, CTRL, Transformer-XL and XLNet
|
||||
|
||||
A conditional generation script is also included to generate text from a prompt.
|
||||
The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
|
||||
The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by Aman Rusia to get high-quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
|
||||
|
||||
Here is how to run the script with the small version of OpenAI GPT-2 model:
|
||||
|
||||
@@ -398,6 +406,16 @@ python ./examples/run_generation.py \
|
||||
--model_name_or_path=gpt2 \
|
||||
```
|
||||
|
||||
and from the Salesforce CTRL model:
|
||||
```shell
|
||||
python ./examples/run_generation.py \
|
||||
--model_type=ctrl \
|
||||
--length=20 \
|
||||
--model_name_or_path=gpt2 \
|
||||
--temperature=0 \
|
||||
--repetition_penalty=1.2 \
|
||||
```
|
||||
|
||||
## Migrating from pytorch-transformers to transformers
|
||||
|
||||
Here is a quick summary of what you should take care of when migrating from `pytorch-transformers` to `transformers`.
|
||||
@@ -417,9 +435,9 @@ Here is a quick summary of what you should take care of when migrating from `pyt
|
||||
|
||||
### Models always output `tuples`
|
||||
|
||||
The main breaking change when migrating from `pytorch-pretrained-bert` to `transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
|
||||
The main breaking change when migrating from `pytorch-pretrained-bert` to `transformers` is that every model's forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
|
||||
|
||||
The exact content of the tuples for each model are detailed in the models' docstrings and the [documentation](https://huggingface.co/transformers/).
|
||||
The exact content of the tuples for each model is detailed in the models' docstrings and the [documentation](https://huggingface.co/transformers/).
|
||||
|
||||
In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.
|
||||
|
||||
@@ -445,13 +463,17 @@ outputs = model(input_ids, labels=labels)
|
||||
loss, logits, attentions = outputs
|
||||
```
|
||||
|
||||
### Using hidden states
|
||||
|
||||
By enabling the configuration option `output_hidden_states`, it was possible to retrieve the last hidden states of the encoder. In `pytorch-transformers` as well as `transformers` the return value has changed slightly: `all_hidden_states` now also includes the hidden state of the embeddings in addition to those of the encoding layers. This allows users to easily access the embeddings final state.
|
||||
|
||||
### Serialization
|
||||
|
||||
Breaking change in the `from_pretrained()`method:
|
||||
Breaking change in the `from_pretrained()` method:
|
||||
|
||||
1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
|
||||
1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them, don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
|
||||
|
||||
2. The additional `*input` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute instead which can break derived model classes build based on the previous `BertForSequenceClassification` examples. We are working on a way to mitigate this breaking change in [#866](https://github.com/huggingface/transformers/pull/866) by forwarding the the model `__init__()` method (i) the provided positional arguments and (ii) the keyword arguments which do not match any configuration class attributes.
|
||||
2. The additional `*input` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute instead, which can break derived model classes built based on the previous `BertForSequenceClassification` examples. We are working on a way to mitigate this breaking change in [#866](https://github.com/huggingface/transformers/pull/866) by forwarding the the model's `__init__()` method (i) the provided positional arguments and (ii) the keyword arguments which do not match any configuration class attributes.
|
||||
|
||||
Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before.
|
||||
|
||||
@@ -523,4 +545,4 @@ for batch in train_data:
|
||||
|
||||
## Citation
|
||||
|
||||
At the moment, there is no paper associated to Transformers but we are working on preparing one. In the meantime, please include a mention of the library and a link to the present repository if you use this work in a published or open-source project.
|
||||
At the moment, there is no paper associated with Transformers but we are working on preparing one. In the meantime, please include a mention of the library and a link to the present repository if you use this work in a published or open-source project.
|
||||
|
||||
@@ -34,11 +34,11 @@ pip install recommonmark
|
||||
|
||||
## Building the documentation
|
||||
|
||||
Make sure that there is a symlink from the `example` file (in /examples) inside the source folder. Run the followig
|
||||
Make sure that there is a symlink from the `example` file (in /examples) inside the source folder. Run the following
|
||||
command to generate it:
|
||||
|
||||
```bash
|
||||
ln -s ../../examples/README.md source/examples.md
|
||||
ln -s ../../examples/README.md examples.md
|
||||
```
|
||||
|
||||
Once you have setup `sphinx`, you can build the documentation by running the following command in the `/docs` folder:
|
||||
@@ -50,7 +50,7 @@ make html
|
||||
---
|
||||
**NOTE**
|
||||
|
||||
If you are adding/removing elements from the toc-tree or from any strutural item, it is recommended to clean the build
|
||||
If you are adding/removing elements from the toc-tree or from any structural item, it is recommended to clean the build
|
||||
directory before rebuilding. Run the following command to clean and build:
|
||||
|
||||
```bash
|
||||
|
||||
@@ -26,4 +26,7 @@ sphinxcontrib-jsmath==1.0.1
|
||||
sphinxcontrib-qthelp==1.0.2
|
||||
sphinxcontrib-serializinghtml==1.1.3
|
||||
urllib3==1.25.3
|
||||
sphinx-markdown-tables==0.0.9
|
||||
sphinx-markdown-tables==0.0.9
|
||||
numpy==1.17.2
|
||||
tensorflow==2.0.0rc2
|
||||
torch==1.2.0
|
||||
@@ -1,5 +1,3 @@
|
||||
huggingface.css
|
||||
|
||||
/* The literal code blocks */
|
||||
.rst-content tt.literal, .rst-content tt.literal, .rst-content code.literal {
|
||||
color: #6670FF;
|
||||
@@ -44,11 +42,11 @@ huggingface.css
|
||||
/* The text items on the toc tree */
|
||||
.wy-menu-vertical a {
|
||||
color: #FFFFDD;
|
||||
font-family: Calibre-Light;
|
||||
font-family: Calibre-Light, sans-serif;
|
||||
}
|
||||
.wy-menu-vertical header, .wy-menu-vertical p.caption{
|
||||
color: white;
|
||||
font-family: Calibre-Light;
|
||||
font-family: Calibre-Light, sans-serif;
|
||||
}
|
||||
|
||||
/* The color inside the selected toc tree block */
|
||||
@@ -85,7 +83,7 @@ a {
|
||||
border-right: solid 2px #FB8D68;
|
||||
border-left: solid 2px #FB8D68;
|
||||
color: #FB8D68;
|
||||
font-family: Calibre-Light;
|
||||
font-family: Calibre-Light, sans-serif;
|
||||
border-top: none;
|
||||
font-style: normal !important;
|
||||
}
|
||||
@@ -136,14 +134,14 @@ a {
|
||||
|
||||
/* class and method names in doc */
|
||||
.rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) tt.descclassname, .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) code.descname, .rst-content dl:not(.docutils) tt.descclassname, .rst-content dl:not(.docutils) code.descclassname{
|
||||
font-family: Calibre;
|
||||
font-family: Calibre, sans-serif;
|
||||
font-size: 20px !important;
|
||||
}
|
||||
|
||||
/* class name in doc*/
|
||||
.rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) code.descname{
|
||||
margin-right: 10px;
|
||||
font-family: Calibre-Medium;
|
||||
font-family: Calibre-Medium, sans-serif;
|
||||
}
|
||||
|
||||
/* Method and class parameters */
|
||||
@@ -160,17 +158,17 @@ a {
|
||||
|
||||
/* FONTS */
|
||||
body{
|
||||
font-family: Calibre;
|
||||
font-family: Calibre, sans-serif;
|
||||
font-size: 16px;
|
||||
}
|
||||
|
||||
h1 {
|
||||
font-family: Calibre-Thin;
|
||||
font-family: Calibre-Thin, sans-serif;
|
||||
font-size: 70px;
|
||||
}
|
||||
|
||||
h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend{
|
||||
font-family: Calibre-Medium;
|
||||
font-family: Calibre-Medium, sans-serif;
|
||||
}
|
||||
|
||||
@font-face {
|
||||
@@ -196,4 +194,3 @@ h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend{
|
||||
src: url(./Calibre-Thin.otf);
|
||||
font-weight:400;
|
||||
}
|
||||
|
||||
|
||||
File diff suppressed because one or more lines are too long
@@ -26,7 +26,7 @@ author = u'huggingface'
|
||||
# The short X.Y version
|
||||
version = u''
|
||||
# The full version, including alpha/beta/rc tags
|
||||
release = u'1.2.0'
|
||||
release = u'2.1.0'
|
||||
|
||||
|
||||
# -- General configuration ---------------------------------------------------
|
||||
|
||||
1
docs/source/examples.md
Symbolic link
1
docs/source/examples.md
Symbolic link
@@ -0,0 +1 @@
|
||||
../../examples/README.md
|
||||
@@ -5,6 +5,8 @@ Transformers
|
||||
(BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural Language Generation
|
||||
(NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.
|
||||
|
||||
This is the documentation of our repository `transformers <https://github.com/huggingface/transformers>`__.
|
||||
|
||||
Features
|
||||
---------------------------------------------------
|
||||
|
||||
@@ -13,17 +15,20 @@ Features
|
||||
- High performance on NLU and NLG tasks
|
||||
- Low barrier to entry for educators and practitioners
|
||||
|
||||
State-of-the-art NLP for everyone
|
||||
State-of-the-art NLP for everyone:
|
||||
|
||||
- Deep learning researchers
|
||||
- Hands-on practitioners
|
||||
- AI/ML/NLP teachers and educators
|
||||
|
||||
Lower compute costs, smaller carbon footprint
|
||||
Lower compute costs, smaller carbon footprint:
|
||||
|
||||
- Researchers can share trained models instead of always retraining
|
||||
- Practitioners can reduce compute time and production costs
|
||||
- 8 architectures with over 30 pretrained models, some in more than 100 languages
|
||||
|
||||
Choose the right framework for every part of a model's lifetime
|
||||
Choose the right framework for every part of a model's lifetime:
|
||||
|
||||
- Train state-of-the-art models in 3 lines of code
|
||||
- Deep interoperability between TensorFlow 2.0 and PyTorch models
|
||||
- Move a single model between TF2.0/PyTorch frameworks at will
|
||||
@@ -41,8 +46,7 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
|
||||
5. `XLNet <https://github.com/zihangdai/xlnet>`_ (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
||||
6. `XLM <https://github.com/facebookresearch/XLM>`_ (from Facebook) released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_ by Guillaume Lample and Alexis Conneau.
|
||||
7. `RoBERTa <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`_ (from Facebook), released together with the paper a `Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
||||
8. `DistilBERT <https://huggingface.co/transformers/model_doc/distilbert.html>`_ (from HuggingFace) released together with the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf.
|
||||
|
||||
8. `DistilBERT <https://huggingface.co/transformers/model_doc/distilbert.html>`_ (from HuggingFace) released together with the paper `DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2 <https://github.com/huggingface/transformers/tree/master/examples/distillation>`_.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
@@ -58,6 +62,7 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
|
||||
migration
|
||||
bertology
|
||||
torchscript
|
||||
multilingual
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
@@ -82,3 +87,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
|
||||
model_doc/xlnet
|
||||
model_doc/roberta
|
||||
model_doc/distilbert
|
||||
model_doc/ctrl
|
||||
|
||||
58
docs/source/installation.md
Normal file
58
docs/source/installation.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# Installation
|
||||
|
||||
Transformers is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.1.0
|
||||
|
||||
## With pip
|
||||
|
||||
PyTorch Transformers can be installed using pip as follows:
|
||||
|
||||
``` bash
|
||||
pip install transformers
|
||||
```
|
||||
|
||||
## From source
|
||||
|
||||
To install from source, clone the repository and install with:
|
||||
|
||||
``` bash
|
||||
git clone https://github.com/huggingface/transformers.git
|
||||
cd transformers
|
||||
pip install [--editable] .
|
||||
```
|
||||
|
||||
## Tests
|
||||
|
||||
An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
|
||||
|
||||
Tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
|
||||
|
||||
Run all the tests from the root of the cloned repository with the commands:
|
||||
|
||||
``` bash
|
||||
python -m pytest -sv ./transformers/tests/
|
||||
python -m pytest -sv ./examples/
|
||||
```
|
||||
|
||||
## OpenAI GPT original tokenization workflow
|
||||
|
||||
If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install `ftfy` (use version 4.4.3 if you are using Python 2) and `SpaCy`:
|
||||
|
||||
``` bash
|
||||
pip install spacy ftfy==4.4.3
|
||||
python -m spacy download en
|
||||
```
|
||||
|
||||
If you don't install `ftfy` and `SpaCy`, the `OpenAI GPT` tokenizer will default to tokenize using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
|
||||
|
||||
## Note on model downloads (Continuous Integration or large-scale deployments)
|
||||
|
||||
If you expect to be downloading large volumes of models (more than 1,000) from our hosted bucket (for instance through your CI setup, or a large-scale production deployment), please cache the model files on your end. It will be way faster, and cheaper. Feel free to contact us privately if you need any help.
|
||||
|
||||
## Do you want to run a Transformer model on a mobile device?
|
||||
|
||||
You should check out our [swift-coreml-transformers](https://github.com/huggingface/swift-coreml-transformers) repo.
|
||||
|
||||
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`, `DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
|
||||
|
||||
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
|
||||
or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
|
||||
@@ -1,71 +0,0 @@
|
||||
Installation
|
||||
================================================
|
||||
|
||||
Transformers is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.1.0
|
||||
|
||||
With pip
|
||||
^^^^^^^^
|
||||
|
||||
PyTorch Transformers can be installed using pip as follows:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install transformers
|
||||
|
||||
From source
|
||||
^^^^^^^^^^^
|
||||
|
||||
To install from source, clone the repository and install with:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git clone https://github.com/huggingface/transformers.git
|
||||
cd transformers
|
||||
pip install [--editable] .
|
||||
|
||||
|
||||
Tests
|
||||
^^^^^
|
||||
|
||||
An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the `tests folder <https://github.com/huggingface/transformers/tree/master/transformers/tests>`_ and examples tests in the `examples folder <https://github.com/huggingface/transformers/tree/master/examples>`_.
|
||||
|
||||
Tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
|
||||
|
||||
Run all the tests from the root of the cloned repository with the commands:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python -m pytest -sv ./transformers/tests/
|
||||
python -m pytest -sv ./examples/
|
||||
|
||||
|
||||
OpenAI GPT original tokenization workflow
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (use version 4.4.3 if you are using Python 2) and ``SpaCy`` :
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install spacy ftfy==4.4.3
|
||||
python -m spacy download en
|
||||
|
||||
If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
|
||||
|
||||
|
||||
Note on model downloads (Continuous Integration or large-scale deployments)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
If you expect to be downloading large volumes of models (more than 1,000) from our hosted bucket (for instance through your CI setup, or a large-scale production deployment), please cache the model files on your end. It will be way faster, and cheaper. Feel free to contact us privately if you need any help.
|
||||
|
||||
|
||||
Do you want to run a Transformer model on a mobile device?
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
You should check out our `swift-coreml-transformers <https://github.com/huggingface/swift-coreml-transformers>`_ repo.
|
||||
|
||||
It contains an example of a conversion script from a Pytorch trained Transformer model (here, ``GPT-2``) to a CoreML model that runs on iOS devices.
|
||||
|
||||
It also contains an implementation of BERT for Question answering.
|
||||
|
||||
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
|
||||
or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
|
||||
@@ -17,5 +17,5 @@ The base class ``PreTrainedModel`` implements the common methods for loading/sav
|
||||
``TFPreTrainedModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFPreTrainedModel
|
||||
.. autoclass:: transformers.TFPreTrainedModel
|
||||
:members:
|
||||
|
||||
@@ -8,20 +8,20 @@ Processors
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
All processors follow the same architecture which is that of the
|
||||
:class:`~pytorch_transformers.data.processors.utils.DataProcessor`. The processor returns a list
|
||||
of :class:`~pytorch_transformers.data.processors.utils.InputExample`. These
|
||||
:class:`~pytorch_transformers.data.processors.utils.InputExample` can be converted to
|
||||
:class:`~pytorch_transformers.data.processors.utils.InputFeatures` in order to be fed to the model.
|
||||
:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list
|
||||
of :class:`~transformers.data.processors.utils.InputExample`. These
|
||||
:class:`~transformers.data.processors.utils.InputExample` can be converted to
|
||||
:class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model.
|
||||
|
||||
.. autoclass:: pytorch_transformers.data.processors.utils.DataProcessor
|
||||
.. autoclass:: transformers.data.processors.utils.DataProcessor
|
||||
:members:
|
||||
|
||||
|
||||
.. autoclass:: pytorch_transformers.data.processors.utils.InputExample
|
||||
.. autoclass:: transformers.data.processors.utils.InputExample
|
||||
:members:
|
||||
|
||||
|
||||
.. autoclass:: pytorch_transformers.data.processors.utils.InputFeatures
|
||||
.. autoclass:: transformers.data.processors.utils.InputFeatures
|
||||
:members:
|
||||
|
||||
|
||||
@@ -36,20 +36,20 @@ This library hosts a total of 10 processors for the following tasks: MRPC, MNLI,
|
||||
CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.
|
||||
|
||||
Those processors are:
|
||||
- :class:`~pytorch_transformers.data.processors.utils.MrpcProcessor`
|
||||
- :class:`~pytorch_transformers.data.processors.utils.MnliProcessor`
|
||||
- :class:`~pytorch_transformers.data.processors.utils.MnliMismatchedProcessor`
|
||||
- :class:`~pytorch_transformers.data.processors.utils.Sst2Processor`
|
||||
- :class:`~pytorch_transformers.data.processors.utils.StsbProcessor`
|
||||
- :class:`~pytorch_transformers.data.processors.utils.QqpProcessor`
|
||||
- :class:`~pytorch_transformers.data.processors.utils.QnliProcessor`
|
||||
- :class:`~pytorch_transformers.data.processors.utils.RteProcessor`
|
||||
- :class:`~pytorch_transformers.data.processors.utils.WnliProcessor`
|
||||
- :class:`~transformers.data.processors.utils.MrpcProcessor`
|
||||
- :class:`~transformers.data.processors.utils.MnliProcessor`
|
||||
- :class:`~transformers.data.processors.utils.MnliMismatchedProcessor`
|
||||
- :class:`~transformers.data.processors.utils.Sst2Processor`
|
||||
- :class:`~transformers.data.processors.utils.StsbProcessor`
|
||||
- :class:`~transformers.data.processors.utils.QqpProcessor`
|
||||
- :class:`~transformers.data.processors.utils.QnliProcessor`
|
||||
- :class:`~transformers.data.processors.utils.RteProcessor`
|
||||
- :class:`~transformers.data.processors.utils.WnliProcessor`
|
||||
|
||||
Additionally, the following method can be used to load values from a data file and convert them to a list of
|
||||
:class:`~pytorch_transformers.data.processors.utils.InputExample`.
|
||||
:class:`~transformers.data.processors.utils.InputExample`.
|
||||
|
||||
.. automethod:: pytorch_transformers.data.processors.glue.glue_convert_examples_to_features
|
||||
.. automethod:: transformers.data.processors.glue.glue_convert_examples_to_features
|
||||
|
||||
Example usage
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
@@ -74,55 +74,55 @@ BERT
|
||||
``TFBertModel``
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFBertModel
|
||||
.. autoclass:: transformers.TFBertModel
|
||||
:members:
|
||||
|
||||
|
||||
``TFBertForPreTraining``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFBertForPreTraining
|
||||
.. autoclass:: transformers.TFBertForPreTraining
|
||||
:members:
|
||||
|
||||
|
||||
``TFBertForMaskedLM``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFBertForMaskedLM
|
||||
.. autoclass:: transformers.TFBertForMaskedLM
|
||||
:members:
|
||||
|
||||
|
||||
``TFBertForNextSentencePrediction``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFBertForNextSentencePrediction
|
||||
.. autoclass:: transformers.TFBertForNextSentencePrediction
|
||||
:members:
|
||||
|
||||
|
||||
``TFBertForSequenceClassification``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFBertForSequenceClassification
|
||||
.. autoclass:: transformers.TFBertForSequenceClassification
|
||||
:members:
|
||||
|
||||
|
||||
``TFBertForMultipleChoice``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFBertForMultipleChoice
|
||||
.. autoclass:: transformers.TFBertForMultipleChoice
|
||||
:members:
|
||||
|
||||
|
||||
``TFBertForTokenClassification``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFBertForTokenClassification
|
||||
.. autoclass:: transformers.TFBertForTokenClassification
|
||||
:members:
|
||||
|
||||
|
||||
``TFBertForQuestionAnswering``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFBertForQuestionAnswering
|
||||
.. autoclass:: transformers.TFBertForQuestionAnswering
|
||||
:members:
|
||||
|
||||
|
||||
44
docs/source/model_doc/ctrl.rst
Normal file
44
docs/source/model_doc/ctrl.rst
Normal file
@@ -0,0 +1,44 @@
|
||||
CTRL
|
||||
----------------------------------------------------
|
||||
|
||||
``CTRLConfig``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.CTRLConfig
|
||||
:members:
|
||||
|
||||
|
||||
``CTRLTokenizer``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.CTRLTokenizer
|
||||
:members:
|
||||
|
||||
|
||||
``CTRLModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.CTRLModel
|
||||
:members:
|
||||
|
||||
|
||||
``CTRLLMHeadModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.CTRLLMHeadModel
|
||||
:members:
|
||||
|
||||
|
||||
``TFCTRLModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.TFCTRLModel
|
||||
:members:
|
||||
|
||||
|
||||
``TFCTRLLMHeadModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.TFCTRLLMHeadModel
|
||||
:members:
|
||||
|
||||
@@ -45,26 +45,26 @@ DistilBERT
|
||||
``TFDistilBertModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFDistilBertModel
|
||||
.. autoclass:: transformers.TFDistilBertModel
|
||||
:members:
|
||||
|
||||
|
||||
``TFDistilBertForMaskedLM``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFDistilBertForMaskedLM
|
||||
.. autoclass:: transformers.TFDistilBertForMaskedLM
|
||||
:members:
|
||||
|
||||
|
||||
``TFDistilBertForSequenceClassification``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFDistilBertForSequenceClassification
|
||||
.. autoclass:: transformers.TFDistilBertForSequenceClassification
|
||||
:members:
|
||||
|
||||
|
||||
``TFDistilBertForQuestionAnswering``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFDistilBertForQuestionAnswering
|
||||
.. autoclass:: transformers.TFDistilBertForQuestionAnswering
|
||||
:members:
|
||||
|
||||
@@ -39,19 +39,19 @@ OpenAI GPT
|
||||
``TFOpenAIGPTModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFOpenAIGPTModel
|
||||
.. autoclass:: transformers.TFOpenAIGPTModel
|
||||
:members:
|
||||
|
||||
|
||||
``TFOpenAIGPTLMHeadModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFOpenAIGPTLMHeadModel
|
||||
.. autoclass:: transformers.TFOpenAIGPTLMHeadModel
|
||||
:members:
|
||||
|
||||
|
||||
``TFOpenAIGPTDoubleHeadsModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFOpenAIGPTDoubleHeadsModel
|
||||
.. autoclass:: transformers.TFOpenAIGPTDoubleHeadsModel
|
||||
:members:
|
||||
|
||||
@@ -39,19 +39,19 @@ OpenAI GPT2
|
||||
``TFGPT2Model``
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFGPT2Model
|
||||
.. autoclass:: transformers.TFGPT2Model
|
||||
:members:
|
||||
|
||||
|
||||
``TFGPT2LMHeadModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFGPT2LMHeadModel
|
||||
.. autoclass:: transformers.TFGPT2LMHeadModel
|
||||
:members:
|
||||
|
||||
|
||||
``TFGPT2DoubleHeadsModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFGPT2DoubleHeadsModel
|
||||
.. autoclass:: transformers.TFGPT2DoubleHeadsModel
|
||||
:members:
|
||||
|
||||
@@ -39,19 +39,19 @@ RoBERTa
|
||||
``TFRobertaModel``
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFRobertaModel
|
||||
.. autoclass:: transformers.TFRobertaModel
|
||||
:members:
|
||||
|
||||
|
||||
``TFRobertaForMaskedLM``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFRobertaForMaskedLM
|
||||
.. autoclass:: transformers.TFRobertaForMaskedLM
|
||||
:members:
|
||||
|
||||
|
||||
``TFRobertaForSequenceClassification``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFRobertaForSequenceClassification
|
||||
.. autoclass:: transformers.TFRobertaForSequenceClassification
|
||||
:members:
|
||||
|
||||
@@ -33,12 +33,12 @@ Transformer XL
|
||||
``TFTransfoXLModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFTransfoXLModel
|
||||
.. autoclass:: transformers.TFTransfoXLModel
|
||||
:members:
|
||||
|
||||
|
||||
``TFTransfoXLLMHeadModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFTransfoXLLMHeadModel
|
||||
.. autoclass:: transformers.TFTransfoXLLMHeadModel
|
||||
:members:
|
||||
|
||||
@@ -44,26 +44,26 @@ XLM
|
||||
``TFXLMModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFXLMModel
|
||||
.. autoclass:: transformers.TFXLMModel
|
||||
:members:
|
||||
|
||||
|
||||
``TFXLMWithLMHeadModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFXLMWithLMHeadModel
|
||||
.. autoclass:: transformers.TFXLMWithLMHeadModel
|
||||
:members:
|
||||
|
||||
|
||||
``TFXLMForSequenceClassification``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFXLMForSequenceClassification
|
||||
.. autoclass:: transformers.TFXLMForSequenceClassification
|
||||
:members:
|
||||
|
||||
|
||||
``TFXLMForQuestionAnsweringSimple``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFXLMForQuestionAnsweringSimple
|
||||
.. autoclass:: transformers.TFXLMForQuestionAnsweringSimple
|
||||
:members:
|
||||
|
||||
@@ -46,26 +46,26 @@ XLNet
|
||||
``TFXLNetModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFXLNetModel
|
||||
.. autoclass:: transformers.TFXLNetModel
|
||||
:members:
|
||||
|
||||
|
||||
``TFXLNetLMHeadModel``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFXLNetLMHeadModel
|
||||
.. autoclass:: transformers.TFXLNetLMHeadModel
|
||||
:members:
|
||||
|
||||
|
||||
``TFXLNetForSequenceClassification``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFXLNetForSequenceClassification
|
||||
.. autoclass:: transformers.TFXLNetForSequenceClassification
|
||||
:members:
|
||||
|
||||
|
||||
``TFXLNetForQuestionAnsweringSimple``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: pytorch_transformers.TFXLNetForQuestionAnsweringSimple
|
||||
.. autoclass:: transformers.TFXLNetForQuestionAnsweringSimple
|
||||
:members:
|
||||
|
||||
103
docs/source/multilingual.rst
Normal file
103
docs/source/multilingual.rst
Normal file
@@ -0,0 +1,103 @@
|
||||
Multi-lingual models
|
||||
================================================
|
||||
|
||||
Most of the models available in this library are mono-lingual models (English, Chinese and German). A few
|
||||
multi-lingual models are available and have a different mechanisms than mono-lingual models.
|
||||
This page details the usage of these models.
|
||||
|
||||
The two models that currently support multiple languages are BERT and XLM.
|
||||
|
||||
XLM
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
XLM has a total of 10 different checkpoints, only one of which is mono-lingual. The 9 remaining model checkpoints can
|
||||
be split in two categories: the checkpoints that make use of language embeddings, and those that don't
|
||||
|
||||
XLM & Language Embeddings
|
||||
------------------------------------------------
|
||||
|
||||
This section concerns the following checkpoints:
|
||||
|
||||
- ``xlm-mlm-ende-1024`` (Masked language modeling, English-German)
|
||||
- ``xlm-mlm-enfr-1024`` (Masked language modeling, English-French)
|
||||
- ``xlm-mlm-enro-1024`` (Masked language modeling, English-Romanian)
|
||||
- ``xlm-mlm-xnli15-1024`` (Masked language modeling, XNLI languages)
|
||||
- ``xlm-mlm-tlm-xnli15-1024`` (Masked language modeling + Translation, XNLI languages)
|
||||
- ``xlm-clm-enfr-1024`` (Causal language modeling, English-French)
|
||||
- ``xlm-clm-ende-1024`` (Causal language modeling, English-German)
|
||||
|
||||
These checkpoints require language embeddings that will specify the language used at inference time. These language
|
||||
embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
|
||||
these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes
|
||||
from the tokenizer.
|
||||
|
||||
Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French):
|
||||
|
||||
|
||||
.. code-block::
|
||||
|
||||
import torch
|
||||
from transformers import XLMTokenizer, XLMWithLMHeadModel
|
||||
|
||||
tokenizer = XLMTokenizer.from_pretrained("xlm-clm-1024-enfr")
|
||||
|
||||
|
||||
The different languages this model/tokenizer handles, as well as the ids of these languages are visible using the
|
||||
``lang2id`` attribute:
|
||||
|
||||
.. code-block::
|
||||
|
||||
print(tokenizer.lang2id) # {'en': 0, 'fr': 1}
|
||||
|
||||
|
||||
These ids should be used when passing a language parameter during a model pass. Let's define our inputs:
|
||||
|
||||
.. code-block::
|
||||
|
||||
input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
|
||||
|
||||
|
||||
We should now define the language embedding by using the previously defined language id. We want to create a tensor
|
||||
filled with the appropriate language ids, of the same size as input_ids. For english, the id is 0:
|
||||
|
||||
.. code-block::
|
||||
|
||||
language_id = tokenizer.lang2id['en'] # 0
|
||||
langs = torch.tensor([language_id] * input_ids.shape[1]) # torch.tensor([0, 0, 0, ..., 0])
|
||||
|
||||
# We reshape it to be of size (batch_size, sequence_length)
|
||||
langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
|
||||
|
||||
|
||||
You can then feed it all as input to your model:
|
||||
|
||||
.. code-block::
|
||||
|
||||
outputs = model(input_ids, langs=langs)
|
||||
|
||||
|
||||
The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/run_generation.py>`__
|
||||
can generate text using the CLM checkpoints from XLM, using the language embeddings.
|
||||
|
||||
XLM without Language Embeddings
|
||||
------------------------------------------------
|
||||
|
||||
This section concerns the following checkpoints:
|
||||
|
||||
- ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages)
|
||||
- ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages)
|
||||
|
||||
These checkpoints do not require language embeddings at inference time. These models are used to have generic
|
||||
sentence representations, differently from previously-mentioned XLM checkpoints.
|
||||
|
||||
|
||||
BERT
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
BERT has two checkpoints that can be used for multi-lingual tasks:
|
||||
|
||||
- ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages)
|
||||
- ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages)
|
||||
|
||||
These checkpoints do not require language embeddings at inference time. They should identify the language
|
||||
used in the context and infer accordingly.
|
||||
@@ -98,6 +98,12 @@ Here is the full list of the currently provided pretrained models together with
|
||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| | ``xlm-clm-ende-1024`` | | 6-layer, 1024-hidden, 8-heads |
|
||||
| | | | XLM English-German model trained with CLM (Causal Language Modeling) on the concatenation of English and German wikipedia |
|
||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| | ``xlm-mlm-17-1280`` | | 16-layer, 1280-hidden, 16-heads |
|
||||
| | | | XLM model trained with MLM (Masked Language Modeling) on 17 languages. |
|
||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| | ``xlm-mlm-100-1280`` | | 16-layer, 1280-hidden, 16-heads |
|
||||
| | | | XLM model trained with MLM (Masked Language Modeling) on 100 languages. |
|
||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| RoBERTa | ``roberta-base`` | | 12-layer, 768-hidden, 12-heads, 125M parameters |
|
||||
| | | | RoBERTa using the BERT-base architecture |
|
||||
@@ -113,11 +119,18 @@ Here is the full list of the currently provided pretrained models together with
|
||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| DistilBERT | ``distilbert-base-uncased`` | | 6-layer, 768-hidden, 12-heads, 66M parameters |
|
||||
| | | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint |
|
||||
| | | (see `details <https://medium.com/huggingface/distilbert-8cf3380435b5>`__) |
|
||||
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
|
||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| | ``distilbert-base-uncased-distilled-squad`` | | 6-layer, 768-hidden, 12-heads, 66M parameters |
|
||||
| | | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint, with an additional linear layer. |
|
||||
| | | (see `details <https://medium.com/huggingface/distilbert-8cf3380435b5>`__) |
|
||||
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
|
||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| | ``distilgpt2`` | | 6-layer, 768-hidden, 12-heads, 82M parameters |
|
||||
| | | | The DistilGPT2 model distilled from the GPT2 model `gpt2` checkpoint. |
|
||||
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
|
||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| CTRL | ``ctrl`` | | 48-layer, 1280-hidden, 16-heads, 1.6B parameters |
|
||||
| | | | Salesforce's Large-sized CTRL English model |
|
||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
|
||||
.. <https://huggingface.co/transformers/examples.html>`__
|
||||
@@ -19,12 +19,12 @@ The library was designed with two strong goals in mind:
|
||||
|
||||
A few other goals:
|
||||
|
||||
- expose the models internals as consistently as possible:
|
||||
- expose the models' internals as consistently as possible:
|
||||
|
||||
- we give access, using a single API to the full hidden-states and attention weights,
|
||||
- tokenizer and base model's API are standardized to easily switch between models.
|
||||
|
||||
- incorporate a subjective selection of promising tools for fine-tuning/investiguating these models:
|
||||
- incorporate a subjective selection of promising tools for fine-tuning/investigating these models:
|
||||
|
||||
- a simple/consistent way to add new tokens to the vocabulary and embeddings for fine-tuning,
|
||||
- simple ways to mask and prune transformer heads.
|
||||
@@ -33,7 +33,7 @@ A few other goals:
|
||||
|
||||
The library is build around three type of classes for each models:
|
||||
|
||||
- **model classes** which are PyTorch models (`torch.nn.Modules`) of the 6 models architectures currently provided in the library, e.g. `BertModel`
|
||||
- **model classes** which are PyTorch models (`torch.nn.Modules`) of the 8 models architectures currently provided in the library, e.g. `BertModel`
|
||||
- **configuration classes** which store all the parameters required to build a model, e.g. `BertConfig`. You don't always need to instantiate these your-self, in particular if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
|
||||
- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model, e.g. `BertTokenizer`
|
||||
|
||||
@@ -51,7 +51,7 @@ We'll finish this quickstart tour by going through a few simple quick-start exam
|
||||
|
||||
Here are two examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.
|
||||
|
||||
See full API reference for examples for each model classe.
|
||||
See full API reference for examples for each model class.
|
||||
|
||||
### BERT example
|
||||
|
||||
@@ -93,8 +93,8 @@ Let's see how we can use `BertModel` to encode our inputs in hidden-states:
|
||||
# Load pre-trained model (weights)
|
||||
model = BertModel.from_pretrained('bert-base-uncased')
|
||||
|
||||
# Set the model in evaluation mode to desactivate the DropOut modules
|
||||
# This is IMPORTANT to have reproductible results during evaluation!
|
||||
# Set the model in evaluation mode to deactivate the DropOut modules
|
||||
# This is IMPORTANT to have reproducible results during evaluation!
|
||||
model.eval()
|
||||
|
||||
# If you have a GPU, put everything on cuda
|
||||
@@ -168,8 +168,8 @@ Let's see how to use `GPT2LMHeadModel` to generate the next token following our
|
||||
# Load pre-trained model (weights)
|
||||
model = GPT2LMHeadModel.from_pretrained('gpt2')
|
||||
|
||||
# Set the model in evaluation mode to desactivate the DropOut modules
|
||||
# This is IMPORTANT to have reproductible results during evaluation!
|
||||
# Set the model in evaluation mode to deactivate the DropOut modules
|
||||
# This is IMPORTANT to have reproducible results during evaluation!
|
||||
model.eval()
|
||||
|
||||
# If you have a GPU, put everything on cuda
|
||||
|
||||
@@ -9,7 +9,7 @@ similar API between the different models.
|
||||
| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
|
||||
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
|
||||
| [SQuAD](#squad) | Using BERT for question answering, examples with distributed training. |
|
||||
| [Multiple Choice](#multiple choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks.
|
||||
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks.
|
||||
|
||||
## Language model fine-tuning
|
||||
|
||||
@@ -283,17 +283,17 @@ The results are the following:
|
||||
loss = 0.04755385363816904
|
||||
```
|
||||
|
||||
##Multiple Choice
|
||||
## Multiple Choice
|
||||
|
||||
Based on the script [`run_multiple_choice.py`]().
|
||||
|
||||
#### Fine-tuning on SWAG
|
||||
Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
|
||||
|
||||
```
|
||||
```bash
|
||||
#training on 4 tesla V100(16GB) GPUS
|
||||
export SWAG_DIR=/path/to/swag_data_dir
|
||||
python ./examples/single_model_scripts/run_multiple_choice.py \
|
||||
python ./examples/run_multiple_choice.py \
|
||||
--model_type roberta \
|
||||
--task_name swag \
|
||||
--model_name_or_path roberta-base \
|
||||
|
||||
@@ -1,22 +1,25 @@
|
||||
# DistilBERT
|
||||
# Distil*
|
||||
|
||||
This folder contains the original code used to train DistilBERT as well as examples showcasing how to use DistilBERT.
|
||||
This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT and DistilGPT2.
|
||||
|
||||
**2019, October 3rd - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2.
|
||||
|
||||
**2019, September 19th - Update:** We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 97% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future!
|
||||
|
||||
## What is DistilBERT
|
||||
## What is Distil*
|
||||
|
||||
DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
|
||||
Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
|
||||
|
||||
For more information on DistilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
|
||||
). *Please note that we will publish a formal write-up with updated and more complete results in the near future (September 19th).*
|
||||
We have applied the same method to GPT2 and release the weights of the compressed model. On the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 15.0 compared to 18.5 for DistilGPT2 (after fine-tuning on the train set).
|
||||
|
||||
Here's the updated results on the dev sets of GLUE:
|
||||
For more information on DistilBERT, please refer to our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108). The paper superseeds our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances.
|
||||
|
||||
| Model | Macro-score | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2 | STS-B | WNLI |
|
||||
Here are the results on the dev sets of GLUE:
|
||||
|
||||
| Model | Macro-score | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2| STS-B| WNLI |
|
||||
| :---: | :---: | :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:|
|
||||
| BERT-base | **77.6** | 48.9 | 84.3 | 88.6 | 89.3 | 89.5 | 71.3 | 91.7 | 91.2 | 43.7 |
|
||||
| DistilBERT | **75.2** | 49.1 | 81.8 | 90.2 | 87.0 | 89.2 | 62.9 | 92.7 | 90.7 | 44.4 |
|
||||
| DistilBERT | **76.8** | 49.1 | 81.8 | 90.2 | 90.2 | 89.2 | 62.9 | 92.7 | 90.7 | 44.4 |
|
||||
|
||||
## Setup
|
||||
|
||||
@@ -26,10 +29,12 @@ This part of the library has only be tested with Python3.6+. There are few speci
|
||||
|
||||
## How to use DistilBERT
|
||||
|
||||
Transformers includes two pre-trained DistilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DistilBERT):
|
||||
Transformers includes two pre-trained Distil* models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DistilBERT):
|
||||
|
||||
- `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
|
||||
- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
|
||||
- `distilgpt2`: DistilGPT2 English language model pretrained with the supervision of `gpt2` (the smallest version of GPT2) on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset and . The model has 6 layers, 768 dimension and 12 heads, totalizing 82M (compared to 124M parameters for GPT2). On average, DistilGPT2 is two times faster than GPT2.
|
||||
- and more to come! 🤗🤗🤗
|
||||
|
||||
Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.
|
||||
|
||||
@@ -42,9 +47,11 @@ outputs = model(input_ids)
|
||||
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||
```
|
||||
|
||||
## How to train DistilBERT
|
||||
Similarly, using DistilGPT2 simply consists in calling the GPT2 classes from a different pretrained checkpoint: `model = GPT2Model.from_pretrained('distilgpt2')`.
|
||||
|
||||
In the following, we will explain how you can train your own compressed model.
|
||||
## How to train Distil*
|
||||
|
||||
In the following, we will explain how you can train DistilBERT.
|
||||
|
||||
### A. Preparing the data
|
||||
|
||||
@@ -57,7 +64,8 @@ First, we will binarize the data, i.e. tokenize the data and convert each token
|
||||
```bash
|
||||
python scripts/binarized_data.py \
|
||||
--file_path data/dump.txt \
|
||||
--bert_tokenizer bert-base-uncased \
|
||||
--tokenizer_type bert \
|
||||
--tokenizer_name bert-base-uncased \
|
||||
--dump_file data/binarized_text
|
||||
```
|
||||
|
||||
@@ -66,7 +74,8 @@ Our implementation of masked language modeling loss follows [XLM](https://github
|
||||
```bash
|
||||
python scripts/token_counts.py \
|
||||
--data_file data/binarized_text.bert-base-uncased.pickle \
|
||||
--token_counts_dump data/token_counts.bert-base-uncased.pickle
|
||||
--token_counts_dump data/token_counts.bert-base-uncased.pickle \
|
||||
--vocab_size 30522
|
||||
```
|
||||
|
||||
### B. Training
|
||||
@@ -75,6 +84,12 @@ Training with distillation is really simple once you have pre-processed the data
|
||||
|
||||
```bash
|
||||
python train.py \
|
||||
--student_type distilbert \
|
||||
--student_config training_configs/distilbert-base-uncased.json \
|
||||
--teacher_type bert \
|
||||
--teacher_name bert-base-uncased \
|
||||
--alpha_ce 5.0 --alpha_mlm 2.0 --alpha_cos 1.0 --mlm \
|
||||
--freeze_pos_embs \
|
||||
--dump_path serialization_dir/my_first_training \
|
||||
--data_file data/binarized_text.bert-base-uncased.pickle \
|
||||
--token_counts data/token_counts.bert-base-uncased.pickle \
|
||||
@@ -83,7 +98,7 @@ python train.py \
|
||||
|
||||
By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please look in `train.py` or run `python train.py --help` to list them.
|
||||
|
||||
We highly encourage you to use distributed training for training DistilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
|
||||
We highly encourage you to use distributed training for training DistilBERT as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
|
||||
|
||||
```bash
|
||||
export NODE_RANK=0
|
||||
@@ -105,11 +120,17 @@ python -m torch.distributed.launch \
|
||||
train.py \
|
||||
--force \
|
||||
--n_gpu $WORLD_SIZE \
|
||||
--student_type distilbert \
|
||||
--student_config training_configs/distilbert-base-uncased.json \
|
||||
--teacher_type bert \
|
||||
--teacher_name bert-base-uncased \
|
||||
--alpha_ce 0.33 --alpha_mlm 0.33 --alpha_cos 0.33 --mlm \
|
||||
--freeze_pos_embs \
|
||||
--dump_path serialization_dir/my_first_training \
|
||||
--data_file data/binarized_text.bert-base-uncased.pickle \
|
||||
--token_counts data/token_counts.bert-base-uncased.pickle \
|
||||
--dump_path serialization_dir/my_first_distillation
|
||||
--token_counts data/token_counts.bert-base-uncased.pickle
|
||||
```
|
||||
|
||||
**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract_for_distil.py` to create a valid initialization checkpoint and use `--from_pretrained_weights` and `--from_pretrained_config` arguments to use this initialization for the distilled training!
|
||||
**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training!
|
||||
|
||||
Happy distillation!
|
||||
|
||||
@@ -12,8 +12,8 @@
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" The distiller to distil DistilBERT
|
||||
adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
|
||||
""" The distiller to distil the student.
|
||||
Adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
|
||||
"""
|
||||
import os
|
||||
import math
|
||||
@@ -28,16 +28,19 @@ import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
from torch.optim import AdamW
|
||||
from torch.utils.data.distributed import DistributedSampler
|
||||
from torch.utils.data import RandomSampler, BatchSampler, DataLoader
|
||||
|
||||
from transformers import WarmupLinearSchedule
|
||||
|
||||
from utils import logger
|
||||
from dataset import Dataset
|
||||
from lm_seqs_dataset import LmSeqsDataset
|
||||
from grouped_batch_sampler import GroupedBatchSampler, create_lengths_groups
|
||||
|
||||
class Distiller:
|
||||
def __init__(self,
|
||||
params: dict,
|
||||
dataloader: Dataset,
|
||||
dataset: LmSeqsDataset,
|
||||
token_probs: torch.tensor,
|
||||
student: nn.Module,
|
||||
teacher: nn.Module):
|
||||
@@ -50,33 +53,47 @@ class Distiller:
|
||||
self.student = student
|
||||
self.teacher = teacher
|
||||
|
||||
self.dataloader = dataloader
|
||||
if self.params.n_gpu > 1:
|
||||
self.dataloader.split()
|
||||
self.get_iterator(seed=params.seed)
|
||||
self.student_config = student.config
|
||||
self.vocab_size = student.config.vocab_size
|
||||
|
||||
if params.n_gpu <= 1:
|
||||
sampler = RandomSampler(dataset)
|
||||
else:
|
||||
sampler = DistributedSampler(dataset)
|
||||
|
||||
if params.group_by_size:
|
||||
groups = create_lengths_groups(lengths=dataset.lengths, k=params.max_model_input_size)
|
||||
sampler = GroupedBatchSampler(sampler=sampler, group_ids=groups, batch_size=params.batch_size)
|
||||
else:
|
||||
sampler = BatchSampler(sampler=sampler, batch_size=params.batch_size, drop_last=False)
|
||||
|
||||
self.dataloader = DataLoader(dataset=dataset,
|
||||
batch_sampler=sampler,
|
||||
collate_fn=dataset.batch_sequences)
|
||||
|
||||
self.temperature = params.temperature
|
||||
assert self.temperature > 0.
|
||||
|
||||
self.alpha_ce = params.alpha_ce
|
||||
self.alpha_mlm = params.alpha_mlm
|
||||
self.alpha_clm = params.alpha_clm
|
||||
self.alpha_mse = params.alpha_mse
|
||||
self.alpha_cos = params.alpha_cos
|
||||
assert self.alpha_ce >= 0.
|
||||
assert self.alpha_mlm >= 0.
|
||||
assert self.alpha_mse >= 0.
|
||||
assert self.alpha_cos >= 0.
|
||||
assert self.alpha_ce + self.alpha_mlm + self.alpha_mse + self.alpha_cos > 0.
|
||||
|
||||
self.mlm_mask_prop = params.mlm_mask_prop
|
||||
assert 0.0 <= self.mlm_mask_prop <= 1.0
|
||||
assert params.word_mask + params.word_keep + params.word_rand == 1.0
|
||||
self.pred_probs = torch.FloatTensor([params.word_mask, params.word_keep, params.word_rand])
|
||||
self.pred_probs = self.pred_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else self.pred_probs
|
||||
self.token_probs = token_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else token_probs
|
||||
if self.fp16:
|
||||
self.pred_probs = self.pred_probs.half()
|
||||
self.token_probs = self.token_probs.half()
|
||||
self.mlm = params.mlm
|
||||
if self.mlm:
|
||||
logger.info(f'Using MLM loss for LM step.')
|
||||
self.mlm_mask_prop = params.mlm_mask_prop
|
||||
assert 0.0 <= self.mlm_mask_prop <= 1.0
|
||||
assert params.word_mask + params.word_keep + params.word_rand == 1.0
|
||||
self.pred_probs = torch.FloatTensor([params.word_mask, params.word_keep, params.word_rand])
|
||||
self.pred_probs = self.pred_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else self.pred_probs
|
||||
self.token_probs = token_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else token_probs
|
||||
if self.fp16:
|
||||
self.pred_probs = self.pred_probs.half()
|
||||
self.token_probs = self.token_probs.half()
|
||||
else:
|
||||
logger.info(f'Using CLM loss for LM step.')
|
||||
|
||||
self.epoch = 0
|
||||
self.n_iter = 0
|
||||
@@ -86,12 +103,13 @@ class Distiller:
|
||||
self.last_loss = 0
|
||||
self.last_loss_ce = 0
|
||||
self.last_loss_mlm = 0
|
||||
self.last_loss_clm = 0
|
||||
if self.alpha_mse > 0.: self.last_loss_mse = 0
|
||||
if self.alpha_cos > 0.: self.last_loss_cos = 0
|
||||
self.last_log = 0
|
||||
|
||||
self.ce_loss_fct = nn.KLDivLoss(reduction='batchmean')
|
||||
self.mlm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
|
||||
self.lm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
|
||||
if self.alpha_mse > 0.:
|
||||
self.mse_loss_fct = nn.MSELoss(reduction='sum')
|
||||
if self.alpha_cos > 0.:
|
||||
@@ -99,7 +117,7 @@ class Distiller:
|
||||
|
||||
logger.info('--- Initializing model optimizer')
|
||||
assert params.gradient_accumulation_steps >= 1
|
||||
self.num_steps_epoch = int(len(self.dataloader) / params.batch_size) + 1
|
||||
self.num_steps_epoch = len(self.dataloader)
|
||||
num_train_optimization_steps = int(self.num_steps_epoch / params.gradient_accumulation_steps * params.n_epoch) + 1
|
||||
|
||||
no_decay = ['bias', 'LayerNorm.weight']
|
||||
@@ -140,43 +158,18 @@ class Distiller:
|
||||
logger.info("Using nn.parallel.DistributedDataParallel for distributed training.")
|
||||
self.student = DistributedDataParallel(self.student,
|
||||
device_ids=[params.local_rank],
|
||||
output_device=params.local_rank)
|
||||
output_device=params.local_rank,
|
||||
find_unused_parameters=True)
|
||||
|
||||
self.is_master = params.is_master
|
||||
if self.is_master:
|
||||
logger.info('--- Initializing Tensorboard')
|
||||
self.tensorboard = SummaryWriter(log_dir=os.path.join(self.dump_path, 'log', 'train'))
|
||||
self.tensorboard.add_text(tag='config', text_string=str(self.params), global_step=0)
|
||||
self.tensorboard.add_text(tag='config/training', text_string=str(self.params), global_step=0)
|
||||
self.tensorboard.add_text(tag='config/student', text_string=str(self.student_config), global_step=0)
|
||||
|
||||
def get_iterator(self,
|
||||
seed: int = None):
|
||||
"""
|
||||
Initialize the data iterator.
|
||||
Each process has its own data iterator (iterating on his own random portion of the dataset).
|
||||
|
||||
Input:
|
||||
------
|
||||
seed: `int` - The random seed.
|
||||
"""
|
||||
logger.info('--- Initializing Data Iterator')
|
||||
self.data_iterator = self.dataloader.get_iterator(seed=seed)
|
||||
|
||||
def get_batch(self):
|
||||
"""
|
||||
Call the data iterator to output a new batch.
|
||||
If the data iterator went through the whole dataset, create a new iterator.
|
||||
"""
|
||||
assert hasattr(self, 'data_iterator')
|
||||
try:
|
||||
x = next(self.data_iterator)
|
||||
except StopIteration:
|
||||
logger.warning('--- Went through the whole dataset. Creating new data iterator.')
|
||||
self.data_iterator = self.dataloader.get_iterator()
|
||||
x = next(self.data_iterator)
|
||||
return x
|
||||
|
||||
def prepare_batch(self,
|
||||
batch):
|
||||
def prepare_batch_mlm(self,
|
||||
batch):
|
||||
"""
|
||||
Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the masked label for MLM.
|
||||
|
||||
@@ -222,7 +215,7 @@ class Distiller:
|
||||
assert pred_mask.sum().item() % 8 == 0, pred_mask.sum().item()
|
||||
|
||||
_token_ids_real = token_ids[pred_mask]
|
||||
_token_ids_rand = _token_ids_real.clone().random_(self.params.vocab_size)
|
||||
_token_ids_rand = _token_ids_real.clone().random_(self.vocab_size)
|
||||
_token_ids_mask = _token_ids_real.clone().fill_(self.params.special_tok_ids['mask_token'])
|
||||
probs = torch.multinomial(self.pred_probs, len(_token_ids_real), replacement=True)
|
||||
_token_ids = _token_ids_mask * (probs == 0).long() + _token_ids_real * (probs == 1).long() + _token_ids_rand * (probs == 2).long()
|
||||
@@ -230,8 +223,41 @@ class Distiller:
|
||||
|
||||
mlm_labels[~pred_mask] = -1 # previously `mlm_labels[1-pred_mask] = -1`, cf pytorch 1.2.0 compatibility
|
||||
|
||||
# sanity checks
|
||||
assert 0 <= token_ids.min() <= token_ids.max() < self.vocab_size
|
||||
|
||||
return token_ids, attn_mask, mlm_labels
|
||||
|
||||
def prepare_batch_clm(self,
|
||||
batch):
|
||||
"""
|
||||
Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the labels for CLM.
|
||||
|
||||
Input:
|
||||
------
|
||||
batch: `Tuple`
|
||||
token_ids: `torch.tensor(bs, seq_length)` - The token ids for each of the sequence. It is padded.
|
||||
lengths: `torch.tensor(bs)` - The lengths of each of the sequences in the batch.
|
||||
|
||||
Output:
|
||||
-------
|
||||
token_ids: `torch.tensor(bs, seq_length)` - The token ids after the modifications for MLM.
|
||||
attn_mask: `torch.tensor(bs, seq_length)` - The attention mask for the self-attention.
|
||||
clm_labels: `torch.tensor(bs, seq_length)` - The causal languge modeling labels. There is a -1 where there is nothing to predict.
|
||||
"""
|
||||
token_ids, lengths = batch
|
||||
token_ids, lengths = self.round_batch(x=token_ids, lengths=lengths)
|
||||
assert token_ids.size(0) == lengths.size(0)
|
||||
|
||||
attn_mask = (torch.arange(token_ids.size(1), dtype=torch.long, device=lengths.device) < lengths[:, None])
|
||||
clm_labels = token_ids.new(token_ids.size()).copy_(token_ids)
|
||||
clm_labels[~attn_mask] = -1 # previously `clm_labels[1-attn_mask] = -1`, cf pytorch 1.2.0 compatibility
|
||||
|
||||
# sanity checks
|
||||
assert 0 <= token_ids.min() <= token_ids.max() < self.vocab_size
|
||||
|
||||
return token_ids, attn_mask, clm_labels
|
||||
|
||||
def round_batch(self,
|
||||
x: torch.tensor,
|
||||
lengths: torch.tensor):
|
||||
@@ -269,7 +295,10 @@ class Distiller:
|
||||
if ml1 % 8 != 0:
|
||||
pad = 8 - (ml1 % 8)
|
||||
ml2 = ml1 + pad
|
||||
pad_id = self.params.special_tok_ids['pad_token']
|
||||
if self.mlm:
|
||||
pad_id = self.params.special_tok_ids['pad_token']
|
||||
else:
|
||||
pad_id = self.params.special_tok_ids['unk_token']
|
||||
padding_tensor = torch.zeros(bs2, pad, dtype=torch.long, device=x.device).fill_(pad_id)
|
||||
x = torch.cat([x, padding_tensor], 1)
|
||||
assert x.size() == (bs2, ml2)
|
||||
@@ -292,14 +321,16 @@ class Distiller:
|
||||
if self.multi_gpu:
|
||||
torch.distributed.barrier()
|
||||
|
||||
iter_bar = trange(self.num_steps_epoch, desc="-Iter", disable=self.params.local_rank not in [-1, 0])
|
||||
for __ in range(self.num_steps_epoch):
|
||||
batch = self.get_batch()
|
||||
iter_bar = tqdm(self.dataloader, desc="-Iter", disable=self.params.local_rank not in [-1, 0])
|
||||
for batch in iter_bar:
|
||||
if self.params.n_gpu > 0:
|
||||
batch = tuple(t.to(f'cuda:{self.params.local_rank}') for t in batch)
|
||||
token_ids, attn_mask, mlm_labels = self.prepare_batch(batch=batch)
|
||||
|
||||
self.step(input_ids=token_ids, attention_mask=attn_mask, mlm_labels=mlm_labels)
|
||||
if self.mlm:
|
||||
token_ids, attn_mask, lm_labels = self.prepare_batch_mlm(batch=batch)
|
||||
else:
|
||||
token_ids, attn_mask, lm_labels = self.prepare_batch_clm(batch=batch)
|
||||
self.step(input_ids=token_ids, attention_mask=attn_mask, lm_labels=lm_labels)
|
||||
|
||||
iter_bar.update()
|
||||
iter_bar.set_postfix({'Last_loss': f'{self.last_loss:.2f}',
|
||||
@@ -317,7 +348,7 @@ class Distiller:
|
||||
def step(self,
|
||||
input_ids: torch.tensor,
|
||||
attention_mask: torch.tensor,
|
||||
mlm_labels: torch.tensor):
|
||||
lm_labels: torch.tensor):
|
||||
"""
|
||||
One optimization step: forward of student AND teacher, backward on the loss (for gradient accumulation),
|
||||
and possibly a parameter update (depending on the gradient accumulation).
|
||||
@@ -326,17 +357,22 @@ class Distiller:
|
||||
------
|
||||
input_ids: `torch.tensor(bs, seq_length)` - The token ids.
|
||||
attention_mask: `torch.tensor(bs, seq_length)` - The attention mask for self attention.
|
||||
mlm_labels: `torch.tensor(bs, seq_length)` - The masked language modeling labels.
|
||||
lm_labels: `torch.tensor(bs, seq_length)` - The language modeling labels (mlm labels for MLM and clm labels for CLM).
|
||||
"""
|
||||
s_logits, s_hidden_states = self.student(input_ids=input_ids, attention_mask=attention_mask) # (bs, seq_length, voc_size)
|
||||
with torch.no_grad():
|
||||
t_logits, t_hidden_states = self.teacher(input_ids=input_ids, attention_mask=attention_mask) # (bs, seq_length, voc_size)
|
||||
if self.mlm:
|
||||
s_logits, s_hidden_states = self.student(input_ids=input_ids, attention_mask=attention_mask) # (bs, seq_length, voc_size)
|
||||
with torch.no_grad():
|
||||
t_logits, t_hidden_states = self.teacher(input_ids=input_ids, attention_mask=attention_mask) # (bs, seq_length, voc_size)
|
||||
else:
|
||||
s_logits, _, s_hidden_states = self.student(input_ids=input_ids, attention_mask=None) # (bs, seq_length, voc_size)
|
||||
with torch.no_grad():
|
||||
t_logits, _, t_hidden_states = self.teacher(input_ids=input_ids, attention_mask=None) # (bs, seq_length, voc_size)
|
||||
assert s_logits.size() == t_logits.size()
|
||||
|
||||
#https://github.com/peterliht/knowledge-distillation-pytorch/blob/master/model/net.py#L100
|
||||
#https://github.com/peterliht/knowledge-distillation-pytorch/issues/2
|
||||
if self.params.restrict_ce_to_mask:
|
||||
mask = (mlm_labels>-1).unsqueeze(-1).expand_as(s_logits) # (bs, seq_lenth, voc_size)
|
||||
mask = (lm_labels>-1).unsqueeze(-1).expand_as(s_logits) # (bs, seq_lenth, voc_size)
|
||||
else:
|
||||
mask = attention_mask.unsqueeze(-1).expand_as(s_logits) # (bs, seq_lenth, voc_size)
|
||||
s_logits_slct = torch.masked_select(s_logits, mask) # (bs * seq_length * voc_size) modulo the 1s in mask
|
||||
@@ -348,13 +384,20 @@ class Distiller:
|
||||
loss_ce = self.ce_loss_fct(F.log_softmax(s_logits_slct/self.temperature, dim=-1),
|
||||
F.softmax(t_logits_slct/self.temperature, dim=-1)) * (self.temperature)**2
|
||||
loss = self.alpha_ce*loss_ce
|
||||
|
||||
if self.alpha_mlm > 0.:
|
||||
loss_mlm = self.mlm_loss_fct(s_logits.view(-1, s_logits.size(-1)), mlm_labels.view(-1))
|
||||
loss_mlm = self.lm_loss_fct(s_logits.view(-1, s_logits.size(-1)), lm_labels.view(-1))
|
||||
loss += self.alpha_mlm * loss_mlm
|
||||
if self.alpha_clm > 0.:
|
||||
shift_logits = s_logits[..., :-1, :].contiguous()
|
||||
shift_labels = lm_labels[..., 1:].contiguous()
|
||||
loss_clm = self.lm_loss_fct(shift_logits.view(-1, shift_logits.size(-1)),
|
||||
shift_labels.view(-1))
|
||||
loss += self.alpha_clm * loss_clm
|
||||
|
||||
if self.alpha_mse > 0.:
|
||||
loss_mse = self.mse_loss_fct(s_logits_slct, t_logits_slct)/s_logits_slct.size(0) # Reproducing batchmean reduction
|
||||
loss += self.alpha_mse * loss_mse
|
||||
|
||||
if self.alpha_cos > 0.:
|
||||
s_hidden_states = s_hidden_states[-1] # (bs, seq_length, dim)
|
||||
t_hidden_states = t_hidden_states[-1] # (bs, seq_length, dim)
|
||||
@@ -376,6 +419,8 @@ class Distiller:
|
||||
self.last_loss_ce = loss_ce.item()
|
||||
if self.alpha_mlm > 0.:
|
||||
self.last_loss_mlm = loss_mlm.item()
|
||||
if self.alpha_clm > 0.:
|
||||
self.last_loss_clm = loss_clm.item()
|
||||
if self.alpha_mse > 0.:
|
||||
self.last_loss_mse = loss_mse.item()
|
||||
if self.alpha_cos > 0.:
|
||||
@@ -452,6 +497,8 @@ class Distiller:
|
||||
self.tensorboard.add_scalar(tag="losses/loss_ce", scalar_value=self.last_loss_ce, global_step=self.n_total_iter)
|
||||
if self.alpha_mlm > 0.:
|
||||
self.tensorboard.add_scalar(tag="losses/loss_mlm", scalar_value=self.last_loss_mlm, global_step=self.n_total_iter)
|
||||
if self.alpha_clm > 0.:
|
||||
self.tensorboard.add_scalar(tag="losses/loss_clm", scalar_value=self.last_loss_clm, global_step=self.n_total_iter)
|
||||
if self.alpha_mse > 0.:
|
||||
self.tensorboard.add_scalar(tag="losses/loss_mse", scalar_value=self.last_loss_mse, global_step=self.n_total_iter)
|
||||
if self.alpha_cos > 0.:
|
||||
|
||||
105
examples/distillation/grouped_batch_sampler.py
Normal file
105
examples/distillation/grouped_batch_sampler.py
Normal file
@@ -0,0 +1,105 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2019-present, the HuggingFace Inc. team and Facebook, Inc.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Adapted from PyTorch Vision (https://github.com/pytorch/vision/blob/master/references/detection/group_by_aspect_ratio.py)
|
||||
"""
|
||||
import bisect
|
||||
import copy
|
||||
from collections import defaultdict
|
||||
import numpy as np
|
||||
|
||||
from torch.utils.data.sampler import BatchSampler, Sampler
|
||||
|
||||
from utils import logger
|
||||
|
||||
def _quantize(x, bins):
|
||||
bins = copy.deepcopy(bins)
|
||||
bins = sorted(bins)
|
||||
quantized = list(map(lambda y: bisect.bisect_right(bins, y), x))
|
||||
return quantized
|
||||
|
||||
def create_lengths_groups(lengths, k=0):
|
||||
bins = np.arange(start=3, stop=k, step=4).tolist() if k > 0 else [10]
|
||||
groups = _quantize(lengths, bins)
|
||||
# count number of elements per group
|
||||
counts = np.unique(groups, return_counts=True)[1]
|
||||
fbins = [0] + bins + [np.inf]
|
||||
logger.info("Using {} as bins for aspect lengths quantization".format(fbins))
|
||||
logger.info("Count of instances per bin: {}".format(counts))
|
||||
return groups
|
||||
|
||||
class GroupedBatchSampler(BatchSampler):
|
||||
"""
|
||||
Wraps another sampler to yield a mini-batch of indices.
|
||||
It enforces that the batch only contain elements from the same group.
|
||||
It also tries to provide mini-batches which follows an ordering which is
|
||||
as close as possible to the ordering from the original sampler.
|
||||
Arguments:
|
||||
sampler (Sampler): Base sampler.
|
||||
group_ids (list[int]): If the sampler produces indices in range [0, N),
|
||||
`group_ids` must be a list of `N` ints which contains the group id of each sample.
|
||||
The group ids must be a continuous set of integers starting from
|
||||
0, i.e. they must be in the range [0, num_groups).
|
||||
batch_size (int): Size of mini-batch.
|
||||
"""
|
||||
def __init__(self, sampler, group_ids, batch_size):
|
||||
if not isinstance(sampler, Sampler):
|
||||
raise ValueError(
|
||||
"sampler should be an instance of "
|
||||
"torch.utils.data.Sampler, but got sampler={}".format(sampler)
|
||||
)
|
||||
self.sampler = sampler
|
||||
self.group_ids = group_ids
|
||||
self.batch_size = batch_size
|
||||
|
||||
def __iter__(self):
|
||||
buffer_per_group = defaultdict(list)
|
||||
samples_per_group = defaultdict(list)
|
||||
|
||||
num_batches = 0
|
||||
for idx in self.sampler:
|
||||
group_id = self.group_ids[idx]
|
||||
buffer_per_group[group_id].append(idx)
|
||||
samples_per_group[group_id].append(idx)
|
||||
if len(buffer_per_group[group_id]) == self.batch_size:
|
||||
yield buffer_per_group[group_id] #TODO
|
||||
num_batches += 1
|
||||
del buffer_per_group[group_id]
|
||||
assert len(buffer_per_group[group_id]) < self.batch_size
|
||||
|
||||
# now we have run out of elements that satisfy
|
||||
# the group criteria, let's return the remaining
|
||||
# elements so that the size of the sampler is
|
||||
# deterministic
|
||||
expected_num_batches = len(self)
|
||||
num_remaining = expected_num_batches - num_batches
|
||||
if num_remaining > 0:
|
||||
# for the remaining batches, group the batches by similar lengths
|
||||
batch_idx = []
|
||||
for group_id, idxs in sorted(buffer_per_group.items(), key=lambda x: x[0]):
|
||||
batch_idx.extend(idxs)
|
||||
if len(batch_idx) >= self.batch_size:
|
||||
yield batch_idx[:self.batch_size]
|
||||
batch_idx = batch_idx[self.batch_size:]
|
||||
num_remaining -= 1
|
||||
if len(batch_idx) > 0:
|
||||
yield batch_idx
|
||||
num_remaining -= 1
|
||||
assert num_remaining == 0
|
||||
|
||||
def __len__(self):
|
||||
"""
|
||||
Return the number of mini-batches rather than the number of samples.
|
||||
"""
|
||||
return (len(self.sampler) + self.batch_size - 1) // self.batch_size
|
||||
@@ -12,30 +12,33 @@
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Dataloaders to train DistilBERT
|
||||
""" Dataset to distilled models
|
||||
adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
|
||||
"""
|
||||
from typing import List
|
||||
import math
|
||||
from itertools import chain
|
||||
from collections import Counter
|
||||
import numpy as np
|
||||
import torch
|
||||
from torch.utils.data import Dataset
|
||||
|
||||
import numpy as np
|
||||
from utils import logger
|
||||
|
||||
class Dataset:
|
||||
class LmSeqsDataset(Dataset):
|
||||
"""Custom Dataset wrapping language modeling sequences.
|
||||
|
||||
Each sample will be retrieved by indexing the list of token_ids and their corresponding lengths.
|
||||
|
||||
Input:
|
||||
------
|
||||
params: `NameSpace` parameters
|
||||
data: `List[np.array[int]]
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
params,
|
||||
data):
|
||||
self.params = params
|
||||
self.tokens_per_batch = params.tokens_per_batch
|
||||
self.batch_size = params.batch_size
|
||||
self.shuffle = params.shuffle
|
||||
self.group_by_size = params.group_by_size
|
||||
|
||||
self.token_ids = np.array(data)
|
||||
self.lengths = np.uint16([len(t) for t in data])
|
||||
self.lengths = np.array([len(t) for t in data])
|
||||
|
||||
self.check()
|
||||
self.remove_long_sequences()
|
||||
@@ -43,6 +46,9 @@ class Dataset:
|
||||
self.check()
|
||||
self.print_statistics()
|
||||
|
||||
def __getitem__(self, index):
|
||||
return (self.token_ids[index], self.lengths[index])
|
||||
|
||||
def __len__(self):
|
||||
return len(self.lengths)
|
||||
|
||||
@@ -51,12 +57,14 @@ class Dataset:
|
||||
Some sanity checks
|
||||
"""
|
||||
assert len(self.token_ids) == len(self.lengths)
|
||||
assert all(self.lengths[i] == len(self.token_ids[i]) for i in range(len(self.lengths)))
|
||||
|
||||
def remove_long_sequences(self):
|
||||
"""
|
||||
Sequences that are too long are splitted by chunk of max_position_embeddings.
|
||||
Sequences that are too long are splitted by chunk of max_model_input_size.
|
||||
"""
|
||||
indices = self.lengths >= self.params.max_position_embeddings
|
||||
max_len = self.params.max_model_input_size
|
||||
indices = self.lengths > max_len
|
||||
logger.info(f'Splitting {sum(indices)} too long sequences.')
|
||||
|
||||
def divide_chunks(l, n):
|
||||
@@ -64,10 +72,13 @@ class Dataset:
|
||||
|
||||
new_tok_ids = []
|
||||
new_lengths = []
|
||||
cls_id, sep_id = self.params.special_tok_ids['cls_token'], self.params.special_tok_ids['sep_token']
|
||||
max_len = self.params.max_position_embeddings
|
||||
if self.params.mlm:
|
||||
cls_id, sep_id = self.params.special_tok_ids['cls_token'], self.params.special_tok_ids['sep_token']
|
||||
else:
|
||||
cls_id, sep_id = self.params.special_tok_ids['bos_token'], self.params.special_tok_ids['eos_token']
|
||||
|
||||
for seq_, len_ in zip(self.token_ids, self.lengths):
|
||||
assert (seq_[0] == cls_id) and (seq_[-1] == sep_id), seq_
|
||||
if len_ <= max_len:
|
||||
new_tok_ids.append(seq_)
|
||||
new_lengths.append(len_)
|
||||
@@ -79,6 +90,7 @@ class Dataset:
|
||||
if sub_s[-1] != sep_id:
|
||||
sub_s = np.insert(sub_s, len(sub_s), sep_id)
|
||||
assert len(sub_s) <= max_len
|
||||
assert (sub_s[0] == cls_id) and (sub_s[-1] == sep_id), sub_s
|
||||
sub_seqs.append(sub_s)
|
||||
|
||||
new_tok_ids.extend(sub_seqs)
|
||||
@@ -113,89 +125,27 @@ class Dataset:
|
||||
# nb_unkown = sum([(t==unk_idx).sum() for t in self.token_ids])
|
||||
# logger.info(f'{nb_unkown} unknown tokens (covering {100*nb_unkown/data_len:.2f}% of the data)')
|
||||
|
||||
def select_data(self, a: int, b: int):
|
||||
"""
|
||||
Select a subportion of the data.
|
||||
"""
|
||||
n_sequences = len(self)
|
||||
assert 0 <= a < b <= n_sequences, ValueError(f'`0 <= a < b <= n_sequences` is not met with a={a} and b={b}')
|
||||
|
||||
logger.info(f'Selecting sequences from {a} to {b} (excluded).')
|
||||
self.token_ids = self.token_ids[a:b]
|
||||
self.lengths = self.lengths[a:b]
|
||||
|
||||
self.check()
|
||||
|
||||
def split(self):
|
||||
"""
|
||||
Distributed training: split the data accross the processes.
|
||||
"""
|
||||
assert self.params.n_gpu > 1
|
||||
logger.info('Splitting the data accross the processuses.')
|
||||
n_seq = len(self)
|
||||
n_seq_per_procesus = n_seq // self.params.world_size
|
||||
a = n_seq_per_procesus * self.params.global_rank
|
||||
b = a + n_seq_per_procesus
|
||||
self.select_data(a=a, b=b)
|
||||
|
||||
def batch_sequences(self,
|
||||
token_ids: List[List[int]],
|
||||
lengths: List[int]):
|
||||
batch):
|
||||
"""
|
||||
Do the padding and transform into torch.tensor.
|
||||
"""
|
||||
token_ids = [t[0] for t in batch]
|
||||
lengths = [t[1] for t in batch]
|
||||
assert len(token_ids) == len(lengths)
|
||||
|
||||
# Max for paddings
|
||||
max_seq_len_ = max(lengths)
|
||||
|
||||
# Pad token ids
|
||||
pad_idx = self.params.special_tok_ids['pad_token']
|
||||
if self.params.mlm:
|
||||
pad_idx = self.params.special_tok_ids['pad_token']
|
||||
else:
|
||||
pad_idx = self.params.special_tok_ids['unk_token']
|
||||
tk_ = [list(t.astype(int)) + [pad_idx]*(max_seq_len_-len(t)) for t in token_ids]
|
||||
assert len(tk_) == len(token_ids)
|
||||
assert all(len(t) == max_seq_len_ for t in tk_)
|
||||
|
||||
tk_t = torch.tensor(tk_) # (bs, max_seq_len_)
|
||||
lg_t = torch.tensor(lengths.astype(int)) # (bs)
|
||||
tk_t = torch.tensor(tk_) # (bs, max_seq_len_)
|
||||
lg_t = torch.tensor(lengths) # (bs)
|
||||
return tk_t, lg_t
|
||||
|
||||
def get_batches_iterator(self,
|
||||
batches):
|
||||
"""
|
||||
Return an iterator over batches.
|
||||
"""
|
||||
for sequences_ids in batches:
|
||||
token_ids, lengths = self.batch_sequences(self.token_ids[sequences_ids],
|
||||
self.lengths[sequences_ids])
|
||||
yield (token_ids, lengths)
|
||||
|
||||
def get_iterator(self,
|
||||
seed: int = None):
|
||||
"""
|
||||
Return a data iterator.
|
||||
"""
|
||||
rng = np.random.RandomState(seed)
|
||||
|
||||
n_sequences = len(self)
|
||||
indices = np.arange(n_sequences)
|
||||
|
||||
if self.group_by_size:
|
||||
indices = indices[np.argsort(self.lengths[indices], kind='mergesort')]
|
||||
|
||||
if self.tokens_per_batch == -1:
|
||||
batches = np.array_split(indices, math.ceil(len(indices) * 1. / self.batch_size))
|
||||
else:
|
||||
assert self.tokens_per_batch > 0
|
||||
batch_ids = np.cumsum(self.lengths[indices]) // self.tokens_per_batch
|
||||
_, bounds = np.unique(batch_ids, return_index=True)
|
||||
batches = [indices[bounds[i]:bounds[i + 1]] for i in range(len(bounds) - 1)]
|
||||
if bounds[-1] < len(indices):
|
||||
batches.append(indices[bounds[-1]:])
|
||||
|
||||
if self.shuffle:
|
||||
rng.shuffle(batches)
|
||||
|
||||
assert n_sequences == sum([len(x) for x in batches])
|
||||
assert self.lengths[indices].sum() == sum([self.lengths[x].sum() for x in batches])
|
||||
|
||||
return self.get_batches_iterator(batches=batches)
|
||||
@@ -3,4 +3,4 @@ tensorboard>=1.14.0
|
||||
tensorboardX==1.8
|
||||
psutil==5.6.3
|
||||
scipy==1.3.1
|
||||
pytorch_transformers==1.2.0
|
||||
transformers==2.0.0
|
||||
|
||||
585
examples/distillation/run_squad_w_distillation.py
Normal file
585
examples/distillation/run_squad_w_distillation.py
Normal file
@@ -0,0 +1,585 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
|
||||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" This is the exact same script as `examples/run_squad.py` (as of 2019, October 4th) with an additional and optional step of distillation."""
|
||||
|
||||
from __future__ import absolute_import, division, print_function
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import glob
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
||||
TensorDataset)
|
||||
from torch.utils.data.distributed import DistributedSampler
|
||||
import torch.nn.functional as F
|
||||
import torch.nn as nn
|
||||
from tqdm import tqdm, trange
|
||||
|
||||
from tensorboardX import SummaryWriter
|
||||
|
||||
from transformers import (WEIGHTS_NAME, BertConfig,
|
||||
BertForQuestionAnswering, BertTokenizer,
|
||||
XLMConfig, XLMForQuestionAnswering,
|
||||
XLMTokenizer, XLNetConfig,
|
||||
XLNetForQuestionAnswering,
|
||||
XLNetTokenizer,
|
||||
DistilBertConfig, DistilBertForQuestionAnswering, DistilBertTokenizer)
|
||||
|
||||
from transformers import AdamW, WarmupLinearSchedule
|
||||
|
||||
from ..utils_squad import (read_squad_examples, convert_examples_to_features,
|
||||
RawResult, write_predictions,
|
||||
RawResultExtended, write_predictions_extended)
|
||||
|
||||
# The follwing import is the official SQuAD evaluation script (2.0).
|
||||
# You can remove it from the dependencies if you are using this script outside of the library
|
||||
# We've added it here for automated tests (see examples/test_examples.py file)
|
||||
from ..utils_squad_evaluate import EVAL_OPTS, main as evaluate_on_squad
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) \
|
||||
for conf in (BertConfig, XLNetConfig, XLMConfig)), ())
|
||||
|
||||
MODEL_CLASSES = {
|
||||
'bert': (BertConfig, BertForQuestionAnswering, BertTokenizer),
|
||||
'xlnet': (XLNetConfig, XLNetForQuestionAnswering, XLNetTokenizer),
|
||||
'xlm': (XLMConfig, XLMForQuestionAnswering, XLMTokenizer),
|
||||
'distilbert': (DistilBertConfig, DistilBertForQuestionAnswering, DistilBertTokenizer)
|
||||
}
|
||||
|
||||
def set_seed(args):
|
||||
random.seed(args.seed)
|
||||
np.random.seed(args.seed)
|
||||
torch.manual_seed(args.seed)
|
||||
if args.n_gpu > 0:
|
||||
torch.cuda.manual_seed_all(args.seed)
|
||||
|
||||
def to_list(tensor):
|
||||
return tensor.detach().cpu().tolist()
|
||||
|
||||
def train(args, train_dataset, model, tokenizer, teacher=None):
|
||||
""" Train the model """
|
||||
if args.local_rank in [-1, 0]:
|
||||
tb_writer = SummaryWriter()
|
||||
|
||||
args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
|
||||
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
|
||||
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
|
||||
|
||||
if args.max_steps > 0:
|
||||
t_total = args.max_steps
|
||||
args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
|
||||
else:
|
||||
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
|
||||
|
||||
# Prepare optimizer and schedule (linear warmup and decay)
|
||||
no_decay = ['bias', 'LayerNorm.weight']
|
||||
optimizer_grouped_parameters = [
|
||||
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
|
||||
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
|
||||
]
|
||||
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
|
||||
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
|
||||
if args.fp16:
|
||||
try:
|
||||
from apex import amp
|
||||
except ImportError:
|
||||
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
|
||||
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
|
||||
|
||||
# multi-gpu training (should be after apex fp16 initialization)
|
||||
if args.n_gpu > 1:
|
||||
model = torch.nn.DataParallel(model)
|
||||
|
||||
# Distributed training (should be after apex fp16 initialization)
|
||||
if args.local_rank != -1:
|
||||
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
|
||||
output_device=args.local_rank,
|
||||
find_unused_parameters=True)
|
||||
|
||||
# Train!
|
||||
logger.info("***** Running training *****")
|
||||
logger.info(" Num examples = %d", len(train_dataset))
|
||||
logger.info(" Num Epochs = %d", args.num_train_epochs)
|
||||
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
|
||||
logger.info(" Total train batch size (w. parallel, distributed & accumulation) = %d",
|
||||
args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
|
||||
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
|
||||
logger.info(" Total optimization steps = %d", t_total)
|
||||
|
||||
global_step = 0
|
||||
tr_loss, logging_loss = 0.0, 0.0
|
||||
model.zero_grad()
|
||||
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
|
||||
set_seed(args) # Added here for reproductibility (even between python 2 and 3)
|
||||
for _ in train_iterator:
|
||||
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
|
||||
for step, batch in enumerate(epoch_iterator):
|
||||
model.train()
|
||||
if teacher is not None:
|
||||
teacher.eval()
|
||||
batch = tuple(t.to(args.device) for t in batch)
|
||||
inputs = {'input_ids': batch[0],
|
||||
'attention_mask': batch[1],
|
||||
'start_positions': batch[3],
|
||||
'end_positions': batch[4]}
|
||||
if args.model_type != 'distilbert':
|
||||
inputs['token_type_ids'] = None if args.model_type == 'xlm' else batch[2]
|
||||
if args.model_type in ['xlnet', 'xlm']:
|
||||
inputs.update({'cls_index': batch[5],
|
||||
'p_mask': batch[6]})
|
||||
outputs = model(**inputs)
|
||||
loss, start_logits_stu, end_logits_stu = outputs
|
||||
|
||||
# Distillation loss
|
||||
if teacher is not None:
|
||||
if 'token_type_ids' not in inputs:
|
||||
inputs['token_type_ids'] = None if args.teacher_type == 'xlm' else batch[2]
|
||||
with torch.no_grad():
|
||||
start_logits_tea, end_logits_tea = teacher(input_ids=inputs['input_ids'],
|
||||
token_type_ids=inputs['token_type_ids'],
|
||||
attention_mask=inputs['attention_mask'])
|
||||
assert start_logits_tea.size() == start_logits_stu.size()
|
||||
assert end_logits_tea.size() == end_logits_stu.size()
|
||||
|
||||
loss_fct = nn.KLDivLoss(reduction='batchmean')
|
||||
loss_start = loss_fct(F.log_softmax(start_logits_stu/args.temperature, dim=-1),
|
||||
F.softmax(start_logits_tea/args.temperature, dim=-1)) * (args.temperature**2)
|
||||
loss_end = loss_fct(F.log_softmax(end_logits_stu/args.temperature, dim=-1),
|
||||
F.softmax(end_logits_tea/args.temperature, dim=-1)) * (args.temperature**2)
|
||||
loss_ce = (loss_start + loss_end)/2.
|
||||
|
||||
loss = args.alpha_ce*loss_ce + args.alpha_squad*loss
|
||||
|
||||
if args.n_gpu > 1:
|
||||
loss = loss.mean() # mean() to average on multi-gpu parallel (not distributed) training
|
||||
if args.gradient_accumulation_steps > 1:
|
||||
loss = loss / args.gradient_accumulation_steps
|
||||
|
||||
if args.fp16:
|
||||
with amp.scale_loss(loss, optimizer) as scaled_loss:
|
||||
scaled_loss.backward()
|
||||
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
||||
else:
|
||||
loss.backward()
|
||||
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
||||
|
||||
tr_loss += loss.item()
|
||||
if (step + 1) % args.gradient_accumulation_steps == 0:
|
||||
optimizer.step()
|
||||
scheduler.step() # Update learning rate schedule
|
||||
model.zero_grad()
|
||||
global_step += 1
|
||||
|
||||
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
|
||||
# Log metrics
|
||||
if args.local_rank == -1 and args.evaluate_during_training: # Only evaluate when single GPU otherwise metrics may not average well
|
||||
results = evaluate(args, model, tokenizer)
|
||||
for key, value in results.items():
|
||||
tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
|
||||
tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
|
||||
tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
|
||||
logging_loss = tr_loss
|
||||
|
||||
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
|
||||
# Save model checkpoint
|
||||
output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
|
||||
if not os.path.exists(output_dir):
|
||||
os.makedirs(output_dir)
|
||||
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
|
||||
model_to_save.save_pretrained(output_dir)
|
||||
torch.save(args, os.path.join(output_dir, 'training_args.bin'))
|
||||
logger.info("Saving model checkpoint to %s", output_dir)
|
||||
|
||||
if args.max_steps > 0 and global_step > args.max_steps:
|
||||
epoch_iterator.close()
|
||||
break
|
||||
if args.max_steps > 0 and global_step > args.max_steps:
|
||||
train_iterator.close()
|
||||
break
|
||||
|
||||
if args.local_rank in [-1, 0]:
|
||||
tb_writer.close()
|
||||
|
||||
return global_step, tr_loss / global_step
|
||||
|
||||
|
||||
def evaluate(args, model, tokenizer, prefix=""):
|
||||
dataset, examples, features = load_and_cache_examples(args, tokenizer, evaluate=True, output_examples=True)
|
||||
|
||||
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
|
||||
os.makedirs(args.output_dir)
|
||||
|
||||
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
|
||||
# Note that DistributedSampler samples randomly
|
||||
eval_sampler = SequentialSampler(dataset) if args.local_rank == -1 else DistributedSampler(dataset)
|
||||
eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
|
||||
|
||||
# Eval!
|
||||
logger.info("***** Running evaluation {} *****".format(prefix))
|
||||
logger.info(" Num examples = %d", len(dataset))
|
||||
logger.info(" Batch size = %d", args.eval_batch_size)
|
||||
all_results = []
|
||||
for batch in tqdm(eval_dataloader, desc="Evaluating"):
|
||||
model.eval()
|
||||
batch = tuple(t.to(args.device) for t in batch)
|
||||
with torch.no_grad():
|
||||
inputs = {'input_ids': batch[0],
|
||||
'attention_mask': batch[1]
|
||||
}
|
||||
if args.model_type != 'distilbert':
|
||||
inputs['token_type_ids'] = None if args.model_type == 'xlm' else batch[2] # XLM don't use segment_ids
|
||||
example_indices = batch[3]
|
||||
if args.model_type in ['xlnet', 'xlm']:
|
||||
inputs.update({'cls_index': batch[4],
|
||||
'p_mask': batch[5]})
|
||||
outputs = model(**inputs)
|
||||
|
||||
for i, example_index in enumerate(example_indices):
|
||||
eval_feature = features[example_index.item()]
|
||||
unique_id = int(eval_feature.unique_id)
|
||||
if args.model_type in ['xlnet', 'xlm']:
|
||||
# XLNet uses a more complex post-processing procedure
|
||||
result = RawResultExtended(unique_id = unique_id,
|
||||
start_top_log_probs = to_list(outputs[0][i]),
|
||||
start_top_index = to_list(outputs[1][i]),
|
||||
end_top_log_probs = to_list(outputs[2][i]),
|
||||
end_top_index = to_list(outputs[3][i]),
|
||||
cls_logits = to_list(outputs[4][i]))
|
||||
else:
|
||||
result = RawResult(unique_id = unique_id,
|
||||
start_logits = to_list(outputs[0][i]),
|
||||
end_logits = to_list(outputs[1][i]))
|
||||
all_results.append(result)
|
||||
|
||||
# Compute predictions
|
||||
output_prediction_file = os.path.join(args.output_dir, "predictions_{}.json".format(prefix))
|
||||
output_nbest_file = os.path.join(args.output_dir, "nbest_predictions_{}.json".format(prefix))
|
||||
if args.version_2_with_negative:
|
||||
output_null_log_odds_file = os.path.join(args.output_dir, "null_odds_{}.json".format(prefix))
|
||||
else:
|
||||
output_null_log_odds_file = None
|
||||
|
||||
if args.model_type in ['xlnet', 'xlm']:
|
||||
# XLNet uses a more complex post-processing procedure
|
||||
write_predictions_extended(examples, features, all_results, args.n_best_size,
|
||||
args.max_answer_length, output_prediction_file,
|
||||
output_nbest_file, output_null_log_odds_file, args.predict_file,
|
||||
model.config.start_n_top, model.config.end_n_top,
|
||||
args.version_2_with_negative, tokenizer, args.verbose_logging)
|
||||
else:
|
||||
write_predictions(examples, features, all_results, args.n_best_size,
|
||||
args.max_answer_length, args.do_lower_case, output_prediction_file,
|
||||
output_nbest_file, output_null_log_odds_file, args.verbose_logging,
|
||||
args.version_2_with_negative, args.null_score_diff_threshold)
|
||||
|
||||
# Evaluate with the official SQuAD script
|
||||
evaluate_options = EVAL_OPTS(data_file=args.predict_file,
|
||||
pred_file=output_prediction_file,
|
||||
na_prob_file=output_null_log_odds_file)
|
||||
results = evaluate_on_squad(evaluate_options)
|
||||
return results
|
||||
|
||||
|
||||
def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False):
|
||||
if args.local_rank not in [-1, 0] and not evaluate:
|
||||
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
|
||||
|
||||
# Load data features from cache or dataset file
|
||||
input_file = args.predict_file if evaluate else args.train_file
|
||||
cached_features_file = os.path.join(os.path.dirname(input_file), 'cached_{}_{}_{}'.format(
|
||||
'dev' if evaluate else 'train',
|
||||
list(filter(None, args.model_name_or_path.split('/'))).pop(),
|
||||
str(args.max_seq_length)))
|
||||
if os.path.exists(cached_features_file) and not args.overwrite_cache and not output_examples:
|
||||
logger.info("Loading features from cached file %s", cached_features_file)
|
||||
features = torch.load(cached_features_file)
|
||||
else:
|
||||
logger.info("Creating features from dataset file at %s", input_file)
|
||||
examples = read_squad_examples(input_file=input_file,
|
||||
is_training=not evaluate,
|
||||
version_2_with_negative=args.version_2_with_negative)
|
||||
features = convert_examples_to_features(examples=examples,
|
||||
tokenizer=tokenizer,
|
||||
max_seq_length=args.max_seq_length,
|
||||
doc_stride=args.doc_stride,
|
||||
max_query_length=args.max_query_length,
|
||||
is_training=not evaluate)
|
||||
if args.local_rank in [-1, 0]:
|
||||
logger.info("Saving features into cached file %s", cached_features_file)
|
||||
torch.save(features, cached_features_file)
|
||||
|
||||
if args.local_rank == 0 and not evaluate:
|
||||
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
|
||||
|
||||
# Convert to Tensors and build dataset
|
||||
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
|
||||
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
|
||||
all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
|
||||
all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long)
|
||||
all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float)
|
||||
if evaluate:
|
||||
all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)
|
||||
dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
|
||||
all_example_index, all_cls_index, all_p_mask)
|
||||
else:
|
||||
all_start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long)
|
||||
all_end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long)
|
||||
dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
|
||||
all_start_positions, all_end_positions,
|
||||
all_cls_index, all_p_mask)
|
||||
|
||||
if output_examples:
|
||||
return dataset, examples, features
|
||||
return dataset
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
## Required parameters
|
||||
parser.add_argument("--train_file", default=None, type=str, required=True,
|
||||
help="SQuAD json for training. E.g., train-v1.1.json")
|
||||
parser.add_argument("--predict_file", default=None, type=str, required=True,
|
||||
help="SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json")
|
||||
parser.add_argument("--model_type", default=None, type=str, required=True,
|
||||
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
|
||||
parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
|
||||
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
|
||||
parser.add_argument("--output_dir", default=None, type=str, required=True,
|
||||
help="The output directory where the model checkpoints and predictions will be written.")
|
||||
|
||||
# Distillation parameters (optional)
|
||||
parser.add_argument('--teacher_type', default=None, type=str,
|
||||
help="Teacher type. Teacher tokenizer and student (model) tokenizer must output the same tokenization. Only for distillation.")
|
||||
parser.add_argument('--teacher_name_or_path', default=None, type=str,
|
||||
help="Path to the already SQuAD fine-tuned teacher model. Only for distillation.")
|
||||
parser.add_argument('--alpha_ce', default=0.5, type=float,
|
||||
help="Distillation loss linear weight. Only for distillation.")
|
||||
parser.add_argument('--alpha_squad', default=0.5, type=float,
|
||||
help="True SQuAD loss linear weight. Only for distillation.")
|
||||
parser.add_argument('--temperature', default=2.0, type=float,
|
||||
help="Distillation temperature. Only for distillation.")
|
||||
|
||||
## Other parameters
|
||||
parser.add_argument("--config_name", default="", type=str,
|
||||
help="Pretrained config name or path if not the same as model_name")
|
||||
parser.add_argument("--tokenizer_name", default="", type=str,
|
||||
help="Pretrained tokenizer name or path if not the same as model_name")
|
||||
parser.add_argument("--cache_dir", default="", type=str,
|
||||
help="Where do you want to store the pre-trained models downloaded from s3")
|
||||
|
||||
parser.add_argument('--version_2_with_negative', action='store_true',
|
||||
help='If true, the SQuAD examples contain some that do not have an answer.')
|
||||
parser.add_argument('--null_score_diff_threshold', type=float, default=0.0,
|
||||
help="If null_score - best_non_null is greater than the threshold predict null.")
|
||||
|
||||
parser.add_argument("--max_seq_length", default=384, type=int,
|
||||
help="The maximum total input sequence length after WordPiece tokenization. Sequences "
|
||||
"longer than this will be truncated, and sequences shorter than this will be padded.")
|
||||
parser.add_argument("--doc_stride", default=128, type=int,
|
||||
help="When splitting up a long document into chunks, how much stride to take between chunks.")
|
||||
parser.add_argument("--max_query_length", default=64, type=int,
|
||||
help="The maximum number of tokens for the question. Questions longer than this will "
|
||||
"be truncated to this length.")
|
||||
parser.add_argument("--do_train", action='store_true',
|
||||
help="Whether to run training.")
|
||||
parser.add_argument("--do_eval", action='store_true',
|
||||
help="Whether to run eval on the dev set.")
|
||||
parser.add_argument("--evaluate_during_training", action='store_true',
|
||||
help="Rul evaluation during training at each logging step.")
|
||||
parser.add_argument("--do_lower_case", action='store_true',
|
||||
help="Set this flag if you are using an uncased model.")
|
||||
|
||||
parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
|
||||
help="Batch size per GPU/CPU for training.")
|
||||
parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
|
||||
help="Batch size per GPU/CPU for evaluation.")
|
||||
parser.add_argument("--learning_rate", default=5e-5, type=float,
|
||||
help="The initial learning rate for Adam.")
|
||||
parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
|
||||
help="Number of updates steps to accumulate before performing a backward/update pass.")
|
||||
parser.add_argument("--weight_decay", default=0.0, type=float,
|
||||
help="Weight deay if we apply some.")
|
||||
parser.add_argument("--adam_epsilon", default=1e-8, type=float,
|
||||
help="Epsilon for Adam optimizer.")
|
||||
parser.add_argument("--max_grad_norm", default=1.0, type=float,
|
||||
help="Max gradient norm.")
|
||||
parser.add_argument("--num_train_epochs", default=3.0, type=float,
|
||||
help="Total number of training epochs to perform.")
|
||||
parser.add_argument("--max_steps", default=-1, type=int,
|
||||
help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
|
||||
parser.add_argument("--warmup_steps", default=0, type=int,
|
||||
help="Linear warmup over warmup_steps.")
|
||||
parser.add_argument("--n_best_size", default=20, type=int,
|
||||
help="The total number of n-best predictions to generate in the nbest_predictions.json output file.")
|
||||
parser.add_argument("--max_answer_length", default=30, type=int,
|
||||
help="The maximum length of an answer that can be generated. This is needed because the start "
|
||||
"and end predictions are not conditioned on one another.")
|
||||
parser.add_argument("--verbose_logging", action='store_true',
|
||||
help="If true, all of the warnings related to data processing will be printed. "
|
||||
"A number of warnings are expected for a normal SQuAD evaluation.")
|
||||
|
||||
parser.add_argument('--logging_steps', type=int, default=50,
|
||||
help="Log every X updates steps.")
|
||||
parser.add_argument('--save_steps', type=int, default=50,
|
||||
help="Save checkpoint every X updates steps.")
|
||||
parser.add_argument("--eval_all_checkpoints", action='store_true',
|
||||
help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
|
||||
parser.add_argument("--no_cuda", action='store_true',
|
||||
help="Whether not to use CUDA when available")
|
||||
parser.add_argument('--overwrite_output_dir', action='store_true',
|
||||
help="Overwrite the content of the output directory")
|
||||
parser.add_argument('--overwrite_cache', action='store_true',
|
||||
help="Overwrite the cached training and evaluation sets")
|
||||
parser.add_argument('--seed', type=int, default=42,
|
||||
help="random seed for initialization")
|
||||
|
||||
parser.add_argument("--local_rank", type=int, default=-1,
|
||||
help="local_rank for distributed training on gpus")
|
||||
parser.add_argument('--fp16', action='store_true',
|
||||
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
|
||||
parser.add_argument('--fp16_opt_level', type=str, default='O1',
|
||||
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
|
||||
"See details at https://nvidia.github.io/apex/amp.html")
|
||||
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
|
||||
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
|
||||
args = parser.parse_args()
|
||||
|
||||
if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
|
||||
raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
|
||||
|
||||
# Setup distant debugging if needed
|
||||
if args.server_ip and args.server_port:
|
||||
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
|
||||
import ptvsd
|
||||
print("Waiting for debugger attach")
|
||||
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
|
||||
ptvsd.wait_for_attach()
|
||||
|
||||
# Setup CUDA, GPU & distributed training
|
||||
if args.local_rank == -1 or args.no_cuda:
|
||||
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
|
||||
args.n_gpu = torch.cuda.device_count()
|
||||
else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
|
||||
torch.cuda.set_device(args.local_rank)
|
||||
device = torch.device("cuda", args.local_rank)
|
||||
torch.distributed.init_process_group(backend='nccl')
|
||||
args.n_gpu = 1
|
||||
args.device = device
|
||||
|
||||
# Setup logging
|
||||
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||
datefmt = '%m/%d/%Y %H:%M:%S',
|
||||
level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
|
||||
logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
|
||||
args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
|
||||
|
||||
# Set seed
|
||||
set_seed(args)
|
||||
|
||||
# Load pretrained model and tokenizer
|
||||
if args.local_rank not in [-1, 0]:
|
||||
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
|
||||
|
||||
args.model_type = args.model_type.lower()
|
||||
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
|
||||
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
|
||||
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
|
||||
model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
|
||||
|
||||
if args.teacher_type is not None:
|
||||
assert args.teacher_name_or_path is not None
|
||||
assert args.alpha_ce > 0.
|
||||
assert args.alpha_ce + args.alpha_squad > 0.
|
||||
assert args.teacher_type != 'distilbert', "We constraint teachers not to be of type DistilBERT."
|
||||
teacher_config_class, teacher_model_class, _ = MODEL_CLASSES[args.teacher_type]
|
||||
teacher_config = teacher_config_class.from_pretrained(args.teacher_name_or_path)
|
||||
teacher = teacher_model_class.from_pretrained(args.teacher_name_or_path, config=teacher_config)
|
||||
teacher.to(args.device)
|
||||
else:
|
||||
teacher = None
|
||||
|
||||
if args.local_rank == 0:
|
||||
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
|
||||
|
||||
model.to(args.device)
|
||||
|
||||
logger.info("Training/evaluation parameters %s", args)
|
||||
|
||||
# Training
|
||||
if args.do_train:
|
||||
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
|
||||
global_step, tr_loss = train(args, train_dataset, model, tokenizer, teacher=teacher)
|
||||
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
|
||||
|
||||
|
||||
# Save the trained model and the tokenizer
|
||||
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
|
||||
# Create output directory if needed
|
||||
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
|
||||
os.makedirs(args.output_dir)
|
||||
|
||||
logger.info("Saving model checkpoint to %s", args.output_dir)
|
||||
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
|
||||
# They can then be reloaded using `from_pretrained()`
|
||||
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
|
||||
model_to_save.save_pretrained(args.output_dir)
|
||||
tokenizer.save_pretrained(args.output_dir)
|
||||
|
||||
# Good practice: save your training arguments together with the trained model
|
||||
torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
|
||||
|
||||
# Load a trained model and vocabulary that you have fine-tuned
|
||||
model = model_class.from_pretrained(args.output_dir)
|
||||
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
|
||||
model.to(args.device)
|
||||
|
||||
|
||||
# Evaluation - we can ask to evaluate all the checkpoints (sub-directories) in a directory
|
||||
results = {}
|
||||
if args.do_eval and args.local_rank in [-1, 0]:
|
||||
checkpoints = [args.output_dir]
|
||||
if args.eval_all_checkpoints:
|
||||
checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
|
||||
logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce model loading logs
|
||||
|
||||
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||
|
||||
for checkpoint in checkpoints:
|
||||
# Reload the model
|
||||
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
||||
model = model_class.from_pretrained(checkpoint)
|
||||
model.to(args.device)
|
||||
|
||||
# Evaluate
|
||||
result = evaluate(args, model, tokenizer, prefix=global_step)
|
||||
|
||||
result = dict((k + ('_{}'.format(global_step) if global_step else ''), v) for k, v in result.items())
|
||||
results.update(result)
|
||||
|
||||
logger.info("Results: {}".format(results))
|
||||
|
||||
return results
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -13,14 +13,14 @@
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Preprocessing script before training DistilBERT.
|
||||
Preprocessing script before distillation.
|
||||
"""
|
||||
import argparse
|
||||
import pickle
|
||||
import random
|
||||
import time
|
||||
import numpy as np
|
||||
from transformers import BertTokenizer, RobertaTokenizer
|
||||
from transformers import BertTokenizer, RobertaTokenizer, GPT2Tokenizer
|
||||
import logging
|
||||
|
||||
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||
@@ -32,7 +32,7 @@ def main():
|
||||
parser = argparse.ArgumentParser(description="Preprocess the data to avoid re-doing it several times by (tokenization + token_to_ids).")
|
||||
parser.add_argument('--file_path', type=str, default='data/dump.txt',
|
||||
help='The path to the data.')
|
||||
parser.add_argument('--tokenizer_type', type=str, default='bert', choices=['bert', 'roberta'])
|
||||
parser.add_argument('--tokenizer_type', type=str, default='bert', choices=['bert', 'roberta', 'gpt2'])
|
||||
parser.add_argument('--tokenizer_name', type=str, default='bert-base-uncased',
|
||||
help="The tokenizer to use.")
|
||||
parser.add_argument('--dump_file', type=str, default='data/dump',
|
||||
@@ -43,10 +43,16 @@ def main():
|
||||
logger.info(f'Loading Tokenizer ({args.tokenizer_name})')
|
||||
if args.tokenizer_type == 'bert':
|
||||
tokenizer = BertTokenizer.from_pretrained(args.tokenizer_name)
|
||||
bos = tokenizer.special_tokens_map['cls_token'] # `[CLS]`
|
||||
sep = tokenizer.special_tokens_map['sep_token'] # `[SEP]`
|
||||
elif args.tokenizer_type == 'roberta':
|
||||
tokenizer = RobertaTokenizer.from_pretrained(args.tokenizer_name)
|
||||
bos = tokenizer.special_tokens_map['bos_token'] # `[CLS]` for bert, `<s>` for roberta
|
||||
sep = tokenizer.special_tokens_map['sep_token'] # `[SEP]` for bert, `</s>` for roberta
|
||||
bos = tokenizer.special_tokens_map['cls_token'] # `<s>`
|
||||
sep = tokenizer.special_tokens_map['sep_token'] # `</s>`
|
||||
elif args.tokenizer_type == 'gpt2':
|
||||
tokenizer = GPT2Tokenizer.from_pretrained(args.tokenizer_name)
|
||||
bos = tokenizer.special_tokens_map['bos_token'] # `<|endoftext|>`
|
||||
sep = tokenizer.special_tokens_map['eos_token'] # `<|endoftext|>`
|
||||
|
||||
logger.info(f'Loading text from {args.file_path}')
|
||||
with open(args.file_path, 'r', encoding='utf8') as fp:
|
||||
|
||||
89
examples/distillation/scripts/extract.py
Normal file
89
examples/distillation/scripts/extract.py
Normal file
@@ -0,0 +1,89 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2019-present, the HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Preprocessing script before training the distilled model.
|
||||
Specific to RoBERTa -> DistilRoBERTa and GPT2 -> DistilGPT2.
|
||||
"""
|
||||
from transformers import BertForMaskedLM, RobertaForMaskedLM, GPT2LMHeadModel
|
||||
import torch
|
||||
import argparse
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description="Extraction some layers of the full RobertaForMaskedLM or GPT2LMHeadModel for Transfer Learned Distillation")
|
||||
parser.add_argument("--model_type", default="roberta", choices=["roberta", "gpt2"])
|
||||
parser.add_argument("--model_name", default='roberta-large', type=str)
|
||||
parser.add_argument("--dump_checkpoint", default='serialization_dir/tf_roberta_048131723.pth', type=str)
|
||||
parser.add_argument("--vocab_transform", action='store_true')
|
||||
args = parser.parse_args()
|
||||
|
||||
|
||||
if args.model_type == 'roberta':
|
||||
model = RobertaForMaskedLM.from_pretrained(args.model_name)
|
||||
prefix = 'roberta'
|
||||
elif args.model_type == 'gpt2':
|
||||
model = GPT2LMHeadModel.from_pretrained(args.model_name)
|
||||
prefix = 'transformer'
|
||||
|
||||
state_dict = model.state_dict()
|
||||
compressed_sd = {}
|
||||
|
||||
### Embeddings ###
|
||||
if args.model_type == 'gpt2':
|
||||
for param_name in ['wte.weight', 'wpe.weight']:
|
||||
compressed_sd[f'{prefix}.{param_name}'] = state_dict[f'{prefix}.{param_name}']
|
||||
else:
|
||||
for w in ['word_embeddings', 'position_embeddings', 'token_type_embeddings']:
|
||||
param_name = f'{prefix}.embeddings.{w}.weight'
|
||||
compressed_sd[param_name] = state_dict[param_name]
|
||||
for w in ['weight', 'bias']:
|
||||
param_name = f'{prefix}.embeddings.LayerNorm.{w}'
|
||||
compressed_sd[param_name] = state_dict[param_name]
|
||||
|
||||
### Transformer Blocks ###
|
||||
std_idx = 0
|
||||
for teacher_idx in [0, 2, 4, 7, 9, 11]:
|
||||
if args.model_type == 'gpt2':
|
||||
for layer in ['ln_1', 'attn.c_attn', 'attn.c_proj', 'ln_2', 'mlp.c_fc', 'mlp.c_proj']:
|
||||
for w in ['weight', 'bias']:
|
||||
compressed_sd[f'{prefix}.h.{std_idx}.{layer}.{w}'] = \
|
||||
state_dict[f'{prefix}.h.{teacher_idx}.{layer}.{w}']
|
||||
compressed_sd[f'{prefix}.h.{std_idx}.attn.bias'] = state_dict[f'{prefix}.h.{teacher_idx}.attn.bias']
|
||||
else:
|
||||
for layer in ['attention.self.query', 'attention.self.key', 'attention.self.value',
|
||||
'attention.output.dense', 'attention.output.LayerNorm',
|
||||
'intermediate.dense', 'output.dense', 'output.LayerNorm']:
|
||||
for w in ['weight', 'bias']:
|
||||
compressed_sd[f'{prefix}.encoder.layer.{std_idx}.{layer}.{w}'] = \
|
||||
state_dict[f'{prefix}.encoder.layer.{teacher_idx}.{layer}.{w}']
|
||||
std_idx += 1
|
||||
|
||||
### Language Modeling Head ###s
|
||||
if args.model_type == 'roberta':
|
||||
for layer in ['lm_head.decoder.weight', 'lm_head.bias']:
|
||||
compressed_sd[f'{layer}'] = state_dict[f'{layer}']
|
||||
if args.vocab_transform:
|
||||
for w in ['weight', 'bias']:
|
||||
compressed_sd[f'lm_head.dense.{w}'] = state_dict[f'lm_head.dense.{w}']
|
||||
compressed_sd[f'lm_head.layer_norm.{w}'] = state_dict[f'lm_head.layer_norm.{w}']
|
||||
elif args.model_type == 'gpt2':
|
||||
for w in ['weight', 'bias']:
|
||||
compressed_sd[f'{prefix}.ln_f.{w}'] = state_dict[f'{prefix}.ln_f.{w}']
|
||||
compressed_sd[f'lm_head.weight'] = state_dict[f'lm_head.weight']
|
||||
|
||||
print(f'N layers selected for distillation: {std_idx}')
|
||||
print(f'Number of params transfered for distillation: {len(compressed_sd.keys())}')
|
||||
|
||||
print(f'Save transfered checkpoint to {args.dump_checkpoint}.')
|
||||
torch.save(compressed_sd, args.dump_checkpoint)
|
||||
@@ -14,6 +14,7 @@
|
||||
# limitations under the License.
|
||||
"""
|
||||
Preprocessing script before training DistilBERT.
|
||||
Specific to BERT -> DistilBERT.
|
||||
"""
|
||||
from transformers import BertForMaskedLM, RobertaForMaskedLM
|
||||
import torch
|
||||
@@ -21,7 +22,7 @@ import argparse
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description="Extraction some layers of the full BertForMaskedLM or RObertaForMaskedLM for Transfer Learned Distillation")
|
||||
parser.add_argument("--model_type", default="bert", choices=["bert", "roberta"])
|
||||
parser.add_argument("--model_type", default="bert", choices=["bert"])
|
||||
parser.add_argument("--model_name", default='bert-base-uncased', type=str)
|
||||
parser.add_argument("--dump_checkpoint", default='serialization_dir/tf_bert-base-uncased_0247911.pth', type=str)
|
||||
parser.add_argument("--vocab_transform", action='store_true')
|
||||
@@ -31,9 +32,8 @@ if __name__ == '__main__':
|
||||
if args.model_type == 'bert':
|
||||
model = BertForMaskedLM.from_pretrained(args.model_name)
|
||||
prefix = 'bert'
|
||||
elif args.model_type == 'roberta':
|
||||
model = RobertaForMaskedLM.from_pretrained(args.model_name)
|
||||
prefix = 'roberta'
|
||||
else:
|
||||
raise ValueError(f'args.model_type should be "bert".')
|
||||
|
||||
state_dict = model.state_dict()
|
||||
compressed_sd = {}
|
||||
@@ -68,20 +68,12 @@ if __name__ == '__main__':
|
||||
state_dict[f'{prefix}.encoder.layer.{teacher_idx}.output.LayerNorm.{w}']
|
||||
std_idx += 1
|
||||
|
||||
if args.model_type == 'bert':
|
||||
compressed_sd[f'vocab_projector.weight'] = state_dict[f'cls.predictions.decoder.weight']
|
||||
compressed_sd[f'vocab_projector.bias'] = state_dict[f'cls.predictions.bias']
|
||||
if args.vocab_transform:
|
||||
for w in ['weight', 'bias']:
|
||||
compressed_sd[f'vocab_transform.{w}'] = state_dict[f'cls.predictions.transform.dense.{w}']
|
||||
compressed_sd[f'vocab_layer_norm.{w}'] = state_dict[f'cls.predictions.transform.LayerNorm.{w}']
|
||||
elif args.model_type == 'roberta':
|
||||
compressed_sd[f'vocab_projector.weight'] = state_dict[f'lm_head.decoder.weight']
|
||||
compressed_sd[f'vocab_projector.bias'] = state_dict[f'lm_head.bias']
|
||||
if args.vocab_transform:
|
||||
for w in ['weight', 'bias']:
|
||||
compressed_sd[f'vocab_transform.{w}'] = state_dict[f'lm_head.dense.{w}']
|
||||
compressed_sd[f'vocab_layer_norm.{w}'] = state_dict[f'lm_head.layer_norm.{w}']
|
||||
compressed_sd[f'vocab_projector.weight'] = state_dict[f'cls.predictions.decoder.weight']
|
||||
compressed_sd[f'vocab_projector.bias'] = state_dict[f'cls.predictions.bias']
|
||||
if args.vocab_transform:
|
||||
for w in ['weight', 'bias']:
|
||||
compressed_sd[f'vocab_transform.{w}'] = state_dict[f'cls.predictions.transform.dense.{w}']
|
||||
compressed_sd[f'vocab_layer_norm.{w}'] = state_dict[f'cls.predictions.transform.LayerNorm.{w}']
|
||||
|
||||
print(f'N layers selected for distillation: {std_idx}')
|
||||
print(f'Number of params transfered for distillation: {len(compressed_sd.keys())}')
|
||||
@@ -13,7 +13,7 @@
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Preprocessing script before training DistilBERT.
|
||||
Preprocessing script before training the distilled model.
|
||||
"""
|
||||
from collections import Counter
|
||||
import argparse
|
||||
|
||||
@@ -13,7 +13,8 @@
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Training DistilBERT.
|
||||
Training the distilled model.
|
||||
Supported architectures include: BERT -> DistilBERT, RoBERTa -> DistilRoBERTa, GPT2 -> DistilGPT2.
|
||||
"""
|
||||
import os
|
||||
import argparse
|
||||
@@ -23,68 +24,96 @@ import shutil
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from transformers import BertTokenizer, BertForMaskedLM, RobertaTokenizer, RobertaForMaskedLM
|
||||
from transformers import DistilBertForMaskedLM, DistilBertConfig
|
||||
from transformers import BertConfig, BertForMaskedLM, BertTokenizer
|
||||
from transformers import RobertaConfig, RobertaForMaskedLM, RobertaTokenizer
|
||||
from transformers import DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer
|
||||
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer
|
||||
|
||||
from distiller import Distiller
|
||||
from utils import git_log, logger, init_gpu_params, set_seed
|
||||
from dataset import Dataset
|
||||
from lm_seqs_dataset import LmSeqsDataset
|
||||
|
||||
|
||||
MODEL_CLASSES = {
|
||||
'distilbert': (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
|
||||
'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
|
||||
'bert': (BertConfig, BertForMaskedLM, BertTokenizer),
|
||||
'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer)
|
||||
}
|
||||
|
||||
def sanity_checks(args):
|
||||
"""
|
||||
A bunch of args sanity checks to perform even starting...
|
||||
"""
|
||||
assert (args.mlm and args.alpha_mlm > 0.) or (not args.mlm and args.alpha_mlm == 0.)
|
||||
assert (args.alpha_mlm > 0. and args.alpha_clm == 0.) or (args.alpha_mlm == 0. and args.alpha_clm > 0.)
|
||||
if args.mlm:
|
||||
assert os.path.isfile(args.token_counts)
|
||||
assert (args.student_type in ['roberta', 'distilbert']) and (args.teacher_type in ['roberta', 'bert'])
|
||||
else:
|
||||
assert (args.student_type in ['gpt2']) and (args.teacher_type in ['gpt2'])
|
||||
|
||||
assert args.teacher_type == args.student_type or (args.student_type=='distilbert' and args.teacher_type=='bert')
|
||||
assert os.path.isfile(args.student_config)
|
||||
if args.student_pretrained_weights is not None:
|
||||
assert os.path.isfile(args.student_pretrained_weights)
|
||||
|
||||
if args.freeze_token_type_embds: assert args.student_type in ['roberta']
|
||||
|
||||
assert args.alpha_ce >= 0.
|
||||
assert args.alpha_mlm >= 0.
|
||||
assert args.alpha_clm >= 0.
|
||||
assert args.alpha_mse >= 0.
|
||||
assert args.alpha_cos >= 0.
|
||||
assert args.alpha_ce + args.alpha_mlm + args.alpha_clm + args.alpha_mse + args.alpha_cos > 0.
|
||||
|
||||
def freeze_pos_embeddings(student, args):
|
||||
if args.student_type == 'roberta':
|
||||
student.roberta.embeddings.position_embeddings.weight.requires_grad = False
|
||||
elif args.student_type == 'gpt2':
|
||||
student.transformer.wpe.weight.requires_grad = False
|
||||
|
||||
def freeze_token_type_embeddings(student, args):
|
||||
if args.student_type == 'roberta':
|
||||
student.roberta.embeddings.token_type_embeddings.weight.requires_grad = False
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Training")
|
||||
parser.add_argument("--force", action='store_true',
|
||||
help="Overwrite dump_path if it already exists.")
|
||||
|
||||
parser.add_argument("--dump_path", type=str, required=True,
|
||||
help="The output directory (log, checkpoints, parameters, etc.)")
|
||||
parser.add_argument("--data_file", type=str, required=True,
|
||||
help="The binarized file (tokenized + tokens_to_ids) and grouped by sequence.")
|
||||
parser.add_argument("--token_counts", type=str, required=True,
|
||||
help="The token counts in the data_file for MLM.")
|
||||
parser.add_argument("--force", action='store_true',
|
||||
help="Overwrite dump_path if it already exists.")
|
||||
|
||||
parser.add_argument("--vocab_size", default=30522, type=int,
|
||||
help="The vocabulary size.")
|
||||
parser.add_argument("--max_position_embeddings", default=512, type=int,
|
||||
help="Maximum sequence length we can model (including [CLS] and [SEP]).")
|
||||
parser.add_argument("--sinusoidal_pos_embds", action='store_false',
|
||||
help="If true, the position embeddings are simply fixed with sinusoidal embeddings.")
|
||||
parser.add_argument("--n_layers", default=6, type=int,
|
||||
help="Number of Transformer blocks.")
|
||||
parser.add_argument("--n_heads", default=12, type=int,
|
||||
help="Number of heads in the self-attention module.")
|
||||
parser.add_argument("--dim", default=768, type=int,
|
||||
help="Dimension through the network. Must be divisible by n_heads")
|
||||
parser.add_argument("--hidden_dim", default=3072, type=int,
|
||||
help="Intermediate dimension in the FFN.")
|
||||
parser.add_argument("--dropout", default=0.1, type=float,
|
||||
help="Dropout.")
|
||||
parser.add_argument("--attention_dropout", default=0.1, type=float,
|
||||
help="Dropout in self-attention.")
|
||||
parser.add_argument("--activation", default='gelu', type=str,
|
||||
help="Activation to use in self-attention")
|
||||
parser.add_argument("--tie_weights_", action='store_false',
|
||||
help="If true, we tie the embeddings matrix with the projection over the vocabulary matrix. Default is true.")
|
||||
|
||||
parser.add_argument("--from_pretrained_weights", default=None, type=str,
|
||||
parser.add_argument("--student_type", type=str, choices=["distilbert", "roberta", "gpt2"], required=True,
|
||||
help="The student type (DistilBERT, RoBERTa).")
|
||||
parser.add_argument("--student_config", type=str, required=True,
|
||||
help="Path to the student configuration.")
|
||||
parser.add_argument("--student_pretrained_weights", default=None, type=str,
|
||||
help="Load student initialization checkpoint.")
|
||||
parser.add_argument("--from_pretrained_config", default=None, type=str,
|
||||
help="Load student initialization architecture config.")
|
||||
parser.add_argument("--teacher_type", default="bert", choices=["bert", "roberta"],
|
||||
|
||||
parser.add_argument("--teacher_type", choices=["bert", "roberta", "gpt2"], required=True,
|
||||
help="Teacher type (BERT, RoBERTa).")
|
||||
parser.add_argument("--teacher_name", default="bert-base-uncased", type=str,
|
||||
parser.add_argument("--teacher_name", type=str, required=True,
|
||||
help="The teacher model.")
|
||||
|
||||
parser.add_argument("--temperature", default=2., type=float,
|
||||
help="Temperature for the softmax temperature.")
|
||||
parser.add_argument("--alpha_ce", default=0.5, type=float,
|
||||
help="Linear weight for the distillation loss. Must be >=0.")
|
||||
parser.add_argument("--alpha_mlm", default=0.5, type=float,
|
||||
help="Linear weight for the MLM loss. Must be >=0.")
|
||||
parser.add_argument("--alpha_mlm", default=0.0, type=float,
|
||||
help="Linear weight for the MLM loss. Must be >=0. Should be used in coonjunction with `mlm` flag.")
|
||||
parser.add_argument("--alpha_clm", default=0.5, type=float,
|
||||
help="Linear weight for the CLM loss. Must be >=0.")
|
||||
parser.add_argument("--alpha_mse", default=0.0, type=float,
|
||||
help="Linear weight of the MSE loss. Must be >=0.")
|
||||
parser.add_argument("--alpha_cos", default=0.0, type=float,
|
||||
help="Linear weight of the cosine embedding loss. Must be >=0.")
|
||||
|
||||
parser.add_argument("--mlm", action="store_true",
|
||||
help="The LM step: MLM or CLM. If `mlm` is True, the MLM is used over CLM.")
|
||||
parser.add_argument("--mlm_mask_prop", default=0.15, type=float,
|
||||
help="Proportion of tokens for which we need to make a prediction.")
|
||||
parser.add_argument("--word_mask", default=0.8, type=float,
|
||||
@@ -95,17 +124,20 @@ def main():
|
||||
help="Proportion of tokens to randomly replace.")
|
||||
parser.add_argument("--mlm_smoothing", default=0.7, type=float,
|
||||
help="Smoothing parameter to emphasize more rare tokens (see XLM, similar to word2vec).")
|
||||
parser.add_argument("--token_counts", type=str,
|
||||
help="The token counts in the data_file for MLM.")
|
||||
|
||||
parser.add_argument("--restrict_ce_to_mask", action='store_true',
|
||||
help="If true, compute the distilation loss only the [MLM] prediction distribution.")
|
||||
parser.add_argument("--freeze_pos_embs", action="store_true",
|
||||
help="Freeze positional embeddings during distillation. For student_type in ['roberta', 'gpt2'] only.")
|
||||
parser.add_argument("--freeze_token_type_embds", action="store_true",
|
||||
help="Freeze token type embeddings during distillation if existent. For student_type in ['roberta'] only.")
|
||||
|
||||
parser.add_argument("--n_epoch", type=int, default=3,
|
||||
help="Number of pass on the whole dataset.")
|
||||
parser.add_argument("--batch_size", type=int, default=5,
|
||||
help="Batch size (for each process).")
|
||||
parser.add_argument("--tokens_per_batch", type=int, default=-1,
|
||||
help="If specified, modify the batches so that they have approximately this number of tokens.")
|
||||
parser.add_argument("--shuffle", action='store_false',
|
||||
help="If true, shuffle the sequence order. Default is true.")
|
||||
parser.add_argument("--group_by_size", action='store_false',
|
||||
help="If true, group sequences that have similar length into the same batch. Default is true.")
|
||||
|
||||
@@ -141,6 +173,7 @@ def main():
|
||||
parser.add_argument("--checkpoint_interval", type=int, default=4000,
|
||||
help="Checkpoint interval.")
|
||||
args = parser.parse_args()
|
||||
sanity_checks(args)
|
||||
|
||||
|
||||
## ARGS ##
|
||||
@@ -164,21 +197,19 @@ def main():
|
||||
with open(os.path.join(args.dump_path, 'parameters.json'), 'w') as f:
|
||||
json.dump(vars(args), f, indent=4)
|
||||
git_log(args.dump_path)
|
||||
assert (args.from_pretrained_weights is None and args.from_pretrained_config is None) or \
|
||||
(args.from_pretrained_weights is not None and args.from_pretrained_config is not None)
|
||||
|
||||
student_config_class, student_model_class, _ = MODEL_CLASSES[args.student_type]
|
||||
teacher_config_class, teacher_model_class, teacher_tokenizer_class = MODEL_CLASSES[args.teacher_type]
|
||||
|
||||
### TOKENIZER ###
|
||||
if args.teacher_type == 'bert':
|
||||
tokenizer = BertTokenizer.from_pretrained(args.teacher_name)
|
||||
elif args.teacher_type == 'roberta':
|
||||
tokenizer = RobertaTokenizer.from_pretrained(args.teacher_name)
|
||||
tokenizer = teacher_tokenizer_class.from_pretrained(args.teacher_name)
|
||||
special_tok_ids = {}
|
||||
for tok_name, tok_symbol in tokenizer.special_tokens_map.items():
|
||||
idx = tokenizer.all_special_tokens.index(tok_symbol)
|
||||
special_tok_ids[tok_name] = tokenizer.all_special_ids[idx]
|
||||
logger.info(f'Special tokens {special_tok_ids}')
|
||||
args.special_tok_ids = special_tok_ids
|
||||
args.max_model_input_size = tokenizer.max_model_input_sizes[args.teacher_name]
|
||||
|
||||
|
||||
## DATA LOADER ##
|
||||
@@ -187,35 +218,34 @@ def main():
|
||||
data = pickle.load(fp)
|
||||
|
||||
|
||||
assert os.path.isfile(args.token_counts)
|
||||
logger.info(f'Loading token counts from {args.token_counts} (already pre-computed)')
|
||||
with open(args.token_counts, 'rb') as fp:
|
||||
counts = pickle.load(fp)
|
||||
assert len(counts) == args.vocab_size
|
||||
token_probs = np.maximum(counts, 1) ** -args.mlm_smoothing
|
||||
for idx in special_tok_ids.values():
|
||||
token_probs[idx] = 0. # do not predict special tokens
|
||||
token_probs = torch.from_numpy(token_probs)
|
||||
if args.mlm:
|
||||
logger.info(f'Loading token counts from {args.token_counts} (already pre-computed)')
|
||||
with open(args.token_counts, 'rb') as fp:
|
||||
counts = pickle.load(fp)
|
||||
|
||||
token_probs = np.maximum(counts, 1) ** -args.mlm_smoothing
|
||||
for idx in special_tok_ids.values():
|
||||
token_probs[idx] = 0. # do not predict special tokens
|
||||
token_probs = torch.from_numpy(token_probs)
|
||||
else:
|
||||
token_probs = None
|
||||
|
||||
|
||||
train_dataloader = Dataset(params=args, data=data)
|
||||
train_lm_seq_dataset = LmSeqsDataset(params=args, data=data)
|
||||
logger.info(f'Data loader created.')
|
||||
|
||||
|
||||
## STUDENT ##
|
||||
if args.from_pretrained_weights is not None:
|
||||
assert os.path.isfile(args.from_pretrained_weights)
|
||||
assert os.path.isfile(args.from_pretrained_config)
|
||||
logger.info(f'Loading pretrained weights from {args.from_pretrained_weights}')
|
||||
logger.info(f'Loading pretrained config from {args.from_pretrained_config}')
|
||||
stu_architecture_config = DistilBertConfig.from_json_file(args.from_pretrained_config)
|
||||
stu_architecture_config.output_hidden_states = True
|
||||
student = DistilBertForMaskedLM.from_pretrained(args.from_pretrained_weights,
|
||||
config=stu_architecture_config)
|
||||
logger.info(f'Loading student config from {args.student_config}')
|
||||
stu_architecture_config = student_config_class.from_pretrained(args.student_config)
|
||||
stu_architecture_config.output_hidden_states = True
|
||||
|
||||
if args.student_pretrained_weights is not None:
|
||||
logger.info(f'Loading pretrained weights from {args.student_pretrained_weights}')
|
||||
student = student_model_class.from_pretrained(args.student_pretrained_weights,
|
||||
config=stu_architecture_config)
|
||||
else:
|
||||
args.vocab_size_or_config_json_file = args.vocab_size
|
||||
stu_architecture_config = DistilBertConfig(**vars(args), output_hidden_states=True)
|
||||
student = DistilBertForMaskedLM(stu_architecture_config)
|
||||
student = student_model_class(stu_architecture_config)
|
||||
|
||||
|
||||
if args.n_gpu > 0:
|
||||
@@ -224,18 +254,31 @@ def main():
|
||||
|
||||
|
||||
## TEACHER ##
|
||||
if args.teacher_type == 'bert':
|
||||
teacher = BertForMaskedLM.from_pretrained(args.teacher_name, output_hidden_states=True)
|
||||
elif args.teacher_type == 'roberta':
|
||||
teacher = RobertaForMaskedLM.from_pretrained(args.teacher_name, output_hidden_states=True)
|
||||
teacher = teacher_model_class.from_pretrained(args.teacher_name, output_hidden_states=True)
|
||||
if args.n_gpu > 0:
|
||||
teacher.to(f'cuda:{args.local_rank}')
|
||||
logger.info(f'Teacher loaded from {args.teacher_name}.')
|
||||
|
||||
|
||||
## FREEZING ##
|
||||
if args.freeze_pos_embs:
|
||||
freeze_pos_embeddings(student, args)
|
||||
if args.freeze_token_type_embds:
|
||||
freeze_token_type_embeddings(student, args)
|
||||
|
||||
|
||||
## SANITY CHECKS ##
|
||||
assert student.config.vocab_size == teacher.config.vocab_size
|
||||
assert student.config.hidden_size == teacher.config.hidden_size
|
||||
assert student.config.max_position_embeddings == teacher.config.max_position_embeddings
|
||||
if args.mlm:
|
||||
assert token_probs.size(0) == stu_architecture_config.vocab_size
|
||||
|
||||
|
||||
## DISTILLER ##
|
||||
torch.cuda.empty_cache()
|
||||
distiller = Distiller(params=args,
|
||||
dataloader=train_dataloader,
|
||||
dataset=train_lm_seq_dataset,
|
||||
token_probs=token_probs,
|
||||
student=student,
|
||||
teacher=teacher)
|
||||
|
||||
@@ -0,0 +1,15 @@
|
||||
{
|
||||
"activation": "gelu",
|
||||
"attention_dropout": 0.1,
|
||||
"dim": 768,
|
||||
"dropout": 0.1,
|
||||
"hidden_dim": 3072,
|
||||
"initializer_range": 0.02,
|
||||
"max_position_embeddings": 512,
|
||||
"n_heads": 12,
|
||||
"n_layers": 6,
|
||||
"sinusoidal_pos_embds": true,
|
||||
"tie_weights_": true,
|
||||
"vocab_size": 30522
|
||||
}
|
||||
|
||||
10
examples/distillation/training_configs/distilgpt2.json
Normal file
10
examples/distillation/training_configs/distilgpt2.json
Normal file
@@ -0,0 +1,10 @@
|
||||
{
|
||||
"initializer_range": 0.02,
|
||||
"layer_norm_epsilon": 0.00001,
|
||||
"n_ctx": 1024,
|
||||
"n_embd": 768,
|
||||
"n_head": 12,
|
||||
"n_layer": 6,
|
||||
"n_positions": 1024,
|
||||
"vocab_size": 50257
|
||||
}
|
||||
@@ -14,7 +14,7 @@
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
|
||||
""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/CTRL/Transformer-XL/XLNet)
|
||||
"""
|
||||
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||
|
||||
@@ -26,12 +26,14 @@ import torch
|
||||
import torch.nn.functional as F
|
||||
import numpy as np
|
||||
|
||||
from transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig
|
||||
from transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, XLMConfig, CTRLConfig
|
||||
|
||||
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
||||
from transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
|
||||
from transformers import XLNetLMHeadModel, XLNetTokenizer
|
||||
from transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
|
||||
from transformers import CTRLLMHeadModel, CTRLTokenizer
|
||||
from transformers import XLMWithLMHeadModel, XLMTokenizer
|
||||
|
||||
|
||||
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||
@@ -41,13 +43,15 @@ logger = logging.getLogger(__name__)
|
||||
|
||||
MAX_LENGTH = int(10000) # Hardcoded max length to avoid infinite loop
|
||||
|
||||
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig)), ())
|
||||
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, XLMConfig, CTRLConfig)), ())
|
||||
|
||||
MODEL_CLASSES = {
|
||||
'gpt2': (GPT2LMHeadModel, GPT2Tokenizer),
|
||||
'ctrl': (CTRLLMHeadModel, CTRLTokenizer),
|
||||
'openai-gpt': (OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
|
||||
'xlnet': (XLNetLMHeadModel, XLNetTokenizer),
|
||||
'transfo-xl': (TransfoXLLMHeadModel, TransfoXLTokenizer),
|
||||
'xlm': (XLMWithLMHeadModel, XLMTokenizer),
|
||||
}
|
||||
|
||||
# Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
|
||||
@@ -103,7 +107,7 @@ def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')
|
||||
return logits
|
||||
|
||||
|
||||
def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, is_xlnet=False, device='cpu'):
|
||||
def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, repetition_penalty=1.0, is_xlnet=False, xlm_lang=None, device='cpu'):
|
||||
context = torch.tensor(context, dtype=torch.long, device=device)
|
||||
context = context.unsqueeze(0).repeat(num_samples, 1)
|
||||
generated = context
|
||||
@@ -121,10 +125,21 @@ def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=
|
||||
target_mapping[0, 0, -1] = 1.0 # predict last token
|
||||
inputs = {'input_ids': input_ids, 'perm_mask': perm_mask, 'target_mapping': target_mapping}
|
||||
|
||||
if xlm_lang is not None:
|
||||
inputs["langs"] = torch.tensor([xlm_lang] * inputs["input_ids"].shape[1], device=device).view(1, -1)
|
||||
|
||||
outputs = model(**inputs) # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
|
||||
next_token_logits = outputs[0][0, -1, :] / temperature
|
||||
next_token_logits = outputs[0][0, -1, :] / (temperature if temperature > 0 else 1.)
|
||||
|
||||
# reptition penalty from CTRL (https://arxiv.org/abs/1909.05858)
|
||||
for _ in set(generated):
|
||||
next_token_logits[_] /= repetition_penalty
|
||||
|
||||
filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
|
||||
next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
|
||||
if temperature == 0: #greedy sampling:
|
||||
next_token = torch.argmax(filtered_logits).unsqueeze(0)
|
||||
else:
|
||||
next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
|
||||
generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
|
||||
return generated
|
||||
|
||||
@@ -137,16 +152,25 @@ def main():
|
||||
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
|
||||
parser.add_argument("--prompt", type=str, default="")
|
||||
parser.add_argument("--padding_text", type=str, default="")
|
||||
parser.add_argument("--xlm_lang", type=str, default="", help="Optional language when used with the XLM model.")
|
||||
parser.add_argument("--length", type=int, default=20)
|
||||
parser.add_argument("--temperature", type=float, default=1.0)
|
||||
parser.add_argument("--temperature", type=float, default=1.0,
|
||||
help="temperature of 0 implies greedy sampling")
|
||||
parser.add_argument("--repetition_penalty", type=float, default=1.0,
|
||||
help="primarily useful for CTRL model; in that case, use 1.2")
|
||||
parser.add_argument("--top_k", type=int, default=0)
|
||||
parser.add_argument("--top_p", type=float, default=0.9)
|
||||
parser.add_argument("--no_cuda", action='store_true',
|
||||
help="Avoid using CUDA when available")
|
||||
parser.add_argument('--seed', type=int, default=42,
|
||||
help="random seed for initialization")
|
||||
parser.add_argument('--stop_token', type=str, default=None,
|
||||
help="Token at which text generation is stopped")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.model_type in ["ctrl"]:
|
||||
if args.temperature > 0.7 :
|
||||
print('CTRL typically works better with lower temperatures (and lower top_k).')
|
||||
|
||||
args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
|
||||
args.n_gpu = torch.cuda.device_count()
|
||||
|
||||
@@ -168,6 +192,18 @@ def main():
|
||||
|
||||
print(args)
|
||||
while True:
|
||||
xlm_lang = None
|
||||
# XLM Language usage detailed in the issues #1414
|
||||
if args.model_type in ["xlm"] and hasattr(tokenizer, 'lang2id') and hasattr(model.config, 'use_lang_emb') \
|
||||
and model.config.use_lang_emb:
|
||||
if args.xlm_lang:
|
||||
language = args.xlm_lang
|
||||
else:
|
||||
language = None
|
||||
while language not in tokenizer.lang2id.keys():
|
||||
language = input("Using XLM. Select language in " + str(list(tokenizer.lang2id.keys())) + " >>> ")
|
||||
xlm_lang = tokenizer.lang2id[language]
|
||||
|
||||
raw_text = args.prompt if args.prompt else input("Model prompt >>> ")
|
||||
if args.model_type in ["transfo-xl", "xlnet"]:
|
||||
# Models with memory likes to have a long prompt for short inputs.
|
||||
@@ -180,11 +216,16 @@ def main():
|
||||
temperature=args.temperature,
|
||||
top_k=args.top_k,
|
||||
top_p=args.top_p,
|
||||
device=args.device,
|
||||
repetition_penalty=args.repetition_penalty,
|
||||
is_xlnet=bool(args.model_type == "xlnet"),
|
||||
xlm_lang=xlm_lang,
|
||||
device=args.device,
|
||||
)
|
||||
out = out[0, len(context_tokens):].tolist()
|
||||
text = tokenizer.decode(out, clean_up_tokenization_spaces=True)
|
||||
|
||||
text = tokenizer.decode(out, clean_up_tokenization_spaces=True, skip_special_tokens=True)
|
||||
text = text[: text.find(args.stop_token) if args.stop_token else None]
|
||||
|
||||
print(text)
|
||||
if args.prompt:
|
||||
break
|
||||
|
||||
@@ -53,7 +53,8 @@ from transformers import glue_convert_examples_to_features as convert_examples_t
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig)), ())
|
||||
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, XLMConfig,
|
||||
RobertaConfig, DistilBertConfig)), ())
|
||||
|
||||
MODEL_CLASSES = {
|
||||
'bert': (BertConfig, BertForSequenceClassification, BertTokenizer),
|
||||
@@ -134,8 +135,9 @@ def train(args, train_dataset, model, tokenizer):
|
||||
batch = tuple(t.to(args.device) for t in batch)
|
||||
inputs = {'input_ids': batch[0],
|
||||
'attention_mask': batch[1],
|
||||
'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None, # XLM, DistilBERT and RoBERTa don't use segment_ids
|
||||
'labels': batch[3]}
|
||||
if args.model_type != 'distilbert':
|
||||
inputs['token_type_ids'] = batch[2] if args.model_type in ['bert', 'xlnet'] else None # XLM, DistilBERT and RoBERTa don't use segment_ids
|
||||
outputs = model(**inputs)
|
||||
loss = outputs[0] # model outputs are always tuple in transformers (see doc)
|
||||
|
||||
@@ -224,8 +226,9 @@ def evaluate(args, model, tokenizer, prefix=""):
|
||||
with torch.no_grad():
|
||||
inputs = {'input_ids': batch[0],
|
||||
'attention_mask': batch[1],
|
||||
'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None, # XLM, DistilBERT and RoBERTa don't use segment_ids
|
||||
'labels': batch[3]}
|
||||
if args.model_type != 'distilbert':
|
||||
inputs['token_type_ids'] = batch[2] if args.model_type in ['bert', 'xlnet'] else None # XLM, DistilBERT and RoBERTa don't use segment_ids
|
||||
outputs = model(**inputs)
|
||||
tmp_eval_loss, logits = outputs[:2]
|
||||
|
||||
@@ -246,7 +249,7 @@ def evaluate(args, model, tokenizer, prefix=""):
|
||||
result = compute_metrics(eval_task, preds, out_label_ids)
|
||||
results.update(result)
|
||||
|
||||
output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
|
||||
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
|
||||
with open(output_eval_file, "w") as writer:
|
||||
logger.info("***** Eval results {} *****".format(prefix))
|
||||
for key in sorted(result.keys()):
|
||||
@@ -268,7 +271,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
|
||||
list(filter(None, args.model_name_or_path.split('/'))).pop(),
|
||||
str(args.max_seq_length),
|
||||
str(task)))
|
||||
if os.path.exists(cached_features_file):
|
||||
if os.path.exists(cached_features_file) and not args.overwrite_cache:
|
||||
logger.info("Loading features from cached file %s", cached_features_file)
|
||||
features = torch.load(cached_features_file)
|
||||
else:
|
||||
@@ -487,9 +490,11 @@ def main():
|
||||
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||
for checkpoint in checkpoints:
|
||||
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
||||
prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
|
||||
|
||||
model = model_class.from_pretrained(checkpoint)
|
||||
model.to(args.device)
|
||||
result = evaluate(args, model, tokenizer, prefix=global_step)
|
||||
result = evaluate(args, model, tokenizer, prefix=prefix)
|
||||
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
|
||||
results.update(result)
|
||||
|
||||
|
||||
@@ -27,6 +27,8 @@ import logging
|
||||
import os
|
||||
import pickle
|
||||
import random
|
||||
import re
|
||||
import shutil
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
@@ -59,7 +61,7 @@ class TextDataset(Dataset):
|
||||
def __init__(self, tokenizer, file_path='train', block_size=512):
|
||||
assert os.path.isfile(file_path)
|
||||
directory, filename = os.path.split(file_path)
|
||||
cached_features_file = os.path.join(directory, f'cached_lm_{block_size}_{filename}')
|
||||
cached_features_file = os.path.join(directory, 'cached_lm_' + block_size + '_' + filename)
|
||||
|
||||
if os.path.exists(cached_features_file):
|
||||
logger.info("Loading features from cached file %s", cached_features_file)
|
||||
@@ -74,9 +76,8 @@ class TextDataset(Dataset):
|
||||
|
||||
tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
|
||||
|
||||
while len(tokenized_text) >= block_size: # Truncate in block of block_size
|
||||
self.examples.append(tokenizer.add_special_tokens_single_sequence(tokenized_text[:block_size]))
|
||||
tokenized_text = tokenized_text[block_size:]
|
||||
for i in range(0, len(tokenized_text)-block_size+1, block_size): # Truncate in block of block_size
|
||||
self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i:i+block_size]))
|
||||
# Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
|
||||
# If your dataset is small, first you should loook for a bigger one :-) and second you
|
||||
# can change this behavior by adding (model specific) padding.
|
||||
@@ -105,11 +106,43 @@ def set_seed(args):
|
||||
torch.cuda.manual_seed_all(args.seed)
|
||||
|
||||
|
||||
def _rotate_checkpoints(args, checkpoint_prefix, use_mtime=False):
|
||||
if not args.save_total_limit:
|
||||
return
|
||||
if args.save_total_limit <= 0:
|
||||
return
|
||||
|
||||
# Check if we should delete older checkpoint(s)
|
||||
glob_checkpoints = glob.glob(os.path.join(args.output_dir, '{}-*'.format(checkpoint_prefix)))
|
||||
if len(glob_checkpoints) <= args.save_total_limit:
|
||||
return
|
||||
|
||||
ordering_and_checkpoint_path = []
|
||||
for path in glob_checkpoints:
|
||||
if use_mtime:
|
||||
ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
|
||||
else:
|
||||
regex_match = re.match('.*{}-([0-9]+)'.format(checkpoint_prefix), path)
|
||||
if regex_match and regex_match.groups():
|
||||
ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))
|
||||
|
||||
checkpoints_sorted = sorted(ordering_and_checkpoint_path)
|
||||
checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
|
||||
number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
|
||||
checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
|
||||
for checkpoint in checkpoints_to_be_deleted:
|
||||
logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
|
||||
shutil.rmtree(checkpoint)
|
||||
|
||||
|
||||
def mask_tokens(inputs, tokenizer, args):
|
||||
""" Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
|
||||
labels = inputs.clone()
|
||||
# We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
|
||||
masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).bool()
|
||||
probability_matrix = torch.full(labels.shape, args.mlm_probability)
|
||||
special_tokens_mask = [tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()]
|
||||
probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
|
||||
masked_indices = torch.bernoulli(probability_matrix).bool()
|
||||
labels[~masked_indices] = -1 # We only compute loss on masked tokens
|
||||
|
||||
# 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
|
||||
@@ -223,8 +256,9 @@ def train(args, train_dataset, model, tokenizer):
|
||||
logging_loss = tr_loss
|
||||
|
||||
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
|
||||
checkpoint_prefix = 'checkpoint'
|
||||
# Save model checkpoint
|
||||
output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
|
||||
output_dir = os.path.join(args.output_dir, '{}-{}'.format(checkpoint_prefix, global_step))
|
||||
if not os.path.exists(output_dir):
|
||||
os.makedirs(output_dir)
|
||||
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
|
||||
@@ -232,6 +266,8 @@ def train(args, train_dataset, model, tokenizer):
|
||||
torch.save(args, os.path.join(output_dir, 'training_args.bin'))
|
||||
logger.info("Saving model checkpoint to %s", output_dir)
|
||||
|
||||
_rotate_checkpoints(args, checkpoint_prefix)
|
||||
|
||||
if args.max_steps > 0 and global_step > args.max_steps:
|
||||
epoch_iterator.close()
|
||||
break
|
||||
@@ -283,7 +319,7 @@ def evaluate(args, model, tokenizer, prefix=""):
|
||||
"perplexity": perplexity
|
||||
}
|
||||
|
||||
output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
|
||||
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
|
||||
with open(output_eval_file, "w") as writer:
|
||||
logger.info("***** Eval results {} *****".format(prefix))
|
||||
for key in sorted(result.keys()):
|
||||
@@ -360,6 +396,8 @@ def main():
|
||||
help="Log every X updates steps.")
|
||||
parser.add_argument('--save_steps', type=int, default=50,
|
||||
help="Save checkpoint every X updates steps.")
|
||||
parser.add_argument('--save_total_limit', type=int, default=None,
|
||||
help='Limit the total amount of checkpoints, delete the older checkpoints in the output_dir, does not delete by default')
|
||||
parser.add_argument("--eval_all_checkpoints", action='store_true',
|
||||
help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
|
||||
parser.add_argument("--no_cuda", action='store_true',
|
||||
@@ -485,9 +523,11 @@ def main():
|
||||
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||
for checkpoint in checkpoints:
|
||||
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
||||
prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
|
||||
|
||||
model = model_class.from_pretrained(checkpoint)
|
||||
model.to(args.device)
|
||||
result = evaluate(args, model, tokenizer, prefix=global_step)
|
||||
result = evaluate(args, model, tokenizer, prefix=prefix)
|
||||
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
|
||||
results.update(result)
|
||||
|
||||
|
||||
@@ -293,7 +293,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False, test=False):
|
||||
list(filter(None, args.model_name_or_path.split('/'))).pop(),
|
||||
str(args.max_seq_length),
|
||||
str(task)))
|
||||
if os.path.exists(cached_features_file):
|
||||
if os.path.exists(cached_features_file) and not args.overwrite_cache:
|
||||
logger.info("Loading features from cached file %s", cached_features_file)
|
||||
features = torch.load(cached_features_file)
|
||||
else:
|
||||
@@ -306,14 +306,14 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False, test=False):
|
||||
else:
|
||||
examples = processor.get_train_examples(args.data_dir)
|
||||
logger.info("Training number: %s", str(len(examples)))
|
||||
features = convert_examples_to_features(examples, label_list, args.max_seq_length, tokenizer,
|
||||
cls_token_at_end=bool(args.model_type in ['xlnet']), # xlnet has a cls token at the end
|
||||
cls_token=tokenizer.cls_token,
|
||||
sep_token=tokenizer.sep_token,
|
||||
sep_token_extra=bool(args.model_type in ['roberta']),
|
||||
cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0,
|
||||
features = convert_examples_to_features(
|
||||
examples,
|
||||
label_list,
|
||||
args.max_seq_length,
|
||||
tokenizer,
|
||||
pad_on_left=bool(args.model_type in ['xlnet']), # pad on the left for xlnet
|
||||
pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0)
|
||||
pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0
|
||||
)
|
||||
if args.local_rank in [-1, 0]:
|
||||
logger.info("Saving features into cached file %s", cached_features_file)
|
||||
torch.save(features, cached_features_file)
|
||||
@@ -362,7 +362,7 @@ def main():
|
||||
help="Whether to run eval on the dev set.")
|
||||
parser.add_argument("--do_test", action='store_true', help='Whether to run test on the test set')
|
||||
parser.add_argument("--evaluate_during_training", action='store_true',
|
||||
help="Rul evaluation during training at each logging step.")
|
||||
help="Run evaluation during training at each logging step.")
|
||||
parser.add_argument("--do_lower_case", action='store_true',
|
||||
help="Set this flag if you are using an uncased model.")
|
||||
|
||||
@@ -512,9 +512,11 @@ def main():
|
||||
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||
for checkpoint in checkpoints:
|
||||
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
||||
prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
|
||||
|
||||
model = model_class.from_pretrained(checkpoint)
|
||||
model.to(args.device)
|
||||
result = evaluate(args, model, tokenizer, prefix=global_step)
|
||||
result = evaluate(args, model, tokenizer, prefix=prefix)
|
||||
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
|
||||
results.update(result)
|
||||
|
||||
@@ -528,9 +530,11 @@ def main():
|
||||
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||
for checkpoint in checkpoints:
|
||||
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
||||
prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
|
||||
|
||||
model = model_class.from_pretrained(checkpoint)
|
||||
model.to(args.device)
|
||||
result = evaluate(args, model, tokenizer, prefix=global_step, test=True)
|
||||
result = evaluate(args, model, tokenizer, prefix=prefix, test=True)
|
||||
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
|
||||
results.update(result)
|
||||
if best_steps:
|
||||
|
||||
@@ -13,7 +13,7 @@
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Finetuning the library models for question-answering on SQuAD (Bert, XLM, XLNet)."""
|
||||
""" Finetuning the library models for question-answering on SQuAD (DistilBERT, Bert, XLM, XLNet)."""
|
||||
|
||||
from __future__ import absolute_import, division, print_function
|
||||
|
||||
@@ -135,9 +135,10 @@ def train(args, train_dataset, model, tokenizer):
|
||||
batch = tuple(t.to(args.device) for t in batch)
|
||||
inputs = {'input_ids': batch[0],
|
||||
'attention_mask': batch[1],
|
||||
'token_type_ids': None if args.model_type == 'xlm' else batch[2],
|
||||
'start_positions': batch[3],
|
||||
'end_positions': batch[4]}
|
||||
if args.model_type != 'distilbert':
|
||||
inputs['token_type_ids'] = None if args.model_type == 'xlm' else batch[2]
|
||||
if args.model_type in ['xlnet', 'xlm']:
|
||||
inputs.update({'cls_index': batch[5],
|
||||
'p_mask': batch[6]})
|
||||
@@ -218,9 +219,10 @@ def evaluate(args, model, tokenizer, prefix=""):
|
||||
batch = tuple(t.to(args.device) for t in batch)
|
||||
with torch.no_grad():
|
||||
inputs = {'input_ids': batch[0],
|
||||
'attention_mask': batch[1],
|
||||
'token_type_ids': None if args.model_type == 'xlm' else batch[2] # XLM don't use segment_ids
|
||||
'attention_mask': batch[1]
|
||||
}
|
||||
if args.model_type != 'distilbert':
|
||||
inputs['token_type_ids'] = None if args.model_type == 'xlm' else batch[2] # XLM don't use segment_ids
|
||||
example_indices = batch[3]
|
||||
if args.model_type in ['xlnet', 'xlm']:
|
||||
inputs.update({'cls_index': batch[4],
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
import tensorflow as tf
|
||||
import tensorflow_datasets
|
||||
from transformers import *
|
||||
from transformers import BertTokenizer, TFBertForSequenceClassification, glue_convert_examples_to_features, BertForSequenceClassification
|
||||
|
||||
# Load dataset, tokenizer, model from pretrained model/vocabulary
|
||||
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
|
||||
@@ -23,12 +23,6 @@ model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
|
||||
history = model.fit(train_dataset, epochs=2, steps_per_epoch=115,
|
||||
validation_data=valid_dataset, validation_steps=7)
|
||||
|
||||
>>> Train for 115 steps, validate for 7 steps
|
||||
>>> Epoch 1/2
|
||||
>>> 115/115 [==============================] - 53s 459ms/step - loss: 0.6033 - accuracy: 0.6712 - val_loss: 0.4964 - val_accuracy: 0.7647
|
||||
>>> Epoch 2/2
|
||||
>>> 115/115 [==============================] - 33s 289ms/step - loss: 0.4141 - accuracy: 0.8160 - val_loss: 0.3914 - val_accuracy: 0.8382
|
||||
|
||||
# Load the TensorFlow model in PyTorch for inspection
|
||||
model.save_pretrained('./save/')
|
||||
pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf=True)
|
||||
@@ -44,5 +38,3 @@ pred_1 = pytorch_model(**inputs_1)[0].argmax().item()
|
||||
pred_2 = pytorch_model(**inputs_2)[0].argmax().item()
|
||||
print("sentence_1 is", "a paraphrase" if pred_1 else "not a paraphrase", "of sentence_0")
|
||||
print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sentence_0")
|
||||
>>> sentence_1 is a paraphrase of sentence_0
|
||||
>>> sentence_2 is not a paraphrase of sentence_0
|
||||
@@ -13,7 +13,7 @@
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" BERT multiple choice fine-tuning: utilities to work with multiple choice tasks of reading comprehension """
|
||||
""" Multiple choice fine-tuning: utilities to work with multiple choice tasks of reading comprehension """
|
||||
|
||||
from __future__ import absolute_import, division, print_function
|
||||
|
||||
@@ -26,6 +26,8 @@ import json
|
||||
import csv
|
||||
import glob
|
||||
import tqdm
|
||||
from typing import List
|
||||
from transformers import PreTrainedTokenizer
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
@@ -34,13 +36,13 @@ logger = logging.getLogger(__name__)
|
||||
class InputExample(object):
|
||||
"""A single training/test example for multiple choice"""
|
||||
|
||||
def __init__(self, example_id, question, contexts, endings, label=None):
|
||||
def __init__(self, example_id, question, contexts, endings, label=None):
|
||||
"""Constructs a InputExample.
|
||||
|
||||
Args:
|
||||
example_id: Unique id for the example.
|
||||
contexts: list of str. The untokenized text of the first sequence (context of corresponding question).
|
||||
question: string. The untokenized text of the second sequence (qustion).
|
||||
question: string. The untokenized text of the second sequence (question).
|
||||
endings: list of str. multiple choice's options. Its length must be equal to contexts' length.
|
||||
label: (Optional) string. The label of the example. This should be
|
||||
specified for train and dev examples, but not for test examples.
|
||||
@@ -66,7 +68,7 @@ class InputFeatures(object):
|
||||
'input_mask': input_mask,
|
||||
'segment_ids': segment_ids
|
||||
}
|
||||
for _, input_ids, input_mask, segment_ids in choices_features
|
||||
for input_ids, input_mask, segment_ids in choices_features
|
||||
]
|
||||
self.label = label
|
||||
|
||||
@@ -192,7 +194,7 @@ class SwagProcessor(DataProcessor):
|
||||
return lines
|
||||
|
||||
|
||||
def _create_examples(self, lines, type):
|
||||
def _create_examples(self, lines: List[List[str]], type: str):
|
||||
"""Creates examples for the training and dev sets."""
|
||||
if type == "train" and lines[0][-1] != 'label':
|
||||
raise ValueError(
|
||||
@@ -300,24 +302,18 @@ class ArcProcessor(DataProcessor):
|
||||
return examples
|
||||
|
||||
|
||||
def convert_examples_to_features(examples, label_list, max_seq_length,
|
||||
tokenizer,
|
||||
cls_token_at_end=False,
|
||||
cls_token='[CLS]',
|
||||
cls_token_segment_id=1,
|
||||
sep_token='[SEP]',
|
||||
sequence_a_segment_id=0,
|
||||
sequence_b_segment_id=1,
|
||||
sep_token_extra=False,
|
||||
pad_token_segment_id=0,
|
||||
pad_on_left=False,
|
||||
pad_token=0,
|
||||
mask_padding_with_zero=True):
|
||||
""" Loads a data file into a list of `InputBatch`s
|
||||
`cls_token_at_end` define the location of the CLS token:
|
||||
- False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
|
||||
- True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
|
||||
`cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
|
||||
def convert_examples_to_features(
|
||||
examples: List[InputExample],
|
||||
label_list: List[str],
|
||||
max_length: int,
|
||||
tokenizer: PreTrainedTokenizer,
|
||||
pad_token_segment_id=0,
|
||||
pad_on_left=False,
|
||||
pad_token=0,
|
||||
mask_padding_with_zero=True,
|
||||
) -> List[InputFeatures]:
|
||||
"""
|
||||
Loads a data file into a list of `InputFeatures`
|
||||
"""
|
||||
|
||||
label_map = {label : i for i, label in enumerate(label_list)}
|
||||
@@ -328,125 +324,70 @@ def convert_examples_to_features(examples, label_list, max_seq_length,
|
||||
logger.info("Writing example %d of %d" % (ex_index, len(examples)))
|
||||
choices_features = []
|
||||
for ending_idx, (context, ending) in enumerate(zip(example.contexts, example.endings)):
|
||||
tokens_a = tokenizer.tokenize(context)
|
||||
tokens_b = None
|
||||
text_a = context
|
||||
if example.question.find("_") != -1:
|
||||
#this is for cloze question
|
||||
tokens_b = tokenizer.tokenize(example.question.replace("_", ending))
|
||||
# this is for cloze question
|
||||
text_b = example.question.replace("_", ending)
|
||||
else:
|
||||
tokens_b = tokenizer.tokenize(example.question + " " + ending)
|
||||
# you can add seq token between quesiotn and ending. This does not make too much difference.
|
||||
# tokens_b = tokenizer.tokenize(example.question)
|
||||
# tokens_b += [sep_token]
|
||||
# if sep_token_extra:
|
||||
# tokens_b += [sep_token]
|
||||
# tokens_b += tokenizer.tokenize(ending)
|
||||
text_b = example.question + " " + ending
|
||||
|
||||
special_tokens_count = 4 if sep_token_extra else 3
|
||||
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - special_tokens_count)
|
||||
inputs = tokenizer.encode_plus(
|
||||
text_a,
|
||||
text_b,
|
||||
add_special_tokens=True,
|
||||
max_length=max_length,
|
||||
)
|
||||
if 'num_truncated_tokens' in inputs and inputs['num_truncated_tokens'] > 0:
|
||||
logger.info('Attention! you are cropping tokens (swag task is ok). '
|
||||
'If you are training ARC and RACE and you are poping question + options,'
|
||||
'you need to try to use a bigger max seq length!')
|
||||
|
||||
# The convention in BERT is:
|
||||
# (a) For sequence pairs:
|
||||
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
|
||||
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
|
||||
# (b) For single sequences:
|
||||
# tokens: [CLS] the dog is hairy . [SEP]
|
||||
# type_ids: 0 0 0 0 0 0 0
|
||||
#
|
||||
# Where "type_ids" are used to indicate whether this is the first
|
||||
# sequence or the second sequence. The embedding vectors for `type=0` and
|
||||
# `type=1` were learned during pre-training and are added to the wordpiece
|
||||
# embedding vector (and position vector). This is not *strictly* necessary
|
||||
# since the [SEP] token unambiguously separates the sequences, but it makes
|
||||
# it easier for the model to learn the concept of sequences.
|
||||
#
|
||||
# For classification tasks, the first vector (corresponding to [CLS]) is
|
||||
# used as as the "sentence vector". Note that this only makes sense because
|
||||
# the entire model is fine-tuned.
|
||||
tokens = tokens_a + [sep_token]
|
||||
if sep_token_extra:
|
||||
# roberta uses an extra separator b/w pairs of sentences
|
||||
tokens += [sep_token]
|
||||
|
||||
segment_ids = [sequence_a_segment_id] * len(tokens)
|
||||
|
||||
if tokens_b:
|
||||
tokens += tokens_b + [sep_token]
|
||||
segment_ids += [sequence_b_segment_id] * (len(tokens_b) + 1)
|
||||
|
||||
if cls_token_at_end:
|
||||
tokens = tokens + [cls_token]
|
||||
segment_ids = segment_ids + [cls_token_segment_id]
|
||||
else:
|
||||
tokens = [cls_token] + tokens
|
||||
segment_ids = [cls_token_segment_id] + segment_ids
|
||||
|
||||
input_ids = tokenizer.convert_tokens_to_ids(tokens)
|
||||
input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
|
||||
|
||||
# The mask has 1 for real tokens and 0 for padding tokens. Only real
|
||||
# tokens are attended to.
|
||||
input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
|
||||
attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
|
||||
|
||||
# Zero-pad up to the sequence length.
|
||||
padding_length = max_seq_length - len(input_ids)
|
||||
padding_length = max_length - len(input_ids)
|
||||
if pad_on_left:
|
||||
input_ids = ([pad_token] * padding_length) + input_ids
|
||||
input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
|
||||
segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
|
||||
attention_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + attention_mask
|
||||
token_type_ids = ([pad_token_segment_id] * padding_length) + token_type_ids
|
||||
else:
|
||||
input_ids = input_ids + ([pad_token] * padding_length)
|
||||
input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
|
||||
segment_ids = segment_ids + ([pad_token_segment_id] * padding_length)
|
||||
attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
|
||||
token_type_ids = token_type_ids + ([pad_token_segment_id] * padding_length)
|
||||
|
||||
assert len(input_ids) == max_length
|
||||
assert len(attention_mask) == max_length
|
||||
assert len(token_type_ids) == max_length
|
||||
choices_features.append((input_ids, attention_mask, token_type_ids))
|
||||
|
||||
|
||||
assert len(input_ids) == max_seq_length
|
||||
assert len(input_mask) == max_seq_length
|
||||
assert len(segment_ids) == max_seq_length
|
||||
choices_features.append((tokens, input_ids, input_mask, segment_ids))
|
||||
label = label_map[example.label]
|
||||
|
||||
if ex_index < 2:
|
||||
logger.info("*** Example ***")
|
||||
logger.info("race_id: {}".format(example.example_id))
|
||||
for choice_idx, (tokens, input_ids, input_mask, segment_ids) in enumerate(choices_features):
|
||||
for choice_idx, (input_ids, attention_mask, token_type_ids) in enumerate(choices_features):
|
||||
logger.info("choice: {}".format(choice_idx))
|
||||
logger.info("tokens: {}".format(' '.join(tokens)))
|
||||
logger.info("input_ids: {}".format(' '.join(map(str, input_ids))))
|
||||
logger.info("input_mask: {}".format(' '.join(map(str, input_mask))))
|
||||
logger.info("segment_ids: {}".format(' '.join(map(str, segment_ids))))
|
||||
logger.info("attention_mask: {}".format(' '.join(map(str, attention_mask))))
|
||||
logger.info("token_type_ids: {}".format(' '.join(map(str, token_type_ids))))
|
||||
logger.info("label: {}".format(label))
|
||||
|
||||
features.append(
|
||||
InputFeatures(
|
||||
example_id = example.example_id,
|
||||
choices_features = choices_features,
|
||||
label = label
|
||||
example_id=example.example_id,
|
||||
choices_features=choices_features,
|
||||
label=label,
|
||||
)
|
||||
)
|
||||
|
||||
return features
|
||||
|
||||
|
||||
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
|
||||
"""Truncates a sequence pair in place to the maximum length."""
|
||||
|
||||
# This is a simple heuristic which will always truncate the longer sequence
|
||||
# one token at a time. This makes more sense than truncating an equal percent
|
||||
# of tokens from each, since if one sequence is very short then each token
|
||||
# that's truncated likely contains more information than a longer sequence.
|
||||
|
||||
# However, since we'd better not to remove tokens of options and questions, you can choose to use a bigger
|
||||
# length or only pop from context
|
||||
while True:
|
||||
total_length = len(tokens_a) + len(tokens_b)
|
||||
if total_length <= max_length:
|
||||
break
|
||||
if len(tokens_a) > len(tokens_b):
|
||||
tokens_a.pop()
|
||||
else:
|
||||
logger.info('Attention! you are removing from token_b (swag task is ok). '
|
||||
'If you are training ARC and RACE (you are poping question + options), '
|
||||
'you need to try to use a bigger max seq length!')
|
||||
tokens_b.pop()
|
||||
|
||||
|
||||
processors = {
|
||||
@@ -456,7 +397,7 @@ processors = {
|
||||
}
|
||||
|
||||
|
||||
GLUE_TASKS_NUM_LABELS = {
|
||||
MULTIPLE_CHOICE_TASKS_NUM_LABELS = {
|
||||
"race", 4,
|
||||
"swag", 4,
|
||||
"arc", 4
|
||||
|
||||
48
requirements-dev.txt
Normal file
48
requirements-dev.txt
Normal file
@@ -0,0 +1,48 @@
|
||||
absl-py==0.8.0
|
||||
astor==0.8.0
|
||||
atomicwrites==1.3.0
|
||||
attrs==19.2.0
|
||||
boto3==1.9.243
|
||||
botocore==1.12.243
|
||||
certifi==2019.9.11
|
||||
chardet==3.0.4
|
||||
Click==7.0
|
||||
docutils==0.15.2
|
||||
gast==0.2.2
|
||||
google-pasta==0.1.7
|
||||
grpcio==1.24.1
|
||||
h5py==2.10.0
|
||||
idna==2.8
|
||||
importlib-metadata==0.23
|
||||
jmespath==0.9.4
|
||||
joblib==0.14.0
|
||||
Keras-Applications==1.0.8
|
||||
Keras-Preprocessing==1.1.0
|
||||
Markdown==3.1.1
|
||||
more-itertools==7.2.0
|
||||
numpy==1.17.2
|
||||
opt-einsum==3.1.0
|
||||
packaging==19.2
|
||||
pluggy==0.13.0
|
||||
protobuf==3.10.0
|
||||
py==1.8.0
|
||||
pyparsing==2.4.2
|
||||
pytest==5.2.1
|
||||
python-dateutil==2.8.0
|
||||
regex==2019.8.19
|
||||
requests==2.22.0
|
||||
s3transfer==0.2.1
|
||||
sacremoses==0.0.35
|
||||
sentencepiece==0.1.83
|
||||
six==1.12.0
|
||||
tensorboard==2.0.0
|
||||
tensorflow==2.0.0
|
||||
tensorflow-estimator==2.0.0
|
||||
termcolor==1.1.0
|
||||
torch==1.2.0
|
||||
tqdm==4.36.1
|
||||
urllib3==1.25.6
|
||||
wcwidth==0.1.7
|
||||
Werkzeug==0.16.0
|
||||
wrapt==1.11.2
|
||||
zipp==0.6.0
|
||||
10
setup.py
10
setup.py
@@ -3,7 +3,7 @@ Simple check list from AllenNLP repo: https://github.com/allenai/allennlp/blob/m
|
||||
|
||||
To create the package for pypi.
|
||||
|
||||
1. Change the version in __init__.py and setup.py.
|
||||
1. Change the version in __init__.py, setup.py as well as docs/source/conf.py.
|
||||
|
||||
2. Commit these changes with the message: "Release: VERSION"
|
||||
|
||||
@@ -38,13 +38,13 @@ from setuptools import find_packages, setup
|
||||
|
||||
setup(
|
||||
name="transformers",
|
||||
version="2.0.0",
|
||||
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors",
|
||||
version="2.1.0",
|
||||
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
|
||||
author_email="thomas@huggingface.co",
|
||||
description="Repository of pre-trained NLP Transformer models: BERT & RoBERTa, GPT & GPT-2, Transformer-XL, XLNet and XLM",
|
||||
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
|
||||
long_description=open("README.md", "r", encoding='utf-8').read(),
|
||||
long_description_content_type="text/markdown",
|
||||
keywords='NLP deep learning transformer pytorch BERT GPT GPT-2 google openai CMU',
|
||||
keywords='NLP deep learning transformer pytorch tensorflow BERT GPT GPT-2 google openai CMU',
|
||||
license='Apache',
|
||||
url="https://github.com/huggingface/transformers",
|
||||
packages=find_packages(exclude=["*.tests", "*.tests.*",
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
__version__ = "2.0.0"
|
||||
__version__ = "2.1.0"
|
||||
|
||||
# Work around to update TensorFlow's absl.logging threshold which alters the
|
||||
# default Python logging output behavior when present.
|
||||
@@ -37,6 +37,7 @@ from .tokenization_bert import BertTokenizer, BasicTokenizer, WordpieceTokenizer
|
||||
from .tokenization_openai import OpenAIGPTTokenizer
|
||||
from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus)
|
||||
from .tokenization_gpt2 import GPT2Tokenizer
|
||||
from .tokenization_ctrl import CTRLTokenizer
|
||||
from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
|
||||
from .tokenization_xlm import XLMTokenizer
|
||||
from .tokenization_roberta import RobertaTokenizer
|
||||
@@ -49,7 +50,9 @@ from .configuration_bert import BertConfig, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_openai import OpenAIGPTConfig, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_transfo_xl import TransfoXLConfig, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_gpt2 import GPT2Config, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_ctrl import CTRLConfig, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_xlnet import XLNetConfig, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_ctrl import CTRLConfig, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_xlm import XLMConfig, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_roberta import RobertaConfig, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_distilbert import DistilBertConfig, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
@@ -73,15 +76,19 @@ if is_torch_available():
|
||||
from .modeling_gpt2 import (GPT2PreTrainedModel, GPT2Model,
|
||||
GPT2LMHeadModel, GPT2DoubleHeadsModel,
|
||||
load_tf_weights_in_gpt2, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
from .modeling_ctrl import (CTRLPreTrainedModel, CTRLModel,
|
||||
CTRLLMHeadModel,
|
||||
CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
from .modeling_xlnet import (XLNetPreTrainedModel, XLNetModel, XLNetLMHeadModel,
|
||||
XLNetForSequenceClassification, XLNetForQuestionAnsweringSimple,
|
||||
XLNetForQuestionAnswering,
|
||||
XLNetForSequenceClassification, XLNetForMultipleChoice,
|
||||
XLNetForQuestionAnsweringSimple, XLNetForQuestionAnswering,
|
||||
load_tf_weights_in_xlnet, XLNET_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
from .modeling_xlm import (XLMPreTrainedModel , XLMModel,
|
||||
XLMWithLMHeadModel, XLMForSequenceClassification,
|
||||
XLMForQuestionAnswering, XLMForQuestionAnsweringSimple,
|
||||
XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
from .modeling_roberta import (RobertaForMaskedLM, RobertaModel, RobertaForSequenceClassification,
|
||||
from .modeling_roberta import (RobertaForMaskedLM, RobertaModel,
|
||||
RobertaForSequenceClassification, RobertaForMultipleChoice,
|
||||
ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
from .modeling_distilbert import (DistilBertForMaskedLM, DistilBertModel,
|
||||
DistilBertForSequenceClassification, DistilBertForQuestionAnswering,
|
||||
@@ -148,6 +155,11 @@ if is_tf_available():
|
||||
load_distilbert_pt_weights_in_tf2,
|
||||
TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
|
||||
from .modeling_tf_ctrl import (TFCTRLPreTrainedModel, TFCTRLModel,
|
||||
TFCTRLLMHeadModel,
|
||||
load_ctrl_pt_weights_in_tf2,
|
||||
TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
|
||||
# TF 2.0 <=> PyTorch conversion utilities
|
||||
if is_tf_available() and is_torch_available():
|
||||
from .modeling_tf_pytorch_utils import (convert_tf_weight_name_to_pt_weight_name,
|
||||
|
||||
@@ -26,6 +26,7 @@ from .configuration_xlnet import XLNetConfig
|
||||
from .configuration_xlm import XLMConfig
|
||||
from .configuration_roberta import RobertaConfig
|
||||
from .configuration_distilbert import DistilBertConfig
|
||||
from .configuration_ctrl import CTRLConfig
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -49,7 +50,7 @@ class AutoConfig(object):
|
||||
- contains `xlnet`: XLNetConfig (XLNet model)
|
||||
- contains `xlm`: XLMConfig (XLM model)
|
||||
- contains `roberta`: RobertaConfig (RoBERTa model)
|
||||
|
||||
- contains `ctrl` : CTRLConfig (CTRL model)
|
||||
This class cannot be instantiated using `__init__()` (throw an error).
|
||||
"""
|
||||
def __init__(self):
|
||||
@@ -71,7 +72,7 @@ class AutoConfig(object):
|
||||
- contains `xlnet`: XLNetConfig (XLNet model)
|
||||
- contains `xlm`: XLMConfig (XLM model)
|
||||
- contains `roberta`: RobertaConfig (RoBERTa model)
|
||||
|
||||
- contains `ctrl` : CTRLConfig (CTRL model)
|
||||
Params:
|
||||
pretrained_model_name_or_path: either:
|
||||
|
||||
@@ -129,7 +130,8 @@ class AutoConfig(object):
|
||||
return XLNetConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
|
||||
elif 'xlm' in pretrained_model_name_or_path:
|
||||
return XLMConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
|
||||
|
||||
elif 'ctrl' in pretrained_model_name_or_path:
|
||||
return CTRLConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
|
||||
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
||||
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
||||
"'xlm', 'roberta'".format(pretrained_model_name_or_path))
|
||||
"'xlm', 'roberta', 'ctrl'".format(pretrained_model_name_or_path))
|
||||
|
||||
143
transformers/configuration_ctrl.py
Normal file
143
transformers/configuration_ctrl.py
Normal file
@@ -0,0 +1,143 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 Salesforce and HuggingFace Inc. team.
|
||||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Salesforce CTRL configuration """
|
||||
|
||||
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
from io import open
|
||||
|
||||
from .configuration_utils import PretrainedConfig
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {"ctrl": "https://storage.googleapis.com/sf-ctrl/pytorch/ctrl-config.json"}
|
||||
|
||||
class CTRLConfig(PretrainedConfig):
|
||||
"""Configuration class to store the configuration of a `CTRLModel`.
|
||||
|
||||
Args:
|
||||
vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `CTRLModel` or a configuration json file.
|
||||
n_positions: Number of positional embeddings.
|
||||
n_ctx: Size of the causal mask (usually same as n_positions).
|
||||
dff: Size of the inner dimension of the FFN.
|
||||
n_embd: Dimensionality of the embeddings and hidden states.
|
||||
n_layer: Number of hidden layers in the Transformer encoder.
|
||||
n_head: Number of attention heads for each attention layer in
|
||||
the Transformer encoder.
|
||||
layer_norm_epsilon: epsilon to use in the layer norm layers
|
||||
resid_pdrop: The dropout probabilitiy for all fully connected
|
||||
layers in the embeddings, encoder, and pooler.
|
||||
attn_pdrop: The dropout ratio for the attention
|
||||
probabilities.
|
||||
embd_pdrop: The dropout ratio for the embeddings.
|
||||
initializer_range: The sttdev of the truncated_normal_initializer for
|
||||
initializing all weight matrices.
|
||||
"""
|
||||
pretrained_config_archive_map = CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size_or_config_json_file=246534,
|
||||
n_positions=256,
|
||||
n_ctx=256,
|
||||
n_embd=1280,
|
||||
dff=8192,
|
||||
n_layer=48,
|
||||
n_head=16,
|
||||
resid_pdrop=0.1,
|
||||
embd_pdrop=0.1,
|
||||
attn_pdrop=0.1,
|
||||
layer_norm_epsilon=1e-6,
|
||||
initializer_range=0.02,
|
||||
|
||||
num_labels=1,
|
||||
summary_type='cls_index',
|
||||
summary_use_proj=True,
|
||||
summary_activation=None,
|
||||
summary_proj_to_labels=True,
|
||||
summary_first_dropout=0.1,
|
||||
**kwargs
|
||||
):
|
||||
"""Constructs CTRLConfig.
|
||||
|
||||
Args:
|
||||
vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `CTRLModel` or a configuration json file.
|
||||
n_positions: Number of positional embeddings.
|
||||
n_ctx: Size of the causal mask (usually same as n_positions).
|
||||
dff: Size of the inner dimension of the FFN.
|
||||
n_embd: Dimensionality of the embeddings and hidden states.
|
||||
n_layer: Number of hidden layers in the Transformer encoder.
|
||||
n_head: Number of attention heads for each attention layer in
|
||||
the Transformer encoder.
|
||||
layer_norm_epsilon: epsilon to use in the layer norm layers
|
||||
resid_pdrop: The dropout probabilitiy for all fully connected
|
||||
layers in the embeddings, encoder, and pooler.
|
||||
attn_pdrop: The dropout ratio for the attention
|
||||
probabilities.
|
||||
embd_pdrop: The dropout ratio for the embeddings.
|
||||
initializer_range: The sttdev of the truncated_normal_initializer for
|
||||
initializing all weight matrices.
|
||||
"""
|
||||
super(CTRLConfig, self).__init__(**kwargs)
|
||||
|
||||
self.vocab_size = vocab_size_or_config_json_file if isinstance(vocab_size_or_config_json_file, int) else -1
|
||||
self.n_ctx = n_ctx
|
||||
self.n_positions = n_positions
|
||||
self.n_embd = n_embd
|
||||
self.n_layer = n_layer
|
||||
self.n_head = n_head
|
||||
self.dff = dff
|
||||
self.resid_pdrop = resid_pdrop
|
||||
self.embd_pdrop = embd_pdrop
|
||||
self.attn_pdrop = attn_pdrop
|
||||
self.layer_norm_epsilon = layer_norm_epsilon
|
||||
self.initializer_range = initializer_range
|
||||
|
||||
self.num_labels = num_labels
|
||||
self.summary_type = summary_type
|
||||
self.summary_use_proj = summary_use_proj
|
||||
self.summary_activation = summary_activation
|
||||
self.summary_first_dropout = summary_first_dropout
|
||||
self.summary_proj_to_labels = summary_proj_to_labels
|
||||
if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
|
||||
and isinstance(vocab_size_or_config_json_file, unicode)):
|
||||
with open(vocab_size_or_config_json_file, "r", encoding="utf-8") as reader:
|
||||
json_config = json.loads(reader.read())
|
||||
for key, value in json_config.items():
|
||||
self.__dict__[key] = value
|
||||
elif not isinstance(vocab_size_or_config_json_file, int):
|
||||
raise ValueError(
|
||||
"First argument must be either a vocabulary size (int)"
|
||||
"or the path to a pretrained model config file (str)"
|
||||
)
|
||||
|
||||
@property
|
||||
def max_position_embeddings(self):
|
||||
return self.n_positions
|
||||
|
||||
@property
|
||||
def hidden_size(self):
|
||||
return self.n_embd
|
||||
|
||||
@property
|
||||
def num_attention_heads(self):
|
||||
return self.n_head
|
||||
|
||||
@property
|
||||
def num_hidden_layers(self):
|
||||
return self.n_layer
|
||||
@@ -28,7 +28,8 @@ logger = logging.getLogger(__name__)
|
||||
|
||||
GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {"gpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json",
|
||||
"gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-config.json",
|
||||
"gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-config.json"}
|
||||
"gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-config.json",
|
||||
"distilgpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-config.json",}
|
||||
|
||||
class GPT2Config(PretrainedConfig):
|
||||
"""Configuration class to store the configuration of a `GPT2Model`.
|
||||
|
||||
@@ -31,7 +31,8 @@ from transformers import (BertConfig, TFBertForPreTraining, TFBertForQuestionAns
|
||||
TransfoXLConfig, TFTransfoXLLMHeadModel, load_transfo_xl_pt_weights_in_tf2, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
OpenAIGPTConfig, TFOpenAIGPTLMHeadModel, load_openai_gpt_pt_weights_in_tf2, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
RobertaConfig, TFRobertaForMaskedLM, TFRobertaForSequenceClassification, load_roberta_pt_weights_in_tf2, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
DistilBertConfig, TFDistilBertForMaskedLM, TFDistilBertForQuestionAnswering, load_distilbert_pt_weights_in_tf2, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
||||
DistilBertConfig, TFDistilBertForMaskedLM, TFDistilBertForQuestionAnswering, load_distilbert_pt_weights_in_tf2, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
CTRLConfig, TFCTRLLMHeadModel, load_ctrl_pt_weights_in_tf2, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
@@ -43,7 +44,8 @@ if is_torch_available():
|
||||
TransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
else:
|
||||
(BertForPreTraining, BertForQuestionAnswering, BertForSequenceClassification, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
GPT2LMHeadModel, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
@@ -52,7 +54,8 @@ else:
|
||||
TransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,) = (
|
||||
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP) = (
|
||||
None, None, None, None,
|
||||
None, None,
|
||||
None, None,
|
||||
@@ -60,7 +63,8 @@ else:
|
||||
None, None,
|
||||
None, None,
|
||||
None, None, None,
|
||||
None, None, None,)
|
||||
None, None, None,
|
||||
None, None)
|
||||
|
||||
|
||||
import logging
|
||||
@@ -80,6 +84,7 @@ MODEL_CLASSES = {
|
||||
'roberta-large-mnli': (RobertaConfig, TFRobertaForSequenceClassification, load_roberta_pt_weights_in_tf2, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||
'distilbert': (DistilBertConfig, TFDistilBertForMaskedLM, load_distilbert_pt_weights_in_tf2, DistilBertForMaskedLM, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||
'distilbert-base-uncased-distilled-squad': (DistilBertConfig, TFDistilBertForQuestionAnswering, load_distilbert_pt_weights_in_tf2, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||
'ctrl': (CTRLConfig, TFCTRLLMHeadModel, load_ctrl_pt_weights_in_tf2, CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
||||
}
|
||||
|
||||
def convert_pt_checkpoint_to_tf(model_type, pytorch_checkpoint_path, config_file, tf_dump_path, compare_with_pt_model=False, use_cached_models=True):
|
||||
@@ -173,10 +178,12 @@ def convert_all_pt_checkpoints_to_tf(args_model_type, tf_dump_path, model_shortc
|
||||
else:
|
||||
model_file = cached_path(model_shortcut_name, force_download=not use_cached_models)
|
||||
|
||||
convert_pt_checkpoint_to_tf(model_type,
|
||||
model_file,
|
||||
config_file,
|
||||
os.path.join(tf_dump_path, model_shortcut_name + '-tf_model.h5'),
|
||||
if os.path.isfile(model_shortcut_name):
|
||||
model_shortcut_name = 'converted_model'
|
||||
convert_pt_checkpoint_to_tf(model_type=model_type,
|
||||
pytorch_checkpoint_path=model_file,
|
||||
config_file=config_file,
|
||||
tf_dump_path=os.path.join(tf_dump_path, model_shortcut_name + '-tf_model.h5'),
|
||||
compare_with_pt_model=compare_with_pt_model)
|
||||
os.remove(config_file)
|
||||
os.remove(model_file)
|
||||
@@ -228,6 +235,7 @@ if __name__ == "__main__":
|
||||
convert_all_pt_checkpoints_to_tf(args.model_type.lower() if args.model_type is not None else None,
|
||||
args.tf_dump_path,
|
||||
model_shortcut_names_or_path=[args.pytorch_checkpoint_path] if args.pytorch_checkpoint_path is not None else None,
|
||||
config_shortcut_names_or_path=[args.config_file] if args.config_file is not None else None,
|
||||
compare_with_pt_model=args.compare_with_pt_model,
|
||||
use_cached_models=args.use_cached_models,
|
||||
only_convert_finetuned_models=args.only_convert_finetuned_models)
|
||||
|
||||
@@ -79,17 +79,13 @@ def glue_convert_examples_to_features(examples, tokenizer,
|
||||
if ex_index % 10000 == 0:
|
||||
logger.info("Writing example %d" % (ex_index))
|
||||
if is_tf_dataset:
|
||||
example = InputExample(example['idx'].numpy(),
|
||||
example['sentence1'].numpy().decode('utf-8'),
|
||||
example['sentence2'].numpy().decode('utf-8'),
|
||||
str(example['label'].numpy()))
|
||||
example = processor.get_example_from_tensor_dict(example)
|
||||
|
||||
inputs = tokenizer.encode_plus(
|
||||
example.text_a,
|
||||
example.text_b,
|
||||
add_special_tokens=True,
|
||||
max_length=max_length,
|
||||
truncate_first_sequence=True # We're truncating the first sequence in priority
|
||||
)
|
||||
input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
|
||||
|
||||
@@ -157,6 +153,13 @@ def glue_convert_examples_to_features(examples, tokenizer,
|
||||
class MrpcProcessor(DataProcessor):
|
||||
"""Processor for the MRPC data set (GLUE version)."""
|
||||
|
||||
def get_example_from_tensor_dict(self, tensor_dict):
|
||||
"""See base class."""
|
||||
return InputExample(tensor_dict['idx'].numpy(),
|
||||
tensor_dict['sentence1'].numpy().decode('utf-8'),
|
||||
tensor_dict['sentence2'].numpy().decode('utf-8'),
|
||||
str(tensor_dict['label'].numpy()))
|
||||
|
||||
def get_train_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv")))
|
||||
@@ -190,6 +193,13 @@ class MrpcProcessor(DataProcessor):
|
||||
class MnliProcessor(DataProcessor):
|
||||
"""Processor for the MultiNLI data set (GLUE version)."""
|
||||
|
||||
def get_example_from_tensor_dict(self, tensor_dict):
|
||||
"""See base class."""
|
||||
return InputExample(tensor_dict['idx'].numpy(),
|
||||
tensor_dict['premise'].numpy().decode('utf-8'),
|
||||
tensor_dict['hypothesis'].numpy().decode('utf-8'),
|
||||
str(tensor_dict['label'].numpy()))
|
||||
|
||||
def get_train_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(
|
||||
@@ -233,6 +243,13 @@ class MnliMismatchedProcessor(MnliProcessor):
|
||||
class ColaProcessor(DataProcessor):
|
||||
"""Processor for the CoLA data set (GLUE version)."""
|
||||
|
||||
def get_example_from_tensor_dict(self, tensor_dict):
|
||||
"""See base class."""
|
||||
return InputExample(tensor_dict['idx'].numpy(),
|
||||
tensor_dict['sentence'].numpy().decode('utf-8'),
|
||||
None,
|
||||
str(tensor_dict['label'].numpy()))
|
||||
|
||||
def get_train_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(
|
||||
@@ -262,6 +279,13 @@ class ColaProcessor(DataProcessor):
|
||||
class Sst2Processor(DataProcessor):
|
||||
"""Processor for the SST-2 data set (GLUE version)."""
|
||||
|
||||
def get_example_from_tensor_dict(self, tensor_dict):
|
||||
"""See base class."""
|
||||
return InputExample(tensor_dict['idx'].numpy(),
|
||||
tensor_dict['sentence'].numpy().decode('utf-8'),
|
||||
None,
|
||||
str(tensor_dict['label'].numpy()))
|
||||
|
||||
def get_train_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(
|
||||
@@ -293,6 +317,13 @@ class Sst2Processor(DataProcessor):
|
||||
class StsbProcessor(DataProcessor):
|
||||
"""Processor for the STS-B data set (GLUE version)."""
|
||||
|
||||
def get_example_from_tensor_dict(self, tensor_dict):
|
||||
"""See base class."""
|
||||
return InputExample(tensor_dict['idx'].numpy(),
|
||||
tensor_dict['sentence1'].numpy().decode('utf-8'),
|
||||
tensor_dict['sentence2'].numpy().decode('utf-8'),
|
||||
str(tensor_dict['label'].numpy()))
|
||||
|
||||
def get_train_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(
|
||||
@@ -325,6 +356,13 @@ class StsbProcessor(DataProcessor):
|
||||
class QqpProcessor(DataProcessor):
|
||||
"""Processor for the QQP data set (GLUE version)."""
|
||||
|
||||
def get_example_from_tensor_dict(self, tensor_dict):
|
||||
"""See base class."""
|
||||
return InputExample(tensor_dict['idx'].numpy(),
|
||||
tensor_dict['question1'].numpy().decode('utf-8'),
|
||||
tensor_dict['question2'].numpy().decode('utf-8'),
|
||||
str(tensor_dict['label'].numpy()))
|
||||
|
||||
def get_train_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(
|
||||
@@ -360,6 +398,13 @@ class QqpProcessor(DataProcessor):
|
||||
class QnliProcessor(DataProcessor):
|
||||
"""Processor for the QNLI data set (GLUE version)."""
|
||||
|
||||
def get_example_from_tensor_dict(self, tensor_dict):
|
||||
"""See base class."""
|
||||
return InputExample(tensor_dict['idx'].numpy(),
|
||||
tensor_dict['question'].numpy().decode('utf-8'),
|
||||
tensor_dict['sentence'].numpy().decode('utf-8'),
|
||||
str(tensor_dict['label'].numpy()))
|
||||
|
||||
def get_train_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(
|
||||
@@ -393,6 +438,13 @@ class QnliProcessor(DataProcessor):
|
||||
class RteProcessor(DataProcessor):
|
||||
"""Processor for the RTE data set (GLUE version)."""
|
||||
|
||||
def get_example_from_tensor_dict(self, tensor_dict):
|
||||
"""See base class."""
|
||||
return InputExample(tensor_dict['idx'].numpy(),
|
||||
tensor_dict['sentence1'].numpy().decode('utf-8'),
|
||||
tensor_dict['sentence2'].numpy().decode('utf-8'),
|
||||
str(tensor_dict['label'].numpy()))
|
||||
|
||||
def get_train_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(
|
||||
@@ -425,6 +477,13 @@ class RteProcessor(DataProcessor):
|
||||
class WnliProcessor(DataProcessor):
|
||||
"""Processor for the WNLI data set (GLUE version)."""
|
||||
|
||||
def get_example_from_tensor_dict(self, tensor_dict):
|
||||
"""See base class."""
|
||||
return InputExample(tensor_dict['idx'].numpy(),
|
||||
tensor_dict['sentence1'].numpy().decode('utf-8'),
|
||||
tensor_dict['sentence2'].numpy().decode('utf-8'),
|
||||
str(tensor_dict['label'].numpy()))
|
||||
|
||||
def get_train_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(
|
||||
|
||||
@@ -86,6 +86,15 @@ class InputFeatures(object):
|
||||
class DataProcessor(object):
|
||||
"""Base class for data converters for sequence classification data sets."""
|
||||
|
||||
def get_example_from_tensor_dict(self, tensor_dict):
|
||||
"""Gets an example from a dict with tensorflow tensors
|
||||
|
||||
Args:
|
||||
tensor_dict: Keys and values should match the corresponding Glue
|
||||
tensorflow_dataset examples.
|
||||
"""
|
||||
raise NotImplementedError()
|
||||
|
||||
def get_train_examples(self, data_dir):
|
||||
"""Gets a collection of `InputExample`s for the train set."""
|
||||
raise NotImplementedError()
|
||||
|
||||
@@ -27,7 +27,7 @@ logger = logging.getLogger(__name__) # pylint: disable=invalid-name
|
||||
|
||||
try:
|
||||
import tensorflow as tf
|
||||
assert int(tf.__version__[0]) >= 2
|
||||
assert hasattr(tf, '__version__') and int(tf.__version__[0]) >= 2
|
||||
_tf_available = True # pylint: disable=invalid-name
|
||||
logger.info("TensorFlow version {} available.".format(tf.__version__))
|
||||
except (ImportError, AssertionError):
|
||||
|
||||
@@ -21,6 +21,7 @@ import logging
|
||||
from .modeling_bert import BertModel, BertForMaskedLM, BertForSequenceClassification, BertForQuestionAnswering
|
||||
from .modeling_openai import OpenAIGPTModel, OpenAIGPTLMHeadModel
|
||||
from .modeling_gpt2 import GPT2Model, GPT2LMHeadModel
|
||||
from .modeling_ctrl import CTRLModel, CTRLLMHeadModel
|
||||
from .modeling_transfo_xl import TransfoXLModel, TransfoXLLMHeadModel
|
||||
from .modeling_xlnet import XLNetModel, XLNetLMHeadModel, XLNetForSequenceClassification, XLNetForQuestionAnswering
|
||||
from .modeling_xlm import XLMModel, XLMWithLMHeadModel, XLMForSequenceClassification, XLMForQuestionAnswering
|
||||
@@ -51,6 +52,7 @@ class AutoModel(object):
|
||||
- contains `bert`: BertModel (Bert model)
|
||||
- contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
|
||||
- contains `gpt2`: GPT2Model (OpenAI GPT-2 model)
|
||||
- contains `ctrl`: CTRLModel (Salesforce CTRL model)
|
||||
- contains `transfo-xl`: TransfoXLModel (Transformer-XL model)
|
||||
- contains `xlnet`: XLNetModel (XLNet model)
|
||||
- contains `xlm`: XLMModel (XLM model)
|
||||
@@ -73,6 +75,7 @@ class AutoModel(object):
|
||||
- contains `bert`: BertModel (Bert model)
|
||||
- contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
|
||||
- contains `gpt2`: GPT2Model (OpenAI GPT-2 model)
|
||||
- contains `ctrl`: CTRLModel (Salesforce CTRL model)
|
||||
- contains `transfo-xl`: TransfoXLModel (Transformer-XL model)
|
||||
- contains `xlnet`: XLNetModel (XLNet model)
|
||||
- contains `xlm`: XLMModel (XLM model)
|
||||
@@ -149,10 +152,11 @@ class AutoModel(object):
|
||||
return XLNetModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
elif 'xlm' in pretrained_model_name_or_path:
|
||||
return XLMModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
|
||||
elif 'ctrl' in pretrained_model_name_or_path:
|
||||
return CTRLModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
||||
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
||||
"'xlm', 'roberta'".format(pretrained_model_name_or_path))
|
||||
"'xlm', 'roberta, 'ctrl'".format(pretrained_model_name_or_path))
|
||||
|
||||
|
||||
class AutoModelWithLMHead(object):
|
||||
@@ -172,6 +176,7 @@ class AutoModelWithLMHead(object):
|
||||
- contains `bert`: BertForMaskedLM (Bert model)
|
||||
- contains `openai-gpt`: OpenAIGPTLMHeadModel (OpenAI GPT model)
|
||||
- contains `gpt2`: GPT2LMHeadModel (OpenAI GPT-2 model)
|
||||
- contains `ctrl`: CTRLLMModel (Salesforce CTRL model)
|
||||
- contains `transfo-xl`: TransfoXLLMHeadModel (Transformer-XL model)
|
||||
- contains `xlnet`: XLNetLMHeadModel (XLNet model)
|
||||
- contains `xlm`: XLMWithLMHeadModel (XLM model)
|
||||
@@ -273,10 +278,11 @@ class AutoModelWithLMHead(object):
|
||||
return XLNetLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
elif 'xlm' in pretrained_model_name_or_path:
|
||||
return XLMWithLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
|
||||
elif 'ctrl' in pretrained_model_name_or_path:
|
||||
return CTRLLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
||||
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
||||
"'xlm', 'roberta'".format(pretrained_model_name_or_path))
|
||||
"'xlm', 'roberta','ctrl'".format(pretrained_model_name_or_path))
|
||||
|
||||
|
||||
class AutoModelForSequenceClassification(object):
|
||||
|
||||
@@ -118,7 +118,7 @@ def load_tf_weights_in_bert(model, config, tf_checkpoint_path):
|
||||
|
||||
|
||||
def gelu(x):
|
||||
""" Original Implementation of the gelu activation function in Google Bert repo when initialy created.
|
||||
""" Original Implementation of the gelu activation function in Google Bert repo when initially created.
|
||||
For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
|
||||
0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
|
||||
Also see https://arxiv.org/abs/1606.08415
|
||||
|
||||
482
transformers/modeling_ctrl.py
Normal file
482
transformers/modeling_ctrl.py
Normal file
@@ -0,0 +1,482 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 Salesforce and HuggingFace Inc. team.
|
||||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" PyTorch CTRL model."""
|
||||
|
||||
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||
|
||||
import collections
|
||||
import json
|
||||
import logging
|
||||
import math
|
||||
import os
|
||||
import sys
|
||||
from io import open
|
||||
import numpy as np
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.nn import CrossEntropyLoss
|
||||
from torch.nn.parameter import Parameter
|
||||
|
||||
from .modeling_utils import PreTrainedModel, Conv1D, prune_conv1d_layer, SequenceSummary
|
||||
from .configuration_ctrl import CTRLConfig
|
||||
from .file_utils import add_start_docstrings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
CTRL_PRETRAINED_MODEL_ARCHIVE_MAP = {"ctrl": "https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin"}
|
||||
|
||||
|
||||
def angle_defn(pos, i, d_model_size):
|
||||
angle_rates = 1 / torch.pow(10000, (2 * (i//2)) / d_model_size)
|
||||
return pos * angle_rates
|
||||
|
||||
def positional_encoding(position, d_model_size, dtype):
|
||||
# create the sinusoidal pattern for the positional encoding
|
||||
angle_rads = (angle_defn(torch.arange(position, dtype=dtype).unsqueeze(1),
|
||||
torch.arange(d_model_size, dtype=dtype).unsqueeze(0),
|
||||
d_model_size))
|
||||
|
||||
sines = torch.sin(angle_rads[:, 0::2])
|
||||
cosines = torch.cos(angle_rads[:, 1::2])
|
||||
|
||||
pos_encoding = torch.cat([sines, cosines], dim=-1)
|
||||
return pos_encoding
|
||||
|
||||
def scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):
|
||||
# calculate attention
|
||||
matmul_qk = torch.matmul(q, k.permute(0,1,3,2))
|
||||
|
||||
dk = k.shape[-1]
|
||||
scaled_attention_logits = matmul_qk / np.sqrt(dk)
|
||||
|
||||
if mask is not None:
|
||||
scaled_attention_logits += (mask * -1e4)
|
||||
|
||||
if attention_mask is not None:
|
||||
# Apply the attention mask
|
||||
scaled_attention_logits = scaled_attention_logits + attention_mask
|
||||
|
||||
attention_weights = torch.softmax(scaled_attention_logits, dim=-1)
|
||||
|
||||
# Mask heads if we want to
|
||||
if head_mask is not None:
|
||||
attention_weights = attention_weights * head_mask
|
||||
|
||||
output = torch.matmul(attention_weights, v)
|
||||
|
||||
return output, attention_weights
|
||||
|
||||
|
||||
class MultiHeadAttention(torch.nn.Module):
|
||||
def __init__(self, d_model_size, num_heads, output_attentions=False):
|
||||
super(MultiHeadAttention, self).__init__()
|
||||
self.output_attentions = output_attentions
|
||||
self.num_heads = num_heads
|
||||
self.d_model_size = d_model_size
|
||||
|
||||
self.depth = int(d_model_size / self.num_heads)
|
||||
|
||||
self.Wq = torch.nn.Linear(d_model_size, d_model_size)
|
||||
self.Wk = torch.nn.Linear(d_model_size, d_model_size)
|
||||
self.Wv = torch.nn.Linear(d_model_size, d_model_size)
|
||||
|
||||
self.dense = torch.nn.Linear(d_model_size, d_model_size)
|
||||
|
||||
def split_into_heads(self, x, batch_size):
|
||||
x = x.reshape(batch_size, -1, self.num_heads, self.depth)
|
||||
return x.permute([0, 2, 1, 3])
|
||||
|
||||
def forward(self, v, k, q, mask, layer_past=None, attention_mask=None, head_mask=None):
|
||||
batch_size = q.shape[0]
|
||||
|
||||
q = self.Wq(q)
|
||||
k = self.Wk(k)
|
||||
v = self.Wv(v)
|
||||
|
||||
q = self.split_into_heads(q, batch_size)
|
||||
k = self.split_into_heads(k, batch_size)
|
||||
v = self.split_into_heads(v, batch_size)
|
||||
if layer_past is not None:
|
||||
past_key, past_value = layer_past[0], layer_past[1]
|
||||
k = torch.cat((past_key, k), dim=-2)
|
||||
v = torch.cat((past_value, v), dim=-2)
|
||||
present = torch.stack((k, v))
|
||||
|
||||
output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)
|
||||
scaled_attention = output[0].permute([0, 2, 1, 3])
|
||||
attn = output[1]
|
||||
original_size_attention = scaled_attention.reshape(batch_size, -1, self.d_model_size)
|
||||
output = self.dense(original_size_attention)
|
||||
|
||||
outputs = (output, present)
|
||||
if self.output_attentions:
|
||||
outputs = outputs + (attn,)
|
||||
return outputs
|
||||
|
||||
|
||||
|
||||
def point_wise_feed_forward_network(d_model_size, dff):
|
||||
return torch.nn.Sequential(torch.nn.Linear(d_model_size, dff),
|
||||
torch.nn.ReLU(),
|
||||
torch.nn.Linear(dff, d_model_size))
|
||||
|
||||
|
||||
class EncoderLayer(torch.nn.Module):
|
||||
def __init__(self, d_model_size, num_heads, dff, rate=0.1, output_attentions=False):
|
||||
super(EncoderLayer, self).__init__()
|
||||
|
||||
self.multi_head_attention = MultiHeadAttention(d_model_size, num_heads, output_attentions)
|
||||
self.ffn = point_wise_feed_forward_network(d_model_size, dff)
|
||||
|
||||
self.layernorm1 = torch.nn.LayerNorm(d_model_size, eps=1e-6)
|
||||
self.layernorm2 = torch.nn.LayerNorm(d_model_size, eps=1e-6)
|
||||
|
||||
self.dropout1 = torch.nn.Dropout(rate)
|
||||
self.dropout2 = torch.nn.Dropout(rate)
|
||||
|
||||
def forward(self, x, mask, layer_past=None, attention_mask=None, head_mask=None):
|
||||
normed = self.layernorm1(x)
|
||||
attn_outputs = self.multi_head_attention(normed, normed, normed, mask,
|
||||
layer_past=layer_past,
|
||||
attention_mask=attention_mask,
|
||||
head_mask=head_mask)
|
||||
attn_output = attn_outputs[0]
|
||||
attn_output = self.dropout1(attn_output)
|
||||
out1 = x + attn_output
|
||||
|
||||
out2 = self.layernorm2(out1)
|
||||
ffn_output = self.ffn(out2)
|
||||
ffn_output = self.dropout2(ffn_output)
|
||||
out2 = out1 + ffn_output
|
||||
|
||||
outputs = (out2,) + attn_outputs[1:]
|
||||
return outputs
|
||||
|
||||
|
||||
class CTRLPreTrainedModel(PreTrainedModel):
|
||||
""" An abstract class to handle weights initialization and
|
||||
a simple interface for dowloading and loading pretrained models.
|
||||
"""
|
||||
config_class = CTRLConfig
|
||||
pretrained_model_archive_map = CTRL_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||
base_model_prefix = "transformer"
|
||||
|
||||
def _init_weights(self, module):
|
||||
""" Initialize the weights.
|
||||
"""
|
||||
if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):
|
||||
# Slightly different from the TF version which uses truncated_normal for initialization
|
||||
# cf https://github.com/pytorch/pytorch/pull/5617
|
||||
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
|
||||
if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:
|
||||
module.bias.data.zero_()
|
||||
elif isinstance(module, nn.LayerNorm):
|
||||
module.bias.data.zero_()
|
||||
module.weight.data.fill_(1.0)
|
||||
|
||||
|
||||
CTRL_START_DOCSTRING = r""" CTRL model was proposed in
|
||||
`CTRL: A Conditional Transformer Language Model for Controllable Generation`_
|
||||
by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
|
||||
It's a causal (unidirectional) transformer pre-trained using language modeling on a very large
|
||||
corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
|
||||
|
||||
This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
|
||||
refer to the PyTorch documentation for all matter related to general usage and behavior.
|
||||
|
||||
.. _`CTRL: A Conditional Transformer Language Model for Controllable Generation`:
|
||||
https://www.github.com/salesforce/ctrl
|
||||
|
||||
.. _`torch.nn.Module`:
|
||||
https://pytorch.org/docs/stable/nn.html#module
|
||||
|
||||
Parameters:
|
||||
config (:class:`~transformers.CTRLConfig`): Model configuration class with all the parameters of the model.
|
||||
Initializing with a config file does not load the weights associated with the model, only the configuration.
|
||||
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
|
||||
"""
|
||||
|
||||
CTRL_INPUTS_DOCSTRING = r""" Inputs:
|
||||
**input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||
Indices of input sequence tokens in the vocabulary.
|
||||
CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on
|
||||
the right rather than the left.
|
||||
Indices can be obtained using :class:`transformers.CTRLTokenizer`.
|
||||
See :func:`transformers.PreTrainedTokenizer.encode` and
|
||||
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
|
||||
**past**:
|
||||
list of ``torch.FloatTensor`` (one for each layer):
|
||||
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||
(see `past` output below). Can be used to speed up sequential decoding.
|
||||
**attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
|
||||
Mask to avoid performing attention on padding token indices.
|
||||
Mask values selected in ``[0, 1]``:
|
||||
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
|
||||
**token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||
A parallel sequence of tokens (can be used to indicate various portions of the inputs).
|
||||
The embeddings from these tokens will be summed with the respective token embeddings.
|
||||
Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).
|
||||
**position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||
Indices of positions of each input sequence tokens in the position embeddings.
|
||||
Selected in the range ``[0, config.max_position_embeddings - 1]``.
|
||||
**head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
|
||||
Mask to nullify selected heads of the self-attention modules.
|
||||
Mask values selected in ``[0, 1]``:
|
||||
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
|
||||
"""
|
||||
|
||||
@add_start_docstrings("The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.",
|
||||
CTRL_START_DOCSTRING, CTRL_INPUTS_DOCSTRING)
|
||||
class CTRLModel(CTRLPreTrainedModel):
|
||||
r"""
|
||||
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||
**last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
|
||||
Sequence of hidden-states at the last layer of the model.
|
||||
**past**:
|
||||
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||
that contains pre-computed hidden-states (key and values in the attention blocks).
|
||||
Can be used (see `past` input) to speed up sequential decoding.
|
||||
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
||||
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||
|
||||
Examples::
|
||||
|
||||
tokenizer = CTRLTokenizer.from_pretrained('ctrl')
|
||||
model = CTRLModel.from_pretrained('ctrl')
|
||||
input_ids = torch.tensor(tokenizer.encode("Links Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
||||
outputs = model(input_ids)
|
||||
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||
|
||||
"""
|
||||
def __init__(self, config):
|
||||
super(CTRLModel, self).__init__(config)
|
||||
self.output_hidden_states = config.output_hidden_states
|
||||
self.d_model_size = config.n_embd
|
||||
self.num_layers = config.n_layer
|
||||
|
||||
self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size, torch.float)
|
||||
|
||||
self.output_attentions = config.output_attentions
|
||||
|
||||
self.w = nn.Embedding(config.vocab_size, config.n_embd)
|
||||
|
||||
|
||||
self.dropout = nn.Dropout(config.embd_pdrop)
|
||||
self.h = nn.ModuleList([EncoderLayer(config.n_embd,
|
||||
config.n_head,
|
||||
config.dff,
|
||||
config.resid_pdrop,
|
||||
config.output_attentions) for _ in range(config.n_layer)])
|
||||
self.layernorm = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
|
||||
|
||||
self.init_weights()
|
||||
|
||||
def _resize_token_embeddings(self, new_num_tokens):
|
||||
self.w = self._get_resized_embeddings(self.w, new_num_tokens)
|
||||
return self.w
|
||||
|
||||
def _prune_heads(self, heads_to_prune):
|
||||
""" Prunes heads of the model.
|
||||
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
|
||||
"""
|
||||
for layer, heads in heads_to_prune.items():
|
||||
self.h[layer].attn.prune_heads(heads)
|
||||
|
||||
def forward(self, input_ids, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None):
|
||||
input_shape = input_ids.size()
|
||||
input_ids = input_ids.view(-1, input_shape[-1])
|
||||
if past is None:
|
||||
past_length = 0
|
||||
past = [None] * len(self.h)
|
||||
else:
|
||||
past_length = past[0][0].size(-2)
|
||||
if position_ids is None:
|
||||
position_ids = torch.arange(past_length, input_ids.size(-1) + past_length, dtype=torch.long, device=input_ids.device)
|
||||
position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
|
||||
|
||||
# Attention mask.
|
||||
if attention_mask is not None:
|
||||
attention_mask = attention_mask.view(-1, input_shape[-1])
|
||||
# We create a 3D attention mask from a 2D tensor mask.
|
||||
# Sizes are [batch_size, 1, 1, to_seq_length]
|
||||
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
|
||||
# this attention mask is more simple than the triangular masking of causal attention
|
||||
# used in OpenAI GPT, we just need to prepare the broadcast dimension here.
|
||||
attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
|
||||
|
||||
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
|
||||
# masked positions, this operation will create a tensor which is 0.0 for
|
||||
# positions we want to attend and -10000.0 for masked positions.
|
||||
# Since we are adding it to the raw scores before the softmax, this is
|
||||
# effectively the same as removing these entirely.
|
||||
attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
|
||||
attention_mask = (1.0 - attention_mask) * -10000.0
|
||||
|
||||
# Prepare head mask if needed
|
||||
# 1.0 in head_mask indicate we keep the head
|
||||
# attention_probs has shape bsz x n_heads x N x N
|
||||
# head_mask has shape n_layer x batch x n_heads x N x N
|
||||
if head_mask is not None:
|
||||
if head_mask.dim() == 1:
|
||||
head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
|
||||
head_mask = head_mask.expand(self.config.n_layer, -1, -1, -1, -1)
|
||||
elif head_mask.dim() == 2:
|
||||
head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1) # We can specify head_mask for each layer
|
||||
head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
|
||||
else:
|
||||
head_mask = [None] * self.config.n_layer
|
||||
|
||||
if token_type_ids is not None:
|
||||
token_type_ids = token_type_ids.view(-1, input_shape[-1])
|
||||
token_type_embeds = self.w(token_type_ids)
|
||||
token_type_embeds *= np.sqrt(self.d_model_size)
|
||||
else:
|
||||
token_type_embeds = 0
|
||||
position_ids = position_ids.view(-1, input_shape[-1])
|
||||
|
||||
inputs_embeds = self.w(input_ids)
|
||||
# inputs_embeds = embedded.unsqueeze(0) if len(input_ids.shape)<2 else embedded
|
||||
seq_len = input_ids.shape[-1]
|
||||
mask = torch.triu(torch.ones(seq_len, seq_len), 1).to(inputs_embeds.device)
|
||||
|
||||
inputs_embeds *= np.sqrt(self.d_model_size)
|
||||
|
||||
pos_embeds = self.pos_encoding[position_ids, :].to(inputs_embeds.device)
|
||||
|
||||
hidden_states = inputs_embeds + pos_embeds + token_type_embeds
|
||||
|
||||
hidden_states = self.dropout(hidden_states)
|
||||
|
||||
output_shape = input_shape + (inputs_embeds.size(-1),)
|
||||
presents = ()
|
||||
all_hidden_states = ()
|
||||
all_attentions = []
|
||||
for i, (h, layer_past) in enumerate(zip(self.h, past)):
|
||||
if self.output_hidden_states:
|
||||
all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)
|
||||
outputs = h(hidden_states,
|
||||
mask,
|
||||
layer_past=layer_past,
|
||||
attention_mask=attention_mask,
|
||||
head_mask=head_mask[i])
|
||||
hidden_states, present = outputs[:2]
|
||||
presents = presents + (present,)
|
||||
|
||||
if self.output_attentions:
|
||||
all_attentions.append(outputs[2])
|
||||
|
||||
hidden_states = self.layernorm(hidden_states)
|
||||
hidden_states = hidden_states.view(*output_shape)
|
||||
if self.output_hidden_states:
|
||||
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||
|
||||
outputs = (hidden_states, presents)
|
||||
if self.output_hidden_states:
|
||||
outputs = outputs + (all_hidden_states,)
|
||||
if self.output_attentions:
|
||||
# let the number of heads free (-1) so we can extract attention even after head pruning
|
||||
attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]
|
||||
all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)
|
||||
outputs = outputs + (all_attentions,)
|
||||
return outputs
|
||||
|
||||
|
||||
@add_start_docstrings("""The CTRL Model transformer with a language modeling head on top
|
||||
(linear layer with weights tied to the input embeddings). """, CTRL_START_DOCSTRING, CTRL_INPUTS_DOCSTRING)
|
||||
class CTRLLMHeadModel(CTRLPreTrainedModel):
|
||||
r"""
|
||||
**labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||
Labels for language modeling.
|
||||
Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
|
||||
Indices are selected in ``[-1, 0, ..., config.vocab_size]``
|
||||
All labels set to ``-1`` are ignored (masked), the loss is only
|
||||
computed for labels in ``[0, ..., config.vocab_size]``
|
||||
|
||||
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||
**loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
|
||||
Language modeling loss.
|
||||
**prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
|
||||
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
||||
**past**:
|
||||
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||
that contains pre-computed hidden-states (key and values in the attention blocks).
|
||||
Can be used (see `past` input) to speed up sequential decoding.
|
||||
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
||||
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||
|
||||
Examples::
|
||||
|
||||
import torch
|
||||
from transformers import CTRLTokenizer, CTRLLMHeadModel
|
||||
|
||||
tokenizer = CTRLTokenizer.from_pretrained('ctrl')
|
||||
model = CTRLLMHeadModel.from_pretrained('ctrl')
|
||||
|
||||
input_ids = torch.tensor(tokenizer.encode("Links Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
||||
outputs = model(input_ids, labels=input_ids)
|
||||
loss, logits = outputs[:2]
|
||||
|
||||
"""
|
||||
def __init__(self, config):
|
||||
super(CTRLLMHeadModel, self).__init__(config)
|
||||
self.transformer = CTRLModel(config)
|
||||
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=True)
|
||||
|
||||
self.init_weights()
|
||||
self.tie_weights()
|
||||
|
||||
def tie_weights(self):
|
||||
""" Make sure we are sharing the input and output embeddings.
|
||||
Export to TorchScript can't handle parameter sharing so we are cloning them instead.
|
||||
"""
|
||||
self._tie_or_clone_weights(self.lm_head, self.transformer.w)
|
||||
|
||||
def forward(self, input_ids, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
|
||||
labels=None):
|
||||
transformer_outputs = self.transformer(input_ids,
|
||||
past=past,
|
||||
attention_mask=attention_mask,
|
||||
token_type_ids=token_type_ids,
|
||||
position_ids=position_ids,
|
||||
head_mask=head_mask)
|
||||
|
||||
hidden_states = transformer_outputs[0]
|
||||
|
||||
lm_logits = self.lm_head(hidden_states)
|
||||
|
||||
outputs = (lm_logits,) + transformer_outputs[1:]
|
||||
|
||||
if labels is not None:
|
||||
# Shift so that tokens < n predict n
|
||||
shift_logits = lm_logits[..., :-1, :].contiguous()
|
||||
shift_labels = labels[..., 1:].contiguous()
|
||||
# Flatten the tokens
|
||||
loss_fct = CrossEntropyLoss(ignore_index=-1)
|
||||
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)),
|
||||
shift_labels.view(-1))
|
||||
outputs = (loss,) + outputs
|
||||
|
||||
return outputs # (loss), lm_logits, presents, (all hidden_states), (attentions)
|
||||
@@ -159,8 +159,6 @@ class MultiHeadSelfAttention(nn.Module):
|
||||
|
||||
dim_per_head = self.dim // self.n_heads
|
||||
|
||||
assert 2 <= mask.dim() <= 3
|
||||
causal = (mask.dim() == 3)
|
||||
mask_reshp = (bs, 1, 1, k_length)
|
||||
|
||||
def shape(x):
|
||||
@@ -649,7 +647,7 @@ class DistilBertForQuestionAnswering(DistilBertPreTrainedModel):
|
||||
start_positions = torch.tensor([1])
|
||||
end_positions = torch.tensor([3])
|
||||
outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
|
||||
loss, start_scores, end_scores = outputs[:2]
|
||||
loss, start_scores, end_scores = outputs[:3]
|
||||
|
||||
"""
|
||||
def __init__(self, config):
|
||||
|
||||
@@ -38,7 +38,8 @@ logger = logging.getLogger(__name__)
|
||||
|
||||
GPT2_PRETRAINED_MODEL_ARCHIVE_MAP = {"gpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin",
|
||||
"gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin",
|
||||
"gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-pytorch_model.bin"}
|
||||
"gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-pytorch_model.bin",
|
||||
"distilgpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-pytorch_model.bin",}
|
||||
|
||||
def load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path):
|
||||
""" Load tf checkpoints in a pytorch model
|
||||
|
||||
@@ -170,7 +170,7 @@ class Attention(nn.Module):
|
||||
# w = w * self.bias + -1e9 * (1 - self.bias) # TF implem method: mask_attn_weights
|
||||
# XD: self.b may be larger than w, so we need to crop it
|
||||
b = self.bias[:, :, : w.size(-2), : w.size(-1)]
|
||||
w = w * b + -1e9 * (1 - b)
|
||||
w = w * b + - 1e4 * (1 - b)
|
||||
|
||||
if attention_mask is not None:
|
||||
# Apply the attention mask
|
||||
|
||||
@@ -43,6 +43,9 @@ class RobertaEmbeddings(BertEmbeddings):
|
||||
def __init__(self, config):
|
||||
super(RobertaEmbeddings, self).__init__(config)
|
||||
self.padding_idx = 1
|
||||
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=self.padding_idx)
|
||||
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size,
|
||||
padding_idx=self.padding_idx)
|
||||
|
||||
def forward(self, input_ids, token_type_ids=None, position_ids=None):
|
||||
seq_length = input_ids.size(1)
|
||||
@@ -169,7 +172,8 @@ class RobertaModel(BertModel):
|
||||
if input_ids[:, 0].sum().item() != 0:
|
||||
logger.warning("A sequence with no special tokens has been passed to the RoBERTa model. "
|
||||
"This model requires special tokens in order to work. "
|
||||
"Please specify add_special_tokens=True in your encoding.")
|
||||
"Please specify add_special_tokens=True in your tokenize.encode()"
|
||||
"or tokenizer.convert_tokens_to_ids().")
|
||||
return super(RobertaModel, self).forward(input_ids,
|
||||
attention_mask=attention_mask,
|
||||
token_type_ids=token_type_ids,
|
||||
|
||||
@@ -62,7 +62,7 @@ def load_bert_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
|
||||
|
||||
def gelu(x):
|
||||
""" Gaussian Error Linear Unit.
|
||||
Original Implementation of the gelu activation function in Google Bert repo when initialy created.
|
||||
Original Implementation of the gelu activation function in Google Bert repo when initially created.
|
||||
For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
|
||||
0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
|
||||
Also see https://arxiv.org/abs/1606.08415
|
||||
|
||||
491
transformers/modeling_tf_ctrl.py
Normal file
491
transformers/modeling_tf_ctrl.py
Normal file
@@ -0,0 +1,491 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 Salesforce and HuggingFace Inc. team.
|
||||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" TF 2.0 CTRL model."""
|
||||
|
||||
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
from io import open
|
||||
import numpy as np
|
||||
import tensorflow as tf
|
||||
|
||||
from .configuration_ctrl import CTRLConfig
|
||||
from .modeling_tf_utils import TFPreTrainedModel, get_initializer, shape_list, TFSharedEmbeddings
|
||||
from .file_utils import add_start_docstrings
|
||||
from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP = {"ctrl": "https://s3.amazonaws.com/models.huggingface.co/bert/ctrl-tf_model.h5"}
|
||||
|
||||
def load_ctrl_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
|
||||
# build the network
|
||||
inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
||||
tf_inputs = tf.constant(inputs_list)
|
||||
tfo = tf_model(tf_inputs, training=False)
|
||||
return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
|
||||
|
||||
|
||||
def angle_defn(pos, i, d_model_size):
|
||||
angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model_size))
|
||||
return pos * angle_rates
|
||||
|
||||
def positional_encoding(position, d_model_size):
|
||||
# create the sinusoidal pattern for the positional encoding
|
||||
angle_rads = angle_defn(np.arange(position)[:, np.newaxis],
|
||||
np.arange(d_model_size)[np.newaxis, :],
|
||||
d_model_size)
|
||||
|
||||
sines = np.sin(angle_rads[:, 0::2])
|
||||
cosines = np.cos(angle_rads[:, 1::2])
|
||||
|
||||
# pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1)[np.newaxis, ...], dtype=tf.float32)
|
||||
pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1), dtype=tf.float32)
|
||||
return pos_encoding
|
||||
|
||||
def scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):
|
||||
# calculate attention
|
||||
matmul_qk = tf.matmul(q, k, transpose_b=True)
|
||||
|
||||
dk = tf.cast(shape_list(k)[-1], tf.float32)
|
||||
scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
|
||||
|
||||
if mask is not None:
|
||||
scaled_attention_logits += (mask * -1e4)
|
||||
|
||||
if attention_mask is not None:
|
||||
# Apply the attention mask
|
||||
scaled_attention_logits = scaled_attention_logits + attention_mask
|
||||
|
||||
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
|
||||
|
||||
# Mask heads if we want to
|
||||
if head_mask is not None:
|
||||
attention_weights = attention_weights * head_mask
|
||||
|
||||
output = tf.matmul(attention_weights, v)
|
||||
|
||||
return output, attention_weights
|
||||
|
||||
|
||||
class TFMultiHeadAttention(tf.keras.layers.Layer):
|
||||
def __init__(self, d_model_size, num_heads, output_attentions=False, **kwargs):
|
||||
super(TFMultiHeadAttention, self).__init__(**kwargs)
|
||||
self.output_attentions = output_attentions
|
||||
self.num_heads = num_heads
|
||||
self.d_model_size = d_model_size
|
||||
|
||||
self.depth = int(d_model_size / self.num_heads)
|
||||
|
||||
self.Wq = tf.keras.layers.Dense(d_model_size, name='Wq')
|
||||
self.Wk = tf.keras.layers.Dense(d_model_size, name='Wk')
|
||||
self.Wv = tf.keras.layers.Dense(d_model_size, name='Wv')
|
||||
|
||||
self.dense = tf.keras.layers.Dense(d_model_size, name='dense')
|
||||
|
||||
def split_into_heads(self, x, batch_size):
|
||||
x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
|
||||
return tf.transpose(x, perm=[0, 2, 1, 3])
|
||||
|
||||
def call(self, inputs, training=False):
|
||||
v, k, q, mask, layer_past, attention_mask, head_mask = inputs
|
||||
batch_size = q.shape[0]
|
||||
|
||||
q = self.Wq(q)
|
||||
k = self.Wk(k)
|
||||
v = self.Wv(v)
|
||||
|
||||
q = self.split_into_heads(q, batch_size)
|
||||
k = self.split_into_heads(k, batch_size)
|
||||
v = self.split_into_heads(v, batch_size)
|
||||
if layer_past is not None:
|
||||
past_key, past_value = tf.unstack(layer_past, axis=1)
|
||||
k = tf.concat((past_key, k), dim=-2)
|
||||
v = tf.concat((past_value, v), dim=-2)
|
||||
present = tf.stack((k, v), axis=1)
|
||||
|
||||
output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)
|
||||
scaled_attention = tf.transpose(output[0], perm=[0, 2, 1, 3])
|
||||
attn = output[1]
|
||||
original_size_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model_size))
|
||||
output = self.dense(original_size_attention)
|
||||
|
||||
outputs = (output, present)
|
||||
if self.output_attentions:
|
||||
outputs = outputs + (attn,)
|
||||
return outputs
|
||||
|
||||
|
||||
|
||||
def point_wise_feed_forward_network(d_model_size, dff, name=""):
|
||||
return tf.keras.Sequential([
|
||||
tf.keras.layers.Dense(dff, activation='relu', name="0"),
|
||||
tf.keras.layers.Dense(d_model_size, name="2")
|
||||
], name="ffn")
|
||||
|
||||
|
||||
class TFEncoderLayer(tf.keras.layers.Layer):
|
||||
def __init__(self, d_model_size, num_heads, dff, rate=0.1, layer_norm_epsilon=1e-6, output_attentions=False, **kwargs):
|
||||
super(TFEncoderLayer, self).__init__(**kwargs)
|
||||
|
||||
self.multi_head_attention = TFMultiHeadAttention(d_model_size,
|
||||
num_heads,
|
||||
output_attentions,
|
||||
name="multi_head_attention")
|
||||
self.ffn = point_wise_feed_forward_network(d_model_size, dff, name="ffn")
|
||||
|
||||
self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layernorm1")
|
||||
self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layernorm2")
|
||||
|
||||
self.dropout1 = tf.keras.layers.Dropout(rate)
|
||||
self.dropout2 = tf.keras.layers.Dropout(rate)
|
||||
|
||||
def call(self, inputs, training=False):
|
||||
x, mask, layer_past, attention_mask, head_mask = inputs
|
||||
normed = self.layernorm1(x)
|
||||
attn_outputs = self.multi_head_attention([normed, normed, normed, mask, layer_past,
|
||||
attention_mask, head_mask], training=training)
|
||||
attn_output = attn_outputs[0]
|
||||
attn_output = self.dropout1(attn_output, training=training)
|
||||
out1 = x + attn_output
|
||||
|
||||
out2 = self.layernorm2(out1)
|
||||
ffn_output = self.ffn(out2)
|
||||
ffn_output = self.dropout2(ffn_output, training=training)
|
||||
out2 = out1 + ffn_output
|
||||
|
||||
outputs = (out2,) + attn_outputs[1:]
|
||||
return outputs
|
||||
|
||||
|
||||
class TFCTRLMainLayer(tf.keras.layers.Layer):
|
||||
def __init__(self, config, **kwargs):
|
||||
super(TFCTRLMainLayer, self).__init__(**kwargs)
|
||||
self.output_hidden_states = config.output_hidden_states
|
||||
self.d_model_size = config.n_embd
|
||||
self.num_layers = config.n_layer
|
||||
|
||||
self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size)
|
||||
|
||||
self.output_attentions = config.output_attentions
|
||||
|
||||
self.w = TFSharedEmbeddings(config.vocab_size,
|
||||
config.n_embd,
|
||||
initializer_range=config.initializer_range,
|
||||
name="w")
|
||||
|
||||
self.dropout = tf.keras.layers.Dropout(config.embd_pdrop)
|
||||
self.h = [TFEncoderLayer(config.n_embd,
|
||||
config.n_head,
|
||||
config.dff,
|
||||
config.resid_pdrop,
|
||||
config.layer_norm_epsilon,
|
||||
config.output_attentions,
|
||||
name='h_._{}'.format(i)) for i in range(config.n_layer)]
|
||||
self.layernorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name="layernorm")
|
||||
|
||||
def _resize_token_embeddings(self, new_num_tokens):
|
||||
raise NotImplementedError
|
||||
|
||||
def _prune_heads(self, heads_to_prune):
|
||||
""" Prunes heads of the model.
|
||||
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
def call(self, inputs, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, training=False):
|
||||
if isinstance(inputs, (tuple, list)):
|
||||
input_ids = inputs[0]
|
||||
past = inputs[1] if len(inputs) > 1 else past
|
||||
attention_mask = inputs[2] if len(inputs) > 2 else attention_mask
|
||||
token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids
|
||||
position_ids = inputs[4] if len(inputs) > 4 else position_ids
|
||||
head_mask = inputs[5] if len(inputs) > 5 else head_mask
|
||||
assert len(inputs) <= 6, "Too many inputs."
|
||||
elif isinstance(inputs, dict):
|
||||
input_ids = inputs.get('input_ids')
|
||||
past = inputs.get('past', past)
|
||||
attention_mask = inputs.get('attention_mask', attention_mask)
|
||||
token_type_ids = inputs.get('token_type_ids', token_type_ids)
|
||||
position_ids = inputs.get('position_ids', position_ids)
|
||||
head_mask = inputs.get('head_mask', head_mask)
|
||||
assert len(inputs) <= 6, "Too many inputs."
|
||||
else:
|
||||
input_ids = inputs
|
||||
|
||||
input_shape = shape_list(input_ids)
|
||||
input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])
|
||||
|
||||
if past is None:
|
||||
past_length = 0
|
||||
past = [None] * len(self.h)
|
||||
else:
|
||||
past_length = shape_list(past[0][0])[-2]
|
||||
if position_ids is None:
|
||||
position_ids = tf.range(past_length, shape_list(input_ids)[-1] + past_length, dtype=tf.int32)[tf.newaxis, :]
|
||||
position_ids = tf.tile(position_ids, [shape_list(input_ids)[0], 1])
|
||||
|
||||
# Attention mask.
|
||||
if attention_mask is not None:
|
||||
# We create a 3D attention mask from a 2D tensor mask.
|
||||
# Sizes are [batch_size, 1, 1, to_seq_length]
|
||||
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
|
||||
# this attention mask is more simple than the triangular masking of causal attention
|
||||
# used in OpenAI GPT, we just need to prepare the broadcast dimension here.
|
||||
attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]
|
||||
|
||||
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
|
||||
# masked positions, this operation will create a tensor which is 0.0 for
|
||||
# positions we want to attend and -10000.0 for masked positions.
|
||||
# Since we are adding it to the raw scores before the softmax, this is
|
||||
# effectively the same as removing these entirely.
|
||||
|
||||
attention_mask = tf.cast(attention_mask, tf.float32)
|
||||
attention_mask = (1.0 - attention_mask) * -10000.0
|
||||
else:
|
||||
attention_mask = None
|
||||
|
||||
# Prepare head mask if needed
|
||||
# 1.0 in head_mask indicate we keep the head
|
||||
# attention_probs has shape bsz x n_heads x N x N
|
||||
# head_mask has shape n_layer x batch x n_heads x N x N
|
||||
if head_mask is not None:
|
||||
raise NotImplementedError
|
||||
else:
|
||||
head_mask = [None] * self.num_layers
|
||||
|
||||
if token_type_ids is not None:
|
||||
token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])
|
||||
token_type_embeds = self.w(token_type_ids, mode='embedding')
|
||||
token_type_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))
|
||||
else:
|
||||
token_type_embeds = 0
|
||||
position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])
|
||||
|
||||
inputs_embeds = self.w(input_ids, mode='embedding')
|
||||
# x = embedded.unsqueeze(0) if len(input_ids.shape)<2 else embedded
|
||||
seq_len = input_shape[-1]
|
||||
mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
|
||||
|
||||
inputs_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))
|
||||
|
||||
pos_embeds = tf.gather(self.pos_encoding, position_ids)
|
||||
|
||||
hidden_states = inputs_embeds + pos_embeds + token_type_embeds
|
||||
|
||||
hidden_states = self.dropout(hidden_states, training=training)
|
||||
|
||||
output_shape = input_shape + [shape_list(hidden_states)[-1]]
|
||||
presents = ()
|
||||
all_hidden_states = ()
|
||||
all_attentions = []
|
||||
for i, (h, layer_past) in enumerate(zip(self.h, past)):
|
||||
if self.output_hidden_states:
|
||||
all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)
|
||||
outputs = h([hidden_states, mask, layer_past, attention_mask, head_mask[i]], training=training)
|
||||
hidden_states, present = outputs[:2]
|
||||
presents = presents + (present,)
|
||||
|
||||
if self.output_attentions:
|
||||
all_attentions.append(outputs[2])
|
||||
|
||||
hidden_states = self.layernorm(hidden_states)
|
||||
hidden_states = tf.reshape(hidden_states, output_shape)
|
||||
if self.output_hidden_states:
|
||||
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||
|
||||
outputs = (hidden_states, presents)
|
||||
if self.output_hidden_states:
|
||||
outputs = outputs + (all_hidden_states,)
|
||||
if self.output_attentions:
|
||||
# let the number of heads free (-1) so we can extract attention even after head pruning
|
||||
attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]
|
||||
all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)
|
||||
outputs = outputs + (all_attentions,)
|
||||
return outputs
|
||||
|
||||
|
||||
class TFCTRLPreTrainedModel(TFPreTrainedModel):
|
||||
""" An abstract class to handle weights initialization and
|
||||
a simple interface for dowloading and loading pretrained models.
|
||||
"""
|
||||
config_class = CTRLConfig
|
||||
pretrained_model_archive_map = TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||
base_model_prefix = "transformer"
|
||||
load_pt_weights = load_ctrl_pt_weights_in_tf2
|
||||
|
||||
|
||||
CTRL_START_DOCSTRING = r""" CTRL model was proposed in
|
||||
`CTRL: A Conditional Transformer Language Model for Controllable Generation`_
|
||||
by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
|
||||
It's a causal (unidirectional) transformer pre-trained using language modeling on a very large
|
||||
corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
|
||||
|
||||
This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
|
||||
refer to the PyTorch documentation for all matter related to general usage and behavior.
|
||||
|
||||
.. _`CTRL: A Conditional Transformer Language Model for Controllable Generation`:
|
||||
https://www.github.com/salesforce/ctrl
|
||||
|
||||
.. _`torch.nn.Module`:
|
||||
https://pytorch.org/docs/stable/nn.html#module
|
||||
|
||||
Parameters:
|
||||
config (:class:`~transformers.CTRLConfig`): Model configuration class with all the parameters of the model.
|
||||
Initializing with a config file does not load the weights associated with the model, only the configuration.
|
||||
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
|
||||
"""
|
||||
|
||||
CTRL_INPUTS_DOCSTRING = r""" Inputs:
|
||||
**input_ids**: ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||
Indices of input sequence tokens in the vocabulary.
|
||||
CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on
|
||||
the right rather than the left.
|
||||
Indices can be obtained using :class:`transformers.CTRLTokenizer`.
|
||||
See :func:`transformers.PreTrainedTokenizer.encode` and
|
||||
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
|
||||
**past**:
|
||||
list of ``Numpy array`` or ``tf.Tensor`` (one for each layer):
|
||||
that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
|
||||
(see `past` output below). Can be used to speed up sequential decoding.
|
||||
**attention_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||
Mask to avoid performing attention on padding token indices.
|
||||
Mask values selected in ``[0, 1]``:
|
||||
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
|
||||
**token_type_ids**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||
A parallel sequence of tokens (can be used to indicate various portions of the inputs).
|
||||
The embeddings from these tokens will be summed with the respective token embeddings.
|
||||
Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).
|
||||
**position_ids**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||
Indices of positions of each input sequence tokens in the position embeddings.
|
||||
Selected in the range ``[0, config.max_position_embeddings - 1]``.
|
||||
**head_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
|
||||
Mask to nullify selected heads of the self-attention modules.
|
||||
Mask values selected in ``[0, 1]``:
|
||||
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
|
||||
"""
|
||||
|
||||
@add_start_docstrings("The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.",
|
||||
CTRL_START_DOCSTRING, CTRL_INPUTS_DOCSTRING)
|
||||
class TFCTRLModel(TFCTRLPreTrainedModel):
|
||||
r"""
|
||||
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||
**last_hidden_state**: ``tf.Tensor`` of shape ``(batch_size, sequence_length, hidden_size)``
|
||||
Sequence of hidden-states at the last layer of the model.
|
||||
**past**:
|
||||
list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||
that contains pre-computed hidden-states (key and values in the attention blocks).
|
||||
Can be used (see `past` input) to speed up sequential decoding.
|
||||
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||
list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
|
||||
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||
list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||
|
||||
Examples::
|
||||
|
||||
import tensorflow as tf
|
||||
from transformers import CTRLTokenizer, TFCTRLModel
|
||||
|
||||
tokenizer = CTRLTokenizer.from_pretrained('ctrl')
|
||||
model = TFCTRLModel.from_pretrained('ctrl')
|
||||
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
|
||||
outputs = model(input_ids)
|
||||
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||
|
||||
"""
|
||||
def __init__(self, config, *inputs, **kwargs):
|
||||
super(TFCTRLModel, self).__init__(config, *inputs, **kwargs)
|
||||
self.transformer = TFCTRLMainLayer(config, name='transformer')
|
||||
|
||||
def call(self, inputs, **kwargs):
|
||||
outputs = self.transformer(inputs, **kwargs)
|
||||
return outputs
|
||||
|
||||
|
||||
class TFCTRLLMHead(tf.keras.layers.Layer):
|
||||
def __init__(self, config, input_embeddings, **kwargs):
|
||||
super(TFCTRLLMHead, self).__init__(**kwargs)
|
||||
self.vocab_size = config.vocab_size
|
||||
|
||||
# The output weights are the same as the input embeddings, but there is
|
||||
# an output-only bias for each token.
|
||||
self.input_embeddings = input_embeddings
|
||||
|
||||
def build(self, input_shape):
|
||||
self.bias = self.add_weight(shape=(self.vocab_size,),
|
||||
initializer='zeros',
|
||||
trainable=True,
|
||||
name='bias')
|
||||
super(TFCTRLLMHead, self).build(input_shape)
|
||||
|
||||
def call(self, hidden_states):
|
||||
hidden_states = self.input_embeddings(hidden_states, mode="linear")
|
||||
hidden_states = hidden_states + self.bias
|
||||
return hidden_states
|
||||
|
||||
|
||||
@add_start_docstrings("""The CTRL Model transformer with a language modeling head on top
|
||||
(linear layer with weights tied to the input embeddings). """, CTRL_START_DOCSTRING, CTRL_INPUTS_DOCSTRING)
|
||||
class TFCTRLLMHeadModel(TFCTRLPreTrainedModel):
|
||||
r"""
|
||||
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||
**prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
|
||||
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
||||
**past**:
|
||||
list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||
that contains pre-computed hidden-states (key and values in the attention blocks).
|
||||
Can be used (see `past` input) to speed up sequential decoding.
|
||||
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||
list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
|
||||
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||
list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||
|
||||
Examples::
|
||||
|
||||
import torch
|
||||
from transformers import CTRLTokenizer, TFCTRLLMHeadModel
|
||||
|
||||
tokenizer = CTRLTokenizer.from_pretrained('ctrl')
|
||||
model = TFCTRLLMHeadModel.from_pretrained('ctrl')
|
||||
|
||||
input_ids = torch.tensor(tokenizer.encode("Links Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
||||
outputs = model(input_ids, labels=input_ids)
|
||||
loss, logits = outputs[:2]
|
||||
|
||||
"""
|
||||
def __init__(self, config, *inputs, **kwargs):
|
||||
super(TFCTRLLMHeadModel, self).__init__(config, *inputs, **kwargs)
|
||||
self.transformer = TFCTRLMainLayer(config, name='transformer')
|
||||
|
||||
self.lm_head = TFCTRLLMHead(config, self.transformer.w, name="lm_head")
|
||||
|
||||
def call(self, inputs, **kwargs):
|
||||
transformer_outputs = self.transformer(inputs, **kwargs)
|
||||
hidden_states = transformer_outputs[0]
|
||||
|
||||
lm_logits = self.lm_head(hidden_states)
|
||||
|
||||
outputs = (lm_logits,) + transformer_outputs[1:]
|
||||
|
||||
return outputs # lm_logits, presents, (all hidden_states), (attentions)
|
||||
@@ -45,7 +45,7 @@ TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
||||
### UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE ###
|
||||
def gelu(x):
|
||||
""" Gaussian Error Linear Unit.
|
||||
Original Implementation of the gelu activation function in Google Bert repo when initialy created.
|
||||
Original Implementation of the gelu activation function in Google Bert repo when initially created.
|
||||
For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
|
||||
0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
|
||||
Also see https://arxiv.org/abs/1606.08415
|
||||
@@ -226,8 +226,6 @@ class TFMultiHeadSelfAttention(tf.keras.layers.Layer):
|
||||
|
||||
dim_per_head = self.dim // self.n_heads
|
||||
|
||||
assert 2 <= len(tf.shape(mask)) <= 3
|
||||
causal = (len(tf.shape(mask)) == 3)
|
||||
mask_reshape = [bs, 1, 1, k_length]
|
||||
|
||||
def shape(x):
|
||||
@@ -603,7 +601,7 @@ class TFDistilBertForMaskedLM(TFDistilBertPreTrainedModel):
|
||||
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
|
||||
model = TFDistilBertForMaskedLM.from_pretrained('distilbert-base-uncased')
|
||||
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
|
||||
outputs = model(input_ids, masked_lm_labels=input_ids)
|
||||
outputs = model(input_ids)
|
||||
prediction_scores = outputs[0]
|
||||
|
||||
"""
|
||||
@@ -715,9 +713,7 @@ class TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel):
|
||||
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
|
||||
model = TFDistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')
|
||||
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
|
||||
start_positions = tf.constant([1])
|
||||
end_positions = tf.constant([3])
|
||||
outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
|
||||
outputs = model(input_ids)
|
||||
start_scores, end_scores = outputs[:2]
|
||||
|
||||
"""
|
||||
|
||||
@@ -38,7 +38,8 @@ logger = logging.getLogger(__name__)
|
||||
|
||||
TF_GPT2_PRETRAINED_MODEL_ARCHIVE_MAP = {"gpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-tf_model.h5",
|
||||
"gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-tf_model.h5",
|
||||
"gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-tf_model.h5"}
|
||||
"gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-tf_model.h5",
|
||||
"distilgpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-tf_model.h5",}
|
||||
|
||||
|
||||
def load_gpt2_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
|
||||
|
||||
@@ -69,7 +69,7 @@ def create_sinusoidal_embeddings(n_pos, dim, out):
|
||||
|
||||
def gelu(x):
|
||||
""" Gaussian Error Linear Unit.
|
||||
Original Implementation of the gelu activation function in Google Bert repo when initialy created.
|
||||
Original Implementation of the gelu activation function in Google Bert repo when initially created.
|
||||
For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
|
||||
0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
|
||||
Also see https://arxiv.org/abs/1606.08415
|
||||
|
||||
@@ -501,7 +501,10 @@ class PoolerEndLogits(nn.Module):
|
||||
x = self.dense_1(x).squeeze(-1)
|
||||
|
||||
if p_mask is not None:
|
||||
x = x * (1 - p_mask) - 1e30 * p_mask
|
||||
if next(self.parameters()).dtype == torch.float16:
|
||||
x = x * (1 - p_mask) - 65500 * p_mask
|
||||
else:
|
||||
x = x * (1 - p_mask) - 1e30 * p_mask
|
||||
|
||||
return x
|
||||
|
||||
|
||||
215
transformers/tests/modeling_ctrl_test.py
Normal file
215
transformers/tests/modeling_ctrl_test.py
Normal file
@@ -0,0 +1,215 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 Salesforce and HuggingFace Inc. team.
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import unittest
|
||||
import pytest
|
||||
import shutil
|
||||
import pdb
|
||||
|
||||
from transformers import is_torch_available
|
||||
|
||||
if is_torch_available():
|
||||
from transformers import (CTRLConfig, CTRLModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
CTRLLMHeadModel)
|
||||
else:
|
||||
pytestmark = pytest.mark.skip("Require Torch")
|
||||
|
||||
from .modeling_common_test import (CommonTestCases, ids_tensor)
|
||||
from .configuration_common_test import ConfigTester
|
||||
|
||||
|
||||
class CTRLModelTest(CommonTestCases.CommonModelTester):
|
||||
|
||||
all_model_classes = (CTRLModel, CTRLLMHeadModel) if is_torch_available() else ()
|
||||
test_pruning = False
|
||||
test_torchscript = False
|
||||
test_resize_embeddings = False
|
||||
test_head_masking = False
|
||||
|
||||
class CTRLModelTester(object):
|
||||
|
||||
def __init__(self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
seq_length=7,
|
||||
is_training=True,
|
||||
use_token_type_ids=True,
|
||||
use_input_mask=True,
|
||||
use_labels=True,
|
||||
use_mc_token_ids=True,
|
||||
vocab_size=99,
|
||||
hidden_size=32,
|
||||
num_hidden_layers=5,
|
||||
num_attention_heads=4,
|
||||
intermediate_size=37,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
max_position_embeddings=512,
|
||||
type_vocab_size=16,
|
||||
type_sequence_label_size=2,
|
||||
initializer_range=0.02,
|
||||
num_labels=3,
|
||||
num_choices=4,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.is_training = is_training
|
||||
self.use_token_type_ids = use_token_type_ids
|
||||
self.use_input_mask = use_input_mask
|
||||
self.use_labels = use_labels
|
||||
self.use_mc_token_ids = use_mc_token_ids
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_act = hidden_act
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.type_vocab_size = type_vocab_size
|
||||
self.type_sequence_label_size = type_sequence_label_size
|
||||
self.initializer_range = initializer_range
|
||||
self.num_labels = num_labels
|
||||
self.num_choices = num_choices
|
||||
self.scope = scope
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
|
||||
input_mask = None
|
||||
if self.use_input_mask:
|
||||
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
|
||||
|
||||
token_type_ids = None
|
||||
if self.use_token_type_ids:
|
||||
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
|
||||
|
||||
mc_token_ids = None
|
||||
if self.use_mc_token_ids:
|
||||
mc_token_ids = ids_tensor([self.batch_size, self.num_choices], self.seq_length)
|
||||
|
||||
sequence_labels = None
|
||||
token_labels = None
|
||||
choice_labels = None
|
||||
if self.use_labels:
|
||||
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
|
||||
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
|
||||
choice_labels = ids_tensor([self.batch_size], self.num_choices)
|
||||
|
||||
config = CTRLConfig(
|
||||
vocab_size_or_config_json_file=self.vocab_size,
|
||||
n_embd=self.hidden_size,
|
||||
n_layer=self.num_hidden_layers,
|
||||
n_head=self.num_attention_heads,
|
||||
# intermediate_size=self.intermediate_size,
|
||||
# hidden_act=self.hidden_act,
|
||||
# hidden_dropout_prob=self.hidden_dropout_prob,
|
||||
# attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||
n_positions=self.max_position_embeddings,
|
||||
n_ctx=self.max_position_embeddings
|
||||
# type_vocab_size=self.type_vocab_size,
|
||||
# initializer_range=self.initializer_range
|
||||
)
|
||||
|
||||
head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
|
||||
|
||||
return config, input_ids, input_mask, head_mask, token_type_ids, mc_token_ids, sequence_labels, token_labels, choice_labels
|
||||
|
||||
def check_loss_output(self, result):
|
||||
self.parent.assertListEqual(
|
||||
list(result["loss"].size()),
|
||||
[])
|
||||
|
||||
def create_and_check_ctrl_model(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
|
||||
model = CTRLModel(config=config)
|
||||
model.eval()
|
||||
|
||||
model(input_ids, token_type_ids=token_type_ids, head_mask=head_mask)
|
||||
model(input_ids, token_type_ids=token_type_ids)
|
||||
sequence_output, presents = model(input_ids)
|
||||
|
||||
result = {
|
||||
"sequence_output": sequence_output,
|
||||
"presents": presents,
|
||||
}
|
||||
self.parent.assertListEqual(
|
||||
list(result["sequence_output"].size()),
|
||||
[self.batch_size, self.seq_length, self.hidden_size])
|
||||
self.parent.assertEqual(len(result["presents"]), config.n_layer)
|
||||
|
||||
def create_and_check_lm_head_model(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
|
||||
model = CTRLLMHeadModel(config)
|
||||
model.eval()
|
||||
|
||||
loss, lm_logits, _ = model(input_ids, token_type_ids=token_type_ids, labels=input_ids)
|
||||
|
||||
result = {
|
||||
"loss": loss,
|
||||
"lm_logits": lm_logits
|
||||
}
|
||||
self.parent.assertListEqual(
|
||||
list(result["loss"].size()),
|
||||
[])
|
||||
self.parent.assertListEqual(
|
||||
list(result["lm_logits"].size()),
|
||||
[self.batch_size, self.seq_length, self.vocab_size])
|
||||
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
|
||||
(config, input_ids, input_mask, head_mask, token_type_ids,
|
||||
mc_token_ids, sequence_labels, token_labels, choice_labels) = config_and_inputs
|
||||
|
||||
inputs_dict = {
|
||||
'input_ids': input_ids,
|
||||
'token_type_ids': token_type_ids,
|
||||
'head_mask': head_mask
|
||||
}
|
||||
|
||||
return config, inputs_dict
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = CTRLModelTest.CTRLModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=CTRLConfig, n_embd=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_ctrl_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_ctrl_model(*config_and_inputs)
|
||||
|
||||
def test_ctrl_lm_head_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
|
||||
|
||||
@pytest.mark.slow
|
||||
def test_model_from_pretrained(self):
|
||||
cache_dir = "/tmp/transformers_test/"
|
||||
for model_name in list(CTRL_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
||||
model = CTRLModel.from_pretrained(model_name, cache_dir=cache_dir)
|
||||
shutil.rmtree(cache_dir)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
@@ -71,6 +71,8 @@ class TFCommonTestCases:
|
||||
if not is_torch_available():
|
||||
return
|
||||
|
||||
import torch
|
||||
import numpy as np
|
||||
import transformers
|
||||
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
@@ -79,12 +81,23 @@ class TFCommonTestCases:
|
||||
pt_model_class_name = model_class.__name__[2:] # Skip the "TF" at the beggining
|
||||
pt_model_class = getattr(transformers, pt_model_class_name)
|
||||
|
||||
config.output_hidden_states = True
|
||||
tf_model = model_class(config)
|
||||
pt_model = pt_model_class(config)
|
||||
|
||||
# Check we can load pt model in tf and vice-versa (architecture similar)
|
||||
tf_model = transformers.load_pytorch_model_in_tf2_model(tf_model, pt_model, tf_inputs=inputs_dict)
|
||||
pt_model = transformers.load_tf2_model_in_pytorch_model(pt_model, tf_model)
|
||||
|
||||
# Check predictions on first output (logits/hidden-states) are close enought given low-level computational differences
|
||||
pt_model.eval()
|
||||
pt_inputs_dict = dict((name, torch.from_numpy(key.numpy()).to(torch.long))
|
||||
for name, key in inputs_dict.items())
|
||||
with torch.no_grad():
|
||||
pto = pt_model(**pt_inputs_dict)
|
||||
tfo = tf_model(inputs_dict)
|
||||
max_diff = np.amax(np.abs(tfo[0].numpy() - pto[0].numpy()))
|
||||
self.assertLessEqual(max_diff, 2e-2)
|
||||
|
||||
def test_keyword_and_dict_args(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
201
transformers/tests/modeling_tf_ctrl_test.py
Normal file
201
transformers/tests/modeling_tf_ctrl_test.py
Normal file
@@ -0,0 +1,201 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 The Google AI Language Team Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import unittest
|
||||
import shutil
|
||||
import pytest
|
||||
import sys
|
||||
|
||||
from .modeling_tf_common_test import (TFCommonTestCases, ids_tensor)
|
||||
from .configuration_common_test import ConfigTester
|
||||
|
||||
from transformers import CTRLConfig, is_tf_available
|
||||
|
||||
if is_tf_available():
|
||||
import tensorflow as tf
|
||||
from transformers.modeling_tf_ctrl import (TFCTRLModel, TFCTRLLMHeadModel,
|
||||
TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
else:
|
||||
pytestmark = pytest.mark.skip("Require TensorFlow")
|
||||
|
||||
|
||||
class TFCTRLModelTest(TFCommonTestCases.TFCommonModelTester):
|
||||
|
||||
all_model_classes = (TFCTRLModel, TFCTRLLMHeadModel) if is_tf_available() else ()
|
||||
|
||||
class TFCTRLModelTester(object):
|
||||
|
||||
def __init__(self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
seq_length=7,
|
||||
is_training=True,
|
||||
use_token_type_ids=True,
|
||||
use_input_mask=True,
|
||||
use_labels=True,
|
||||
use_mc_token_ids=True,
|
||||
vocab_size=99,
|
||||
hidden_size=32,
|
||||
num_hidden_layers=5,
|
||||
num_attention_heads=4,
|
||||
intermediate_size=37,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
max_position_embeddings=512,
|
||||
type_vocab_size=16,
|
||||
type_sequence_label_size=2,
|
||||
initializer_range=0.02,
|
||||
num_labels=3,
|
||||
num_choices=4,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.is_training = is_training
|
||||
self.use_token_type_ids = use_token_type_ids
|
||||
self.use_input_mask = use_input_mask
|
||||
self.use_labels = use_labels
|
||||
self.use_mc_token_ids = use_mc_token_ids
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_act = hidden_act
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.type_vocab_size = type_vocab_size
|
||||
self.type_sequence_label_size = type_sequence_label_size
|
||||
self.initializer_range = initializer_range
|
||||
self.num_labels = num_labels
|
||||
self.num_choices = num_choices
|
||||
self.scope = scope
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
|
||||
input_mask = None
|
||||
if self.use_input_mask:
|
||||
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
|
||||
|
||||
token_type_ids = None
|
||||
if self.use_token_type_ids:
|
||||
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
|
||||
|
||||
mc_token_ids = None
|
||||
if self.use_mc_token_ids:
|
||||
mc_token_ids = ids_tensor([self.batch_size, self.num_choices], self.seq_length)
|
||||
|
||||
sequence_labels = None
|
||||
token_labels = None
|
||||
choice_labels = None
|
||||
if self.use_labels:
|
||||
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
|
||||
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
|
||||
choice_labels = ids_tensor([self.batch_size], self.num_choices)
|
||||
|
||||
config = CTRLConfig(
|
||||
vocab_size_or_config_json_file=self.vocab_size,
|
||||
n_embd=self.hidden_size,
|
||||
n_layer=self.num_hidden_layers,
|
||||
n_head=self.num_attention_heads,
|
||||
# intermediate_size=self.intermediate_size,
|
||||
# hidden_act=self.hidden_act,
|
||||
# hidden_dropout_prob=self.hidden_dropout_prob,
|
||||
# attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||
n_positions=self.max_position_embeddings,
|
||||
n_ctx=self.max_position_embeddings
|
||||
# type_vocab_size=self.type_vocab_size,
|
||||
# initializer_range=self.initializer_range
|
||||
)
|
||||
|
||||
head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
|
||||
|
||||
return config, input_ids, input_mask, head_mask, token_type_ids, mc_token_ids, sequence_labels, token_labels, choice_labels
|
||||
|
||||
def create_and_check_ctrl_model(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
|
||||
model = TFCTRLModel(config=config)
|
||||
inputs = {'input_ids': input_ids,
|
||||
'attention_mask': input_mask,
|
||||
'token_type_ids': token_type_ids}
|
||||
sequence_output = model(inputs)[0]
|
||||
|
||||
inputs = [input_ids, None, input_mask] # None is the input for 'past'
|
||||
sequence_output = model(inputs)[0]
|
||||
|
||||
sequence_output = model(input_ids)[0]
|
||||
|
||||
result = {
|
||||
"sequence_output": sequence_output.numpy(),
|
||||
}
|
||||
self.parent.assertListEqual(
|
||||
list(result["sequence_output"].shape),
|
||||
[self.batch_size, self.seq_length, self.hidden_size])
|
||||
|
||||
|
||||
def create_and_check_ctrl_lm_head(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
|
||||
model = TFCTRLLMHeadModel(config=config)
|
||||
inputs = {'input_ids': input_ids,
|
||||
'attention_mask': input_mask,
|
||||
'token_type_ids': token_type_ids}
|
||||
prediction_scores = model(inputs)[0]
|
||||
result = {
|
||||
"prediction_scores": prediction_scores.numpy(),
|
||||
}
|
||||
self.parent.assertListEqual(
|
||||
list(result["prediction_scores"].shape),
|
||||
[self.batch_size, self.seq_length, self.vocab_size])
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
|
||||
(config, input_ids, input_mask, head_mask, token_type_ids,
|
||||
mc_token_ids, sequence_labels, token_labels, choice_labels) = config_and_inputs
|
||||
|
||||
inputs_dict = {'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': input_mask}
|
||||
return config, inputs_dict
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = TFCTRLModelTest.TFCTRLModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=CTRLConfig, n_embd=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_ctrl_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_ctrl_model(*config_and_inputs)
|
||||
|
||||
def test_ctrl_lm_head(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_ctrl_lm_head(*config_and_inputs)
|
||||
|
||||
@pytest.mark.slow
|
||||
def test_model_from_pretrained(self):
|
||||
cache_dir = "/tmp/transformers_test/"
|
||||
for model_name in list(TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
||||
model = TFCTRLModel.from_pretrained(model_name, cache_dir=cache_dir)
|
||||
shutil.rmtree(cache_dir)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
|
||||
@@ -222,7 +222,7 @@ class TFGPT2ModelTest(TFCommonTestCases.TFCommonModelTester):
|
||||
@pytest.mark.slow
|
||||
def test_model_from_pretrained(self):
|
||||
cache_dir = "/tmp/transformers_test/"
|
||||
for model_name in list(TF_gpt2_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
||||
for model_name in list(TF_GPT2_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
||||
model = TFGPT2Model.from_pretrained(model_name, cache_dir=cache_dir)
|
||||
shutil.rmtree(cache_dir)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
@@ -131,8 +131,8 @@ class BertTokenizationTest(CommonTestCases.CommonTokenizerTester):
|
||||
text = tokenizer.encode("sequence builders")
|
||||
text_2 = tokenizer.encode("multi-sequence build")
|
||||
|
||||
encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
|
||||
encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
|
||||
encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
|
||||
encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
|
||||
|
||||
assert encoded_sentence == [101] + text + [102]
|
||||
assert encoded_pair == [101] + text + [102] + text_2 + [102]
|
||||
|
||||
69
transformers/tests/tokenization_ctrl_test.py
Normal file
69
transformers/tests/tokenization_ctrl_test.py
Normal file
@@ -0,0 +1,69 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 Salesforce and HuggingFace Inc. team.
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||
|
||||
import os
|
||||
import unittest
|
||||
import json
|
||||
from io import open
|
||||
|
||||
from transformers.tokenization_ctrl import CTRLTokenizer, VOCAB_FILES_NAMES
|
||||
|
||||
from .tokenization_tests_commons import CommonTestCases
|
||||
|
||||
class CTRLTokenizationTest(CommonTestCases.CommonTokenizerTester):
|
||||
|
||||
tokenizer_class = CTRLTokenizer
|
||||
|
||||
def setUp(self):
|
||||
super(CTRLTokenizationTest, self).setUp()
|
||||
|
||||
# Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
|
||||
vocab = ['adapt', 're@@', 'a@@', 'apt', 'c@@', 't', '<unk>']
|
||||
vocab_tokens = dict(zip(vocab, range(len(vocab))))
|
||||
merges = ["#version: 0.2", 'a p', 'ap t</w>', 'r e', 'a d', 'ad apt</w>', '']
|
||||
self.special_tokens_map = {"unk_token": "<unk>"}
|
||||
|
||||
self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
|
||||
self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['merges_file'])
|
||||
with open(self.vocab_file, "w", encoding="utf-8") as fp:
|
||||
fp.write(json.dumps(vocab_tokens) + "\n")
|
||||
with open(self.merges_file, "w", encoding="utf-8") as fp:
|
||||
fp.write("\n".join(merges))
|
||||
|
||||
def get_tokenizer(self, **kwargs):
|
||||
kwargs.update(self.special_tokens_map)
|
||||
return CTRLTokenizer.from_pretrained(self.tmpdirname, **kwargs)
|
||||
|
||||
def get_input_output_texts(self):
|
||||
input_text = u"adapt react readapt apt"
|
||||
output_text = u"adapt react readapt apt"
|
||||
return input_text, output_text
|
||||
|
||||
def test_full_tokenizer(self):
|
||||
tokenizer = CTRLTokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
|
||||
text = "adapt react readapt apt"
|
||||
bpe_tokens = 'adapt re@@ a@@ c@@ t re@@ adapt apt'.split()
|
||||
tokens = tokenizer.tokenize(text)
|
||||
self.assertListEqual(tokens, bpe_tokens)
|
||||
|
||||
input_tokens = tokens + [tokenizer.unk_token]
|
||||
|
||||
input_bpe_tokens = [0, 1, 2, 4, 5, 1, 0, 3, 6]
|
||||
self.assertListEqual(
|
||||
tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
@@ -36,8 +36,8 @@ class DistilBertTokenizationTest(BertTokenizationTest):
|
||||
text = tokenizer.encode("sequence builders")
|
||||
text_2 = tokenizer.encode("multi-sequence build")
|
||||
|
||||
encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
|
||||
encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
|
||||
encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
|
||||
encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
|
||||
|
||||
assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
|
||||
assert encoded_pair == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id] + \
|
||||
|
||||
@@ -87,8 +87,8 @@ class RobertaTokenizationTest(CommonTestCases.CommonTokenizerTester):
|
||||
encoded_text_from_decode = tokenizer.encode("sequence builders", add_special_tokens=True)
|
||||
encoded_pair_from_decode = tokenizer.encode("sequence builders", "multi-sequence build", add_special_tokens=True)
|
||||
|
||||
encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
|
||||
encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
|
||||
encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
|
||||
encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
|
||||
|
||||
assert encoded_sentence == encoded_text_from_decode
|
||||
assert encoded_pair == encoded_pair_from_decode
|
||||
|
||||
@@ -193,12 +193,12 @@ class CommonTestCases:
|
||||
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
if tokenizer.add_special_tokens_sequence_pair.__qualname__.split('.')[0] != "PreTrainedTokenizer":
|
||||
if tokenizer.build_inputs_with_special_tokens.__qualname__.split('.')[0] != "PreTrainedTokenizer":
|
||||
seq_0 = "Test this method."
|
||||
seq_1 = "With these inputs."
|
||||
information = tokenizer.encode_plus(seq_0, seq_1, add_special_tokens=True)
|
||||
sequences, mask = information["input_ids"], information["token_type_ids"]
|
||||
assert len(sequences) == len(mask)
|
||||
self.assertEqual(len(sequences), len(mask))
|
||||
|
||||
def test_number_of_added_tokens(self):
|
||||
tokenizer = self.get_tokenizer()
|
||||
@@ -211,7 +211,7 @@ class CommonTestCases:
|
||||
|
||||
# Method is implemented (e.g. not GPT-2)
|
||||
if len(attached_sequences) != 2:
|
||||
assert tokenizer.num_added_tokens(pair=True) == len(attached_sequences) - len(sequences)
|
||||
self.assertEqual(tokenizer.num_added_tokens(pair=True), len(attached_sequences) - len(sequences))
|
||||
|
||||
def test_maximum_encoding_length_single_input(self):
|
||||
tokenizer = self.get_tokenizer()
|
||||
@@ -227,10 +227,10 @@ class CommonTestCases:
|
||||
truncated_sequence = information["input_ids"]
|
||||
overflowing_tokens = information["overflowing_tokens"]
|
||||
|
||||
assert len(overflowing_tokens) == 2 + stride
|
||||
assert overflowing_tokens == sequence[-(2 + stride):]
|
||||
assert len(truncated_sequence) == total_length - 2
|
||||
assert truncated_sequence == tokenizer.add_special_tokens_single_sequence(sequence[:-2])
|
||||
self.assertEqual(len(overflowing_tokens), 2 + stride)
|
||||
self.assertEqual(overflowing_tokens, sequence[-(2 + stride):])
|
||||
self.assertEqual(len(truncated_sequence), total_length - 2)
|
||||
self.assertEqual(truncated_sequence, tokenizer.build_inputs_with_special_tokens(sequence[:-2]))
|
||||
|
||||
def test_maximum_encoding_length_pair_input(self):
|
||||
tokenizer = self.get_tokenizer()
|
||||
@@ -243,26 +243,26 @@ class CommonTestCases:
|
||||
sequence_1_no_special_tokens = tokenizer.encode(seq_1)
|
||||
|
||||
sequence = tokenizer.encode(seq_0, seq_1, add_special_tokens=True)
|
||||
truncated_second_sequence = tokenizer.add_special_tokens_sequence_pair(
|
||||
truncated_second_sequence = tokenizer.build_inputs_with_special_tokens(
|
||||
tokenizer.encode(seq_0),
|
||||
tokenizer.encode(seq_1)[:-2]
|
||||
)
|
||||
|
||||
information = tokenizer.encode_plus(seq_0, seq_1, max_length=len(sequence) - 2, add_special_tokens=True,
|
||||
stride=stride, truncate_first_sequence=False)
|
||||
stride=stride, truncation_strategy='only_second')
|
||||
information_first_truncated = tokenizer.encode_plus(seq_0, seq_1, max_length=len(sequence) - 2,
|
||||
add_special_tokens=True, stride=stride,
|
||||
truncate_first_sequence=True)
|
||||
truncation_strategy='only_first')
|
||||
|
||||
truncated_sequence = information["input_ids"]
|
||||
overflowing_tokens = information["overflowing_tokens"]
|
||||
overflowing_tokens_first_truncated = information_first_truncated["overflowing_tokens"]
|
||||
|
||||
assert len(overflowing_tokens) == 2 + stride
|
||||
assert overflowing_tokens == sequence_1_no_special_tokens[-(2 + stride):]
|
||||
assert overflowing_tokens_first_truncated == sequence_0_no_special_tokens[-(2 + stride):]
|
||||
assert len(truncated_sequence) == len(sequence) - 2
|
||||
assert truncated_sequence == truncated_second_sequence
|
||||
self.assertEqual(len(overflowing_tokens), 2 + stride)
|
||||
self.assertEqual(overflowing_tokens, sequence_1_no_special_tokens[-(2 + stride):])
|
||||
self.assertEqual(overflowing_tokens_first_truncated, sequence_0_no_special_tokens[-(2 + stride):])
|
||||
self.assertEqual(len(truncated_sequence), len(sequence) - 2)
|
||||
self.assertEqual(truncated_sequence, truncated_second_sequence)
|
||||
|
||||
def test_encode_input_type(self):
|
||||
tokenizer = self.get_tokenizer()
|
||||
@@ -273,5 +273,43 @@ class CommonTestCases:
|
||||
input_ids = tokenizer.convert_tokens_to_ids(tokens)
|
||||
formatted_input = tokenizer.encode(sequence, add_special_tokens=True)
|
||||
|
||||
assert tokenizer.encode(tokens, add_special_tokens=True) == formatted_input
|
||||
assert tokenizer.encode(input_ids, add_special_tokens=True) == formatted_input
|
||||
self.assertEqual(tokenizer.encode(tokens, add_special_tokens=True), formatted_input)
|
||||
self.assertEqual(tokenizer.encode(input_ids, add_special_tokens=True), formatted_input)
|
||||
|
||||
def test_special_tokens_mask(self):
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
sequence_0 = "Encode this."
|
||||
sequence_1 = "This one too please."
|
||||
|
||||
# Testing single inputs
|
||||
encoded_sequence = tokenizer.encode(sequence_0)
|
||||
encoded_sequence_dict = tokenizer.encode_plus(sequence_0, add_special_tokens=True)
|
||||
encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
|
||||
special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
|
||||
self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
|
||||
|
||||
filtered_sequence = [(x if not special_tokens_mask[i] else None) for i, x in enumerate(encoded_sequence_w_special)]
|
||||
filtered_sequence = [x for x in filtered_sequence if x is not None]
|
||||
self.assertEqual(encoded_sequence, filtered_sequence)
|
||||
|
||||
# Testing inputs pairs
|
||||
encoded_sequence = tokenizer.encode(sequence_0) + tokenizer.encode(sequence_1)
|
||||
encoded_sequence_dict = tokenizer.encode_plus(sequence_0, sequence_1, add_special_tokens=True)
|
||||
encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
|
||||
special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
|
||||
self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
|
||||
|
||||
filtered_sequence = [(x if not special_tokens_mask[i] else None) for i, x in enumerate(encoded_sequence_w_special)]
|
||||
filtered_sequence = [x for x in filtered_sequence if x is not None]
|
||||
self.assertEqual(encoded_sequence, filtered_sequence)
|
||||
|
||||
# Testing with already existing special tokens
|
||||
if tokenizer.cls_token_id == tokenizer.unk_token_id and tokenizer.cls_token_id == tokenizer.unk_token_id:
|
||||
tokenizer.add_special_tokens({'cls_token': '</s>', 'sep_token': '<s>'})
|
||||
encoded_sequence_dict = tokenizer.encode_plus(sequence_0, add_special_tokens=True)
|
||||
encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
|
||||
special_tokens_mask_orig = encoded_sequence_dict["special_tokens_mask"]
|
||||
special_tokens_mask = tokenizer.get_special_tokens_mask(encoded_sequence_w_special, already_has_special_tokens=True)
|
||||
self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
|
||||
self.assertEqual(special_tokens_mask_orig, special_tokens_mask)
|
||||
|
||||
@@ -72,8 +72,8 @@ class XLMTokenizationTest(CommonTestCases.CommonTokenizerTester):
|
||||
text = tokenizer.encode("sequence builders")
|
||||
text_2 = tokenizer.encode("multi-sequence build")
|
||||
|
||||
encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
|
||||
encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
|
||||
encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
|
||||
encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
|
||||
|
||||
assert encoded_sentence == [1] + text + [1]
|
||||
assert encoded_pair == [1] + text + [1] + text_2 + [1]
|
||||
|
||||
@@ -95,8 +95,8 @@ class XLNetTokenizationTest(CommonTestCases.CommonTokenizerTester):
|
||||
text = tokenizer.encode("sequence builders")
|
||||
text_2 = tokenizer.encode("multi-sequence build")
|
||||
|
||||
encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
|
||||
encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
|
||||
encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
|
||||
encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
|
||||
|
||||
assert encoded_sentence == text + [4, 3]
|
||||
assert encoded_pair == text + [4] + text_2 + [4, 3]
|
||||
|
||||
@@ -21,6 +21,7 @@ import logging
|
||||
from .tokenization_bert import BertTokenizer
|
||||
from .tokenization_openai import OpenAIGPTTokenizer
|
||||
from .tokenization_gpt2 import GPT2Tokenizer
|
||||
from .tokenization_ctrl import CTRLTokenizer
|
||||
from .tokenization_transfo_xl import TransfoXLTokenizer
|
||||
from .tokenization_xlnet import XLNetTokenizer
|
||||
from .tokenization_xlm import XLMTokenizer
|
||||
@@ -45,6 +46,7 @@ class AutoTokenizer(object):
|
||||
- contains `bert`: BertTokenizer (Bert model)
|
||||
- contains `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)
|
||||
- contains `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)
|
||||
- contains `ctrl`: CTRLTokenizer (Salesforce CTRL model)
|
||||
- contains `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)
|
||||
- contains `xlnet`: XLNetTokenizer (XLNet model)
|
||||
- contains `xlm`: XLMTokenizer (XLM model)
|
||||
@@ -67,6 +69,7 @@ class AutoTokenizer(object):
|
||||
- contains `bert`: BertTokenizer (Bert model)
|
||||
- contains `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)
|
||||
- contains `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)
|
||||
- contains `ctrl`: CTRLTokenizer (Salesforce CTRL model)
|
||||
- contains `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)
|
||||
- contains `xlnet`: XLNetTokenizer (XLNet model)
|
||||
- contains `xlm`: XLMTokenizer (XLM model)
|
||||
@@ -114,7 +117,8 @@ class AutoTokenizer(object):
|
||||
return XLNetTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
|
||||
elif 'xlm' in pretrained_model_name_or_path:
|
||||
return XLMTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
|
||||
|
||||
elif 'ctrl' in pretrained_model_name_or_path:
|
||||
return CTRLTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
|
||||
raise ValueError("Unrecognized model identifier in {}. Should contains one of "
|
||||
"'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
|
||||
"'xlm', 'roberta'".format(pretrained_model_name_or_path))
|
||||
"'xlm', 'roberta', 'ctrl'".format(pretrained_model_name_or_path))
|
||||
|
||||
@@ -187,33 +187,59 @@ class BertTokenizer(PreTrainedTokenizer):
|
||||
out_string = ' '.join(tokens).replace(' ##', '').strip()
|
||||
return out_string
|
||||
|
||||
def add_special_tokens_single_sequence(self, token_ids):
|
||||
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||
"""
|
||||
Adds special tokens to the a sequence for sequence classification tasks.
|
||||
A BERT sequence has the following format: [CLS] X [SEP]
|
||||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||
by concatenating and adding special tokens.
|
||||
A BERT sequence has the following format:
|
||||
single sequence: [CLS] X [SEP]
|
||||
pair of sequences: [CLS] A [SEP] B [SEP]
|
||||
"""
|
||||
return [self.cls_token_id] + token_ids + [self.sep_token_id]
|
||||
|
||||
def add_special_tokens_sequence_pair(self, token_ids_0, token_ids_1):
|
||||
"""
|
||||
Adds special tokens to a sequence pair for sequence classification tasks.
|
||||
A BERT sequence pair has the following format: [CLS] A [SEP] B [SEP]
|
||||
"""
|
||||
sep = [self.sep_token_id]
|
||||
if token_ids_1 is None:
|
||||
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
|
||||
cls = [self.cls_token_id]
|
||||
|
||||
sep = [self.sep_token_id]
|
||||
return cls + token_ids_0 + sep + token_ids_1 + sep
|
||||
|
||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1):
|
||||
def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
|
||||
"""
|
||||
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||
special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.
|
||||
|
||||
Args:
|
||||
token_ids_0: list of ids (must not contain special tokens)
|
||||
token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids
|
||||
for sequence pairs
|
||||
already_has_special_tokens: (default False) Set to True if the token list is already formated with
|
||||
special tokens for the model
|
||||
|
||||
Returns:
|
||||
A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token.
|
||||
"""
|
||||
|
||||
if already_has_special_tokens:
|
||||
if token_ids_1 is not None:
|
||||
raise ValueError("You should not supply a second sequence if the provided sequence of "
|
||||
"ids is already formated with special tokens for the model.")
|
||||
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
|
||||
|
||||
if token_ids_1 is not None:
|
||||
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
|
||||
return [1] + ([0] * len(token_ids_0)) + [1]
|
||||
|
||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
||||
"""
|
||||
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
||||
A BERT sequence pair mask has the following format:
|
||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
|
||||
| first sequence | second sequence
|
||||
|
||||
if token_ids_1 is None, only returns the first portion of the mask (0's).
|
||||
"""
|
||||
sep = [self.sep_token_id]
|
||||
cls = [self.cls_token_id]
|
||||
|
||||
if token_ids_1 is None:
|
||||
return len(cls + token_ids_0 + sep) * [0]
|
||||
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
|
||||
|
||||
def save_vocabulary(self, vocab_path):
|
||||
|
||||
239
transformers/tokenization_ctrl.py
Normal file
239
transformers/tokenization_ctrl.py
Normal file
@@ -0,0 +1,239 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 Salesforce and The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Tokenization classes for Salesforce CTRL."""
|
||||
from __future__ import (absolute_import, division, print_function,
|
||||
unicode_literals)
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import regex as re
|
||||
from io import open
|
||||
|
||||
from .tokenization_bert import BasicTokenizer
|
||||
|
||||
from .tokenization_utils import PreTrainedTokenizer
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
VOCAB_FILES_NAMES = {
|
||||
'vocab_file': 'vocab.json',
|
||||
'merges_file': 'merges.txt',
|
||||
}
|
||||
|
||||
PRETRAINED_VOCAB_FILES_MAP = {
|
||||
'vocab_file':
|
||||
{
|
||||
'ctrl': "https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-vocab.json",
|
||||
},
|
||||
'merges_file':
|
||||
{
|
||||
'ctrl': "https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-merges.txt",
|
||||
},
|
||||
}
|
||||
|
||||
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
||||
'ctrl': 256,
|
||||
}
|
||||
|
||||
def text_standardize(text):
|
||||
"""
|
||||
fixes some issues the spacy tokenizer had on books corpus
|
||||
also does some whitespace standardization
|
||||
"""
|
||||
text = text.replace('—', '-')
|
||||
text = text.replace('–', '-')
|
||||
text = text.replace('―', '-')
|
||||
text = text.replace('…', '...')
|
||||
text = text.replace('´', "'")
|
||||
text = re.sub(r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)''', r' \1 ', text)
|
||||
text = re.sub(r'\s*\n\s*', ' \n ', text)
|
||||
text = re.sub(r'[^\S\n]+', ' ', text)
|
||||
return text.strip()
|
||||
|
||||
|
||||
def get_pairs(word):
|
||||
"""Return set of symbol pairs in a word.
|
||||
|
||||
Word is represented as tuple of symbols (symbols being variable-length strings).
|
||||
"""
|
||||
# pairs = []
|
||||
# prev_char = word[0]
|
||||
# for i, char in enumerate(word[1:]):
|
||||
# #_i = i + 1
|
||||
# #if word[_i+1:] == tuple('</w>'):
|
||||
# # pairs.append((prev_char, char+'</w>'))
|
||||
# # break
|
||||
# #else:
|
||||
# if True:
|
||||
# pairs.append((prev_char, char))
|
||||
# prev_char = char
|
||||
|
||||
pairs = set()
|
||||
prev_char = word[0]
|
||||
for char in word[1:]:
|
||||
pairs.add((prev_char, char))
|
||||
prev_char = char
|
||||
|
||||
pairs = set(pairs)
|
||||
return pairs
|
||||
|
||||
class CTRLTokenizer(PreTrainedTokenizer):
|
||||
"""
|
||||
CTRL BPE tokenizer. Peculiarities:
|
||||
- Byte-level Byte-Pair-Encoding
|
||||
- Requires a space to start the input string => the encoding methods should be called with the
|
||||
``add_prefix_space`` flag set to ``True``.
|
||||
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
|
||||
the absence of a space at the beginning of a string: `tokenizer.decode(tokenizer.encode("Hello")) = " Hello"`
|
||||
"""
|
||||
vocab_files_names = VOCAB_FILES_NAMES
|
||||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
|
||||
|
||||
def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
|
||||
super(CTRLTokenizer, self).__init__(unk_token=unk_token, **kwargs)
|
||||
self.max_len_single_sentence = self.max_len # no default special tokens - you can update this value if you add special tokens
|
||||
self.max_len_sentences_pair = self.max_len # no default special tokens - you can update this value if you add special tokens
|
||||
|
||||
try:
|
||||
import ftfy
|
||||
from spacy.lang.en import English
|
||||
_nlp = English()
|
||||
self.nlp = _nlp.Defaults.create_tokenizer(_nlp)
|
||||
self.fix_text = ftfy.fix_text
|
||||
except ImportError:
|
||||
logger.warning("ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.")
|
||||
self.nlp = BasicTokenizer(do_lower_case=True)
|
||||
self.fix_text = None
|
||||
|
||||
self.encoder = json.load(open(vocab_file, encoding="utf-8"))
|
||||
self.decoder = {v:k for k,v in self.encoder.items()}
|
||||
merges = open(merges_file, encoding='utf-8').read().split('\n')[1:-1]
|
||||
merges = [tuple(merge.split()) for merge in merges]
|
||||
self.bpe_ranks = dict(zip(merges, range(len(merges))))
|
||||
self.cache = {}
|
||||
|
||||
@property
|
||||
def vocab_size(self):
|
||||
return len(self.encoder)
|
||||
|
||||
def bpe(self, token):
|
||||
if token in self.cache:
|
||||
return self.cache[token]
|
||||
word = tuple(token)
|
||||
word = tuple(list(word[:-1]) + [word[-1]+'</w>'])
|
||||
pairs = get_pairs(word)
|
||||
|
||||
if not pairs:
|
||||
return token
|
||||
|
||||
while True:
|
||||
bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
|
||||
if bigram not in self.bpe_ranks:
|
||||
break
|
||||
first, second = bigram
|
||||
new_word = []
|
||||
i = 0
|
||||
while i < len(word):
|
||||
try:
|
||||
j = word.index(first, i)
|
||||
new_word.extend(word[i:j])
|
||||
i = j
|
||||
except:
|
||||
new_word.extend(word[i:])
|
||||
break
|
||||
|
||||
if word[i] == first and i < len(word)-1 and word[i+1] == second:
|
||||
new_word.append(first+second)
|
||||
i += 2
|
||||
else:
|
||||
new_word.append(word[i])
|
||||
i += 1
|
||||
new_word = tuple(new_word)
|
||||
word = new_word
|
||||
if len(word) == 1:
|
||||
break
|
||||
else:
|
||||
pairs = get_pairs(word)
|
||||
word = '@@ '.join(word)
|
||||
word = word[:-4]
|
||||
self.cache[token] = word
|
||||
return word
|
||||
|
||||
def _tokenize(self, text):
|
||||
""" Tokenize a string.
|
||||
"""
|
||||
split_tokens = []
|
||||
if self.fix_text is None:
|
||||
# Using BERT's BasicTokenizer
|
||||
text = self.nlp.tokenize(text)
|
||||
for token in text:
|
||||
split_tokens.extend([t for t in self.bpe(token).split(' ')])
|
||||
else:
|
||||
# Using SpaCy & ftfy (original tokenization process of OpenAI GPT)
|
||||
text = self.nlp(text_standardize(self.fix_text(text)))
|
||||
for token in text:
|
||||
split_tokens.extend([t for t in self.bpe(token.text.lower()).split(' ')])
|
||||
# for token in text.split():
|
||||
# if sys.version_info[0] == 2:
|
||||
# token = ''.join(self.byte_encoder[ord(b)] for b in token) # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)
|
||||
# else:
|
||||
# token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8')) # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)
|
||||
# bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))
|
||||
return split_tokens
|
||||
|
||||
def _convert_token_to_id(self, token):
|
||||
""" Converts a token (str/unicode) in an id using the vocab. """
|
||||
return self.encoder.get(token, self.encoder.get(self.unk_token))
|
||||
|
||||
def _convert_id_to_token(self, index):
|
||||
"""Converts an index (integer) in a token (string/unicode) using the vocab."""
|
||||
return self.decoder.get(index, self.unk_token)
|
||||
|
||||
def convert_tokens_to_string(self, tokens):
|
||||
""" Converts a sequence of tokens (string) in a single string. """
|
||||
out_string = ' '.join(tokens).replace('@@ ', '').strip()
|
||||
return out_string
|
||||
|
||||
def save_vocabulary(self, save_directory):
|
||||
"""Save the tokenizer vocabulary and merge files to a directory."""
|
||||
if not os.path.isdir(save_directory):
|
||||
logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
|
||||
return
|
||||
vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES['vocab_file'])
|
||||
merge_file = os.path.join(save_directory, VOCAB_FILES_NAMES['merges_file'])
|
||||
|
||||
with open(vocab_file, 'w', encoding='utf-8') as f:
|
||||
f.write(json.dumps(self.encoder, ensure_ascii=False))
|
||||
|
||||
index = 0
|
||||
with open(merge_file, "w", encoding="utf-8") as writer:
|
||||
writer.write(u'#version: 0.2\n')
|
||||
for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
|
||||
if index != token_index:
|
||||
logger.warning("Saving vocabulary to {}: BPE merge indices are not consecutive."
|
||||
" Please check that the tokenizer is not corrupted!".format(merge_file))
|
||||
index = token_index
|
||||
writer.write(' '.join(bpe_tokens) + u'\n')
|
||||
index += 1
|
||||
|
||||
return vocab_file, merge_file
|
||||
|
||||
# def decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True):
|
||||
# filtered_tokens = ' '.join(self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens))
|
||||
# tokens_generated_so_far = re.sub('(@@ )', '', string=filtered_tokens)
|
||||
# tokens_generated_so_far = re.sub('(@@ ?$)', '', string=tokens_generated_so_far)
|
||||
# return ''.join(tokens_generated_so_far)
|
||||
@@ -46,12 +46,14 @@ PRETRAINED_VOCAB_FILES_MAP = {
|
||||
'gpt2': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json",
|
||||
'gpt2-medium': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json",
|
||||
'gpt2-large': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json",
|
||||
'distilgpt2': "https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-vocab.json",
|
||||
},
|
||||
'merges_file':
|
||||
{
|
||||
'gpt2': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt",
|
||||
'gpt2-medium': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt",
|
||||
'gpt2-large': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt",
|
||||
'distilgpt2': "https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-merges.txt",
|
||||
},
|
||||
}
|
||||
|
||||
@@ -59,6 +61,7 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
||||
'gpt2': 1024,
|
||||
'gpt2-medium': 1024,
|
||||
'gpt2-large': 1024,
|
||||
'distilgpt2': 1024,
|
||||
}
|
||||
|
||||
@lru_cache()
|
||||
@@ -101,9 +104,10 @@ class GPT2Tokenizer(PreTrainedTokenizer):
|
||||
"""
|
||||
GPT-2 BPE tokenizer. Peculiarities:
|
||||
- Byte-level Byte-Pair-Encoding
|
||||
- Requires a space to start the input string => will add a space is there isn't.
|
||||
As a consequence, this tokenizer `encode` and `decode` method will not conserve
|
||||
the absence of a space at the beginning of a string: `tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
|
||||
- Requires a space to start the input string => the encoding methods should be called with the
|
||||
``add_prefix_space`` flag set to ``True``.
|
||||
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
|
||||
the absence of a space at the beginning of a string: `tokenizer.decode(tokenizer.encode("Hello")) = " Hello"`
|
||||
"""
|
||||
vocab_files_names = VOCAB_FILES_NAMES
|
||||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||
|
||||
@@ -66,9 +66,10 @@ class RobertaTokenizer(GPT2Tokenizer):
|
||||
"""
|
||||
RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities:
|
||||
- Byte-level Byte-Pair-Encoding
|
||||
- Requires a space to start the input string => will add a space is there isn't.
|
||||
As a consequence, this tokenizer `encode` and `decode` method will not conserve
|
||||
the absence of a space at the beginning of a string: `tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
|
||||
- Requires a space to start the input string => the encoding methods should be called with the
|
||||
``add_prefix_space`` flag set to ``True``.
|
||||
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
|
||||
the absence of a space at the beginning of a string: `tokenizer.decode(tokenizer.encode("Hello")) = " Hello"`
|
||||
"""
|
||||
vocab_files_names = VOCAB_FILES_NAMES
|
||||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||
@@ -80,31 +81,60 @@ class RobertaTokenizer(GPT2Tokenizer):
|
||||
bos_token=bos_token, eos_token=eos_token, unk_token=unk_token,
|
||||
sep_token=sep_token, cls_token=cls_token, pad_token=pad_token,
|
||||
mask_token=mask_token, **kwargs)
|
||||
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
|
||||
self.max_len_sentences_pair = self.max_len - 4 # take into account special tokens
|
||||
|
||||
def add_special_tokens_single_sequence(self, token_ids):
|
||||
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||
"""
|
||||
Adds special tokens to a sequence for sequence classification tasks.
|
||||
A RoBERTa sequence has the following format: <s> X </s>
|
||||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||
by concatenating and adding special tokens.
|
||||
A RoBERTa sequence has the following format:
|
||||
single sequence: <s> X </s>
|
||||
pair of sequences: <s> A </s></s> B </s>
|
||||
"""
|
||||
return [self.cls_token_id] + token_ids + [self.sep_token_id]
|
||||
|
||||
def add_special_tokens_sequence_pair(self, token_ids_0, token_ids_1):
|
||||
"""
|
||||
Adds special tokens to a sequence pair for sequence classification tasks.
|
||||
A RoBERTa sequence pair has the following format: <s> A </s></s> B </s>
|
||||
"""
|
||||
sep = [self.sep_token_id]
|
||||
if token_ids_1 is None:
|
||||
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
|
||||
cls = [self.cls_token_id]
|
||||
sep = [self.sep_token_id]
|
||||
return cls + token_ids_0 + sep + sep + token_ids_1 + sep
|
||||
|
||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1):
|
||||
def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
|
||||
"""
|
||||
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||
special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.
|
||||
|
||||
Args:
|
||||
token_ids_0: list of ids (must not contain special tokens)
|
||||
token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids
|
||||
for sequence pairs
|
||||
already_has_special_tokens: (default False) Set to True if the token list is already formated with
|
||||
special tokens for the model
|
||||
|
||||
Returns:
|
||||
A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token.
|
||||
"""
|
||||
if already_has_special_tokens:
|
||||
if token_ids_1 is not None:
|
||||
raise ValueError("You should not supply a second sequence if the provided sequence of "
|
||||
"ids is already formated with special tokens for the model.")
|
||||
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
|
||||
|
||||
if token_ids_1 is None:
|
||||
return [1] + ([0] * len(token_ids_0)) + [1]
|
||||
return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
|
||||
|
||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
||||
"""
|
||||
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
||||
A RoBERTa sequence pair mask has the following format:
|
||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
|
||||
| first sequence | second sequence
|
||||
|
||||
if token_ids_1 is None, only returns the first portion of the mask (0's).
|
||||
"""
|
||||
sep = [self.sep_token_id]
|
||||
cls = [self.cls_token_id]
|
||||
|
||||
return len(cls + token_ids_0 + sep + sep) * [0] + len(token_ids_1 + sep) * [1]
|
||||
if token_ids_1 is None:
|
||||
return len(cls + token_ids_0 + sep) * [0]
|
||||
return len(cls + token_ids_0 + sep + sep) * [0] + len(token_ids_1 + sep) * [1]
|
||||
|
||||
@@ -430,7 +430,7 @@ class PreTrainedTokenizer(object):
|
||||
- tokenizer instantiation positional and keywords inputs (e.g. do_lower_case for Bert).
|
||||
|
||||
This won't save modifications other than (added tokens and special token mapping) you may have
|
||||
applied to the tokenizer after the instantion (e.g. modifying tokenizer.do_lower_case after creation).
|
||||
applied to the tokenizer after the instantiation (e.g. modifying tokenizer.do_lower_case after creation).
|
||||
|
||||
This method make sure the full tokenizer can then be re-loaded using the :func:`~transformers.PreTrainedTokenizer.from_pretrained` class method.
|
||||
"""
|
||||
@@ -512,7 +512,8 @@ class PreTrainedTokenizer(object):
|
||||
for token in new_tokens:
|
||||
assert isinstance(token, str) or (six.PY2 and isinstance(token, unicode))
|
||||
if token != self.unk_token and \
|
||||
self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token):
|
||||
self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) and \
|
||||
token not in to_add_tokens:
|
||||
to_add_tokens.append(token)
|
||||
logger.info("Adding %s to the vocabulary", token)
|
||||
|
||||
@@ -538,15 +539,9 @@ class PreTrainedTokenizer(object):
|
||||
Returns:
|
||||
Number of tokens added to sequences
|
||||
"""
|
||||
|
||||
if pair:
|
||||
initial_tokens_len = len(self.encode("This is a sequence") + self.encode("This is another"))
|
||||
final_tokens_len = len(self.encode("This is a sequence", "This is another", add_special_tokens=True))
|
||||
else:
|
||||
initial_tokens_len = len(self.encode("This is a sequence"))
|
||||
final_tokens_len = len(self.encode("This is a sequence", add_special_tokens=True))
|
||||
|
||||
return final_tokens_len - initial_tokens_len
|
||||
token_ids_0 = []
|
||||
token_ids_1 = []
|
||||
return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
|
||||
|
||||
def add_special_tokens(self, special_tokens_dict):
|
||||
"""
|
||||
@@ -698,7 +693,7 @@ class PreTrainedTokenizer(object):
|
||||
add_special_tokens=False,
|
||||
max_length=None,
|
||||
stride=0,
|
||||
truncate_first_sequence=True,
|
||||
truncation_strategy='longest_first',
|
||||
return_tensors=None,
|
||||
**kwargs):
|
||||
"""
|
||||
@@ -718,9 +713,13 @@ class PreTrainedTokenizer(object):
|
||||
max_length: if set to a number, will limit the total sequence returned so that it has a maximum length.
|
||||
If there are overflowing tokens, those will be added to the returned dictionary
|
||||
stride: if set to a number along with max_length, the overflowing tokens returned will contain some tokens
|
||||
from the main sequence returned. The value of this argument defined the number of additional tokens.
|
||||
truncate_first_sequence: if there is a specified max_length, this flag will choose which sequence
|
||||
will be truncated.
|
||||
from the main sequence returned. The value of this argument defines the number of additional tokens.
|
||||
truncation_strategy: string selected in the following options:
|
||||
- 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length
|
||||
starting from the longest one at each token (when there is a pair of input sequences)
|
||||
- 'only_first': Only truncate the first sequence
|
||||
- 'only_second': Only truncate the second sequence
|
||||
- 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
|
||||
return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant
|
||||
or PyTorch torch.Tensor instead of a list of python integers.
|
||||
**kwargs: passed to the `self.tokenize()` method
|
||||
@@ -730,7 +729,7 @@ class PreTrainedTokenizer(object):
|
||||
max_length=max_length,
|
||||
add_special_tokens=add_special_tokens,
|
||||
stride=stride,
|
||||
truncate_first_sequence=truncate_first_sequence,
|
||||
truncation_strategy=truncation_strategy,
|
||||
return_tensors=return_tensors,
|
||||
**kwargs)
|
||||
|
||||
@@ -742,7 +741,7 @@ class PreTrainedTokenizer(object):
|
||||
add_special_tokens=False,
|
||||
max_length=None,
|
||||
stride=0,
|
||||
truncate_first_sequence=True,
|
||||
truncation_strategy='longest_first',
|
||||
return_tensors=None,
|
||||
**kwargs):
|
||||
"""
|
||||
@@ -761,9 +760,13 @@ class PreTrainedTokenizer(object):
|
||||
max_length: if set to a number, will limit the total sequence returned so that it has a maximum length.
|
||||
If there are overflowing tokens, those will be added to the returned dictionary
|
||||
stride: if set to a number along with max_length, the overflowing tokens returned will contain some tokens
|
||||
from the main sequence returned. The value of this argument defined the number of additional tokens.
|
||||
truncate_first_sequence: if there is a specified max_length, this flag will choose which sequence
|
||||
will be truncated.
|
||||
from the main sequence returned. The value of this argument defines the number of additional tokens.
|
||||
truncation_strategy: string selected in the following options:
|
||||
- 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length
|
||||
starting from the longest one at each token (when there is a pair of input sequences)
|
||||
- 'only_first': Only truncate the first sequence
|
||||
- 'only_second': Only truncate the second sequence
|
||||
- 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
|
||||
return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant
|
||||
or PyTorch torch.Tensor instead of a list of python integers.
|
||||
**kwargs: passed to the `self.tokenize()` method
|
||||
@@ -787,12 +790,11 @@ class PreTrainedTokenizer(object):
|
||||
max_length=max_length,
|
||||
add_special_tokens=add_special_tokens,
|
||||
stride=stride,
|
||||
truncate_first_sequence=truncate_first_sequence,
|
||||
truncation_strategy=truncation_strategy,
|
||||
return_tensors=return_tensors)
|
||||
|
||||
|
||||
def prepare_for_model(self, ids, pair_ids=None, max_length=None, add_special_tokens=False, stride=0,
|
||||
truncate_first_sequence=True, return_tensors=None):
|
||||
truncation_strategy='longest_first', return_tensors=None):
|
||||
"""
|
||||
Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model.
|
||||
It adds special tokens, truncates
|
||||
@@ -809,41 +811,50 @@ class PreTrainedTokenizer(object):
|
||||
to their model.
|
||||
stride: window stride for overflowing tokens. Can be useful for edge effect removal when using sequential
|
||||
list of inputs.
|
||||
truncate_first_sequence: if set to `True` and an optional second list of input ids is provided,
|
||||
alongside a specified `max_length`, will truncate the first sequence if the total size is superior
|
||||
than the specified `max_length`. If set to `False`, will truncate the second sequence instead.
|
||||
truncation_strategy: string selected in the following options:
|
||||
- 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length
|
||||
starting from the longest one at each token (when there is a pair of input sequences)
|
||||
- 'only_first': Only truncate the first sequence
|
||||
- 'only_second': Only truncate the second sequence
|
||||
- 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
|
||||
return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant
|
||||
or PyTorch torch.Tensor instead of a list of python integers.
|
||||
|
||||
Return:
|
||||
a dictionary containing the `input_ids` as well as the `overflowing_tokens` if a `max_length` was given.
|
||||
A Dictionary of shape::
|
||||
|
||||
{
|
||||
input_ids: list[int],
|
||||
overflowing_tokens: list[int] if a ``max_length`` is specified, else None
|
||||
special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True``
|
||||
}
|
||||
|
||||
With the fields:
|
||||
``input_ids``: list of tokens to be fed to a model
|
||||
|
||||
``overflowing_tokens``: list of overflowing tokens if a max length is specified.
|
||||
|
||||
``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added
|
||||
tokens and 1 specifying sequence tokens.
|
||||
"""
|
||||
pair = bool(pair_ids is not None)
|
||||
len_ids = len(ids)
|
||||
len_pair_ids = len(pair_ids) if pair else 0
|
||||
|
||||
encoded_inputs = {}
|
||||
if max_length:
|
||||
n_added_tokens = self.num_added_tokens(pair=pair) if add_special_tokens else 0
|
||||
if pair and n_added_tokens + (len_pair_ids if truncate_first_sequence else len_ids) >= max_length:
|
||||
logger.warning(
|
||||
"You supplied a pair of sequence in which the sequence that will not be truncated is longer than the maximum specified length."
|
||||
"This pair of sequences will not be truncated.")
|
||||
else:
|
||||
if n_added_tokens + len_ids + len_pair_ids > max_length:
|
||||
if truncate_first_sequence or not pair:
|
||||
encoded_inputs["overflowing_tokens"] = ids[max_length - len_pair_ids - n_added_tokens - stride:]
|
||||
ids = ids[:max_length - len_pair_ids - n_added_tokens]
|
||||
elif not truncate_first_sequence and pair:
|
||||
encoded_inputs["overflowing_tokens"] = pair_ids[max_length - len_ids - n_added_tokens - stride:]
|
||||
pair_ids = pair_ids[:max_length - len_ids - n_added_tokens]
|
||||
else:
|
||||
logger.warning(
|
||||
"Cannot truncate second sequence as it is not provided. No truncation.")
|
||||
total_len = len_ids + len_pair_ids + (self.num_added_tokens(pair=pair) if add_special_tokens else 0)
|
||||
if max_length and total_len > max_length:
|
||||
ids, pair_ids, overflowing_tokens = self.truncate_sequences(ids, pair_ids=pair_ids,
|
||||
num_tokens_to_remove=total_len-max_length,
|
||||
truncation_strategy=truncation_strategy,
|
||||
stride=stride)
|
||||
encoded_inputs["overflowing_tokens"] = overflowing_tokens
|
||||
encoded_inputs["num_truncated_tokens"] = total_len - max_length
|
||||
|
||||
if add_special_tokens:
|
||||
sequence = self.add_special_tokens_sequence_pair(ids, pair_ids) if pair else self.add_special_tokens_single_sequence(ids)
|
||||
token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids) if pair else [0] * len(sequence)
|
||||
sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
|
||||
token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
|
||||
encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
|
||||
else:
|
||||
sequence = ids + pair_ids if pair else ids
|
||||
token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])
|
||||
@@ -860,20 +871,89 @@ class PreTrainedTokenizer(object):
|
||||
encoded_inputs["input_ids"] = sequence
|
||||
encoded_inputs["token_type_ids"] = token_type_ids
|
||||
|
||||
if max_length and len(encoded_inputs["input_ids"]) > max_length:
|
||||
encoded_inputs["input_ids"] = encoded_inputs["input_ids"][:max_length]
|
||||
encoded_inputs["token_type_ids"] = encoded_inputs["token_type_ids"][:max_length]
|
||||
encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"][:max_length]
|
||||
|
||||
return encoded_inputs
|
||||
|
||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1):
|
||||
def truncate_sequences(self, ids, pair_ids=None, num_tokens_to_remove=0, truncation_strategy='longest_first', stride=0):
|
||||
"""Truncates a sequence pair in place to the maximum length.
|
||||
truncation_strategy: string selected in the following options:
|
||||
- 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length
|
||||
starting from the longest one at each token (when there is a pair of input sequences).
|
||||
Overflowing tokens only contains overflow from the first sequence.
|
||||
- 'only_first': Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.
|
||||
- 'only_second': Only truncate the second sequence
|
||||
- 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
|
||||
"""
|
||||
if num_tokens_to_remove <= 0:
|
||||
return ids, pair_ids, []
|
||||
|
||||
if truncation_strategy == 'longest_first':
|
||||
overflowing_tokens = []
|
||||
for _ in range(num_tokens_to_remove):
|
||||
if pair_ids is None or len(ids) > len(pair_ids):
|
||||
overflowing_tokens = [ids[-1]] + overflowing_tokens
|
||||
ids = ids[:-1]
|
||||
else:
|
||||
pair_ids = pair_ids[:-1]
|
||||
window_len = min(len(ids), stride)
|
||||
if window_len > 0:
|
||||
overflowing_tokens = ids[-window_len:] + overflowing_tokens
|
||||
elif truncation_strategy == 'only_first':
|
||||
assert len(ids) > num_tokens_to_remove
|
||||
window_len = min(len(ids), stride + num_tokens_to_remove)
|
||||
overflowing_tokens = ids[-window_len:]
|
||||
ids = ids[:-num_tokens_to_remove]
|
||||
elif truncation_strategy == 'only_second':
|
||||
assert pair_ids is not None and len(pair_ids) > num_tokens_to_remove
|
||||
window_len = min(len(pair_ids), stride + num_tokens_to_remove)
|
||||
overflowing_tokens = pair_ids[-window_len:]
|
||||
pair_ids = pair_ids[:-num_tokens_to_remove]
|
||||
elif truncation_strategy == 'do_not_truncate':
|
||||
raise ValueError("Input sequence are too long for max_length. Please select a truncation strategy.")
|
||||
else:
|
||||
raise ValueError("Truncation_strategy should be selected in ['longest_first', 'only_first', 'only_second', 'do_not_truncate']")
|
||||
return (ids, pair_ids, overflowing_tokens)
|
||||
|
||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
||||
logger.warning("This tokenizer does not make use of special tokens.")
|
||||
if token_ids_1 is None:
|
||||
return len(token_ids_0) * [0]
|
||||
return [0] * len(token_ids_0) + [1] * len(token_ids_1)
|
||||
|
||||
def add_special_tokens_single_sequence(self, token_ids):
|
||||
logger.warning("This tokenizer does not make use of special tokens. The sequence has been returned with no modification.")
|
||||
return token_ids
|
||||
|
||||
def add_special_tokens_sequence_pair(self, token_ids_0, token_ids_1):
|
||||
logger.warning("This tokenizer does not make use of special tokens. The two sequences have been concatenated.")
|
||||
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||
"""
|
||||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||
by concatenating and adding special tokens.
|
||||
A RoBERTa sequence has the following format:
|
||||
single sequence: <s> X </s>
|
||||
pair of sequences: <s> A </s></s> B </s>
|
||||
"""
|
||||
logger.warning("This tokenizer does not make use of special tokens. Input is returned with no modification.")
|
||||
if token_ids_1 is None:
|
||||
return token_ids_0
|
||||
return token_ids_0 + token_ids_1
|
||||
|
||||
def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
|
||||
"""
|
||||
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||
special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.
|
||||
|
||||
Args:
|
||||
token_ids_0: list of ids (must not contain special tokens)
|
||||
token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids
|
||||
for sequence pairs
|
||||
already_has_special_tokens: (default False) Set to True if the token list is already formated with
|
||||
special tokens for the model
|
||||
|
||||
Returns:
|
||||
A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token.
|
||||
"""
|
||||
return [0] * ((len(token_ids_1) if token_ids_1 else 0) + len(token_ids_0))
|
||||
|
||||
def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
|
||||
""" Converts a single index or a sequence of indices (integers) in a token "
|
||||
(resp.) a sequence of tokens (str/unicode), using the vocabulary and added tokens.
|
||||
@@ -911,6 +991,11 @@ class PreTrainedTokenizer(object):
|
||||
Converts a sequence of ids (integer) in a string, using the tokenizer and vocabulary
|
||||
with options to remove special tokens and clean up tokenization spaces.
|
||||
Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.
|
||||
|
||||
Args:
|
||||
token_ids: list of tokenized input ids. Can be obtained using the `encode` or `encode_plus` methods.
|
||||
skip_special_tokens: if set to True, will replace special tokens.
|
||||
clean_up_tokenization_spaces: if set to True, will clean up the tokenization spaces.
|
||||
"""
|
||||
filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
|
||||
|
||||
@@ -933,20 +1018,11 @@ class PreTrainedTokenizer(object):
|
||||
sub_texts.append(self.convert_tokens_to_string(current_sub_text))
|
||||
text = ''.join(sub_texts)
|
||||
|
||||
if self._sep_token is not None and self._sep_token in text:
|
||||
text = text.replace(self._cls_token, self._sep_token)
|
||||
split_text = list(filter(lambda sentence: len(sentence) > 0, text.split(self._sep_token)))
|
||||
if clean_up_tokenization_spaces:
|
||||
clean_text = [self.clean_up_tokenization(text) for text in split_text]
|
||||
return clean_text
|
||||
else:
|
||||
return split_text
|
||||
if clean_up_tokenization_spaces:
|
||||
clean_text = self.clean_up_tokenization(text)
|
||||
return clean_text
|
||||
else:
|
||||
if clean_up_tokenization_spaces:
|
||||
clean_text = self.clean_up_tokenization(text)
|
||||
return clean_text
|
||||
else:
|
||||
return text
|
||||
return text
|
||||
|
||||
@property
|
||||
def special_tokens_map(self):
|
||||
|
||||
@@ -754,32 +754,59 @@ class XLMTokenizer(PreTrainedTokenizer):
|
||||
out_string = ''.join(tokens).replace('</w>', ' ').strip()
|
||||
return out_string
|
||||
|
||||
def add_special_tokens_single_sequence(self, token_ids):
|
||||
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||
"""
|
||||
Adds special tokens to a sequence for sequence classification tasks.
|
||||
An XLM sequence has the following format: [CLS] X [SEP]
|
||||
"""
|
||||
return [self.cls_token_id] + token_ids + [self.sep_token_id]
|
||||
|
||||
def add_special_tokens_sequence_pair(self, token_ids_0, token_ids_1):
|
||||
"""
|
||||
Adds special tokens to a sequence pair for sequence classification tasks.
|
||||
An XLM sequence pair has the following format: [CLS] A [SEP] B [SEP]
|
||||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||
by concatenating and adding special tokens.
|
||||
A RoBERTa sequence has the following format:
|
||||
single sequence: <s> X </s>
|
||||
pair of sequences: <s> A </s></s> B </s>
|
||||
"""
|
||||
if token_ids_1 is None:
|
||||
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
|
||||
sep = [self.sep_token_id]
|
||||
cls = [self.cls_token_id]
|
||||
return cls + token_ids_0 + sep + token_ids_1 + sep
|
||||
|
||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1):
|
||||
def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
|
||||
"""
|
||||
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||
special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.
|
||||
|
||||
Args:
|
||||
token_ids_0: list of ids (must not contain special tokens)
|
||||
token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids
|
||||
for sequence pairs
|
||||
already_has_special_tokens: (default False) Set to True if the token list is already formated with
|
||||
special tokens for the model
|
||||
|
||||
Returns:
|
||||
A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token.
|
||||
"""
|
||||
|
||||
if already_has_special_tokens:
|
||||
if token_ids_1 is not None:
|
||||
raise ValueError("You should not supply a second sequence if the provided sequence of "
|
||||
"ids is already formated with special tokens for the model.")
|
||||
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
|
||||
|
||||
if token_ids_1 is not None:
|
||||
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
|
||||
return [1] + ([0] * len(token_ids_0)) + [1]
|
||||
|
||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
||||
"""
|
||||
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
||||
An XLM sequence pair mask has the following format:
|
||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
|
||||
| first sequence | second sequence
|
||||
|
||||
if token_ids_1 is None, only returns the first portion of the mask (0's).
|
||||
"""
|
||||
sep = [self.sep_token_id]
|
||||
cls = [self.cls_token_id]
|
||||
|
||||
if token_ids_1 is None:
|
||||
return len(cls + token_ids_0 + sep) * [0]
|
||||
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
|
||||
|
||||
def save_vocabulary(self, save_directory):
|
||||
|
||||
@@ -181,36 +181,61 @@ class XLNetTokenizer(PreTrainedTokenizer):
|
||||
out_string = ''.join(tokens).replace(SPIECE_UNDERLINE, ' ').strip()
|
||||
return out_string
|
||||
|
||||
def add_special_tokens_single_sequence(self, token_ids):
|
||||
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||
"""
|
||||
Adds special tokens to a sequence for sequence classification tasks.
|
||||
An XLNet sequence has the following format: X [SEP][CLS]
|
||||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||
by concatenating and adding special tokens.
|
||||
A RoBERTa sequence has the following format:
|
||||
single sequence: <s> X </s>
|
||||
pair of sequences: <s> A </s></s> B </s>
|
||||
"""
|
||||
sep = [self.sep_token_id]
|
||||
cls = [self.cls_token_id]
|
||||
return token_ids + sep + cls
|
||||
|
||||
def add_special_tokens_sequence_pair(self, token_ids_0, token_ids_1):
|
||||
"""
|
||||
Adds special tokens to a sequence pair for sequence classification tasks.
|
||||
An XLNet sequence pair has the following format: A [SEP] B [SEP][CLS]
|
||||
"""
|
||||
|
||||
sep = [self.sep_token_id]
|
||||
cls = [self.cls_token_id]
|
||||
if token_ids_1 is None:
|
||||
return token_ids_0 + sep + cls
|
||||
return token_ids_0 + sep + token_ids_1 + sep + cls
|
||||
|
||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1):
|
||||
def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
|
||||
"""
|
||||
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||
special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.
|
||||
|
||||
Args:
|
||||
token_ids_0: list of ids (must not contain special tokens)
|
||||
token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids
|
||||
for sequence pairs
|
||||
already_has_special_tokens: (default False) Set to True if the token list is already formated with
|
||||
special tokens for the model
|
||||
|
||||
Returns:
|
||||
A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token.
|
||||
"""
|
||||
|
||||
if already_has_special_tokens:
|
||||
if token_ids_1 is not None:
|
||||
raise ValueError("You should not supply a second sequence if the provided sequence of "
|
||||
"ids is already formated with special tokens for the model.")
|
||||
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
|
||||
|
||||
if token_ids_1 is not None:
|
||||
return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1, 1]
|
||||
return ([0] * len(token_ids_0)) + [1, 1]
|
||||
|
||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
||||
"""
|
||||
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
||||
A BERT sequence pair mask has the following format:
|
||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2
|
||||
| first sequence | second sequence | CLS segment ID
|
||||
|
||||
if token_ids_1 is None, only returns the first portion of the mask (0's).
|
||||
"""
|
||||
sep = [self.sep_token_id]
|
||||
cls = [self.cls_token_id]
|
||||
cls_segment_id = [2]
|
||||
|
||||
if token_ids_1 is None:
|
||||
return len(token_ids_0 + sep + cls) * [0]
|
||||
return len(token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1] + cls_segment_id
|
||||
|
||||
def save_vocabulary(self, save_directory):
|
||||
|
||||
Reference in New Issue
Block a user