Merge branch 'master' into distilbert-german
This commit is contained in:
@@ -23,3 +23,4 @@ deploy_doc "fe02e45" v1.1.0
|
|||||||
deploy_doc "89fd345" v1.2.0
|
deploy_doc "89fd345" v1.2.0
|
||||||
deploy_doc "fc9faa8" v2.0.0
|
deploy_doc "fc9faa8" v2.0.0
|
||||||
deploy_doc "3ddce1d" v2.1.1
|
deploy_doc "3ddce1d" v2.1.1
|
||||||
|
deploy_doc "3616209" v2.2.0
|
||||||
|
|||||||
@@ -106,7 +106,7 @@ Follow these steps to start contributing:
|
|||||||
```bash
|
```bash
|
||||||
$ git clone git@github.com:<your Github handle>/transformers.git
|
$ git clone git@github.com:<your Github handle>/transformers.git
|
||||||
$ cd transformers
|
$ cd transformers
|
||||||
$ git remote add upstream git@github.com:huggingface/transformers.git
|
$ git remote add upstream https://github.com/huggingface/transformers.git
|
||||||
```
|
```
|
||||||
|
|
||||||
3. Create a new branch to hold your development changes:
|
3. Create a new branch to hold your development changes:
|
||||||
|
|||||||
19
README.md
19
README.md
@@ -58,7 +58,7 @@ Choose the right framework for every part of a model's lifetime
|
|||||||
| [Quick tour: Fine-tuning/usage scripts](#quick-tour-of-the-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
|
| [Quick tour: Fine-tuning/usage scripts](#quick-tour-of-the-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
|
||||||
| [Migrating from pytorch-transformers to transformers](#Migrating-from-pytorch-transformers-to-transformers) | Migrating your code from pytorch-transformers to transformers |
|
| [Migrating from pytorch-transformers to transformers](#Migrating-from-pytorch-transformers-to-transformers) | Migrating your code from pytorch-transformers to transformers |
|
||||||
| [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
|
| [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
|
||||||
| [Documentation](https://huggingface.co/transformers/) [(v2.1.1)](https://huggingface.co/transformers/v2.1.1) [(v2.0.0)](https://huggingface.co/transformers/v2.0.0) [(v1.2.0)](https://huggingface.co/transformers/v1.2.0) [(v1.1.0)](https://huggingface.co/transformers/v1.1.0) [(v1.0.0)](https://huggingface.co/transformers/v1.0.0) | Full API documentation and more |
|
| [Documentation][(v2.2.0)](https://huggingface.co/transformers/v2.2.0) [(v2.1.1)](https://huggingface.co/transformers/v2.1.1) [(v2.0.0)](https://huggingface.co/transformers/v2.0.0) [(v1.2.0)](https://huggingface.co/transformers/v1.2.0) [(v1.1.0)](https://huggingface.co/transformers/v1.1.0) [(v1.0.0)](https://huggingface.co/transformers/v1.0.0) [(master)](https://huggingface.co/transformers) | Full API documentation and more |
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
@@ -86,6 +86,17 @@ When TensorFlow 2.0 and/or PyTorch has been installed, you can install from sour
|
|||||||
pip install [--editable] .
|
pip install [--editable] .
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Run the examples
|
||||||
|
|
||||||
|
Examples are included in the repository but are not shipped with the library.
|
||||||
|
Therefore, in order to run the latest versions of the examples you also need to install from source. To do so, create a new virtual environment and follow these steps:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/huggingface/transformers
|
||||||
|
cd transformers
|
||||||
|
pip install [--editable] .
|
||||||
|
```
|
||||||
|
|
||||||
### Tests
|
### Tests
|
||||||
|
|
||||||
A series of tests are included for the library and the example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
|
A series of tests are included for the library and the example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
|
||||||
@@ -123,6 +134,7 @@ At some point in the future, you'll be able to seamlessly move from pre-training
|
|||||||
8. **[DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation).
|
8. **[DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation).
|
||||||
9. **[CTRL](https://github.com/salesforce/ctrl/)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
|
9. **[CTRL](https://github.com/salesforce/ctrl/)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
|
||||||
10. **[CamemBERT](https://camembert-model.fr)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
10. **[CamemBERT](https://camembert-model.fr)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
||||||
|
11. **[ALBERT](https://github.com/google-research/google-research/tree/master/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
|
||||||
11. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
|
11. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
|
||||||
|
|
||||||
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
|
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
|
||||||
@@ -253,6 +265,11 @@ print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sen
|
|||||||
|
|
||||||
## Quick tour of the fine-tuning/usage scripts
|
## Quick tour of the fine-tuning/usage scripts
|
||||||
|
|
||||||
|
**Important**
|
||||||
|
Before running the fine-tuning scripts, please read the
|
||||||
|
[instructions](#run-the-examples) on how to
|
||||||
|
setup your environment to run the examples.
|
||||||
|
|
||||||
The library comprises several example scripts with SOTA performances for NLU and NLG tasks:
|
The library comprises several example scripts with SOTA performances for NLU and NLG tasks:
|
||||||
|
|
||||||
- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
|
- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
|
||||||
|
|||||||
22
deploy_multi_version_doc.sh
Normal file
22
deploy_multi_version_doc.sh
Normal file
@@ -0,0 +1,22 @@
|
|||||||
|
cd docs
|
||||||
|
|
||||||
|
function deploy_doc(){
|
||||||
|
echo "Creating doc at commit $1 and pushing to folder $2"
|
||||||
|
git checkout $1
|
||||||
|
if [ ! -z "$2" ]
|
||||||
|
then
|
||||||
|
echo "Pushing version" $2
|
||||||
|
make clean && make html && scp -r -oStrictHostKeyChecking=no _build/html $doc:$dir/$2
|
||||||
|
else
|
||||||
|
echo "Pushing master"
|
||||||
|
make clean && make html && scp -r -oStrictHostKeyChecking=no _build/html/* $doc:$dir
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
deploy_doc "master"
|
||||||
|
deploy_doc "b33a385" v1.0.0
|
||||||
|
deploy_doc "fe02e45" v1.1.0
|
||||||
|
deploy_doc "89fd345" v1.2.0
|
||||||
|
deploy_doc "fc9faa8" v2.0.0
|
||||||
|
deploy_doc "3ddce1d" v2.1.1
|
||||||
|
deploy_doc "f2f3294" v2.2.0
|
||||||
@@ -1,5 +1,5 @@
|
|||||||
function addIcon() {
|
function addIcon() {
|
||||||
const huggingFaceLogo = "https://huggingface.co/assets/transformers-docs/huggingface_logo.svg";
|
const huggingFaceLogo = "https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg";
|
||||||
const image = document.createElement("img");
|
const image = document.createElement("img");
|
||||||
image.setAttribute("src", huggingFaceLogo);
|
image.setAttribute("src", huggingFaceLogo);
|
||||||
|
|
||||||
@@ -24,10 +24,10 @@ function addCustomFooter() {
|
|||||||
social.classList.add("footer__Social");
|
social.classList.add("footer__Social");
|
||||||
|
|
||||||
const imageDetails = [
|
const imageDetails = [
|
||||||
{ link: "https://huggingface.co", imageLink: "https://huggingface.co/assets/transformers-docs/website.svg" },
|
{ link: "https://huggingface.co", imageLink: "https://huggingface.co/landing/assets/transformers-docs/website.svg" },
|
||||||
{ link: "https://twitter.com/huggingface", imageLink: "https://huggingface.co/assets/transformers-docs/twitter.svg" },
|
{ link: "https://twitter.com/huggingface", imageLink: "https://huggingface.co/landing/assets/transformers-docs/twitter.svg" },
|
||||||
{ link: "https://github.com/huggingface", imageLink: "https://huggingface.co/assets/transformers-docs/github.svg" },
|
{ link: "https://github.com/huggingface", imageLink: "https://huggingface.co/landing/assets/transformers-docs/github.svg" },
|
||||||
{ link: "https://www.linkedin.com/company/huggingface/", imageLink: "https://huggingface.co/assets/transformers-docs/linkedin.svg" }
|
{ link: "https://www.linkedin.com/company/huggingface/", imageLink: "https://huggingface.co/landing/assets/transformers-docs/linkedin.svg" }
|
||||||
];
|
];
|
||||||
|
|
||||||
imageDetails.forEach(imageLinks => {
|
imageDetails.forEach(imageLinks => {
|
||||||
|
|||||||
@@ -26,7 +26,7 @@ author = u'huggingface'
|
|||||||
# The short X.Y version
|
# The short X.Y version
|
||||||
version = u''
|
version = u''
|
||||||
# The full version, including alpha/beta/rc tags
|
# The full version, including alpha/beta/rc tags
|
||||||
release = u'2.1.1'
|
release = u'2.2.0'
|
||||||
|
|
||||||
|
|
||||||
# -- General configuration ---------------------------------------------------
|
# -- General configuration ---------------------------------------------------
|
||||||
|
|||||||
@@ -47,6 +47,9 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
|
|||||||
6. `XLM <https://github.com/facebookresearch/XLM>`_ (from Facebook) released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_ by Guillaume Lample and Alexis Conneau.
|
6. `XLM <https://github.com/facebookresearch/XLM>`_ (from Facebook) released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_ by Guillaume Lample and Alexis Conneau.
|
||||||
7. `RoBERTa <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`_ (from Facebook), released together with the paper a `Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
7. `RoBERTa <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`_ (from Facebook), released together with the paper a `Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
||||||
8. `DistilBERT <https://huggingface.co/transformers/model_doc/distilbert.html>`_ (from HuggingFace) released together with the paper `DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2 <https://github.com/huggingface/transformers/tree/master/examples/distillation>`_.
|
8. `DistilBERT <https://huggingface.co/transformers/model_doc/distilbert.html>`_ (from HuggingFace) released together with the paper `DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2 <https://github.com/huggingface/transformers/tree/master/examples/distillation>`_.
|
||||||
|
9. `CTRL <https://github.com/pytorch/fairseq/tree/master/examples/ctrl>`_ (from Salesforce), released together with the paper `CTRL: A Conditional Transformer Language Model for Controllable Generation <https://www.github.com/salesforce/ctrl>`_ by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
|
||||||
|
10. `CamemBERT <https://huggingface.co/transformers/model_doc/camembert.html>`_ (from FAIR, Inria, Sorbonne Université) released together with the paper `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`_ by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de la Clergerie, Djame Seddah, and Benoît Sagot.
|
||||||
|
11. `ALBERT <https://github.com/pytorch/fairseq/tree/master/examples/albert>`_ (from Google Research), released together with the paper a `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_ by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 2
|
:maxdepth: 2
|
||||||
@@ -89,3 +92,5 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
|
|||||||
model_doc/roberta
|
model_doc/roberta
|
||||||
model_doc/distilbert
|
model_doc/distilbert
|
||||||
model_doc/ctrl
|
model_doc/ctrl
|
||||||
|
model_doc/camembert
|
||||||
|
model_doc/albert
|
||||||
|
|||||||
@@ -55,4 +55,27 @@ Example usage
|
|||||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
An example using these processors is given in the
|
An example using these processors is given in the
|
||||||
`run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_glue.py>`__ script.
|
`run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_glue.py>`__ script.
|
||||||
|
|
||||||
|
|
||||||
|
XNLI
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates
|
||||||
|
the quality of cross-lingual text representations.
|
||||||
|
XNLI is crowd-sourced dataset based on `MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment
|
||||||
|
annotations for 15 different languages (including both high-ressource language such as English and low-ressource languages such as Swahili).
|
||||||
|
|
||||||
|
It was released together with the paper
|
||||||
|
`XNLI: Evaluating Cross-lingual Sentence Representations <https://arxiv.org/abs/1809.05053>`__
|
||||||
|
|
||||||
|
This library hosts the processor to load the XNLI data:
|
||||||
|
- :class:`~transformers.data.processors.utils.XnliProcessor`
|
||||||
|
|
||||||
|
Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
|
||||||
|
|
||||||
|
Example usage
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
An example using these processors is given in the
|
||||||
|
`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_xnli.py>`__ script.
|
||||||
64
docs/source/model_doc/albert.rst
Normal file
64
docs/source/model_doc/albert.rst
Normal file
@@ -0,0 +1,64 @@
|
|||||||
|
ALBERT
|
||||||
|
----------------------------------------------------
|
||||||
|
|
||||||
|
``AlbrtConfig``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.AlbertConfig
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``AlbertTokenizer``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.AlbertTokenizer
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``AlbertModel``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.AlbertModel
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``AlbertForMaskedLM``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.AlbertForMaskedLM
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``AlbertForSequenceClassification``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.AlbertForSequenceClassification
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``AlbertForQuestionAnswering``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.AlbertForQuestionAnswering
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``TFAlbertModel``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.TFAlbertModel
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``TFAlbertForMaskedLM``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.TFAlbertForMaskedLM
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``TFAlbertForSequenceClassification``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.TFAlbertForSequenceClassification
|
||||||
|
:members:
|
||||||
50
docs/source/model_doc/camembert.rst
Normal file
50
docs/source/model_doc/camembert.rst
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
CamemBERT
|
||||||
|
----------------------------------------------------
|
||||||
|
|
||||||
|
``CamembertConfig``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.CamembertConfig
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``CamembertTokenizer``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.CamembertTokenizer
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``CamembertModel``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.CamembertModel
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``CamembertForMaskedLM``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.CamembertForMaskedLM
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``CamembertForSequenceClassification``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.CamembertForSequenceClassification
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``CamembertForMultipleChoice``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.CamembertForMultipleChoice
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
``CamembertForTokenClassification``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.CamembertForTokenClassification
|
||||||
|
:members:
|
||||||
@@ -163,5 +163,38 @@ Here is the full list of the currently provided pretrained models together with
|
|||||||
| | | | CamemBERT using the BERT-base architecture |
|
| | | | CamemBERT using the BERT-base architecture |
|
||||||
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/camembert>`__) |
|
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/camembert>`__) |
|
||||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| ALBERT | ``albert-base-v1`` | | 12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters |
|
||||||
|
| | | | ALBERT base model |
|
||||||
|
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``albert-large-v1`` | | 24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters |
|
||||||
|
| | | | ALBERT large model |
|
||||||
|
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``albert-xlarge-v1`` | | 24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters |
|
||||||
|
| | | | ALBERT xlarge model |
|
||||||
|
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``albert-xxlarge-v1`` | | 12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters |
|
||||||
|
| | | | ALBERT xxlarge model |
|
||||||
|
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``albert-base-v2`` | | 12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters |
|
||||||
|
| | | | ALBERT base model with no dropout, additional training data and longer training |
|
||||||
|
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``albert-large-v2`` | | 24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters |
|
||||||
|
| | | | ALBERT large model with no dropout, additional training data and longer training |
|
||||||
|
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``albert-xlarge-v2`` | | 24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters |
|
||||||
|
| | | | ALBERT xlarge model with no dropout, additional training data and longer training |
|
||||||
|
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``albert-xxlarge-v2`` | | 12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters |
|
||||||
|
| | | | ALBERT xxlarge model with no dropout, additional training data and longer training |
|
||||||
|
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
|
||||||
|
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
|
||||||
|
|
||||||
.. <https://huggingface.co/transformers/examples.html>`__
|
.. <https://huggingface.co/transformers/examples.html>`__
|
||||||
|
|||||||
@@ -3,6 +3,15 @@
|
|||||||
In this section a few examples are put together. All of these examples work for several models, making use of the very
|
In this section a few examples are put together. All of these examples work for several models, making use of the very
|
||||||
similar API between the different models.
|
similar API between the different models.
|
||||||
|
|
||||||
|
**Important**
|
||||||
|
To run the latest versions of the examples, you have to install from source. Execute the following steps in a new virtual environment:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/huggingface/transformers
|
||||||
|
cd transformers
|
||||||
|
pip install [--editable] .
|
||||||
|
```
|
||||||
|
|
||||||
| Section | Description |
|
| Section | Description |
|
||||||
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||||
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks.
|
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks.
|
||||||
@@ -12,6 +21,7 @@ similar API between the different models.
|
|||||||
| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
|
| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
|
||||||
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks.
|
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks.
|
||||||
| [Named Entity Recognition](#named-entity-recognition) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
|
| [Named Entity Recognition](#named-entity-recognition) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
|
||||||
|
| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
|
||||||
| [Abstractive summarization](#abstractive-summarization) | Fine-tuning the library models for abstractive summarization tasks on the CNN/Daily Mail dataset. |
|
| [Abstractive summarization](#abstractive-summarization) | Fine-tuning the library models for abstractive summarization tasks on the CNN/Daily Mail dataset. |
|
||||||
|
|
||||||
## TensorFlow 2.0 Bert models on GLUE
|
## TensorFlow 2.0 Bert models on GLUE
|
||||||
@@ -591,3 +601,43 @@ python run_summarization_finetuning.py \
|
|||||||
--do_train \
|
--do_train \
|
||||||
--data_path=$DATA_PATH \
|
--data_path=$DATA_PATH \
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## XNLI
|
||||||
|
|
||||||
|
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
|
||||||
|
|
||||||
|
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-ressource language such as English and low-ressource languages such as Swahili).
|
||||||
|
|
||||||
|
#### Fine-tuning on XNLI
|
||||||
|
|
||||||
|
This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
|
||||||
|
on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
|
||||||
|
`$XNLI_DIR` directory.
|
||||||
|
|
||||||
|
* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
|
||||||
|
* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export XNLI_DIR=/path/to/XNLI
|
||||||
|
|
||||||
|
python run_xnli.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-multilingual-cased \
|
||||||
|
--language de \
|
||||||
|
--train_language en \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--data_dir $XNLI_DIR \
|
||||||
|
--per_gpu_train_batch_size 32 \
|
||||||
|
--learning_rate 5e-5 \
|
||||||
|
--num_train_epochs 2.0 \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--output_dir /tmp/debug_xnli/ \
|
||||||
|
--save_steps -1
|
||||||
|
```
|
||||||
|
|
||||||
|
Training with the previously defined hyper-parameters yields the following results on the **test** set:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
acc = 0.7093812375249501
|
||||||
|
```
|
||||||
|
|||||||
@@ -47,7 +47,11 @@ from transformers import (WEIGHTS_NAME, BertConfig,
|
|||||||
XLNetTokenizer,
|
XLNetTokenizer,
|
||||||
DistilBertConfig,
|
DistilBertConfig,
|
||||||
DistilBertForSequenceClassification,
|
DistilBertForSequenceClassification,
|
||||||
DistilBertTokenizer)
|
DistilBertTokenizer,
|
||||||
|
AlbertConfig,
|
||||||
|
AlbertForSequenceClassification,
|
||||||
|
AlbertTokenizer,
|
||||||
|
)
|
||||||
|
|
||||||
from transformers import AdamW, get_linear_schedule_with_warmup
|
from transformers import AdamW, get_linear_schedule_with_warmup
|
||||||
|
|
||||||
@@ -66,7 +70,8 @@ MODEL_CLASSES = {
|
|||||||
'xlnet': (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer),
|
'xlnet': (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer),
|
||||||
'xlm': (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
|
'xlm': (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
|
||||||
'roberta': (RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer),
|
'roberta': (RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer),
|
||||||
'distilbert': (DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer)
|
'distilbert': (DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer),
|
||||||
|
'albert': (AlbertConfig, AlbertForSequenceClassification, AlbertTokenizer)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -99,6 +104,7 @@ def train(args, train_dataset, model, tokenizer):
|
|||||||
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
|
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
|
||||||
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
|
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
|
||||||
]
|
]
|
||||||
|
|
||||||
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
|
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
|
||||||
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total)
|
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total)
|
||||||
if args.fp16:
|
if args.fp16:
|
||||||
@@ -158,7 +164,7 @@ def train(args, train_dataset, model, tokenizer):
|
|||||||
loss.backward()
|
loss.backward()
|
||||||
|
|
||||||
tr_loss += loss.item()
|
tr_loss += loss.item()
|
||||||
if (step + 1) % args.gradient_accumulation_steps == 0 and not args.tpu:
|
if (step + 1) % args.gradient_accumulation_steps == 0:
|
||||||
if args.fp16:
|
if args.fp16:
|
||||||
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
||||||
else:
|
else:
|
||||||
@@ -189,11 +195,6 @@ def train(args, train_dataset, model, tokenizer):
|
|||||||
torch.save(args, os.path.join(output_dir, 'training_args.bin'))
|
torch.save(args, os.path.join(output_dir, 'training_args.bin'))
|
||||||
logger.info("Saving model checkpoint to %s", output_dir)
|
logger.info("Saving model checkpoint to %s", output_dir)
|
||||||
|
|
||||||
if args.tpu:
|
|
||||||
args.xla_model.optimizer_step(optimizer, barrier=True)
|
|
||||||
model.zero_grad()
|
|
||||||
global_step += 1
|
|
||||||
|
|
||||||
if args.max_steps > 0 and global_step > args.max_steps:
|
if args.max_steps > 0 and global_step > args.max_steps:
|
||||||
epoch_iterator.close()
|
epoch_iterator.close()
|
||||||
break
|
break
|
||||||
@@ -322,7 +323,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
|
|||||||
all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
|
all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
|
||||||
elif output_mode == "regression":
|
elif output_mode == "regression":
|
||||||
all_labels = torch.tensor([f.label for f in features], dtype=torch.float)
|
all_labels = torch.tensor([f.label for f in features], dtype=torch.float)
|
||||||
|
|
||||||
dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
|
dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
|
||||||
return dataset
|
return dataset
|
||||||
|
|
||||||
@@ -366,7 +367,7 @@ def main():
|
|||||||
parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
|
parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
|
||||||
help="Batch size per GPU/CPU for evaluation.")
|
help="Batch size per GPU/CPU for evaluation.")
|
||||||
parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
|
parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
|
||||||
help="Number of updates steps to accumulate before performing a backward/update pass.")
|
help="Number of updates steps to accumulate before performing a backward/update pass.")
|
||||||
parser.add_argument("--learning_rate", default=5e-5, type=float,
|
parser.add_argument("--learning_rate", default=5e-5, type=float,
|
||||||
help="The initial learning rate for Adam.")
|
help="The initial learning rate for Adam.")
|
||||||
parser.add_argument("--weight_decay", default=0.0, type=float,
|
parser.add_argument("--weight_decay", default=0.0, type=float,
|
||||||
@@ -397,15 +398,6 @@ def main():
|
|||||||
parser.add_argument('--seed', type=int, default=42,
|
parser.add_argument('--seed', type=int, default=42,
|
||||||
help="random seed for initialization")
|
help="random seed for initialization")
|
||||||
|
|
||||||
parser.add_argument('--tpu', action='store_true',
|
|
||||||
help="Whether to run on the TPU defined in the environment variables")
|
|
||||||
parser.add_argument('--tpu_ip_address', type=str, default='',
|
|
||||||
help="TPU IP address if none are set in the environment variables")
|
|
||||||
parser.add_argument('--tpu_name', type=str, default='',
|
|
||||||
help="TPU name if none are set in the environment variables")
|
|
||||||
parser.add_argument('--xrt_tpu_config', type=str, default='',
|
|
||||||
help="XRT TPU config if none are set in the environment variables")
|
|
||||||
|
|
||||||
parser.add_argument('--fp16', action='store_true',
|
parser.add_argument('--fp16', action='store_true',
|
||||||
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
|
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
|
||||||
parser.add_argument('--fp16_opt_level', type=str, default='O1',
|
parser.add_argument('--fp16_opt_level', type=str, default='O1',
|
||||||
@@ -439,23 +431,6 @@ def main():
|
|||||||
args.n_gpu = 1
|
args.n_gpu = 1
|
||||||
args.device = device
|
args.device = device
|
||||||
|
|
||||||
if args.tpu:
|
|
||||||
if args.tpu_ip_address:
|
|
||||||
os.environ["TPU_IP_ADDRESS"] = args.tpu_ip_address
|
|
||||||
if args.tpu_name:
|
|
||||||
os.environ["TPU_NAME"] = args.tpu_name
|
|
||||||
if args.xrt_tpu_config:
|
|
||||||
os.environ["XRT_TPU_CONFIG"] = args.xrt_tpu_config
|
|
||||||
|
|
||||||
assert "TPU_IP_ADDRESS" in os.environ
|
|
||||||
assert "TPU_NAME" in os.environ
|
|
||||||
assert "XRT_TPU_CONFIG" in os.environ
|
|
||||||
|
|
||||||
import torch_xla
|
|
||||||
import torch_xla.core.xla_model as xm
|
|
||||||
args.device = xm.xla_device()
|
|
||||||
args.xla_model = xm
|
|
||||||
|
|
||||||
# Setup logging
|
# Setup logging
|
||||||
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||||
datefmt = '%m/%d/%Y %H:%M:%S',
|
datefmt = '%m/%d/%Y %H:%M:%S',
|
||||||
@@ -509,7 +484,7 @@ def main():
|
|||||||
|
|
||||||
|
|
||||||
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
|
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
|
||||||
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0) and not args.tpu:
|
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
|
||||||
# Create output directory if needed
|
# Create output directory if needed
|
||||||
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
|
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
|
||||||
os.makedirs(args.output_dir)
|
os.makedirs(args.output_dir)
|
||||||
|
|||||||
@@ -68,7 +68,7 @@ class TextDataset(Dataset):
|
|||||||
directory, filename = os.path.split(file_path)
|
directory, filename = os.path.split(file_path)
|
||||||
cached_features_file = os.path.join(directory, args.model_name_or_path + '_cached_lm_' + str(block_size) + '_' + filename)
|
cached_features_file = os.path.join(directory, args.model_name_or_path + '_cached_lm_' + str(block_size) + '_' + filename)
|
||||||
|
|
||||||
if os.path.exists(cached_features_file):
|
if os.path.exists(cached_features_file) and not args.overwrite_cache:
|
||||||
logger.info("Loading features from cached file %s", cached_features_file)
|
logger.info("Loading features from cached file %s", cached_features_file)
|
||||||
with open(cached_features_file, 'rb') as handle:
|
with open(cached_features_file, 'rb') as handle:
|
||||||
self.examples = pickle.load(handle)
|
self.examples = pickle.load(handle)
|
||||||
@@ -215,6 +215,7 @@ def train(args, train_dataset, model, tokenizer):
|
|||||||
|
|
||||||
global_step = 0
|
global_step = 0
|
||||||
tr_loss, logging_loss = 0.0, 0.0
|
tr_loss, logging_loss = 0.0, 0.0
|
||||||
|
model.resize_token_embeddings(len(tokenizer))
|
||||||
model.zero_grad()
|
model.zero_grad()
|
||||||
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
|
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
|
||||||
set_seed(args) # Added here for reproducibility (even between python 2 and 3)
|
set_seed(args) # Added here for reproducibility (even between python 2 and 3)
|
||||||
|
|||||||
@@ -37,6 +37,7 @@ from transformers import AdamW, get_linear_schedule_with_warmup
|
|||||||
from transformers import WEIGHTS_NAME, BertConfig, BertForTokenClassification, BertTokenizer
|
from transformers import WEIGHTS_NAME, BertConfig, BertForTokenClassification, BertTokenizer
|
||||||
from transformers import RobertaConfig, RobertaForTokenClassification, RobertaTokenizer
|
from transformers import RobertaConfig, RobertaForTokenClassification, RobertaTokenizer
|
||||||
from transformers import DistilBertConfig, DistilBertForTokenClassification, DistilBertTokenizer
|
from transformers import DistilBertConfig, DistilBertForTokenClassification, DistilBertTokenizer
|
||||||
|
from transformers import CamembertConfig, CamembertForTokenClassification, CamembertTokenizer
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -47,7 +48,8 @@ ALL_MODELS = sum(
|
|||||||
MODEL_CLASSES = {
|
MODEL_CLASSES = {
|
||||||
"bert": (BertConfig, BertForTokenClassification, BertTokenizer),
|
"bert": (BertConfig, BertForTokenClassification, BertTokenizer),
|
||||||
"roberta": (RobertaConfig, RobertaForTokenClassification, RobertaTokenizer),
|
"roberta": (RobertaConfig, RobertaForTokenClassification, RobertaTokenizer),
|
||||||
"distilbert": (DistilBertConfig, DistilBertForTokenClassification, DistilBertTokenizer)
|
"distilbert": (DistilBertConfig, DistilBertForTokenClassification, DistilBertTokenizer),
|
||||||
|
"camembert": (CamembertConfig, CamembertForTokenClassification, CamembertTokenizer),
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -125,7 +127,7 @@ def train(args, train_dataset, model, tokenizer, labels, pad_token_label_id):
|
|||||||
"attention_mask": batch[1],
|
"attention_mask": batch[1],
|
||||||
"labels": batch[3]}
|
"labels": batch[3]}
|
||||||
if args.model_type != "distilbert":
|
if args.model_type != "distilbert":
|
||||||
inputs["token_type_ids"]: batch[2] if args.model_type in ["bert", "xlnet"] else None # XLM and RoBERTa don"t use segment_ids
|
inputs["token_type_ids"] = batch[2] if args.model_type in ["bert", "xlnet"] else None # XLM and RoBERTa don"t use segment_ids
|
||||||
|
|
||||||
outputs = model(**inputs)
|
outputs = model(**inputs)
|
||||||
loss = outputs[0] # model outputs are always tuple in pytorch-transformers (see doc)
|
loss = outputs[0] # model outputs are always tuple in pytorch-transformers (see doc)
|
||||||
@@ -215,7 +217,7 @@ def evaluate(args, model, tokenizer, labels, pad_token_label_id, mode, prefix=""
|
|||||||
"attention_mask": batch[1],
|
"attention_mask": batch[1],
|
||||||
"labels": batch[3]}
|
"labels": batch[3]}
|
||||||
if args.model_type != "distilbert":
|
if args.model_type != "distilbert":
|
||||||
inputs["token_type_ids"]: batch[2] if args.model_type in ["bert", "xlnet"] else None # XLM and RoBERTa don"t use segment_ids
|
inputs["token_type_ids"] = batch[2] if args.model_type in ["bert", "xlnet"] else None # XLM and RoBERTa don"t use segment_ids
|
||||||
outputs = model(**inputs)
|
outputs = model(**inputs)
|
||||||
tmp_eval_loss, logits = outputs[:2]
|
tmp_eval_loss, logits = outputs[:2]
|
||||||
|
|
||||||
|
|||||||
@@ -43,7 +43,8 @@ from transformers import (WEIGHTS_NAME, BertConfig,
|
|||||||
XLMTokenizer, XLNetConfig,
|
XLMTokenizer, XLNetConfig,
|
||||||
XLNetForQuestionAnswering,
|
XLNetForQuestionAnswering,
|
||||||
XLNetTokenizer,
|
XLNetTokenizer,
|
||||||
DistilBertConfig, DistilBertForQuestionAnswering, DistilBertTokenizer)
|
DistilBertConfig, DistilBertForQuestionAnswering, DistilBertTokenizer,
|
||||||
|
AlbertConfig, AlbertForQuestionAnswering, AlbertTokenizer)
|
||||||
|
|
||||||
from transformers import AdamW, get_linear_schedule_with_warmup
|
from transformers import AdamW, get_linear_schedule_with_warmup
|
||||||
|
|
||||||
@@ -65,7 +66,8 @@ MODEL_CLASSES = {
|
|||||||
'bert': (BertConfig, BertForQuestionAnswering, BertTokenizer),
|
'bert': (BertConfig, BertForQuestionAnswering, BertTokenizer),
|
||||||
'xlnet': (XLNetConfig, XLNetForQuestionAnswering, XLNetTokenizer),
|
'xlnet': (XLNetConfig, XLNetForQuestionAnswering, XLNetTokenizer),
|
||||||
'xlm': (XLMConfig, XLMForQuestionAnswering, XLMTokenizer),
|
'xlm': (XLMConfig, XLMForQuestionAnswering, XLMTokenizer),
|
||||||
'distilbert': (DistilBertConfig, DistilBertForQuestionAnswering, DistilBertTokenizer)
|
'distilbert': (DistilBertConfig, DistilBertForQuestionAnswering, DistilBertTokenizer),
|
||||||
|
'albert': (AlbertConfig, AlbertForQuestionAnswering, AlbertTokenizer)
|
||||||
}
|
}
|
||||||
|
|
||||||
def set_seed(args):
|
def set_seed(args):
|
||||||
@@ -128,7 +130,7 @@ def train(args, train_dataset, model, tokenizer):
|
|||||||
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
|
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
|
||||||
logger.info(" Total optimization steps = %d", t_total)
|
logger.info(" Total optimization steps = %d", t_total)
|
||||||
|
|
||||||
global_step = 0
|
global_step = 1
|
||||||
tr_loss, logging_loss = 0.0, 0.0
|
tr_loss, logging_loss = 0.0, 0.0
|
||||||
model.zero_grad()
|
model.zero_grad()
|
||||||
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
|
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
|
||||||
@@ -537,7 +539,7 @@ def main():
|
|||||||
torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
|
torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
|
||||||
|
|
||||||
# Load a trained model and vocabulary that you have fine-tuned
|
# Load a trained model and vocabulary that you have fine-tuned
|
||||||
model = model_class.from_pretrained(args.output_dir)
|
model = model_class.from_pretrained(args.output_dir, force_download=True)
|
||||||
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
|
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
|
||||||
model.to(args.device)
|
model.to(args.device)
|
||||||
|
|
||||||
@@ -555,7 +557,7 @@ def main():
|
|||||||
for checkpoint in checkpoints:
|
for checkpoint in checkpoints:
|
||||||
# Reload the model
|
# Reload the model
|
||||||
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
||||||
model = model_class.from_pretrained(checkpoint)
|
model = model_class.from_pretrained(checkpoint, force_download=True)
|
||||||
model.to(args.device)
|
model.to(args.device)
|
||||||
|
|
||||||
# Evaluate
|
# Evaluate
|
||||||
|
|||||||
515
examples/run_xnli.py
Normal file
515
examples/run_xnli.py
Normal file
@@ -0,0 +1,515 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Finetuning multi-lingual models on XNLI (Bert, DistilBERT, XLM).
|
||||||
|
Adapted from `examples/run_glue.py`"""
|
||||||
|
|
||||||
|
from __future__ import absolute_import, division, print_function
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import glob
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import random
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
|
||||||
|
TensorDataset)
|
||||||
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
|
|
||||||
|
try:
|
||||||
|
from torch.utils.tensorboard import SummaryWriter
|
||||||
|
except:
|
||||||
|
from tensorboardX import SummaryWriter
|
||||||
|
|
||||||
|
from tqdm import tqdm, trange
|
||||||
|
|
||||||
|
from transformers import (WEIGHTS_NAME,
|
||||||
|
BertConfig, BertForSequenceClassification, BertTokenizer,
|
||||||
|
XLMConfig, XLMForSequenceClassification, XLMTokenizer,
|
||||||
|
DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer)
|
||||||
|
|
||||||
|
from transformers import AdamW, get_linear_schedule_with_warmup
|
||||||
|
|
||||||
|
from transformers import xnli_compute_metrics as compute_metrics
|
||||||
|
from transformers import xnli_output_modes as output_modes
|
||||||
|
from transformers import xnli_processors as processors
|
||||||
|
|
||||||
|
from transformers import glue_convert_examples_to_features as convert_examples_to_features
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, DistilBertConfig, XLMConfig)), ())
|
||||||
|
|
||||||
|
MODEL_CLASSES = {
|
||||||
|
'bert': (BertConfig, BertForSequenceClassification, BertTokenizer),
|
||||||
|
'xlm': (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
|
||||||
|
'distilbert': (DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer)
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def set_seed(args):
|
||||||
|
random.seed(args.seed)
|
||||||
|
np.random.seed(args.seed)
|
||||||
|
torch.manual_seed(args.seed)
|
||||||
|
if args.n_gpu > 0:
|
||||||
|
torch.cuda.manual_seed_all(args.seed)
|
||||||
|
|
||||||
|
|
||||||
|
def train(args, train_dataset, model, tokenizer):
|
||||||
|
""" Train the model """
|
||||||
|
if args.local_rank in [-1, 0]:
|
||||||
|
tb_writer = SummaryWriter()
|
||||||
|
|
||||||
|
args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
|
||||||
|
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
|
||||||
|
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
|
||||||
|
|
||||||
|
if args.max_steps > 0:
|
||||||
|
t_total = args.max_steps
|
||||||
|
args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
|
||||||
|
else:
|
||||||
|
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
|
||||||
|
|
||||||
|
# Prepare optimizer and schedule (linear warmup and decay)
|
||||||
|
no_decay = ['bias', 'LayerNorm.weight']
|
||||||
|
optimizer_grouped_parameters = [
|
||||||
|
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
|
||||||
|
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
|
||||||
|
]
|
||||||
|
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
|
||||||
|
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total)
|
||||||
|
if args.fp16:
|
||||||
|
try:
|
||||||
|
from apex import amp
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
|
||||||
|
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
|
||||||
|
|
||||||
|
# multi-gpu training (should be after apex fp16 initialization)
|
||||||
|
if args.n_gpu > 1:
|
||||||
|
model = torch.nn.DataParallel(model)
|
||||||
|
|
||||||
|
# Distributed training (should be after apex fp16 initialization)
|
||||||
|
if args.local_rank != -1:
|
||||||
|
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
|
||||||
|
output_device=args.local_rank,
|
||||||
|
find_unused_parameters=True)
|
||||||
|
|
||||||
|
# Train!
|
||||||
|
logger.info("***** Running training *****")
|
||||||
|
logger.info(" Num examples = %d", len(train_dataset))
|
||||||
|
logger.info(" Num Epochs = %d", args.num_train_epochs)
|
||||||
|
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
|
||||||
|
logger.info(" Total train batch size (w. parallel, distributed & accumulation) = %d",
|
||||||
|
args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
|
||||||
|
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
|
||||||
|
logger.info(" Total optimization steps = %d", t_total)
|
||||||
|
|
||||||
|
global_step = 0
|
||||||
|
tr_loss, logging_loss = 0.0, 0.0
|
||||||
|
model.zero_grad()
|
||||||
|
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
|
||||||
|
set_seed(args) # Added here for reproductibility (even between python 2 and 3)
|
||||||
|
for _ in train_iterator:
|
||||||
|
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
|
||||||
|
for step, batch in enumerate(epoch_iterator):
|
||||||
|
model.train()
|
||||||
|
batch = tuple(t.to(args.device) for t in batch)
|
||||||
|
inputs = {'input_ids': batch[0],
|
||||||
|
'attention_mask': batch[1],
|
||||||
|
'labels': batch[3]}
|
||||||
|
if args.model_type != 'distilbert':
|
||||||
|
inputs['token_type_ids'] = batch[2] if args.model_type in ['bert'] else None # XLM and DistilBERT don't use segment_ids
|
||||||
|
outputs = model(**inputs)
|
||||||
|
loss = outputs[0] # model outputs are always tuple in transformers (see doc)
|
||||||
|
|
||||||
|
if args.n_gpu > 1:
|
||||||
|
loss = loss.mean() # mean() to average on multi-gpu parallel training
|
||||||
|
if args.gradient_accumulation_steps > 1:
|
||||||
|
loss = loss / args.gradient_accumulation_steps
|
||||||
|
|
||||||
|
if args.fp16:
|
||||||
|
with amp.scale_loss(loss, optimizer) as scaled_loss:
|
||||||
|
scaled_loss.backward()
|
||||||
|
else:
|
||||||
|
loss.backward()
|
||||||
|
|
||||||
|
tr_loss += loss.item()
|
||||||
|
if (step + 1) % args.gradient_accumulation_steps == 0:
|
||||||
|
if args.fp16:
|
||||||
|
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
||||||
|
else:
|
||||||
|
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
||||||
|
|
||||||
|
optimizer.step()
|
||||||
|
scheduler.step() # Update learning rate schedule
|
||||||
|
model.zero_grad()
|
||||||
|
global_step += 1
|
||||||
|
|
||||||
|
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
|
||||||
|
# Log metrics
|
||||||
|
if args.local_rank == -1 and args.evaluate_during_training: # Only evaluate when single GPU otherwise metrics may not average well
|
||||||
|
results = evaluate(args, model, tokenizer)
|
||||||
|
for key, value in results.items():
|
||||||
|
tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
|
||||||
|
tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
|
||||||
|
tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
|
||||||
|
logging_loss = tr_loss
|
||||||
|
|
||||||
|
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
|
||||||
|
# Save model checkpoint
|
||||||
|
output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
|
||||||
|
if not os.path.exists(output_dir):
|
||||||
|
os.makedirs(output_dir)
|
||||||
|
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
|
||||||
|
model_to_save.save_pretrained(output_dir)
|
||||||
|
torch.save(args, os.path.join(output_dir, 'training_args.bin'))
|
||||||
|
logger.info("Saving model checkpoint to %s", output_dir)
|
||||||
|
|
||||||
|
if args.max_steps > 0 and global_step > args.max_steps:
|
||||||
|
epoch_iterator.close()
|
||||||
|
break
|
||||||
|
if args.max_steps > 0 and global_step > args.max_steps:
|
||||||
|
train_iterator.close()
|
||||||
|
break
|
||||||
|
|
||||||
|
if args.local_rank in [-1, 0]:
|
||||||
|
tb_writer.close()
|
||||||
|
|
||||||
|
return global_step, tr_loss / global_step
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate(args, model, tokenizer, prefix=""):
|
||||||
|
eval_task_names = (args.task_name,)
|
||||||
|
eval_outputs_dirs = (args.output_dir,)
|
||||||
|
|
||||||
|
results = {}
|
||||||
|
for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
|
||||||
|
eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
|
||||||
|
|
||||||
|
if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
|
||||||
|
os.makedirs(eval_output_dir)
|
||||||
|
|
||||||
|
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
|
||||||
|
# Note that DistributedSampler samples randomly
|
||||||
|
eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
|
||||||
|
eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
|
||||||
|
|
||||||
|
# multi-gpu eval
|
||||||
|
if args.n_gpu > 1:
|
||||||
|
model = torch.nn.DataParallel(model)
|
||||||
|
|
||||||
|
# Eval!
|
||||||
|
logger.info("***** Running evaluation {} *****".format(prefix))
|
||||||
|
logger.info(" Num examples = %d", len(eval_dataset))
|
||||||
|
logger.info(" Batch size = %d", args.eval_batch_size)
|
||||||
|
eval_loss = 0.0
|
||||||
|
nb_eval_steps = 0
|
||||||
|
preds = None
|
||||||
|
out_label_ids = None
|
||||||
|
for batch in tqdm(eval_dataloader, desc="Evaluating"):
|
||||||
|
model.eval()
|
||||||
|
batch = tuple(t.to(args.device) for t in batch)
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
inputs = {'input_ids': batch[0],
|
||||||
|
'attention_mask': batch[1],
|
||||||
|
'labels': batch[3]}
|
||||||
|
if args.model_type != 'distilbert':
|
||||||
|
inputs['token_type_ids'] = batch[2] if args.model_type in ['bert'] else None # XLM and DistilBERT don't use segment_ids
|
||||||
|
outputs = model(**inputs)
|
||||||
|
tmp_eval_loss, logits = outputs[:2]
|
||||||
|
|
||||||
|
eval_loss += tmp_eval_loss.mean().item()
|
||||||
|
nb_eval_steps += 1
|
||||||
|
if preds is None:
|
||||||
|
preds = logits.detach().cpu().numpy()
|
||||||
|
out_label_ids = inputs['labels'].detach().cpu().numpy()
|
||||||
|
else:
|
||||||
|
preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
|
||||||
|
out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
|
||||||
|
|
||||||
|
eval_loss = eval_loss / nb_eval_steps
|
||||||
|
if args.output_mode == "classification":
|
||||||
|
preds = np.argmax(preds, axis=1)
|
||||||
|
else:
|
||||||
|
raise ValueError('No other `output_mode` for XNLI.')
|
||||||
|
result = compute_metrics(eval_task, preds, out_label_ids)
|
||||||
|
results.update(result)
|
||||||
|
|
||||||
|
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
|
||||||
|
with open(output_eval_file, "w") as writer:
|
||||||
|
logger.info("***** Eval results {} *****".format(prefix))
|
||||||
|
for key in sorted(result.keys()):
|
||||||
|
logger.info(" %s = %s", key, str(result[key]))
|
||||||
|
writer.write("%s = %s\n" % (key, str(result[key])))
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def load_and_cache_examples(args, task, tokenizer, evaluate=False):
|
||||||
|
if args.local_rank not in [-1, 0] and not evaluate:
|
||||||
|
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
|
||||||
|
|
||||||
|
processor = processors[task](language=args.language, train_language=args.train_language)
|
||||||
|
output_mode = output_modes[task]
|
||||||
|
# Load data features from cache or dataset file
|
||||||
|
cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_{}_{}'.format(
|
||||||
|
'test' if evaluate else 'train',
|
||||||
|
list(filter(None, args.model_name_or_path.split('/'))).pop(),
|
||||||
|
str(args.max_seq_length),
|
||||||
|
str(task),
|
||||||
|
str(args.train_language if (not evaluate and args.train_language is not None) else args.language)))
|
||||||
|
if os.path.exists(cached_features_file) and not args.overwrite_cache:
|
||||||
|
logger.info("Loading features from cached file %s", cached_features_file)
|
||||||
|
features = torch.load(cached_features_file)
|
||||||
|
else:
|
||||||
|
logger.info("Creating features from dataset file at %s", args.data_dir)
|
||||||
|
label_list = processor.get_labels()
|
||||||
|
examples = processor.get_test_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
|
||||||
|
features = convert_examples_to_features(examples,
|
||||||
|
tokenizer,
|
||||||
|
label_list=label_list,
|
||||||
|
max_length=args.max_seq_length,
|
||||||
|
output_mode=output_mode,
|
||||||
|
pad_on_left=False,
|
||||||
|
pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
|
||||||
|
pad_token_segment_id=0,
|
||||||
|
)
|
||||||
|
if args.local_rank in [-1, 0]:
|
||||||
|
logger.info("Saving features into cached file %s", cached_features_file)
|
||||||
|
torch.save(features, cached_features_file)
|
||||||
|
|
||||||
|
if args.local_rank == 0 and not evaluate:
|
||||||
|
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
|
||||||
|
|
||||||
|
# Convert to Tensors and build dataset
|
||||||
|
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
|
||||||
|
all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
|
||||||
|
all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
|
||||||
|
if output_mode == "classification":
|
||||||
|
all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
|
||||||
|
else:
|
||||||
|
raise ValueError('No other `output_mode` for XNLI.')
|
||||||
|
|
||||||
|
dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
|
||||||
|
return dataset
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
|
||||||
|
## Required parameters
|
||||||
|
parser.add_argument("--data_dir", default=None, type=str, required=True,
|
||||||
|
help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
|
||||||
|
parser.add_argument("--model_type", default=None, type=str, required=True,
|
||||||
|
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
|
||||||
|
parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
|
||||||
|
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
|
||||||
|
parser.add_argument("--language", default=None, type=str, required=True,
|
||||||
|
help="Evaluation language. Also train language if `train_language` is set to None.")
|
||||||
|
parser.add_argument("--train_language", default=None, type=str,
|
||||||
|
help="Train language if is different of the evaluation language.")
|
||||||
|
parser.add_argument("--output_dir", default=None, type=str, required=True,
|
||||||
|
help="The output directory where the model predictions and checkpoints will be written.")
|
||||||
|
|
||||||
|
## Other parameters
|
||||||
|
parser.add_argument("--config_name", default="", type=str,
|
||||||
|
help="Pretrained config name or path if not the same as model_name")
|
||||||
|
parser.add_argument("--tokenizer_name", default="", type=str,
|
||||||
|
help="Pretrained tokenizer name or path if not the same as model_name")
|
||||||
|
parser.add_argument("--cache_dir", default="", type=str,
|
||||||
|
help="Where do you want to store the pre-trained models downloaded from s3")
|
||||||
|
parser.add_argument("--max_seq_length", default=128, type=int,
|
||||||
|
help="The maximum total input sequence length after tokenization. Sequences longer "
|
||||||
|
"than this will be truncated, sequences shorter will be padded.")
|
||||||
|
parser.add_argument("--do_train", action='store_true',
|
||||||
|
help="Whether to run training.")
|
||||||
|
parser.add_argument("--do_eval", action='store_true',
|
||||||
|
help="Whether to run eval on the test set.")
|
||||||
|
parser.add_argument("--evaluate_during_training", action='store_true',
|
||||||
|
help="Rul evaluation during training at each logging step.")
|
||||||
|
parser.add_argument("--do_lower_case", action='store_true',
|
||||||
|
help="Set this flag if you are using an uncased model.")
|
||||||
|
|
||||||
|
parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
|
||||||
|
help="Batch size per GPU/CPU for training.")
|
||||||
|
parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
|
||||||
|
help="Batch size per GPU/CPU for evaluation.")
|
||||||
|
parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
|
||||||
|
help="Number of updates steps to accumulate before performing a backward/update pass.")
|
||||||
|
parser.add_argument("--learning_rate", default=5e-5, type=float,
|
||||||
|
help="The initial learning rate for Adam.")
|
||||||
|
parser.add_argument("--weight_decay", default=0.0, type=float,
|
||||||
|
help="Weight deay if we apply some.")
|
||||||
|
parser.add_argument("--adam_epsilon", default=1e-8, type=float,
|
||||||
|
help="Epsilon for Adam optimizer.")
|
||||||
|
parser.add_argument("--max_grad_norm", default=1.0, type=float,
|
||||||
|
help="Max gradient norm.")
|
||||||
|
parser.add_argument("--num_train_epochs", default=3.0, type=float,
|
||||||
|
help="Total number of training epochs to perform.")
|
||||||
|
parser.add_argument("--max_steps", default=-1, type=int,
|
||||||
|
help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
|
||||||
|
parser.add_argument("--warmup_steps", default=0, type=int,
|
||||||
|
help="Linear warmup over warmup_steps.")
|
||||||
|
|
||||||
|
parser.add_argument('--logging_steps', type=int, default=50,
|
||||||
|
help="Log every X updates steps.")
|
||||||
|
parser.add_argument('--save_steps', type=int, default=50,
|
||||||
|
help="Save checkpoint every X updates steps.")
|
||||||
|
parser.add_argument("--eval_all_checkpoints", action='store_true',
|
||||||
|
help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
|
||||||
|
parser.add_argument("--no_cuda", action='store_true',
|
||||||
|
help="Avoid using CUDA when available")
|
||||||
|
parser.add_argument('--overwrite_output_dir', action='store_true',
|
||||||
|
help="Overwrite the content of the output directory")
|
||||||
|
parser.add_argument('--overwrite_cache', action='store_true',
|
||||||
|
help="Overwrite the cached training and evaluation sets")
|
||||||
|
parser.add_argument('--seed', type=int, default=42,
|
||||||
|
help="random seed for initialization")
|
||||||
|
|
||||||
|
parser.add_argument('--fp16', action='store_true',
|
||||||
|
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
|
||||||
|
parser.add_argument('--fp16_opt_level', type=str, default='O1',
|
||||||
|
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
|
||||||
|
"See details at https://nvidia.github.io/apex/amp.html")
|
||||||
|
parser.add_argument("--local_rank", type=int, default=-1,
|
||||||
|
help="For distributed training: local_rank")
|
||||||
|
parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
|
||||||
|
parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
|
||||||
|
raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
|
||||||
|
|
||||||
|
# Setup distant debugging if needed
|
||||||
|
if args.server_ip and args.server_port:
|
||||||
|
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
|
||||||
|
import ptvsd
|
||||||
|
print("Waiting for debugger attach")
|
||||||
|
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
|
||||||
|
ptvsd.wait_for_attach()
|
||||||
|
|
||||||
|
# Setup CUDA, GPU & distributed training
|
||||||
|
if args.local_rank == -1 or args.no_cuda:
|
||||||
|
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
|
||||||
|
args.n_gpu = torch.cuda.device_count()
|
||||||
|
else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
|
||||||
|
torch.cuda.set_device(args.local_rank)
|
||||||
|
device = torch.device("cuda", args.local_rank)
|
||||||
|
torch.distributed.init_process_group(backend='nccl')
|
||||||
|
args.n_gpu = 1
|
||||||
|
args.device = device
|
||||||
|
|
||||||
|
# Setup logging
|
||||||
|
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||||
|
datefmt = '%m/%d/%Y %H:%M:%S',
|
||||||
|
level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
|
||||||
|
logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
|
||||||
|
args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
|
||||||
|
|
||||||
|
# Set seed
|
||||||
|
set_seed(args)
|
||||||
|
|
||||||
|
# Prepare XNLI task
|
||||||
|
args.task_name = 'xnli'
|
||||||
|
if args.task_name not in processors:
|
||||||
|
raise ValueError("Task not found: %s" % (args.task_name))
|
||||||
|
processor = processors[args.task_name](language=args.language, train_language=args.train_language)
|
||||||
|
args.output_mode = output_modes[args.task_name]
|
||||||
|
label_list = processor.get_labels()
|
||||||
|
num_labels = len(label_list)
|
||||||
|
|
||||||
|
# Load pretrained model and tokenizer
|
||||||
|
if args.local_rank not in [-1, 0]:
|
||||||
|
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
|
||||||
|
|
||||||
|
args.model_type = args.model_type.lower()
|
||||||
|
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
|
||||||
|
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path,
|
||||||
|
num_labels=num_labels,
|
||||||
|
finetuning_task=args.task_name,
|
||||||
|
cache_dir=args.cache_dir if args.cache_dir else None)
|
||||||
|
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
|
||||||
|
do_lower_case=args.do_lower_case,
|
||||||
|
cache_dir=args.cache_dir if args.cache_dir else None)
|
||||||
|
model = model_class.from_pretrained(args.model_name_or_path,
|
||||||
|
from_tf=bool('.ckpt' in args.model_name_or_path),
|
||||||
|
config=config,
|
||||||
|
cache_dir=args.cache_dir if args.cache_dir else None)
|
||||||
|
|
||||||
|
if args.local_rank == 0:
|
||||||
|
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
|
||||||
|
|
||||||
|
model.to(args.device)
|
||||||
|
|
||||||
|
logger.info("Training/evaluation parameters %s", args)
|
||||||
|
|
||||||
|
|
||||||
|
# Training
|
||||||
|
if args.do_train:
|
||||||
|
train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
|
||||||
|
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
|
||||||
|
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
|
||||||
|
|
||||||
|
|
||||||
|
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
|
||||||
|
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
|
||||||
|
# Create output directory if needed
|
||||||
|
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
|
||||||
|
os.makedirs(args.output_dir)
|
||||||
|
|
||||||
|
logger.info("Saving model checkpoint to %s", args.output_dir)
|
||||||
|
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
|
||||||
|
# They can then be reloaded using `from_pretrained()`
|
||||||
|
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
|
||||||
|
model_to_save.save_pretrained(args.output_dir)
|
||||||
|
tokenizer.save_pretrained(args.output_dir)
|
||||||
|
|
||||||
|
# Good practice: save your training arguments together with the trained model
|
||||||
|
torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
|
||||||
|
|
||||||
|
# Load a trained model and vocabulary that you have fine-tuned
|
||||||
|
model = model_class.from_pretrained(args.output_dir)
|
||||||
|
tokenizer = tokenizer_class.from_pretrained(args.output_dir)
|
||||||
|
model.to(args.device)
|
||||||
|
|
||||||
|
|
||||||
|
# Evaluation
|
||||||
|
results = {}
|
||||||
|
if args.do_eval and args.local_rank in [-1, 0]:
|
||||||
|
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
|
||||||
|
checkpoints = [args.output_dir]
|
||||||
|
if args.eval_all_checkpoints:
|
||||||
|
checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
|
||||||
|
logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging
|
||||||
|
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||||
|
for checkpoint in checkpoints:
|
||||||
|
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
|
||||||
|
prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
|
||||||
|
|
||||||
|
model = model_class.from_pretrained(checkpoint)
|
||||||
|
model.to(args.device)
|
||||||
|
result = evaluate(args, model, tokenizer, prefix=prefix)
|
||||||
|
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
|
||||||
|
results.update(result)
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -240,6 +240,7 @@ def convert_examples_to_features(examples, tokenizer, max_seq_length,
|
|||||||
|
|
||||||
# The -3 accounts for [CLS], [SEP] and [SEP]
|
# The -3 accounts for [CLS], [SEP] and [SEP]
|
||||||
max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
|
max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
|
||||||
|
assert max_tokens_for_doc > 0
|
||||||
|
|
||||||
# We can have documents that are longer than the maximum sequence length.
|
# We can have documents that are longer than the maximum sequence length.
|
||||||
# To deal with this we do a sliding window approach, where we take chunks
|
# To deal with this we do a sliding window approach, where we take chunks
|
||||||
|
|||||||
2
setup.py
2
setup.py
@@ -38,7 +38,7 @@ from setuptools import find_packages, setup
|
|||||||
|
|
||||||
setup(
|
setup(
|
||||||
name="transformers",
|
name="transformers",
|
||||||
version="2.1.1",
|
version="2.2.0",
|
||||||
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
|
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
|
||||||
author_email="thomas@huggingface.co",
|
author_email="thomas@huggingface.co",
|
||||||
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
|
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
__version__ = "2.1.1"
|
__version__ = "2.2.0"
|
||||||
|
|
||||||
# Work around to update TensorFlow's absl.logging threshold which alters the
|
# Work around to update TensorFlow's absl.logging threshold which alters the
|
||||||
# default Python logging output behavior when present.
|
# default Python logging output behavior when present.
|
||||||
@@ -25,10 +25,11 @@ from .file_utils import (TRANSFORMERS_CACHE, PYTORCH_TRANSFORMERS_CACHE, PYTORCH
|
|||||||
from .data import (is_sklearn_available,
|
from .data import (is_sklearn_available,
|
||||||
InputExample, InputFeatures, DataProcessor,
|
InputExample, InputFeatures, DataProcessor,
|
||||||
glue_output_modes, glue_convert_examples_to_features,
|
glue_output_modes, glue_convert_examples_to_features,
|
||||||
glue_processors, glue_tasks_num_labels)
|
glue_processors, glue_tasks_num_labels,
|
||||||
|
xnli_output_modes, xnli_processors, xnli_tasks_num_labels)
|
||||||
|
|
||||||
if is_sklearn_available():
|
if is_sklearn_available():
|
||||||
from .data import glue_compute_metrics
|
from .data import glue_compute_metrics, xnli_compute_metrics
|
||||||
|
|
||||||
# Tokenizers
|
# Tokenizers
|
||||||
from .tokenization_utils import (PreTrainedTokenizer)
|
from .tokenization_utils import (PreTrainedTokenizer)
|
||||||
@@ -42,6 +43,7 @@ from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
|
|||||||
from .tokenization_xlm import XLMTokenizer
|
from .tokenization_xlm import XLMTokenizer
|
||||||
from .tokenization_roberta import RobertaTokenizer
|
from .tokenization_roberta import RobertaTokenizer
|
||||||
from .tokenization_distilbert import DistilBertTokenizer
|
from .tokenization_distilbert import DistilBertTokenizer
|
||||||
|
from .tokenization_albert import AlbertTokenizer
|
||||||
from .tokenization_camembert import CamembertTokenizer
|
from .tokenization_camembert import CamembertTokenizer
|
||||||
|
|
||||||
# Configurations
|
# Configurations
|
||||||
@@ -57,6 +59,7 @@ from .configuration_ctrl import CTRLConfig, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP
|
|||||||
from .configuration_xlm import XLMConfig, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP
|
from .configuration_xlm import XLMConfig, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
from .configuration_roberta import RobertaConfig, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP
|
from .configuration_roberta import RobertaConfig, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
from .configuration_distilbert import DistilBertConfig, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
from .configuration_distilbert import DistilBertConfig, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
|
from .configuration_albert import AlbertConfig, ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
from .configuration_camembert import CamembertConfig, CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
from .configuration_camembert import CamembertConfig, CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
|
|
||||||
# Modeling
|
# Modeling
|
||||||
@@ -100,9 +103,14 @@ if is_torch_available():
|
|||||||
DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
from .modeling_camembert import (CamembertForMaskedLM, CamembertModel,
|
from .modeling_camembert import (CamembertForMaskedLM, CamembertModel,
|
||||||
CamembertForSequenceClassification, CamembertForMultipleChoice,
|
CamembertForSequenceClassification, CamembertForMultipleChoice,
|
||||||
|
CamembertForTokenClassification,
|
||||||
CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
from .modeling_encoder_decoder import PreTrainedEncoderDecoder, Model2Model
|
from .modeling_encoder_decoder import PreTrainedEncoderDecoder, Model2Model
|
||||||
|
|
||||||
|
from .modeling_albert import (AlbertModel, AlbertForMaskedLM, AlbertForSequenceClassification,
|
||||||
|
AlbertForQuestionAnswering,
|
||||||
|
load_tf_weights_in_albert, ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
|
||||||
# Optimization
|
# Optimization
|
||||||
from .optimization import (AdamW, get_constant_schedule, get_constant_schedule_with_warmup, get_cosine_schedule_with_warmup,
|
from .optimization import (AdamW, get_constant_schedule, get_constant_schedule_with_warmup, get_cosine_schedule_with_warmup,
|
||||||
get_cosine_with_hard_restarts_schedule_with_warmup, get_linear_schedule_with_warmup)
|
get_cosine_with_hard_restarts_schedule_with_warmup, get_linear_schedule_with_warmup)
|
||||||
@@ -161,6 +169,10 @@ if is_tf_available():
|
|||||||
TFCTRLLMHeadModel,
|
TFCTRLLMHeadModel,
|
||||||
TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
|
||||||
|
from .modeling_tf_albert import (TFAlbertPreTrainedModel, TFAlbertModel, TFAlbertForMaskedLM,
|
||||||
|
TFAlbertForSequenceClassification,
|
||||||
|
TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
|
||||||
# TF 2.0 <=> PyTorch conversion utilities
|
# TF 2.0 <=> PyTorch conversion utilities
|
||||||
from .modeling_tf_pytorch_utils import (convert_tf_weight_name_to_pt_weight_name,
|
from .modeling_tf_pytorch_utils import (convert_tf_weight_name_to_pt_weight_name,
|
||||||
load_pytorch_checkpoint_in_tf2_model,
|
load_pytorch_checkpoint_in_tf2_model,
|
||||||
|
|||||||
100
transformers/configuration_albert.py
Normal file
100
transformers/configuration_albert.py
Normal file
@@ -0,0 +1,100 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" ALBERT model configuration """
|
||||||
|
|
||||||
|
from .configuration_utils import PretrainedConfig
|
||||||
|
|
||||||
|
ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||||
|
'albert-base-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-config.json",
|
||||||
|
'albert-large-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-config.json",
|
||||||
|
'albert-xlarge-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-config.json",
|
||||||
|
'albert-xxlarge-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-config.json",
|
||||||
|
'albert-base-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-config.json",
|
||||||
|
'albert-large-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-config.json",
|
||||||
|
'albert-xlarge-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-config.json",
|
||||||
|
'albert-xxlarge-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-config.json",
|
||||||
|
}
|
||||||
|
|
||||||
|
class AlbertConfig(PretrainedConfig):
|
||||||
|
"""Configuration for `AlbertModel`.
|
||||||
|
|
||||||
|
The default settings match the configuration of model `albert_xxlarge`.
|
||||||
|
"""
|
||||||
|
|
||||||
|
pretrained_config_archive_map = ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||||
|
|
||||||
|
def __init__(self,
|
||||||
|
vocab_size_or_config_json_file=30000,
|
||||||
|
embedding_size=128,
|
||||||
|
hidden_size=4096,
|
||||||
|
num_hidden_layers=12,
|
||||||
|
num_hidden_groups=1,
|
||||||
|
num_attention_heads=64,
|
||||||
|
intermediate_size=16384,
|
||||||
|
inner_group_num=1,
|
||||||
|
hidden_act="gelu_new",
|
||||||
|
hidden_dropout_prob=0,
|
||||||
|
attention_probs_dropout_prob=0,
|
||||||
|
max_position_embeddings=512,
|
||||||
|
type_vocab_size=2,
|
||||||
|
initializer_range=0.02,
|
||||||
|
layer_norm_eps=1e-12, **kwargs):
|
||||||
|
"""Constructs AlbertConfig.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_size: Vocabulary size of `inputs_ids` in `AlbertModel`.
|
||||||
|
embedding_size: size of voc embeddings.
|
||||||
|
hidden_size: Size of the encoder layers and the pooler layer.
|
||||||
|
num_hidden_layers: Number of hidden layers in the Transformer encoder.
|
||||||
|
num_hidden_groups: Number of group for the hidden layers, parameters in
|
||||||
|
the same group are shared.
|
||||||
|
num_attention_heads: Number of attention heads for each attention layer in
|
||||||
|
the Transformer encoder.
|
||||||
|
intermediate_size: The size of the "intermediate" (i.e., feed-forward)
|
||||||
|
layer in the Transformer encoder.
|
||||||
|
inner_group_num: int, number of inner repetition of attention and ffn.
|
||||||
|
down_scale_factor: float, the scale to apply
|
||||||
|
hidden_act: The non-linear activation function (function or string) in the
|
||||||
|
encoder and pooler.
|
||||||
|
hidden_dropout_prob: The dropout probability for all fully connected
|
||||||
|
layers in the embeddings, encoder, and pooler.
|
||||||
|
attention_probs_dropout_prob: The dropout ratio for the attention
|
||||||
|
probabilities.
|
||||||
|
max_position_embeddings: The maximum sequence length that this model might
|
||||||
|
ever be used with. Typically set this to something large just in case
|
||||||
|
(e.g., 512 or 1024 or 2048).
|
||||||
|
type_vocab_size: The vocabulary size of the `token_type_ids` passed into
|
||||||
|
`AlbertModel`.
|
||||||
|
initializer_range: The stdev of the truncated_normal_initializer for
|
||||||
|
initializing all weight matrices.
|
||||||
|
"""
|
||||||
|
super(AlbertConfig, self).__init__(**kwargs)
|
||||||
|
|
||||||
|
self.vocab_size = vocab_size_or_config_json_file
|
||||||
|
self.embedding_size = embedding_size
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_hidden_groups = num_hidden_groups
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.inner_group_num = inner_group_num
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.type_vocab_size = type_vocab_size
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
@@ -95,6 +95,9 @@ class AutoConfig(object):
|
|||||||
force_download: (`optional`) boolean, default False:
|
force_download: (`optional`) boolean, default False:
|
||||||
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
||||||
|
|
||||||
|
resume_download: (`optional`) boolean, default False:
|
||||||
|
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
|
||||||
|
|
||||||
proxies: (`optional`) dict, default None:
|
proxies: (`optional`) dict, default None:
|
||||||
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
||||||
The proxies are used on each request.
|
The proxies are used on each request.
|
||||||
|
|||||||
@@ -29,6 +29,7 @@ DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
|||||||
'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json",
|
'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json",
|
||||||
'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-config.json",
|
'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-config.json",
|
||||||
'distilbert-base-german-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-config.json",
|
'distilbert-base-german-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-config.json",
|
||||||
|
'distilbert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-multilingual-cased-config.json",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -94,6 +94,9 @@ class PretrainedConfig(object):
|
|||||||
force_download: (`optional`) boolean, default False:
|
force_download: (`optional`) boolean, default False:
|
||||||
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
||||||
|
|
||||||
|
resume_download: (`optional`) boolean, default False:
|
||||||
|
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
|
||||||
|
|
||||||
proxies: (`optional`) dict, default None:
|
proxies: (`optional`) dict, default None:
|
||||||
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
||||||
The proxies are used on each request.
|
The proxies are used on each request.
|
||||||
@@ -120,6 +123,7 @@ class PretrainedConfig(object):
|
|||||||
"""
|
"""
|
||||||
cache_dir = kwargs.pop('cache_dir', None)
|
cache_dir = kwargs.pop('cache_dir', None)
|
||||||
force_download = kwargs.pop('force_download', False)
|
force_download = kwargs.pop('force_download', False)
|
||||||
|
resume_download = kwargs.pop('resume_download', False)
|
||||||
proxies = kwargs.pop('proxies', None)
|
proxies = kwargs.pop('proxies', None)
|
||||||
return_unused_kwargs = kwargs.pop('return_unused_kwargs', False)
|
return_unused_kwargs = kwargs.pop('return_unused_kwargs', False)
|
||||||
|
|
||||||
@@ -131,7 +135,8 @@ class PretrainedConfig(object):
|
|||||||
config_file = pretrained_model_name_or_path
|
config_file = pretrained_model_name_or_path
|
||||||
# redirect to the cache, if necessary
|
# redirect to the cache, if necessary
|
||||||
try:
|
try:
|
||||||
resolved_config_file = cached_path(config_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
|
resolved_config_file = cached_path(config_file, cache_dir=cache_dir, force_download=force_download,
|
||||||
|
proxies=proxies, resume_download=resume_download)
|
||||||
except EnvironmentError:
|
except EnvironmentError:
|
||||||
if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
|
if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
|
||||||
msg = "Couldn't reach server at '{}' to download pretrained model configuration file.".format(
|
msg = "Couldn't reach server at '{}' to download pretrained model configuration file.".format(
|
||||||
|
|||||||
@@ -0,0 +1,67 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Convert ALBERT checkpoint."""
|
||||||
|
|
||||||
|
from __future__ import absolute_import
|
||||||
|
from __future__ import division
|
||||||
|
from __future__ import print_function
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers import AlbertConfig, AlbertForMaskedLM, load_tf_weights_in_albert
|
||||||
|
|
||||||
|
import logging
|
||||||
|
logging.basicConfig(level=logging.INFO)
|
||||||
|
|
||||||
|
|
||||||
|
def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, albert_config_file, pytorch_dump_path):
|
||||||
|
# Initialise PyTorch model
|
||||||
|
config = AlbertConfig.from_json_file(albert_config_file)
|
||||||
|
print("Building PyTorch model from configuration: {}".format(str(config)))
|
||||||
|
model = AlbertForMaskedLM(config)
|
||||||
|
|
||||||
|
# Load weights from tf checkpoint
|
||||||
|
load_tf_weights_in_albert(model, config, tf_checkpoint_path)
|
||||||
|
|
||||||
|
# Save pytorch-model
|
||||||
|
print("Save PyTorch model to {}".format(pytorch_dump_path))
|
||||||
|
torch.save(model.state_dict(), pytorch_dump_path)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
## Required parameters
|
||||||
|
parser.add_argument("--tf_checkpoint_path",
|
||||||
|
default = None,
|
||||||
|
type = str,
|
||||||
|
required = True,
|
||||||
|
help = "Path to the TensorFlow checkpoint path.")
|
||||||
|
parser.add_argument("--albert_config_file",
|
||||||
|
default = None,
|
||||||
|
type = str,
|
||||||
|
required = True,
|
||||||
|
help = "The config json file corresponding to the pre-trained ALBERT model. \n"
|
||||||
|
"This specifies the model architecture.")
|
||||||
|
parser.add_argument("--pytorch_dump_path",
|
||||||
|
default = None,
|
||||||
|
type = str,
|
||||||
|
required = True,
|
||||||
|
help = "Path to the output PyTorch model.")
|
||||||
|
args = parser.parse_args()
|
||||||
|
convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path,
|
||||||
|
args.albert_config_file,
|
||||||
|
args.pytorch_dump_path)
|
||||||
|
|
||||||
@@ -33,7 +33,8 @@ from transformers import (load_pytorch_checkpoint_in_tf2_model,
|
|||||||
OpenAIGPTConfig, TFOpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
OpenAIGPTConfig, TFOpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
RobertaConfig, TFRobertaForMaskedLM, TFRobertaForSequenceClassification, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
RobertaConfig, TFRobertaForMaskedLM, TFRobertaForSequenceClassification, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
DistilBertConfig, TFDistilBertForMaskedLM, TFDistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
DistilBertConfig, TFDistilBertForMaskedLM, TFDistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
CTRLConfig, TFCTRLLMHeadModel, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
CTRLConfig, TFCTRLLMHeadModel, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
|
AlbertConfig, TFAlbertForMaskedLM, ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
||||||
|
|
||||||
if is_torch_available():
|
if is_torch_available():
|
||||||
import torch
|
import torch
|
||||||
@@ -46,7 +47,8 @@ if is_torch_available():
|
|||||||
OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
|
AlbertForMaskedLM, ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
else:
|
else:
|
||||||
(BertForPreTraining, BertForQuestionAnswering, BertForSequenceClassification, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
(BertForPreTraining, BertForQuestionAnswering, BertForSequenceClassification, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
GPT2LMHeadModel, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP,
|
GPT2LMHeadModel, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
@@ -56,7 +58,8 @@ else:
|
|||||||
OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP) = (
|
CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
|
AlbertForMaskedLM, ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP) = (
|
||||||
None, None, None, None,
|
None, None, None, None,
|
||||||
None, None,
|
None, None,
|
||||||
None, None,
|
None, None,
|
||||||
@@ -65,6 +68,7 @@ else:
|
|||||||
None, None,
|
None, None,
|
||||||
None, None, None,
|
None, None, None,
|
||||||
None, None, None,
|
None, None, None,
|
||||||
|
None, None,
|
||||||
None, None)
|
None, None)
|
||||||
|
|
||||||
|
|
||||||
@@ -85,7 +89,8 @@ MODEL_CLASSES = {
|
|||||||
'roberta-large-mnli': (RobertaConfig, TFRobertaForSequenceClassification, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'roberta-large-mnli': (RobertaConfig, TFRobertaForSequenceClassification, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'distilbert': (DistilBertConfig, TFDistilBertForMaskedLM, DistilBertForMaskedLM, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'distilbert': (DistilBertConfig, TFDistilBertForMaskedLM, DistilBertForMaskedLM, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'distilbert-base-uncased-distilled-squad': (DistilBertConfig, TFDistilBertForQuestionAnswering, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
'distilbert-base-uncased-distilled-squad': (DistilBertConfig, TFDistilBertForQuestionAnswering, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
'ctrl': (CTRLConfig, TFCTRLLMHeadModel, CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
'ctrl': (CTRLConfig, TFCTRLLMHeadModel, CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||||
|
'albert': (AlbertConfig, TFAlbertForMaskedLM, AlbertForMaskedLM, ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP, ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
||||||
}
|
}
|
||||||
|
|
||||||
def convert_pt_checkpoint_to_tf(model_type, pytorch_checkpoint_path, config_file, tf_dump_path, compare_with_pt_model=False, use_cached_models=True):
|
def convert_pt_checkpoint_to_tf(model_type, pytorch_checkpoint_path, config_file, tf_dump_path, compare_with_pt_model=False, use_cached_models=True):
|
||||||
|
|||||||
@@ -1,6 +1,7 @@
|
|||||||
from .processors import InputExample, InputFeatures, DataProcessor
|
from .processors import InputExample, InputFeatures, DataProcessor
|
||||||
from .processors import glue_output_modes, glue_processors, glue_tasks_num_labels, glue_convert_examples_to_features
|
from .processors import glue_output_modes, glue_processors, glue_tasks_num_labels, glue_convert_examples_to_features
|
||||||
|
from .processors import xnli_output_modes, xnli_processors, xnli_tasks_num_labels
|
||||||
|
|
||||||
from .metrics import is_sklearn_available
|
from .metrics import is_sklearn_available
|
||||||
if is_sklearn_available():
|
if is_sklearn_available():
|
||||||
from .metrics import glue_compute_metrics
|
from .metrics import glue_compute_metrics, xnli_compute_metrics
|
||||||
|
|||||||
@@ -81,3 +81,11 @@ if _has_sklearn:
|
|||||||
return {"acc": simple_accuracy(preds, labels)}
|
return {"acc": simple_accuracy(preds, labels)}
|
||||||
else:
|
else:
|
||||||
raise KeyError(task_name)
|
raise KeyError(task_name)
|
||||||
|
|
||||||
|
|
||||||
|
def xnli_compute_metrics(task_name, preds, labels):
|
||||||
|
assert len(preds) == len(labels)
|
||||||
|
if task_name == "xnli":
|
||||||
|
return {"acc": simple_accuracy(preds, labels)}
|
||||||
|
else:
|
||||||
|
raise KeyError(task_name)
|
||||||
|
|||||||
@@ -1,3 +1,3 @@
|
|||||||
from .utils import InputExample, InputFeatures, DataProcessor
|
from .utils import InputExample, InputFeatures, DataProcessor
|
||||||
from .glue import glue_output_modes, glue_processors, glue_tasks_num_labels, glue_convert_examples_to_features
|
from .glue import glue_output_modes, glue_processors, glue_tasks_num_labels, glue_convert_examples_to_features
|
||||||
|
from .xnli import xnli_output_modes, xnli_processors, xnli_tasks_num_labels
|
||||||
|
|||||||
85
transformers/data/processors/xnli.py
Normal file
85
transformers/data/processors/xnli.py
Normal file
@@ -0,0 +1,85 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" XNLI utils (dataset loading and evaluation) """
|
||||||
|
|
||||||
|
from __future__ import absolute_import, division, print_function
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
|
||||||
|
from .utils import DataProcessor, InputExample
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class XnliProcessor(DataProcessor):
|
||||||
|
"""Processor for the XNLI dataset.
|
||||||
|
Adapted from https://github.com/google-research/bert/blob/f39e881b169b9d53bea03d2d341b31707a6c052b/run_classifier.py#L207"""
|
||||||
|
|
||||||
|
def __init__(self, language, train_language = None):
|
||||||
|
self.language = language
|
||||||
|
self.train_language = train_language
|
||||||
|
|
||||||
|
def get_train_examples(self, data_dir):
|
||||||
|
"""See base class."""
|
||||||
|
lg = self.language if self.train_language is None else self.train_language
|
||||||
|
lines = self._read_tsv(os.path.join(data_dir, "XNLI-MT-1.0/multinli/multinli.train.{}.tsv".format(lg)))
|
||||||
|
examples = []
|
||||||
|
for (i, line) in enumerate(lines):
|
||||||
|
if i == 0:
|
||||||
|
continue
|
||||||
|
guid = "%s-%s" % ('train', i)
|
||||||
|
text_a = line[0]
|
||||||
|
text_b = line[1]
|
||||||
|
label = "contradiction" if line[2] == "contradictory" else line[2]
|
||||||
|
assert isinstance(text_a, str) and isinstance(text_b, str) and isinstance(label, str)
|
||||||
|
examples.append(
|
||||||
|
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
|
||||||
|
return examples
|
||||||
|
|
||||||
|
def get_test_examples(self, data_dir):
|
||||||
|
"""See base class."""
|
||||||
|
lines = self._read_tsv(os.path.join(data_dir, "XNLI-1.0/xnli.test.tsv"))
|
||||||
|
examples = []
|
||||||
|
for (i, line) in enumerate(lines):
|
||||||
|
if i == 0:
|
||||||
|
continue
|
||||||
|
language = line[0]
|
||||||
|
if language != self.language:
|
||||||
|
continue
|
||||||
|
guid = "%s-%s" % ('test', i)
|
||||||
|
text_a = line[6]
|
||||||
|
text_b = line[7]
|
||||||
|
label = line[1]
|
||||||
|
assert isinstance(text_a, str) and isinstance(text_b, str) and isinstance(label, str)
|
||||||
|
examples.append(
|
||||||
|
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
|
||||||
|
return examples
|
||||||
|
|
||||||
|
def get_labels(self):
|
||||||
|
"""See base class."""
|
||||||
|
return ["contradiction", "entailment", "neutral"]
|
||||||
|
|
||||||
|
xnli_processors = {
|
||||||
|
"xnli": XnliProcessor,
|
||||||
|
}
|
||||||
|
|
||||||
|
xnli_output_modes = {
|
||||||
|
"xnli": "classification",
|
||||||
|
}
|
||||||
|
|
||||||
|
xnli_tasks_num_labels = {
|
||||||
|
"xnli": 3,
|
||||||
|
}
|
||||||
@@ -22,6 +22,7 @@ from botocore.config import Config
|
|||||||
from botocore.exceptions import ClientError
|
from botocore.exceptions import ClientError
|
||||||
import requests
|
import requests
|
||||||
from tqdm import tqdm
|
from tqdm import tqdm
|
||||||
|
from contextlib import contextmanager
|
||||||
|
|
||||||
logger = logging.getLogger(__name__) # pylint: disable=invalid-name
|
logger = logging.getLogger(__name__) # pylint: disable=invalid-name
|
||||||
|
|
||||||
@@ -152,7 +153,7 @@ def filename_to_url(filename, cache_dir=None):
|
|||||||
return url, etag
|
return url, etag
|
||||||
|
|
||||||
|
|
||||||
def cached_path(url_or_filename, cache_dir=None, force_download=False, proxies=None):
|
def cached_path(url_or_filename, cache_dir=None, force_download=False, proxies=None, resume_download=False):
|
||||||
"""
|
"""
|
||||||
Given something that might be a URL (or might be a local path),
|
Given something that might be a URL (or might be a local path),
|
||||||
determine which. If it's a URL, download the file and cache it, and
|
determine which. If it's a URL, download the file and cache it, and
|
||||||
@@ -161,6 +162,7 @@ def cached_path(url_or_filename, cache_dir=None, force_download=False, proxies=N
|
|||||||
Args:
|
Args:
|
||||||
cache_dir: specify a cache directory to save the file to (overwrite the default cache dir).
|
cache_dir: specify a cache directory to save the file to (overwrite the default cache dir).
|
||||||
force_download: if True, re-dowload the file even if it's already cached in the cache dir.
|
force_download: if True, re-dowload the file even if it's already cached in the cache dir.
|
||||||
|
resume_download: if True, resume the download if incompletly recieved file is found.
|
||||||
"""
|
"""
|
||||||
if cache_dir is None:
|
if cache_dir is None:
|
||||||
cache_dir = TRANSFORMERS_CACHE
|
cache_dir = TRANSFORMERS_CACHE
|
||||||
@@ -173,7 +175,9 @@ def cached_path(url_or_filename, cache_dir=None, force_download=False, proxies=N
|
|||||||
|
|
||||||
if parsed.scheme in ('http', 'https', 's3'):
|
if parsed.scheme in ('http', 'https', 's3'):
|
||||||
# URL, so get it from the cache (downloading if necessary)
|
# URL, so get it from the cache (downloading if necessary)
|
||||||
return get_from_cache(url_or_filename, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
|
return get_from_cache(url_or_filename, cache_dir=cache_dir,
|
||||||
|
force_download=force_download, proxies=proxies,
|
||||||
|
resume_download=resume_download)
|
||||||
elif os.path.exists(url_or_filename):
|
elif os.path.exists(url_or_filename):
|
||||||
# File, and it exists.
|
# File, and it exists.
|
||||||
return url_or_filename
|
return url_or_filename
|
||||||
@@ -234,19 +238,22 @@ def s3_get(url, temp_file, proxies=None):
|
|||||||
s3_resource.Bucket(bucket_name).download_fileobj(s3_path, temp_file)
|
s3_resource.Bucket(bucket_name).download_fileobj(s3_path, temp_file)
|
||||||
|
|
||||||
|
|
||||||
def http_get(url, temp_file, proxies=None):
|
def http_get(url, temp_file, proxies=None, resume_size=0):
|
||||||
req = requests.get(url, stream=True, proxies=proxies)
|
headers={'Range':'bytes=%d-'%(resume_size,)} if resume_size > 0 else None
|
||||||
content_length = req.headers.get('Content-Length')
|
response = requests.get(url, stream=True, proxies=proxies, headers=headers)
|
||||||
total = int(content_length) if content_length is not None else None
|
if response.status_code == 416: # Range not satisfiable
|
||||||
progress = tqdm(unit="B", total=total)
|
return
|
||||||
for chunk in req.iter_content(chunk_size=1024):
|
content_length = response.headers.get('Content-Length')
|
||||||
|
total = resume_size + int(content_length) if content_length is not None else None
|
||||||
|
progress = tqdm(unit="B", total=total, initial=resume_size)
|
||||||
|
for chunk in response.iter_content(chunk_size=1024):
|
||||||
if chunk: # filter out keep-alive new chunks
|
if chunk: # filter out keep-alive new chunks
|
||||||
progress.update(len(chunk))
|
progress.update(len(chunk))
|
||||||
temp_file.write(chunk)
|
temp_file.write(chunk)
|
||||||
progress.close()
|
progress.close()
|
||||||
|
|
||||||
|
|
||||||
def get_from_cache(url, cache_dir=None, force_download=False, proxies=None, etag_timeout=10):
|
def get_from_cache(url, cache_dir=None, force_download=False, proxies=None, etag_timeout=10, resume_download=False):
|
||||||
"""
|
"""
|
||||||
Given a URL, look for the corresponding dataset in the local cache.
|
Given a URL, look for the corresponding dataset in the local cache.
|
||||||
If it's not there, download it. Then return the path to the cached file.
|
If it's not there, download it. Then return the path to the cached file.
|
||||||
@@ -289,17 +296,35 @@ def get_from_cache(url, cache_dir=None, force_download=False, proxies=None, etag
|
|||||||
if matching_files:
|
if matching_files:
|
||||||
cache_path = os.path.join(cache_dir, matching_files[-1])
|
cache_path = os.path.join(cache_dir, matching_files[-1])
|
||||||
|
|
||||||
|
if resume_download:
|
||||||
|
incomplete_path = cache_path + '.incomplete'
|
||||||
|
@contextmanager
|
||||||
|
def _resumable_file_manager():
|
||||||
|
with open(incomplete_path,'a+b') as f:
|
||||||
|
yield f
|
||||||
|
os.remove(incomplete_path)
|
||||||
|
temp_file_manager = _resumable_file_manager
|
||||||
|
if os.path.exists(incomplete_path):
|
||||||
|
resume_size = os.stat(incomplete_path).st_size
|
||||||
|
else:
|
||||||
|
resume_size = 0
|
||||||
|
else:
|
||||||
|
temp_file_manager = tempfile.NamedTemporaryFile
|
||||||
|
resume_size = 0
|
||||||
|
|
||||||
if not os.path.exists(cache_path) or force_download:
|
if not os.path.exists(cache_path) or force_download:
|
||||||
# Download to temporary file, then copy to cache dir once finished.
|
# Download to temporary file, then copy to cache dir once finished.
|
||||||
# Otherwise you get corrupt cache entries if the download gets interrupted.
|
# Otherwise you get corrupt cache entries if the download gets interrupted.
|
||||||
with tempfile.NamedTemporaryFile() as temp_file:
|
with temp_file_manager() as temp_file:
|
||||||
logger.info("%s not found in cache or force_download set to True, downloading to %s", url, temp_file.name)
|
logger.info("%s not found in cache or force_download set to True, downloading to %s", url, temp_file.name)
|
||||||
|
|
||||||
# GET file object
|
# GET file object
|
||||||
if url.startswith("s3://"):
|
if url.startswith("s3://"):
|
||||||
|
if resume_download:
|
||||||
|
logger.warn('Warning: resumable downloads are not implemented for "s3://" urls')
|
||||||
s3_get(url, temp_file, proxies=proxies)
|
s3_get(url, temp_file, proxies=proxies)
|
||||||
else:
|
else:
|
||||||
http_get(url, temp_file, proxies=proxies)
|
http_get(url, temp_file, proxies=proxies, resume_size=resume_size)
|
||||||
|
|
||||||
# we are copying the file before closing it, so flush to avoid truncation
|
# we are copying the file before closing it, so flush to avoid truncation
|
||||||
temp_file.flush()
|
temp_file.flush()
|
||||||
|
|||||||
764
transformers/modeling_albert.py
Normal file
764
transformers/modeling_albert.py
Normal file
@@ -0,0 +1,764 @@
|
|||||||
|
|
||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""PyTorch ALBERT model. """
|
||||||
|
|
||||||
|
import os
|
||||||
|
import math
|
||||||
|
import logging
|
||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
from torch.nn import CrossEntropyLoss, MSELoss
|
||||||
|
from transformers.modeling_utils import PreTrainedModel
|
||||||
|
from transformers.configuration_albert import AlbertConfig
|
||||||
|
from transformers.modeling_bert import BertEmbeddings, BertSelfAttention, prune_linear_layer, ACT2FN
|
||||||
|
from .file_utils import add_start_docstrings
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
||||||
|
'albert-base-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-pytorch_model.bin",
|
||||||
|
'albert-large-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-pytorch_model.bin",
|
||||||
|
'albert-xlarge-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-pytorch_model.bin",
|
||||||
|
'albert-xxlarge-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-pytorch_model.bin",
|
||||||
|
'albert-base-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-pytorch_model.bin",
|
||||||
|
'albert-large-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-pytorch_model.bin",
|
||||||
|
'albert-xlarge-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-pytorch_model.bin",
|
||||||
|
'albert-xxlarge-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-pytorch_model.bin",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def load_tf_weights_in_albert(model, config, tf_checkpoint_path):
|
||||||
|
""" Load tf checkpoints in a pytorch model."""
|
||||||
|
try:
|
||||||
|
import re
|
||||||
|
import numpy as np
|
||||||
|
import tensorflow as tf
|
||||||
|
except ImportError:
|
||||||
|
logger.error("Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see "
|
||||||
|
"https://www.tensorflow.org/install/ for installation instructions.")
|
||||||
|
raise
|
||||||
|
tf_path = os.path.abspath(tf_checkpoint_path)
|
||||||
|
logger.info("Converting TensorFlow checkpoint from {}".format(tf_path))
|
||||||
|
# Load weights from TF model
|
||||||
|
init_vars = tf.train.list_variables(tf_path)
|
||||||
|
names = []
|
||||||
|
arrays = []
|
||||||
|
for name, shape in init_vars:
|
||||||
|
logger.info("Loading TF weight {} with shape {}".format(name, shape))
|
||||||
|
array = tf.train.load_variable(tf_path, name)
|
||||||
|
names.append(name)
|
||||||
|
arrays.append(array)
|
||||||
|
|
||||||
|
for name, array in zip(names, arrays):
|
||||||
|
print(name)
|
||||||
|
|
||||||
|
for name, array in zip(names, arrays):
|
||||||
|
original_name = name
|
||||||
|
name = name.replace("ffn_1", "ffn")
|
||||||
|
name = name.replace("/bert/", "/albert/")
|
||||||
|
name = name.replace("ffn/intermediate/output", "ffn_output")
|
||||||
|
name = name.replace("attention_1", "attention")
|
||||||
|
name = name.replace("cls/predictions", "predictions")
|
||||||
|
name = name.replace("transform/", "")
|
||||||
|
name = name.replace("LayerNorm_1", "full_layer_layer_norm")
|
||||||
|
name = name.replace("LayerNorm", "attention/LayerNorm")
|
||||||
|
name = name.replace("inner_group_", "albert_layers/")
|
||||||
|
name = name.replace("group_", "albert_layer_groups/")
|
||||||
|
name = name.split('/')
|
||||||
|
pointer = model
|
||||||
|
for m_name in name:
|
||||||
|
if re.fullmatch(r'[A-Za-z]+_\d+', m_name):
|
||||||
|
l = re.split(r'_(\d+)', m_name)
|
||||||
|
else:
|
||||||
|
l = [m_name]
|
||||||
|
|
||||||
|
if l[0] == 'kernel' or l[0] == 'gamma':
|
||||||
|
pointer = getattr(pointer, 'weight')
|
||||||
|
elif l[0] == 'output_bias' or l[0] == 'beta':
|
||||||
|
pointer = getattr(pointer, 'bias')
|
||||||
|
elif l[0] == 'output_weights':
|
||||||
|
pointer = getattr(pointer, 'weight')
|
||||||
|
elif l[0] == 'squad':
|
||||||
|
pointer = getattr(pointer, 'classifier')
|
||||||
|
else:
|
||||||
|
try:
|
||||||
|
pointer = getattr(pointer, l[0])
|
||||||
|
except AttributeError:
|
||||||
|
logger.info("Skipping {}".format("/".join(name)))
|
||||||
|
continue
|
||||||
|
if len(l) >= 2:
|
||||||
|
num = int(l[1])
|
||||||
|
pointer = pointer[num]
|
||||||
|
|
||||||
|
if m_name[-11:] == '_embeddings':
|
||||||
|
pointer = getattr(pointer, 'weight')
|
||||||
|
elif m_name == 'kernel':
|
||||||
|
array = np.transpose(array)
|
||||||
|
try:
|
||||||
|
assert pointer.shape == array.shape
|
||||||
|
except AssertionError as e:
|
||||||
|
e.args += (pointer.shape, array.shape)
|
||||||
|
raise
|
||||||
|
print("Initialize PyTorch weight {} from {}".format(name, original_name))
|
||||||
|
pointer.data = torch.from_numpy(array)
|
||||||
|
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
class AlbertEmbeddings(BertEmbeddings):
|
||||||
|
"""
|
||||||
|
Construct the embeddings from word, position and token_type embeddings.
|
||||||
|
"""
|
||||||
|
def __init__(self, config):
|
||||||
|
super(AlbertEmbeddings, self).__init__(config)
|
||||||
|
|
||||||
|
self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=0)
|
||||||
|
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)
|
||||||
|
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)
|
||||||
|
self.LayerNorm = torch.nn.LayerNorm(config.embedding_size, eps=config.layer_norm_eps)
|
||||||
|
|
||||||
|
|
||||||
|
class AlbertAttention(BertSelfAttention):
|
||||||
|
def __init__(self, config):
|
||||||
|
super(AlbertAttention, self).__init__(config)
|
||||||
|
|
||||||
|
self.output_attentions = config.output_attentions
|
||||||
|
self.num_attention_heads = config.num_attention_heads
|
||||||
|
self.hidden_size = config.hidden_size
|
||||||
|
self.attention_head_size = config.hidden_size // config.num_attention_heads
|
||||||
|
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
|
||||||
|
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
|
||||||
|
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||||
|
self.pruned_heads = set()
|
||||||
|
|
||||||
|
def prune_heads(self, heads):
|
||||||
|
if len(heads) == 0:
|
||||||
|
return
|
||||||
|
mask = torch.ones(self.num_attention_heads, self.attention_head_size)
|
||||||
|
heads = set(heads) - self.pruned_heads # Convert to set and emove already pruned heads
|
||||||
|
for head in heads:
|
||||||
|
# Compute how many pruned heads are before the head and move the index accordingly
|
||||||
|
head = head - sum(1 if h < head else 0 for h in self.pruned_heads)
|
||||||
|
mask[head] = 0
|
||||||
|
mask = mask.view(-1).contiguous().eq(1)
|
||||||
|
index = torch.arange(len(mask))[mask].long()
|
||||||
|
|
||||||
|
# Prune linear layers
|
||||||
|
self.query = prune_linear_layer(self.query, index)
|
||||||
|
self.key = prune_linear_layer(self.key, index)
|
||||||
|
self.value = prune_linear_layer(self.value, index)
|
||||||
|
self.dense = prune_linear_layer(self.dense, index, dim=1)
|
||||||
|
|
||||||
|
# Update hyper params and store pruned heads
|
||||||
|
self.num_attention_heads = self.num_attention_heads - len(heads)
|
||||||
|
self.all_head_size = self.attention_head_size * self.num_attention_heads
|
||||||
|
self.pruned_heads = self.pruned_heads.union(heads)
|
||||||
|
|
||||||
|
def forward(self, input_ids, attention_mask=None, head_mask=None):
|
||||||
|
mixed_query_layer = self.query(input_ids)
|
||||||
|
mixed_key_layer = self.key(input_ids)
|
||||||
|
mixed_value_layer = self.value(input_ids)
|
||||||
|
|
||||||
|
query_layer = self.transpose_for_scores(mixed_query_layer)
|
||||||
|
key_layer = self.transpose_for_scores(mixed_key_layer)
|
||||||
|
value_layer = self.transpose_for_scores(mixed_value_layer)
|
||||||
|
|
||||||
|
# Take the dot product between "query" and "key" to get the raw attention scores.
|
||||||
|
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
|
||||||
|
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
|
||||||
|
if attention_mask is not None:
|
||||||
|
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
|
||||||
|
attention_scores = attention_scores + attention_mask
|
||||||
|
|
||||||
|
# Normalize the attention scores to probabilities.
|
||||||
|
attention_probs = nn.Softmax(dim=-1)(attention_scores)
|
||||||
|
|
||||||
|
# This is actually dropping out entire tokens to attend to, which might
|
||||||
|
# seem a bit unusual, but is taken from the original Transformer paper.
|
||||||
|
attention_probs = self.dropout(attention_probs)
|
||||||
|
|
||||||
|
# Mask heads if we want to
|
||||||
|
if head_mask is not None:
|
||||||
|
attention_probs = attention_probs * head_mask
|
||||||
|
|
||||||
|
context_layer = torch.matmul(attention_probs, value_layer)
|
||||||
|
|
||||||
|
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
|
||||||
|
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
|
||||||
|
reshaped_context_layer = context_layer.view(*new_context_layer_shape)
|
||||||
|
|
||||||
|
|
||||||
|
# Should find a better way to do this
|
||||||
|
w = self.dense.weight.t().view(self.num_attention_heads, self.attention_head_size, self.hidden_size).to(context_layer.dtype)
|
||||||
|
b = self.dense.bias.to(context_layer.dtype)
|
||||||
|
|
||||||
|
projected_context_layer = torch.einsum("bfnd,ndh->bfh", context_layer, w) + b
|
||||||
|
projected_context_layer_dropout = self.dropout(projected_context_layer)
|
||||||
|
layernormed_context_layer = self.LayerNorm(input_ids + projected_context_layer_dropout)
|
||||||
|
return (layernormed_context_layer, attention_probs) if self.output_attentions else (layernormed_context_layer,)
|
||||||
|
|
||||||
|
|
||||||
|
class AlbertLayer(nn.Module):
|
||||||
|
def __init__(self, config):
|
||||||
|
super(AlbertLayer, self).__init__()
|
||||||
|
|
||||||
|
self.config = config
|
||||||
|
self.full_layer_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||||
|
self.attention = AlbertAttention(config)
|
||||||
|
self.ffn = nn.Linear(config.hidden_size, config.intermediate_size)
|
||||||
|
self.ffn_output = nn.Linear(config.intermediate_size, config.hidden_size)
|
||||||
|
self.activation = ACT2FN[config.hidden_act]
|
||||||
|
|
||||||
|
def forward(self, hidden_states, attention_mask=None, head_mask=None):
|
||||||
|
attention_output = self.attention(hidden_states, attention_mask, head_mask)
|
||||||
|
ffn_output = self.ffn(attention_output[0])
|
||||||
|
ffn_output = self.activation(ffn_output)
|
||||||
|
ffn_output = self.ffn_output(ffn_output)
|
||||||
|
hidden_states = self.full_layer_layer_norm(ffn_output + attention_output[0])
|
||||||
|
|
||||||
|
return (hidden_states,) + attention_output[1:] # add attentions if we output them
|
||||||
|
|
||||||
|
|
||||||
|
class AlbertLayerGroup(nn.Module):
|
||||||
|
def __init__(self, config):
|
||||||
|
super(AlbertLayerGroup, self).__init__()
|
||||||
|
|
||||||
|
self.output_attentions = config.output_attentions
|
||||||
|
self.output_hidden_states = config.output_hidden_states
|
||||||
|
self.albert_layers = nn.ModuleList([AlbertLayer(config) for _ in range(config.inner_group_num)])
|
||||||
|
|
||||||
|
def forward(self, hidden_states, attention_mask=None, head_mask=None):
|
||||||
|
layer_hidden_states = ()
|
||||||
|
layer_attentions = ()
|
||||||
|
|
||||||
|
for layer_index, albert_layer in enumerate(self.albert_layers):
|
||||||
|
layer_output = albert_layer(hidden_states, attention_mask, head_mask[layer_index])
|
||||||
|
hidden_states = layer_output[0]
|
||||||
|
|
||||||
|
if self.output_attentions:
|
||||||
|
layer_attentions = layer_attentions + (layer_output[1],)
|
||||||
|
|
||||||
|
if self.output_hidden_states:
|
||||||
|
layer_hidden_states = layer_hidden_states + (hidden_states,)
|
||||||
|
|
||||||
|
outputs = (hidden_states,)
|
||||||
|
if self.output_hidden_states:
|
||||||
|
outputs = outputs + (layer_hidden_states,)
|
||||||
|
if self.output_attentions:
|
||||||
|
outputs = outputs + (layer_attentions,)
|
||||||
|
return outputs # last-layer hidden state, (layer hidden states), (layer attentions)
|
||||||
|
|
||||||
|
|
||||||
|
class AlbertTransformer(nn.Module):
|
||||||
|
def __init__(self, config):
|
||||||
|
super(AlbertTransformer, self).__init__()
|
||||||
|
|
||||||
|
self.config = config
|
||||||
|
self.output_attentions = config.output_attentions
|
||||||
|
self.output_hidden_states = config.output_hidden_states
|
||||||
|
self.embedding_hidden_mapping_in = nn.Linear(config.embedding_size, config.hidden_size)
|
||||||
|
self.albert_layer_groups = nn.ModuleList([AlbertLayerGroup(config) for _ in range(config.num_hidden_groups)])
|
||||||
|
|
||||||
|
def forward(self, hidden_states, attention_mask=None, head_mask=None):
|
||||||
|
hidden_states = self.embedding_hidden_mapping_in(hidden_states)
|
||||||
|
|
||||||
|
all_attentions = ()
|
||||||
|
|
||||||
|
if self.output_hidden_states:
|
||||||
|
all_hidden_states = (hidden_states,)
|
||||||
|
|
||||||
|
for i in range(self.config.num_hidden_layers):
|
||||||
|
# Number of layers in a hidden group
|
||||||
|
layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)
|
||||||
|
|
||||||
|
# Index of the hidden group
|
||||||
|
group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))
|
||||||
|
|
||||||
|
# Index of the layer inside the group
|
||||||
|
layer_idx = int(i - group_idx * layers_per_group)
|
||||||
|
|
||||||
|
layer_group_output = self.albert_layer_groups[group_idx](hidden_states, attention_mask, head_mask[group_idx*layers_per_group:(group_idx+1)*layers_per_group])
|
||||||
|
hidden_states = layer_group_output[0]
|
||||||
|
|
||||||
|
if self.output_attentions:
|
||||||
|
all_attentions = all_attentions + layer_group_output[-1]
|
||||||
|
|
||||||
|
if self.output_hidden_states:
|
||||||
|
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||||
|
|
||||||
|
|
||||||
|
outputs = (hidden_states,)
|
||||||
|
if self.output_hidden_states:
|
||||||
|
outputs = outputs + (all_hidden_states,)
|
||||||
|
if self.output_attentions:
|
||||||
|
outputs = outputs + (all_attentions,)
|
||||||
|
return outputs # last-layer hidden state, (all hidden states), (all attentions)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
class AlbertPreTrainedModel(PreTrainedModel):
|
||||||
|
""" An abstract class to handle weights initialization and
|
||||||
|
a simple interface for dowloading and loading pretrained models.
|
||||||
|
"""
|
||||||
|
config_class = AlbertConfig
|
||||||
|
pretrained_model_archive_map = ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
|
base_model_prefix = "albert"
|
||||||
|
|
||||||
|
def _init_weights(self, module):
|
||||||
|
""" Initialize the weights.
|
||||||
|
"""
|
||||||
|
if isinstance(module, (nn.Linear, nn.Embedding)):
|
||||||
|
# Slightly different from the TF version which uses truncated_normal for initialization
|
||||||
|
# cf https://github.com/pytorch/pytorch/pull/5617
|
||||||
|
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
|
||||||
|
if isinstance(module, (nn.Linear)) and module.bias is not None:
|
||||||
|
module.bias.data.zero_()
|
||||||
|
elif isinstance(module, nn.LayerNorm):
|
||||||
|
module.bias.data.zero_()
|
||||||
|
module.weight.data.fill_(1.0)
|
||||||
|
|
||||||
|
|
||||||
|
ALBERT_START_DOCSTRING = r""" The ALBERT model was proposed in
|
||||||
|
`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations`_
|
||||||
|
by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents
|
||||||
|
two parameter-reduction techniques to lower memory consumption and increase the trainig speed of BERT.
|
||||||
|
|
||||||
|
This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
|
||||||
|
refer to the PyTorch documentation for all matter related to general usage and behavior.
|
||||||
|
|
||||||
|
.. _`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations`:
|
||||||
|
https://arxiv.org/abs/1909.11942
|
||||||
|
|
||||||
|
.. _`torch.nn.Module`:
|
||||||
|
https://pytorch.org/docs/stable/nn.html#module
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
config (:class:`~transformers.AlbertConfig`): Model configuration class with all the parameters of the model.
|
||||||
|
Initializing with a config file does not load the weights associated with the model, only the configuration.
|
||||||
|
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
|
||||||
|
"""
|
||||||
|
|
||||||
|
ALBERT_INPUTS_DOCSTRING = r"""
|
||||||
|
Inputs:
|
||||||
|
**input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Indices of input sequence tokens in the vocabulary.
|
||||||
|
To match pre-training, BERT input sequence should be formatted with [CLS] and [SEP] tokens as follows:
|
||||||
|
|
||||||
|
(a) For sequence pairs:
|
||||||
|
|
||||||
|
``tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]``
|
||||||
|
|
||||||
|
``token_type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1``
|
||||||
|
|
||||||
|
(b) For single sequences:
|
||||||
|
|
||||||
|
``tokens: [CLS] the dog is hairy . [SEP]``
|
||||||
|
|
||||||
|
``token_type_ids: 0 0 0 0 0 0 0``
|
||||||
|
|
||||||
|
Albert is a model with absolute position embeddings so it's usually advised to pad the inputs on
|
||||||
|
the right rather than the left.
|
||||||
|
|
||||||
|
Indices can be obtained using :class:`transformers.AlbertTokenizer`.
|
||||||
|
See :func:`transformers.PreTrainedTokenizer.encode` and
|
||||||
|
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
|
||||||
|
**attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Mask to avoid performing attention on padding token indices.
|
||||||
|
Mask values selected in ``[0, 1]``:
|
||||||
|
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
|
||||||
|
**token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Segment token indices to indicate first and second portions of the inputs.
|
||||||
|
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
|
||||||
|
corresponds to a `sentence B` token
|
||||||
|
(see `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`_ for more details).
|
||||||
|
**position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Indices of positions of each input sequence tokens in the position embeddings.
|
||||||
|
Selected in the range ``[0, config.max_position_embeddings - 1]``.
|
||||||
|
**head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
|
||||||
|
Mask to nullify selected heads of the self-attention modules.
|
||||||
|
Mask values selected in ``[0, 1]``:
|
||||||
|
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
|
||||||
|
"""
|
||||||
|
|
||||||
|
@add_start_docstrings("The bare ALBERT Model transformer outputting raw hidden-states without any specific head on top.",
|
||||||
|
ALBERT_START_DOCSTRING, ALBERT_INPUTS_DOCSTRING)
|
||||||
|
class AlbertModel(AlbertPreTrainedModel):
|
||||||
|
r"""
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
|
||||||
|
Sequence of hidden-states at the output of the last layer of the model.
|
||||||
|
**pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
|
||||||
|
Last layer hidden-state of the first token of the sequence (classification token)
|
||||||
|
further processed by a Linear layer and a Tanh activation function. The Linear
|
||||||
|
layer weights are trained from the next sentence prediction (classification)
|
||||||
|
objective during Bert pretraining. This output is usually *not* a good summary
|
||||||
|
of the semantic content of the input, you're often better with averaging or pooling
|
||||||
|
the sequence of hidden-states for the whole input sequence.
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
"""
|
||||||
|
|
||||||
|
config_class = AlbertConfig
|
||||||
|
pretrained_model_archive_map = ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
|
load_tf_weights = load_tf_weights_in_albert
|
||||||
|
base_model_prefix = "albert"
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super(AlbertModel, self).__init__(config)
|
||||||
|
|
||||||
|
self.config = config
|
||||||
|
self.embeddings = AlbertEmbeddings(config)
|
||||||
|
self.encoder = AlbertTransformer(config)
|
||||||
|
self.pooler = nn.Linear(config.hidden_size, config.hidden_size)
|
||||||
|
self.pooler_activation = nn.Tanh()
|
||||||
|
|
||||||
|
self.init_weights()
|
||||||
|
|
||||||
|
def get_input_embeddings(self):
|
||||||
|
return self.embeddings.word_embeddings
|
||||||
|
|
||||||
|
def set_input_embeddings(self, value):
|
||||||
|
self.embeddings.word_embeddings = value
|
||||||
|
|
||||||
|
def _resize_token_embeddings(self, new_num_tokens):
|
||||||
|
old_embeddings = self.embeddings.word_embeddings
|
||||||
|
new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)
|
||||||
|
self.embeddings.word_embeddings = new_embeddings
|
||||||
|
return self.embeddings.word_embeddings
|
||||||
|
|
||||||
|
def _prune_heads(self, heads_to_prune):
|
||||||
|
""" Prunes heads of the model.
|
||||||
|
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
|
||||||
|
ALBERT has a different architecture in that its layers are shared across groups, which then has inner groups.
|
||||||
|
If an ALBERT model has 12 hidden layers and 2 hidden groups, with two inner groups, there
|
||||||
|
is a total of 4 different layers.
|
||||||
|
|
||||||
|
These layers are flattened: the indices [0,1] correspond to the two inner groups of the first hidden layer,
|
||||||
|
while [2,3] correspond to the two inner groups of the second hidden layer.
|
||||||
|
|
||||||
|
Any layer with in index other than [0,1,2,3] will result in an error.
|
||||||
|
See base class PreTrainedModel for more information about head pruning
|
||||||
|
"""
|
||||||
|
for layer, heads in heads_to_prune.items():
|
||||||
|
group_idx = int(layer / self.config.inner_group_num)
|
||||||
|
inner_group_idx = int(layer - group_idx * self.config.inner_group_num)
|
||||||
|
self.encoder.albert_layer_groups[group_idx].albert_layers[inner_group_idx].attention.prune_heads(heads)
|
||||||
|
|
||||||
|
def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
|
||||||
|
inputs_embeds=None):
|
||||||
|
|
||||||
|
if input_ids is not None and inputs_embeds is not None:
|
||||||
|
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
|
||||||
|
elif input_ids is not None:
|
||||||
|
input_shape = input_ids.size()
|
||||||
|
elif inputs_embeds is not None:
|
||||||
|
input_shape = inputs_embeds.size()[:-1]
|
||||||
|
else:
|
||||||
|
raise ValueError("You have to specify either input_ids or inputs_embeds")
|
||||||
|
|
||||||
|
device = input_ids.device if input_ids is not None else inputs_embeds.device
|
||||||
|
|
||||||
|
if attention_mask is None:
|
||||||
|
attention_mask = torch.ones(input_shape, device=device)
|
||||||
|
if token_type_ids is None:
|
||||||
|
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
|
||||||
|
|
||||||
|
extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
|
||||||
|
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
|
||||||
|
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
|
||||||
|
if head_mask is not None:
|
||||||
|
if head_mask.dim() == 1:
|
||||||
|
head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
|
||||||
|
head_mask = head_mask.expand(self.config.num_hidden_layers, -1, -1, -1, -1)
|
||||||
|
elif head_mask.dim() == 2:
|
||||||
|
head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1) # We can specify head_mask for each layer
|
||||||
|
head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
|
||||||
|
else:
|
||||||
|
head_mask = [None] * self.config.num_hidden_layers
|
||||||
|
|
||||||
|
embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
|
||||||
|
inputs_embeds=inputs_embeds)
|
||||||
|
encoder_outputs = self.encoder(embedding_output,
|
||||||
|
extended_attention_mask,
|
||||||
|
head_mask=head_mask)
|
||||||
|
|
||||||
|
sequence_output = encoder_outputs[0]
|
||||||
|
|
||||||
|
pooled_output = self.pooler_activation(self.pooler(sequence_output[:, 0]))
|
||||||
|
|
||||||
|
outputs = (sequence_output, pooled_output) + encoder_outputs[1:] # add hidden_states and attentions if they are here
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
class AlbertMLMHead(nn.Module):
|
||||||
|
def __init__(self, config):
|
||||||
|
super(AlbertMLMHead, self).__init__()
|
||||||
|
|
||||||
|
self.LayerNorm = nn.LayerNorm(config.embedding_size)
|
||||||
|
self.bias = nn.Parameter(torch.zeros(config.vocab_size))
|
||||||
|
self.dense = nn.Linear(config.hidden_size, config.embedding_size)
|
||||||
|
self.decoder = nn.Linear(config.embedding_size, config.vocab_size)
|
||||||
|
self.activation = ACT2FN[config.hidden_act]
|
||||||
|
|
||||||
|
def forward(self, hidden_states):
|
||||||
|
hidden_states = self.dense(hidden_states)
|
||||||
|
hidden_states = self.activation(hidden_states)
|
||||||
|
hidden_states = self.LayerNorm(hidden_states)
|
||||||
|
hidden_states = self.decoder(hidden_states)
|
||||||
|
|
||||||
|
prediction_scores = hidden_states + self.bias
|
||||||
|
|
||||||
|
return prediction_scores
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings("Bert Model with a `language modeling` head on top.", ALBERT_START_DOCSTRING, ALBERT_INPUTS_DOCSTRING)
|
||||||
|
class AlbertForMaskedLM(AlbertPreTrainedModel):
|
||||||
|
r"""
|
||||||
|
**masked_lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Labels for computing the masked language modeling loss.
|
||||||
|
Indices should be in ``[-1, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
|
||||||
|
Tokens with indices set to ``-1`` are ignored (masked), the loss is only computed for the tokens with labels
|
||||||
|
in ``[0, ..., config.vocab_size]``
|
||||||
|
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**loss**: (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
|
||||||
|
Masked language modeling loss.
|
||||||
|
**prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
|
||||||
|
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super(AlbertForMaskedLM, self).__init__(config)
|
||||||
|
|
||||||
|
self.albert = AlbertModel(config)
|
||||||
|
self.predictions = AlbertMLMHead(config)
|
||||||
|
|
||||||
|
self.init_weights()
|
||||||
|
self.tie_weights()
|
||||||
|
|
||||||
|
def tie_weights(self):
|
||||||
|
""" Make sure we are sharing the input and output embeddings.
|
||||||
|
Export to TorchScript can't handle parameter sharing so we are cloning them instead.
|
||||||
|
"""
|
||||||
|
self._tie_or_clone_weights(self.predictions.decoder,
|
||||||
|
self.albert.embeddings.word_embeddings)
|
||||||
|
|
||||||
|
def get_output_embeddings(self):
|
||||||
|
return self.predictions.decoder
|
||||||
|
|
||||||
|
def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None,
|
||||||
|
masked_lm_labels=None):
|
||||||
|
outputs = self.albert(
|
||||||
|
input_ids=input_ids,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
token_type_ids=token_type_ids,
|
||||||
|
position_ids=position_ids,
|
||||||
|
head_mask=head_mask,
|
||||||
|
inputs_embeds=inputs_embeds
|
||||||
|
)
|
||||||
|
sequence_outputs = outputs[0]
|
||||||
|
|
||||||
|
prediction_scores = self.predictions(sequence_outputs)
|
||||||
|
|
||||||
|
outputs = (prediction_scores,) + outputs[2:] # Add hidden states and attention if they are here
|
||||||
|
if masked_lm_labels is not None:
|
||||||
|
loss_fct = CrossEntropyLoss(ignore_index=-1)
|
||||||
|
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
|
||||||
|
outputs = (masked_lm_loss,) + outputs
|
||||||
|
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings("""Albert Model transformer with a sequence classification/regression head on top (a linear layer on top of
|
||||||
|
the pooled output) e.g. for GLUE tasks. """,
|
||||||
|
ALBERT_START_DOCSTRING, ALBERT_INPUTS_DOCSTRING)
|
||||||
|
class AlbertForSequenceClassification(AlbertPreTrainedModel):
|
||||||
|
r"""
|
||||||
|
**labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
|
||||||
|
Labels for computing the sequence classification/regression loss.
|
||||||
|
Indices should be in ``[0, ..., config.num_labels - 1]``.
|
||||||
|
If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
|
||||||
|
If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
|
||||||
|
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
|
||||||
|
Classification (or regression if config.num_labels==1) loss.
|
||||||
|
**logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
|
||||||
|
Classification (or regression if config.num_labels==1) scores (before SoftMax).
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
|
||||||
|
model = AlbertForSequenceClassification.from_pretrained('albert-base-v2')
|
||||||
|
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
||||||
|
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
|
||||||
|
outputs = model(input_ids, labels=labels)
|
||||||
|
loss, logits = outputs[:2]
|
||||||
|
|
||||||
|
"""
|
||||||
|
def __init__(self, config):
|
||||||
|
super(AlbertForSequenceClassification, self).__init__(config)
|
||||||
|
self.num_labels = config.num_labels
|
||||||
|
|
||||||
|
self.albert = AlbertModel(config)
|
||||||
|
self.dropout = nn.Dropout(config.hidden_dropout_prob)
|
||||||
|
self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
|
||||||
|
|
||||||
|
self.init_weights()
|
||||||
|
|
||||||
|
def forward(self, input_ids=None, attention_mask=None, token_type_ids=None,
|
||||||
|
position_ids=None, head_mask=None, inputs_embeds=None, labels=None):
|
||||||
|
|
||||||
|
outputs = self.albert(
|
||||||
|
input_ids=input_ids,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
token_type_ids=token_type_ids,
|
||||||
|
position_ids=position_ids,
|
||||||
|
head_mask=head_mask,
|
||||||
|
inputs_embeds=inputs_embeds
|
||||||
|
)
|
||||||
|
|
||||||
|
pooled_output = outputs[1]
|
||||||
|
|
||||||
|
pooled_output = self.dropout(pooled_output)
|
||||||
|
logits = self.classifier(pooled_output)
|
||||||
|
|
||||||
|
outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
|
||||||
|
|
||||||
|
if labels is not None:
|
||||||
|
if self.num_labels == 1:
|
||||||
|
# We are doing regression
|
||||||
|
loss_fct = MSELoss()
|
||||||
|
loss = loss_fct(logits.view(-1), labels.view(-1))
|
||||||
|
else:
|
||||||
|
loss_fct = CrossEntropyLoss()
|
||||||
|
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
|
||||||
|
outputs = (loss,) + outputs
|
||||||
|
|
||||||
|
return outputs # (loss), logits, (hidden_states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings("""Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
|
||||||
|
the hidden-states output to compute `span start logits` and `span end logits`). """,
|
||||||
|
ALBERT_START_DOCSTRING, ALBERT_INPUTS_DOCSTRING)
|
||||||
|
class AlbertForQuestionAnswering(AlbertPreTrainedModel):
|
||||||
|
r"""
|
||||||
|
**start_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
|
||||||
|
Labels for position (index) of the start of the labelled span for computing the token classification loss.
|
||||||
|
Positions are clamped to the length of the sequence (`sequence_length`).
|
||||||
|
Position outside of the sequence are not taken into account for computing the loss.
|
||||||
|
**end_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
|
||||||
|
Labels for position (index) of the end of the labelled span for computing the token classification loss.
|
||||||
|
Positions are clamped to the length of the sequence (`sequence_length`).
|
||||||
|
Position outside of the sequence are not taken into account for computing the loss.
|
||||||
|
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
|
||||||
|
Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
|
||||||
|
**start_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
|
||||||
|
Span-start scores (before SoftMax).
|
||||||
|
**end_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
|
||||||
|
Span-end scores (before SoftMax).
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
|
||||||
|
model = AlbertForQuestionAnswering.from_pretrained('albert-base-v2')
|
||||||
|
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
|
||||||
|
input_text = "[CLS] " + question + " [SEP] " + text + " [SEP]"
|
||||||
|
input_ids = tokenizer.encode(input_text)
|
||||||
|
token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]
|
||||||
|
start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
|
||||||
|
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
|
||||||
|
print(' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1]))
|
||||||
|
# a nice puppet
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
def __init__(self, config):
|
||||||
|
super(AlbertForQuestionAnswering, self).__init__(config)
|
||||||
|
self.num_labels = config.num_labels
|
||||||
|
|
||||||
|
self.albert = AlbertModel(config)
|
||||||
|
self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
|
||||||
|
|
||||||
|
self.init_weights()
|
||||||
|
|
||||||
|
def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
|
||||||
|
inputs_embeds=None, start_positions=None, end_positions=None):
|
||||||
|
|
||||||
|
outputs = self.albert(
|
||||||
|
input_ids=input_ids,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
token_type_ids=token_type_ids,
|
||||||
|
position_ids=position_ids,
|
||||||
|
head_mask=head_mask,
|
||||||
|
inputs_embeds=inputs_embeds
|
||||||
|
)
|
||||||
|
|
||||||
|
sequence_output = outputs[0]
|
||||||
|
|
||||||
|
logits = self.qa_outputs(sequence_output)
|
||||||
|
start_logits, end_logits = logits.split(1, dim=-1)
|
||||||
|
start_logits = start_logits.squeeze(-1)
|
||||||
|
end_logits = end_logits.squeeze(-1)
|
||||||
|
|
||||||
|
outputs = (start_logits, end_logits,) + outputs[2:]
|
||||||
|
if start_positions is not None and end_positions is not None:
|
||||||
|
# If we are on multi-GPU, split add a dimension
|
||||||
|
if len(start_positions.size()) > 1:
|
||||||
|
start_positions = start_positions.squeeze(-1)
|
||||||
|
if len(end_positions.size()) > 1:
|
||||||
|
end_positions = end_positions.squeeze(-1)
|
||||||
|
# sometimes the start/end positions are outside our model inputs, we ignore these terms
|
||||||
|
ignored_index = start_logits.size(1)
|
||||||
|
start_positions.clamp_(0, ignored_index)
|
||||||
|
end_positions.clamp_(0, ignored_index)
|
||||||
|
|
||||||
|
loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
|
||||||
|
start_loss = loss_fct(start_logits, start_positions)
|
||||||
|
end_loss = loss_fct(end_logits, end_positions)
|
||||||
|
total_loss = (start_loss + end_loss) / 2
|
||||||
|
outputs = (total_loss,) + outputs
|
||||||
|
|
||||||
|
return outputs # (loss), start_logits, end_logits, (hidden_states), (attentions)
|
||||||
@@ -27,6 +27,7 @@ from .modeling_xlnet import XLNetModel, XLNetLMHeadModel, XLNetForSequenceClassi
|
|||||||
from .modeling_xlm import XLMModel, XLMWithLMHeadModel, XLMForSequenceClassification, XLMForQuestionAnswering
|
from .modeling_xlm import XLMModel, XLMWithLMHeadModel, XLMForSequenceClassification, XLMForQuestionAnswering
|
||||||
from .modeling_roberta import RobertaModel, RobertaForMaskedLM, RobertaForSequenceClassification
|
from .modeling_roberta import RobertaModel, RobertaForMaskedLM, RobertaForSequenceClassification
|
||||||
from .modeling_distilbert import DistilBertModel, DistilBertForQuestionAnswering, DistilBertForMaskedLM, DistilBertForSequenceClassification
|
from .modeling_distilbert import DistilBertModel, DistilBertForQuestionAnswering, DistilBertForMaskedLM, DistilBertForSequenceClassification
|
||||||
|
from .modeling_camembert import CamembertModel, CamembertForMaskedLM, CamembertForSequenceClassification, CamembertForMultipleChoice
|
||||||
|
|
||||||
from .modeling_utils import PreTrainedModel, SequenceSummary
|
from .modeling_utils import PreTrainedModel, SequenceSummary
|
||||||
|
|
||||||
@@ -48,6 +49,7 @@ class AutoModel(object):
|
|||||||
The base model class to instantiate is selected as the first pattern matching
|
The base model class to instantiate is selected as the first pattern matching
|
||||||
in the `pretrained_model_name_or_path` string (in the following order):
|
in the `pretrained_model_name_or_path` string (in the following order):
|
||||||
- contains `distilbert`: DistilBertModel (DistilBERT model)
|
- contains `distilbert`: DistilBertModel (DistilBERT model)
|
||||||
|
- contains `camembert`: CamembertModel (CamemBERT model)
|
||||||
- contains `roberta`: RobertaModel (RoBERTa model)
|
- contains `roberta`: RobertaModel (RoBERTa model)
|
||||||
- contains `bert`: BertModel (Bert model)
|
- contains `bert`: BertModel (Bert model)
|
||||||
- contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
|
- contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
|
||||||
@@ -71,6 +73,7 @@ class AutoModel(object):
|
|||||||
The model class to instantiate is selected as the first pattern matching
|
The model class to instantiate is selected as the first pattern matching
|
||||||
in the `pretrained_model_name_or_path` string (in the following order):
|
in the `pretrained_model_name_or_path` string (in the following order):
|
||||||
- contains `distilbert`: DistilBertModel (DistilBERT model)
|
- contains `distilbert`: DistilBertModel (DistilBERT model)
|
||||||
|
- contains `camembert`: CamembertModel (CamemBERT model)
|
||||||
- contains `roberta`: RobertaModel (RoBERTa model)
|
- contains `roberta`: RobertaModel (RoBERTa model)
|
||||||
- contains `bert`: BertModel (Bert model)
|
- contains `bert`: BertModel (Bert model)
|
||||||
- contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
|
- contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
|
||||||
@@ -112,6 +115,9 @@ class AutoModel(object):
|
|||||||
force_download: (`optional`) boolean, default False:
|
force_download: (`optional`) boolean, default False:
|
||||||
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
||||||
|
|
||||||
|
resume_download: (`optional`) boolean, default False:
|
||||||
|
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
|
||||||
|
|
||||||
proxies: (`optional`) dict, default None:
|
proxies: (`optional`) dict, default None:
|
||||||
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
||||||
The proxies are used on each request.
|
The proxies are used on each request.
|
||||||
@@ -138,6 +144,8 @@ class AutoModel(object):
|
|||||||
"""
|
"""
|
||||||
if 'distilbert' in pretrained_model_name_or_path:
|
if 'distilbert' in pretrained_model_name_or_path:
|
||||||
return DistilBertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return DistilBertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
|
elif 'camembert' in pretrained_model_name_or_path:
|
||||||
|
return CamembertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
elif 'roberta' in pretrained_model_name_or_path:
|
elif 'roberta' in pretrained_model_name_or_path:
|
||||||
return RobertaModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return RobertaModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
elif 'bert' in pretrained_model_name_or_path:
|
elif 'bert' in pretrained_model_name_or_path:
|
||||||
@@ -172,6 +180,7 @@ class AutoModelWithLMHead(object):
|
|||||||
The model class to instantiate is selected as the first pattern matching
|
The model class to instantiate is selected as the first pattern matching
|
||||||
in the `pretrained_model_name_or_path` string (in the following order):
|
in the `pretrained_model_name_or_path` string (in the following order):
|
||||||
- contains `distilbert`: DistilBertForMaskedLM (DistilBERT model)
|
- contains `distilbert`: DistilBertForMaskedLM (DistilBERT model)
|
||||||
|
- contains `camembert`: CamembertForMaskedLM (CamemBERT model)
|
||||||
- contains `roberta`: RobertaForMaskedLM (RoBERTa model)
|
- contains `roberta`: RobertaForMaskedLM (RoBERTa model)
|
||||||
- contains `bert`: BertForMaskedLM (Bert model)
|
- contains `bert`: BertForMaskedLM (Bert model)
|
||||||
- contains `openai-gpt`: OpenAIGPTLMHeadModel (OpenAI GPT model)
|
- contains `openai-gpt`: OpenAIGPTLMHeadModel (OpenAI GPT model)
|
||||||
@@ -198,6 +207,7 @@ class AutoModelWithLMHead(object):
|
|||||||
The model class to instantiate is selected as the first pattern matching
|
The model class to instantiate is selected as the first pattern matching
|
||||||
in the `pretrained_model_name_or_path` string (in the following order):
|
in the `pretrained_model_name_or_path` string (in the following order):
|
||||||
- contains `distilbert`: DistilBertForMaskedLM (DistilBERT model)
|
- contains `distilbert`: DistilBertForMaskedLM (DistilBERT model)
|
||||||
|
- contains `camembert`: CamembertForMaskedLM (CamemBERT model)
|
||||||
- contains `roberta`: RobertaForMaskedLM (RoBERTa model)
|
- contains `roberta`: RobertaForMaskedLM (RoBERTa model)
|
||||||
- contains `bert`: BertForMaskedLM (Bert model)
|
- contains `bert`: BertForMaskedLM (Bert model)
|
||||||
- contains `openai-gpt`: OpenAIGPTLMHeadModel (OpenAI GPT model)
|
- contains `openai-gpt`: OpenAIGPTLMHeadModel (OpenAI GPT model)
|
||||||
@@ -237,6 +247,8 @@ class AutoModelWithLMHead(object):
|
|||||||
|
|
||||||
force_download: (`optional`) boolean, default False:
|
force_download: (`optional`) boolean, default False:
|
||||||
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
||||||
|
resume_download: (`optional`) boolean, default False:
|
||||||
|
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
|
||||||
|
|
||||||
proxies: (`optional`) dict, default None:
|
proxies: (`optional`) dict, default None:
|
||||||
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
||||||
@@ -264,6 +276,8 @@ class AutoModelWithLMHead(object):
|
|||||||
"""
|
"""
|
||||||
if 'distilbert' in pretrained_model_name_or_path:
|
if 'distilbert' in pretrained_model_name_or_path:
|
||||||
return DistilBertForMaskedLM.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return DistilBertForMaskedLM.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
|
elif 'camembert' in pretrained_model_name_or_path:
|
||||||
|
return CamembertForMaskedLM.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
elif 'roberta' in pretrained_model_name_or_path:
|
elif 'roberta' in pretrained_model_name_or_path:
|
||||||
return RobertaForMaskedLM.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return RobertaForMaskedLM.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
elif 'bert' in pretrained_model_name_or_path:
|
elif 'bert' in pretrained_model_name_or_path:
|
||||||
@@ -298,6 +312,7 @@ class AutoModelForSequenceClassification(object):
|
|||||||
The model class to instantiate is selected as the first pattern matching
|
The model class to instantiate is selected as the first pattern matching
|
||||||
in the `pretrained_model_name_or_path` string (in the following order):
|
in the `pretrained_model_name_or_path` string (in the following order):
|
||||||
- contains `distilbert`: DistilBertForSequenceClassification (DistilBERT model)
|
- contains `distilbert`: DistilBertForSequenceClassification (DistilBERT model)
|
||||||
|
- contains `camembert`: CamembertForSequenceClassification (CamemBERT model)
|
||||||
- contains `roberta`: RobertaForSequenceClassification (RoBERTa model)
|
- contains `roberta`: RobertaForSequenceClassification (RoBERTa model)
|
||||||
- contains `bert`: BertForSequenceClassification (Bert model)
|
- contains `bert`: BertForSequenceClassification (Bert model)
|
||||||
- contains `xlnet`: XLNetForSequenceClassification (XLNet model)
|
- contains `xlnet`: XLNetForSequenceClassification (XLNet model)
|
||||||
@@ -320,6 +335,7 @@ class AutoModelForSequenceClassification(object):
|
|||||||
The model class to instantiate is selected as the first pattern matching
|
The model class to instantiate is selected as the first pattern matching
|
||||||
in the `pretrained_model_name_or_path` string (in the following order):
|
in the `pretrained_model_name_or_path` string (in the following order):
|
||||||
- contains `distilbert`: DistilBertForSequenceClassification (DistilBERT model)
|
- contains `distilbert`: DistilBertForSequenceClassification (DistilBERT model)
|
||||||
|
- contains `camembert`: CamembertForSequenceClassification (CamemBERT model)
|
||||||
- contains `roberta`: RobertaForSequenceClassification (RoBERTa model)
|
- contains `roberta`: RobertaForSequenceClassification (RoBERTa model)
|
||||||
- contains `bert`: BertForSequenceClassification (Bert model)
|
- contains `bert`: BertForSequenceClassification (Bert model)
|
||||||
- contains `xlnet`: XLNetForSequenceClassification (XLNet model)
|
- contains `xlnet`: XLNetForSequenceClassification (XLNet model)
|
||||||
@@ -357,6 +373,9 @@ class AutoModelForSequenceClassification(object):
|
|||||||
force_download: (`optional`) boolean, default False:
|
force_download: (`optional`) boolean, default False:
|
||||||
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
||||||
|
|
||||||
|
resume_download: (`optional`) boolean, default False:
|
||||||
|
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
|
||||||
|
|
||||||
proxies: (`optional`) dict, default None:
|
proxies: (`optional`) dict, default None:
|
||||||
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
||||||
The proxies are used on each request.
|
The proxies are used on each request.
|
||||||
@@ -383,6 +402,8 @@ class AutoModelForSequenceClassification(object):
|
|||||||
"""
|
"""
|
||||||
if 'distilbert' in pretrained_model_name_or_path:
|
if 'distilbert' in pretrained_model_name_or_path:
|
||||||
return DistilBertForSequenceClassification.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return DistilBertForSequenceClassification.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
|
elif 'camembert' in pretrained_model_name_or_path:
|
||||||
|
return CamembertForSequenceClassification.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
elif 'roberta' in pretrained_model_name_or_path:
|
elif 'roberta' in pretrained_model_name_or_path:
|
||||||
return RobertaForSequenceClassification.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
return RobertaForSequenceClassification.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||||
elif 'bert' in pretrained_model_name_or_path:
|
elif 'bert' in pretrained_model_name_or_path:
|
||||||
|
|||||||
@@ -278,7 +278,7 @@ class BertAttention(nn.Module):
|
|||||||
if len(heads) == 0:
|
if len(heads) == 0:
|
||||||
return
|
return
|
||||||
mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)
|
mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)
|
||||||
heads = set(heads) - self.pruned_heads # Convert to set and emove already pruned heads
|
heads = set(heads) - self.pruned_heads # Convert to set and remove already pruned heads
|
||||||
for head in heads:
|
for head in heads:
|
||||||
# Compute how many pruned heads are before the head and move the index accordingly
|
# Compute how many pruned heads are before the head and move the index accordingly
|
||||||
head = head - sum(1 if h < head else 0 for h in self.pruned_heads)
|
head = head - sum(1 if h < head else 0 for h in self.pruned_heads)
|
||||||
@@ -597,7 +597,7 @@ class BertModel(BertPreTrainedModel):
|
|||||||
|
|
||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||||||
model = BertModel.from_pretrained('bert-base-uncased')
|
model = BertModel.from_pretrained('bert-base-uncased')
|
||||||
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
||||||
outputs = model(input_ids)
|
outputs = model(input_ids)
|
||||||
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||||
|
|
||||||
@@ -760,7 +760,7 @@ class BertForPreTraining(BertPreTrainedModel):
|
|||||||
|
|
||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||||||
model = BertForPreTraining.from_pretrained('bert-base-uncased')
|
model = BertForPreTraining.from_pretrained('bert-base-uncased')
|
||||||
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
||||||
outputs = model(input_ids)
|
outputs = model(input_ids)
|
||||||
prediction_scores, seq_relationship_scores = outputs[:2]
|
prediction_scores, seq_relationship_scores = outputs[:2]
|
||||||
|
|
||||||
@@ -836,7 +836,7 @@ class BertForMaskedLM(BertPreTrainedModel):
|
|||||||
|
|
||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||||||
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
|
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
|
||||||
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
||||||
outputs = model(input_ids, masked_lm_labels=input_ids)
|
outputs = model(input_ids, masked_lm_labels=input_ids)
|
||||||
loss, prediction_scores = outputs[:2]
|
loss, prediction_scores = outputs[:2]
|
||||||
|
|
||||||
@@ -919,7 +919,7 @@ class BertForNextSentencePrediction(BertPreTrainedModel):
|
|||||||
|
|
||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||||||
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
|
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
|
||||||
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
||||||
outputs = model(input_ids)
|
outputs = model(input_ids)
|
||||||
seq_relationship_scores = outputs[0]
|
seq_relationship_scores = outputs[0]
|
||||||
|
|
||||||
@@ -984,7 +984,7 @@ class BertForSequenceClassification(BertPreTrainedModel):
|
|||||||
|
|
||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||||||
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
|
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
|
||||||
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
||||||
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
|
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
|
||||||
outputs = model(input_ids, labels=labels)
|
outputs = model(input_ids, labels=labels)
|
||||||
loss, logits = outputs[:2]
|
loss, logits = outputs[:2]
|
||||||
@@ -1060,7 +1060,7 @@ class BertForMultipleChoice(BertPreTrainedModel):
|
|||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||||||
model = BertForMultipleChoice.from_pretrained('bert-base-uncased')
|
model = BertForMultipleChoice.from_pretrained('bert-base-uncased')
|
||||||
choices = ["Hello, my dog is cute", "Hello, my cat is amazing"]
|
choices = ["Hello, my dog is cute", "Hello, my cat is amazing"]
|
||||||
input_ids = torch.tensor([tokenizer.encode(s) for s in choices]).unsqueeze(0) # Batch size 1, 2 choices
|
input_ids = torch.tensor([tokenizer.encode(s, add_special_tokens=True) for s in choices]).unsqueeze(0) # Batch size 1, 2 choices
|
||||||
labels = torch.tensor(1).unsqueeze(0) # Batch size 1
|
labels = torch.tensor(1).unsqueeze(0) # Batch size 1
|
||||||
outputs = model(input_ids, labels=labels)
|
outputs = model(input_ids, labels=labels)
|
||||||
loss, classification_scores = outputs[:2]
|
loss, classification_scores = outputs[:2]
|
||||||
@@ -1134,7 +1134,7 @@ class BertForTokenClassification(BertPreTrainedModel):
|
|||||||
|
|
||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||||||
model = BertForTokenClassification.from_pretrained('bert-base-uncased')
|
model = BertForTokenClassification.from_pretrained('bert-base-uncased')
|
||||||
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
||||||
labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0) # Batch size 1
|
labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0) # Batch size 1
|
||||||
outputs = model(input_ids, labels=labels)
|
outputs = model(input_ids, labels=labels)
|
||||||
loss, scores = outputs[:2]
|
loss, scores = outputs[:2]
|
||||||
|
|||||||
@@ -20,7 +20,7 @@ from __future__ import (absolute_import, division, print_function,
|
|||||||
|
|
||||||
import logging
|
import logging
|
||||||
|
|
||||||
from .modeling_roberta import RobertaModel, RobertaForMaskedLM, RobertaForSequenceClassification, RobertaForMultipleChoice
|
from .modeling_roberta import RobertaModel, RobertaForMaskedLM, RobertaForSequenceClassification, RobertaForMultipleChoice, RobertaForTokenClassification
|
||||||
from .configuration_camembert import CamembertConfig
|
from .configuration_camembert import CamembertConfig
|
||||||
from .file_utils import add_start_docstrings
|
from .file_utils import add_start_docstrings
|
||||||
|
|
||||||
@@ -255,3 +255,39 @@ class CamembertForMultipleChoice(RobertaForMultipleChoice):
|
|||||||
"""
|
"""
|
||||||
config_class = CamembertConfig
|
config_class = CamembertConfig
|
||||||
pretrained_model_archive_map = CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_MAP
|
pretrained_model_archive_map = CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings("""CamemBERT Model with a token classification head on top (a linear layer on top of
|
||||||
|
the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
|
||||||
|
CAMEMBERT_START_DOCSTRING, CAMEMBERT_INPUTS_DOCSTRING)
|
||||||
|
class CamembertForTokenClassification(RobertaForTokenClassification):
|
||||||
|
r"""
|
||||||
|
**labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Labels for computing the token classification loss.
|
||||||
|
Indices should be in ``[0, ..., config.num_labels - 1]``.
|
||||||
|
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
|
||||||
|
Classification loss.
|
||||||
|
**scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.num_labels)``
|
||||||
|
Classification scores (before SoftMax).
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
tokenizer = CamembertTokenizer.from_pretrained('camembert-base')
|
||||||
|
model = CamembertForTokenClassification.from_pretrained('camembert-base')
|
||||||
|
input_ids = torch.tensor(tokenizer.encode("J'aime le camembert !", add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
||||||
|
labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0) # Batch size 1
|
||||||
|
outputs = model(input_ids, labels=labels)
|
||||||
|
loss, scores = outputs[:2]
|
||||||
|
|
||||||
|
"""
|
||||||
|
config_class = CamembertConfig
|
||||||
|
pretrained_model_archive_map = CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
|
|||||||
@@ -44,6 +44,7 @@ DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
|||||||
'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-pytorch_model.bin",
|
'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-pytorch_model.bin",
|
||||||
'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-pytorch_model.bin",
|
'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-pytorch_model.bin",
|
||||||
'distilbert-base-german-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-pytorch_model.bin",
|
'distilbert-base-german-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-pytorch_model.bin",
|
||||||
|
'distilbert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-multilingual-cased-pytorch_model.bin",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
799
transformers/modeling_tf_albert.py
Normal file
799
transformers/modeling_tf_albert.py
Normal file
@@ -0,0 +1,799 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
|
||||||
|
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" TF 2.0 ALBERT model. """
|
||||||
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import math
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
from io import open
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import tensorflow as tf
|
||||||
|
|
||||||
|
from .configuration_albert import AlbertConfig
|
||||||
|
from .modeling_tf_utils import TFPreTrainedModel, get_initializer
|
||||||
|
from .modeling_tf_bert import ACT2FN, TFBertSelfAttention
|
||||||
|
from .file_utils import add_start_docstrings
|
||||||
|
|
||||||
|
import logging
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
||||||
|
'albert-base-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-tf_model.h5",
|
||||||
|
'albert-large-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-tf_model.h5",
|
||||||
|
'albert-xlarge-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-tf_model.h5",
|
||||||
|
'albert-xxlarge-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-tf_model.h5",
|
||||||
|
'albert-base-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-tf_model.h5",
|
||||||
|
'albert-large-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-tf_model.h5",
|
||||||
|
'albert-xlarge-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-tf_model.h5",
|
||||||
|
'albert-xxlarge-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-tf_model.h5",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class TFAlbertEmbeddings(tf.keras.layers.Layer):
|
||||||
|
"""Construct the embeddings from word, position and token_type embeddings.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, config, **kwargs):
|
||||||
|
super(TFAlbertEmbeddings, self).__init__(**kwargs)
|
||||||
|
|
||||||
|
self.config = config
|
||||||
|
self.position_embeddings = tf.keras.layers.Embedding(config.max_position_embeddings,
|
||||||
|
config.embedding_size,
|
||||||
|
embeddings_initializer=get_initializer(
|
||||||
|
self.config.initializer_range),
|
||||||
|
name='position_embeddings')
|
||||||
|
self.token_type_embeddings = tf.keras.layers.Embedding(config.type_vocab_size,
|
||||||
|
config.embedding_size,
|
||||||
|
embeddings_initializer=get_initializer(
|
||||||
|
self.config.initializer_range),
|
||||||
|
name='token_type_embeddings')
|
||||||
|
|
||||||
|
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
|
||||||
|
# any TensorFlow checkpoint file
|
||||||
|
self.LayerNorm = tf.keras.layers.LayerNormalization(
|
||||||
|
epsilon=config.layer_norm_eps, name='LayerNorm')
|
||||||
|
self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
|
||||||
|
|
||||||
|
def build(self, input_shape):
|
||||||
|
"""Build shared word embedding layer """
|
||||||
|
with tf.name_scope("word_embeddings"):
|
||||||
|
# Create and initialize weights. The random normal initializer was chosen
|
||||||
|
# arbitrarily, and works well.
|
||||||
|
self.word_embeddings = self.add_weight(
|
||||||
|
"weight",
|
||||||
|
shape=[self.config.vocab_size, self.config.embedding_size],
|
||||||
|
initializer=get_initializer(self.config.initializer_range))
|
||||||
|
super(TFAlbertEmbeddings, self).build(input_shape)
|
||||||
|
|
||||||
|
def call(self, inputs, mode="embedding", training=False):
|
||||||
|
"""Get token embeddings of inputs.
|
||||||
|
Args:
|
||||||
|
inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)
|
||||||
|
mode: string, a valid value is one of "embedding" and "linear".
|
||||||
|
Returns:
|
||||||
|
outputs: (1) If mode == "embedding", output embedding tensor, float32 with
|
||||||
|
shape [batch_size, length, embedding_size]; (2) mode == "linear", output
|
||||||
|
linear tensor, float32 with shape [batch_size, length, vocab_size].
|
||||||
|
Raises:
|
||||||
|
ValueError: if mode is not valid.
|
||||||
|
|
||||||
|
Shared weights logic adapted from
|
||||||
|
https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24
|
||||||
|
"""
|
||||||
|
if mode == "embedding":
|
||||||
|
return self._embedding(inputs, training=training)
|
||||||
|
elif mode == "linear":
|
||||||
|
return self._linear(inputs)
|
||||||
|
else:
|
||||||
|
raise ValueError("mode {} is not valid.".format(mode))
|
||||||
|
|
||||||
|
def _embedding(self, inputs, training=False):
|
||||||
|
"""Applies embedding based on inputs tensor."""
|
||||||
|
input_ids, position_ids, token_type_ids, inputs_embeds = inputs
|
||||||
|
|
||||||
|
if input_ids is not None:
|
||||||
|
input_shape = tf.shape(input_ids)
|
||||||
|
else:
|
||||||
|
input_shape = tf.shape(inputs_embeds)[:-1]
|
||||||
|
|
||||||
|
seq_length = input_shape[1]
|
||||||
|
if position_ids is None:
|
||||||
|
position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]
|
||||||
|
if token_type_ids is None:
|
||||||
|
token_type_ids = tf.fill(input_shape, 0)
|
||||||
|
|
||||||
|
if inputs_embeds is None:
|
||||||
|
inputs_embeds = tf.gather(self.word_embeddings, input_ids)
|
||||||
|
position_embeddings = self.position_embeddings(position_ids)
|
||||||
|
token_type_embeddings = self.token_type_embeddings(token_type_ids)
|
||||||
|
|
||||||
|
embeddings = inputs_embeds + position_embeddings + token_type_embeddings
|
||||||
|
embeddings = self.LayerNorm(embeddings)
|
||||||
|
embeddings = self.dropout(embeddings, training=training)
|
||||||
|
return embeddings
|
||||||
|
|
||||||
|
def _linear(self, inputs):
|
||||||
|
"""Computes logits by running inputs through a linear layer.
|
||||||
|
Args:
|
||||||
|
inputs: A float32 tensor with shape [batch_size, length, embedding_size]
|
||||||
|
Returns:
|
||||||
|
float32 tensor with shape [batch_size, length, vocab_size].
|
||||||
|
"""
|
||||||
|
batch_size = tf.shape(inputs)[0]
|
||||||
|
length = tf.shape(inputs)[1]
|
||||||
|
x = tf.reshape(inputs, [-1, self.config.embedding_size])
|
||||||
|
logits = tf.matmul(x, self.word_embeddings, transpose_b=True)
|
||||||
|
return tf.reshape(logits, [batch_size, length, self.config.vocab_size])
|
||||||
|
|
||||||
|
|
||||||
|
class TFAlbertSelfAttention(tf.keras.layers.Layer):
|
||||||
|
def __init__(self, config, **kwargs):
|
||||||
|
super(TFAlbertSelfAttention, self).__init__(**kwargs)
|
||||||
|
if config.hidden_size % config.num_attention_heads != 0:
|
||||||
|
raise ValueError(
|
||||||
|
"The hidden size (%d) is not a multiple of the number of attention "
|
||||||
|
"heads (%d)" % (config.hidden_size, config.num_attention_heads))
|
||||||
|
self.output_attentions = config.output_attentions
|
||||||
|
|
||||||
|
self.num_attention_heads = config.num_attention_heads
|
||||||
|
assert config.hidden_size % config.num_attention_heads == 0
|
||||||
|
self.attention_head_size = int(
|
||||||
|
config.hidden_size / config.num_attention_heads)
|
||||||
|
self.all_head_size = self.num_attention_heads * self.attention_head_size
|
||||||
|
|
||||||
|
self.query = tf.keras.layers.Dense(self.all_head_size,
|
||||||
|
kernel_initializer=get_initializer(
|
||||||
|
config.initializer_range),
|
||||||
|
name='query')
|
||||||
|
self.key = tf.keras.layers.Dense(self.all_head_size,
|
||||||
|
kernel_initializer=get_initializer(
|
||||||
|
config.initializer_range),
|
||||||
|
name='key')
|
||||||
|
self.value = tf.keras.layers.Dense(self.all_head_size,
|
||||||
|
kernel_initializer=get_initializer(
|
||||||
|
config.initializer_range),
|
||||||
|
name='value')
|
||||||
|
|
||||||
|
self.dropout = tf.keras.layers.Dropout(
|
||||||
|
config.attention_probs_dropout_prob)
|
||||||
|
|
||||||
|
def transpose_for_scores(self, x, batch_size):
|
||||||
|
x = tf.reshape(
|
||||||
|
x, (batch_size, -1, self.num_attention_heads, self.attention_head_size))
|
||||||
|
return tf.transpose(x, perm=[0, 2, 1, 3])
|
||||||
|
|
||||||
|
def call(self, inputs, training=False):
|
||||||
|
hidden_states, attention_mask, head_mask = inputs
|
||||||
|
|
||||||
|
batch_size = tf.shape(hidden_states)[0]
|
||||||
|
mixed_query_layer = self.query(hidden_states)
|
||||||
|
mixed_key_layer = self.key(hidden_states)
|
||||||
|
mixed_value_layer = self.value(hidden_states)
|
||||||
|
|
||||||
|
query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)
|
||||||
|
key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)
|
||||||
|
value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)
|
||||||
|
|
||||||
|
# Take the dot product between "query" and "key" to get the raw attention scores.
|
||||||
|
# (batch size, num_heads, seq_len_q, seq_len_k)
|
||||||
|
attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
|
||||||
|
# scale attention_scores
|
||||||
|
dk = tf.cast(tf.shape(key_layer)[-1], tf.float32)
|
||||||
|
attention_scores = attention_scores / tf.math.sqrt(dk)
|
||||||
|
|
||||||
|
if attention_mask is not None:
|
||||||
|
# Apply the attention mask is (precomputed for all layers in TFAlbertModel call() function)
|
||||||
|
attention_scores = attention_scores + attention_mask
|
||||||
|
|
||||||
|
# Normalize the attention scores to probabilities.
|
||||||
|
attention_probs = tf.nn.softmax(attention_scores, axis=-1)
|
||||||
|
|
||||||
|
# This is actually dropping out entire tokens to attend to, which might
|
||||||
|
# seem a bit unusual, but is taken from the original Transformer paper.
|
||||||
|
attention_probs = self.dropout(attention_probs, training=training)
|
||||||
|
|
||||||
|
# Mask heads if we want to
|
||||||
|
if head_mask is not None:
|
||||||
|
attention_probs = attention_probs * head_mask
|
||||||
|
|
||||||
|
context_layer = tf.matmul(attention_probs, value_layer)
|
||||||
|
|
||||||
|
context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])
|
||||||
|
context_layer = tf.reshape(context_layer,
|
||||||
|
(batch_size, -1, self.all_head_size)) # (batch_size, seq_len_q, all_head_size)
|
||||||
|
|
||||||
|
outputs = (context_layer, attention_probs) if self.output_attentions else (
|
||||||
|
context_layer,)
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class TFAlbertSelfOutput(tf.keras.layers.Layer):
|
||||||
|
def __init__(self, config, **kwargs):
|
||||||
|
super(TFAlbertSelfOutput, self).__init__(**kwargs)
|
||||||
|
self.dense = tf.keras.layers.Dense(config.hidden_size,
|
||||||
|
kernel_initializer=get_initializer(
|
||||||
|
config.initializer_range),
|
||||||
|
name='dense')
|
||||||
|
self.LayerNorm = tf.keras.layers.LayerNormalization(
|
||||||
|
epsilon=config.layer_norm_eps, name='LayerNorm')
|
||||||
|
self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
|
||||||
|
|
||||||
|
def call(self, inputs, training=False):
|
||||||
|
hidden_states, input_tensor = inputs
|
||||||
|
|
||||||
|
hidden_states = self.dense(hidden_states)
|
||||||
|
hidden_states = self.dropout(hidden_states, training=training)
|
||||||
|
hidden_states = self.LayerNorm(hidden_states + input_tensor)
|
||||||
|
return hidden_states
|
||||||
|
|
||||||
|
|
||||||
|
class TFAlbertAttention(TFBertSelfAttention):
|
||||||
|
def __init__(self, config, **kwargs):
|
||||||
|
super(TFAlbertAttention, self).__init__(config, **kwargs)
|
||||||
|
|
||||||
|
self.hidden_size = config.hidden_size
|
||||||
|
self.dense = tf.keras.layers.Dense(config.hidden_size,
|
||||||
|
kernel_initializer=get_initializer(
|
||||||
|
config.initializer_range),
|
||||||
|
name='dense')
|
||||||
|
self.LayerNorm = tf.keras.layers.LayerNormalization(
|
||||||
|
epsilon=config.layer_norm_eps, name='LayerNorm')
|
||||||
|
self.pruned_heads = set()
|
||||||
|
|
||||||
|
def prune_heads(self, heads):
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
def call(self, inputs, training=False):
|
||||||
|
input_tensor, attention_mask, head_mask = inputs
|
||||||
|
|
||||||
|
batch_size = tf.shape(input_tensor)[0]
|
||||||
|
mixed_query_layer = self.query(input_tensor)
|
||||||
|
mixed_key_layer = self.key(input_tensor)
|
||||||
|
mixed_value_layer = self.value(input_tensor)
|
||||||
|
|
||||||
|
query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)
|
||||||
|
key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)
|
||||||
|
value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)
|
||||||
|
|
||||||
|
# Take the dot product between "query" and "key" to get the raw attention scores.
|
||||||
|
# (batch size, num_heads, seq_len_q, seq_len_k)
|
||||||
|
attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
|
||||||
|
# scale attention_scores
|
||||||
|
dk = tf.cast(tf.shape(key_layer)[-1], tf.float32)
|
||||||
|
attention_scores = attention_scores / tf.math.sqrt(dk)
|
||||||
|
|
||||||
|
if attention_mask is not None:
|
||||||
|
# Apply the attention mask is (precomputed for all layers in TFBertModel call() function)
|
||||||
|
attention_scores = attention_scores + attention_mask
|
||||||
|
|
||||||
|
# Normalize the attention scores to probabilities.
|
||||||
|
attention_probs = tf.nn.softmax(attention_scores, axis=-1)
|
||||||
|
|
||||||
|
# This is actually dropping out entire tokens to attend to, which might
|
||||||
|
# seem a bit unusual, but is taken from the original Transformer paper.
|
||||||
|
attention_probs = self.dropout(attention_probs, training=training)
|
||||||
|
|
||||||
|
# Mask heads if we want to
|
||||||
|
if head_mask is not None:
|
||||||
|
attention_probs = attention_probs * head_mask
|
||||||
|
|
||||||
|
context_layer = tf.matmul(attention_probs, value_layer)
|
||||||
|
|
||||||
|
context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])
|
||||||
|
context_layer = tf.reshape(context_layer,
|
||||||
|
(batch_size, -1, self.all_head_size)) # (batch_size, seq_len_q, all_head_size)
|
||||||
|
|
||||||
|
self_outputs = (context_layer, attention_probs) if self.output_attentions else (
|
||||||
|
context_layer,)
|
||||||
|
|
||||||
|
hidden_states = self_outputs[0]
|
||||||
|
|
||||||
|
hidden_states = self.dense(hidden_states)
|
||||||
|
hidden_states = self.dropout(hidden_states, training=training)
|
||||||
|
attention_output = self.LayerNorm(hidden_states + input_tensor)
|
||||||
|
|
||||||
|
# add attentions if we output them
|
||||||
|
outputs = (attention_output,) + self_outputs[1:]
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class TFAlbertLayer(tf.keras.layers.Layer):
|
||||||
|
def __init__(self, config, **kwargs):
|
||||||
|
super(TFAlbertLayer, self).__init__(**kwargs)
|
||||||
|
self.attention = TFAlbertAttention(config, name='attention')
|
||||||
|
|
||||||
|
self.ffn = tf.keras.layers.Dense(config.intermediate_size, kernel_initializer=get_initializer(
|
||||||
|
config.initializer_range), name='ffn')
|
||||||
|
|
||||||
|
if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2 and isinstance(config.hidden_act, unicode)):
|
||||||
|
self.activation = ACT2FN[config.hidden_act]
|
||||||
|
else:
|
||||||
|
self.activation = config.hidden_act
|
||||||
|
|
||||||
|
self.ffn_output = tf.keras.layers.Dense(config.hidden_size, kernel_initializer=get_initializer(
|
||||||
|
config.initializer_range), name='ffn_output')
|
||||||
|
self.full_layer_layer_norm = tf.keras.layers.LayerNormalization(
|
||||||
|
epsilon=config.layer_norm_eps, name='full_layer_layer_norm')
|
||||||
|
self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
|
||||||
|
|
||||||
|
def call(self, inputs, training=False):
|
||||||
|
hidden_states, attention_mask, head_mask = inputs
|
||||||
|
|
||||||
|
attention_outputs = self.attention(
|
||||||
|
[hidden_states, attention_mask, head_mask], training=training)
|
||||||
|
ffn_output = self.ffn(attention_outputs[0])
|
||||||
|
ffn_output = self.activation(ffn_output)
|
||||||
|
ffn_output = self.ffn_output(ffn_output)
|
||||||
|
|
||||||
|
hidden_states = self.dropout(hidden_states, training=training)
|
||||||
|
hidden_states = self.full_layer_layer_norm(
|
||||||
|
ffn_output + attention_outputs[0])
|
||||||
|
|
||||||
|
# add attentions if we output them
|
||||||
|
outputs = (hidden_states,) + attention_outputs[1:]
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class TFAlbertLayerGroup(tf.keras.layers.Layer):
|
||||||
|
def __init__(self, config, **kwargs):
|
||||||
|
super(TFAlbertLayerGroup, self).__init__(**kwargs)
|
||||||
|
|
||||||
|
self.output_attentions = config.output_attentions
|
||||||
|
self.output_hidden_states = config.output_hidden_states
|
||||||
|
self.albert_layers = [TFAlbertLayer(config, name="albert_layers_._{}".format(
|
||||||
|
i)) for i in range(config.inner_group_num)]
|
||||||
|
|
||||||
|
def call(self, inputs, training=False):
|
||||||
|
hidden_states, attention_mask, head_mask = inputs
|
||||||
|
|
||||||
|
layer_hidden_states = ()
|
||||||
|
layer_attentions = ()
|
||||||
|
|
||||||
|
for layer_index, albert_layer in enumerate(self.albert_layers):
|
||||||
|
layer_output = albert_layer(
|
||||||
|
[hidden_states, attention_mask, head_mask[layer_index]], training=training)
|
||||||
|
hidden_states = layer_output[0]
|
||||||
|
|
||||||
|
if self.output_attentions:
|
||||||
|
layer_attentions = layer_attentions + (layer_output[1],)
|
||||||
|
|
||||||
|
if self.output_hidden_states:
|
||||||
|
layer_hidden_states = layer_hidden_states + (hidden_states,)
|
||||||
|
|
||||||
|
outputs = (hidden_states,)
|
||||||
|
if self.output_hidden_states:
|
||||||
|
outputs = outputs + (layer_hidden_states,)
|
||||||
|
if self.output_attentions:
|
||||||
|
outputs = outputs + (layer_attentions,)
|
||||||
|
# last-layer hidden state, (layer hidden states), (layer attentions)
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class TFAlbertTransformer(tf.keras.layers.Layer):
|
||||||
|
def __init__(self, config, **kwargs):
|
||||||
|
super(TFAlbertTransformer, self).__init__(**kwargs)
|
||||||
|
|
||||||
|
self.config = config
|
||||||
|
self.output_attentions = config.output_attentions
|
||||||
|
self.output_hidden_states = config.output_hidden_states
|
||||||
|
self.embedding_hidden_mapping_in = tf.keras.layers.Dense(config.hidden_size, kernel_initializer=get_initializer(
|
||||||
|
config.initializer_range), name='embedding_hidden_mapping_in')
|
||||||
|
self.albert_layer_groups = [TFAlbertLayerGroup(
|
||||||
|
config, name="albert_layer_groups_._{}".format(i)) for i in range(config.num_hidden_groups)]
|
||||||
|
|
||||||
|
def call(self, inputs, training=False):
|
||||||
|
hidden_states, attention_mask, head_mask = inputs
|
||||||
|
|
||||||
|
hidden_states = self.embedding_hidden_mapping_in(hidden_states)
|
||||||
|
all_attentions = ()
|
||||||
|
|
||||||
|
if self.output_hidden_states:
|
||||||
|
all_hidden_states = (hidden_states,)
|
||||||
|
|
||||||
|
for i in range(self.config.num_hidden_layers):
|
||||||
|
# Number of layers in a hidden group
|
||||||
|
layers_per_group = int(
|
||||||
|
self.config.num_hidden_layers / self.config.num_hidden_groups)
|
||||||
|
|
||||||
|
# Index of the hidden group
|
||||||
|
group_idx = int(
|
||||||
|
i / (self.config.num_hidden_layers / self.config.num_hidden_groups))
|
||||||
|
|
||||||
|
layer_group_output = self.albert_layer_groups[group_idx](
|
||||||
|
[hidden_states, attention_mask, head_mask[group_idx*layers_per_group:(group_idx+1)*layers_per_group]], training=training)
|
||||||
|
hidden_states = layer_group_output[0]
|
||||||
|
|
||||||
|
if self.output_attentions:
|
||||||
|
all_attentions = all_attentions + layer_group_output[-1]
|
||||||
|
|
||||||
|
if self.output_hidden_states:
|
||||||
|
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||||
|
|
||||||
|
outputs = (hidden_states,)
|
||||||
|
if self.output_hidden_states:
|
||||||
|
outputs = outputs + (all_hidden_states,)
|
||||||
|
if self.output_attentions:
|
||||||
|
outputs = outputs + (all_attentions,)
|
||||||
|
|
||||||
|
# last-layer hidden state, (all hidden states), (all attentions)
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class TFAlbertPreTrainedModel(TFPreTrainedModel):
|
||||||
|
""" An abstract class to handle weights initialization and
|
||||||
|
a simple interface for dowloading and loading pretrained models.
|
||||||
|
"""
|
||||||
|
config_class = AlbertConfig
|
||||||
|
pretrained_model_archive_map = TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
|
base_model_prefix = "albert"
|
||||||
|
|
||||||
|
|
||||||
|
class TFAlbertMLMHead(tf.keras.layers.Layer):
|
||||||
|
def __init__(self, config, input_embeddings, **kwargs):
|
||||||
|
super(TFAlbertMLMHead, self).__init__(**kwargs)
|
||||||
|
self.vocab_size = config.vocab_size
|
||||||
|
|
||||||
|
self.dense = tf.keras.layers.Dense(config.embedding_size,
|
||||||
|
kernel_initializer=get_initializer(
|
||||||
|
config.initializer_range),
|
||||||
|
name='dense')
|
||||||
|
if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2 and isinstance(config.hidden_act, unicode)):
|
||||||
|
self.activation = ACT2FN[config.hidden_act]
|
||||||
|
else:
|
||||||
|
self.activation = config.hidden_act
|
||||||
|
|
||||||
|
self.LayerNorm = tf.keras.layers.LayerNormalization(
|
||||||
|
epsilon=config.layer_norm_eps, name='LayerNorm')
|
||||||
|
|
||||||
|
# The output weights are the same as the input embeddings, but there is
|
||||||
|
# an output-only bias for each token.
|
||||||
|
self.decoder = input_embeddings
|
||||||
|
|
||||||
|
def build(self, input_shape):
|
||||||
|
self.bias = self.add_weight(shape=(self.vocab_size,),
|
||||||
|
initializer='zeros',
|
||||||
|
trainable=True,
|
||||||
|
name='bias')
|
||||||
|
self.decoder_bias = self.add_weight(shape=(self.vocab_size,),
|
||||||
|
initializer='zeros',
|
||||||
|
trainable=True,
|
||||||
|
name='decoder/bias')
|
||||||
|
super(TFAlbertMLMHead, self).build(input_shape)
|
||||||
|
|
||||||
|
def call(self, hidden_states):
|
||||||
|
hidden_states = self.dense(hidden_states)
|
||||||
|
hidden_states = self.activation(hidden_states)
|
||||||
|
hidden_states = self.LayerNorm(hidden_states)
|
||||||
|
hidden_states = self.decoder(hidden_states, mode="linear") + self.decoder_bias
|
||||||
|
hidden_states = hidden_states + self.bias
|
||||||
|
return hidden_states
|
||||||
|
|
||||||
|
|
||||||
|
ALBERT_START_DOCSTRING = r""" The ALBERT model was proposed in
|
||||||
|
`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations`_
|
||||||
|
by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents
|
||||||
|
two parameter-reduction techniques to lower memory consumption and increase the trainig speed of BERT.
|
||||||
|
|
||||||
|
This model is a tf.keras.Model `tf.keras.Model`_ sub-class. Use it as a regular TF 2.0 Keras Model and
|
||||||
|
refer to the TF 2.0 documentation for all matter related to general usage and behavior.
|
||||||
|
|
||||||
|
.. _`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations`:
|
||||||
|
https://arxiv.org/abs/1909.11942
|
||||||
|
|
||||||
|
.. _`tf.keras.Model`:
|
||||||
|
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model
|
||||||
|
|
||||||
|
Note on the model inputs:
|
||||||
|
TF 2.0 models accepts two formats as inputs:
|
||||||
|
|
||||||
|
- having all inputs as keyword arguments (like PyTorch models), or
|
||||||
|
- having all inputs as a list, tuple or dict in the first positional arguments.
|
||||||
|
|
||||||
|
This second option is usefull when using `tf.keras.Model.fit()` method which currently requires having all the tensors in the first argument of the model call function: `model(inputs)`.
|
||||||
|
|
||||||
|
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :
|
||||||
|
|
||||||
|
- a single Tensor with input_ids only and nothing else: `model(inputs_ids)
|
||||||
|
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
|
||||||
|
`model([input_ids, attention_mask])` or `model([input_ids, attention_mask, token_type_ids])`
|
||||||
|
- a dictionary with one or several input Tensors associaed to the input names given in the docstring:
|
||||||
|
`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
config (:class:`~transformers.AlbertConfig`): Model configuration class with all the parameters of the model.
|
||||||
|
Initializing with a config file does not load the weights associated with the model, only the configuration.
|
||||||
|
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
|
||||||
|
"""
|
||||||
|
|
||||||
|
ALBERT_INPUTS_DOCSTRING = r"""
|
||||||
|
Inputs:
|
||||||
|
**input_ids**: ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Indices of input sequence tokens in the vocabulary.
|
||||||
|
To match pre-training, ALBERT input sequence should be formatted with [CLS] and [SEP] tokens as follows:
|
||||||
|
|
||||||
|
(a) For sequence pairs:
|
||||||
|
|
||||||
|
``tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]``
|
||||||
|
|
||||||
|
``token_type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1``
|
||||||
|
|
||||||
|
(b) For single sequences:
|
||||||
|
|
||||||
|
``tokens: [CLS] the dog is hairy . [SEP]``
|
||||||
|
|
||||||
|
``token_type_ids: 0 0 0 0 0 0 0``
|
||||||
|
|
||||||
|
Albert is a model with absolute position embeddings so it's usually advised to pad the inputs on
|
||||||
|
the right rather than the left.
|
||||||
|
|
||||||
|
Indices can be obtained using :class:`transformers.AlbertTokenizer`.
|
||||||
|
See :func:`transformers.PreTrainedTokenizer.encode` and
|
||||||
|
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
|
||||||
|
**attention_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Mask to avoid performing attention on padding token indices.
|
||||||
|
Mask values selected in ``[0, 1]``:
|
||||||
|
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
|
||||||
|
**token_type_ids**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Segment token indices to indicate first and second portions of the inputs.
|
||||||
|
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
|
||||||
|
corresponds to a `sentence B` token
|
||||||
|
(see `ALBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`_ for more details).
|
||||||
|
**position_ids**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
|
Indices of positions of each input sequence tokens in the position embeddings.
|
||||||
|
Selected in the range ``[0, config.max_position_embeddings - 1]``.
|
||||||
|
**head_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
|
||||||
|
Mask to nullify selected heads of the self-attention modules.
|
||||||
|
Mask values selected in ``[0, 1]``:
|
||||||
|
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
|
||||||
|
"""
|
||||||
|
|
||||||
|
@add_start_docstrings("The bare Albert Model transformer outputing raw hidden-states without any specific head on top.",
|
||||||
|
ALBERT_START_DOCSTRING, ALBERT_INPUTS_DOCSTRING)
|
||||||
|
class TFAlbertModel(TFAlbertPreTrainedModel):
|
||||||
|
r"""
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**last_hidden_state**: ``tf.Tensor`` of shape ``(batch_size, sequence_length, hidden_size)``
|
||||||
|
Sequence of hidden-states at the output of the last layer of the model.
|
||||||
|
**pooler_output**: ``tf.Tensor`` of shape ``(batch_size, hidden_size)``
|
||||||
|
Last layer hidden-state of the first token of the sequence (classification token)
|
||||||
|
further processed by a Linear layer and a Tanh activation function. The Linear
|
||||||
|
layer weights are trained from the next sentence prediction (classification)
|
||||||
|
objective during Albert pretraining. This output is usually *not* a good summary
|
||||||
|
of the semantic content of the input, you're often better with averaging or pooling
|
||||||
|
the sequence of hidden-states for the whole input sequence.
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
import tensorflow as tf
|
||||||
|
from transformers import AlbertTokenizer, TFAlbertModel
|
||||||
|
|
||||||
|
tokenizer = AlbertTokenizer.from_pretrained('bert-base-uncased')
|
||||||
|
model = TFAlbertModel.from_pretrained('bert-base-uncased')
|
||||||
|
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
|
||||||
|
outputs = model(input_ids)
|
||||||
|
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, config, **kwargs):
|
||||||
|
super(TFAlbertModel, self).__init__(config, **kwargs)
|
||||||
|
self.num_hidden_layers = config.num_hidden_layers
|
||||||
|
|
||||||
|
self.embeddings = TFAlbertEmbeddings(config, name="embeddings")
|
||||||
|
self.encoder = TFAlbertTransformer(config, name="encoder")
|
||||||
|
self.pooler = tf.keras.layers.Dense(config.hidden_size, kernel_initializer=get_initializer(
|
||||||
|
config.initializer_range), activation='tanh', name='pooler')
|
||||||
|
|
||||||
|
def get_input_embeddings(self):
|
||||||
|
return self.embeddings
|
||||||
|
|
||||||
|
def _resize_token_embeddings(self, new_num_tokens):
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
def _prune_heads(self, heads_to_prune):
|
||||||
|
""" Prunes heads of the model.
|
||||||
|
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
|
||||||
|
See base class PreTrainedModel
|
||||||
|
"""
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
def call(self, inputs, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, training=False):
|
||||||
|
if isinstance(inputs, (tuple, list)):
|
||||||
|
input_ids = inputs[0]
|
||||||
|
attention_mask = inputs[1] if len(inputs) > 1 else attention_mask
|
||||||
|
token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids
|
||||||
|
position_ids = inputs[3] if len(inputs) > 3 else position_ids
|
||||||
|
head_mask = inputs[4] if len(inputs) > 4 else head_mask
|
||||||
|
inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds
|
||||||
|
assert len(inputs) <= 6, "Too many inputs."
|
||||||
|
elif isinstance(inputs, dict):
|
||||||
|
input_ids = inputs.get('input_ids')
|
||||||
|
attention_mask = inputs.get('attention_mask', attention_mask)
|
||||||
|
token_type_ids = inputs.get('token_type_ids', token_type_ids)
|
||||||
|
position_ids = inputs.get('position_ids', position_ids)
|
||||||
|
head_mask = inputs.get('head_mask', head_mask)
|
||||||
|
inputs_embeds = inputs.get('inputs_embeds', inputs_embeds)
|
||||||
|
assert len(inputs) <= 6, "Too many inputs."
|
||||||
|
else:
|
||||||
|
input_ids = inputs
|
||||||
|
|
||||||
|
if input_ids is not None and inputs_embeds is not None:
|
||||||
|
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
|
||||||
|
elif input_ids is not None:
|
||||||
|
input_shape = tf.shape(input_ids)
|
||||||
|
elif inputs_embeds is not None:
|
||||||
|
input_shape = inputs_embeds.shape[:-1]
|
||||||
|
else:
|
||||||
|
raise ValueError("You have to specify either input_ids or inputs_embeds")
|
||||||
|
|
||||||
|
if attention_mask is None:
|
||||||
|
attention_mask = tf.fill(input_shape, 1)
|
||||||
|
if token_type_ids is None:
|
||||||
|
token_type_ids = tf.fill(input_shape, 0)
|
||||||
|
|
||||||
|
# We create a 3D attention mask from a 2D tensor mask.
|
||||||
|
# Sizes are [batch_size, 1, 1, to_seq_length]
|
||||||
|
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
|
||||||
|
# this attention mask is more simple than the triangular masking of causal attention
|
||||||
|
# used in OpenAI GPT, we just need to prepare the broadcast dimension here.
|
||||||
|
extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]
|
||||||
|
|
||||||
|
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
|
||||||
|
# masked positions, this operation will create a tensor which is 0.0 for
|
||||||
|
# positions we want to attend and -10000.0 for masked positions.
|
||||||
|
# Since we are adding it to the raw scores before the softmax, this is
|
||||||
|
# effectively the same as removing these entirely.
|
||||||
|
|
||||||
|
extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)
|
||||||
|
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
|
||||||
|
|
||||||
|
# Prepare head mask if needed
|
||||||
|
# 1.0 in head_mask indicate we keep the head
|
||||||
|
# attention_probs has shape bsz x n_heads x N x N
|
||||||
|
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
|
||||||
|
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
|
||||||
|
if not head_mask is None:
|
||||||
|
raise NotImplementedError
|
||||||
|
else:
|
||||||
|
head_mask = [None] * self.num_hidden_layers
|
||||||
|
# head_mask = tf.constant([0] * self.num_hidden_layers)
|
||||||
|
|
||||||
|
embedding_output = self.embeddings(
|
||||||
|
[input_ids, position_ids, token_type_ids, inputs_embeds], training=training)
|
||||||
|
encoder_outputs = self.encoder(
|
||||||
|
[embedding_output, extended_attention_mask, head_mask], training=training)
|
||||||
|
|
||||||
|
sequence_output = encoder_outputs[0]
|
||||||
|
pooled_output = self.pooler(sequence_output[:, 0])
|
||||||
|
|
||||||
|
# add hidden_states and attentions if they are here
|
||||||
|
outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]
|
||||||
|
# sequence_output, pooled_output, (hidden_states), (attentions)
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings("""Albert Model with a `language modeling` head on top. """,
|
||||||
|
ALBERT_START_DOCSTRING, ALBERT_INPUTS_DOCSTRING)
|
||||||
|
class TFAlbertForMaskedLM(TFAlbertPreTrainedModel):
|
||||||
|
r"""
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**prediction_scores**: ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
|
||||||
|
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``Numpy array`` or ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``Numpy array`` or ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
import tensorflow as tf
|
||||||
|
from transformers import AlbertTokenizer, TFAlbertForMaskedLM
|
||||||
|
|
||||||
|
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
|
||||||
|
model = TFAlbertForMaskedLM.from_pretrained('albert-base-v2')
|
||||||
|
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
|
||||||
|
outputs = model(input_ids)
|
||||||
|
prediction_scores = outputs[0]
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, config, *inputs, **kwargs):
|
||||||
|
super(TFAlbertForMaskedLM, self).__init__(config, *inputs, **kwargs)
|
||||||
|
|
||||||
|
self.albert = TFAlbertModel(config, name='albert')
|
||||||
|
self.predictions = TFAlbertMLMHead(
|
||||||
|
config, self.albert.embeddings, name='predictions')
|
||||||
|
|
||||||
|
def get_output_embeddings(self):
|
||||||
|
return self.albert.embeddings
|
||||||
|
|
||||||
|
def call(self, inputs, **kwargs):
|
||||||
|
outputs = self.albert(inputs, **kwargs)
|
||||||
|
|
||||||
|
sequence_output = outputs[0]
|
||||||
|
prediction_scores = self.predictions(
|
||||||
|
sequence_output, training=kwargs.get('training', False))
|
||||||
|
|
||||||
|
# Add hidden states and attention if they are here
|
||||||
|
outputs = (prediction_scores,) + outputs[2:]
|
||||||
|
|
||||||
|
return outputs # prediction_scores, (hidden_states), (attentions)
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings("""Albert Model transformer with a sequence classification/regression head on top (a linear layer on top of
|
||||||
|
the pooled output) e.g. for GLUE tasks. """,
|
||||||
|
ALBERT_START_DOCSTRING, ALBERT_INPUTS_DOCSTRING)
|
||||||
|
class TFAlbertForSequenceClassification(TFAlbertPreTrainedModel):
|
||||||
|
r"""
|
||||||
|
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||||
|
**logits**: ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, config.num_labels)``
|
||||||
|
Classification (or regression if config.num_labels==1) scores (before SoftMax).
|
||||||
|
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||||
|
list of ``Numpy array`` or ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
|
||||||
|
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||||
|
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||||
|
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||||
|
list of ``Numpy array`` or ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||||
|
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
import tensorflow as tf
|
||||||
|
from transformers import AlbertTokenizer, TFAlbertForSequenceClassification
|
||||||
|
|
||||||
|
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
|
||||||
|
model = TFAlbertForSequenceClassification.from_pretrained('albert-base-v2')
|
||||||
|
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
|
||||||
|
outputs = model(input_ids)
|
||||||
|
logits = outputs[0]
|
||||||
|
|
||||||
|
"""
|
||||||
|
def __init__(self, config, *inputs, **kwargs):
|
||||||
|
super(TFAlbertForSequenceClassification, self).__init__(config, *inputs, **kwargs)
|
||||||
|
self.num_labels = config.num_labels
|
||||||
|
|
||||||
|
self.albert = TFAlbertModel(config, name='albert')
|
||||||
|
self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
|
||||||
|
self.classifier = tf.keras.layers.Dense(config.num_labels,
|
||||||
|
kernel_initializer=get_initializer(config.initializer_range),
|
||||||
|
name='classifier')
|
||||||
|
|
||||||
|
def call(self, inputs, **kwargs):
|
||||||
|
outputs = self.albert(inputs, **kwargs)
|
||||||
|
|
||||||
|
pooled_output = outputs[1]
|
||||||
|
|
||||||
|
pooled_output = self.dropout(pooled_output, training=kwargs.get('training', False))
|
||||||
|
logits = self.classifier(pooled_output)
|
||||||
|
|
||||||
|
outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
|
||||||
|
|
||||||
|
return outputs # logits, (hidden_states), (attentions)
|
||||||
@@ -109,6 +109,9 @@ class TFAutoModel(object):
|
|||||||
force_download: (`optional`) boolean, default False:
|
force_download: (`optional`) boolean, default False:
|
||||||
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
||||||
|
|
||||||
|
resume_download: (`optional`) boolean, default False:
|
||||||
|
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
|
||||||
|
|
||||||
proxies: (`optional`) dict, default None:
|
proxies: (`optional`) dict, default None:
|
||||||
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
||||||
The proxies are used on each request.
|
The proxies are used on each request.
|
||||||
@@ -237,6 +240,9 @@ class TFAutoModelWithLMHead(object):
|
|||||||
force_download: (`optional`) boolean, default False:
|
force_download: (`optional`) boolean, default False:
|
||||||
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
||||||
|
|
||||||
|
resume_download: (`optional`) boolean, default False:
|
||||||
|
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
|
||||||
|
|
||||||
proxies: (`optional`) dict, default None:
|
proxies: (`optional`) dict, default None:
|
||||||
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
||||||
The proxies are used on each request.
|
The proxies are used on each request.
|
||||||
@@ -360,6 +366,9 @@ class TFAutoModelForSequenceClassification(object):
|
|||||||
force_download: (`optional`) boolean, default False:
|
force_download: (`optional`) boolean, default False:
|
||||||
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
||||||
|
|
||||||
|
resume_download: (`optional`) boolean, default False:
|
||||||
|
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
|
||||||
|
|
||||||
proxies: (`optional`) dict, default None:
|
proxies: (`optional`) dict, default None:
|
||||||
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
||||||
The proxies are used on each request.
|
The proxies are used on each request.
|
||||||
@@ -472,6 +481,9 @@ class TFAutoModelForQuestionAnswering(object):
|
|||||||
force_download: (`optional`) boolean, default False:
|
force_download: (`optional`) boolean, default False:
|
||||||
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
||||||
|
|
||||||
|
resume_download: (`optional`) boolean, default False:
|
||||||
|
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
|
||||||
|
|
||||||
proxies: (`optional`) dict, default None:
|
proxies: (`optional`) dict, default None:
|
||||||
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
||||||
The proxies are used on each request.
|
The proxies are used on each request.
|
||||||
|
|||||||
@@ -191,6 +191,9 @@ class TFPreTrainedModel(tf.keras.Model):
|
|||||||
force_download: (`optional`) boolean, default False:
|
force_download: (`optional`) boolean, default False:
|
||||||
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
||||||
|
|
||||||
|
resume_download: (`optional`) boolean, default False:
|
||||||
|
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
|
||||||
|
|
||||||
proxies: (`optional`) dict, default None:
|
proxies: (`optional`) dict, default None:
|
||||||
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
||||||
The proxies are used on each request.
|
The proxies are used on each request.
|
||||||
@@ -216,6 +219,7 @@ class TFPreTrainedModel(tf.keras.Model):
|
|||||||
cache_dir = kwargs.pop('cache_dir', None)
|
cache_dir = kwargs.pop('cache_dir', None)
|
||||||
from_pt = kwargs.pop('from_pt', False)
|
from_pt = kwargs.pop('from_pt', False)
|
||||||
force_download = kwargs.pop('force_download', False)
|
force_download = kwargs.pop('force_download', False)
|
||||||
|
resume_download = kwargs.pop('resume_download', False)
|
||||||
proxies = kwargs.pop('proxies', None)
|
proxies = kwargs.pop('proxies', None)
|
||||||
|
|
||||||
# Load config
|
# Load config
|
||||||
@@ -224,6 +228,7 @@ class TFPreTrainedModel(tf.keras.Model):
|
|||||||
pretrained_model_name_or_path, *model_args,
|
pretrained_model_name_or_path, *model_args,
|
||||||
cache_dir=cache_dir, return_unused_kwargs=True,
|
cache_dir=cache_dir, return_unused_kwargs=True,
|
||||||
force_download=force_download,
|
force_download=force_download,
|
||||||
|
resume_download=resume_download,
|
||||||
**kwargs
|
**kwargs
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
@@ -251,7 +256,8 @@ class TFPreTrainedModel(tf.keras.Model):
|
|||||||
|
|
||||||
# redirect to the cache, if necessary
|
# redirect to the cache, if necessary
|
||||||
try:
|
try:
|
||||||
resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
|
resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download,
|
||||||
|
resume_download=resume_download, proxies=proxies)
|
||||||
except EnvironmentError as e:
|
except EnvironmentError as e:
|
||||||
if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
|
if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
|
||||||
logger.error(
|
logger.error(
|
||||||
|
|||||||
@@ -291,6 +291,9 @@ class PreTrainedModel(nn.Module):
|
|||||||
force_download: (`optional`) boolean, default False:
|
force_download: (`optional`) boolean, default False:
|
||||||
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
|
||||||
|
|
||||||
|
resume_download: (`optional`) boolean, default False:
|
||||||
|
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
|
||||||
|
|
||||||
proxies: (`optional`) dict, default None:
|
proxies: (`optional`) dict, default None:
|
||||||
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
||||||
The proxies are used on each request.
|
The proxies are used on each request.
|
||||||
@@ -315,11 +318,16 @@ class PreTrainedModel(nn.Module):
|
|||||||
model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_tf=True, config=config)
|
model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_tf=True, config=config)
|
||||||
|
|
||||||
"""
|
"""
|
||||||
|
if "albert" in pretrained_model_name_or_path and "v2" in pretrained_model_name_or_path:
|
||||||
|
logger.warning("There is currently an upstream reproducibility issue with ALBERT v2 models. Please see " +
|
||||||
|
"https://github.com/google-research/google-research/issues/119 for more information.")
|
||||||
|
|
||||||
config = kwargs.pop('config', None)
|
config = kwargs.pop('config', None)
|
||||||
state_dict = kwargs.pop('state_dict', None)
|
state_dict = kwargs.pop('state_dict', None)
|
||||||
cache_dir = kwargs.pop('cache_dir', None)
|
cache_dir = kwargs.pop('cache_dir', None)
|
||||||
from_tf = kwargs.pop('from_tf', False)
|
from_tf = kwargs.pop('from_tf', False)
|
||||||
force_download = kwargs.pop('force_download', False)
|
force_download = kwargs.pop('force_download', False)
|
||||||
|
resume_download = kwargs.pop('resume_download', False)
|
||||||
proxies = kwargs.pop('proxies', None)
|
proxies = kwargs.pop('proxies', None)
|
||||||
output_loading_info = kwargs.pop('output_loading_info', False)
|
output_loading_info = kwargs.pop('output_loading_info', False)
|
||||||
|
|
||||||
@@ -329,6 +337,7 @@ class PreTrainedModel(nn.Module):
|
|||||||
pretrained_model_name_or_path, *model_args,
|
pretrained_model_name_or_path, *model_args,
|
||||||
cache_dir=cache_dir, return_unused_kwargs=True,
|
cache_dir=cache_dir, return_unused_kwargs=True,
|
||||||
force_download=force_download,
|
force_download=force_download,
|
||||||
|
resume_download=resume_download,
|
||||||
proxies=proxies,
|
proxies=proxies,
|
||||||
**kwargs
|
**kwargs
|
||||||
)
|
)
|
||||||
@@ -361,7 +370,8 @@ class PreTrainedModel(nn.Module):
|
|||||||
|
|
||||||
# redirect to the cache, if necessary
|
# redirect to the cache, if necessary
|
||||||
try:
|
try:
|
||||||
resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
|
resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download,
|
||||||
|
proxies=proxies, resume_download=resume_download)
|
||||||
except EnvironmentError:
|
except EnvironmentError:
|
||||||
if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
|
if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
|
||||||
msg = "Couldn't reach server at '{}' to download pretrained weights.".format(
|
msg = "Couldn't reach server at '{}' to download pretrained weights.".format(
|
||||||
|
|||||||
BIN
transformers/tests/fixtures/spiece.model
vendored
Normal file
BIN
transformers/tests/fixtures/spiece.model
vendored
Normal file
Binary file not shown.
237
transformers/tests/modeling_albert_test.py
Normal file
237
transformers/tests/modeling_albert_test.py
Normal file
@@ -0,0 +1,237 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Google AI Language Team Authors.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from __future__ import absolute_import
|
||||||
|
from __future__ import division
|
||||||
|
from __future__ import print_function
|
||||||
|
|
||||||
|
import unittest
|
||||||
|
import shutil
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from transformers import is_torch_available
|
||||||
|
|
||||||
|
from .modeling_common_test import (CommonTestCases, ids_tensor)
|
||||||
|
from .configuration_common_test import ConfigTester
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
from transformers import (AlbertConfig, AlbertModel, AlbertForMaskedLM,
|
||||||
|
AlbertForSequenceClassification, AlbertForQuestionAnswering,
|
||||||
|
)
|
||||||
|
from transformers.modeling_albert import ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||||
|
else:
|
||||||
|
pytestmark = pytest.mark.skip("Require Torch")
|
||||||
|
|
||||||
|
|
||||||
|
class AlbertModelTest(CommonTestCases.CommonModelTester):
|
||||||
|
|
||||||
|
all_model_classes = (AlbertModel, AlbertForMaskedLM) if is_torch_available() else ()
|
||||||
|
|
||||||
|
class AlbertModelTester(object):
|
||||||
|
|
||||||
|
def __init__(self,
|
||||||
|
parent,
|
||||||
|
batch_size=13,
|
||||||
|
seq_length=7,
|
||||||
|
is_training=True,
|
||||||
|
use_input_mask=True,
|
||||||
|
use_token_type_ids=True,
|
||||||
|
use_labels=True,
|
||||||
|
vocab_size=99,
|
||||||
|
embedding_size=16,
|
||||||
|
hidden_size=36,
|
||||||
|
num_hidden_layers=6,
|
||||||
|
num_hidden_groups=6,
|
||||||
|
num_attention_heads=6,
|
||||||
|
intermediate_size=37,
|
||||||
|
hidden_act="gelu",
|
||||||
|
hidden_dropout_prob=0.1,
|
||||||
|
attention_probs_dropout_prob=0.1,
|
||||||
|
max_position_embeddings=512,
|
||||||
|
type_vocab_size=16,
|
||||||
|
type_sequence_label_size=2,
|
||||||
|
initializer_range=0.02,
|
||||||
|
num_labels=3,
|
||||||
|
num_choices=4,
|
||||||
|
scope=None,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.seq_length = seq_length
|
||||||
|
self.is_training = is_training
|
||||||
|
self.use_input_mask = use_input_mask
|
||||||
|
self.use_token_type_ids = use_token_type_ids
|
||||||
|
self.use_labels = use_labels
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.embedding_size = embedding_size
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.type_vocab_size = type_vocab_size
|
||||||
|
self.type_sequence_label_size = type_sequence_label_size
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.num_labels = num_labels
|
||||||
|
self.num_choices = num_choices
|
||||||
|
self.scope = scope
|
||||||
|
self.num_hidden_groups = num_hidden_groups
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||||
|
|
||||||
|
input_mask = None
|
||||||
|
if self.use_input_mask:
|
||||||
|
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
|
||||||
|
|
||||||
|
token_type_ids = None
|
||||||
|
if self.use_token_type_ids:
|
||||||
|
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
|
||||||
|
|
||||||
|
sequence_labels = None
|
||||||
|
token_labels = None
|
||||||
|
choice_labels = None
|
||||||
|
if self.use_labels:
|
||||||
|
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
|
||||||
|
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
|
||||||
|
choice_labels = ids_tensor([self.batch_size], self.num_choices)
|
||||||
|
|
||||||
|
config = AlbertConfig(
|
||||||
|
vocab_size_or_config_json_file=self.vocab_size,
|
||||||
|
hidden_size=self.hidden_size,
|
||||||
|
num_hidden_layers=self.num_hidden_layers,
|
||||||
|
num_attention_heads=self.num_attention_heads,
|
||||||
|
intermediate_size=self.intermediate_size,
|
||||||
|
hidden_act=self.hidden_act,
|
||||||
|
hidden_dropout_prob=self.hidden_dropout_prob,
|
||||||
|
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||||
|
max_position_embeddings=self.max_position_embeddings,
|
||||||
|
type_vocab_size=self.type_vocab_size,
|
||||||
|
initializer_range=self.initializer_range,
|
||||||
|
num_hidden_groups=self.num_hidden_groups)
|
||||||
|
|
||||||
|
return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||||
|
|
||||||
|
def check_loss_output(self, result):
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["loss"].size()),
|
||||||
|
[])
|
||||||
|
|
||||||
|
def create_and_check_albert_model(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
|
||||||
|
model = AlbertModel(config=config)
|
||||||
|
model.eval()
|
||||||
|
sequence_output, pooled_output = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
|
||||||
|
sequence_output, pooled_output = model(input_ids, token_type_ids=token_type_ids)
|
||||||
|
sequence_output, pooled_output = model(input_ids)
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"sequence_output": sequence_output,
|
||||||
|
"pooled_output": pooled_output,
|
||||||
|
}
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["sequence_output"].size()),
|
||||||
|
[self.batch_size, self.seq_length, self.hidden_size])
|
||||||
|
self.parent.assertListEqual(list(result["pooled_output"].size()), [self.batch_size, self.hidden_size])
|
||||||
|
|
||||||
|
|
||||||
|
def create_and_check_albert_for_masked_lm(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
|
||||||
|
model = AlbertForMaskedLM(config=config)
|
||||||
|
model.eval()
|
||||||
|
loss, prediction_scores = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, masked_lm_labels=token_labels)
|
||||||
|
result = {
|
||||||
|
"loss": loss,
|
||||||
|
"prediction_scores": prediction_scores,
|
||||||
|
}
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["prediction_scores"].size()),
|
||||||
|
[self.batch_size, self.seq_length, self.vocab_size])
|
||||||
|
self.check_loss_output(result)
|
||||||
|
|
||||||
|
def create_and_check_albert_for_question_answering(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
|
||||||
|
model = AlbertForQuestionAnswering(config=config)
|
||||||
|
model.eval()
|
||||||
|
loss, start_logits, end_logits = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids,
|
||||||
|
start_positions=sequence_labels, end_positions=sequence_labels)
|
||||||
|
result = {
|
||||||
|
"loss": loss,
|
||||||
|
"start_logits": start_logits,
|
||||||
|
"end_logits": end_logits,
|
||||||
|
}
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["start_logits"].size()),
|
||||||
|
[self.batch_size, self.seq_length])
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["end_logits"].size()),
|
||||||
|
[self.batch_size, self.seq_length])
|
||||||
|
self.check_loss_output(result)
|
||||||
|
|
||||||
|
|
||||||
|
def create_and_check_albert_for_sequence_classification(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
|
||||||
|
config.num_labels = self.num_labels
|
||||||
|
model = AlbertForSequenceClassification(config)
|
||||||
|
model.eval()
|
||||||
|
loss, logits = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=sequence_labels)
|
||||||
|
result = {
|
||||||
|
"loss": loss,
|
||||||
|
"logits": logits,
|
||||||
|
}
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["logits"].size()),
|
||||||
|
[self.batch_size, self.num_labels])
|
||||||
|
self.check_loss_output(result)
|
||||||
|
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
|
(config, input_ids, token_type_ids, input_mask,
|
||||||
|
sequence_labels, token_labels, choice_labels) = config_and_inputs
|
||||||
|
inputs_dict = {'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': input_mask}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = AlbertModelTest.AlbertModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=AlbertConfig, hidden_size=37)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_albert_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_albert_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_for_masked_lm(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_albert_for_masked_lm(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_for_question_answering(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_albert_for_question_answering(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_for_sequence_classification(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_albert_for_sequence_classification(*config_and_inputs)
|
||||||
|
|
||||||
|
@pytest.mark.slow
|
||||||
|
def test_model_from_pretrained(self):
|
||||||
|
cache_dir = "/tmp/transformers_test/"
|
||||||
|
for model_name in list(ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
||||||
|
model = AlbertModel.from_pretrained(model_name, cache_dir=cache_dir)
|
||||||
|
shutil.rmtree(cache_dir)
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
231
transformers/tests/modeling_tf_albert_test.py
Normal file
231
transformers/tests/modeling_tf_albert_test.py
Normal file
@@ -0,0 +1,231 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 The Google AI Language Team Authors.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from __future__ import absolute_import
|
||||||
|
from __future__ import division
|
||||||
|
from __future__ import print_function
|
||||||
|
|
||||||
|
import unittest
|
||||||
|
import shutil
|
||||||
|
import pytest
|
||||||
|
import sys
|
||||||
|
|
||||||
|
from .modeling_tf_common_test import (TFCommonTestCases, ids_tensor)
|
||||||
|
from .configuration_common_test import ConfigTester
|
||||||
|
|
||||||
|
from transformers import AlbertConfig, is_tf_available
|
||||||
|
|
||||||
|
if is_tf_available():
|
||||||
|
import tensorflow as tf
|
||||||
|
from transformers.modeling_tf_albert import (TFAlbertModel, TFAlbertForMaskedLM,
|
||||||
|
TFAlbertForSequenceClassification,
|
||||||
|
TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
else:
|
||||||
|
pytestmark = pytest.mark.skip("Require TensorFlow")
|
||||||
|
|
||||||
|
|
||||||
|
class TFAlbertModelTest(TFCommonTestCases.TFCommonModelTester):
|
||||||
|
|
||||||
|
all_model_classes = (
|
||||||
|
TFAlbertModel,
|
||||||
|
TFAlbertForMaskedLM,
|
||||||
|
TFAlbertForSequenceClassification
|
||||||
|
) if is_tf_available() else ()
|
||||||
|
|
||||||
|
class TFAlbertModelTester(object):
|
||||||
|
|
||||||
|
def __init__(self,
|
||||||
|
parent,
|
||||||
|
batch_size=13,
|
||||||
|
seq_length=7,
|
||||||
|
is_training=True,
|
||||||
|
use_input_mask=True,
|
||||||
|
use_token_type_ids=True,
|
||||||
|
use_labels=True,
|
||||||
|
vocab_size=99,
|
||||||
|
embedding_size=16,
|
||||||
|
hidden_size=32,
|
||||||
|
num_hidden_layers=5,
|
||||||
|
num_attention_heads=4,
|
||||||
|
intermediate_size=37,
|
||||||
|
hidden_act="gelu",
|
||||||
|
hidden_dropout_prob=0.1,
|
||||||
|
attention_probs_dropout_prob=0.1,
|
||||||
|
max_position_embeddings=512,
|
||||||
|
type_vocab_size=16,
|
||||||
|
type_sequence_label_size=2,
|
||||||
|
initializer_range=0.02,
|
||||||
|
num_labels=3,
|
||||||
|
num_choices=4,
|
||||||
|
scope=None,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.seq_length = seq_length
|
||||||
|
self.is_training = is_training
|
||||||
|
self.use_input_mask = use_input_mask
|
||||||
|
self.use_token_type_ids = use_token_type_ids
|
||||||
|
self.use_labels = use_labels
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.embedding_size = embedding_size
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.type_vocab_size = type_vocab_size
|
||||||
|
self.type_sequence_label_size = type_sequence_label_size
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.num_labels = num_labels
|
||||||
|
self.num_choices = num_choices
|
||||||
|
self.scope = scope
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
input_ids = ids_tensor(
|
||||||
|
[self.batch_size, self.seq_length], self.vocab_size)
|
||||||
|
|
||||||
|
input_mask = None
|
||||||
|
if self.use_input_mask:
|
||||||
|
input_mask = ids_tensor(
|
||||||
|
[self.batch_size, self.seq_length], vocab_size=2)
|
||||||
|
|
||||||
|
token_type_ids = None
|
||||||
|
if self.use_token_type_ids:
|
||||||
|
token_type_ids = ids_tensor(
|
||||||
|
[self.batch_size, self.seq_length], self.type_vocab_size)
|
||||||
|
|
||||||
|
sequence_labels = None
|
||||||
|
token_labels = None
|
||||||
|
choice_labels = None
|
||||||
|
if self.use_labels:
|
||||||
|
sequence_labels = ids_tensor(
|
||||||
|
[self.batch_size], self.type_sequence_label_size)
|
||||||
|
token_labels = ids_tensor(
|
||||||
|
[self.batch_size, self.seq_length], self.num_labels)
|
||||||
|
choice_labels = ids_tensor([self.batch_size], self.num_choices)
|
||||||
|
|
||||||
|
config = AlbertConfig(
|
||||||
|
vocab_size_or_config_json_file=self.vocab_size,
|
||||||
|
hidden_size=self.hidden_size,
|
||||||
|
num_hidden_layers=self.num_hidden_layers,
|
||||||
|
num_attention_heads=self.num_attention_heads,
|
||||||
|
intermediate_size=self.intermediate_size,
|
||||||
|
hidden_act=self.hidden_act,
|
||||||
|
hidden_dropout_prob=self.hidden_dropout_prob,
|
||||||
|
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||||
|
max_position_embeddings=self.max_position_embeddings,
|
||||||
|
type_vocab_size=self.type_vocab_size,
|
||||||
|
initializer_range=self.initializer_range)
|
||||||
|
|
||||||
|
return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||||
|
|
||||||
|
def create_and_check_albert_model(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
|
||||||
|
model = TFAlbertModel(config=config)
|
||||||
|
# inputs = {'input_ids': input_ids,
|
||||||
|
# 'attention_mask': input_mask,
|
||||||
|
# 'token_type_ids': token_type_ids}
|
||||||
|
# sequence_output, pooled_output = model(**inputs)
|
||||||
|
inputs = {'input_ids': input_ids,
|
||||||
|
'attention_mask': input_mask,
|
||||||
|
'token_type_ids': token_type_ids}
|
||||||
|
sequence_output, pooled_output = model(inputs)
|
||||||
|
|
||||||
|
inputs = [input_ids, input_mask]
|
||||||
|
sequence_output, pooled_output = model(inputs)
|
||||||
|
|
||||||
|
sequence_output, pooled_output = model(input_ids)
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"sequence_output": sequence_output.numpy(),
|
||||||
|
"pooled_output": pooled_output.numpy(),
|
||||||
|
}
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["sequence_output"].shape),
|
||||||
|
[self.batch_size, self.seq_length, self.hidden_size])
|
||||||
|
self.parent.assertListEqual(list(result["pooled_output"].shape), [
|
||||||
|
self.batch_size, self.hidden_size])
|
||||||
|
|
||||||
|
def create_and_check_albert_for_masked_lm(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
|
||||||
|
model = TFAlbertForMaskedLM(config=config)
|
||||||
|
inputs = {'input_ids': input_ids,
|
||||||
|
'attention_mask': input_mask,
|
||||||
|
'token_type_ids': token_type_ids}
|
||||||
|
prediction_scores, = model(inputs)
|
||||||
|
result = {
|
||||||
|
"prediction_scores": prediction_scores.numpy(),
|
||||||
|
}
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["prediction_scores"].shape),
|
||||||
|
[self.batch_size, self.seq_length, self.vocab_size])
|
||||||
|
|
||||||
|
def create_and_check_albert_for_sequence_classification(self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels):
|
||||||
|
config.num_labels = self.num_labels
|
||||||
|
model = TFAlbertForSequenceClassification(config=config)
|
||||||
|
inputs = {'input_ids': input_ids,
|
||||||
|
'attention_mask': input_mask,
|
||||||
|
'token_type_ids': token_type_ids}
|
||||||
|
logits, = model(inputs)
|
||||||
|
result = {
|
||||||
|
"logits": logits.numpy(),
|
||||||
|
}
|
||||||
|
self.parent.assertListEqual(
|
||||||
|
list(result["logits"].shape),
|
||||||
|
[self.batch_size, self.num_labels])
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
|
(config, input_ids, token_type_ids, input_mask,
|
||||||
|
sequence_labels, token_labels, choice_labels) = config_and_inputs
|
||||||
|
inputs_dict = {'input_ids': input_ids,
|
||||||
|
'token_type_ids': token_type_ids, 'attention_mask': input_mask}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = TFAlbertModelTest.TFAlbertModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(
|
||||||
|
self, config_class=AlbertConfig, hidden_size=37)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_albert_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_albert_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_for_masked_lm(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_albert_for_masked_lm(
|
||||||
|
*config_and_inputs)
|
||||||
|
|
||||||
|
def test_for_sequence_classification(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_albert_for_sequence_classification(
|
||||||
|
*config_and_inputs)
|
||||||
|
|
||||||
|
@pytest.mark.slow
|
||||||
|
def test_model_from_pretrained(self):
|
||||||
|
cache_dir = "/tmp/transformers_test/"
|
||||||
|
# for model_name in list(TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
||||||
|
for model_name in ['albert-base-uncased']:
|
||||||
|
model = TFAlbertModel.from_pretrained(
|
||||||
|
model_name, cache_dir=cache_dir)
|
||||||
|
shutil.rmtree(cache_dir)
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
@@ -426,9 +426,17 @@ class TFCommonTestCases:
|
|||||||
try:
|
try:
|
||||||
x = wte([input_ids], mode="embedding")
|
x = wte([input_ids], mode="embedding")
|
||||||
except:
|
except:
|
||||||
x = tf.ones(input_ids.shape + [self.model_tester.hidden_size], dtype=tf.dtypes.float32)
|
try:
|
||||||
|
x = wte([input_ids, None, None, None], mode="embedding")
|
||||||
|
except:
|
||||||
|
if hasattr(self.model_tester, "embedding_size"):
|
||||||
|
x = tf.ones(input_ids.shape + [self.model_tester.embedding_size], dtype=tf.dtypes.float32)
|
||||||
|
else:
|
||||||
|
x = tf.ones(input_ids.shape + [self.model_tester.hidden_size], dtype=tf.dtypes.float32)
|
||||||
# ^^ In our TF models, the input_embeddings can take slightly different forms,
|
# ^^ In our TF models, the input_embeddings can take slightly different forms,
|
||||||
# so we try two of them and fall back to just synthetically creating a dummy tensor of ones.
|
# so we try a few of them.
|
||||||
|
# We used to fall back to just synthetically creating a dummy tensor of ones:
|
||||||
|
#
|
||||||
inputs_dict["inputs_embeds"] = x
|
inputs_dict["inputs_embeds"] = x
|
||||||
outputs = model(inputs_dict)
|
outputs = model(inputs_dict)
|
||||||
|
|
||||||
|
|||||||
78
transformers/tests/tokenization_albert_test.py
Normal file
78
transformers/tests/tokenization_albert_test.py
Normal file
@@ -0,0 +1,78 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2019 Hugging Face inc.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||||
|
|
||||||
|
import os
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from transformers.tokenization_albert import (AlbertTokenizer, SPIECE_UNDERLINE)
|
||||||
|
|
||||||
|
from .tokenization_tests_commons import CommonTestCases
|
||||||
|
|
||||||
|
SAMPLE_VOCAB = os.path.join(os.path.dirname(os.path.abspath(__file__)),
|
||||||
|
'fixtures/spiece.model')
|
||||||
|
|
||||||
|
class AlbertTokenizationTest(CommonTestCases.CommonTokenizerTester):
|
||||||
|
|
||||||
|
tokenizer_class = AlbertTokenizer
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
super(AlbertTokenizationTest, self).setUp()
|
||||||
|
|
||||||
|
# We have a SentencePiece fixture for testing
|
||||||
|
tokenizer = AlbertTokenizer(SAMPLE_VOCAB)
|
||||||
|
tokenizer.save_pretrained(self.tmpdirname)
|
||||||
|
|
||||||
|
def get_tokenizer(self, **kwargs):
|
||||||
|
return AlbertTokenizer.from_pretrained(self.tmpdirname, **kwargs)
|
||||||
|
|
||||||
|
def get_input_output_texts(self):
|
||||||
|
input_text = u"this is a test"
|
||||||
|
output_text = u"this is a test"
|
||||||
|
return input_text, output_text
|
||||||
|
|
||||||
|
|
||||||
|
def test_full_tokenizer(self):
|
||||||
|
tokenizer = AlbertTokenizer(SAMPLE_VOCAB, keep_accents=True)
|
||||||
|
|
||||||
|
tokens = tokenizer.tokenize(u'This is a test')
|
||||||
|
self.assertListEqual(tokens, [u'▁this', u'▁is', u'▁a', u'▁test'])
|
||||||
|
|
||||||
|
self.assertListEqual(
|
||||||
|
tokenizer.convert_tokens_to_ids(tokens), [48, 25, 21, 1289])
|
||||||
|
|
||||||
|
tokens = tokenizer.tokenize(u"I was born in 92000, and this is falsé.")
|
||||||
|
self.assertListEqual(tokens, [u'▁i', u'▁was', u'▁born', u'▁in', u'▁9', u'2000', u',', u'▁and', u'▁this', u'▁is', u'▁fal', u's', u'é', u'.'])
|
||||||
|
ids = tokenizer.convert_tokens_to_ids(tokens)
|
||||||
|
self.assertListEqual(ids, [31, 23, 386, 19, 561, 3050, 15, 17, 48, 25, 8256, 18, 1, 9])
|
||||||
|
|
||||||
|
back_tokens = tokenizer.convert_ids_to_tokens(ids)
|
||||||
|
self.assertListEqual(back_tokens, ['▁i', '▁was', '▁born', '▁in', '▁9', '2000', ',', '▁and', '▁this', '▁is', '▁fal', 's', '<unk>', '.'])
|
||||||
|
|
||||||
|
def test_sequence_builders(self):
|
||||||
|
tokenizer = AlbertTokenizer(SAMPLE_VOCAB)
|
||||||
|
|
||||||
|
text = tokenizer.encode("sequence builders")
|
||||||
|
text_2 = tokenizer.encode("multi-sequence build")
|
||||||
|
|
||||||
|
encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
|
||||||
|
encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
|
||||||
|
|
||||||
|
assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
|
||||||
|
assert encoded_pair == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id] + text_2 + [tokenizer.sep_token_id]
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
unittest.main()
|
||||||
@@ -110,6 +110,36 @@ class CommonTestCases:
|
|||||||
|
|
||||||
self.assertListEqual(subwords, subwords_loaded)
|
self.assertListEqual(subwords, subwords_loaded)
|
||||||
|
|
||||||
|
def test_added_tokens_do_lower_case(self):
|
||||||
|
tokenizer = self.get_tokenizer(do_lower_case=True)
|
||||||
|
|
||||||
|
text = "aaaaa bbbbbb low cccccccccdddddddd l"
|
||||||
|
text2 = "AAAAA BBBBBB low CCCCCCCCCDDDDDDDD l"
|
||||||
|
|
||||||
|
toks0 = tokenizer.tokenize(text) # toks before adding new_toks
|
||||||
|
|
||||||
|
new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd", 'AAAAA BBBBBB', 'CCCCCCCCCDDDDDDDD']
|
||||||
|
added = tokenizer.add_tokens(new_toks)
|
||||||
|
self.assertEqual(added, 2)
|
||||||
|
|
||||||
|
toks = tokenizer.tokenize(text)
|
||||||
|
toks2 = tokenizer.tokenize(text2)
|
||||||
|
|
||||||
|
self.assertEqual(len(toks), len(toks2))
|
||||||
|
self.assertNotEqual(len(toks), len(toks0)) # toks0 should be longer
|
||||||
|
self.assertListEqual(toks, toks2)
|
||||||
|
|
||||||
|
tokenizer = self.get_tokenizer(do_lower_case=False)
|
||||||
|
|
||||||
|
added = tokenizer.add_tokens(new_toks)
|
||||||
|
self.assertEqual(added, 4)
|
||||||
|
|
||||||
|
toks = tokenizer.tokenize(text)
|
||||||
|
toks2 = tokenizer.tokenize(text2)
|
||||||
|
|
||||||
|
self.assertEqual(len(toks), len(toks2)) # Length should still be the same
|
||||||
|
self.assertNotEqual(len(toks), len(toks0))
|
||||||
|
self.assertNotEqual(toks[0], toks2[0]) # But at least the first tokens should differ
|
||||||
|
|
||||||
def test_add_tokens_tokenizer(self):
|
def test_add_tokens_tokenizer(self):
|
||||||
tokenizer = self.get_tokenizer()
|
tokenizer = self.get_tokenizer()
|
||||||
@@ -243,7 +273,11 @@ class CommonTestCases:
|
|||||||
sequence = tokenizer.encode(seq_0, add_special_tokens=False)
|
sequence = tokenizer.encode(seq_0, add_special_tokens=False)
|
||||||
num_added_tokens = tokenizer.num_added_tokens()
|
num_added_tokens = tokenizer.num_added_tokens()
|
||||||
total_length = len(sequence) + num_added_tokens
|
total_length = len(sequence) + num_added_tokens
|
||||||
information = tokenizer.encode_plus(seq_0, max_length=total_length - 2, add_special_tokens=True, stride=stride)
|
information = tokenizer.encode_plus(seq_0,
|
||||||
|
max_length=total_length - 2,
|
||||||
|
add_special_tokens=True,
|
||||||
|
stride=stride,
|
||||||
|
return_overflowing_tokens=True)
|
||||||
|
|
||||||
truncated_sequence = information["input_ids"]
|
truncated_sequence = information["input_ids"]
|
||||||
overflowing_tokens = information["overflowing_tokens"]
|
overflowing_tokens = information["overflowing_tokens"]
|
||||||
@@ -270,10 +304,12 @@ class CommonTestCases:
|
|||||||
)
|
)
|
||||||
|
|
||||||
information = tokenizer.encode_plus(seq_0, seq_1, max_length=len(sequence) - 2, add_special_tokens=True,
|
information = tokenizer.encode_plus(seq_0, seq_1, max_length=len(sequence) - 2, add_special_tokens=True,
|
||||||
stride=stride, truncation_strategy='only_second')
|
stride=stride, truncation_strategy='only_second',
|
||||||
|
return_overflowing_tokens=True)
|
||||||
information_first_truncated = tokenizer.encode_plus(seq_0, seq_1, max_length=len(sequence) - 2,
|
information_first_truncated = tokenizer.encode_plus(seq_0, seq_1, max_length=len(sequence) - 2,
|
||||||
add_special_tokens=True, stride=stride,
|
add_special_tokens=True, stride=stride,
|
||||||
truncation_strategy='only_first')
|
truncation_strategy='only_first',
|
||||||
|
return_overflowing_tokens=True)
|
||||||
|
|
||||||
truncated_sequence = information["input_ids"]
|
truncated_sequence = information["input_ids"]
|
||||||
overflowing_tokens = information["overflowing_tokens"]
|
overflowing_tokens = information["overflowing_tokens"]
|
||||||
@@ -305,7 +341,7 @@ class CommonTestCases:
|
|||||||
|
|
||||||
# Testing single inputs
|
# Testing single inputs
|
||||||
encoded_sequence = tokenizer.encode(sequence_0, add_special_tokens=False)
|
encoded_sequence = tokenizer.encode(sequence_0, add_special_tokens=False)
|
||||||
encoded_sequence_dict = tokenizer.encode_plus(sequence_0, add_special_tokens=True)
|
encoded_sequence_dict = tokenizer.encode_plus(sequence_0, add_special_tokens=True, return_special_tokens_mask=True)
|
||||||
encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
|
encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
|
||||||
special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
|
special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
|
||||||
self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
|
self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
|
||||||
@@ -317,7 +353,8 @@ class CommonTestCases:
|
|||||||
# Testing inputs pairs
|
# Testing inputs pairs
|
||||||
encoded_sequence = tokenizer.encode(sequence_0, add_special_tokens=False) + tokenizer.encode(sequence_1,
|
encoded_sequence = tokenizer.encode(sequence_0, add_special_tokens=False) + tokenizer.encode(sequence_1,
|
||||||
add_special_tokens=False)
|
add_special_tokens=False)
|
||||||
encoded_sequence_dict = tokenizer.encode_plus(sequence_0, sequence_1, add_special_tokens=True)
|
encoded_sequence_dict = tokenizer.encode_plus(sequence_0, sequence_1, add_special_tokens=True,
|
||||||
|
return_special_tokens_mask=True)
|
||||||
encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
|
encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
|
||||||
special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
|
special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
|
||||||
self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
|
self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
|
||||||
@@ -329,7 +366,9 @@ class CommonTestCases:
|
|||||||
# Testing with already existing special tokens
|
# Testing with already existing special tokens
|
||||||
if tokenizer.cls_token_id == tokenizer.unk_token_id and tokenizer.cls_token_id == tokenizer.unk_token_id:
|
if tokenizer.cls_token_id == tokenizer.unk_token_id and tokenizer.cls_token_id == tokenizer.unk_token_id:
|
||||||
tokenizer.add_special_tokens({'cls_token': '</s>', 'sep_token': '<s>'})
|
tokenizer.add_special_tokens({'cls_token': '</s>', 'sep_token': '<s>'})
|
||||||
encoded_sequence_dict = tokenizer.encode_plus(sequence_0, add_special_tokens=True)
|
encoded_sequence_dict = tokenizer.encode_plus(sequence_0,
|
||||||
|
add_special_tokens=True,
|
||||||
|
return_special_tokens_mask=True)
|
||||||
encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
|
encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
|
||||||
special_tokens_mask_orig = encoded_sequence_dict["special_tokens_mask"]
|
special_tokens_mask_orig = encoded_sequence_dict["special_tokens_mask"]
|
||||||
special_tokens_mask = tokenizer.get_special_tokens_mask(encoded_sequence_w_special, already_has_special_tokens=True)
|
special_tokens_mask = tokenizer.get_special_tokens_mask(encoded_sequence_w_special, already_has_special_tokens=True)
|
||||||
|
|||||||
252
transformers/tokenization_albert.py
Normal file
252
transformers/tokenization_albert.py
Normal file
@@ -0,0 +1,252 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Tokenization classes for ALBERT model."""
|
||||||
|
from __future__ import (absolute_import, division, print_function,
|
||||||
|
unicode_literals)
|
||||||
|
|
||||||
|
from .tokenization_utils import PreTrainedTokenizer
|
||||||
|
import logging
|
||||||
|
import unicodedata
|
||||||
|
import six
|
||||||
|
import os
|
||||||
|
from shutil import copyfile
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
VOCAB_FILES_NAMES = {'vocab_file': 'spiece.model'}
|
||||||
|
|
||||||
|
PRETRAINED_VOCAB_FILES_MAP = {
|
||||||
|
'vocab_file':
|
||||||
|
{
|
||||||
|
'albert-base-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-spiece.model",
|
||||||
|
'albert-large-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-spiece.model",
|
||||||
|
'albert-xlarge-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-spiece.model",
|
||||||
|
'albert-xxlarge-v1': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-spiece.model",
|
||||||
|
'albert-base-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-base-v2-spiece.model",
|
||||||
|
'albert-large-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-large-v2-spiece.model",
|
||||||
|
'albert-xlarge-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xlarge-v2-spiece.model",
|
||||||
|
'albert-xxlarge-v2': "https://s3.amazonaws.com/models.huggingface.co/bert/albert-xxlarge-v2-spiece.model",
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
||||||
|
'albert-base-v1': 512,
|
||||||
|
'albert-large-v1': 512,
|
||||||
|
'albert-xlarge-v1': 512,
|
||||||
|
'albert-xxlarge-v1': 512,
|
||||||
|
'albert-base-v2': 512,
|
||||||
|
'albert-large-v2': 512,
|
||||||
|
'albert-xlarge-v2': 512,
|
||||||
|
'albert-xxlarge-v2': 512,
|
||||||
|
}
|
||||||
|
|
||||||
|
SPIECE_UNDERLINE = u'▁'
|
||||||
|
|
||||||
|
class AlbertTokenizer(PreTrainedTokenizer):
|
||||||
|
"""
|
||||||
|
SentencePiece based tokenizer. Peculiarities:
|
||||||
|
|
||||||
|
- requires `SentencePiece <https://github.com/google/sentencepiece>`_
|
||||||
|
"""
|
||||||
|
vocab_files_names = VOCAB_FILES_NAMES
|
||||||
|
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||||
|
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
|
||||||
|
|
||||||
|
def __init__(self, vocab_file,
|
||||||
|
do_lower_case=True, remove_space=True, keep_accents=False,
|
||||||
|
bos_token="[CLS]", eos_token="[SEP]", unk_token="<unk>", sep_token="[SEP]",
|
||||||
|
pad_token="<pad>", cls_token="[CLS]", mask_token="[MASK]>", **kwargs):
|
||||||
|
super(AlbertTokenizer, self).__init__(bos_token=bos_token, eos_token=eos_token,
|
||||||
|
unk_token=unk_token, sep_token=sep_token,
|
||||||
|
pad_token=pad_token, cls_token=cls_token,
|
||||||
|
mask_token=mask_token, **kwargs)
|
||||||
|
|
||||||
|
self.max_len_single_sentence = self.max_len - 2 # take into account special tokens
|
||||||
|
self.max_len_sentences_pair = self.max_len - 3 # take into account special tokens
|
||||||
|
|
||||||
|
try:
|
||||||
|
import sentencepiece as spm
|
||||||
|
except ImportError:
|
||||||
|
logger.warning("You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece"
|
||||||
|
"pip install sentencepiece")
|
||||||
|
|
||||||
|
self.do_lower_case = do_lower_case
|
||||||
|
self.remove_space = remove_space
|
||||||
|
self.keep_accents = keep_accents
|
||||||
|
self.vocab_file = vocab_file
|
||||||
|
|
||||||
|
self.sp_model = spm.SentencePieceProcessor()
|
||||||
|
self.sp_model.Load(vocab_file)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def vocab_size(self):
|
||||||
|
return len(self.sp_model)
|
||||||
|
|
||||||
|
def __getstate__(self):
|
||||||
|
state = self.__dict__.copy()
|
||||||
|
state["sp_model"] = None
|
||||||
|
return state
|
||||||
|
|
||||||
|
def __setstate__(self, d):
|
||||||
|
self.__dict__ = d
|
||||||
|
try:
|
||||||
|
import sentencepiece as spm
|
||||||
|
except ImportError:
|
||||||
|
logger.warning("You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece"
|
||||||
|
"pip install sentencepiece")
|
||||||
|
self.sp_model = spm.SentencePieceProcessor()
|
||||||
|
self.sp_model.Load(self.vocab_file)
|
||||||
|
|
||||||
|
def preprocess_text(self, inputs):
|
||||||
|
if self.remove_space:
|
||||||
|
outputs = ' '.join(inputs.strip().split())
|
||||||
|
else:
|
||||||
|
outputs = inputs
|
||||||
|
outputs = outputs.replace("``", '"').replace("''", '"')
|
||||||
|
|
||||||
|
if six.PY2 and isinstance(outputs, str):
|
||||||
|
outputs = outputs.decode('utf-8')
|
||||||
|
|
||||||
|
if not self.keep_accents:
|
||||||
|
outputs = unicodedata.normalize('NFKD', outputs)
|
||||||
|
outputs = ''.join([c for c in outputs if not unicodedata.combining(c)])
|
||||||
|
if self.do_lower_case:
|
||||||
|
outputs = outputs.lower()
|
||||||
|
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
def _tokenize(self, text, return_unicode=True, sample=False):
|
||||||
|
""" Tokenize a string.
|
||||||
|
return_unicode is used only for py2
|
||||||
|
"""
|
||||||
|
text = self.preprocess_text(text)
|
||||||
|
# note(zhiliny): in some systems, sentencepiece only accepts str for py2
|
||||||
|
if six.PY2 and isinstance(text, unicode):
|
||||||
|
text = text.encode('utf-8')
|
||||||
|
|
||||||
|
if not sample:
|
||||||
|
pieces = self.sp_model.EncodeAsPieces(text)
|
||||||
|
else:
|
||||||
|
pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)
|
||||||
|
new_pieces = []
|
||||||
|
for piece in pieces:
|
||||||
|
if len(piece) > 1 and piece[-1] == ',' and piece[-2].isdigit():
|
||||||
|
cur_pieces = self.sp_model.EncodeAsPieces(
|
||||||
|
piece[:-1].replace(SPIECE_UNDERLINE, ''))
|
||||||
|
if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
|
||||||
|
if len(cur_pieces[0]) == 1:
|
||||||
|
cur_pieces = cur_pieces[1:]
|
||||||
|
else:
|
||||||
|
cur_pieces[0] = cur_pieces[0][1:]
|
||||||
|
cur_pieces.append(piece[-1])
|
||||||
|
new_pieces.extend(cur_pieces)
|
||||||
|
else:
|
||||||
|
new_pieces.append(piece)
|
||||||
|
|
||||||
|
# note(zhiliny): convert back to unicode for py2
|
||||||
|
if six.PY2 and return_unicode:
|
||||||
|
ret_pieces = []
|
||||||
|
for piece in new_pieces:
|
||||||
|
if isinstance(piece, str):
|
||||||
|
piece = piece.decode('utf-8')
|
||||||
|
ret_pieces.append(piece)
|
||||||
|
new_pieces = ret_pieces
|
||||||
|
|
||||||
|
return new_pieces
|
||||||
|
|
||||||
|
def _convert_token_to_id(self, token):
|
||||||
|
""" Converts a token (str/unicode) in an id using the vocab. """
|
||||||
|
return self.sp_model.PieceToId(token)
|
||||||
|
|
||||||
|
def _convert_id_to_token(self, index, return_unicode=True):
|
||||||
|
"""Converts an index (integer) in a token (string/unicode) using the vocab."""
|
||||||
|
token = self.sp_model.IdToPiece(index)
|
||||||
|
if six.PY2 and return_unicode and isinstance(token, str):
|
||||||
|
token = token.decode('utf-8')
|
||||||
|
return token
|
||||||
|
|
||||||
|
def convert_tokens_to_string(self, tokens):
|
||||||
|
"""Converts a sequence of tokens (strings for sub-words) in a single string."""
|
||||||
|
out_string = ''.join(tokens).replace(SPIECE_UNDERLINE, ' ').strip()
|
||||||
|
return out_string
|
||||||
|
|
||||||
|
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||||
|
"""
|
||||||
|
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||||
|
by concatenating and adding special tokens.
|
||||||
|
An ALBERT sequence has the following format:
|
||||||
|
single sequence: [CLS] X [SEP]
|
||||||
|
pair of sequences: [CLS] A [SEP] B [SEP]
|
||||||
|
"""
|
||||||
|
sep = [self.sep_token_id]
|
||||||
|
cls = [self.cls_token_id]
|
||||||
|
if token_ids_1 is None:
|
||||||
|
return cls + token_ids_0 + sep
|
||||||
|
return cls + token_ids_0 + sep + token_ids_1 + sep
|
||||||
|
|
||||||
|
def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
|
||||||
|
"""
|
||||||
|
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||||
|
special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
token_ids_0: list of ids (must not contain special tokens)
|
||||||
|
token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids
|
||||||
|
for sequence pairs
|
||||||
|
already_has_special_tokens: (default False) Set to True if the token list is already formated with
|
||||||
|
special tokens for the model
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token.
|
||||||
|
"""
|
||||||
|
|
||||||
|
if already_has_special_tokens:
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
raise ValueError("You should not supply a second sequence if the provided sequence of "
|
||||||
|
"ids is already formated with special tokens for the model.")
|
||||||
|
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
|
||||||
|
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
|
||||||
|
return [1] + ([0] * len(token_ids_0)) + [1]
|
||||||
|
|
||||||
|
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
||||||
|
"""
|
||||||
|
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
||||||
|
An ALBERT sequence pair mask has the following format:
|
||||||
|
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
|
||||||
|
| first sequence | second sequence
|
||||||
|
|
||||||
|
if token_ids_1 is None, only returns the first portion of the mask (0's).
|
||||||
|
"""
|
||||||
|
sep = [self.sep_token_id]
|
||||||
|
cls = [self.cls_token_id]
|
||||||
|
|
||||||
|
if token_ids_1 is None:
|
||||||
|
return len(cls + token_ids_0 + sep) * [0]
|
||||||
|
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
|
||||||
|
|
||||||
|
def save_vocabulary(self, save_directory):
|
||||||
|
""" Save the sentencepiece vocabulary (copy original file) and special tokens file
|
||||||
|
to a directory.
|
||||||
|
"""
|
||||||
|
if not os.path.isdir(save_directory):
|
||||||
|
logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
|
||||||
|
return
|
||||||
|
out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES['vocab_file'])
|
||||||
|
|
||||||
|
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
|
||||||
|
copyfile(self.vocab_file, out_vocab_file)
|
||||||
|
|
||||||
|
return (out_vocab_file,)
|
||||||
@@ -90,6 +90,9 @@ class AutoTokenizer(object):
|
|||||||
force_download: (`optional`) boolean, default False:
|
force_download: (`optional`) boolean, default False:
|
||||||
Force to (re-)download the vocabulary files and override the cached versions if they exists.
|
Force to (re-)download the vocabulary files and override the cached versions if they exists.
|
||||||
|
|
||||||
|
resume_download: (`optional`) boolean, default False:
|
||||||
|
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
|
||||||
|
|
||||||
proxies: (`optional`) dict, default None:
|
proxies: (`optional`) dict, default None:
|
||||||
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
||||||
The proxies are used on each request.
|
The proxies are used on each request.
|
||||||
|
|||||||
@@ -16,9 +16,14 @@
|
|||||||
from __future__ import (absolute_import, division, print_function,
|
from __future__ import (absolute_import, division, print_function,
|
||||||
unicode_literals)
|
unicode_literals)
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
from shutil import copyfile
|
||||||
|
|
||||||
import sentencepiece as spm
|
import sentencepiece as spm
|
||||||
from transformers.tokenization_utils import PreTrainedTokenizer
|
from transformers.tokenization_utils import PreTrainedTokenizer
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
VOCAB_FILES_NAMES = {'vocab_file': 'sentencepiece.bpe.model'}
|
VOCAB_FILES_NAMES = {'vocab_file': 'sentencepiece.bpe.model'}
|
||||||
|
|
||||||
@@ -55,6 +60,7 @@ class CamembertTokenizer(PreTrainedTokenizer):
|
|||||||
self.max_len_sentences_pair = self.max_len - 4 # take into account special tokens
|
self.max_len_sentences_pair = self.max_len - 4 # take into account special tokens
|
||||||
self.sp_model = spm.SentencePieceProcessor()
|
self.sp_model = spm.SentencePieceProcessor()
|
||||||
self.sp_model.Load(str(vocab_file))
|
self.sp_model.Load(str(vocab_file))
|
||||||
|
self.vocab_file = vocab_file
|
||||||
# HACK: These tokens were added by fairseq but don't seem to be actually used when duplicated in the actual
|
# HACK: These tokens were added by fairseq but don't seem to be actually used when duplicated in the actual
|
||||||
# sentencepiece vocabulary (this is the case for <s> and </s>
|
# sentencepiece vocabulary (this is the case for <s> and </s>
|
||||||
self.fairseq_tokens_to_ids = {'<s>NOTUSED': 0, '<pad>': 1, '</s>NOTUSED': 2, '<unk>': 3}
|
self.fairseq_tokens_to_ids = {'<s>NOTUSED': 0, '<pad>': 1, '</s>NOTUSED': 2, '<unk>': 3}
|
||||||
@@ -135,3 +141,17 @@ class CamembertTokenizer(PreTrainedTokenizer):
|
|||||||
if index in self.fairseq_ids_to_tokens:
|
if index in self.fairseq_ids_to_tokens:
|
||||||
return self.fairseq_ids_to_tokens[index]
|
return self.fairseq_ids_to_tokens[index]
|
||||||
return self.sp_model.IdToPiece(index - self.fairseq_offset)
|
return self.sp_model.IdToPiece(index - self.fairseq_offset)
|
||||||
|
|
||||||
|
def save_vocabulary(self, save_directory):
|
||||||
|
""" Save the sentencepiece vocabulary (copy original file) and special tokens file
|
||||||
|
to a directory.
|
||||||
|
"""
|
||||||
|
if not os.path.isdir(save_directory):
|
||||||
|
logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
|
||||||
|
return
|
||||||
|
out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES['vocab_file'])
|
||||||
|
|
||||||
|
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
|
||||||
|
copyfile(self.vocab_file, out_vocab_file)
|
||||||
|
|
||||||
|
return (out_vocab_file,)
|
||||||
|
|||||||
@@ -34,6 +34,7 @@ PRETRAINED_VOCAB_FILES_MAP = {
|
|||||||
'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
|
'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
|
||||||
'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
|
'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
|
||||||
'distilbert-base-german-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-vocab.txt",
|
'distilbert-base-german-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-vocab.txt",
|
||||||
|
'distilbert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -41,6 +42,7 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
|||||||
'distilbert-base-uncased': 512,
|
'distilbert-base-uncased': 512,
|
||||||
'distilbert-base-uncased-distilled-squad': 512,
|
'distilbert-base-uncased-distilled-squad': 512,
|
||||||
'distilbert-base-german-cased': 512,
|
'distilbert-base-german-cased': 512,
|
||||||
|
'distilbert-base-multilingual-cased': 512,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -107,10 +107,10 @@ class GPT2Tokenizer(PreTrainedTokenizer):
|
|||||||
"""
|
"""
|
||||||
GPT-2 BPE tokenizer. Peculiarities:
|
GPT-2 BPE tokenizer. Peculiarities:
|
||||||
- Byte-level Byte-Pair-Encoding
|
- Byte-level Byte-Pair-Encoding
|
||||||
- Requires a space to start the input string => the encoding methods should be called with the
|
- Requires a space to start the input string => the encoding and tokenize methods should be called with the
|
||||||
``add_prefix_space`` flag set to ``True``.
|
``add_prefix_space`` flag set to ``True``.
|
||||||
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
|
Otherwise, this tokenizer's ``encode``, ``decode``, and ``tokenize`` methods will not conserve
|
||||||
the absence of a space at the beginning of a string: `tokenizer.decode(tokenizer.encode("Hello")) = " Hello"`
|
the spaces at the beginning of a string: `tokenizer.decode(tokenizer.encode(" Hello")) = "Hello"`
|
||||||
"""
|
"""
|
||||||
vocab_files_names = VOCAB_FILES_NAMES
|
vocab_files_names = VOCAB_FILES_NAMES
|
||||||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||||
@@ -184,7 +184,7 @@ class GPT2Tokenizer(PreTrainedTokenizer):
|
|||||||
""" Tokenize a string.
|
""" Tokenize a string.
|
||||||
Args:
|
Args:
|
||||||
- add_prefix_space (boolean, default False):
|
- add_prefix_space (boolean, default False):
|
||||||
Begin the sentence with at least one space toto get invariance to word order in GPT-2 (and RoBERTa) tokenizers.
|
Begin the sentence with at least one space to get invariance to word order in GPT-2 (and RoBERTa) tokenizers.
|
||||||
"""
|
"""
|
||||||
if add_prefix_space:
|
if add_prefix_space:
|
||||||
text = ' ' + text
|
text = ' ' + text
|
||||||
|
|||||||
@@ -252,6 +252,9 @@ class PreTrainedTokenizer(object):
|
|||||||
force_download: (`optional`) boolean, default False:
|
force_download: (`optional`) boolean, default False:
|
||||||
Force to (re-)download the vocabulary files and override the cached versions if they exists.
|
Force to (re-)download the vocabulary files and override the cached versions if they exists.
|
||||||
|
|
||||||
|
resume_download: (`optional`) boolean, default False:
|
||||||
|
Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
|
||||||
|
|
||||||
proxies: (`optional`) dict, default None:
|
proxies: (`optional`) dict, default None:
|
||||||
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
|
||||||
The proxies are used on each request.
|
The proxies are used on each request.
|
||||||
@@ -287,6 +290,7 @@ class PreTrainedTokenizer(object):
|
|||||||
def _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs):
|
def _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs):
|
||||||
cache_dir = kwargs.pop('cache_dir', None)
|
cache_dir = kwargs.pop('cache_dir', None)
|
||||||
force_download = kwargs.pop('force_download', False)
|
force_download = kwargs.pop('force_download', False)
|
||||||
|
resume_download = kwargs.pop('resume_download', False)
|
||||||
proxies = kwargs.pop('proxies', None)
|
proxies = kwargs.pop('proxies', None)
|
||||||
|
|
||||||
s3_models = list(cls.max_model_input_sizes.keys())
|
s3_models = list(cls.max_model_input_sizes.keys())
|
||||||
@@ -353,7 +357,7 @@ class PreTrainedTokenizer(object):
|
|||||||
if file_path is None:
|
if file_path is None:
|
||||||
resolved_vocab_files[file_id] = None
|
resolved_vocab_files[file_id] = None
|
||||||
else:
|
else:
|
||||||
resolved_vocab_files[file_id] = cached_path(file_path, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
|
resolved_vocab_files[file_id] = cached_path(file_path, cache_dir=cache_dir, force_download=force_download, proxies=proxies, resume_download=resume_download)
|
||||||
except EnvironmentError:
|
except EnvironmentError:
|
||||||
if pretrained_model_name_or_path in s3_models:
|
if pretrained_model_name_or_path in s3_models:
|
||||||
msg = "Couldn't reach server at '{}' to download vocabulary files."
|
msg = "Couldn't reach server at '{}' to download vocabulary files."
|
||||||
@@ -513,6 +517,8 @@ class PreTrainedTokenizer(object):
|
|||||||
to_add_tokens = []
|
to_add_tokens = []
|
||||||
for token in new_tokens:
|
for token in new_tokens:
|
||||||
assert isinstance(token, str) or (six.PY2 and isinstance(token, unicode))
|
assert isinstance(token, str) or (six.PY2 and isinstance(token, unicode))
|
||||||
|
if self.init_kwargs.get('do_lower_case', False):
|
||||||
|
token = token.lower()
|
||||||
if token != self.unk_token and \
|
if token != self.unk_token and \
|
||||||
self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) and \
|
self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token) and \
|
||||||
token not in to_add_tokens:
|
token not in to_add_tokens:
|
||||||
@@ -606,6 +612,9 @@ class PreTrainedTokenizer(object):
|
|||||||
|
|
||||||
Take care of added tokens.
|
Take care of added tokens.
|
||||||
"""
|
"""
|
||||||
|
if self.init_kwargs.get('do_lower_case', False):
|
||||||
|
text = text.lower()
|
||||||
|
|
||||||
def split_on_token(tok, text):
|
def split_on_token(tok, text):
|
||||||
result = []
|
result = []
|
||||||
split_text = text.split(tok)
|
split_text = text.split(tok)
|
||||||
@@ -741,6 +750,9 @@ class PreTrainedTokenizer(object):
|
|||||||
stride=0,
|
stride=0,
|
||||||
truncation_strategy='longest_first',
|
truncation_strategy='longest_first',
|
||||||
return_tensors=None,
|
return_tensors=None,
|
||||||
|
return_token_type_ids=True,
|
||||||
|
return_overflowing_tokens=False,
|
||||||
|
return_special_tokens_mask=False,
|
||||||
**kwargs):
|
**kwargs):
|
||||||
"""
|
"""
|
||||||
Returns a dictionary containing the encoded sequence or sequence pair and additional informations:
|
Returns a dictionary containing the encoded sequence or sequence pair and additional informations:
|
||||||
@@ -767,7 +779,30 @@ class PreTrainedTokenizer(object):
|
|||||||
- 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
|
- 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
|
||||||
return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant
|
return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant
|
||||||
or PyTorch torch.Tensor instead of a list of python integers.
|
or PyTorch torch.Tensor instead of a list of python integers.
|
||||||
|
return_token_type_ids: (optional) Set to False to avoid returning token_type_ids (default True).
|
||||||
|
return_overflowing_tokens: (optional) Set to True to return overflowing token information (default False).
|
||||||
|
return_special_tokens_mask: (optional) Set to True to return special tokens mask information (default False).
|
||||||
**kwargs: passed to the `self.tokenize()` method
|
**kwargs: passed to the `self.tokenize()` method
|
||||||
|
|
||||||
|
Return:
|
||||||
|
A Dictionary of shape::
|
||||||
|
|
||||||
|
{
|
||||||
|
input_ids: list[int],
|
||||||
|
token_type_ids: list[int] if return_token_type_ids is True (default)
|
||||||
|
overflowing_tokens: list[int] if a ``max_length`` is specified and return_overflowing_tokens is True
|
||||||
|
num_truncated_tokens: int if a ``max_length`` is specified and return_overflowing_tokens is True
|
||||||
|
special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True`` and return_special_tokens_mask is True
|
||||||
|
}
|
||||||
|
|
||||||
|
With the fields:
|
||||||
|
``input_ids``: list of token ids to be fed to a model
|
||||||
|
``token_type_ids``: list of token type ids to be fed to a model
|
||||||
|
|
||||||
|
``overflowing_tokens``: list of overflowing tokens if a max length is specified.
|
||||||
|
``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified
|
||||||
|
``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added
|
||||||
|
tokens and 1 specifying sequence tokens.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def get_input_ids(text):
|
def get_input_ids(text):
|
||||||
@@ -789,10 +824,17 @@ class PreTrainedTokenizer(object):
|
|||||||
add_special_tokens=add_special_tokens,
|
add_special_tokens=add_special_tokens,
|
||||||
stride=stride,
|
stride=stride,
|
||||||
truncation_strategy=truncation_strategy,
|
truncation_strategy=truncation_strategy,
|
||||||
return_tensors=return_tensors)
|
return_tensors=return_tensors,
|
||||||
|
return_token_type_ids=return_token_type_ids,
|
||||||
|
return_overflowing_tokens=return_overflowing_tokens,
|
||||||
|
return_special_tokens_mask=return_special_tokens_mask)
|
||||||
|
|
||||||
def prepare_for_model(self, ids, pair_ids=None, max_length=None, add_special_tokens=True, stride=0,
|
def prepare_for_model(self, ids, pair_ids=None, max_length=None, add_special_tokens=True, stride=0,
|
||||||
truncation_strategy='longest_first', return_tensors=None):
|
truncation_strategy='longest_first',
|
||||||
|
return_tensors=None,
|
||||||
|
return_token_type_ids=True,
|
||||||
|
return_overflowing_tokens=False,
|
||||||
|
return_special_tokens_mask=False):
|
||||||
"""
|
"""
|
||||||
Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model.
|
Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model.
|
||||||
It adds special tokens, truncates
|
It adds special tokens, truncates
|
||||||
@@ -817,21 +859,27 @@ class PreTrainedTokenizer(object):
|
|||||||
- 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
|
- 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
|
||||||
return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant
|
return_tensors: (optional) can be set to 'tf' or 'pt' to return respectively TensorFlow tf.constant
|
||||||
or PyTorch torch.Tensor instead of a list of python integers.
|
or PyTorch torch.Tensor instead of a list of python integers.
|
||||||
|
return_token_type_ids: (optional) Set to False to avoid returning token_type_ids (default True).
|
||||||
|
return_overflowing_tokens: (optional) Set to True to return overflowing token information (default False).
|
||||||
|
return_special_tokens_mask: (optional) Set to True to return special tokens mask information (default False).
|
||||||
|
|
||||||
Return:
|
Return:
|
||||||
A Dictionary of shape::
|
A Dictionary of shape::
|
||||||
|
|
||||||
{
|
{
|
||||||
input_ids: list[int],
|
input_ids: list[int],
|
||||||
overflowing_tokens: list[int] if a ``max_length`` is specified, else None
|
token_type_ids: list[int] if return_token_type_ids is True (default)
|
||||||
special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True``
|
overflowing_tokens: list[int] if a ``max_length`` is specified and return_overflowing_tokens is True
|
||||||
|
num_truncated_tokens: int if a ``max_length`` is specified and return_overflowing_tokens is True
|
||||||
|
special_tokens_mask: list[int] if ``add_special_tokens`` if set to ``True`` and return_special_tokens_mask is True
|
||||||
}
|
}
|
||||||
|
|
||||||
With the fields:
|
With the fields:
|
||||||
``input_ids``: list of tokens to be fed to a model
|
``input_ids``: list of token ids to be fed to a model
|
||||||
|
``token_type_ids``: list of token type ids to be fed to a model
|
||||||
|
|
||||||
``overflowing_tokens``: list of overflowing tokens if a max length is specified.
|
``overflowing_tokens``: list of overflowing tokens if a max length is specified.
|
||||||
|
``num_truncated_tokens``: number of overflowing tokens a ``max_length`` is specified
|
||||||
``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added
|
``special_tokens_mask``: if adding special tokens, this is a list of [0, 1], with 0 specifying special added
|
||||||
tokens and 1 specifying sequence tokens.
|
tokens and 1 specifying sequence tokens.
|
||||||
"""
|
"""
|
||||||
@@ -840,23 +888,31 @@ class PreTrainedTokenizer(object):
|
|||||||
len_pair_ids = len(pair_ids) if pair else 0
|
len_pair_ids = len(pair_ids) if pair else 0
|
||||||
|
|
||||||
encoded_inputs = {}
|
encoded_inputs = {}
|
||||||
|
|
||||||
|
# Handle max sequence length
|
||||||
total_len = len_ids + len_pair_ids + (self.num_added_tokens(pair=pair) if add_special_tokens else 0)
|
total_len = len_ids + len_pair_ids + (self.num_added_tokens(pair=pair) if add_special_tokens else 0)
|
||||||
if max_length and total_len > max_length:
|
if max_length and total_len > max_length:
|
||||||
ids, pair_ids, overflowing_tokens = self.truncate_sequences(ids, pair_ids=pair_ids,
|
ids, pair_ids, overflowing_tokens = self.truncate_sequences(ids, pair_ids=pair_ids,
|
||||||
num_tokens_to_remove=total_len-max_length,
|
num_tokens_to_remove=total_len-max_length,
|
||||||
truncation_strategy=truncation_strategy,
|
truncation_strategy=truncation_strategy,
|
||||||
stride=stride)
|
stride=stride)
|
||||||
encoded_inputs["overflowing_tokens"] = overflowing_tokens
|
if return_overflowing_tokens:
|
||||||
encoded_inputs["num_truncated_tokens"] = total_len - max_length
|
encoded_inputs["overflowing_tokens"] = overflowing_tokens
|
||||||
|
encoded_inputs["num_truncated_tokens"] = total_len - max_length
|
||||||
|
|
||||||
|
# Handle special_tokens
|
||||||
if add_special_tokens:
|
if add_special_tokens:
|
||||||
sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
|
sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
|
||||||
token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
|
token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
|
||||||
encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
|
special_tokens_mask = self.get_special_tokens_mask(ids, pair_ids)
|
||||||
else:
|
else:
|
||||||
sequence = ids + pair_ids if pair else ids
|
sequence = ids + pair_ids if pair else ids
|
||||||
token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])
|
token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])
|
||||||
|
special_tokens_mask = [0] * (len(ids) + (len(pair_ids) if pair else 0))
|
||||||
|
if return_special_tokens_mask:
|
||||||
|
encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
|
||||||
|
|
||||||
|
# Prepare inputs as tensors if asked
|
||||||
if return_tensors == 'tf' and is_tf_available():
|
if return_tensors == 'tf' and is_tf_available():
|
||||||
sequence = tf.constant([sequence])
|
sequence = tf.constant([sequence])
|
||||||
token_type_ids = tf.constant([token_type_ids])
|
token_type_ids = tf.constant([token_type_ids])
|
||||||
@@ -867,12 +923,15 @@ class PreTrainedTokenizer(object):
|
|||||||
logger.warning("Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.".format(return_tensors))
|
logger.warning("Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.".format(return_tensors))
|
||||||
|
|
||||||
encoded_inputs["input_ids"] = sequence
|
encoded_inputs["input_ids"] = sequence
|
||||||
encoded_inputs["token_type_ids"] = token_type_ids
|
if return_token_type_ids:
|
||||||
|
encoded_inputs["token_type_ids"] = token_type_ids
|
||||||
|
|
||||||
if max_length and len(encoded_inputs["input_ids"]) > max_length:
|
if max_length and len(encoded_inputs["input_ids"]) > max_length:
|
||||||
encoded_inputs["input_ids"] = encoded_inputs["input_ids"][:max_length]
|
encoded_inputs["input_ids"] = encoded_inputs["input_ids"][:max_length]
|
||||||
encoded_inputs["token_type_ids"] = encoded_inputs["token_type_ids"][:max_length]
|
if return_token_type_ids:
|
||||||
encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"][:max_length]
|
encoded_inputs["token_type_ids"] = encoded_inputs["token_type_ids"][:max_length]
|
||||||
|
if return_special_tokens_mask:
|
||||||
|
encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"][:max_length]
|
||||||
|
|
||||||
if max_length is None and len(encoded_inputs["input_ids"]) > self.max_len:
|
if max_length is None and len(encoded_inputs["input_ids"]) > self.max_len:
|
||||||
logger.warning("Token indices sequence length is longer than the specified maximum sequence length "
|
logger.warning("Token indices sequence length is longer than the specified maximum sequence length "
|
||||||
|
|||||||
@@ -12,7 +12,7 @@
|
|||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
"""Tokenization classes for OpenAI GPT."""
|
"""Tokenization classes for XLM."""
|
||||||
from __future__ import (absolute_import, division, print_function,
|
from __future__ import (absolute_import, division, print_function,
|
||||||
unicode_literals)
|
unicode_literals)
|
||||||
|
|
||||||
@@ -758,9 +758,9 @@ class XLMTokenizer(PreTrainedTokenizer):
|
|||||||
"""
|
"""
|
||||||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||||
by concatenating and adding special tokens.
|
by concatenating and adding special tokens.
|
||||||
A RoBERTa sequence has the following format:
|
A XLM sequence has the following format:
|
||||||
single sequence: <s> X </s>
|
single sequence: <s> X </s>
|
||||||
pair of sequences: <s> A </s></s> B </s>
|
pair of sequences: <s> A </s> B </s>
|
||||||
"""
|
"""
|
||||||
if token_ids_1 is None:
|
if token_ids_1 is None:
|
||||||
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
|
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
|
||||||
|
|||||||
@@ -185,9 +185,9 @@ class XLNetTokenizer(PreTrainedTokenizer):
|
|||||||
"""
|
"""
|
||||||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||||
by concatenating and adding special tokens.
|
by concatenating and adding special tokens.
|
||||||
A RoBERTa sequence has the following format:
|
An XLNet sequence has the following format:
|
||||||
single sequence: <s> X </s>
|
single sequence: X <sep> <cls>
|
||||||
pair of sequences: <s> A </s></s> B </s>
|
pair of sequences: A <sep> B <sep> <cls>
|
||||||
"""
|
"""
|
||||||
sep = [self.sep_token_id]
|
sep = [self.sep_token_id]
|
||||||
cls = [self.cls_token_id]
|
cls = [self.cls_token_id]
|
||||||
@@ -224,7 +224,7 @@ class XLNetTokenizer(PreTrainedTokenizer):
|
|||||||
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
|
||||||
"""
|
"""
|
||||||
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
|
||||||
A BERT sequence pair mask has the following format:
|
An XLNet sequence pair mask has the following format:
|
||||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2
|
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2
|
||||||
| first sequence | second sequence | CLS segment ID
|
| first sequence | second sequence | CLS segment ID
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user