Merge branch 'master' into auto_models
This commit is contained in:
48
.github/ISSUE_TEMPLATE/bug-report.md
vendored
Normal file
48
.github/ISSUE_TEMPLATE/bug-report.md
vendored
Normal file
@@ -0,0 +1,48 @@
|
|||||||
|
---
|
||||||
|
name: "\U0001F41B Bug Report"
|
||||||
|
about: Submit a bug report to help us improve PyTorch Transformers
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🐛 Bug
|
||||||
|
|
||||||
|
<!-- Important information -->
|
||||||
|
|
||||||
|
Model I am using (Bert, XLNet....):
|
||||||
|
|
||||||
|
Language I am using the model on (English, Chinese....):
|
||||||
|
|
||||||
|
The problem arise when using:
|
||||||
|
* [ ] the official example scripts: (give details)
|
||||||
|
* [ ] my own modified scripts: (give details)
|
||||||
|
|
||||||
|
The tasks I am working on is:
|
||||||
|
* [ ] an official GLUE/SQUaD task: (give the name)
|
||||||
|
* [ ] my own task or dataset: (give details)
|
||||||
|
|
||||||
|
## To Reproduce
|
||||||
|
|
||||||
|
Steps to reproduce the behavior:
|
||||||
|
|
||||||
|
1.
|
||||||
|
2.
|
||||||
|
3.
|
||||||
|
|
||||||
|
<!-- If you have a code sample, error messages, stack traces, please provide it here as well. -->
|
||||||
|
|
||||||
|
## Expected behavior
|
||||||
|
|
||||||
|
<!-- A clear and concise description of what you expected to happen. -->
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
* OS:
|
||||||
|
* Python version:
|
||||||
|
* PyTorch version:
|
||||||
|
* PyTorch Transformers version (or branch):
|
||||||
|
* Using GPU ?
|
||||||
|
* Distributed of parallel setup ?
|
||||||
|
* Any other relevant information:
|
||||||
|
|
||||||
|
## Additional context
|
||||||
|
|
||||||
|
<!-- Add any other context about the problem here. -->
|
||||||
16
.github/ISSUE_TEMPLATE/feature-request.md
vendored
Normal file
16
.github/ISSUE_TEMPLATE/feature-request.md
vendored
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
---
|
||||||
|
name: "\U0001F680 Feature Request"
|
||||||
|
about: Submit a proposal/request for a new PyTorch Transformers feature
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚀 Feature
|
||||||
|
|
||||||
|
<!-- A clear and concise description of the feature proposal. Please provide a link to the paper and code in case they exist. -->
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
<!-- Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too. -->
|
||||||
|
|
||||||
|
## Additional context
|
||||||
|
|
||||||
|
<!-- Add any other context or screenshots about the feature request here. -->
|
||||||
43
.github/ISSUE_TEMPLATE/migration.md
vendored
Normal file
43
.github/ISSUE_TEMPLATE/migration.md
vendored
Normal file
@@ -0,0 +1,43 @@
|
|||||||
|
---
|
||||||
|
name: "\U0001F4DA Migration from PyTorch-pretrained-Bert"
|
||||||
|
about: Report a problem when migrating from PyTorch-pretrained-Bert to PyTorch-Transformers
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📚 Migration
|
||||||
|
|
||||||
|
<!-- Important information -->
|
||||||
|
|
||||||
|
Model I am using (Bert, XLNet....):
|
||||||
|
|
||||||
|
Language I am using the model on (English, Chinese....):
|
||||||
|
|
||||||
|
The problem arise when using:
|
||||||
|
* [ ] the official example scripts: (give details)
|
||||||
|
* [ ] my own modified scripts: (give details)
|
||||||
|
|
||||||
|
The tasks I am working on is:
|
||||||
|
* [ ] an official GLUE/SQUaD task: (give the name)
|
||||||
|
* [ ] my own task or dataset: (give details)
|
||||||
|
|
||||||
|
Details of the issue:
|
||||||
|
|
||||||
|
<!-- A clear and concise description of the migration issue. If you have code snippets, please provide it here as well. -->
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
* OS:
|
||||||
|
* Python version:
|
||||||
|
* PyTorch version:
|
||||||
|
* PyTorch Transformers version (or branch):
|
||||||
|
* Using GPU ?
|
||||||
|
* Distributed of parallel setup ?
|
||||||
|
* Any other relevant information:
|
||||||
|
|
||||||
|
## Checklist
|
||||||
|
|
||||||
|
- [ ] I have read the migration guide in the readme.
|
||||||
|
- [ ] I checked if a related official extension example runs on my machine.
|
||||||
|
|
||||||
|
## Additional context
|
||||||
|
|
||||||
|
<!-- Add any other context about the problem here. -->
|
||||||
8
.github/ISSUE_TEMPLATE/question-help.md
vendored
Normal file
8
.github/ISSUE_TEMPLATE/question-help.md
vendored
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
---
|
||||||
|
name: "❓Questions & Help"
|
||||||
|
about: Start a general discussion related to PyTorch Transformers
|
||||||
|
---
|
||||||
|
|
||||||
|
## ❓ Questions & Help
|
||||||
|
|
||||||
|
<!-- A clear and concise description of the question. -->
|
||||||
19
README.md
19
README.md
@@ -18,7 +18,7 @@ These implementations have been tested on several datasets (see the example scri
|
|||||||
| Section | Description |
|
| Section | Description |
|
||||||
|-|-|
|
|-|-|
|
||||||
| [Installation](#installation) | How to install the package |
|
| [Installation](#installation) | How to install the package |
|
||||||
| [Quick tour: Usage](#quick-tour-usage) | Tokenizers & models usage: Bert and GPT-2 |
|
| [Quick tour: Usage](#quick-tour) | Tokenizers & models usage: Bert and GPT-2 |
|
||||||
| [Quick tour: Fine-tuning/usage scripts](#quick-tour-of-the-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
|
| [Quick tour: Fine-tuning/usage scripts](#quick-tour-of-the-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
|
||||||
| [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-pytorch-transformers) | Migrating your code from pytorch-pretrained-bert to pytorch-transformers |
|
| [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-pytorch-transformers) | Migrating your code from pytorch-pretrained-bert to pytorch-transformers |
|
||||||
| [Documentation](https://huggingface.co/pytorch-transformers/) | Full API documentation and more |
|
| [Documentation](https://huggingface.co/pytorch-transformers/) | Full API documentation and more |
|
||||||
@@ -56,6 +56,16 @@ python -m pytest -sv ./pytorch_transformers/tests/
|
|||||||
python -m pytest -sv ./examples/
|
python -m pytest -sv ./examples/
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Do you want to run a Transformer model on a mobile device?
|
||||||
|
|
||||||
|
You should check out our [`swift-coreml-transformers`](https://github.com/huggingface/swift-coreml-transformers) repo.
|
||||||
|
|
||||||
|
It contains an example of a conversion script from a Pytorch trained Transformer model (here, `GPT-2`) to a CoreML model that runs on iOS devices.
|
||||||
|
|
||||||
|
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
|
||||||
|
or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
|
||||||
|
|
||||||
|
|
||||||
## Quick tour
|
## Quick tour
|
||||||
|
|
||||||
Let's do a very quick overview of PyTorch-Transformers. Detailed examples for each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [full documentation](https://huggingface.co/pytorch-transformers/).
|
Let's do a very quick overview of PyTorch-Transformers. Detailed examples for each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [full documentation](https://huggingface.co/pytorch-transformers/).
|
||||||
@@ -195,7 +205,7 @@ python ./examples/run_glue.py \
|
|||||||
--warmup_steps=120
|
--warmup_steps=120
|
||||||
```
|
```
|
||||||
|
|
||||||
On this machine we thus have a batch size of 32, please increase `gradient_accumulation_steps` to reach the same batch size if you have a smaller machine. These hyper-parameters should results in a Pearson correlation coefficient of `+0.917` on the development set.
|
On this machine we thus have a batch size of 32, please increase `gradient_accumulation_steps` to reach the same batch size if you have a smaller machine. These hyper-parameters should result in a Pearson correlation coefficient of `+0.917` on the development set.
|
||||||
|
|
||||||
#### Fine-tuning Bert model on the MRPC classification task
|
#### Fine-tuning Bert model on the MRPC classification task
|
||||||
|
|
||||||
@@ -265,7 +275,7 @@ This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-s
|
|||||||
### `run_generation.py`: Text generation with GPT, GPT-2, Transformer-XL and XLNet
|
### `run_generation.py`: Text generation with GPT, GPT-2, Transformer-XL and XLNet
|
||||||
|
|
||||||
A conditional generation script is also included to generate text from a prompt.
|
A conditional generation script is also included to generate text from a prompt.
|
||||||
The generation script include the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
|
The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
|
||||||
|
|
||||||
Here is how to run the script with the small version of OpenAI GPT-2 model:
|
Here is how to run the script with the small version of OpenAI GPT-2 model:
|
||||||
|
|
||||||
@@ -284,7 +294,7 @@ Here is a quick summary of what you should take care of when migrating from `pyt
|
|||||||
|
|
||||||
The main breaking change when migrating from `pytorch-pretrained-bert` to `pytorch-transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
|
The main breaking change when migrating from `pytorch-pretrained-bert` to `pytorch-transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
|
||||||
|
|
||||||
The exact content of the tuples for each model are detailled in the models' docstrings and the [documentation](https://huggingface.co/pytorch-transformers/).
|
The exact content of the tuples for each model are detailed in the models' docstrings and the [documentation](https://huggingface.co/pytorch-transformers/).
|
||||||
|
|
||||||
In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.
|
In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.
|
||||||
|
|
||||||
@@ -383,6 +393,7 @@ for batch in train_data:
|
|||||||
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm) # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
|
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm) # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
|
||||||
scheduler.step()
|
scheduler.step()
|
||||||
optimizer.step()
|
optimizer.step()
|
||||||
|
optimizer.zero_grad()
|
||||||
```
|
```
|
||||||
|
|
||||||
## Citation
|
## Citation
|
||||||
|
|||||||
@@ -49,4 +49,17 @@ If you want to reproduce the original tokenization process of the ``OpenAI GPT``
|
|||||||
pip install spacy ftfy==4.4.3
|
pip install spacy ftfy==4.4.3
|
||||||
python -m spacy download en
|
python -m spacy download en
|
||||||
|
|
||||||
If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer defaults to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
|
If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
|
||||||
|
|
||||||
|
|
||||||
|
Do you want to run a Transformer model on a mobile device?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
You should check out our `swift-coreml-transformers <https://github.com/huggingface/swift-coreml-transformers>`_ repo.
|
||||||
|
|
||||||
|
It contains an example of a conversion script from a Pytorch trained Transformer model (here, ``GPT-2``) to a CoreML model that runs on iOS devices.
|
||||||
|
|
||||||
|
It also contains an implementation of BERT for Question answering.
|
||||||
|
|
||||||
|
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
|
||||||
|
or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
|
||||||
@@ -3,57 +3,98 @@ Pretrained models
|
|||||||
|
|
||||||
Here is the full list of the currently provided pretrained models together with a short presentation of each model.
|
Here is the full list of the currently provided pretrained models together with a short presentation of each model.
|
||||||
|
|
||||||
+===============+============================================================+===========================+
|
|
||||||
| Architecture | Shortcut name | Details of the model |
|
|
||||||
+===============+============================================================+===========================+
|
|
||||||
| | ``bert-base-uncased`` | 12-layer, 768-hidden, 12-heads, 110M parameters
|
|
||||||
| | | Trained on lower-cased English text |
|
|
||||||
| +------------------------------------------------------------+---------------------------+
|
|
||||||
| | ``bert-large-uncased`` | 24-layer, 1024-hidden, 16-heads, 340M parameters
|
|
||||||
| | | Trained on lower-cased English text |
|
|
||||||
| +------------------------------------------------------------+---------------------------+
|
|
||||||
| | ``bert-base-cased`` | 12-layer, 768-hidden, 12-heads, 110M parameters
|
|
||||||
| | | Trained on cased English text |
|
|
||||||
| +------------------------------------------------------------+---------------------------+
|
|
||||||
| | ``bert-large-cased`` | 24-layer, 1024-hidden, 16-heads, 340M parameters |
|
|
||||||
| | | Trained on cased English text |
|
|
||||||
| +------------------------------------------------------------+---------------------------+
|
|
||||||
| | ``bert-base-multilingual-uncased`` | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters
|
|
||||||
| | | Trained on lower-cased text in the top 102 languages with the largest Wikipedias
|
|
||||||
| | | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`_) |
|
|
||||||
| +------------------------------------------------------------+---------------------------+
|
|
||||||
| | ``bert-base-multilingual-cased`` | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters |
|
|
||||||
| | | Trained on cased text in the top 104 languages with the largest Wikipedias
|
|
||||||
| | | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`_) |
|
|
||||||
| +------------------------------------------------------------+---------------------------+
|
|
||||||
| BERT | ``bert-base-chinese`` | 12-layer, 768-hidden, 12-heads, 110M parameters |
|
|
||||||
| | | Trained on cased Chinese Simplified and Traditional text |
|
|
||||||
| +------------------------------------------------------------+---------------------------+
|
|
||||||
| | ``bert-base-german-cased`` | 12-layer, 768-hidden, 12-heads, 110M parameters |
|
|
||||||
| | | Trained on cased German text by Deepset.ai |
|
|
||||||
| | | (see `details on deepset.ai website <https://deepset.ai/german-bert>`_) |
|
|
||||||
| +------------------------------------------------------------+---------------------------+
|
|
||||||
| | ``bert-large-uncased-whole-word-masking`` | 24-layer, 1024-hidden, 16-heads, 340M parameters |
|
|
||||||
| | | Trained on lower-cased English text using Whole-Word-Masking |
|
|
||||||
| | | (see `details <https://github.com/google-research/bert/#bert>`_) |
|
|
||||||
| +------------------------------------------------------------+---------------------------+
|
|
||||||
| | ``bert-large-cased-whole-word-masking`` | 24-layer, 1024-hidden, 16-heads, 340M parameters |
|
|
||||||
| | | Trained on cased English text using Whole-Word-Masking |
|
|
||||||
| | | (see `details <https://github.com/google-research/bert/#bert>`_) |
|
|
||||||
| +------------------------------------------------------------+---------------------------+
|
|
||||||
| | ``bert-large-uncased-whole-word-masking-finetuned-squad`` | 24-layer, 1024-hidden, 16-heads, 340M parameters |
|
|
||||||
| | | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD |
|
|
||||||
| | | (see details of fine-tuning in the `example section`_) |
|
|
||||||
| +------------------------------------------------------------+---------------------------+
|
|
||||||
| | ``bert-large-cased-whole-word-masking-finetuned-squad`` | 24-layer, 1024-hidden, 16-heads, 340M parameters |
|
|
||||||
| | | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD |
|
|
||||||
| | | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`_) |
|
|
||||||
| +------------------------------------------------------------+---------------------------+
|
|
||||||
| | ``bert-base-cased-finetuned-mrpc`` | 12-layer, 768-hidden, 12-heads, 110M parameters |
|
|
||||||
| | | The ``bert-base-cased`` model fine-tuned on MRPC |
|
|
||||||
| | | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`_) |
|
|
||||||
+---------------+------------------------------------------------------------+---------------------------+
|
|
||||||
| GPT | Cells may span columns. |
|
|
||||||
+---------------+----------------------------------------------------------------------------------------+
|
|
||||||
|
|
||||||
.. <https://huggingface.co/pytorch-transformers/examples.html>`_
|
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| Architecture | Shortcut name | Details of the model |
|
||||||
|
+===================+============================================================+===========================================================================================================================+
|
||||||
|
| BERT | ``bert-base-uncased`` | 12-layer, 768-hidden, 12-heads, 110M parameters |
|
||||||
|
| | | Trained on lower-cased English text |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-large-uncased`` | 24-layer, 1024-hidden, 16-heads, 340M parameters |
|
||||||
|
| | | Trained on lower-cased English text |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-base-cased`` | 12-layer, 768-hidden, 12-heads, 110M parameters |
|
||||||
|
| | | Trained on cased English text |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-large-cased`` | 24-layer, 1024-hidden, 16-heads, 340M parameters |
|
||||||
|
| | | Trained on cased English text |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-base-multilingual-uncased`` | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters |
|
||||||
|
| | | Trained on lower-cased text in the top 102 languages with the largest Wikipedias |
|
||||||
|
| | | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-base-multilingual-cased`` | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters |
|
||||||
|
| | | Trained on cased text in the top 104 languages with the largest Wikipedias |
|
||||||
|
| | | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-base-chinese`` | 12-layer, 768-hidden, 12-heads, 110M parameters |
|
||||||
|
| | | Trained on cased Chinese Simplified and Traditional text |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-base-german-cased`` | 12-layer, 768-hidden, 12-heads, 110M parameters |
|
||||||
|
| | | Trained on cased German text by Deepset.ai |
|
||||||
|
| | | (see `details on deepset.ai website <https://deepset.ai/german-bert>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-large-uncased-whole-word-masking`` | 24-layer, 1024-hidden, 16-heads, 340M parameters |
|
||||||
|
| | | Trained on lower-cased English text using Whole-Word-Masking |
|
||||||
|
| | | (see `details <https://github.com/google-research/bert/#bert>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-large-cased-whole-word-masking`` | 24-layer, 1024-hidden, 16-heads, 340M parameters |
|
||||||
|
| | | Trained on cased English text using Whole-Word-Masking |
|
||||||
|
| | | (see `details <https://github.com/google-research/bert/#bert>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-large-uncased-whole-word-masking-finetuned-squad`` | 24-layer, 1024-hidden, 16-heads, 340M parameters |
|
||||||
|
| | | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD (see details of fine-tuning in the |
|
||||||
|
| | | `example section <https://github.com/huggingface/pytorch-transformers/tree/master/examples>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-large-cased-whole-word-masking-finetuned-squad`` | 24-layer, 1024-hidden, 16-heads, 340M parameters |
|
||||||
|
| | | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD |
|
||||||
|
| | | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`__) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``bert-base-cased-finetuned-mrpc`` | 12-layer, 768-hidden, 12-heads, 110M parameters |
|
||||||
|
| | | The ``bert-base-cased`` model fine-tuned on MRPC |
|
||||||
|
| | | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`__) |
|
||||||
|
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| GPT | ``openai-gpt`` | 12-layer, 768-hidden, 12-heads, 110M parameters |
|
||||||
|
| | | OpenAI GPT English model |
|
||||||
|
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| GPT-2 | ``gpt2`` | 12-layer, 768-hidden, 12-heads, 117M parameters |
|
||||||
|
| | | OpenAI GPT-2 English model |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``gpt2-medium`` | 24-layer, 1024-hidden, 16-heads, 345M parameters |
|
||||||
|
| | | OpenAI's Medium-sized GPT-2 English model |
|
||||||
|
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| Transformer-XL | ``transfo-xl-wt103`` | 18-layer, 1024-hidden, 16-heads, 257M parameters |
|
||||||
|
| | | English model trained on wikitext-103 |
|
||||||
|
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| XLNet | ``xlnet-base-cased`` | 12-layer, 768-hidden, 12-heads, 110M parameters |
|
||||||
|
| | | XLNet English model |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``xlnet-large-cased`` | 24-layer, 1024-hidden, 16-heads, 340M parameters |
|
||||||
|
| | | XLNet Large English model |
|
||||||
|
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| XLM | ``xlm-mlm-en-2048`` | 12-layer, 1024-hidden, 8-heads |
|
||||||
|
| | | XLM English model |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``xlm-mlm-ende-1024`` | 12-layer, 1024-hidden, 8-heads |
|
||||||
|
| | | XLM English-German Multi-language model |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``xlm-mlm-enfr-1024`` | 12-layer, 1024-hidden, 8-heads |
|
||||||
|
| | | XLM English-French Multi-language model |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``xlm-mlm-enro-1024`` | 12-layer, 1024-hidden, 8-heads |
|
||||||
|
| | | XLM English-Romanian Multi-language model |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``xlm-mlm-xnli15-1024`` | 12-layer, 1024-hidden, 8-heads |
|
||||||
|
| | | XLM Model pre-trained with MLM on the `15 XNLI languages <https://github.com/facebookresearch/XNLI>`__. |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``xlm-mlm-tlm-xnli15-1024`` | 12-layer, 1024-hidden, 8-heads |
|
||||||
|
| | | XLM Model pre-trained with MLM + TLM on the `15 XNLI languages <https://github.com/facebookresearch/XNLI>`__. |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``xlm-clm-enfr-1024`` | 12-layer, 1024-hidden, 8-heads |
|
||||||
|
| | | XLM English model trained with CLM (Causal Language Modeling) |
|
||||||
|
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
| | ``xlm-clm-ende-1024`` | 12-layer, 1024-hidden, 8-heads |
|
||||||
|
| | | XLM English-German Multi-language model trained with CLM (Causal Language Modeling) |
|
||||||
|
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
|
||||||
|
|
||||||
|
.. <https://huggingface.co/pytorch-transformers/examples.html>`__
|
||||||
@@ -132,4 +132,4 @@ Using the traced model for inference is as simple as using its ``__call__`` dund
|
|||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
traced_model(tokens_tensor, segments_tensors)
|
traced_model(tokens_tensor, segments_tensors)
|
||||||
|
|||||||
@@ -92,6 +92,10 @@ def train(args, train_dataset, model, tokenizer):
|
|||||||
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
|
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
|
||||||
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
|
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
|
||||||
|
|
||||||
|
# multi-gpu training (should be after apex fp16 initialization)
|
||||||
|
if args.n_gpu > 1:
|
||||||
|
model = torch.nn.DataParallel(model)
|
||||||
|
|
||||||
# Distributed training (should be after apex fp16 initialization)
|
# Distributed training (should be after apex fp16 initialization)
|
||||||
if args.local_rank != -1:
|
if args.local_rank != -1:
|
||||||
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
|
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
|
||||||
@@ -243,6 +247,9 @@ def evaluate(args, model, tokenizer, prefix=""):
|
|||||||
|
|
||||||
|
|
||||||
def load_and_cache_examples(args, task, tokenizer, evaluate=False):
|
def load_and_cache_examples(args, task, tokenizer, evaluate=False):
|
||||||
|
if args.local_rank not in [-1, 0]:
|
||||||
|
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
|
||||||
|
|
||||||
processor = processors[task]()
|
processor = processors[task]()
|
||||||
output_mode = output_modes[task]
|
output_mode = output_modes[task]
|
||||||
# Load data features from cache or dataset file
|
# Load data features from cache or dataset file
|
||||||
@@ -269,6 +276,9 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
|
|||||||
logger.info("Saving features into cached file %s", cached_features_file)
|
logger.info("Saving features into cached file %s", cached_features_file)
|
||||||
torch.save(features, cached_features_file)
|
torch.save(features, cached_features_file)
|
||||||
|
|
||||||
|
if args.local_rank == 0:
|
||||||
|
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
|
||||||
|
|
||||||
# Convert to Tensors and build dataset
|
# Convert to Tensors and build dataset
|
||||||
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
|
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
|
||||||
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
|
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
|
||||||
@@ -418,8 +428,6 @@ def main():
|
|||||||
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
|
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
|
||||||
|
|
||||||
model.to(args.device)
|
model.to(args.device)
|
||||||
if args.n_gpu > 1:
|
|
||||||
model = torch.nn.DataParallel(model)
|
|
||||||
|
|
||||||
logger.info("Training/evaluation parameters %s", args)
|
logger.info("Training/evaluation parameters %s", args)
|
||||||
|
|
||||||
|
|||||||
@@ -101,6 +101,10 @@ def train(args, train_dataset, model, tokenizer):
|
|||||||
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
|
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
|
||||||
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
|
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
|
||||||
|
|
||||||
|
# multi-gpu training (should be after apex fp16 initialization)
|
||||||
|
if args.n_gpu > 1:
|
||||||
|
model = torch.nn.DataParallel(model)
|
||||||
|
|
||||||
# Distributed training (should be after apex fp16 initialization)
|
# Distributed training (should be after apex fp16 initialization)
|
||||||
if args.local_rank != -1:
|
if args.local_rank != -1:
|
||||||
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
|
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
|
||||||
@@ -241,7 +245,10 @@ def evaluate(args, model, tokenizer, prefix=""):
|
|||||||
# Compute predictions
|
# Compute predictions
|
||||||
output_prediction_file = os.path.join(args.output_dir, "predictions_{}.json".format(prefix))
|
output_prediction_file = os.path.join(args.output_dir, "predictions_{}.json".format(prefix))
|
||||||
output_nbest_file = os.path.join(args.output_dir, "nbest_predictions_{}.json".format(prefix))
|
output_nbest_file = os.path.join(args.output_dir, "nbest_predictions_{}.json".format(prefix))
|
||||||
output_null_log_odds_file = os.path.join(args.output_dir, "null_odds_{}.json".format(prefix))
|
if args.version_2_with_negative:
|
||||||
|
output_null_log_odds_file = os.path.join(args.output_dir, "null_odds_{}.json".format(prefix))
|
||||||
|
else:
|
||||||
|
output_null_log_odds_file = None
|
||||||
|
|
||||||
if args.model_type in ['xlnet', 'xlm']:
|
if args.model_type in ['xlnet', 'xlm']:
|
||||||
# XLNet uses a more complex post-processing procedure
|
# XLNet uses a more complex post-processing procedure
|
||||||
@@ -265,6 +272,9 @@ def evaluate(args, model, tokenizer, prefix=""):
|
|||||||
|
|
||||||
|
|
||||||
def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False):
|
def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False):
|
||||||
|
if args.local_rank not in [-1, 0]:
|
||||||
|
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
|
||||||
|
|
||||||
# Load data features from cache or dataset file
|
# Load data features from cache or dataset file
|
||||||
input_file = args.predict_file if evaluate else args.train_file
|
input_file = args.predict_file if evaluate else args.train_file
|
||||||
cached_features_file = os.path.join(os.path.dirname(input_file), 'cached_{}_{}_{}'.format(
|
cached_features_file = os.path.join(os.path.dirname(input_file), 'cached_{}_{}_{}'.format(
|
||||||
@@ -289,6 +299,9 @@ def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=Fal
|
|||||||
logger.info("Saving features into cached file %s", cached_features_file)
|
logger.info("Saving features into cached file %s", cached_features_file)
|
||||||
torch.save(features, cached_features_file)
|
torch.save(features, cached_features_file)
|
||||||
|
|
||||||
|
if args.local_rank == 0:
|
||||||
|
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
|
||||||
|
|
||||||
# Convert to Tensors and build dataset
|
# Convert to Tensors and build dataset
|
||||||
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
|
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
|
||||||
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
|
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
|
||||||
@@ -457,8 +470,6 @@ def main():
|
|||||||
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
|
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
|
||||||
|
|
||||||
model.to(args.device)
|
model.to(args.device)
|
||||||
if args.n_gpu > 1:
|
|
||||||
model = torch.nn.DataParallel(model)
|
|
||||||
|
|
||||||
logger.info("Training/evaluation parameters %s", args)
|
logger.info("Training/evaluation parameters %s", args)
|
||||||
|
|
||||||
|
|||||||
@@ -10,20 +10,20 @@ from .tokenization_utils import (PreTrainedTokenizer)
|
|||||||
|
|
||||||
from .modeling_auto import (AutoConfig, AutoModel)
|
from .modeling_auto import (AutoConfig, AutoModel)
|
||||||
|
|
||||||
from .modeling_bert import (BertConfig, BertModel, BertForPreTraining,
|
from .modeling_bert import (BertConfig, BertPreTrainedModel, BertModel, BertForPreTraining,
|
||||||
BertForMaskedLM, BertForNextSentencePrediction,
|
BertForMaskedLM, BertForNextSentencePrediction,
|
||||||
BertForSequenceClassification, BertForMultipleChoice,
|
BertForSequenceClassification, BertForMultipleChoice,
|
||||||
BertForTokenClassification, BertForQuestionAnswering,
|
BertForTokenClassification, BertForQuestionAnswering,
|
||||||
load_tf_weights_in_bert, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
load_tf_weights_in_bert, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||||
BERT_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
BERT_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
||||||
from .modeling_openai import (OpenAIGPTConfig, OpenAIGPTModel,
|
from .modeling_openai import (OpenAIGPTConfig, OpenAIGPTPreTrainedModel, OpenAIGPTModel,
|
||||||
OpenAIGPTLMHeadModel, OpenAIGPTDoubleHeadsModel,
|
OpenAIGPTLMHeadModel, OpenAIGPTDoubleHeadsModel,
|
||||||
load_tf_weights_in_openai_gpt, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
load_tf_weights_in_openai_gpt, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
from .modeling_transfo_xl import (TransfoXLConfig, TransfoXLModel, TransfoXLLMHeadModel,
|
from .modeling_transfo_xl import (TransfoXLConfig, TransfoXLPreTrainedModel, TransfoXLModel, TransfoXLLMHeadModel,
|
||||||
load_tf_weights_in_transfo_xl, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
load_tf_weights_in_transfo_xl, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
from .modeling_gpt2 import (GPT2Config, GPT2Model,
|
from .modeling_gpt2 import (GPT2Config, GPT2PreTrainedModel, GPT2Model,
|
||||||
GPT2LMHeadModel, GPT2DoubleHeadsModel,
|
GPT2LMHeadModel, GPT2DoubleHeadsModel,
|
||||||
load_tf_weights_in_gpt2, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
load_tf_weights_in_gpt2, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
GPT2_PRETRAINED_MODEL_ARCHIVE_MAP)
|
GPT2_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
@@ -32,7 +32,7 @@ from .modeling_xlnet import (XLNetConfig,
|
|||||||
XLNetForSequenceClassification, XLNetForQuestionAnswering,
|
XLNetForSequenceClassification, XLNetForQuestionAnswering,
|
||||||
load_tf_weights_in_xlnet, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
load_tf_weights_in_xlnet, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
XLNET_PRETRAINED_MODEL_ARCHIVE_MAP)
|
XLNET_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
from .modeling_xlm import (XLMConfig, XLMModel,
|
from .modeling_xlm import (XLMConfig, XLMPreTrainedModel , XLMModel,
|
||||||
XLMWithLMHeadModel, XLMForSequenceClassification,
|
XLMWithLMHeadModel, XLMForSequenceClassification,
|
||||||
XLMForQuestionAnswering, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
XLMForQuestionAnswering, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
|
XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||||
|
|||||||
@@ -41,7 +41,7 @@ def convert_pytorch_checkpoint_to_tf(model:BertModel, ckpt_dir:str, model_name:s
|
|||||||
N BertForQuestionAnswering
|
N BertForQuestionAnswering
|
||||||
"""
|
"""
|
||||||
|
|
||||||
tensors_to_transopse = (
|
tensors_to_transpose = (
|
||||||
"dense.weight",
|
"dense.weight",
|
||||||
"attention.self.query",
|
"attention.self.query",
|
||||||
"attention.self.key",
|
"attention.self.key",
|
||||||
@@ -62,34 +62,34 @@ def convert_pytorch_checkpoint_to_tf(model:BertModel, ckpt_dir:str, model_name:s
|
|||||||
if not os.path.isdir(ckpt_dir):
|
if not os.path.isdir(ckpt_dir):
|
||||||
os.makedirs(ckpt_dir)
|
os.makedirs(ckpt_dir)
|
||||||
|
|
||||||
session = tf.Session()
|
|
||||||
state_dict = model.state_dict()
|
state_dict = model.state_dict()
|
||||||
tf_vars = []
|
|
||||||
|
|
||||||
def to_tf_var_name(name:str):
|
def to_tf_var_name(name:str):
|
||||||
for patt, repl in iter(var_map):
|
for patt, repl in iter(var_map):
|
||||||
name = name.replace(patt, repl)
|
name = name.replace(patt, repl)
|
||||||
return 'bert/{}'.format(name)
|
return 'bert/{}'.format(name)
|
||||||
|
|
||||||
def assign_tf_var(tensor:np.ndarray, name:str):
|
def create_tf_var(tensor:np.ndarray, name:str, session:tf.Session):
|
||||||
tmp_var = tf.Variable(initial_value=tensor)
|
tf_dtype = tf.dtypes.as_dtype(tensor.dtype)
|
||||||
tf_var = tf.get_variable(dtype=tmp_var.dtype, shape=tmp_var.shape, name=name)
|
tf_var = tf.get_variable(dtype=tf_dtype, shape=tensor.shape, name=name, initializer=tf.zeros_initializer())
|
||||||
op = tf.assign(ref=tf_var, value=tmp_var)
|
session.run(tf.variables_initializer([tf_var]))
|
||||||
session.run(tf.variables_initializer([tmp_var, tf_var]))
|
session.run(tf_var)
|
||||||
session.run(fetches=[op, tf_var])
|
|
||||||
return tf_var
|
return tf_var
|
||||||
|
|
||||||
for var_name in state_dict:
|
tf.reset_default_graph()
|
||||||
tf_name = to_tf_var_name(var_name)
|
with tf.Session() as session:
|
||||||
torch_tensor = state_dict[var_name].numpy()
|
for var_name in state_dict:
|
||||||
if any([x in var_name for x in tensors_to_transopse]):
|
tf_name = to_tf_var_name(var_name)
|
||||||
torch_tensor = torch_tensor.T
|
torch_tensor = state_dict[var_name].numpy()
|
||||||
tf_tensor = assign_tf_var(tensor=torch_tensor, name=tf_name)
|
if any([x in var_name for x in tensors_to_transpose]):
|
||||||
tf_vars.append(tf_tensor)
|
torch_tensor = torch_tensor.T
|
||||||
print("{0}{1}initialized".format(tf_name, " " * (60 - len(tf_name))))
|
tf_var = create_tf_var(tensor=torch_tensor, name=tf_name, session=session)
|
||||||
|
tf.keras.backend.set_value(tf_var, torch_tensor)
|
||||||
|
tf_weight = session.run(tf_var)
|
||||||
|
print("Successfully created {}: {}".format(tf_name, np.allclose(tf_weight, torch_tensor)))
|
||||||
|
|
||||||
saver = tf.train.Saver(tf_vars)
|
saver = tf.train.Saver(tf.trainable_variables())
|
||||||
saver.save(session, os.path.join(ckpt_dir, model_name.replace("-", "_") + ".ckpt"))
|
saver.save(session, os.path.join(ckpt_dir, model_name.replace("-", "_") + ".ckpt"))
|
||||||
|
|
||||||
|
|
||||||
def main(raw_args=None):
|
def main(raw_args=None):
|
||||||
|
|||||||
@@ -24,11 +24,10 @@ from io import open
|
|||||||
import torch
|
import torch
|
||||||
|
|
||||||
import pytorch_transformers.tokenization_transfo_xl as data_utils
|
import pytorch_transformers.tokenization_transfo_xl as data_utils
|
||||||
from pytorch_transformers.modeling_transfo_xl import (CONFIG_NAME,
|
|
||||||
WEIGHTS_NAME,
|
from pytorch_transformers import CONFIG_NAME, WEIGHTS_NAME
|
||||||
TransfoXLConfig,
|
from pytorch_transformers.modeling_transfo_xl import (TransfoXLConfig, TransfoXLLMHeadModel,
|
||||||
TransfoXLLMHeadModel,
|
load_tf_weights_in_transfo_xl)
|
||||||
load_tf_weights_in_transfo_xl)
|
|
||||||
from pytorch_transformers.tokenization_transfo_xl import (CORPUS_NAME, VOCAB_FILES_NAMES)
|
from pytorch_transformers.tokenization_transfo_xl import (CORPUS_NAME, VOCAB_FILES_NAMES)
|
||||||
|
|
||||||
if sys.version_info[0] == 2:
|
if sys.version_info[0] == 2:
|
||||||
|
|||||||
@@ -538,7 +538,7 @@ class OpenAIGPTLMHeadModel(OpenAIGPTPreTrainedModel):
|
|||||||
r"""
|
r"""
|
||||||
**labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
**labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||||
Labels for language modeling.
|
Labels for language modeling.
|
||||||
Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
|
Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``
|
||||||
Indices are selected in ``[-1, 0, ..., config.vocab_size]``
|
Indices are selected in ``[-1, 0, ..., config.vocab_size]``
|
||||||
All labels set to ``-1`` are ignored (masked), the loss is only
|
All labels set to ``-1`` are ignored (masked), the loss is only
|
||||||
computed for labels in ``[0, ..., config.vocab_size]``
|
computed for labels in ``[0, ..., config.vocab_size]``
|
||||||
|
|||||||
@@ -39,6 +39,20 @@ WEIGHTS_NAME = "pytorch_model.bin"
|
|||||||
TF_WEIGHTS_NAME = 'model.ckpt'
|
TF_WEIGHTS_NAME = 'model.ckpt'
|
||||||
|
|
||||||
|
|
||||||
|
try:
|
||||||
|
from torch.nn import Identity
|
||||||
|
except ImportError:
|
||||||
|
# Older PyTorch compatibility
|
||||||
|
class Identity(nn.Module):
|
||||||
|
r"""A placeholder identity operator that is argument-insensitive.
|
||||||
|
"""
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
super(Identity, self).__init__()
|
||||||
|
|
||||||
|
def forward(self, input):
|
||||||
|
return input
|
||||||
|
|
||||||
|
|
||||||
if not six.PY2:
|
if not six.PY2:
|
||||||
def add_start_docstrings(*docstr):
|
def add_start_docstrings(*docstr):
|
||||||
def docstring_decorator(fn):
|
def docstring_decorator(fn):
|
||||||
@@ -783,7 +797,7 @@ class SequenceSummary(nn.Module):
|
|||||||
# We can probably just use the multi-head attention module of PyTorch >=1.1.0
|
# We can probably just use the multi-head attention module of PyTorch >=1.1.0
|
||||||
raise NotImplementedError
|
raise NotImplementedError
|
||||||
|
|
||||||
self.summary = nn.Identity()
|
self.summary = Identity()
|
||||||
if hasattr(config, 'summary_use_proj') and config.summary_use_proj:
|
if hasattr(config, 'summary_use_proj') and config.summary_use_proj:
|
||||||
if hasattr(config, 'summary_proj_to_labels') and config.summary_proj_to_labels and config.num_labels > 0:
|
if hasattr(config, 'summary_proj_to_labels') and config.summary_proj_to_labels and config.num_labels > 0:
|
||||||
num_classes = config.num_labels
|
num_classes = config.num_labels
|
||||||
@@ -791,15 +805,15 @@ class SequenceSummary(nn.Module):
|
|||||||
num_classes = config.hidden_size
|
num_classes = config.hidden_size
|
||||||
self.summary = nn.Linear(config.hidden_size, num_classes)
|
self.summary = nn.Linear(config.hidden_size, num_classes)
|
||||||
|
|
||||||
self.activation = nn.Identity()
|
self.activation = Identity()
|
||||||
if hasattr(config, 'summary_activation') and config.summary_activation == 'tanh':
|
if hasattr(config, 'summary_activation') and config.summary_activation == 'tanh':
|
||||||
self.activation = nn.Tanh()
|
self.activation = nn.Tanh()
|
||||||
|
|
||||||
self.first_dropout = nn.Identity()
|
self.first_dropout = Identity()
|
||||||
if hasattr(config, 'summary_first_dropout') and config.summary_first_dropout > 0:
|
if hasattr(config, 'summary_first_dropout') and config.summary_first_dropout > 0:
|
||||||
self.first_dropout = nn.Dropout(config.summary_first_dropout)
|
self.first_dropout = nn.Dropout(config.summary_first_dropout)
|
||||||
|
|
||||||
self.last_dropout = nn.Identity()
|
self.last_dropout = Identity()
|
||||||
if hasattr(config, 'summary_last_dropout') and config.summary_last_dropout > 0:
|
if hasattr(config, 'summary_last_dropout') and config.summary_last_dropout > 0:
|
||||||
self.last_dropout = nn.Dropout(config.summary_last_dropout)
|
self.last_dropout = nn.Dropout(config.summary_last_dropout)
|
||||||
|
|
||||||
|
|||||||
@@ -226,26 +226,46 @@ class PreTrainedTokenizer(object):
|
|||||||
s3_models = list(cls.max_model_input_sizes.keys())
|
s3_models = list(cls.max_model_input_sizes.keys())
|
||||||
vocab_files = {}
|
vocab_files = {}
|
||||||
if pretrained_model_name_or_path in s3_models:
|
if pretrained_model_name_or_path in s3_models:
|
||||||
|
# Get the vocabulary from AWS S3 bucket
|
||||||
for file_id, map_list in cls.pretrained_vocab_files_map.items():
|
for file_id, map_list in cls.pretrained_vocab_files_map.items():
|
||||||
vocab_files[file_id] = map_list[pretrained_model_name_or_path]
|
vocab_files[file_id] = map_list[pretrained_model_name_or_path]
|
||||||
else:
|
else:
|
||||||
|
# Get the vocabulary from local files
|
||||||
logger.info(
|
logger.info(
|
||||||
"Model name '{}' not found in model shortcut name list ({}). "
|
"Model name '{}' not found in model shortcut name list ({}). "
|
||||||
"Assuming '{}' is a path or url to a directory containing tokenizer files.".format(
|
"Assuming '{}' is a path or url to a directory containing tokenizer files.".format(
|
||||||
pretrained_model_name_or_path, ', '.join(s3_models),
|
pretrained_model_name_or_path, ', '.join(s3_models),
|
||||||
pretrained_model_name_or_path))
|
pretrained_model_name_or_path))
|
||||||
all_vocab_files_names = {'added_tokens_file': ADDED_TOKENS_FILE,
|
|
||||||
'special_tokens_map_file': SPECIAL_TOKENS_MAP_FILE}
|
# Look for the tokenizer main vocabulary files
|
||||||
all_vocab_files_names.update(cls.vocab_files_names)
|
for file_id, file_name in cls.vocab_files_names.items():
|
||||||
for file_id, file_name in all_vocab_files_names.items():
|
|
||||||
if os.path.isdir(pretrained_model_name_or_path):
|
if os.path.isdir(pretrained_model_name_or_path):
|
||||||
|
# If a directory is provided we look for the standard filenames
|
||||||
full_file_name = os.path.join(pretrained_model_name_or_path, file_name)
|
full_file_name = os.path.join(pretrained_model_name_or_path, file_name)
|
||||||
else:
|
else:
|
||||||
|
# If a path to a file is provided we use it (will only work for non-BPE tokenizer using a single vocabulary file)
|
||||||
full_file_name = pretrained_model_name_or_path
|
full_file_name = pretrained_model_name_or_path
|
||||||
if not os.path.exists(full_file_name):
|
if not os.path.exists(full_file_name):
|
||||||
logger.info("Didn't find file {}. We won't load it.".format(full_file_name))
|
logger.info("Didn't find file {}. We won't load it.".format(full_file_name))
|
||||||
full_file_name = None
|
full_file_name = None
|
||||||
vocab_files[file_id] = full_file_name
|
vocab_files[file_id] = full_file_name
|
||||||
|
|
||||||
|
# Look for the additional tokens files
|
||||||
|
all_vocab_files_names = {'added_tokens_file': ADDED_TOKENS_FILE,
|
||||||
|
'special_tokens_map_file': SPECIAL_TOKENS_MAP_FILE}
|
||||||
|
|
||||||
|
# If a path to a file was provided, get the parent directory
|
||||||
|
saved_directory = pretrained_model_name_or_path
|
||||||
|
if os.path.exists(saved_directory) and not os.path.isdir(saved_directory):
|
||||||
|
saved_directory = os.path.dirname(saved_directory)
|
||||||
|
|
||||||
|
for file_id, file_name in all_vocab_files_names.items():
|
||||||
|
full_file_name = os.path.join(saved_directory, file_name)
|
||||||
|
if not os.path.exists(full_file_name):
|
||||||
|
logger.info("Didn't find file {}. We won't load it.".format(full_file_name))
|
||||||
|
full_file_name = None
|
||||||
|
vocab_files[file_id] = full_file_name
|
||||||
|
|
||||||
if all(full_file_name is None for full_file_name in vocab_files.values()):
|
if all(full_file_name is None for full_file_name in vocab_files.values()):
|
||||||
logger.error(
|
logger.error(
|
||||||
"Model name '{}' was not found in model name list ({}). "
|
"Model name '{}' was not found in model name list ({}). "
|
||||||
@@ -333,7 +353,7 @@ class PreTrainedTokenizer(object):
|
|||||||
|
|
||||||
with open(added_tokens_file, 'w', encoding='utf-8') as f:
|
with open(added_tokens_file, 'w', encoding='utf-8') as f:
|
||||||
if self.added_tokens_encoder:
|
if self.added_tokens_encoder:
|
||||||
out_str = json.dumps(self.added_tokens_decoder, ensure_ascii=False)
|
out_str = json.dumps(self.added_tokens_encoder, ensure_ascii=False)
|
||||||
else:
|
else:
|
||||||
out_str = u"{}"
|
out_str = u"{}"
|
||||||
f.write(out_str)
|
f.write(out_str)
|
||||||
|
|||||||
Reference in New Issue
Block a user