Compare commits
77 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
10d72390c0 | ||
|
|
e0db6bbd65 | ||
|
|
bd6e301832 | ||
|
|
a086527727 | ||
|
|
9d2ce253de | ||
|
|
49296533ca | ||
|
|
271bedb485 | ||
|
|
865d4d595e | ||
|
|
a3af8e86cb | ||
|
|
eacea530c1 | ||
|
|
fa2fbed3e5 | ||
|
|
e708bb75bf | ||
|
|
49c06132df | ||
|
|
cacb654c7f | ||
|
|
30a09f3827 | ||
|
|
14cb5b35fa | ||
|
|
6dc52c78d8 | ||
|
|
ed5456daf4 | ||
|
|
c76450e20c | ||
|
|
9907dc523a | ||
|
|
efbc1c5a9d | ||
|
|
956c4c4eb4 | ||
|
|
48c3a70b4e | ||
|
|
aa925a52fa | ||
|
|
5856999a9f | ||
|
|
07dd7c2fd8 | ||
|
|
8f1d047148 | ||
|
|
31eedff5a0 | ||
|
|
384f0eb2f9 | ||
|
|
bf14ef75f1 | ||
|
|
5e7fe8b585 | ||
|
|
4c06893610 | ||
|
|
9de4afa897 | ||
|
|
42e8fbfc51 | ||
|
|
54065d68b8 | ||
|
|
e28b7e2311 | ||
|
|
09b933f19d | ||
|
|
235777ccc9 | ||
|
|
9ddd3a6548 | ||
|
|
c5aa114392 | ||
|
|
ca4a3f4da9 | ||
|
|
24538df919 | ||
|
|
a699525d25 | ||
|
|
d9ece8233d | ||
|
|
d39bf0ac2d | ||
|
|
590adb130b | ||
|
|
026a5d0888 | ||
|
|
fa6113f9a0 | ||
|
|
757baee846 | ||
|
|
a27c795908 | ||
|
|
31c799a0c9 | ||
|
|
8581a670e3 | ||
|
|
18d233d525 | ||
|
|
3e0f062106 | ||
|
|
fc2a4c88ce | ||
|
|
55bda52555 | ||
|
|
ad02c961c6 | ||
|
|
15550ce0d1 | ||
|
|
62427d0815 | ||
|
|
34706ba050 | ||
|
|
edf9ac11d4 | ||
|
|
b908f2e9dd | ||
|
|
af2e6bf87c | ||
|
|
7defc6670f | ||
|
|
84894974bd | ||
|
|
db0076a9df | ||
|
|
2d05480174 | ||
|
|
035678efdb | ||
|
|
b9c9e05381 | ||
|
|
9535bf1977 | ||
|
|
7822cd38a0 | ||
|
|
448c467256 | ||
|
|
c547f15a17 | ||
|
|
015f7812ed | ||
|
|
ef46ccb05c | ||
|
|
94cb73c2d2 | ||
|
|
a0eebdc404 |
2
.github/workflows/github-torch-hub.yml
vendored
2
.github/workflows/github-torch-hub.yml
vendored
@@ -21,7 +21,7 @@ jobs:
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
pip install torch
|
||||
pip install numpy tokenizers filelock requests tqdm regex sentencepiece sacremoses
|
||||
pip install numpy tokenizers filelock requests tqdm regex sentencepiece sacremoses packaging
|
||||
|
||||
- name: Torch hub list
|
||||
run: |
|
||||
|
||||
2
.github/workflows/self-push.yml
vendored
2
.github/workflows/self-push.yml
vendored
@@ -35,7 +35,7 @@ jobs:
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
pip install torch==1.4.0
|
||||
pip install torch
|
||||
pip install .[sklearn,testing]
|
||||
|
||||
- name: Are GPUs recognized by our DL frameworks
|
||||
|
||||
@@ -198,11 +198,12 @@ Follow these steps to start contributing:
|
||||
are useful to avoid duplicated work, and to differentiate it from PRs ready
|
||||
to be merged;
|
||||
4. Make sure existing tests pass;
|
||||
5. Add high-coverage tests. No quality test, no merge.
|
||||
5. Add high-coverage tests. No quality testing = no merge.
|
||||
- If you are adding a new model, make sure that you use `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)`, which triggers the common tests.
|
||||
- If you are adding new `@slow` tests, make sure they pass using `RUN_SLOW=1 python -m pytest tests/test_my_new_model.py`.
|
||||
- If you are adding a new tokenizer, write tests, and make sure `RUN_SLOW=1 python -m pytest tests/test_tokenization_{your_model_name}.py` passes.
|
||||
CircleCI does not run them.
|
||||
6. All public methods must have informative docstrings;
|
||||
6. All public methods must have informative docstrings that work nicely with sphinx. See `modeling_ctrl.py` for an example.
|
||||
|
||||
### Tests
|
||||
|
||||
|
||||
@@ -165,8 +165,9 @@ At some point in the future, you'll be able to seamlessly move from pre-training
|
||||
18. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
|
||||
19. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
|
||||
20. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
|
||||
21. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
|
||||
22. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
|
||||
21. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
|
||||
22. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
|
||||
23. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
|
||||
|
||||
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
|
||||
|
||||
|
||||
@@ -26,7 +26,7 @@ author = u'huggingface'
|
||||
# The short X.Y version
|
||||
version = u''
|
||||
# The full version, including alpha/beta/rc tags
|
||||
release = u'2.9.1'
|
||||
release = u'2.10.0'
|
||||
|
||||
|
||||
# -- General configuration ---------------------------------------------------
|
||||
|
||||
@@ -109,3 +109,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
|
||||
model_doc/dialogpt
|
||||
model_doc/reformer
|
||||
model_doc/marian
|
||||
model_doc/longformer
|
||||
|
||||
@@ -6,7 +6,7 @@ Overview
|
||||
|
||||
The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_
|
||||
by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents
|
||||
two parameter-reduction techniques to lower memory consumption and increase the trainig speed of BERT:
|
||||
two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:
|
||||
|
||||
- Splitting the embedding matrix into two smaller matrices
|
||||
- Using repeating layers split among groups
|
||||
|
||||
69
docs/source/model_doc/longformer.rst
Normal file
69
docs/source/model_doc/longformer.rst
Normal file
@@ -0,0 +1,69 @@
|
||||
Longformer
|
||||
----------------------------------------------------
|
||||
**DISCLAIMER:** This model is still a work in progress, if you see something strange,
|
||||
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
|
||||
|
||||
Overview
|
||||
~~~~~
|
||||
The Longformer model was presented in `Longformer: The Long-Document Transformer <https://arxiv.org/pdf/2004.05150.pdf>`_ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
|
||||
Here the abstract:
|
||||
|
||||
*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.*
|
||||
|
||||
The Authors' code can be found `here <https://github.com/allenai/longformer>`_ .
|
||||
|
||||
Longformer Self Attention
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
Longformer self attention employs self attention on both a "local" context and a "global" context.
|
||||
Most tokens only attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous tokens and :math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in `config.attention_window`. Note that `config.attention_window` can be of type ``list`` to define a different :math:`w` for each layer.
|
||||
A selecetd few tokens attend "globally" to all other tokens, as it is conventionally done for all tokens in *e.g.* `BertSelfAttention`.
|
||||
|
||||
Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices.
|
||||
Also note that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all "globally" attending tokens so that global attention is *symmetric*.
|
||||
|
||||
The user can define which tokens are masked, which tokens attend "locally" and which tokens attend "globally" by setting the `config.attention_mask` `torch.Tensor` appropriately. In contrast to other models `Longformer` accepts the following values in `config.attention_mask`: `0` - the token is masked and not attended at all (as is done in other models), `1` - the token attends "locally", `2` - token attends "globally". For more information please also refer to :func:`~transformers.LongformerModel.forward` method.
|
||||
|
||||
Using Longformer self attention, the memory and time complexity of the query-key matmul operation, which usually represents the memory and time bottleneck, can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times w)`, with :math:`n_s` being the sequence length and :math:`w` being the average window size. It is assumed that the number of "globally" attending tokens is insignificant as compared to the number of "locally" attending tokens.
|
||||
|
||||
For more information, please refer to the official `paper <https://arxiv.org/pdf/2004.05150.pdf>`_ .
|
||||
|
||||
|
||||
Training
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
``LongformerForMaskedLM`` is trained the exact same way, ``RobertaForMaskedLM`` is trained and
|
||||
should be used as follows:
|
||||
|
||||
::
|
||||
|
||||
input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
|
||||
mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
|
||||
|
||||
loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
|
||||
|
||||
|
||||
LongformerConfig
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LongformerConfig
|
||||
:members:
|
||||
|
||||
|
||||
LongformerTokenizer
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LongformerTokenizer
|
||||
:members:
|
||||
|
||||
|
||||
LongformerModel
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LongformerModel
|
||||
:members:
|
||||
|
||||
|
||||
LongformerForMaskedLM
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LongformerForMaskedLM
|
||||
:members:
|
||||
@@ -305,3 +305,9 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
|
||||
| MarianMT | ``Helsinki-NLP/opus-mt-{src}-{tgt}`` | | 12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size. |
|
||||
| | | | (see `model list <https://huggingface.co/Helsinki-NLP>`_) |
|
||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Longformer | ``longformer-base-4096`` | | 12-layer, 768-hidden, 12-heads, ~149M parameters |
|
||||
| | | | Starting from RoBERTa-base checkpoint, trained on documents of max length 4,096 |
|
||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| | ``longformer-large-4096`` | | 24-layer, 1024-hidden, 16-heads, ~435M parameters |
|
||||
| | | | Starting from RoBERTa-large checkpoint, trained on documents of max length 4,096 |
|
||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Examples
|
||||
## Examples
|
||||
|
||||
Version 2.9 of `transformers` introduces a new `Trainer` class for PyTorch, and its equivalent `TFTrainer` for TF 2.
|
||||
Version 2.9 of `transformers` introduces a new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) class for PyTorch, and its equivalent [`TFTrainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_tf.py) for TF 2.
|
||||
|
||||
Here is the list of all our examples:
|
||||
- **grouped by task** (all official examples work for multiple models)
|
||||
@@ -12,32 +12,24 @@ Here is the list of all our examples:
|
||||
This is still a work-in-progress – in particular documentation is still sparse – so please **contribute improvements/pull requests.**
|
||||
|
||||
|
||||
## Tasks built on Trainer
|
||||
# The Big Table of Tasks
|
||||
|
||||
| Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab | One-click Deploy to Azure (wip) |
|
||||
|---|---|:---:|:---:|:---:|:---:|:---:|
|
||||
| [`language-modeling`](./language-modeling) | Raw text | ✅ | - | - | - | - |
|
||||
| [`text-classification`](./text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb) | [](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure%2Fazure-quickstart-templates%2Fmaster%2F101-storage-account-create%2Fazuredeploy.json) |
|
||||
| [`token-classification`](./token-classification) | CoNLL NER | ✅ | ✅ | ✅ | - | - |
|
||||
| [`multiple-choice`](./multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb) | - |
|
||||
| [`question-answering`](./question-answering) | SQuAD | - | ✅ | - | - | - |
|
||||
| Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab
|
||||
|---|---|:---:|:---:|:---:|:---:|
|
||||
| [**`language-modeling`**](./language-modeling) | Raw text | ✅ | - | - | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
|
||||
| [**`text-classification`**](./text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb)
|
||||
| [**`token-classification`**](./token-classification) | CoNLL NER | ✅ | ✅ | ✅ | -
|
||||
| [**`multiple-choice`**](./multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
|
||||
| [**`question-answering`**](./question-answering) | SQuAD | - | ✅ | - | -
|
||||
| [**`text-generation`**](./text-generation) | - | - | - | - | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
|
||||
| [**`distillation`**](./distillation) | All | - | - | - | -
|
||||
| [**`summarization`**](./summarization) | CNN/Daily Mail | - | - | - | -
|
||||
| [**`translation`**](./translation) | WMT | - | - | - | -
|
||||
| [**`bertology`**](./bertology) | - | - | - | - | -
|
||||
| [**`adversarial`**](./adversarial) | HANS | - | - | - | -
|
||||
|
||||
|
||||
|
||||
## Other examples and how-to's
|
||||
|
||||
| Section | Description |
|
||||
|---|---|
|
||||
| [TensorFlow 2.0 models on GLUE](./text-classification) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
|
||||
| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
|
||||
| [Language Model training](./language-modeling) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
|
||||
| [Language Generation](./text-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
|
||||
| [GLUE](./text-classification) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
|
||||
| [SQuAD](./question-answering) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
|
||||
| [Multiple Choice](./multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
|
||||
| [Named Entity Recognition](./token-classification) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
|
||||
| [XNLI](./text-classification) | Examples running BERT/XLM on the XNLI benchmark. |
|
||||
| [Adversarial evaluation of model performances](./adversarial) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
|
||||
<br>
|
||||
|
||||
## Important note
|
||||
|
||||
@@ -52,6 +44,12 @@ pip install .
|
||||
pip install -r ./examples/requirements.txt
|
||||
```
|
||||
|
||||
## One-click Deploy to Cloud (wip)
|
||||
|
||||
#### Azure
|
||||
|
||||
[](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure%2Fazure-quickstart-templates%2Fmaster%2F101-storage-account-create%2Fazuredeploy.json)
|
||||
|
||||
## Running on TPUs
|
||||
|
||||
When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Strategy`.
|
||||
|
||||
@@ -478,7 +478,7 @@ def _compute_pytorch(
|
||||
dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
|
||||
|
||||
if not no_speed:
|
||||
print_fn("Going through model with sequence of shape".format(sequence.shape))
|
||||
print_fn("Going through model with sequence of shape {}".format(sequence.shape))
|
||||
runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
|
||||
average_time = sum(runtimes) / float(len(runtimes)) / 3.0
|
||||
dictionary[model_name]["time"][batch_size][slice_size] = average_time
|
||||
|
||||
@@ -64,7 +64,7 @@ def print_2d_tensor(tensor):
|
||||
|
||||
|
||||
def compute_heads_importance(
|
||||
args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None
|
||||
args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None, actually_pruned=False
|
||||
):
|
||||
""" This method shows how to compute:
|
||||
- head attention entropy
|
||||
@@ -77,7 +77,12 @@ def compute_heads_importance(
|
||||
|
||||
if head_mask is None:
|
||||
head_mask = torch.ones(n_layers, n_heads).to(args.device)
|
||||
|
||||
head_mask.requires_grad_(requires_grad=True)
|
||||
# If actually pruned attention multi-head, set head mask to None to avoid shape mismatch
|
||||
if actually_pruned:
|
||||
head_mask = None
|
||||
|
||||
preds = None
|
||||
labels = None
|
||||
tot_tokens = 0.0
|
||||
@@ -172,6 +177,7 @@ def mask_heads(args, model, eval_dataloader):
|
||||
new_head_mask = new_head_mask.view(-1)
|
||||
new_head_mask[current_heads_to_mask] = 0.0
|
||||
new_head_mask = new_head_mask.view_as(head_mask)
|
||||
new_head_mask = new_head_mask.clone().detach()
|
||||
print_2d_tensor(new_head_mask)
|
||||
|
||||
# Compute metric and head importance again
|
||||
@@ -181,7 +187,7 @@ def mask_heads(args, model, eval_dataloader):
|
||||
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
|
||||
current_score = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
|
||||
logger.info(
|
||||
"Masking: current score: %f, remaning heads %d (%.1f percents)",
|
||||
"Masking: current score: %f, remaining heads %d (%.1f percents)",
|
||||
current_score,
|
||||
new_head_mask.sum(),
|
||||
new_head_mask.sum() / new_head_mask.numel() * 100,
|
||||
@@ -209,14 +215,23 @@ def prune_heads(args, model, eval_dataloader, head_mask):
|
||||
original_time = datetime.now() - before_time
|
||||
|
||||
original_num_params = sum(p.numel() for p in model.parameters())
|
||||
heads_to_prune = dict((layer, (1 - head_mask[layer].long()).nonzero().tolist()) for layer in range(len(head_mask)))
|
||||
heads_to_prune = dict(
|
||||
(layer, (1 - head_mask[layer].long()).nonzero().squeeze().tolist()) for layer in range(len(head_mask))
|
||||
)
|
||||
|
||||
assert sum(len(h) for h in heads_to_prune.values()) == (1 - head_mask.long()).sum().item()
|
||||
model.prune_heads(heads_to_prune)
|
||||
pruned_num_params = sum(p.numel() for p in model.parameters())
|
||||
|
||||
before_time = datetime.now()
|
||||
_, _, preds, labels = compute_heads_importance(
|
||||
args, model, eval_dataloader, compute_entropy=False, compute_importance=False, head_mask=None
|
||||
args,
|
||||
model,
|
||||
eval_dataloader,
|
||||
compute_entropy=False,
|
||||
compute_importance=False,
|
||||
head_mask=None,
|
||||
actually_pruned=True,
|
||||
)
|
||||
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
|
||||
score_pruning = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
|
||||
@@ -404,7 +419,7 @@ def main():
|
||||
logger.info("Training/evaluation parameters %s", args)
|
||||
|
||||
# Prepare dataset for the GLUE task
|
||||
eval_dataset = GlueDataset(args, tokenizer=tokenizer, evaluate=True)
|
||||
eval_dataset = GlueDataset(args, tokenizer=tokenizer, mode="dev")
|
||||
if args.data_subset > 0:
|
||||
eval_dataset = Subset(eval_dataset, list(range(min(args.data_subset, len(eval_dataset)))))
|
||||
eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
|
||||
|
||||
@@ -80,7 +80,7 @@ def main():
|
||||
|
||||
# Load a pre-trained model
|
||||
model = TransfoXLLMHeadModel.from_pretrained(args.model_name)
|
||||
model = model.to(device)
|
||||
model.to(device)
|
||||
|
||||
logger.info(
|
||||
"Evaluating with bsz {} tgt_len {} ext_len {} mem_len {} clamp_len {}".format(
|
||||
|
||||
@@ -80,7 +80,7 @@ class Distiller:
|
||||
|
||||
self.mlm = params.mlm
|
||||
if self.mlm:
|
||||
logger.info(f"Using MLM loss for LM step.")
|
||||
logger.info("Using MLM loss for LM step.")
|
||||
self.mlm_mask_prop = params.mlm_mask_prop
|
||||
assert 0.0 <= self.mlm_mask_prop <= 1.0
|
||||
assert params.word_mask + params.word_keep + params.word_rand == 1.0
|
||||
@@ -91,7 +91,7 @@ class Distiller:
|
||||
self.pred_probs = self.pred_probs.half()
|
||||
self.token_probs = self.token_probs.half()
|
||||
else:
|
||||
logger.info(f"Using CLM loss for LM step.")
|
||||
logger.info("Using CLM loss for LM step.")
|
||||
|
||||
self.epoch = 0
|
||||
self.n_iter = 0
|
||||
@@ -365,8 +365,8 @@ class Distiller:
|
||||
self.end_epoch()
|
||||
|
||||
if self.is_master:
|
||||
logger.info(f"Save very last checkpoint as `pytorch_model.bin`.")
|
||||
self.save_checkpoint(checkpoint_name=f"pytorch_model.bin")
|
||||
logger.info("Save very last checkpoint as `pytorch_model.bin`.")
|
||||
self.save_checkpoint(checkpoint_name="pytorch_model.bin")
|
||||
logger.info("Training is finished")
|
||||
|
||||
def step(self, input_ids: torch.tensor, attention_mask: torch.tensor, lm_labels: torch.tensor):
|
||||
|
||||
@@ -60,7 +60,7 @@ def main():
|
||||
with open(args.file_path, "r", encoding="utf8") as fp:
|
||||
data = fp.readlines()
|
||||
|
||||
logger.info(f"Start encoding")
|
||||
logger.info("Start encoding")
|
||||
logger.info(f"{len(data)} examples to process.")
|
||||
|
||||
rslt = []
|
||||
|
||||
@@ -93,7 +93,7 @@ if __name__ == "__main__":
|
||||
elif args.model_type == "gpt2":
|
||||
for w in ["weight", "bias"]:
|
||||
compressed_sd[f"{prefix}.ln_f.{w}"] = state_dict[f"{prefix}.ln_f.{w}"]
|
||||
compressed_sd[f"lm_head.weight"] = state_dict[f"lm_head.weight"]
|
||||
compressed_sd["lm_head.weight"] = state_dict["lm_head.weight"]
|
||||
|
||||
print(f"N layers selected for distillation: {std_idx}")
|
||||
print(f"Number of params transfered for distillation: {len(compressed_sd.keys())}")
|
||||
|
||||
@@ -37,7 +37,7 @@ if __name__ == "__main__":
|
||||
model = BertForMaskedLM.from_pretrained(args.model_name)
|
||||
prefix = "bert"
|
||||
else:
|
||||
raise ValueError(f'args.model_type should be "bert".')
|
||||
raise ValueError('args.model_type should be "bert".')
|
||||
|
||||
state_dict = model.state_dict()
|
||||
compressed_sd = {}
|
||||
@@ -78,8 +78,8 @@ if __name__ == "__main__":
|
||||
]
|
||||
std_idx += 1
|
||||
|
||||
compressed_sd[f"vocab_projector.weight"] = state_dict[f"cls.predictions.decoder.weight"]
|
||||
compressed_sd[f"vocab_projector.bias"] = state_dict[f"cls.predictions.bias"]
|
||||
compressed_sd["vocab_projector.weight"] = state_dict["cls.predictions.decoder.weight"]
|
||||
compressed_sd["vocab_projector.bias"] = state_dict["cls.predictions.bias"]
|
||||
if args.vocab_transform:
|
||||
for w in ["weight", "bias"]:
|
||||
compressed_sd[f"vocab_transform.{w}"] = state_dict[f"cls.predictions.transform.dense.{w}"]
|
||||
|
||||
@@ -273,7 +273,7 @@ def main():
|
||||
token_probs = None
|
||||
|
||||
train_lm_seq_dataset = LmSeqsDataset(params=args, data=data)
|
||||
logger.info(f"Data loader created.")
|
||||
logger.info("Data loader created.")
|
||||
|
||||
# STUDENT #
|
||||
logger.info(f"Loading student config from {args.student_config}")
|
||||
@@ -288,7 +288,7 @@ def main():
|
||||
|
||||
if args.n_gpu > 0:
|
||||
student.to(f"cuda:{args.local_rank}")
|
||||
logger.info(f"Student loaded.")
|
||||
logger.info("Student loaded.")
|
||||
|
||||
# TEACHER #
|
||||
teacher = teacher_model_class.from_pretrained(args.teacher_name, output_hidden_states=True)
|
||||
|
||||
@@ -115,15 +115,13 @@ class DataTrainingArguments:
|
||||
)
|
||||
|
||||
|
||||
def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False, local_rank=-1):
|
||||
def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False):
|
||||
file_path = args.eval_data_file if evaluate else args.train_data_file
|
||||
if args.line_by_line:
|
||||
return LineByLineTextDataset(
|
||||
tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, local_rank=local_rank
|
||||
)
|
||||
return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
|
||||
else:
|
||||
return TextDataset(
|
||||
tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, local_rank=local_rank,
|
||||
tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, overwrite_cache=args.overwrite_cache
|
||||
)
|
||||
|
||||
|
||||
@@ -220,16 +218,9 @@ def main():
|
||||
data_args.block_size = min(data_args.block_size, tokenizer.max_len)
|
||||
|
||||
# Get datasets
|
||||
train_dataset = (
|
||||
get_dataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank)
|
||||
if training_args.do_train
|
||||
else None
|
||||
)
|
||||
eval_dataset = (
|
||||
get_dataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank, evaluate=True)
|
||||
if training_args.do_eval
|
||||
else None
|
||||
)
|
||||
|
||||
train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
|
||||
eval_dataset = get_dataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None
|
||||
data_collator = DataCollatorForLanguageModeling(
|
||||
tokenizer=tokenizer, mlm=data_args.mlm, mlm_probability=data_args.mlm_probability
|
||||
)
|
||||
@@ -260,7 +251,7 @@ def main():
|
||||
|
||||
# Evaluation
|
||||
results = {}
|
||||
if training_args.do_eval and training_args.local_rank in [-1, 0]:
|
||||
if training_args.do_eval:
|
||||
logger.info("*** Evaluate ***")
|
||||
|
||||
eval_output = trainer.evaluate()
|
||||
@@ -269,11 +260,12 @@ def main():
|
||||
result = {"perplexity": perplexity}
|
||||
|
||||
output_eval_file = os.path.join(training_args.output_dir, "eval_results_lm.txt")
|
||||
with open(output_eval_file, "w") as writer:
|
||||
logger.info("***** Eval results *****")
|
||||
for key in sorted(result.keys()):
|
||||
logger.info(" %s = %s", key, str(result[key]))
|
||||
writer.write("%s = %s\n" % (key, str(result[key])))
|
||||
if trainer.is_world_master():
|
||||
with open(output_eval_file, "w") as writer:
|
||||
logger.info("***** Eval results *****")
|
||||
for key in sorted(result.keys()):
|
||||
logger.info(" %s = %s", key, str(result[key]))
|
||||
writer.write("%s = %s\n" % (key, str(result[key])))
|
||||
|
||||
results.update(result)
|
||||
|
||||
|
||||
@@ -159,7 +159,6 @@ def main():
|
||||
max_seq_length=data_args.max_seq_length,
|
||||
overwrite_cache=data_args.overwrite_cache,
|
||||
mode=Split.train,
|
||||
local_rank=training_args.local_rank,
|
||||
)
|
||||
if training_args.do_train
|
||||
else None
|
||||
@@ -172,7 +171,6 @@ def main():
|
||||
max_seq_length=data_args.max_seq_length,
|
||||
overwrite_cache=data_args.overwrite_cache,
|
||||
mode=Split.dev,
|
||||
local_rank=training_args.local_rank,
|
||||
)
|
||||
if training_args.do_eval
|
||||
else None
|
||||
@@ -204,19 +202,20 @@ def main():
|
||||
|
||||
# Evaluation
|
||||
results = {}
|
||||
if training_args.do_eval and training_args.local_rank in [-1, 0]:
|
||||
if training_args.do_eval:
|
||||
logger.info("*** Evaluate ***")
|
||||
|
||||
result = trainer.evaluate()
|
||||
|
||||
output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
|
||||
with open(output_eval_file, "w") as writer:
|
||||
logger.info("***** Eval results *****")
|
||||
for key, value in result.items():
|
||||
logger.info(" %s = %s", key, value)
|
||||
writer.write("%s = %s\n" % (key, value))
|
||||
if trainer.is_world_master():
|
||||
with open(output_eval_file, "w") as writer:
|
||||
logger.info("***** Eval results *****")
|
||||
for key, value in result.items():
|
||||
logger.info(" %s = %s", key, value)
|
||||
writer.write("%s = %s\n" % (key, value))
|
||||
|
||||
results.update(result)
|
||||
results.update(result)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
@@ -26,6 +26,7 @@ from enum import Enum
|
||||
from typing import List, Optional
|
||||
|
||||
import tqdm
|
||||
from filelock import FileLock
|
||||
|
||||
from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available
|
||||
|
||||
@@ -77,7 +78,6 @@ class Split(Enum):
|
||||
if is_torch_available():
|
||||
import torch
|
||||
from torch.utils.data.dataset import Dataset
|
||||
from transformers import torch_distributed_zero_first
|
||||
|
||||
class MultipleChoiceDataset(Dataset):
|
||||
"""
|
||||
@@ -95,7 +95,6 @@ if is_torch_available():
|
||||
max_seq_length: Optional[int] = None,
|
||||
overwrite_cache=False,
|
||||
mode: Split = Split.train,
|
||||
local_rank=-1,
|
||||
):
|
||||
processor = processors[task]()
|
||||
|
||||
@@ -103,9 +102,11 @@ if is_torch_available():
|
||||
data_dir,
|
||||
"cached_{}_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length), task,),
|
||||
)
|
||||
with torch_distributed_zero_first(local_rank):
|
||||
# Make sure only the first process in distributed training processes the dataset,
|
||||
# and the others will use the cache.
|
||||
|
||||
# Make sure only the first process in distributed training processes the dataset,
|
||||
# and the others will use the cache.
|
||||
lock_path = cached_features_file + ".lock"
|
||||
with FileLock(lock_path):
|
||||
|
||||
if os.path.exists(cached_features_file) and not overwrite_cache:
|
||||
logger.info(f"Loading features from cached file {cached_features_file}")
|
||||
@@ -130,9 +131,8 @@ if is_torch_available():
|
||||
pad_token=tokenizer.pad_token_id,
|
||||
pad_token_segment_id=tokenizer.pad_token_type_id,
|
||||
)
|
||||
if local_rank in [-1, 0]:
|
||||
logger.info("Saving features into cached file %s", cached_features_file)
|
||||
torch.save(self.features, cached_features_file)
|
||||
logger.info("Saving features into cached file %s", cached_features_file)
|
||||
torch.save(self.features, cached_features_file)
|
||||
|
||||
def __len__(self):
|
||||
return len(self.features)
|
||||
@@ -535,7 +535,12 @@ def convert_examples_to_features(
|
||||
text_b = example.question + " " + ending
|
||||
|
||||
inputs = tokenizer.encode_plus(
|
||||
text_a, text_b, add_special_tokens=True, max_length=max_length, pad_to_max_length=True,
|
||||
text_a,
|
||||
text_b,
|
||||
add_special_tokens=True,
|
||||
max_length=max_length,
|
||||
pad_to_max_length=True,
|
||||
return_overflowing_tokens=True,
|
||||
)
|
||||
if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0:
|
||||
logger.info(
|
||||
|
||||
@@ -135,7 +135,8 @@ def main():
|
||||
|
||||
# Get datasets
|
||||
train_dataset = GlueDataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
|
||||
eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None
|
||||
eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev") if training_args.do_eval else None
|
||||
test_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="test") if training_args.do_predict else None
|
||||
|
||||
def compute_metrics(p: EvalPrediction) -> Dict:
|
||||
if output_mode == "classification":
|
||||
@@ -165,31 +166,57 @@ def main():
|
||||
tokenizer.save_pretrained(training_args.output_dir)
|
||||
|
||||
# Evaluation
|
||||
results = {}
|
||||
if training_args.do_eval and training_args.local_rank in [-1, 0]:
|
||||
eval_results = {}
|
||||
if training_args.do_eval:
|
||||
logger.info("*** Evaluate ***")
|
||||
|
||||
# Loop to handle MNLI double evaluation (matched, mis-matched)
|
||||
eval_datasets = [eval_dataset]
|
||||
if data_args.task_name == "mnli":
|
||||
mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
|
||||
eval_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, evaluate=True))
|
||||
eval_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="dev"))
|
||||
|
||||
for eval_dataset in eval_datasets:
|
||||
result = trainer.evaluate(eval_dataset=eval_dataset)
|
||||
eval_result = trainer.evaluate(eval_dataset=eval_dataset)
|
||||
|
||||
output_eval_file = os.path.join(
|
||||
training_args.output_dir, f"eval_results_{eval_dataset.args.task_name}.txt"
|
||||
)
|
||||
with open(output_eval_file, "w") as writer:
|
||||
logger.info("***** Eval results {} *****".format(eval_dataset.args.task_name))
|
||||
for key, value in result.items():
|
||||
logger.info(" %s = %s", key, value)
|
||||
writer.write("%s = %s\n" % (key, value))
|
||||
if trainer.is_world_master():
|
||||
with open(output_eval_file, "w") as writer:
|
||||
logger.info("***** Eval results {} *****".format(eval_dataset.args.task_name))
|
||||
for key, value in eval_result.items():
|
||||
logger.info(" %s = %s", key, value)
|
||||
writer.write("%s = %s\n" % (key, value))
|
||||
|
||||
results.update(result)
|
||||
eval_results.update(eval_result)
|
||||
|
||||
return results
|
||||
if training_args.do_predict:
|
||||
logging.info("*** Test ***")
|
||||
test_datasets = [test_dataset]
|
||||
if data_args.task_name == "mnli":
|
||||
mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
|
||||
test_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="test"))
|
||||
|
||||
for test_dataset in test_datasets:
|
||||
predictions = trainer.predict(test_dataset=test_dataset).predictions
|
||||
if output_mode == "classification":
|
||||
predictions = np.argmax(predictions, axis=1)
|
||||
|
||||
output_test_file = os.path.join(
|
||||
training_args.output_dir, f"test_results_{test_dataset.args.task_name}.txt"
|
||||
)
|
||||
if trainer.is_world_master():
|
||||
with open(output_test_file, "w") as writer:
|
||||
logger.info("***** Test results {} *****".format(test_dataset.args.task_name))
|
||||
writer.write("index\tprediction\n")
|
||||
for index, item in enumerate(predictions):
|
||||
if output_mode == "regression":
|
||||
writer.write("%d\t%3.3f\n" % (index, item))
|
||||
else:
|
||||
item = test_dataset.get_labels()[item]
|
||||
writer.write("%d\t%s\n" % (index, item))
|
||||
return eval_results
|
||||
|
||||
|
||||
def _mp_fn(index):
|
||||
|
||||
@@ -171,7 +171,6 @@ def main():
|
||||
max_seq_length=data_args.max_seq_length,
|
||||
overwrite_cache=data_args.overwrite_cache,
|
||||
mode=Split.train,
|
||||
local_rank=training_args.local_rank,
|
||||
)
|
||||
if training_args.do_train
|
||||
else None
|
||||
@@ -185,7 +184,6 @@ def main():
|
||||
max_seq_length=data_args.max_seq_length,
|
||||
overwrite_cache=data_args.overwrite_cache,
|
||||
mode=Split.dev,
|
||||
local_rank=training_args.local_rank,
|
||||
)
|
||||
if training_args.do_eval
|
||||
else None
|
||||
@@ -237,22 +235,23 @@ def main():
|
||||
|
||||
# Evaluation
|
||||
results = {}
|
||||
if training_args.do_eval and training_args.local_rank in [-1, 0]:
|
||||
if training_args.do_eval:
|
||||
logger.info("*** Evaluate ***")
|
||||
|
||||
result = trainer.evaluate()
|
||||
|
||||
output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
|
||||
with open(output_eval_file, "w") as writer:
|
||||
logger.info("***** Eval results *****")
|
||||
for key, value in result.items():
|
||||
logger.info(" %s = %s", key, value)
|
||||
writer.write("%s = %s\n" % (key, value))
|
||||
if trainer.is_world_master():
|
||||
with open(output_eval_file, "w") as writer:
|
||||
logger.info("***** Eval results *****")
|
||||
for key, value in result.items():
|
||||
logger.info(" %s = %s", key, value)
|
||||
writer.write("%s = %s\n" % (key, value))
|
||||
|
||||
results.update(result)
|
||||
|
||||
# Predict
|
||||
if training_args.do_predict and training_args.local_rank in [-1, 0]:
|
||||
if training_args.do_predict:
|
||||
test_dataset = NerDataset(
|
||||
data_dir=data_args.data_dir,
|
||||
tokenizer=tokenizer,
|
||||
@@ -261,33 +260,36 @@ def main():
|
||||
max_seq_length=data_args.max_seq_length,
|
||||
overwrite_cache=data_args.overwrite_cache,
|
||||
mode=Split.test,
|
||||
local_rank=training_args.local_rank,
|
||||
)
|
||||
|
||||
predictions, label_ids, metrics = trainer.predict(test_dataset)
|
||||
preds_list, _ = align_predictions(predictions, label_ids)
|
||||
|
||||
output_test_results_file = os.path.join(training_args.output_dir, "test_results.txt")
|
||||
with open(output_test_results_file, "w") as writer:
|
||||
for key, value in metrics.items():
|
||||
logger.info(" %s = %s", key, value)
|
||||
writer.write("%s = %s\n" % (key, value))
|
||||
if trainer.is_world_master():
|
||||
with open(output_test_results_file, "w") as writer:
|
||||
for key, value in metrics.items():
|
||||
logger.info(" %s = %s", key, value)
|
||||
writer.write("%s = %s\n" % (key, value))
|
||||
|
||||
# Save predictions
|
||||
output_test_predictions_file = os.path.join(training_args.output_dir, "test_predictions.txt")
|
||||
with open(output_test_predictions_file, "w") as writer:
|
||||
with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f:
|
||||
example_id = 0
|
||||
for line in f:
|
||||
if line.startswith("-DOCSTART-") or line == "" or line == "\n":
|
||||
writer.write(line)
|
||||
if not preds_list[example_id]:
|
||||
example_id += 1
|
||||
elif preds_list[example_id]:
|
||||
output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n"
|
||||
writer.write(output_line)
|
||||
else:
|
||||
logger.warning("Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0])
|
||||
if trainer.is_world_master():
|
||||
with open(output_test_predictions_file, "w") as writer:
|
||||
with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f:
|
||||
example_id = 0
|
||||
for line in f:
|
||||
if line.startswith("-DOCSTART-") or line == "" or line == "\n":
|
||||
writer.write(line)
|
||||
if not preds_list[example_id]:
|
||||
example_id += 1
|
||||
elif preds_list[example_id]:
|
||||
output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n"
|
||||
writer.write(output_line)
|
||||
else:
|
||||
logger.warning(
|
||||
"Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0]
|
||||
)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
@@ -22,6 +22,8 @@ from dataclasses import dataclass
|
||||
from enum import Enum
|
||||
from typing import List, Optional, Union
|
||||
|
||||
from filelock import FileLock
|
||||
|
||||
from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available
|
||||
|
||||
|
||||
@@ -68,7 +70,6 @@ if is_torch_available():
|
||||
import torch
|
||||
from torch import nn
|
||||
from torch.utils.data.dataset import Dataset
|
||||
from transformers import torch_distributed_zero_first
|
||||
|
||||
class NerDataset(Dataset):
|
||||
"""
|
||||
@@ -90,16 +91,16 @@ if is_torch_available():
|
||||
max_seq_length: Optional[int] = None,
|
||||
overwrite_cache=False,
|
||||
mode: Split = Split.train,
|
||||
local_rank=-1,
|
||||
):
|
||||
# Load data features from cache or dataset file
|
||||
cached_features_file = os.path.join(
|
||||
data_dir, "cached_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length)),
|
||||
)
|
||||
|
||||
with torch_distributed_zero_first(local_rank):
|
||||
# Make sure only the first process in distributed training processes the dataset,
|
||||
# and the others will use the cache.
|
||||
# Make sure only the first process in distributed training processes the dataset,
|
||||
# and the others will use the cache.
|
||||
lock_path = cached_features_file + ".lock"
|
||||
with FileLock(lock_path):
|
||||
|
||||
if os.path.exists(cached_features_file) and not overwrite_cache:
|
||||
logger.info(f"Loading features from cached file {cached_features_file}")
|
||||
@@ -125,9 +126,8 @@ if is_torch_available():
|
||||
pad_token_segment_id=tokenizer.pad_token_type_id,
|
||||
pad_token_label_id=self.pad_token_label_id,
|
||||
)
|
||||
if local_rank in [-1, 0]:
|
||||
logger.info(f"Saving features into cached file {cached_features_file}")
|
||||
torch.save(self.features, cached_features_file)
|
||||
logger.info(f"Saving features into cached file {cached_features_file}")
|
||||
torch.save(self.features, cached_features_file)
|
||||
|
||||
def __len__(self):
|
||||
return len(self.features)
|
||||
|
||||
23
model_cards/Tereveni-AI/gpt2-124M-uk-fiction/README.md
Normal file
23
model_cards/Tereveni-AI/gpt2-124M-uk-fiction/README.md
Normal file
@@ -0,0 +1,23 @@
|
||||
Note: **default code snippet above won't work** because we are using `AlbertTokenizer` with `GPT2LMHeadModel`, see [issue](https://github.com/huggingface/transformers/issues/4285).
|
||||
|
||||
## GPT2 124M Trained on Ukranian Fiction
|
||||
|
||||
Example usage:
|
||||
```python
|
||||
from transformers import AlbertTokenizer, GPT2LMHeadModel
|
||||
|
||||
tokenizer = AlbertTokenizer.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
|
||||
model = GPT2LMHeadModel.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
|
||||
|
||||
input_ids = tokenizer.encode('Но зла Юнона, суча дочка,', add_special_tokens=False, return_tensors='pt')
|
||||
|
||||
outputs = model.generate(
|
||||
input_ids,
|
||||
do_sample=True,
|
||||
num_return_sequences=3,
|
||||
max_length=50
|
||||
)
|
||||
|
||||
for i, out in enumerate(outputs):
|
||||
print('{}: {}'.format(i, tokenizer.decode(out)))
|
||||
```
|
||||
@@ -1,19 +1,25 @@
|
||||
---
|
||||
language: norwegian
|
||||
thumbnail: https://i.imgur.com/QqSEC5I.png
|
||||
---
|
||||
|
||||
# Norwegian Electra
|
||||
Image incoming, im going to have som fun with this one.
|
||||

|
||||
|
||||
Trained on Oscar + wikipedia + opensubtitles + some other data I had with the awesome power of TPUs(V3-8)
|
||||
|
||||
Use with caution. I have no downstream tasks in Norwegian to test on so I have no idea of its performance yet.
|
||||
|
||||
# Model
|
||||
## Electra: Pre-training Text Encoders as Discriminators Rather Than Generators
|
||||
Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning
|
||||
- https://openreview.net/pdf?id=r1xMH1BtvB
|
||||
- https://github.com/google-research/electra
|
||||
# Acknowledgments
|
||||
|
||||
### TensorFlow Research Cloud
|
||||
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️
|
||||
- https://www.tensorflow.org/tfrc
|
||||
|
||||
#### OSCAR corpus
|
||||
- https://oscar-corpus.com/
|
||||
|
||||
#### OPUS
|
||||
- http://opus.nlpl.eu/
|
||||
- http://www.opensubtitles.org/
|
||||
|
||||
43
model_cards/activebus/BERT-DK_laptop/README.md
Normal file
43
model_cards/activebus/BERT-DK_laptop/README.md
Normal file
@@ -0,0 +1,43 @@
|
||||
# ReviewBERT
|
||||
|
||||
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
|
||||
|
||||
`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`.
|
||||
|
||||
|
||||
## Model Description
|
||||
|
||||
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
|
||||
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
|
||||
|
||||
`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`.
|
||||
|
||||
## Instructions
|
||||
Loading the post-trained weights are as simple as, e.g.,
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_laptop")
|
||||
model = AutoModel.from_pretrained("activebus/BERT-DK_laptop")
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Evaluation Results
|
||||
|
||||
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
|
||||
|
||||
|
||||
## Citation
|
||||
If you find this work useful, please cite as following.
|
||||
```
|
||||
@inproceedings{xu_bert2019,
|
||||
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
|
||||
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
|
||||
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
|
||||
month = "jun",
|
||||
year = "2019",
|
||||
}
|
||||
```
|
||||
41
model_cards/activebus/BERT-DK_rest/README.md
Normal file
41
model_cards/activebus/BERT-DK_rest/README.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# ReviewBERT
|
||||
|
||||
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
|
||||
|
||||
`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.
|
||||
|
||||
## Model Description
|
||||
|
||||
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
|
||||
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
|
||||
|
||||
|
||||
## Instructions
|
||||
Loading the post-trained weights are as simple as, e.g.,
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_rest")
|
||||
model = AutoModel.from_pretrained("activebus/BERT-DK_rest")
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Evaluation Results
|
||||
|
||||
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
|
||||
|
||||
|
||||
## Citation
|
||||
If you find this work useful, please cite as following.
|
||||
```
|
||||
@inproceedings{xu_bert2019,
|
||||
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
|
||||
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
|
||||
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
|
||||
month = "jun",
|
||||
year = "2019",
|
||||
}
|
||||
```
|
||||
41
model_cards/activebus/BERT-PT_laptop/README.md
Normal file
41
model_cards/activebus/BERT-PT_laptop/README.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# ReviewBERT
|
||||
|
||||
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
|
||||
|
||||
`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`.
|
||||
`BERT-PT_*` addtionally uses SQuAD 1.1.
|
||||
|
||||
## Model Description
|
||||
|
||||
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
|
||||
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
|
||||
|
||||
|
||||
## Instructions
|
||||
Loading the post-trained weights are as simple as, e.g.,
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_laptop")
|
||||
model = AutoModel.from_pretrained("activebus/BERT-PT_laptop")
|
||||
|
||||
```
|
||||
|
||||
## Evaluation Results
|
||||
|
||||
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
|
||||
|
||||
|
||||
## Citation
|
||||
If you find this work useful, please cite as following.
|
||||
```
|
||||
@inproceedings{xu_bert2019,
|
||||
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
|
||||
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
|
||||
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
|
||||
month = "jun",
|
||||
year = "2019",
|
||||
}
|
||||
```
|
||||
42
model_cards/activebus/BERT-PT_rest/README.md
Normal file
42
model_cards/activebus/BERT-PT_rest/README.md
Normal file
@@ -0,0 +1,42 @@
|
||||
# ReviewBERT
|
||||
|
||||
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
|
||||
|
||||
`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.
|
||||
`BERT-PT_*` addtionally uses SQuAD 1.1.
|
||||
|
||||
## Model Description
|
||||
|
||||
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
|
||||
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
|
||||
|
||||
|
||||
## Instructions
|
||||
Loading the post-trained weights are as simple as, e.g.,
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_rest")
|
||||
model = AutoModel.from_pretrained("activebus/BERT-PT_rest")
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Evaluation Results
|
||||
|
||||
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
|
||||
|
||||
|
||||
## Citation
|
||||
If you find this work useful, please cite as following.
|
||||
```
|
||||
@inproceedings{xu_bert2019,
|
||||
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
|
||||
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
|
||||
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
|
||||
month = "jun",
|
||||
year = "2019",
|
||||
}
|
||||
```
|
||||
44
model_cards/activebus/BERT-XD_Review/README.md
Normal file
44
model_cards/activebus/BERT-XD_Review/README.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# ReviewBERT
|
||||
|
||||
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
|
||||
Please visit https://github.com/howardhsu/BERT-for-RRC-ABSA for details.
|
||||
|
||||
`BERT-XD_Review` is a cross-domain (beyond just `laptop` and `restaurant`) language model, where each example is from a single product / restaurant with the same rating, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
|
||||
The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
|
||||
|
||||
## Model Description
|
||||
|
||||
The original model is from `BERT-base-uncased`.
|
||||
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
|
||||
|
||||
|
||||
## Instructions
|
||||
Loading the post-trained weights are as simple as, e.g.,
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-XD_Review")
|
||||
model = AutoModel.from_pretrained("activebus/BERT-XD_Review")
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Evaluation Results
|
||||
|
||||
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
|
||||
`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
|
||||
|
||||
|
||||
## Citation
|
||||
If you find this work useful, please cite as following.
|
||||
```
|
||||
@inproceedings{xu_bert2019,
|
||||
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
|
||||
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
|
||||
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
|
||||
month = "jun",
|
||||
year = "2019",
|
||||
}
|
||||
```
|
||||
44
model_cards/activebus/BERT_Review/README.md
Normal file
44
model_cards/activebus/BERT_Review/README.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# ReviewBERT
|
||||
|
||||
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
|
||||
|
||||
`BERT_Review` is cross-domain (beyond just `laptop` and `restaurant`) language model with one example from randomly mixed domains, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
|
||||
The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
|
||||
|
||||
|
||||
## Model Description
|
||||
|
||||
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
|
||||
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
|
||||
|
||||
|
||||
## Instructions
|
||||
Loading the post-trained weights are as simple as, e.g.,
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT_Review")
|
||||
model = AutoModel.from_pretrained("activebus/BERT_Review")
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Evaluation Results
|
||||
|
||||
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
|
||||
`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
|
||||
|
||||
|
||||
## Citation
|
||||
If you find this work useful, please cite as following.
|
||||
```
|
||||
@inproceedings{xu_bert2019,
|
||||
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
|
||||
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
|
||||
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
|
||||
month = "jun",
|
||||
year = "2019",
|
||||
}
|
||||
```
|
||||
@@ -18,13 +18,16 @@ tags:
|
||||
**Eval data:** Conll03 (NER), GermEval14 (NER), GermEval18 (Classification), GNAD (Classification)
|
||||
**Infrastructure**: 1x TPU v2
|
||||
**Published**: Jun 14th, 2019
|
||||
|
||||
**Update April 3rd, 2020**: we updated the vocabulary file on deepset's s3 to conform with the default tokenization of punctuation tokens.
|
||||
For details see the related [FARM issue](https://github.com/deepset-ai/FARM/issues/60). If you want to use the old vocab we have also uploaded a ["deepset/bert-base-german-cased-oldvocab"](https://huggingface.co/deepset/bert-base-german-cased-oldvocab) model.
|
||||
|
||||
## Details
|
||||
- We trained using Google's Tensorflow code on a single cloud TPU v2 with standard settings.
|
||||
- We trained 810k steps with a batch size of 1024 for sequence length 128 and 30k steps with sequence length 512. Training took about 9 days.
|
||||
- As training data we used the latest German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB).
|
||||
- We cleaned the data dumps with tailored scripts and segmented sentences with spacy v2.1. To create tensorflow records we used the recommended sentencepiece library for creating the word piece vocabulary and tensorflow scripts to convert the text to data usable by BERT.
|
||||
- Update April 3rd, 2020: updated the vocab file on deepset s3 to adjust tokenization of punctuation.
|
||||
|
||||
|
||||
See https://deepset.ai/german-bert for more details
|
||||
|
||||
|
||||
@@ -0,0 +1,28 @@
|
||||
---
|
||||
language: german
|
||||
thumbnail: https://static.tildacdn.com/tild6438-3730-4164-b266-613634323466/german_bert.png
|
||||
tags:
|
||||
- exbert
|
||||
---
|
||||
|
||||
<a href="https://huggingface.co/exbert/?model=bert-base-german-cased">
|
||||
<img width="300px" src="https://hf-dinosaur.huggingface.co/exbert/button.png">
|
||||
</a>
|
||||
|
||||
# German BERT with old vocabulary
|
||||
For details see the related [FARM issue](https://github.com/deepset-ai/FARM/issues/60).
|
||||
|
||||
|
||||
## About us
|
||||

|
||||
|
||||
We bring NLP to the industry via open source!
|
||||
Our focus: Industry specific language models & large scale QA systems.
|
||||
|
||||
Some of our work:
|
||||
- [German BERT (aka "bert-base-german-cased")](https://deepset.ai/german-bert)
|
||||
- [FARM](https://github.com/deepset-ai/FARM)
|
||||
- [Haystack](https://github.com/deepset-ai/haystack/)
|
||||
|
||||
Get in touch:
|
||||
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Website](https://deepset.ai)
|
||||
@@ -0,0 +1,18 @@
|
||||
# COVID-Twitter-BERT (CT-BERT)
|
||||
BERT-large-uncased model, pretrained on a corpus of messages from Twitter about COVID-19
|
||||
|
||||
## Overview
|
||||
This model was trained on 160M tweets collected between January 12 and April 16, 2020 containing at least one of the keywords "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2". These tweets were filtered and preprocessed to reach a final sample of 22.5M tweets (containing 40.7M sentences and 633M tokens) which were used for training.
|
||||
|
||||
This model was evaluated based on downstream classification tasks, but it could be used for any other NLP task which can leverage contextual embeddings.
|
||||
|
||||
In order to achieve best results, make sure to use the same text preprocessing as we did for pretraining. This involves replacing user mentions, urls and emojis. You can find a script on our projects [GitHub repo](https://github.com/digitalepidemiologylab/covid-twitter-bert).
|
||||
|
||||
## Example usage
|
||||
```python
|
||||
tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert")
|
||||
model = TFAutoModel.from_pretrained("digitalepidemiologylab/covid-twitter-bert")
|
||||
```
|
||||
|
||||
## References
|
||||
[1] Martin Müller, Marcel Salaté, Per E Kummervold. "COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter" arXiv preprint arXiv:2005.07503 (2020).
|
||||
@@ -0,0 +1,48 @@
|
||||
---
|
||||
language: romanian
|
||||
---
|
||||
|
||||
# bert-base-romanian-cased-v1
|
||||
|
||||
The BERT **base**, **cased** model for Romanian, trained on a 15GB corpus, version 
|
||||
|
||||
### How to use
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
import torch
|
||||
# load tokenizer and model
|
||||
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
|
||||
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
|
||||
# tokenize a sentence and run through the model
|
||||
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
||||
outputs = model(input_ids)
|
||||
# get encoding
|
||||
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||
```
|
||||
|
||||
### Evaluation
|
||||
|
||||
Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md).
|
||||
|
||||
The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian.
|
||||
|
||||
| Model | UPOS | XPOS | NER | LAS |
|
||||
|--------------------------------|:-----:|:------:|:-----:|:-----:|
|
||||
| bert-base-multilingual-cased | 97.87 | 96.16 | 84.13 | 88.04 |
|
||||
| bert-base-romanian-cased-v1 | **98.00** | **96.46** | **85.88** | **89.69** |
|
||||
|
||||
### Corpus
|
||||
|
||||
The model is trained on the following corpora (stats in the table below are after cleaning):
|
||||
|
||||
| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
|
||||
|----------- |:--------: |:--------: |:--------: |:--------: |
|
||||
| OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
|
||||
| OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
|
||||
| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
|
||||
| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
|
||||
|
||||
#### Acknowledgements
|
||||
|
||||
- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
|
||||
@@ -0,0 +1,51 @@
|
||||
---
|
||||
language: romanian
|
||||
---
|
||||
|
||||
# bert-base-romanian-uncased-v1
|
||||
|
||||
The BERT **base**, **uncased** model for Romanian, trained on a 15GB corpus, version 
|
||||
|
||||
### How to use
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
import torch
|
||||
|
||||
# load tokenizer and model
|
||||
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1", do_lower_case=True)
|
||||
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")
|
||||
|
||||
# tokenize a sentence and run through the model
|
||||
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
||||
outputs = model(input_ids)
|
||||
|
||||
# get encoding
|
||||
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||
```
|
||||
|
||||
### Evaluation
|
||||
|
||||
Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md).
|
||||
|
||||
The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian.
|
||||
|
||||
| Model | UPOS | XPOS | NER | LAS |
|
||||
|--------------------------------|:-----:|:------:|:-----:|:-----:|
|
||||
| bert-base-multilingual-uncased | 97.65 | 95.72 | 83.91 | 87.65 |
|
||||
| bert-base-romanian-uncased-v1 | **98.18** | **96.84** | **85.26** | **89.61** |
|
||||
|
||||
### Corpus
|
||||
|
||||
The model is trained on the following corpora (stats in the table below are after cleaning):
|
||||
|
||||
| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
|
||||
|----------- |:--------: |:--------: |:--------: |:--------: |
|
||||
| OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
|
||||
| OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
|
||||
| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
|
||||
| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
|
||||
|
||||
#### Acknowledgements
|
||||
|
||||
- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
|
||||
74
model_cards/huseinzol05/t5-base-bahasa-cased/README.md
Normal file
74
model_cards/huseinzol05/t5-base-bahasa-cased/README.md
Normal file
@@ -0,0 +1,74 @@
|
||||
---
|
||||
language: malay
|
||||
---
|
||||
|
||||
# Bahasa T5 Model
|
||||
|
||||
Pretrained T5 base language model for Malay and Indonesian.
|
||||
|
||||
## Pretraining Corpus
|
||||
|
||||
`t5-base-bahasa-cased` model was pretrained on multiple tasks. Below is list of tasks we trained on,
|
||||
|
||||
1. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
|
||||
2. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
|
||||
3. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
|
||||
4. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
|
||||
5. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
|
||||
6. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
|
||||
7. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
|
||||
8. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
|
||||
9. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
|
||||
10. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
|
||||
11. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
|
||||
12. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
|
||||
13. [Bahasa SNLI](https://github.com/huseinzol05/Malaya-Dataset#snli).
|
||||
14. [Bahasa Question Quora](https://github.com/huseinzol05/Malaya-Dataset#quora).
|
||||
15. [Bahasa Natural Questions](https://github.com/huseinzol05/Malaya-Dataset#natural-questions).
|
||||
16. [News title summarization](https://github.com/huseinzol05/Malaya-Dataset#crawled-news).
|
||||
17. [Stemming to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-stemming.ipynb).
|
||||
18. [Synonym to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-synonym.ipynb).
|
||||
|
||||
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
|
||||
|
||||
## Pretraining details
|
||||
|
||||
- This model was trained using Google T5's github [repository](https://github.com/google-research/text-to-text-transfer-transformer) on v3-8 TPU.
|
||||
- All steps can reproduce from here, [Malaya/pretrained-model/t5](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5).
|
||||
|
||||
## Load Pretrained Model
|
||||
|
||||
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
|
||||
|
||||
```python
|
||||
from transformers import T5Tokenizer, T5Model
|
||||
|
||||
model = T5Model.from_pretrained('huseinzol05/t5-base-bahasa-cased')
|
||||
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
|
||||
```
|
||||
|
||||
## Example using T5ForConditionalGeneration
|
||||
|
||||
```python
|
||||
from transformers import T5Tokenizer, T5ForConditionalGeneration
|
||||
|
||||
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
|
||||
model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-base-bahasa-cased')
|
||||
input_ids = tokenizer.encode('soalan: siapakah perdana menteri malaysia?', return_tensors = 'pt')
|
||||
outputs = model.generate(input_ids)
|
||||
print(tokenizer.decode(outputs[0]))
|
||||
```
|
||||
|
||||
Output is,
|
||||
|
||||
```
|
||||
'Mahathir Mohamad'
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
|
||||
|
||||
## Acknowledgement
|
||||
|
||||
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa.
|
||||
92
model_cards/mrm8488/RuPERTa-base-finetuned-ner/README.md
Normal file
92
model_cards/mrm8488/RuPERTa-base-finetuned-ner/README.md
Normal file
@@ -0,0 +1,92 @@
|
||||
---
|
||||
language: spanish
|
||||
thumbnail:
|
||||
---
|
||||
|
||||
# RuPERTa-base (Spanish RoBERTa) + NER 🎃🏷
|
||||
|
||||
This model is a fine-tuned on [NER-C](https://www.kaggle.com/nltkdata/conll-corpora) version of [RuPERTa-base](https://huggingface.co/mrm8488/RuPERTa-base) for **NER** downstream task.
|
||||
|
||||
## Details of the downstream task (NER) - Dataset
|
||||
|
||||
- [Dataset: CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora) 📚
|
||||
|
||||
| Dataset | # Examples |
|
||||
| ---------------------- | ----- |
|
||||
| Train | 329 K |
|
||||
| Dev | 40 K |
|
||||
|
||||
|
||||
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)
|
||||
|
||||
- Labels covered:
|
||||
|
||||
```
|
||||
B-LOC
|
||||
B-MISC
|
||||
B-ORG
|
||||
B-PER
|
||||
I-LOC
|
||||
I-MISC
|
||||
I-ORG
|
||||
I-PER
|
||||
O
|
||||
```
|
||||
|
||||
## Metrics on evaluation set 🧾
|
||||
|
||||
| Metric | # score |
|
||||
| :------------------------------------------------------------------------------------: | :-------: |
|
||||
| F1 | **77.55**
|
||||
| Precision | **75.53** |
|
||||
| Recall | **79.68** |
|
||||
|
||||
## Model in action 🔨
|
||||
|
||||
|
||||
Example of usage:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModelForTokenClassification, AutoTokenizer
|
||||
|
||||
id2label = {
|
||||
"0": "B-LOC",
|
||||
"1": "B-MISC",
|
||||
"2": "B-ORG",
|
||||
"3": "B-PER",
|
||||
"4": "I-LOC",
|
||||
"5": "I-MISC",
|
||||
"6": "I-ORG",
|
||||
"7": "I-PER",
|
||||
"8": "O"
|
||||
}
|
||||
|
||||
text ="Julien, CEO de HF, nació en Francia."
|
||||
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
|
||||
|
||||
outputs = model(input_ids)
|
||||
last_hidden_states = outputs[0]
|
||||
|
||||
for m in last_hidden_states:
|
||||
for index, n in enumerate(m):
|
||||
if(index > 0 and index <= len(text.split(" "))):
|
||||
print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
|
||||
|
||||
'''
|
||||
Output:
|
||||
--------
|
||||
Julien,: I-PER
|
||||
CEO: O
|
||||
de: O
|
||||
HF,: B-ORG
|
||||
nació: I-PER
|
||||
en: I-PER
|
||||
Francia.: I-LOC
|
||||
'''
|
||||
```
|
||||
Yeah! Not too bad 🎉
|
||||
|
||||
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
|
||||
|
||||
> Made with <span style="color: #e25555;">♥</span> in Spain
|
||||
111
model_cards/mrm8488/RuPERTa-base-finetuned-pos/README.md
Normal file
111
model_cards/mrm8488/RuPERTa-base-finetuned-pos/README.md
Normal file
@@ -0,0 +1,111 @@
|
||||
---
|
||||
language: spanish
|
||||
thumbnail:
|
||||
---
|
||||
|
||||
# RuPERTa-base (Spanish RoBERTa) + POS 🎃🏷
|
||||
|
||||
This model is a fine-tuned on [CONLL CORPORA](https://www.kaggle.com/nltkdata/conll-corpora) version of [RuPERTa-base](https://huggingface.co/mrm8488/RuPERTa-base) for **POS** downstream task.
|
||||
|
||||
## Details of the downstream task (POS) - Dataset
|
||||
|
||||
- [Dataset: CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora) 📚
|
||||
|
||||
| Dataset | # Examples |
|
||||
| ---------------------- | ----- |
|
||||
| Train | 445 K |
|
||||
| Dev | 55 K |
|
||||
|
||||
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)
|
||||
|
||||
- Labels covered:
|
||||
|
||||
```
|
||||
ADJ
|
||||
ADP
|
||||
ADV
|
||||
AUX
|
||||
CCONJ
|
||||
DET
|
||||
INTJ
|
||||
NOUN
|
||||
NUM
|
||||
PART
|
||||
PRON
|
||||
PROPN
|
||||
PUNCT
|
||||
SCONJ
|
||||
SYM
|
||||
VERB
|
||||
```
|
||||
|
||||
## Metrics on evaluation set 🧾
|
||||
|
||||
| Metric | # score |
|
||||
| :------------------------------------------------------------------------------------: | :-------: |
|
||||
| F1 | **97.39**
|
||||
| Precision | **97.47** |
|
||||
| Recall | **9732** |
|
||||
|
||||
## Model in action 🔨
|
||||
|
||||
|
||||
Example of usage
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModelForTokenClassification, AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained('mrm8488/RuPERTa-base-finetuned-pos')
|
||||
model = AutoModelForTokenClassification.from_pretrained('mrm8488/RuPERTa-base-finetuned-pos')
|
||||
|
||||
id2label = {
|
||||
"0": "O",
|
||||
"1": "ADJ",
|
||||
"2": "ADP",
|
||||
"3": "ADV",
|
||||
"4": "AUX",
|
||||
"5": "CCONJ",
|
||||
"6": "DET",
|
||||
"7": "INTJ",
|
||||
"8": "NOUN",
|
||||
"9": "NUM",
|
||||
"10": "PART",
|
||||
"11": "PRON",
|
||||
"12": "PROPN",
|
||||
"13": "PUNCT",
|
||||
"14": "SCONJ",
|
||||
"15": "SYM",
|
||||
"16": "VERB"
|
||||
}
|
||||
|
||||
text ="Mis amigos están pensando viajar a Londres este verano."
|
||||
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
|
||||
|
||||
outputs = model(input_ids)
|
||||
last_hidden_states = outputs[0]
|
||||
|
||||
for m in last_hidden_states:
|
||||
for index, n in enumerate(m):
|
||||
if(index > 0 and index <= len(text.split(" "))):
|
||||
print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
|
||||
|
||||
'''
|
||||
Output:
|
||||
--------
|
||||
Mis: NUM
|
||||
amigos: PRON
|
||||
están: AUX
|
||||
pensando: ADV
|
||||
viajar: VERB
|
||||
a: ADP
|
||||
Londres: PROPN
|
||||
este: DET
|
||||
verano..: NOUN
|
||||
'''
|
||||
```
|
||||
Yeah! Not too bad 🎉
|
||||
|
||||
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
|
||||
|
||||
> Made with <span style="color: #e25555;">♥</span> in Spain
|
||||
@@ -43,8 +43,8 @@ import torch
|
||||
discriminator = ElectraForPreTraining.from_pretrained("mrm8488/electricidad-small-discriminator")
|
||||
tokenizer = ElectraTokenizerFast.from_pretrained("mrm8488/electricidad-small-discriminator")
|
||||
|
||||
sentence = "El rápido zorro marrón salta sobre el perro perezoso"
|
||||
fake_sentence = "El rápido zorro marrón falsea sobre el perro perezoso"
|
||||
sentence = "el zorro rojo es muy rápido"
|
||||
fake_sentence = "el zorro rojo es muy ser"
|
||||
|
||||
fake_tokens = tokenizer.tokenize(sentence)
|
||||
fake_inputs = tokenizer.encode(sentence, return_tensors="pt")
|
||||
@@ -53,9 +53,16 @@ predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
|
||||
|
||||
[print("%7s" % token, end="") for token in fake_tokens]
|
||||
|
||||
[print("%7s" % prediction, end="") for prediction in predictions.tolist()]
|
||||
[print("%7s" % int(prediction), end="") for prediction in predictions.tolist()[1:-1]]
|
||||
|
||||
# Output:
|
||||
'''
|
||||
el zorro rojo es muy ser 0 0 0 0 0 1[None, None, None, None, None, None]
|
||||
'''
|
||||
```
|
||||
|
||||
As you can see there is a **1** in the place where the model detected the fake token (**ser**). So, it works! 🎉
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
I thank [🤗/transformers team](https://github.com/huggingface/transformers) for answering my doubts and Google for helping me with the [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc) program.
|
||||
|
||||
@@ -20,6 +20,9 @@ A few examples of the model response to a query before and after optimisation:
|
||||
|I have watched 3 episodes |with this guy and he is such a talented actor...| but the show is just plain awful and there ne...| 2.681171| -4.512792|
|
||||
|We know that firefighters and| police officers are forced to become populari...| other chains have going to get this disaster ...| 1.367811| -3.34017|
|
||||
|
||||
## Training logs and metrics <img src="https://gblobscdn.gitbook.com/spaces%2F-Lqya5RvLedGEWPhtkjU%2Favatar.png?alt=media" width="25" height="25">
|
||||
Watch the whole training logs and metrics on [W&B](https://app.wandb.ai/mrm8488/gpt2-sentiment-negative?workspace=user-mrm8488)
|
||||
|
||||
|
||||
|
||||
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
|
||||
|
||||
125
model_cards/oliverguhr/german-sentiment-bert/README.md
Normal file
125
model_cards/oliverguhr/german-sentiment-bert/README.md
Normal file
@@ -0,0 +1,125 @@
|
||||
# German Sentiment Classification with Bert
|
||||
|
||||
This model was trained for sentiment classification of German language texts. To achieve the best results all model inputs needs to be preprocessed with the same procedure, that was applied during the training. To simplify the usage of the model,
|
||||
we provide a Python package that bundles the code need for the preprocessing and inferencing.
|
||||
|
||||
The model uses the Googles Bert architecture and was trained on 1.834 million German-language samples. The training data contains texts from various domains like Twitter, Facebook and movie, app and hotel reviews.
|
||||
You can find more information about the dataset and the training process in the [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.201.pdf).
|
||||
|
||||
## Using the Python package
|
||||
|
||||
To get started install the package from [pypi](https://pypi.org/project/germansentiment/):
|
||||
|
||||
```bash
|
||||
pip install germansentiment
|
||||
```
|
||||
|
||||
```python
|
||||
from germansentiment import SentimentModel
|
||||
|
||||
model = SentimentModel()
|
||||
|
||||
texts = [
|
||||
"Mit keinem guten Ergebniss","Das ist gar nicht mal so gut",
|
||||
"Total awesome!","nicht so schlecht wie erwartet",
|
||||
"Der Test verlief positiv.","Sie fährt ein grünes Auto."]
|
||||
|
||||
result = model.predict_sentiment(texts)
|
||||
print(result)
|
||||
```
|
||||
|
||||
The code above will output following list:
|
||||
|
||||
```python
|
||||
["negative","negative","positive","positive","neutral", "neutral"]
|
||||
```
|
||||
|
||||
## minimal working Sample
|
||||
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
||||
from typing import List
|
||||
import torch
|
||||
import re
|
||||
|
||||
class SentimentModel():
|
||||
def __init__(self, model_name: str):
|
||||
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
||||
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
|
||||
self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)
|
||||
self.clean_http_urls = re.compile(r'https*\S+', re.MULTILINE)
|
||||
self.clean_at_mentions = re.compile(r'@\S+', re.MULTILINE)
|
||||
|
||||
def predict_sentiment(self, texts: List[str])-> List[str]:
|
||||
texts = [self.clean_text(text) for text in texts]
|
||||
# Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
|
||||
input_ids = self.tokenizer.batch_encode_plus(texts,pad_to_max_length=True, add_special_tokens=True)
|
||||
input_ids = torch.tensor(input_ids["input_ids"])
|
||||
|
||||
with torch.no_grad():
|
||||
logits = self.model(input_ids)
|
||||
|
||||
label_ids = torch.argmax(logits[0], axis=1)
|
||||
|
||||
labels = [self.model.config.id2label[label_id] for label_id in label_ids.tolist()]
|
||||
return labels
|
||||
|
||||
def replace_numbers(self,text: str) -> str:
|
||||
return text.replace("0"," null").replace("1"," eins").replace("2"," zwei").replace("3"," drei").replace("4"," vier").replace("5"," fünf").replace("6"," sechs").replace("7"," sieben").replace("8"," acht").replace("9"," neun")
|
||||
|
||||
def clean_text(self,text: str)-> str:
|
||||
text = text.replace("\n", " ")
|
||||
text = self.clean_http_urls.sub('',text)
|
||||
text = self.clean_at_mentions.sub('',text)
|
||||
text = self.replace_numbers(text)
|
||||
text = self.clean_chars.sub('', text) # use only text chars
|
||||
text = ' '.join(text.split()) # substitute multiple whitespace with single whitespace
|
||||
text = text.strip().lower()
|
||||
return text
|
||||
|
||||
texts = ["Mit keinem guten Ergebniss","Das war unfair", "Das ist gar nicht mal so gut",
|
||||
"Total awesome!","nicht so schlecht wie erwartet", "Das ist gar nicht mal so schlecht",
|
||||
"Der Test verlief positiv.","Sie fährt ein grünes Auto.", "Der Fall wurde an die Polzei übergeben."]
|
||||
|
||||
model = SentimentModel(model_name = "oliverguhr/german-sentiment-bert")
|
||||
|
||||
print(model.predict_sentiment(texts))
|
||||
```
|
||||
|
||||
## Model and Data
|
||||
|
||||
If you are interested in code and data that was used to train this model please have a look at [this repository](https://github.com/oliverguhr/german-sentiment) and our [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.201.pdf). Here is a table of the F1 scores that his model achieves on following datasets. Since we trained this model on a newer version of the transformer library, the results are slightly better than reported in the paper.
|
||||
|
||||
| Dataset | F1 micro Score |
|
||||
| :----------------------------------------------------------- | -------------: |
|
||||
| [holidaycheck](https://github.com/oliverguhr/german-sentiment) | 0.9568 |
|
||||
| [scare](https://www.romanklinger.de/scare/) | 0.9418 |
|
||||
| [filmstarts](https://github.com/oliverguhr/german-sentiment) | 0.9021 |
|
||||
| [germeval](https://sites.google.com/view/germeval2017-absa/home) | 0.7536 |
|
||||
| [PotTS](https://www.aclweb.org/anthology/L16-1181/) | 0.6780 |
|
||||
| [emotions](https://github.com/oliverguhr/german-sentiment) | 0.9649 |
|
||||
| [sb10k](https://www.spinningbytes.com/resources/germansentiment/) | 0.7376 |
|
||||
| [Leipzig Wikipedia Corpus 2016](https://wortschatz.uni-leipzig.de/de/download/german) | 0.9967 |
|
||||
| all | 0.9639 |
|
||||
|
||||
## Cite
|
||||
|
||||
For feedback and questions contact me view mail or Twitter [@oliverguhr](https://twitter.com/oliverguhr). Please cite us if you found this useful:
|
||||
|
||||
```
|
||||
@InProceedings{guhr-EtAl:2020:LREC,
|
||||
author = {Guhr, Oliver and Schumann, Anne-Kathrin and Bahrmann, Frank and Böhme, Hans Joachim},
|
||||
title = {Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems},
|
||||
booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
|
||||
month = {May},
|
||||
year = {2020},
|
||||
address = {Marseille, France},
|
||||
publisher = {European Language Resources Association},
|
||||
pages = {1620--1625},
|
||||
url = {https://www.aclweb.org/anthology/2020.lrec-1.201}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
@@ -5,12 +5,12 @@ https://huggingface.co/savasy/bert-base-turkish-sentiment-cased
|
||||
This model is used for Sentiment Analysis, which is based on BERTurk for Turkish Language https://huggingface.co/dbmdz/bert-base-turkish-cased
|
||||
|
||||
|
||||
# Dataset
|
||||
## Dataset
|
||||
|
||||
The dataset is taken from the studies [2] and [3] and merged.
|
||||
The dataset is taken from the studies [[2]](#paper-2) and [[3]](#paper-3), and merged.
|
||||
|
||||
* The study [2] gathered movie and product reviews. The products are book, DVD, electronics, and kitchen.
|
||||
The movie dataset is taken from a cinema Web page (www.beyazperde.com) with
|
||||
The movie dataset is taken from a cinema Web page ([Beyazperde](www.beyazperde.com)) with
|
||||
5331 positive and 5331 negative sentences. Reviews in the Web page are marked in
|
||||
scale from 0 to 5 by the users who made the reviews. The study considered a review
|
||||
sentiment positive if the rating is equal to or bigger than 4, and negative if it is less
|
||||
@@ -19,9 +19,9 @@ Web page. They constructed benchmark dataset consisting of reviews regarding som
|
||||
products (book, DVD, etc.). Likewise, reviews are marked in the range from 1 to 5,
|
||||
and majority class of reviews are 5. Each category has 700 positive and 700 negative
|
||||
reviews in which average rating of negative reviews is 2.27 and of positive reviews
|
||||
is 4.5. This dataset is also used the study [1]
|
||||
is 4.5. This dataset is also used by the study [[1]](#paper-1).
|
||||
|
||||
* The study[3] collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion.
|
||||
* The study [[3]](#paper-3) collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion.
|
||||
|
||||
*Merged Dataset*
|
||||
|
||||
@@ -32,20 +32,21 @@ is 4.5. This dataset is also used the study [1]
|
||||
| 32000 |train.tsv|
|
||||
| *48290* |*total*|
|
||||
|
||||
### The dataset is used by following papers
|
||||
|
||||
The dataset is used by following papers
|
||||
|
||||
* 1 Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12.
|
||||
* 2 Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
|
||||
<a id="paper-1">[1]</a> Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12.
|
||||
|
||||
<a id="paper-2">[2]</a> Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
|
||||
Discovery and Opinion Mining (WISDOM ’13)
|
||||
* [3] Hayran, A., Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey
|
||||
|
||||
# Training
|
||||
<a id="paper-3">[3]</a> Hayran, A., Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey
|
||||
|
||||
```
|
||||
|
||||
## Training
|
||||
|
||||
```shell
|
||||
export GLUE_DIR="./sst-2-newall"
|
||||
export TASK_NAME=SST-2
|
||||
|
||||
|
||||
python3 run_glue.py \
|
||||
--model_type bert \
|
||||
@@ -59,88 +60,79 @@ python3 run_glue.py \
|
||||
--learning_rate 2e-5 \
|
||||
--num_train_epochs 3.0 \
|
||||
--output_dir "./model"
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Results
|
||||
|
||||
> 05/10/2020 17:00:43 - INFO - transformers.trainer - \*\*\*\*\* Running Evaluation \*\*\*\*\*
|
||||
> 05/10/2020 17:00:43 - INFO - transformers.trainer - Num examples = 7999
|
||||
> 05/10/2020 17:00:43 - INFO - transformers.trainer - Batch size = 8
|
||||
> Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]
|
||||
> 05/10/2020 17:01:17 - INFO - \_\_main__ - \*\*\*\*\* Eval results sst-2 \*\*\*\*\*
|
||||
> 05/10/2020 17:01:17 - INFO - \_\_main__ - acc = 0.9539942492811602
|
||||
> 05/10/2020 17:01:17 - INFO - \_\_main__ - loss = 0.16348013816401363
|
||||
|
||||
Accuracy is about **95.4%**
|
||||
|
||||
|
||||
# Results
|
||||
## Code Usage
|
||||
|
||||
> 05/10/2020 17:00:43 - INFO - transformers.trainer - ***** Running Evaluation *****
|
||||
|
||||
> 05/10/2020 17:00:43 - INFO - transformers.trainer - Num examples = 7999
|
||||
|
||||
> 05/10/2020 17:00:43 - INFO - transformers.trainer - Batch size = 8
|
||||
|
||||
>Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]
|
||||
|
||||
>05/10/2020 17:01:17 - INFO - __main__ - ***** Eval results sst-2 *****
|
||||
|
||||
>05/10/2020 17:01:17 - INFO - __main__ - acc = 0.9539942492811602
|
||||
|
||||
>05/10/2020 17:01:17 - INFO - __main__ - loss = 0.16348013816401363
|
||||
|
||||
|
||||
Accuracy is about *%95.4*
|
||||
# Code Usage
|
||||
|
||||
```
|
||||
```python
|
||||
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
|
||||
|
||||
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
|
||||
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
|
||||
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
|
||||
|
||||
p= sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
|
||||
p = sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
|
||||
print(p)
|
||||
#[{'label': 'LABEL_1', 'score': 0.9871089}]
|
||||
print (p[0]['label']=='LABEL_1')
|
||||
#True
|
||||
# [{'label': 'LABEL_1', 'score': 0.9871089}]
|
||||
print(p[0]['label'] == 'LABEL_1')
|
||||
# True
|
||||
|
||||
|
||||
p= sa("Film çok kötü ve çok sahteydi")
|
||||
p = sa("Film çok kötü ve çok sahteydi")
|
||||
print(p)
|
||||
#[{'label': 'LABEL_0', 'score': 0.9975505}]
|
||||
print (p[0]['label']=='LABEL_1')
|
||||
#False
|
||||
# [{'label': 'LABEL_0', 'score': 0.9975505}]
|
||||
print(p[0]['label'] == 'LABEL_1')
|
||||
# False
|
||||
```
|
||||
|
||||
# Test your data
|
||||
|
||||
## Test
|
||||
### Data
|
||||
|
||||
Suppose your file has lots of lines of comment and label (1 or 0) at the end (tab seperated)
|
||||
|
||||
> comment1 ... \t label
|
||||
|
||||
> comment2 ... \t label
|
||||
|
||||
> comment1 ... \t label
|
||||
> comment2 ... \t label
|
||||
> ...
|
||||
|
||||
### Code
|
||||
|
||||
|
||||
```
|
||||
```python
|
||||
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
|
||||
|
||||
f="/path/to/your/file/yourfile.tsv"
|
||||
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
|
||||
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
|
||||
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
|
||||
sa = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
|
||||
|
||||
i,crr=0,0
|
||||
for line in open(f):
|
||||
lines=line.strip().split("\t")
|
||||
if len(lines)==2:
|
||||
i=i+1
|
||||
if i%100==0:
|
||||
print(i)
|
||||
pred= sa(lines[0])
|
||||
pred=pred[0]["label"].split("_")[1]
|
||||
if pred== lines[1]:
|
||||
crr=crr+1
|
||||
input_file = "/path/to/your/file/yourfile.tsv"
|
||||
|
||||
i, crr = 0, 0
|
||||
for line in open(input_file):
|
||||
lines = line.strip().split("\t")
|
||||
if len(lines) == 2:
|
||||
|
||||
i = i + 1
|
||||
if i%100 == 0:
|
||||
print(i)
|
||||
|
||||
pred = sa(lines[0])
|
||||
pred = pred[0]["label"].split("_")[1]
|
||||
|
||||
if pred == lines[1]:
|
||||
crr = crr + 1
|
||||
|
||||
print(crr, i, crr/i)
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
67
model_cards/savasy/bert-base-turkish-squad/README.md
Normal file
67
model_cards/savasy/bert-base-turkish-squad/README.md
Normal file
@@ -0,0 +1,67 @@
|
||||
---
|
||||
language: turkish
|
||||
---
|
||||
# Turkish SQuAD Model : Question Answering
|
||||
|
||||
I fine-tuned Turkish-Bert-Model for Question-Answering problem with Turkish version of SQuAD; TQuAD
|
||||
* BERT-base: https://huggingface.co/dbmdz/bert-base-turkish-uncased
|
||||
* TQuAD dataset: https://github.com/TQuad/turkish-nlp-qa-dataset
|
||||
|
||||
|
||||
# Training Code
|
||||
|
||||
```
|
||||
!python3 run_squad.py \
|
||||
--model_type bert \
|
||||
--model_name_or_path dbmdz/bert-base-turkish-uncased\
|
||||
--do_train \
|
||||
--do_eval \
|
||||
--train_file trainQ.json \
|
||||
--predict_file dev1.json \
|
||||
--per_gpu_train_batch_size 12 \
|
||||
--learning_rate 3e-5 \
|
||||
--num_train_epochs 5.0 \
|
||||
--max_seq_length 384 \
|
||||
--doc_stride 128 \
|
||||
--output_dir "./model"
|
||||
```
|
||||
|
||||
|
||||
# Example Usage
|
||||
|
||||
> Load Model
|
||||
```
|
||||
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
|
||||
import torch
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("./model")
|
||||
model = AutoModelForQuestionAnswering.from_pretrained("./model")
|
||||
nlp=pipeline("question-answering", model=model, tokenizer=tokenizer)
|
||||
```
|
||||
|
||||
> Apply the model
|
||||
```
|
||||
|
||||
sait="ABASIYANIK, Sait Faik. Hikayeci (Adapazarı 23 Kasım 1906-İstanbul 11 Mayıs 1954). \
|
||||
İlk öğrenimine Adapazarı’nda Rehber-i Terakki Mektebi’nde başladı. İki yıl kadar Adapazarı İdadisi’nde okudu.\
|
||||
İstanbul Erkek Lisesi’nde devam ettiği orta öğrenimini Bursa Lisesi’nde tamamladı (1928). İstanbul Edebiyat \
|
||||
Fakültesi’ne iki yıl devam ettikten sonra babasının isteği üzerine iktisat öğrenimi için İsviçre’ye gitti. \
|
||||
Kısa süre sonra iktisat öğrenimini bırakarak Lozan’dan Grenoble’a geçti. Üç yıl başıboş bir edebiyat öğrenimi \
|
||||
gördükten sonra babası tarafından geri çağrıldı (1933). Bir müddet Halıcıoğlu Ermeni Yetim Mektebi'nde Türkçe \
|
||||
gurup dersleri öğretmenliği yaptı. Ticarete atıldıysa da tutunamadı. Bir ay Haber gazetesinde adliye muhabirliği\
|
||||
yaptı (1942). Babasının ölümü üzerine aileden kalan emlakin geliri ile avare bir hayata başladı. Evlenemedi.\
|
||||
Yazları Burgaz adasındaki köşklerinde, kışları Şişli’deki apartmanlarında annesi ile beraber geçen bu fazla \
|
||||
içkili bohem hayatı ömrünün sonuna kadar sürdü."
|
||||
|
||||
print(nlp(question="Ne zaman avare bir hayata başladı?", context=sait))
|
||||
print(nlp(question="Sait Faik hangi Lisede orta öğrenimini tamamladı?", context=sait))
|
||||
|
||||
```
|
||||
```
|
||||
# Ask your self ! type your question
|
||||
print(nlp(question="...?", context=sait))
|
||||
```
|
||||
|
||||
|
||||
Check My other Model
|
||||
https://huggingface.co/savasy
|
||||
51
model_cards/seiya/oubiobert-base-uncased/README.md
Normal file
51
model_cards/seiya/oubiobert-base-uncased/README.md
Normal file
@@ -0,0 +1,51 @@
|
||||
---
|
||||
tags:
|
||||
- exbert
|
||||
license: apache-2.0
|
||||
---
|
||||
|
||||
# ouBioBERT-Base, Uncased
|
||||
|
||||
Bidirectional Encoder Representations from Transformers for Biomedical Text Mining by Osaka University (ouBioBERT) is a language model based on the BERT-Base (Devlin, et al., 2019) architecture. We pre-trained ouBioBERT on PubMed abstracts from the PubMed baseline (ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline) via our method.
|
||||
|
||||
The details of the pre-training procedure can be found in Wada, et al. (2020).
|
||||
|
||||
## Evaluation
|
||||
|
||||
We evaluated the performance of ouBioBERT in terms of the biomedical language understanding evaluation (BLUE) benchmark (Peng, et al., 2019). The numbers are mean (standard deviation) on five different random seeds.
|
||||
|
||||
|
||||
| Dataset | Task Type | Score |
|
||||
|:----------------|:-----------------------------|-------------:|
|
||||
| MedSTS | Sentence similarity | 84.9 (0.6) |
|
||||
| BIOSSES | Sentence similarity | 92.3 (0.8) |
|
||||
| BC5CDR-disease | Named-entity recognition | 87.4 (0.1) |
|
||||
| BC5CDR-chemical | Named-entity recognition | 93.7 (0.2) |
|
||||
| ShARe/CLEFE | Named-entity recognition | 80.1 (0.4) |
|
||||
| DDI | Relation extraction | 81.1 (1.5) |
|
||||
| ChemProt | Relation extraction | 75.0 (0.3) |
|
||||
| i2b2 2010 | Relation extraction | 74.0 (0.8) |
|
||||
| HoC | Document classification | 86.4 (0.5) |
|
||||
| MedNLI | Inference | 83.6 (0.7) |
|
||||
| **Total** | Macro average of the scores |**83.8 (0.3)**|
|
||||
|
||||
|
||||
## Code for Fine-tuning
|
||||
We made the source code for fine-tuning freely available at [our repository](https://github.com/sy-wada/blue_benchmark_with_transformers).
|
||||
|
||||
## Citation
|
||||
|
||||
If you use our work in your research, please kindly cite the following paper:
|
||||
|
||||
```bibtex
|
||||
@misc{2005.07202,
|
||||
Author = {Shoya Wada and Toshihiro Takeda and Shiro Manabe and Shozo Konishi and Jun Kamohara and Yasushi Matsumura},
|
||||
Title = {A pre-training technique to localize medical BERT and enhance BioBERT},
|
||||
Year = {2020},
|
||||
Eprint = {arXiv:2005.07202},
|
||||
}
|
||||
```
|
||||
|
||||
<a href="https://huggingface.co/exbert/?model=seiya/oubiobert-base-uncased&sentence=Coronavirus%20disease%20(COVID-19)%20is%20caused%20by%20SARS-COV2%20and%20represents%20the%20causative%20agent%20of%20a%20potentially%20fatal%20disease%20that%20is%20of%20great%20global%20public%20health%20concern.">
|
||||
<img width="300px" src="https://hf-dinosaur.huggingface.co/exbert/button.png">
|
||||
</a>
|
||||
38
model_cards/valhalla/t5-base-squad/README.md
Normal file
38
model_cards/valhalla/t5-base-squad/README.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# T5 for question-answering
|
||||
This is T5-base model fine-tuned on SQuAD1.1 for QA using text-to-text approach
|
||||
|
||||
## Model training
|
||||
This model was trained on colab TPU with 35GB RAM for 4 epochs
|
||||
|
||||
## Results:
|
||||
| Metric | #Value |
|
||||
|-------------|---------|
|
||||
| Exact Match | 81.5610 |
|
||||
| F1 | 89.9601 |
|
||||
|
||||
## Model in Action 🚀
|
||||
```
|
||||
from transformers import AutoModelWithLMHead, AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("valhalla/t5-base-squad")
|
||||
model = AutoModelWithLMHead.from_pretrained("valhalla/t5-base-squad")
|
||||
|
||||
def get_answer(question, context):
|
||||
input_text = "question: %s context: %s </s>" % (question, context)
|
||||
features = tokenizer.batch_encode_plus([input_text], return_tensors='pt')
|
||||
|
||||
out = model.generate(input_ids=features['input_ids'],
|
||||
attention_mask=features['attention_mask'])
|
||||
|
||||
return tokenizer.decode(out[0])
|
||||
|
||||
context = "In Norse mythology, Valhalla is a majestic, enormous hall located in Asgard, ruled over by the god Odin."
|
||||
question = "What is Valhalla ?"
|
||||
|
||||
get_answer(question, context)
|
||||
# output: 'a majestic, enormous hall located in Asgard, ruled over by the god Odin'
|
||||
```
|
||||
Play with this model [](https://colab.research.google.com/drive/1a5xpJiUjZybfU9Mi-aDkOp116PZ9-wni?usp=sharing)
|
||||
|
||||
> Created by Suraj Patil [](https://github.com/patil-suraj/)
|
||||
[](https://twitter.com/psuraj28)
|
||||
@@ -3,7 +3,6 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": true,
|
||||
"pycharm": {
|
||||
"is_executing": false,
|
||||
"name": "#%% md\n"
|
||||
@@ -77,7 +76,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false,
|
||||
@@ -85,77 +84,7 @@
|
||||
},
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Requirement already satisfied: transformers in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (2.5.1)\n",
|
||||
"Requirement already satisfied: filelock in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (3.0.12)\n",
|
||||
"Requirement already satisfied: sentencepiece in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (0.1.83)\n",
|
||||
"Requirement already satisfied: boto3 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (1.12.0)\n",
|
||||
"Requirement already satisfied: requests in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (2.22.0)\n",
|
||||
"Requirement already satisfied: numpy in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (1.18.1)\n",
|
||||
"Requirement already satisfied: sacremoses in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (0.0.35)\n",
|
||||
"Requirement already satisfied: tokenizers==0.5.2 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (0.5.2)\n",
|
||||
"Requirement already satisfied: regex!=2019.12.17 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (2020.1.8)\n",
|
||||
"Requirement already satisfied: tqdm>=4.27 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (4.42.1)\n",
|
||||
"Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from boto3->transformers) (0.3.3)\n",
|
||||
"Requirement already satisfied: botocore<1.16.0,>=1.15.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from boto3->transformers) (1.15.0)\n",
|
||||
"Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from boto3->transformers) (0.9.4)\n",
|
||||
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests->transformers) (2019.11.28)\n",
|
||||
"Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests->transformers) (2.8)\n",
|
||||
"Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests->transformers) (1.25.8)\n",
|
||||
"Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests->transformers) (3.0.4)\n",
|
||||
"Requirement already satisfied: joblib in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from sacremoses->transformers) (0.14.0)\n",
|
||||
"Requirement already satisfied: click in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from sacremoses->transformers) (7.0)\n",
|
||||
"Requirement already satisfied: six in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from sacremoses->transformers) (1.14.0)\n",
|
||||
"Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from botocore<1.16.0,>=1.15.0->boto3->transformers) (0.15.2)\n",
|
||||
"Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from botocore<1.16.0,>=1.15.0->boto3->transformers) (2.8.1)\n",
|
||||
"Requirement already satisfied: tensorflow==2.1.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (2.1.0)\n",
|
||||
"Requirement already satisfied: termcolor>=1.1.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.1.0)\n",
|
||||
"Requirement already satisfied: keras-preprocessing>=1.1.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.1.0)\n",
|
||||
"Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (3.1.0)\n",
|
||||
"Requirement already satisfied: protobuf>=3.8.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (3.11.4)\n",
|
||||
"Requirement already satisfied: numpy<2.0,>=1.16.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.18.1)\n",
|
||||
"Requirement already satisfied: tensorboard<2.2.0,>=2.1.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (2.1.0)\n",
|
||||
"Requirement already satisfied: keras-applications>=1.0.8 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.0.8)\n",
|
||||
"Requirement already satisfied: wrapt>=1.11.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.11.2)\n",
|
||||
"Requirement already satisfied: six>=1.12.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.14.0)\n",
|
||||
"Requirement already satisfied: tensorflow-estimator<2.2.0,>=2.1.0rc0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (2.1.0)\n",
|
||||
"Requirement already satisfied: scipy==1.4.1; python_version >= \"3\" in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.4.1)\n",
|
||||
"Requirement already satisfied: google-pasta>=0.1.6 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (0.1.8)\n",
|
||||
"Requirement already satisfied: wheel>=0.26; python_version >= \"3\" in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (0.34.2)\n",
|
||||
"Requirement already satisfied: grpcio>=1.8.6 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.16.1)\n",
|
||||
"Requirement already satisfied: absl-py>=0.7.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (0.9.0)\n",
|
||||
"Requirement already satisfied: gast==0.2.2 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (0.2.2)\n",
|
||||
"Requirement already satisfied: astor>=0.6.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (0.8.0)\n",
|
||||
"Requirement already satisfied: setuptools in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from protobuf>=3.8.0->tensorflow==2.1.0) (45.2.0.post20200210)\n",
|
||||
"Requirement already satisfied: google-auth<2,>=1.6.3 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (1.11.2)\n",
|
||||
"Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (0.4.1)\n",
|
||||
"Requirement already satisfied: markdown>=2.6.8 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (3.1.1)\n",
|
||||
"Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (1.0.0)\n",
|
||||
"Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (2.22.0)\n",
|
||||
"Requirement already satisfied: h5py in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from keras-applications>=1.0.8->tensorflow==2.1.0) (2.10.0)\n",
|
||||
"Requirement already satisfied: rsa<4.1,>=3.1.4 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (4.0)\n",
|
||||
"Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (4.0.0)\n",
|
||||
"Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (0.2.8)\n",
|
||||
"Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (1.3.0)\n",
|
||||
"Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (2.8)\n",
|
||||
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (2019.11.28)\n",
|
||||
"Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (3.0.4)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (1.25.8)\r\n",
|
||||
"Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from rsa<4.1,>=3.1.4->google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (0.4.8)\r\n",
|
||||
"Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (3.1.0)\r\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install transformers\n",
|
||||
"!pip install tensorflow==2.1.0"
|
||||
@@ -174,7 +103,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"<torch.autograd.grad_mode.set_grad_enabled at 0x102c0ce10>"
|
||||
"<torch.autograd.grad_mode.set_grad_enabled at 0x7f10b441e890>"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
@@ -441,7 +370,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 8,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
@@ -458,13 +387,22 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 9,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"output differences: 1.6236e-05\n",
|
||||
"pooled differences: -1.3039e-08\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# transformers generates a ready to use dictionary with all the required parameters for the specific framework.\n",
|
||||
"input_tf = tokenizer.encode_plus(\"This is a sample input\", return_tensors=\"tf\")\n",
|
||||
@@ -476,7 +414,7 @@
|
||||
"# Models outputs 2 values (The value for each tokens, the pooled representation of the input sentence)\n",
|
||||
"# Here we compare the output differences between PyTorch and TensorFlow.\n",
|
||||
"for name, o_tf, o_pt in zip([\"output\", \"pooled\"], output_tf, output_pt):\n",
|
||||
" print(\"{} differences: {}\".format(name, (o_tf.numpy() - o_pt.numpy()).sum()))"
|
||||
" print(\"{} differences: {:.5}\".format(name, (o_tf.numpy() - o_pt.numpy()).sum()))"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -504,13 +442,24 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 10,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"CPU times: user 232 ms, sys: 0 ns, total: 232 ms\n",
|
||||
"Wall time: 21.1 ms\n",
|
||||
"CPU times: user 511 ms, sys: 0 ns, total: 511 ms\n",
|
||||
"Wall time: 43.9 ms\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from transformers import DistilBertModel\n",
|
||||
"\n",
|
||||
@@ -541,13 +490,25 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Tokens (int) : [102, 12272, 9355, 5746, 30881, 215, 261, 5945, 4118, 212, 2414, 153, 1942, 232, 3532, 566, 103]\n",
|
||||
"Tokens (str) : ['[CLS]', 'Hug', '##ging', 'Fac', '##e', 'ist', 'eine', 'französische', 'Firma', 'mit', 'Sitz', 'in', 'New', '-', 'York', '.', '[SEP]']\n",
|
||||
"Tokens (attn_mask): [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]\n",
|
||||
"\n",
|
||||
"Token wise output: torch.Size([1, 7, 768]), Pooled output: torch.Size([1, 768])\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Let's load German BERT from the Bavarian State Library\n",
|
||||
"de_bert = BertModel.from_pretrained(\"dbmdz/bert-base-german-cased\")\n",
|
||||
@@ -557,7 +518,14 @@
|
||||
" \"Hugging Face ist eine französische Firma mit Sitz in New-York.\",\n",
|
||||
" return_tensors=\"pt\"\n",
|
||||
")\n",
|
||||
"output_de, pooled_de = de_bert(**de_input)"
|
||||
"print(\"Tokens (int) : {}\".format(de_input['input_ids'].tolist()[0]))\n",
|
||||
"print(\"Tokens (str) : {}\".format([de_tokenizer.convert_ids_to_tokens(s) for s in de_input['input_ids'].tolist()[0]]))\n",
|
||||
"print(\"Tokens (attn_mask): {}\".format(de_input['attention_mask'].tolist()[0]))\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"output_de, pooled_de = de_bert(**de_input)\n",
|
||||
"\n",
|
||||
"print(\"Token wise output: {}, Pooled output: {}\".format(outputs.shape, pooled.shape))"
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -577,7 +545,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.6"
|
||||
"version": "3.7.4"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
@@ -590,5 +558,5 @@
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
|
||||
492
notebooks/04-onnx-export.ipynb
Normal file
492
notebooks/04-onnx-export.ipynb
Normal file
File diff suppressed because one or more lines are too long
@@ -4,15 +4,25 @@ You can find here a list of the official notebooks provided by Hugging Face.
|
||||
|
||||
Also, we would like to list here interesting content created by the community.
|
||||
If you wrote some notebook(s) leveraging transformers and would like be listed here, please open a
|
||||
Pull Request and we'll review it so it can be included here.
|
||||
Pull Request so it can be included under the Community notebooks.
|
||||
|
||||
|
||||
## Hugging Face's notebooks :hugs:
|
||||
|
||||
| Notebook | Description | |
|
||||
|:----------|:-------------:|------:|
|
||||
|:----------|:-------------|------:|
|
||||
| [Getting Started Tokenizers](https://github.com/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
|
||||
| [Getting Started Transformers](https://github.com/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb) | How to easily start using transformers | [](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb) |
|
||||
| [How to use Pipelines](https://github.com/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb) |
|
||||
| [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
|
||||
| [How to generate text](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)| How to use different decoding methods for language generation with transformers | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)|
|
||||
| [How to export model to ONNX](https://github.com/huggingface/transformers/blob/master/notebooks/04-onnx-export.ipynb) | Highlight how to export and run inference workloads through ONNX |
|
||||
|
||||
|
||||
## Community notebooks:
|
||||
|
||||
| Notebook | Description | Author | |
|
||||
|:----------|:-------------|:-------------|------:|
|
||||
| [Train T5 on TPU](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) | How to train T5 on SQUAD with Transformers and Nlp | [Suraj Patil](https://github.com/patil-suraj) |[](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) |
|
||||
| [Fine-tune T5 for Classification and Multiple Choice](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | How to fine-tune T5 for classification and multiple choice tasks using a text-to-text format with PyTorch Lightning | [Suraj Patil](https://github.com/patil-suraj) | [](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) |
|
||||
| [Fine-tune DialoGPT on New Datasets and Languages](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | How to fine-tune the DialoGPT model on a new dataset for open-dialog conversational chatbots | [Nathan Cooper](https://github.com/ncoop57) | [](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb)
|
||||
|
||||
@@ -36,5 +36,5 @@ multi_line_output = 3
|
||||
use_parentheses = True
|
||||
|
||||
[flake8]
|
||||
ignore = E203, E501, W503
|
||||
ignore = E203, E501, E741, W503
|
||||
max-line-length = 119
|
||||
|
||||
20
setup.py
20
setup.py
@@ -67,8 +67,18 @@ extras = {}
|
||||
|
||||
extras["mecab"] = ["mecab-python3"]
|
||||
extras["sklearn"] = ["scikit-learn"]
|
||||
extras["tf"] = ["tensorflow"]
|
||||
extras["tf-cpu"] = ["tensorflow-cpu"]
|
||||
|
||||
# keras2onnx and onnxconverter-common version is specific through a commit until 1.7.0 lands on pypi
|
||||
extras["tf"] = [
|
||||
"tensorflow",
|
||||
"onnxconverter-common",
|
||||
"keras2onnx"
|
||||
]
|
||||
extras["tf-cpu"] = [
|
||||
"tensorflow-cpu",
|
||||
"onnxconverter-common",
|
||||
"keras2onnx"
|
||||
]
|
||||
extras["torch"] = ["torch"]
|
||||
|
||||
extras["serving"] = ["pydantic", "uvicorn", "fastapi", "starlette"]
|
||||
@@ -79,14 +89,14 @@ extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rt
|
||||
extras["quality"] = [
|
||||
"black",
|
||||
"isort",
|
||||
"flake8==3.7.9",
|
||||
"flake8",
|
||||
]
|
||||
extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3", "scikit-learn", "tensorflow", "torch"]
|
||||
|
||||
setup(
|
||||
name="transformers",
|
||||
version="2.9.1",
|
||||
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
|
||||
version="2.10.0",
|
||||
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Patrick von Platen, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
|
||||
author_email="thomas@huggingface.co",
|
||||
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
|
||||
long_description=open("README.md", "r", encoding="utf-8").read(),
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||
# module, but to preserve other warnings. So, don't check this module at all.
|
||||
|
||||
__version__ = "2.9.1"
|
||||
__version__ = "2.10.0"
|
||||
|
||||
# Work around to update TensorFlow's absl.logging threshold which alters the
|
||||
# default Python logging output behavior when present.
|
||||
@@ -44,6 +44,7 @@ from .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, Electr
|
||||
from .configuration_encoder_decoder import EncoderDecoderConfig
|
||||
from .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig
|
||||
from .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config
|
||||
from .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig
|
||||
from .configuration_marian import MarianConfig
|
||||
from .configuration_mmbt import MMBTConfig
|
||||
from .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig
|
||||
@@ -138,6 +139,7 @@ from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFas
|
||||
from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast
|
||||
from .tokenization_flaubert import FlaubertTokenizer
|
||||
from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
|
||||
from .tokenization_longformer import LongformerTokenizer
|
||||
from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
|
||||
from .tokenization_reformer import ReformerTokenizer
|
||||
from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
|
||||
@@ -319,6 +321,7 @@ if is_torch_available():
|
||||
ElectraForMaskedLM,
|
||||
ElectraForTokenClassification,
|
||||
ElectraPreTrainedModel,
|
||||
ElectraForSequenceClassification,
|
||||
ElectraModel,
|
||||
load_tf_weights_in_electra,
|
||||
ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
@@ -332,6 +335,8 @@ if is_torch_available():
|
||||
REFORMER_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
)
|
||||
|
||||
from .modeling_longformer import LONGFORMER_PRETRAINED_MODEL_ARCHIVE_MAP, LongformerModel, LongformerForMaskedLM
|
||||
|
||||
# Optimization
|
||||
from .optimization import (
|
||||
AdamW,
|
||||
|
||||
@@ -28,6 +28,7 @@ from .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, Electr
|
||||
from .configuration_encoder_decoder import EncoderDecoderConfig
|
||||
from .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig
|
||||
from .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config
|
||||
from .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig
|
||||
from .configuration_marian import MarianConfig
|
||||
from .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig
|
||||
from .configuration_reformer import ReformerConfig
|
||||
@@ -62,6 +63,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
|
||||
XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
]
|
||||
for key, value, in pretrained_map.items()
|
||||
)
|
||||
@@ -77,6 +79,7 @@ CONFIG_MAPPING = OrderedDict(
|
||||
("marian", MarianConfig,),
|
||||
("bart", BartConfig,),
|
||||
("reformer", ReformerConfig,),
|
||||
("longformer", LongformerConfig,),
|
||||
("roberta", RobertaConfig,),
|
||||
("flaubert", FlaubertConfig,),
|
||||
("bert", BertConfig,),
|
||||
@@ -133,6 +136,7 @@ class AutoConfig:
|
||||
- contains `albert`: :class:`~transformers.AlbertConfig` (ALBERT model)
|
||||
- contains `camembert`: :class:`~transformers.CamembertConfig` (CamemBERT model)
|
||||
- contains `xlm-roberta`: :class:`~transformers.XLMRobertaConfig` (XLM-RoBERTa model)
|
||||
- contains `longformer`: :class:`~transformers.LongformerConfig` (Longformer model)
|
||||
- contains `roberta`: :class:`~transformers.RobertaConfig` (RoBERTa model)
|
||||
- contains `reformer`: :class:`~transformers.ReformerConfig` (Reformer model)
|
||||
- contains `bert`: :class:`~transformers.BertConfig` (Bert model)
|
||||
@@ -145,7 +149,6 @@ class AutoConfig:
|
||||
- contains `flaubert` : :class:`~transformers.FlaubertConfig` (Flaubert model)
|
||||
- contains `electra` : :class:`~transformers.ElectraConfig` (ELECTRA model)
|
||||
|
||||
|
||||
Args:
|
||||
pretrained_model_name_or_path (:obj:`string`):
|
||||
Is either: \
|
||||
|
||||
69
src/transformers/configuration_longformer.py
Normal file
69
src/transformers/configuration_longformer.py
Normal file
@@ -0,0 +1,69 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Longformer configuration """
|
||||
|
||||
import logging
|
||||
from typing import List, Union
|
||||
|
||||
from .configuration_roberta import RobertaConfig
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"longformer-base-4096": "https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-base-4096/config.json",
|
||||
"longformer-large-4096": "https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096/config.json",
|
||||
}
|
||||
|
||||
|
||||
class LongformerConfig(RobertaConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of an :class:`~transformers.LongformerModel`.
|
||||
It is used to instantiate an Longformer model according to the specified arguments, defining the model
|
||||
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
|
||||
the RoBERTa `roberta-base <https://huggingface.co/roberta-base>`__ architecture with a sequence length 4,096.
|
||||
|
||||
The :class:`~transformers.LongformerConfig` class directly inherits :class:`~transformers.RobertaConfig`.
|
||||
It reuses the same defaults. Please check the parent class for more information.
|
||||
|
||||
Args:
|
||||
attention_window (:obj:`int` or :obj:`List[int]`, optional, defaults to 512):
|
||||
Size of an attention window around each token. If :obj:`int`, use the same size for all layers.
|
||||
To specify a different window size for each layer, use a :obj:`List[int]` where
|
||||
``len(attention_window) == num_hidden_layers``.
|
||||
|
||||
Example::
|
||||
|
||||
from transformers import LongformerConfig, LongformerModel
|
||||
|
||||
# Initializing a Longformer configuration
|
||||
configuration = LongformerConfig()
|
||||
|
||||
# Initializing a model from the configuration
|
||||
model = LongformerModel(configuration)
|
||||
|
||||
# Accessing the model configuration
|
||||
configuration = model.config
|
||||
|
||||
Attributes:
|
||||
pretrained_config_archive_map (Dict[str, str]):
|
||||
A dictionary containing all the available pre-trained checkpoints.
|
||||
"""
|
||||
pretrained_config_archive_map = LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
model_type = "longformer"
|
||||
|
||||
def __init__(self, attention_window: Union[List[int], int] = 512, **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
self.attention_window = attention_window
|
||||
@@ -68,6 +68,6 @@ class RobertaConfig(BertConfig):
|
||||
model_type = "roberta"
|
||||
|
||||
def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs):
|
||||
"""Constructs FlaubertConfig.
|
||||
"""Constructs RobertaConfig.
|
||||
"""
|
||||
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
|
||||
|
||||
@@ -39,10 +39,10 @@ class T5Config(PretrainedConfig):
|
||||
|
||||
Arguments:
|
||||
vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `T5Model`.
|
||||
hidden_size: Size of the encoder layers and the pooler layer.
|
||||
num_hidden_layers: Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads: Number of attention heads for each attention layer in
|
||||
the Transformer encoder.
|
||||
d_model: Size of the encoder layers and the pooler layer. `d_model` can also accesed via the property `hidden_size`.
|
||||
num_layers: Number of hidden layers in the Transformer encoder. `num_layers` can also be accessed via the property `num_hidden_layers`.
|
||||
num_heads: Number of attention heads for each attention layer in
|
||||
the Transformer encoder. `num_heads` can also be accessed via the property `num_attention_heads`.
|
||||
intermediate_size: The size of the "intermediate" (i.e., feed-forward)
|
||||
layer in the Transformer encoder.
|
||||
hidden_act: The non-linear activation function (function or string) in the
|
||||
@@ -51,9 +51,9 @@ class T5Config(PretrainedConfig):
|
||||
layers in the embeddings, encoder, and pooler.
|
||||
attention_probs_dropout_prob: The dropout ratio for the attention
|
||||
probabilities.
|
||||
max_position_embeddings: The maximum sequence length that this model might
|
||||
n_positions: The maximum sequence length that this model might
|
||||
ever be used with. Typically set this to something large just in case
|
||||
(e.g., 512 or 1024 or 2048).
|
||||
(e.g., 512 or 1024 or 2048). `n_positions` can also be accessed via the property `max_position_embeddings'.
|
||||
type_vocab_size: The vocabulary size of the `token_type_ids` passed into
|
||||
`T5Model`.
|
||||
initializer_factor: A factor for initializing all weight matrices (should be kept to 1.0, used for initialization testing).
|
||||
|
||||
220
src/transformers/convert_graph_to_onnx.py
Normal file
220
src/transformers/convert_graph_to_onnx.py
Normal file
@@ -0,0 +1,220 @@
|
||||
from argparse import ArgumentParser
|
||||
from itertools import takewhile
|
||||
from os import listdir, makedirs
|
||||
from os.path import abspath, dirname, exists
|
||||
from typing import Dict, List, Optional, Tuple
|
||||
|
||||
from transformers import is_tf_available, is_torch_available
|
||||
from transformers.pipelines import Pipeline, pipeline
|
||||
from transformers.tokenization_utils import BatchEncoding
|
||||
|
||||
|
||||
class OnnxConverterArgumentParser(ArgumentParser):
|
||||
"""
|
||||
Wraps all the script arguments supported to export transformers models to ONNX IR
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
super(OnnxConverterArgumentParser, self).__init__("ONNX Converter")
|
||||
|
||||
self.add_argument("--model", type=str, required=True, help="Model's id or path (ex: bert-base-cased)")
|
||||
self.add_argument("--tokenizer", type=str, help="Tokenizer's id or path (ex: bert-base-cased)")
|
||||
self.add_argument("--framework", type=str, choices=["pt", "tf"], help="Framework for loading the model")
|
||||
self.add_argument("--opset", type=int, default=11, help="ONNX opset to use")
|
||||
self.add_argument("--check-loading", action="store_true", help="Check ONNX is able to load the model")
|
||||
self.add_argument("--use-external-format", action="store_true", help="Allow exporting model >= than 2Gb")
|
||||
self.add_argument("output")
|
||||
|
||||
|
||||
def ensure_valid_input(model, tokens, input_names):
|
||||
"""
|
||||
Ensure input are presented in the correct order, without any None
|
||||
Args:
|
||||
model: The model used to forward the input data
|
||||
tokens: BatchEncoding holding the input data
|
||||
input_names: The name of the inputs
|
||||
|
||||
Returns: Tuple
|
||||
|
||||
"""
|
||||
model_args_name = model.forward.__code__.co_varnames
|
||||
model_args_pos = [(model_args_name.index(name) - 1, name) for name in input_names]
|
||||
model_args = [None] * (max(map(lambda x: x[0], model_args_pos)) + 1)
|
||||
|
||||
for arg_pos, arg_name in model_args_pos:
|
||||
model_args[arg_pos] = tokens[arg_name]
|
||||
|
||||
model_args = tuple(model_args) # Need to be ordered
|
||||
return tuple(takewhile(lambda arg: arg is not None, model_args))
|
||||
|
||||
|
||||
def infer_shapes(nlp: Pipeline, framework: str) -> Tuple[List[str], List[str], Dict, BatchEncoding]:
|
||||
def build_shape_dict(tensor, is_input: bool, seq_len: int):
|
||||
if isinstance(tensor, (tuple, list)):
|
||||
return [build_shape_dict(t, is_input, seq_len) for t in tensor]
|
||||
|
||||
else:
|
||||
# Let's assume batch is the first axis with only 1 element (~~ might not be always true ...)
|
||||
axes = {[axis for axis, numel in enumerate(tensor.shape) if numel == 1][0]: "batch"}
|
||||
if is_input:
|
||||
if len(tensor.shape) == 2:
|
||||
axes[1] = "sequence"
|
||||
else:
|
||||
raise ValueError("Unable to infer tensor axes ({})".format(len(tensor.shape)))
|
||||
else:
|
||||
seq_axes = [dim for dim, shape in enumerate(tensor.shape) if shape == seq_len]
|
||||
axes.update({dim: "sequence" for dim in seq_axes})
|
||||
|
||||
return axes
|
||||
|
||||
tokens = nlp.tokenizer.encode_plus("This is a sample output", return_tensors=framework)
|
||||
seq_len = tokens.input_ids.shape[-1]
|
||||
outputs = nlp.model(**tokens) if framework == "pt" else nlp.model(tokens)
|
||||
|
||||
if not isinstance(outputs, (list, tuple)):
|
||||
outputs = (outputs,)
|
||||
|
||||
# Generate input names & axes
|
||||
input_vars = list(tokens.keys())
|
||||
input_dynamic_axes = {k: build_shape_dict(v, True, seq_len) for k, v in tokens.items()}
|
||||
|
||||
# flatten potentially grouped outputs (past for gpt2, attentions)
|
||||
outputs_flat = []
|
||||
for output in outputs:
|
||||
if isinstance(output, (tuple, list)):
|
||||
outputs_flat.extend(output)
|
||||
else:
|
||||
outputs_flat.append(output)
|
||||
|
||||
# Generate output names & axes
|
||||
output_names = ["output_{}".format(i) for i in range(len(outputs_flat))]
|
||||
output_dynamic_axes = {k: build_shape_dict(v, False, seq_len) for k, v in zip(output_names, outputs_flat)}
|
||||
|
||||
# Create the aggregated axes representation
|
||||
dynamic_axes = dict(input_dynamic_axes, **output_dynamic_axes)
|
||||
return input_vars, output_names, dynamic_axes, tokens
|
||||
|
||||
|
||||
def load_graph_from_args(framework: str, model: str, tokenizer: Optional[str] = None) -> Pipeline:
|
||||
# If no tokenizer provided
|
||||
if tokenizer is None:
|
||||
tokenizer = model
|
||||
|
||||
print("Loading pipeline (model: {}, tokenizer: {})".format(model, tokenizer))
|
||||
|
||||
# Allocate tokenizer and model
|
||||
return pipeline("feature-extraction", model=model, tokenizer=tokenizer, framework=framework)
|
||||
|
||||
|
||||
def convert_pytorch(nlp: Pipeline, opset: int, output: str, use_external_format: bool):
|
||||
if not is_torch_available():
|
||||
raise Exception("Cannot convert because PyTorch is not installed. Please install torch first.")
|
||||
|
||||
import torch
|
||||
from torch.onnx import export
|
||||
|
||||
print("PyTorch: {}".format(torch.__version__))
|
||||
|
||||
with torch.no_grad():
|
||||
input_names, output_names, dynamic_axes, tokens = infer_shapes(nlp, "pt")
|
||||
model_args = ensure_valid_input(nlp.model, tokens, input_names)
|
||||
|
||||
export(
|
||||
nlp.model,
|
||||
model_args,
|
||||
f=output,
|
||||
input_names=input_names,
|
||||
output_names=output_names,
|
||||
dynamic_axes=dynamic_axes,
|
||||
do_constant_folding=True,
|
||||
use_external_data_format=use_external_format,
|
||||
enable_onnx_checker=True,
|
||||
opset_version=opset,
|
||||
)
|
||||
|
||||
|
||||
def convert_tensorflow(nlp: Pipeline, opset: int, output: str):
|
||||
if not is_tf_available():
|
||||
raise Exception(
|
||||
"Cannot convert {} because TF is not installed. Please install torch first.".format(args.model)
|
||||
)
|
||||
|
||||
print("/!\\ Please note TensorFlow doesn't support exporting model > 2Gb /!\\")
|
||||
|
||||
try:
|
||||
import tensorflow as tf
|
||||
from keras2onnx import convert_keras, save_model, __version__ as k2ov
|
||||
|
||||
print("TensorFlow: {}, keras2onnx: {}".format(tf.version.VERSION, k2ov))
|
||||
|
||||
# Build
|
||||
input_names, output_names, dynamic_axes, tokens = infer_shapes(nlp, "tf")
|
||||
|
||||
# Forward
|
||||
nlp.model.predict(tokens.data)
|
||||
onnx_model = convert_keras(nlp.model, nlp.model.name, target_opset=opset)
|
||||
save_model(onnx_model, output)
|
||||
|
||||
except ImportError as e:
|
||||
raise Exception(
|
||||
"Cannot import {} required to convert TF model to ONNX. Please install {} first.".format(e.name, e.name)
|
||||
)
|
||||
|
||||
|
||||
def convert(
|
||||
framework: str,
|
||||
model: str,
|
||||
output: str,
|
||||
opset: int,
|
||||
tokenizer: Optional[str] = None,
|
||||
use_external_format: bool = False,
|
||||
):
|
||||
print("ONNX opset version set to: {}".format(opset))
|
||||
|
||||
# Load the pipeline
|
||||
nlp = load_graph_from_args(framework, model, tokenizer)
|
||||
|
||||
parent = dirname(output)
|
||||
if not exists(parent):
|
||||
print("Creating folder {}".format(parent))
|
||||
makedirs(parent)
|
||||
elif len(listdir(parent)) > 0:
|
||||
raise Exception("Folder {} is not empty, aborting conversion".format(parent))
|
||||
|
||||
# Export the graph
|
||||
if framework == "pt":
|
||||
convert_pytorch(nlp, opset, output, use_external_format)
|
||||
else:
|
||||
convert_tensorflow(nlp, opset, output)
|
||||
|
||||
|
||||
def verify(path: str):
|
||||
from onnxruntime import InferenceSession, SessionOptions
|
||||
from onnxruntime.capi.onnxruntime_pybind11_state import RuntimeException
|
||||
|
||||
print("Checking ONNX model loading from: {}".format(path))
|
||||
try:
|
||||
onnx_options = SessionOptions()
|
||||
_ = InferenceSession(path, onnx_options, providers=["CPUExecutionProvider"])
|
||||
print("Model correctly loaded")
|
||||
except RuntimeException as re:
|
||||
print("Error while loading the model: {}".format(re))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = OnnxConverterArgumentParser()
|
||||
args = parser.parse_args()
|
||||
|
||||
# Make sure output is absolute path
|
||||
args.output = abspath(args.output)
|
||||
|
||||
try:
|
||||
# Convert
|
||||
convert(args.framework, args.model, args.output, args.opset, args.tokenizer, args.use_external_format)
|
||||
|
||||
# And verify
|
||||
if args.check_loading:
|
||||
verify(args.output)
|
||||
except Exception as e:
|
||||
print("Error while converting the model: {}".format(e))
|
||||
exit(1)
|
||||
@@ -226,7 +226,7 @@ def lmap(f, x) -> List:
|
||||
def fetch_test_set(test_set_url):
|
||||
import wget
|
||||
|
||||
fname = wget.download(test_set_url, f"opus_test.txt")
|
||||
fname = wget.download(test_set_url, "opus_test.txt")
|
||||
lns = Path(fname).open().readlines()
|
||||
src = lmap(str.strip, lns[::4])
|
||||
gold = lmap(str.strip, lns[1::4])
|
||||
|
||||
@@ -2,7 +2,8 @@ import logging
|
||||
import os
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from typing import List, Optional
|
||||
from enum import Enum
|
||||
from typing import List, Optional, Union
|
||||
|
||||
import torch
|
||||
from filelock import FileLock
|
||||
@@ -47,6 +48,12 @@ class GlueDataTrainingArguments:
|
||||
self.task_name = self.task_name.lower()
|
||||
|
||||
|
||||
class Split(Enum):
|
||||
train = "train"
|
||||
dev = "dev"
|
||||
test = "test"
|
||||
|
||||
|
||||
class GlueDataset(Dataset):
|
||||
"""
|
||||
This will be superseded by a framework-agnostic approach
|
||||
@@ -62,16 +69,21 @@ class GlueDataset(Dataset):
|
||||
args: GlueDataTrainingArguments,
|
||||
tokenizer: PreTrainedTokenizer,
|
||||
limit_length: Optional[int] = None,
|
||||
evaluate=False,
|
||||
mode: Union[str, Split] = Split.train,
|
||||
):
|
||||
self.args = args
|
||||
processor = glue_processors[args.task_name]()
|
||||
self.processor = glue_processors[args.task_name]()
|
||||
self.output_mode = glue_output_modes[args.task_name]
|
||||
if isinstance(mode, str):
|
||||
try:
|
||||
mode = Split[mode]
|
||||
except KeyError:
|
||||
raise KeyError("mode is not a valid split name")
|
||||
# Load data features from cache or dataset file
|
||||
cached_features_file = os.path.join(
|
||||
args.data_dir,
|
||||
"cached_{}_{}_{}_{}".format(
|
||||
"dev" if evaluate else "train", tokenizer.__class__.__name__, str(args.max_seq_length), args.task_name,
|
||||
mode.value, tokenizer.__class__.__name__, str(args.max_seq_length), args.task_name,
|
||||
),
|
||||
)
|
||||
|
||||
@@ -88,7 +100,7 @@ class GlueDataset(Dataset):
|
||||
)
|
||||
else:
|
||||
logger.info(f"Creating features from dataset file at {args.data_dir}")
|
||||
label_list = processor.get_labels()
|
||||
label_list = self.processor.get_labels()
|
||||
if args.task_name in ["mnli", "mnli-mm"] and tokenizer.__class__ in (
|
||||
RobertaTokenizer,
|
||||
RobertaTokenizerFast,
|
||||
@@ -96,11 +108,12 @@ class GlueDataset(Dataset):
|
||||
):
|
||||
# HACK(label indices are swapped in RoBERTa pretrained model)
|
||||
label_list[1], label_list[2] = label_list[2], label_list[1]
|
||||
examples = (
|
||||
processor.get_dev_examples(args.data_dir)
|
||||
if evaluate
|
||||
else processor.get_train_examples(args.data_dir)
|
||||
)
|
||||
if mode == Split.dev:
|
||||
examples = self.processor.get_dev_examples(args.data_dir)
|
||||
elif mode == Split.test:
|
||||
examples = self.processor.get_test_examples(args.data_dir)
|
||||
else:
|
||||
examples = self.processor.get_train_examples(args.data_dir)
|
||||
if limit_length is not None:
|
||||
examples = examples[:limit_length]
|
||||
self.features = glue_convert_examples_to_features(
|
||||
@@ -114,7 +127,7 @@ class GlueDataset(Dataset):
|
||||
torch.save(self.features, cached_features_file)
|
||||
# ^ This seems to take a lot of time so I want to investigate why and how we can improve.
|
||||
logger.info(
|
||||
f"Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
|
||||
"Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
|
||||
)
|
||||
|
||||
def __len__(self):
|
||||
@@ -122,3 +135,6 @@ class GlueDataset(Dataset):
|
||||
|
||||
def __getitem__(self, i) -> InputFeatures:
|
||||
return self.features[i]
|
||||
|
||||
def get_labels(self):
|
||||
return self.processor.get_labels()
|
||||
|
||||
@@ -4,10 +4,10 @@ import pickle
|
||||
import time
|
||||
|
||||
import torch
|
||||
from filelock import FileLock
|
||||
from torch.utils.data.dataset import Dataset
|
||||
|
||||
from ...tokenization_utils import PreTrainedTokenizer
|
||||
from ...trainer import torch_distributed_zero_first
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
@@ -20,7 +20,7 @@ class TextDataset(Dataset):
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int, overwrite_cache=False, local_rank=-1,
|
||||
self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int, overwrite_cache=False,
|
||||
):
|
||||
assert os.path.isfile(file_path)
|
||||
|
||||
@@ -31,9 +31,10 @@ class TextDataset(Dataset):
|
||||
directory, "cached_lm_{}_{}_{}".format(tokenizer.__class__.__name__, str(block_size), filename,),
|
||||
)
|
||||
|
||||
with torch_distributed_zero_first(local_rank):
|
||||
# Make sure only the first process in distributed training processes the dataset,
|
||||
# and the others will use the cache.
|
||||
# Make sure only the first process in distributed training processes the dataset,
|
||||
# and the others will use the cache.
|
||||
lock_path = cached_features_file + ".lock"
|
||||
with FileLock(lock_path):
|
||||
|
||||
if os.path.exists(cached_features_file) and not overwrite_cache:
|
||||
start = time.time()
|
||||
@@ -64,7 +65,7 @@ class TextDataset(Dataset):
|
||||
with open(cached_features_file, "wb") as handle:
|
||||
pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
|
||||
logger.info(
|
||||
f"Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
|
||||
"Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
|
||||
)
|
||||
|
||||
def __len__(self):
|
||||
@@ -80,7 +81,7 @@ class LineByLineTextDataset(Dataset):
|
||||
soon.
|
||||
"""
|
||||
|
||||
def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int, local_rank=-1):
|
||||
def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
|
||||
assert os.path.isfile(file_path)
|
||||
# Here, we do not cache the features, operating under the assumption
|
||||
# that we will soon use fast multithreaded tokenizers from the
|
||||
|
||||
@@ -126,7 +126,9 @@ def _glue_convert_examples_to_features(
|
||||
|
||||
label_map = {label: i for i, label in enumerate(label_list)}
|
||||
|
||||
def label_from_example(example: InputExample) -> Union[int, float]:
|
||||
def label_from_example(example: InputExample) -> Union[int, float, None]:
|
||||
if example.label is None:
|
||||
return None
|
||||
if output_mode == "classification":
|
||||
return label_map[example.label]
|
||||
elif output_mode == "regression":
|
||||
@@ -180,12 +182,16 @@ class MrpcProcessor(DataProcessor):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
|
||||
|
||||
def get_test_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
|
||||
|
||||
def get_labels(self):
|
||||
"""See base class."""
|
||||
return ["0", "1"]
|
||||
|
||||
def _create_examples(self, lines, set_type):
|
||||
"""Creates examples for the training and dev sets."""
|
||||
"""Creates examples for the training, dev and test sets."""
|
||||
examples = []
|
||||
for (i, line) in enumerate(lines):
|
||||
if i == 0:
|
||||
@@ -193,7 +199,7 @@ class MrpcProcessor(DataProcessor):
|
||||
guid = "%s-%s" % (set_type, i)
|
||||
text_a = line[3]
|
||||
text_b = line[4]
|
||||
label = line[0]
|
||||
label = None if set_type == "test" else line[0]
|
||||
examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
|
||||
return examples
|
||||
|
||||
@@ -218,12 +224,16 @@ class MnliProcessor(DataProcessor):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")), "dev_matched")
|
||||
|
||||
def get_test_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test_matched")
|
||||
|
||||
def get_labels(self):
|
||||
"""See base class."""
|
||||
return ["contradiction", "entailment", "neutral"]
|
||||
|
||||
def _create_examples(self, lines, set_type):
|
||||
"""Creates examples for the training and dev sets."""
|
||||
"""Creates examples for the training, dev and test sets."""
|
||||
examples = []
|
||||
for (i, line) in enumerate(lines):
|
||||
if i == 0:
|
||||
@@ -231,7 +241,7 @@ class MnliProcessor(DataProcessor):
|
||||
guid = "%s-%s" % (set_type, line[0])
|
||||
text_a = line[8]
|
||||
text_b = line[9]
|
||||
label = line[-1]
|
||||
label = None if set_type.startswith("test") else line[-1]
|
||||
examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
|
||||
return examples
|
||||
|
||||
@@ -241,7 +251,11 @@ class MnliMismatchedProcessor(MnliProcessor):
|
||||
|
||||
def get_dev_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev_mismatched.tsv")), "dev_matched")
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev_mismatched.tsv")), "dev_mismatched")
|
||||
|
||||
def get_test_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "test_mismatched.tsv")), "test_mismatched")
|
||||
|
||||
|
||||
class ColaProcessor(DataProcessor):
|
||||
@@ -264,17 +278,25 @@ class ColaProcessor(DataProcessor):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
|
||||
|
||||
def get_test_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
|
||||
|
||||
def get_labels(self):
|
||||
"""See base class."""
|
||||
return ["0", "1"]
|
||||
|
||||
def _create_examples(self, lines, set_type):
|
||||
"""Creates examples for the training and dev sets."""
|
||||
"""Creates examples for the training, dev and test sets."""
|
||||
test_mode = set_type == "test"
|
||||
if test_mode:
|
||||
lines = lines[1:]
|
||||
text_index = 1 if test_mode else 3
|
||||
examples = []
|
||||
for (i, line) in enumerate(lines):
|
||||
guid = "%s-%s" % (set_type, i)
|
||||
text_a = line[3]
|
||||
label = line[1]
|
||||
text_a = line[text_index]
|
||||
label = None if test_mode else line[1]
|
||||
examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
|
||||
return examples
|
||||
|
||||
@@ -299,19 +321,23 @@ class Sst2Processor(DataProcessor):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
|
||||
|
||||
def get_test_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
|
||||
|
||||
def get_labels(self):
|
||||
"""See base class."""
|
||||
return ["0", "1"]
|
||||
|
||||
def _create_examples(self, lines, set_type):
|
||||
"""Creates examples for the training and dev sets."""
|
||||
"""Creates examples for the training, dev and test sets."""
|
||||
examples = []
|
||||
for (i, line) in enumerate(lines):
|
||||
if i == 0:
|
||||
continue
|
||||
guid = "%s-%s" % (set_type, i)
|
||||
text_a = line[0]
|
||||
label = line[1]
|
||||
label = None if set_type == "test" else line[1]
|
||||
examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
|
||||
return examples
|
||||
|
||||
@@ -336,12 +362,16 @@ class StsbProcessor(DataProcessor):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
|
||||
|
||||
def get_test_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
|
||||
|
||||
def get_labels(self):
|
||||
"""See base class."""
|
||||
return [None]
|
||||
|
||||
def _create_examples(self, lines, set_type):
|
||||
"""Creates examples for the training and dev sets."""
|
||||
"""Creates examples for the training, dev and test sets."""
|
||||
examples = []
|
||||
for (i, line) in enumerate(lines):
|
||||
if i == 0:
|
||||
@@ -349,7 +379,7 @@ class StsbProcessor(DataProcessor):
|
||||
guid = "%s-%s" % (set_type, line[0])
|
||||
text_a = line[7]
|
||||
text_b = line[8]
|
||||
label = line[-1]
|
||||
label = None if set_type == "test" else line[-1]
|
||||
examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
|
||||
return examples
|
||||
|
||||
@@ -374,21 +404,28 @@ class QqpProcessor(DataProcessor):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
|
||||
|
||||
def get_test_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
|
||||
|
||||
def get_labels(self):
|
||||
"""See base class."""
|
||||
return ["0", "1"]
|
||||
|
||||
def _create_examples(self, lines, set_type):
|
||||
"""Creates examples for the training and dev sets."""
|
||||
"""Creates examples for the training, dev and test sets."""
|
||||
test_mode = set_type == "test"
|
||||
q1_index = 1 if test_mode else 3
|
||||
q2_index = 2 if test_mode else 4
|
||||
examples = []
|
||||
for (i, line) in enumerate(lines):
|
||||
if i == 0:
|
||||
continue
|
||||
guid = "%s-%s" % (set_type, line[0])
|
||||
try:
|
||||
text_a = line[3]
|
||||
text_b = line[4]
|
||||
label = line[5]
|
||||
text_a = line[q1_index]
|
||||
text_b = line[q2_index]
|
||||
label = None if test_mode else line[5]
|
||||
except IndexError:
|
||||
continue
|
||||
examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
|
||||
@@ -413,14 +450,18 @@ class QnliProcessor(DataProcessor):
|
||||
|
||||
def get_dev_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev_matched")
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
|
||||
|
||||
def get_test_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
|
||||
|
||||
def get_labels(self):
|
||||
"""See base class."""
|
||||
return ["entailment", "not_entailment"]
|
||||
|
||||
def _create_examples(self, lines, set_type):
|
||||
"""Creates examples for the training and dev sets."""
|
||||
"""Creates examples for the training, dev and test sets."""
|
||||
examples = []
|
||||
for (i, line) in enumerate(lines):
|
||||
if i == 0:
|
||||
@@ -428,7 +469,7 @@ class QnliProcessor(DataProcessor):
|
||||
guid = "%s-%s" % (set_type, line[0])
|
||||
text_a = line[1]
|
||||
text_b = line[2]
|
||||
label = line[-1]
|
||||
label = None if set_type == "test" else line[-1]
|
||||
examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
|
||||
return examples
|
||||
|
||||
@@ -453,12 +494,16 @@ class RteProcessor(DataProcessor):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
|
||||
|
||||
def get_test_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
|
||||
|
||||
def get_labels(self):
|
||||
"""See base class."""
|
||||
return ["entailment", "not_entailment"]
|
||||
|
||||
def _create_examples(self, lines, set_type):
|
||||
"""Creates examples for the training and dev sets."""
|
||||
"""Creates examples for the training, dev and test sets."""
|
||||
examples = []
|
||||
for (i, line) in enumerate(lines):
|
||||
if i == 0:
|
||||
@@ -466,7 +511,7 @@ class RteProcessor(DataProcessor):
|
||||
guid = "%s-%s" % (set_type, line[0])
|
||||
text_a = line[1]
|
||||
text_b = line[2]
|
||||
label = line[-1]
|
||||
label = None if set_type == "test" else line[-1]
|
||||
examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
|
||||
return examples
|
||||
|
||||
@@ -491,12 +536,16 @@ class WnliProcessor(DataProcessor):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
|
||||
|
||||
def get_test_examples(self, data_dir):
|
||||
"""See base class."""
|
||||
return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
|
||||
|
||||
def get_labels(self):
|
||||
"""See base class."""
|
||||
return ["0", "1"]
|
||||
|
||||
def _create_examples(self, lines, set_type):
|
||||
"""Creates examples for the training and dev sets."""
|
||||
"""Creates examples for the training, dev and test sets."""
|
||||
examples = []
|
||||
for (i, line) in enumerate(lines):
|
||||
if i == 0:
|
||||
@@ -504,7 +553,7 @@ class WnliProcessor(DataProcessor):
|
||||
guid = "%s-%s" % (set_type, line[0])
|
||||
text_a = line[1]
|
||||
text_b = line[2]
|
||||
label = line[-1]
|
||||
label = None if set_type == "test" else line[-1]
|
||||
examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
|
||||
return examples
|
||||
|
||||
|
||||
@@ -195,18 +195,22 @@ def squad_convert_example_to_features(example, max_seq_length, doc_stride, max_q
|
||||
cls_index = span["input_ids"].index(tokenizer.cls_token_id)
|
||||
|
||||
# p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)
|
||||
# Original TF implem also keep the classification token (set to 0) (not sure why...)
|
||||
p_mask = np.array(span["token_type_ids"])
|
||||
|
||||
p_mask = np.minimum(p_mask, 1)
|
||||
|
||||
# Original TF implem also keep the classification token (set to 0)
|
||||
p_mask = np.ones_like(span["token_type_ids"])
|
||||
if tokenizer.padding_side == "right":
|
||||
# Limit positive values to one
|
||||
p_mask = 1 - p_mask
|
||||
p_mask[len(truncated_query) + sequence_added_tokens :] = 0
|
||||
else:
|
||||
p_mask[-len(span["tokens"]) : -(len(truncated_query) + sequence_added_tokens)] = 0
|
||||
|
||||
p_mask[np.where(np.array(span["input_ids"]) == tokenizer.sep_token_id)[0]] = 1
|
||||
pad_token_indices = np.where(span["input_ids"] == tokenizer.pad_token_id)
|
||||
special_token_indices = np.asarray(
|
||||
tokenizer.get_special_tokens_mask(span["input_ids"], already_has_special_tokens=True)
|
||||
).nonzero()
|
||||
|
||||
# Set the CLS index to '0'
|
||||
p_mask[pad_token_indices] = 1
|
||||
p_mask[special_token_indices] = 1
|
||||
|
||||
# Set the cls index to 0: the CLS index can be used for impossible answers
|
||||
p_mask[cls_index] = 0
|
||||
|
||||
span_is_impossible = example.is_impossible
|
||||
|
||||
@@ -98,6 +98,10 @@ class DataProcessor:
|
||||
"""Gets a collection of `InputExample`s for the dev set."""
|
||||
raise NotImplementedError()
|
||||
|
||||
def get_test_examples(self, data_dir):
|
||||
"""Gets a collection of `InputExample`s for the test set."""
|
||||
raise NotImplementedError()
|
||||
|
||||
def get_labels(self):
|
||||
"""Gets the list of labels for this data set."""
|
||||
raise NotImplementedError()
|
||||
|
||||
@@ -550,7 +550,7 @@ class AlbertModel(AlbertPreTrainedModel):
|
||||
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
|
||||
|
||||
extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
|
||||
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
|
||||
extended_attention_mask = extended_attention_mask.to(dtype=self.dtype) # fp16 compatibility
|
||||
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
|
||||
head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
|
||||
|
||||
|
||||
@@ -30,6 +30,7 @@ from .configuration_auto import (
|
||||
EncoderDecoderConfig,
|
||||
FlaubertConfig,
|
||||
GPT2Config,
|
||||
LongformerConfig,
|
||||
OpenAIGPTConfig,
|
||||
ReformerConfig,
|
||||
RobertaConfig,
|
||||
@@ -87,6 +88,7 @@ from .modeling_electra import (
|
||||
ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
ElectraForMaskedLM,
|
||||
ElectraForPreTraining,
|
||||
ElectraForSequenceClassification,
|
||||
ElectraForTokenClassification,
|
||||
ElectraModel,
|
||||
)
|
||||
@@ -99,6 +101,7 @@ from .modeling_flaubert import (
|
||||
FlaubertWithLMHeadModel,
|
||||
)
|
||||
from .modeling_gpt2 import GPT2_PRETRAINED_MODEL_ARCHIVE_MAP, GPT2LMHeadModel, GPT2Model
|
||||
from .modeling_longformer import LONGFORMER_PRETRAINED_MODEL_ARCHIVE_MAP, LongformerForMaskedLM, LongformerModel
|
||||
from .modeling_marian import MarianMTModel
|
||||
from .modeling_openai import OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP, OpenAIGPTLMHeadModel, OpenAIGPTModel
|
||||
from .modeling_reformer import ReformerModel, ReformerModelWithLMHead
|
||||
@@ -162,6 +165,7 @@ ALL_PRETRAINED_MODEL_ARCHIVE_MAP = dict(
|
||||
FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
LONGFORMER_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
]
|
||||
for key, value, in pretrained_map.items()
|
||||
)
|
||||
@@ -174,6 +178,7 @@ MODEL_MAPPING = OrderedDict(
|
||||
(CamembertConfig, CamembertModel),
|
||||
(XLMRobertaConfig, XLMRobertaModel),
|
||||
(BartConfig, BartModel),
|
||||
(LongformerConfig, LongformerModel),
|
||||
(RobertaConfig, RobertaModel),
|
||||
(BertConfig, BertModel),
|
||||
(OpenAIGPTConfig, OpenAIGPTModel),
|
||||
@@ -196,6 +201,7 @@ MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
|
||||
(CamembertConfig, CamembertForMaskedLM),
|
||||
(XLMRobertaConfig, XLMRobertaForMaskedLM),
|
||||
(BartConfig, BartForConditionalGeneration),
|
||||
(LongformerConfig, LongformerForMaskedLM),
|
||||
(RobertaConfig, RobertaForMaskedLM),
|
||||
(BertConfig, BertForPreTraining),
|
||||
(OpenAIGPTConfig, OpenAIGPTLMHeadModel),
|
||||
@@ -218,6 +224,7 @@ MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
|
||||
(XLMRobertaConfig, XLMRobertaForMaskedLM),
|
||||
(MarianConfig, MarianMTModel),
|
||||
(BartConfig, BartForConditionalGeneration),
|
||||
(LongformerConfig, LongformerForMaskedLM),
|
||||
(RobertaConfig, RobertaForMaskedLM),
|
||||
(BertConfig, BertForMaskedLM),
|
||||
(OpenAIGPTConfig, OpenAIGPTLMHeadModel),
|
||||
@@ -245,6 +252,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
|
||||
(XLNetConfig, XLNetForSequenceClassification),
|
||||
(FlaubertConfig, FlaubertForSequenceClassification),
|
||||
(XLMConfig, XLMForSequenceClassification),
|
||||
(ElectraConfig, ElectraForSequenceClassification),
|
||||
]
|
||||
)
|
||||
|
||||
@@ -313,6 +321,7 @@ class AutoModel:
|
||||
The model class to instantiate is selected based on the configuration class:
|
||||
|
||||
- isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertModel` (DistilBERT model)
|
||||
- isInstance of `longformer` configuration class: :class:`~transformers.LongformerModel` (Longformer model)
|
||||
- isInstance of `roberta` configuration class: :class:`~transformers.RobertaModel` (RoBERTa model)
|
||||
- isInstance of `bert` configuration class: :class:`~transformers.BertModel` (Bert model)
|
||||
- isInstance of `openai-gpt` configuration class: :class:`~transformers.OpenAIGPTModel` (OpenAI GPT model)
|
||||
@@ -355,6 +364,7 @@ class AutoModel:
|
||||
- contains `albert`: :class:`~transformers.AlbertModel` (ALBERT model)
|
||||
- contains `camembert`: :class:`~transformers.CamembertModel` (CamemBERT model)
|
||||
- contains `xlm-roberta`: :class:`~transformers.XLMRobertaModel` (XLM-RoBERTa model)
|
||||
- contains `longformer` :class:`~transformers.LongformerModel` (Longformer model)
|
||||
- contains `roberta`: :class:`~transformers.RobertaModel` (RoBERTa model)
|
||||
- contains `bert`: :class:`~transformers.BertModel` (Bert model)
|
||||
- contains `openai-gpt`: :class:`~transformers.OpenAIGPTModel` (OpenAI GPT model)
|
||||
@@ -463,6 +473,7 @@ class AutoModelForPreTraining:
|
||||
The model class to instantiate is selected based on the configuration class:
|
||||
|
||||
- isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model)
|
||||
- isInstance of `longformer` configuration class: :class:`~transformers.LongformerForMaskedLM` (Longformer model)
|
||||
- isInstance of `roberta` configuration class: :class:`~transformers.RobertaForMaskedLM` (RoBERTa model)
|
||||
- isInstance of `bert` configuration class: :class:`~transformers.BertForPreTraining` (Bert model)
|
||||
- isInstance of `openai-gpt` configuration class: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model)
|
||||
@@ -504,6 +515,7 @@ class AutoModelForPreTraining:
|
||||
- contains `albert`: :class:`~transformers.AlbertForMaskedLM` (ALBERT model)
|
||||
- contains `camembert`: :class:`~transformers.CamembertForMaskedLM` (CamemBERT model)
|
||||
- contains `xlm-roberta`: :class:`~transformers.XLMRobertaForMaskedLM` (XLM-RoBERTa model)
|
||||
- contains `longformer`: :class:`~transformers.LongformerForMaskedLM` (Longformer model)
|
||||
- contains `roberta`: :class:`~transformers.RobertaForMaskedLM` (RoBERTa model)
|
||||
- contains `bert`: :class:`~transformers.BertForPreTraining` (Bert model)
|
||||
- contains `openai-gpt`: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model)
|
||||
@@ -606,6 +618,7 @@ class AutoModelWithLMHead:
|
||||
The model class to instantiate is selected based on the configuration class:
|
||||
|
||||
- isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model)
|
||||
- isInstance of `longformer` configuration class: :class:`~transformers.LongformerForMaskedLM` (Longformer model)
|
||||
- isInstance of `roberta` configuration class: :class:`~transformers.RobertaForMaskedLM` (RoBERTa model)
|
||||
- isInstance of `bert` configuration class: :class:`~transformers.BertForMaskedLM` (Bert model)
|
||||
- isInstance of `openai-gpt` configuration class: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model)
|
||||
@@ -648,6 +661,7 @@ class AutoModelWithLMHead:
|
||||
- contains `albert`: :class:`~transformers.AlbertForMaskedLM` (ALBERT model)
|
||||
- contains `camembert`: :class:`~transformers.CamembertForMaskedLM` (CamemBERT model)
|
||||
- contains `xlm-roberta`: :class:`~transformers.XLMRobertaForMaskedLM` (XLM-RoBERTa model)
|
||||
- contains `longformer`: :class:`~transformers.LongformerForMaskedLM` (Longformer model)
|
||||
- contains `roberta`: :class:`~transformers.RobertaForMaskedLM` (RoBERTa model)
|
||||
- contains `bert`: :class:`~transformers.BertForMaskedLM` (Bert model)
|
||||
- contains `openai-gpt`: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model)
|
||||
|
||||
@@ -703,9 +703,7 @@ class BertModel(BertPreTrainedModel):
|
||||
|
||||
# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
|
||||
# ourselves in which case we just need to make it broadcastable to all heads.
|
||||
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(
|
||||
attention_mask, input_shape, self.device
|
||||
)
|
||||
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
|
||||
|
||||
# If a 2D ou 3D attention mask is provided for the cross-attention
|
||||
# we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]
|
||||
|
||||
@@ -3,6 +3,7 @@ import os
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.nn import CrossEntropyLoss, MSELoss
|
||||
|
||||
from .activations import get_activation
|
||||
from .configuration_electra import ElectraConfig
|
||||
@@ -330,6 +331,112 @@ class ElectraModel(ElectraPreTrainedModel):
|
||||
return hidden_states
|
||||
|
||||
|
||||
class ElectraClassificationHead(nn.Module):
|
||||
"""Head for sentence-level classification tasks."""
|
||||
|
||||
def __init__(self, config):
|
||||
super().__init__()
|
||||
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
|
||||
self.dropout = nn.Dropout(config.hidden_dropout_prob)
|
||||
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
|
||||
|
||||
def forward(self, features, **kwargs):
|
||||
x = features[:, 0, :] # take <s> token (equiv. to [CLS])
|
||||
x = self.dropout(x)
|
||||
x = self.dense(x)
|
||||
x = get_activation("gelu")(x) # although BERT uses tanh here, it seems Electra authors used gelu here
|
||||
x = self.dropout(x)
|
||||
x = self.out_proj(x)
|
||||
return x
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
"""ELECTRA Model transformer with a sequence classification/regression head on top (a linear layer on top of
|
||||
the pooled output) e.g. for GLUE tasks. """,
|
||||
ELECTRA_START_DOCSTRING,
|
||||
)
|
||||
class ElectraForSequenceClassification(ElectraPreTrainedModel):
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
self.num_labels = config.num_labels
|
||||
self.electra = ElectraModel(config)
|
||||
self.classifier = ElectraClassificationHead(config)
|
||||
|
||||
self.init_weights()
|
||||
|
||||
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
|
||||
def forward(
|
||||
self,
|
||||
input_ids=None,
|
||||
attention_mask=None,
|
||||
token_type_ids=None,
|
||||
position_ids=None,
|
||||
head_mask=None,
|
||||
inputs_embeds=None,
|
||||
labels=None,
|
||||
):
|
||||
r"""
|
||||
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
|
||||
Labels for computing the sequence classification/regression loss.
|
||||
Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
|
||||
If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
|
||||
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
|
||||
|
||||
Returns:
|
||||
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
|
||||
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):
|
||||
Classification (or regression if config.num_labels==1) loss.
|
||||
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):
|
||||
Classification (or regression if config.num_labels==1) scores (before SoftMax).
|
||||
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
|
||||
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
|
||||
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
|
||||
|
||||
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
|
||||
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
|
||||
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
|
||||
|
||||
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
|
||||
heads.
|
||||
|
||||
Examples::
|
||||
|
||||
from transformers import BertTokenizer, BertForSequenceClassification
|
||||
import torch
|
||||
|
||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||||
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
|
||||
|
||||
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
|
||||
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
|
||||
outputs = model(input_ids, labels=labels)
|
||||
|
||||
loss, logits = outputs[:2]
|
||||
|
||||
"""
|
||||
discriminator_hidden_states = self.electra(
|
||||
input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds
|
||||
)
|
||||
|
||||
sequence_output = discriminator_hidden_states[0]
|
||||
logits = self.classifier(sequence_output)
|
||||
|
||||
outputs = (logits,) + discriminator_hidden_states[2:] # add hidden states and attention if they are here
|
||||
|
||||
if labels is not None:
|
||||
if self.num_labels == 1:
|
||||
# We are doing regression
|
||||
loss_fct = MSELoss()
|
||||
loss = loss_fct(logits.view(-1), labels.view(-1))
|
||||
else:
|
||||
loss_fct = CrossEntropyLoss()
|
||||
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
|
||||
outputs = (loss,) + outputs
|
||||
|
||||
return outputs # (loss), logits, (hidden_states), (attentions)
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
"""
|
||||
Electra model with a binary classification head on top as used during pre-training for identifying generated
|
||||
|
||||
709
src/transformers/modeling_longformer.py
Normal file
709
src/transformers/modeling_longformer.py
Normal file
@@ -0,0 +1,709 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""PyTorch Longformer model. """
|
||||
|
||||
import logging
|
||||
import math
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.nn import CrossEntropyLoss
|
||||
from torch.nn import functional as F
|
||||
|
||||
from .configuration_longformer import LongformerConfig
|
||||
from .file_utils import add_start_docstrings, add_start_docstrings_to_callable
|
||||
from .modeling_bert import BertPreTrainedModel
|
||||
from .modeling_roberta import RobertaLMHead, RobertaModel
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
LONGFORMER_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
||||
"longformer-base-4096": "https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-base-4096/pytorch_model.bin",
|
||||
"longformer-large-4096": "https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096/pytorch_model.bin",
|
||||
}
|
||||
|
||||
|
||||
class LongformerSelfAttention(nn.Module):
|
||||
def __init__(self, config, layer_id):
|
||||
super().__init__()
|
||||
if config.hidden_size % config.num_attention_heads != 0:
|
||||
raise ValueError(
|
||||
"The hidden size (%d) is not a multiple of the number of attention "
|
||||
"heads (%d)" % (config.hidden_size, config.num_attention_heads)
|
||||
)
|
||||
self.output_attentions = config.output_attentions
|
||||
self.num_heads = config.num_attention_heads
|
||||
self.head_dim = int(config.hidden_size / config.num_attention_heads)
|
||||
self.embed_dim = config.hidden_size
|
||||
|
||||
self.query = nn.Linear(config.hidden_size, self.embed_dim)
|
||||
self.key = nn.Linear(config.hidden_size, self.embed_dim)
|
||||
self.value = nn.Linear(config.hidden_size, self.embed_dim)
|
||||
|
||||
# separate projection layers for tokens with global attention
|
||||
self.query_global = nn.Linear(config.hidden_size, self.embed_dim)
|
||||
self.key_global = nn.Linear(config.hidden_size, self.embed_dim)
|
||||
self.value_global = nn.Linear(config.hidden_size, self.embed_dim)
|
||||
|
||||
self.dropout = config.attention_probs_dropout_prob
|
||||
|
||||
self.layer_id = layer_id
|
||||
attention_window = config.attention_window[self.layer_id]
|
||||
assert (
|
||||
attention_window % 2 == 0
|
||||
), f"`attention_window` for layer {self.layer_id} has to be an even value. Given {attention_window}"
|
||||
assert (
|
||||
attention_window > 0
|
||||
), f"`attention_window` for layer {self.layer_id} has to be positive. Given {attention_window}"
|
||||
|
||||
self.one_sided_attention_window_size = attention_window // 2
|
||||
|
||||
@staticmethod
|
||||
def _skew(x, direction):
|
||||
"""Convert diagonals into columns (or columns into diagonals depending on `direction`"""
|
||||
x_padded = F.pad(x, direction) # padding value is not important because it will be overwritten
|
||||
x_padded = x_padded.view(*x_padded.size()[:-2], x_padded.size(-1), x_padded.size(-2))
|
||||
return x_padded
|
||||
|
||||
@staticmethod
|
||||
def _skew2(x):
|
||||
"""shift every row 1 step to right converting columns into diagonals"""
|
||||
# X = B x C x M x L
|
||||
B, C, M, L = x.size()
|
||||
x = F.pad(x, (0, M + 1)) # B x C x M x (L+M+1). Padding value is not important because it'll be overwritten
|
||||
x = x.view(B, C, -1) # B x C x ML+MM+M
|
||||
x = x[:, :, :-M] # B x C x ML+MM
|
||||
x = x.view(B, C, M, M + L) # B x C, M x L+M
|
||||
x = x[:, :, :, :-1]
|
||||
return x
|
||||
|
||||
@staticmethod
|
||||
def _chunk(x, w):
|
||||
"""convert into overlapping chunkings. Chunk size = 2w, overlap size = w"""
|
||||
|
||||
# non-overlapping chunks of size = 2w
|
||||
x = x.view(x.size(0), x.size(1) // (w * 2), w * 2, x.size(2))
|
||||
|
||||
# use `as_strided` to make the chunks overlap with an overlap size = w
|
||||
chunk_size = list(x.size())
|
||||
chunk_size[1] = chunk_size[1] * 2 - 1
|
||||
|
||||
chunk_stride = list(x.stride())
|
||||
chunk_stride[1] = chunk_stride[1] // 2
|
||||
return x.as_strided(size=chunk_size, stride=chunk_stride)
|
||||
|
||||
def _mask_invalid_locations(self, input_tensor, w) -> torch.Tensor:
|
||||
affected_seqlen = w
|
||||
beginning_mask_2d = input_tensor.new_ones(w, w + 1).tril().flip(dims=[0])
|
||||
beginning_mask = beginning_mask_2d[None, :, None, :]
|
||||
ending_mask = beginning_mask.flip(dims=(1, 3))
|
||||
seqlen = input_tensor.size(1)
|
||||
beginning_input = input_tensor[:, :affected_seqlen, :, : w + 1]
|
||||
beginning_mask = beginning_mask[:, :seqlen].expand(beginning_input.size())
|
||||
beginning_input.masked_fill_(beginning_mask == 1, -float("inf")) # `== 1` converts to bool or uint8
|
||||
ending_input = input_tensor[:, -affected_seqlen:, :, -(w + 1) :]
|
||||
ending_mask = ending_mask[:, -seqlen:].expand(ending_input.size())
|
||||
ending_input.masked_fill_(ending_mask == 1, -float("inf")) # `== 1` converts to bool or uint8
|
||||
|
||||
def _sliding_chunks_matmul_qk(self, q: torch.Tensor, k: torch.Tensor, w: int):
|
||||
"""Matrix multiplicatio of query x key tensors using with a sliding window attention pattern.
|
||||
This implementation splits the input into overlapping chunks of size 2w (e.g. 512 for pretrained Longformer)
|
||||
with an overlap of size w"""
|
||||
batch_size, seqlen, num_heads, head_dim = q.size()
|
||||
assert seqlen % (w * 2) == 0, f"Sequence length should be multiple of {w * 2}. Given {seqlen}"
|
||||
assert q.size() == k.size()
|
||||
|
||||
chunks_count = seqlen // w - 1
|
||||
|
||||
# group batch_size and num_heads dimensions into one, then chunk seqlen into chunks of size w * 2
|
||||
q = q.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)
|
||||
k = k.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)
|
||||
|
||||
chunk_q = self._chunk(q, w)
|
||||
chunk_k = self._chunk(k, w)
|
||||
|
||||
# matrix multipication
|
||||
# bcxd: batch_size * num_heads x chunks x 2w x head_dim
|
||||
# bcyd: batch_size * num_heads x chunks x 2w x head_dim
|
||||
# bcxy: batch_size * num_heads x chunks x 2w x 2w
|
||||
chunk_attn = torch.einsum("bcxd,bcyd->bcxy", (chunk_q, chunk_k)) # multiply
|
||||
|
||||
# convert diagonals into columns
|
||||
diagonal_chunk_attn = self._skew(chunk_attn, direction=(0, 0, 0, 1))
|
||||
|
||||
# allocate space for the overall attention matrix where the chunks are compined. The last dimension
|
||||
# has (w * 2 + 1) columns. The first (w) columns are the w lower triangles (attention from a word to
|
||||
# w previous words). The following column is attention score from each word to itself, then
|
||||
# followed by w columns for the upper triangle.
|
||||
|
||||
diagonal_attn = diagonal_chunk_attn.new_empty((batch_size * num_heads, chunks_count + 1, w, w * 2 + 1))
|
||||
|
||||
# copy parts from diagonal_chunk_attn into the compined matrix of attentions
|
||||
# - copying the main diagonal and the upper triangle
|
||||
diagonal_attn[:, :-1, :, w:] = diagonal_chunk_attn[:, :, :w, : w + 1]
|
||||
diagonal_attn[:, -1, :, w:] = diagonal_chunk_attn[:, -1, w:, : w + 1]
|
||||
# - copying the lower triangle
|
||||
diagonal_attn[:, 1:, :, :w] = diagonal_chunk_attn[:, :, -(w + 1) : -1, w + 1 :]
|
||||
diagonal_attn[:, 0, 1:w, 1:w] = diagonal_chunk_attn[:, 0, : w - 1, 1 - w :]
|
||||
|
||||
# separate batch_size and num_heads dimensions again
|
||||
diagonal_attn = diagonal_attn.view(batch_size, num_heads, seqlen, 2 * w + 1).transpose(2, 1)
|
||||
|
||||
self._mask_invalid_locations(diagonal_attn, w)
|
||||
return diagonal_attn
|
||||
|
||||
def _sliding_chunks_matmul_pv(self, prob: torch.Tensor, v: torch.Tensor, w: int):
|
||||
"""Same as _sliding_chunks_matmul_qk but for prob and value tensors. It is expecting the same output
|
||||
format from _sliding_chunks_matmul_qk"""
|
||||
batch_size, seqlen, num_heads, head_dim = v.size()
|
||||
assert seqlen % (w * 2) == 0
|
||||
assert prob.size()[:3] == v.size()[:3]
|
||||
assert prob.size(3) == 2 * w + 1
|
||||
chunks_count = seqlen // w - 1
|
||||
# group batch_size and num_heads dimensions into one, then chunk seqlen into chunks of size 2w
|
||||
chunk_prob = prob.transpose(1, 2).reshape(batch_size * num_heads, seqlen // w, w, 2 * w + 1)
|
||||
|
||||
# group batch_size and num_heads dimensions into one
|
||||
v = v.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)
|
||||
|
||||
# pad seqlen with w at the beginning of the sequence and another w at the end
|
||||
padded_v = F.pad(v, (0, 0, w, w), value=-1)
|
||||
|
||||
# chunk padded_v into chunks of size 3w and an overlap of size w
|
||||
chunk_v_size = (batch_size * num_heads, chunks_count + 1, 3 * w, head_dim)
|
||||
chunk_v_stride = padded_v.stride()
|
||||
chunk_v_stride = chunk_v_stride[0], w * chunk_v_stride[1], chunk_v_stride[1], chunk_v_stride[2]
|
||||
chunk_v = padded_v.as_strided(size=chunk_v_size, stride=chunk_v_stride)
|
||||
|
||||
skewed_prob = self._skew2(chunk_prob)
|
||||
|
||||
context = torch.einsum("bcwd,bcdh->bcwh", (skewed_prob, chunk_v))
|
||||
return context.view(batch_size, num_heads, seqlen, head_dim).transpose(1, 2)
|
||||
|
||||
def forward(
|
||||
self,
|
||||
hidden_states,
|
||||
attention_mask=None,
|
||||
head_mask=None,
|
||||
encoder_hidden_states=None,
|
||||
encoder_attention_mask=None,
|
||||
):
|
||||
"""
|
||||
LongformerSelfAttention expects `len(hidden_states)` to be multiple of `attention_window`.
|
||||
Padding to `attention_window` happens in LongformerModel.forward to avoid redoing the padding on each layer.
|
||||
|
||||
The `attention_mask` is changed in `BertModel.forward` from 0, 1, 2 to
|
||||
-ve: no attention
|
||||
0: local attention
|
||||
+ve: global attention
|
||||
|
||||
`encoder_hidden_states` and `encoder_attention_mask` are not supported and should be None
|
||||
"""
|
||||
# TODO: add support for `encoder_hidden_states` and `encoder_attention_mask`
|
||||
assert encoder_hidden_states is None, "`encoder_hidden_states` is not supported and should be None"
|
||||
assert encoder_attention_mask is None, "`encoder_attention_mask` is not supported and shiould be None"
|
||||
|
||||
if attention_mask is not None:
|
||||
attention_mask = attention_mask.squeeze(dim=2).squeeze(dim=1)
|
||||
key_padding_mask = attention_mask < 0
|
||||
extra_attention_mask = attention_mask > 0
|
||||
remove_from_windowed_attention_mask = attention_mask != 0
|
||||
|
||||
num_extra_indices_per_batch = extra_attention_mask.long().sum(dim=1)
|
||||
max_num_extra_indices_per_batch = num_extra_indices_per_batch.max()
|
||||
if max_num_extra_indices_per_batch <= 0:
|
||||
extra_attention_mask = None
|
||||
else:
|
||||
# To support the case of variable number of global attention in the rows of a batch,
|
||||
# we use the following three selection masks to select global attention embeddings
|
||||
# in a 3d tensor and pad it to `max_num_extra_indices_per_batch`
|
||||
# 1) selecting embeddings that correspond to global attention
|
||||
extra_attention_mask_nonzeros = extra_attention_mask.nonzero(as_tuple=True)
|
||||
zero_to_max_range = torch.arange(
|
||||
0, max_num_extra_indices_per_batch, device=num_extra_indices_per_batch.device
|
||||
)
|
||||
# mask indicating which values are actually going to be padding
|
||||
selection_padding_mask = zero_to_max_range < num_extra_indices_per_batch.unsqueeze(dim=-1)
|
||||
# 2) location of the non-padding values in the selected global attention
|
||||
selection_padding_mask_nonzeros = selection_padding_mask.nonzero(as_tuple=True)
|
||||
# 3) location of the padding values in the selected global attention
|
||||
selection_padding_mask_zeros = (selection_padding_mask == 0).nonzero(as_tuple=True)
|
||||
else:
|
||||
remove_from_windowed_attention_mask = None
|
||||
extra_attention_mask = None
|
||||
key_padding_mask = None
|
||||
|
||||
hidden_states = hidden_states.transpose(0, 1)
|
||||
seqlen, batch_size, embed_dim = hidden_states.size()
|
||||
assert embed_dim == self.embed_dim
|
||||
q = self.query(hidden_states)
|
||||
k = self.key(hidden_states)
|
||||
v = self.value(hidden_states)
|
||||
q /= math.sqrt(self.head_dim)
|
||||
|
||||
q = q.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)
|
||||
k = k.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)
|
||||
# attn_weights = (batch_size, seqlen, num_heads, window*2+1)
|
||||
attn_weights = self._sliding_chunks_matmul_qk(q, k, self.one_sided_attention_window_size)
|
||||
self._mask_invalid_locations(attn_weights, self.one_sided_attention_window_size)
|
||||
if remove_from_windowed_attention_mask is not None:
|
||||
# This implementation is fast and takes very little memory because num_heads x hidden_size = 1
|
||||
# from (batch_size x seqlen) to (batch_size x seqlen x num_heads x hidden_size)
|
||||
remove_from_windowed_attention_mask = remove_from_windowed_attention_mask.unsqueeze(dim=-1).unsqueeze(
|
||||
dim=-1
|
||||
)
|
||||
# cast to fp32/fp16 then replace 1's with -inf
|
||||
float_mask = remove_from_windowed_attention_mask.type_as(q).masked_fill(
|
||||
remove_from_windowed_attention_mask, -10000.0
|
||||
)
|
||||
ones = float_mask.new_ones(size=float_mask.size()) # tensor of ones
|
||||
# diagonal mask with zeros everywhere and -inf inplace of padding
|
||||
d_mask = self._sliding_chunks_matmul_qk(ones, float_mask, self.one_sided_attention_window_size)
|
||||
attn_weights += d_mask
|
||||
assert list(attn_weights.size()) == [
|
||||
batch_size,
|
||||
seqlen,
|
||||
self.num_heads,
|
||||
self.one_sided_attention_window_size * 2 + 1,
|
||||
]
|
||||
|
||||
# the extra attention
|
||||
if extra_attention_mask is not None:
|
||||
selected_k = k.new_zeros(batch_size, max_num_extra_indices_per_batch, self.num_heads, self.head_dim)
|
||||
selected_k[selection_padding_mask_nonzeros] = k[extra_attention_mask_nonzeros]
|
||||
# (batch_size, seqlen, num_heads, max_num_extra_indices_per_batch)
|
||||
selected_attn_weights = torch.einsum("blhd,bshd->blhs", (q, selected_k))
|
||||
selected_attn_weights[selection_padding_mask_zeros[0], :, :, selection_padding_mask_zeros[1]] = -10000
|
||||
# concat to attn_weights
|
||||
# (batch_size, seqlen, num_heads, extra attention count + 2*window+1)
|
||||
attn_weights = torch.cat((selected_attn_weights, attn_weights), dim=-1)
|
||||
|
||||
attn_weights_fp32 = F.softmax(attn_weights, dim=-1, dtype=torch.float32) # use fp32 for numerical stability
|
||||
attn_weights = attn_weights_fp32.type_as(attn_weights)
|
||||
|
||||
if key_padding_mask is not None:
|
||||
# softmax sometimes inserts NaN if all positions are masked, replace them with 0
|
||||
attn_weights = torch.masked_fill(attn_weights, key_padding_mask.unsqueeze(-1).unsqueeze(-1), 0.0)
|
||||
|
||||
attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training)
|
||||
v = v.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)
|
||||
attn = None
|
||||
if extra_attention_mask is not None:
|
||||
selected_attn_probs = attn_probs.narrow(-1, 0, max_num_extra_indices_per_batch)
|
||||
selected_v = v.new_zeros(batch_size, max_num_extra_indices_per_batch, self.num_heads, self.head_dim)
|
||||
selected_v[selection_padding_mask_nonzeros] = v[extra_attention_mask_nonzeros]
|
||||
# use `matmul` because `einsum` crashes sometimes with fp16
|
||||
# attn = torch.einsum('blhs,bshd->blhd', (selected_attn_probs, selected_v))
|
||||
attn = torch.matmul(
|
||||
selected_attn_probs.transpose(1, 2), selected_v.transpose(1, 2).type_as(selected_attn_probs)
|
||||
).transpose(1, 2)
|
||||
attn_probs = attn_probs.narrow(
|
||||
-1, max_num_extra_indices_per_batch, attn_probs.size(-1) - max_num_extra_indices_per_batch
|
||||
).contiguous()
|
||||
if attn is None:
|
||||
attn = self._sliding_chunks_matmul_pv(attn_probs, v, self.one_sided_attention_window_size)
|
||||
else:
|
||||
attn += self._sliding_chunks_matmul_pv(attn_probs, v, self.one_sided_attention_window_size)
|
||||
|
||||
assert attn.size() == (batch_size, seqlen, self.num_heads, self.head_dim), "Unexpected size"
|
||||
attn = attn.transpose(0, 1).reshape(seqlen, batch_size, embed_dim).contiguous()
|
||||
|
||||
# For this case, we'll just recompute the attention for these indices
|
||||
# and overwrite the attn tensor.
|
||||
# TODO: remove the redundant computation
|
||||
if extra_attention_mask is not None:
|
||||
selected_hidden_states = hidden_states.new_zeros(max_num_extra_indices_per_batch, batch_size, embed_dim)
|
||||
selected_hidden_states[selection_padding_mask_nonzeros[::-1]] = hidden_states[
|
||||
extra_attention_mask_nonzeros[::-1]
|
||||
]
|
||||
|
||||
q = self.query_global(selected_hidden_states)
|
||||
k = self.key_global(hidden_states)
|
||||
v = self.value_global(hidden_states)
|
||||
q /= math.sqrt(self.head_dim)
|
||||
|
||||
q = (
|
||||
q.contiguous()
|
||||
.view(max_num_extra_indices_per_batch, batch_size * self.num_heads, self.head_dim)
|
||||
.transpose(0, 1)
|
||||
) # (batch_size * self.num_heads, max_num_extra_indices_per_batch, head_dim)
|
||||
k = (
|
||||
k.contiguous().view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)
|
||||
) # batch_size * self.num_heads, seqlen, head_dim)
|
||||
v = (
|
||||
v.contiguous().view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)
|
||||
) # batch_size * self.num_heads, seqlen, head_dim)
|
||||
attn_weights = torch.bmm(q, k.transpose(1, 2))
|
||||
assert list(attn_weights.size()) == [batch_size * self.num_heads, max_num_extra_indices_per_batch, seqlen]
|
||||
|
||||
attn_weights = attn_weights.view(batch_size, self.num_heads, max_num_extra_indices_per_batch, seqlen)
|
||||
attn_weights[selection_padding_mask_zeros[0], :, selection_padding_mask_zeros[1], :] = -10000.0
|
||||
if key_padding_mask is not None:
|
||||
attn_weights = attn_weights.masked_fill(key_padding_mask.unsqueeze(1).unsqueeze(2), -10000.0,)
|
||||
attn_weights = attn_weights.view(batch_size * self.num_heads, max_num_extra_indices_per_batch, seqlen)
|
||||
attn_weights_float = F.softmax(
|
||||
attn_weights, dim=-1, dtype=torch.float32
|
||||
) # use fp32 for numerical stability
|
||||
attn_probs = F.dropout(attn_weights_float.type_as(attn_weights), p=self.dropout, training=self.training)
|
||||
selected_attn = torch.bmm(attn_probs, v)
|
||||
assert list(selected_attn.size()) == [
|
||||
batch_size * self.num_heads,
|
||||
max_num_extra_indices_per_batch,
|
||||
self.head_dim,
|
||||
]
|
||||
|
||||
selected_attn_4d = selected_attn.view(
|
||||
batch_size, self.num_heads, max_num_extra_indices_per_batch, self.head_dim
|
||||
)
|
||||
nonzero_selected_attn = selected_attn_4d[
|
||||
selection_padding_mask_nonzeros[0], :, selection_padding_mask_nonzeros[1]
|
||||
]
|
||||
attn[extra_attention_mask_nonzeros[::-1]] = nonzero_selected_attn.view(
|
||||
len(selection_padding_mask_nonzeros[0]), -1
|
||||
).type_as(hidden_states)
|
||||
|
||||
context_layer = attn.transpose(0, 1)
|
||||
if self.output_attentions:
|
||||
if extra_attention_mask is not None:
|
||||
# With global attention, return global attention probabilities only
|
||||
# batch_size x num_heads x max_num_global_attention_tokens x sequence_length
|
||||
# which is the attention weights from tokens with global attention to all tokens
|
||||
# It doesn't not return local attention
|
||||
# In case of variable number of global attantion in the rows of a batch,
|
||||
# attn_weights are padded with -10000.0 attention scores
|
||||
attn_weights = attn_weights.view(batch_size, self.num_heads, max_num_extra_indices_per_batch, seqlen)
|
||||
else:
|
||||
# without global attention, return local attention probabilities
|
||||
# batch_size x num_heads x sequence_length x window_size
|
||||
# which is the attention weights of every token attending to its neighbours
|
||||
attn_weights = attn_weights.permute(0, 2, 1, 3)
|
||||
outputs = (context_layer, attn_weights) if self.output_attentions else (context_layer,)
|
||||
return outputs
|
||||
|
||||
|
||||
LONGFORMER_START_DOCSTRING = r"""
|
||||
|
||||
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
|
||||
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
|
||||
usage and behavior.
|
||||
|
||||
Parameters:
|
||||
config (:class:`~transformers.LongformerConfig`): Model configuration class with all the parameters of the
|
||||
model. Initializing with a config file does not load the weights associated with the model, only the configuration.
|
||||
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
|
||||
"""
|
||||
|
||||
LONGFORMER_INPUTS_DOCSTRING = r"""
|
||||
Args:
|
||||
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
|
||||
Indices of input sequence tokens in the vocabulary.
|
||||
|
||||
Indices can be obtained using :class:`transformers.LonmgformerTokenizer`.
|
||||
See :func:`transformers.PreTrainedTokenizer.encode` and
|
||||
:func:`transformers.PreTrainedTokenizer.encode_plus` for details.
|
||||
|
||||
`What are input IDs? <../glossary.html#input-ids>`__
|
||||
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
|
||||
Mask to decide the attention given on each token, local attention, global attenion, or no attention (for padding tokens).
|
||||
Tokens with global attention attends to all other tokens, and all other tokens attend to them. This is important for
|
||||
task-specific finetuning because it makes the model more flexible at representing the task. For example,
|
||||
for classification, the <s> token should be given global attention. For QA, all question tokens should also have
|
||||
global attention. Please refer to the Longformer paper https://arxiv.org/abs/2004.05150 for more details.
|
||||
Mask values selected in ``[0, 1, 2]``:
|
||||
``0`` for no attention (padding tokens),
|
||||
``1`` for local attention (a sliding window attention),
|
||||
``2`` for global attention (tokens that attend to all other tokens, and all other tokens attend to them).
|
||||
|
||||
`What are attention masks? <../glossary.html#attention-mask>`__
|
||||
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
|
||||
Segment token indices to indicate first and second portions of the inputs.
|
||||
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
|
||||
corresponds to a `sentence B` token
|
||||
|
||||
`What are token type IDs? <../glossary.html#token-type-ids>`_
|
||||
position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
|
||||
Indices of positions of each input sequence tokens in the position embeddings.
|
||||
Selected in the range ``[0, config.max_position_embeddings - 1]``.
|
||||
|
||||
`What are position IDs? <../glossary.html#position-ids>`_
|
||||
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
|
||||
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
|
||||
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
|
||||
than the model's internal embedding lookup matrix.
|
||||
"""
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
"The bare Longformer Model outputting raw hidden-states without any specific head on top.",
|
||||
LONGFORMER_START_DOCSTRING,
|
||||
)
|
||||
class LongformerModel(RobertaModel):
|
||||
"""
|
||||
This class overrides :class:`~transformers.RobertaModel` to provide the ability to process
|
||||
long sequences following the selfattention approach described in `Longformer: the Long-Document Transformer`_by
|
||||
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer selfattention combines a local (sliding window)
|
||||
and global attention to extend to long documents without the O(n^2) increase in memory and compute.
|
||||
|
||||
The selfattention module `LongformerSelfAttention` implemented here supports the combination of local and
|
||||
global attention but it lacks support for autoregressive attention and dilated attention. Autoregressive
|
||||
and dilated attention are more relevant for autoregressive language modeling than finetuning on downstream
|
||||
tasks. Future release will add support for autoregressive attention, but the support for dilated attention
|
||||
requires a custom CUDA kernel to be memory and compute efficient.
|
||||
|
||||
.. _`Longformer: the Long-Document Transformer`:
|
||||
https://arxiv.org/abs/2004.05150
|
||||
|
||||
"""
|
||||
|
||||
config_class = LongformerConfig
|
||||
pretrained_model_archive_map = LONGFORMER_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||
base_model_prefix = "longformer"
|
||||
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
|
||||
if isinstance(config.attention_window, int):
|
||||
assert config.attention_window % 2 == 0, "`config.attention_window` has to be an even value"
|
||||
assert config.attention_window > 0, "`config.attention_window` has to be positive"
|
||||
config.attention_window = [config.attention_window] * config.num_hidden_layers # one value per layer
|
||||
else:
|
||||
assert len(config.attention_window) == config.num_hidden_layers, (
|
||||
"`len(config.attention_window)` should equal `config.num_hidden_layers`. "
|
||||
f"Expected {config.num_hidden_layers}, given {len(config.attention_window)}"
|
||||
)
|
||||
|
||||
for i, layer in enumerate(self.encoder.layer):
|
||||
# replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
|
||||
layer.attention.self = LongformerSelfAttention(config, layer_id=i)
|
||||
|
||||
self.init_weights()
|
||||
|
||||
def _pad_to_window_size(
|
||||
self,
|
||||
input_ids: torch.Tensor,
|
||||
attention_mask: torch.Tensor,
|
||||
token_type_ids: torch.Tensor,
|
||||
position_ids: torch.Tensor,
|
||||
inputs_embeds: torch.Tensor,
|
||||
attention_window: int,
|
||||
pad_token_id: int,
|
||||
):
|
||||
"""A helper function to pad tokens and mask to work with implementation of Longformer selfattention."""
|
||||
|
||||
assert attention_window % 2 == 0, f"`attention_window` should be an even value. Given {attention_window}"
|
||||
input_shape = input_ids.shape if input_ids is not None else inputs_embeds.shape
|
||||
batch_size, seqlen = input_shape[:2]
|
||||
|
||||
padding_len = (attention_window - seqlen % attention_window) % attention_window
|
||||
if padding_len > 0:
|
||||
logger.info(
|
||||
"Input ids are automatically padded from {} to {} to be a multiple of `config.attention_window`: {}".format(
|
||||
seqlen, seqlen + padding_len, attention_window
|
||||
)
|
||||
)
|
||||
if input_ids is not None:
|
||||
input_ids = F.pad(input_ids, (0, padding_len), value=pad_token_id)
|
||||
if attention_mask is not None:
|
||||
attention_mask = F.pad(
|
||||
attention_mask, (0, padding_len), value=False
|
||||
) # no attention on the padding tokens
|
||||
if token_type_ids is not None:
|
||||
token_type_ids = F.pad(token_type_ids, (0, padding_len), value=0) # pad with token_type_id = 0
|
||||
if position_ids is not None:
|
||||
# pad with position_id = pad_token_id as in modeling_roberta.RobertaEmbeddings
|
||||
position_ids = F.pad(position_ids, (0, padding_len), value=pad_token_id)
|
||||
if inputs_embeds is not None:
|
||||
input_ids_padding = inputs_embeds.new_full(
|
||||
(batch_size, padding_len), self.config.pad_token_id, dtype=torch.long,
|
||||
)
|
||||
inputs_embeds_padding = self.embeddings(input_ids_padding)
|
||||
inputs_embeds = torch.cat([inputs_embeds, inputs_embeds_padding], dim=-2)
|
||||
|
||||
return padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds
|
||||
|
||||
@add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING)
|
||||
def forward(
|
||||
self,
|
||||
input_ids=None,
|
||||
attention_mask=None,
|
||||
token_type_ids=None,
|
||||
position_ids=None,
|
||||
inputs_embeds=None,
|
||||
masked_lm_labels=None,
|
||||
):
|
||||
r"""
|
||||
|
||||
Returns:
|
||||
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.RobertaConfig`) and inputs:
|
||||
masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
|
||||
Masked language modeling loss.
|
||||
prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
|
||||
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
||||
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
|
||||
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
|
||||
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
|
||||
|
||||
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
|
||||
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
|
||||
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
|
||||
|
||||
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
|
||||
heads.
|
||||
|
||||
Examples::
|
||||
|
||||
import torch
|
||||
from transformers import LongformerModel, LongformerTokenizer
|
||||
|
||||
model = LongformerModel.from_pretrained('longformer-base-4096')
|
||||
tokenizer = LongformerTokenizer.from_pretrained('longformer-base-4096')
|
||||
|
||||
SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000) # long input document
|
||||
input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0) # batch of size 1
|
||||
|
||||
# Attention mask values -- 0: no attention, 1: local attention, 2: global attention
|
||||
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
|
||||
attention_mask[:, [1, 4, 21,]] = 2 # Set global attention based on the task. For example,
|
||||
# classification: the <s> token
|
||||
# QA: question tokens
|
||||
# LM: potentially on the beginning of sentences and paragraphs
|
||||
sequence_output, pooled_output = model(input_ids, attention_mask=attention_mask)
|
||||
"""
|
||||
|
||||
# padding
|
||||
attention_window = (
|
||||
self.config.attention_window
|
||||
if isinstance(self.config.attention_window, int)
|
||||
else max(self.config.attention_window)
|
||||
)
|
||||
padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds = self._pad_to_window_size(
|
||||
input_ids=input_ids,
|
||||
attention_mask=attention_mask,
|
||||
token_type_ids=token_type_ids,
|
||||
position_ids=position_ids,
|
||||
inputs_embeds=inputs_embeds,
|
||||
attention_window=attention_window,
|
||||
pad_token_id=self.config.pad_token_id,
|
||||
)
|
||||
|
||||
# embed
|
||||
output = super().forward(
|
||||
input_ids=input_ids,
|
||||
attention_mask=attention_mask,
|
||||
token_type_ids=token_type_ids,
|
||||
position_ids=position_ids,
|
||||
head_mask=None,
|
||||
inputs_embeds=inputs_embeds,
|
||||
encoder_hidden_states=None,
|
||||
encoder_attention_mask=None,
|
||||
)
|
||||
|
||||
# undo padding
|
||||
if padding_len > 0:
|
||||
# `output` has the following tensors: sequence_output, pooled_output, (hidden_states), (attentions)
|
||||
# `sequence_output`: unpad because the calling function is expecting a length == input_ids.size(1)
|
||||
# `pooled_output`: independent of the sequence length
|
||||
# `hidden_states`: mainly used for debugging and analysis, so keep the padding
|
||||
# `attentions`: mainly used for debugging and analysis, so keep the padding
|
||||
output = output[0][:, :-padding_len], *output[1:]
|
||||
|
||||
return output
|
||||
|
||||
|
||||
@add_start_docstrings("""Longformer Model with a `language modeling` head on top. """, LONGFORMER_START_DOCSTRING)
|
||||
class LongformerForMaskedLM(BertPreTrainedModel):
|
||||
config_class = LongformerConfig
|
||||
pretrained_model_archive_map = LONGFORMER_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||
base_model_prefix = "longformer"
|
||||
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
|
||||
self.longformer = LongformerModel(config)
|
||||
self.lm_head = RobertaLMHead(config)
|
||||
|
||||
self.init_weights()
|
||||
|
||||
@add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING)
|
||||
def forward(
|
||||
self,
|
||||
input_ids=None,
|
||||
attention_mask=None,
|
||||
token_type_ids=None,
|
||||
position_ids=None,
|
||||
inputs_embeds=None,
|
||||
masked_lm_labels=None,
|
||||
):
|
||||
r"""
|
||||
masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
|
||||
Labels for computing the masked language modeling loss.
|
||||
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
|
||||
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
|
||||
in ``[0, ..., config.vocab_size]``
|
||||
|
||||
Returns:
|
||||
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.RobertaConfig`) and inputs:
|
||||
masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
|
||||
Masked language modeling loss.
|
||||
prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
|
||||
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
||||
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
|
||||
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
|
||||
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
|
||||
|
||||
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
|
||||
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
|
||||
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
|
||||
|
||||
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
|
||||
heads.
|
||||
|
||||
Examples::
|
||||
|
||||
import torch
|
||||
from transformers import LongformerForMaskedLM, LongformerTokenizer
|
||||
|
||||
model = LongformerForMaskedLM.from_pretrained('longformer-base-4096')
|
||||
tokenizer = LongformerTokenizer.from_pretrained('longformer-base-4096')
|
||||
|
||||
SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000) # long input document
|
||||
input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0) # batch of size 1
|
||||
|
||||
attention_mask = None # default is local attention everywhere, which is a good choice for MaskedLM
|
||||
# check ``LongformerModel.forward`` for more details how to set `attention_mask`
|
||||
loss, prediction_scores = model(input_ids, attention_mask=attention_mask, masked_lm_labels=input_ids)
|
||||
"""
|
||||
|
||||
outputs = self.longformer(
|
||||
input_ids,
|
||||
attention_mask=attention_mask,
|
||||
token_type_ids=token_type_ids,
|
||||
position_ids=position_ids,
|
||||
inputs_embeds=inputs_embeds,
|
||||
)
|
||||
sequence_output = outputs[0]
|
||||
prediction_scores = self.lm_head(sequence_output)
|
||||
|
||||
outputs = (prediction_scores,) + outputs[2:] # Add hidden states and attention if they are here
|
||||
|
||||
if masked_lm_labels is not None:
|
||||
loss_fct = CrossEntropyLoss()
|
||||
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
|
||||
outputs = (masked_lm_loss,) + outputs
|
||||
|
||||
return outputs # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)
|
||||
@@ -149,8 +149,12 @@ class T5LayerNorm(nn.Module):
|
||||
self.variance_epsilon = eps
|
||||
|
||||
def forward(self, x):
|
||||
variance = x.pow(2).mean(-1, keepdim=True)
|
||||
# layer norm should always be calculated in float32
|
||||
variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True)
|
||||
x = x / torch.sqrt(variance + self.variance_epsilon)
|
||||
|
||||
if self.weight.dtype == torch.float16:
|
||||
x = x.to(torch.float16)
|
||||
return self.weight * x
|
||||
|
||||
|
||||
@@ -691,14 +695,16 @@ class T5Stack(T5PreTrainedModel):
|
||||
attention_mask = torch.ones(batch_size, mask_seq_length).to(inputs_embeds.device)
|
||||
if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:
|
||||
encoder_seq_length = encoder_hidden_states.shape[1]
|
||||
encoder_attention_mask = torch.ones(batch_size, encoder_seq_length).to(inputs_embeds.device)
|
||||
encoder_attention_mask = torch.ones(
|
||||
batch_size, encoder_seq_length, device=inputs_embeds.device, dtype=torch.long
|
||||
)
|
||||
|
||||
# initialize past_key_value_states with `None` if past does not exist
|
||||
if past_key_value_states is None:
|
||||
past_key_value_states = [None] * len(self.block)
|
||||
|
||||
# ourselves in which case we just need to make it broadcastable to all heads.
|
||||
extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, self.device)
|
||||
extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, inputs_embeds.device)
|
||||
|
||||
if self.is_decoder and encoder_attention_mask is not None:
|
||||
encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
|
||||
@@ -733,6 +739,7 @@ class T5Stack(T5PreTrainedModel):
|
||||
# layer_outputs is a tuple with:
|
||||
# hidden-states, key-value-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)
|
||||
hidden_states, present_key_value_state = layer_outputs[:2]
|
||||
|
||||
if i == 0:
|
||||
# We share the position biases between the layers - the first layer store them
|
||||
# layer_outputs = hidden-states, key-value-states (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)
|
||||
|
||||
@@ -537,7 +537,7 @@ class TFT5MainLayer(tf.keras.layers.Layer):
|
||||
|
||||
def call(
|
||||
self,
|
||||
input_ids,
|
||||
inputs,
|
||||
attention_mask=None,
|
||||
encoder_hidden_states=None,
|
||||
encoder_attention_mask=None,
|
||||
@@ -548,19 +548,19 @@ class TFT5MainLayer(tf.keras.layers.Layer):
|
||||
training=False,
|
||||
):
|
||||
|
||||
if input_ids is not None and inputs_embeds is not None:
|
||||
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
|
||||
elif input_ids is not None:
|
||||
input_shape = shape_list(input_ids)
|
||||
input_ids = tf.reshape(input_ids, (-1, input_shape[-1]))
|
||||
if inputs is not None and inputs_embeds is not None:
|
||||
raise ValueError("You cannot specify both inputs and inputs_embeds at the same time")
|
||||
elif inputs is not None:
|
||||
input_shape = shape_list(inputs)
|
||||
inputs = tf.reshape(inputs, (-1, input_shape[-1]))
|
||||
elif inputs_embeds is not None:
|
||||
input_shape = shape_list(inputs_embeds)[:-1]
|
||||
else:
|
||||
raise ValueError("You have to specify either input_ids or inputs_embeds")
|
||||
raise ValueError("You have to specify either inputs or inputs_embeds")
|
||||
|
||||
if inputs_embeds is None:
|
||||
assert self.embed_tokens is not None, "You have to intialize the model with valid token embeddings"
|
||||
inputs_embeds = self.embed_tokens(input_ids)
|
||||
inputs_embeds = self.embed_tokens(inputs)
|
||||
|
||||
batch_size, seq_length = input_shape
|
||||
|
||||
@@ -725,11 +725,11 @@ class TFT5PreTrainedModel(TFPreTrainedModel):
|
||||
|
||||
@property
|
||||
def dummy_inputs(self):
|
||||
input_ids = tf.constant(DUMMY_INPUTS)
|
||||
inputs = tf.constant(DUMMY_INPUTS)
|
||||
input_mask = tf.constant(DUMMY_MASK)
|
||||
dummy_inputs = {
|
||||
"inputs": input_ids,
|
||||
"decoder_input_ids": input_ids,
|
||||
"inputs": inputs,
|
||||
"decoder_input_ids": inputs,
|
||||
"decoder_attention_mask": input_mask,
|
||||
}
|
||||
return dummy_inputs
|
||||
@@ -759,11 +759,11 @@ T5_START_DOCSTRING = r""" The T5 model was proposed in
|
||||
|
||||
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :
|
||||
|
||||
- a single Tensor with input_ids only and nothing else: `model(inputs_ids)
|
||||
- a single Tensor with inputs only and nothing else: `model(inputs_ids)
|
||||
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
|
||||
`model([input_ids, attention_mask])` or `model([input_ids, attention_mask, token_type_ids])`
|
||||
`model([inputs, attention_mask])` or `model([inputs, attention_mask, token_type_ids])`
|
||||
- a dictionary with one or several input Tensors associaed to the input names given in the docstring:
|
||||
`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
|
||||
`model({'inputs': inputs, 'token_type_ids': token_type_ids})`
|
||||
|
||||
Parameters:
|
||||
config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model.
|
||||
@@ -780,7 +780,7 @@ T5_INPUTS_DOCSTRING = r"""
|
||||
T5 is a model with relative position embeddings so you should be able to pad the inputs on
|
||||
the right or the left.
|
||||
Indices can be obtained using :class:`transformers.T5Tokenizer`.
|
||||
To know more on how to prepare :obj:`input_ids` for pre-training take a look at
|
||||
To know more on how to prepare :obj:`inputs` for pre-training take a look at
|
||||
`T5 Training <./t5.html#training>`_ .
|
||||
See :func:`transformers.PreTrainedTokenizer.encode` and
|
||||
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
|
||||
@@ -805,8 +805,8 @@ T5_INPUTS_DOCSTRING = r"""
|
||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
If `use_cache` is True, `decoder_past_key_value_states` are returned and can be used to speed up decoding (see `decoder_past_key_value_states`).
|
||||
inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
|
||||
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
|
||||
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
|
||||
Optionally, instead of passing :obj:`inputs` you can choose to directly pass an embedded representation.
|
||||
This is useful if you want more control over how to convert `inputs` indices into associated vectors
|
||||
than the model's internal embedding lookup matrix.
|
||||
decoder_inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
|
||||
Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.
|
||||
@@ -885,8 +885,8 @@ class TFT5Model(TFT5PreTrainedModel):
|
||||
|
||||
tokenizer = T5Tokenizer.from_pretrained('t5-small')
|
||||
model = TFT5Model.from_pretrained('t5-small')
|
||||
input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="tf") # Batch size 1
|
||||
outputs = model(input_ids, decoder_input_ids=input_ids)
|
||||
inputs = tokenizer.encode("Hello, my dog is cute", return_tensors="tf") # Batch size 1
|
||||
outputs = model(inputs, decoder_input_ids=inputs)
|
||||
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||
|
||||
"""
|
||||
@@ -897,7 +897,7 @@ class TFT5Model(TFT5PreTrainedModel):
|
||||
kwargs["inputs"] = inputs
|
||||
|
||||
# retrieve arguments
|
||||
input_ids = kwargs.get("inputs", None)
|
||||
inputs = kwargs.get("inputs", None)
|
||||
inputs_embeds = kwargs.get("inputs_embeds", None)
|
||||
attention_mask = kwargs.get("attention_mask", None)
|
||||
encoder_outputs = kwargs.get("encoder_outputs", None)
|
||||
@@ -911,7 +911,7 @@ class TFT5Model(TFT5PreTrainedModel):
|
||||
# Encode if needed (training, first prediction pass)
|
||||
if encoder_outputs is None:
|
||||
encoder_outputs = self.encoder(
|
||||
input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,
|
||||
inputs, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,
|
||||
)
|
||||
|
||||
hidden_states = encoder_outputs[0]
|
||||
@@ -1006,14 +1006,14 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
|
||||
|
||||
tokenizer = T5Tokenizer.from_pretrained('t5-small')
|
||||
model = TFT5ForConditionalGeneration.from_pretrained('t5-small')
|
||||
input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="tf") # Batch size 1
|
||||
outputs = model(input_ids, decoder_input_ids=input_ids)
|
||||
inputs = tokenizer.encode("Hello, my dog is cute", return_tensors="tf") # Batch size 1
|
||||
outputs = model(inputs, decoder_input_ids=inputs)
|
||||
prediction_scores = outputs[0]
|
||||
|
||||
tokenizer = T5Tokenizer.from_pretrained('t5-small')
|
||||
model = TFT5ForConditionalGeneration.from_pretrained('t5-small')
|
||||
input_ids = tokenizer.encode("summarize: Hello, my dog is cute", return_tensors="tf") # Batch size 1
|
||||
model.generate(input_ids)
|
||||
inputs = tokenizer.encode("summarize: Hello, my dog is cute", return_tensors="tf") # Batch size 1
|
||||
model.generate(inputs)
|
||||
|
||||
"""
|
||||
|
||||
@@ -1023,7 +1023,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
|
||||
kwargs["inputs"] = inputs
|
||||
|
||||
# retrieve arguments
|
||||
input_ids = kwargs.get("inputs", None)
|
||||
inputs = kwargs.get("inputs", None)
|
||||
decoder_input_ids = kwargs.get("decoder_input_ids", None)
|
||||
attention_mask = kwargs.get("attention_mask", None)
|
||||
encoder_outputs = kwargs.get("encoder_outputs", None)
|
||||
@@ -1038,7 +1038,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
|
||||
if encoder_outputs is None:
|
||||
# Convert encoder inputs in embeddings if needed
|
||||
encoder_outputs = self.encoder(
|
||||
input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,
|
||||
inputs, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,
|
||||
)
|
||||
|
||||
hidden_states = encoder_outputs[0]
|
||||
@@ -1076,7 +1076,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
|
||||
|
||||
return decoder_outputs + encoder_outputs
|
||||
|
||||
def prepare_inputs_for_generation(self, input_ids, past, attention_mask, use_cache, **kwargs):
|
||||
def prepare_inputs_for_generation(self, inputs, past, attention_mask, use_cache, **kwargs):
|
||||
assert past is not None, "past has to be defined for encoder_outputs"
|
||||
|
||||
# first step
|
||||
@@ -1087,7 +1087,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
|
||||
|
||||
return {
|
||||
"inputs": None, # inputs don't have to be defined, but still need to be passed to make Keras.layer.__call__ happy
|
||||
"decoder_input_ids": input_ids, # input_ids are the decoder_input_ids
|
||||
"decoder_input_ids": inputs, # inputs are the decoder_input_ids
|
||||
"decoder_past_key_value_states": decoder_past_key_value_states,
|
||||
"encoder_outputs": encoder_outputs,
|
||||
"attention_mask": attention_mask,
|
||||
|
||||
@@ -929,7 +929,9 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
|
||||
else:
|
||||
tokens_to_add = next_token
|
||||
|
||||
# add token and increase length by one
|
||||
input_ids = tf.concat([input_ids, tf.expand_dims(tokens_to_add, -1)], 1)
|
||||
cur_len = cur_len + 1
|
||||
|
||||
if eos_token_id is not None:
|
||||
eos_in_sents = tokens_to_add == eos_token_id
|
||||
@@ -955,8 +957,6 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
|
||||
[attention_mask, tf.ones((shape_list(attention_mask)[0], 1), dtype=tf.int32)], axis=-1
|
||||
)
|
||||
|
||||
cur_len = cur_len + 1
|
||||
|
||||
# if there are different sentences lengths in the batch, some batches have to be padded
|
||||
min_sent_length = tf.math.reduce_min(sent_lengths)
|
||||
max_sent_length = tf.math.reduce_max(sent_lengths)
|
||||
@@ -970,7 +970,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
|
||||
tf.expand_dims(sent_lengths, -1), [batch_size, max_sent_length]
|
||||
)
|
||||
broad_casted_range = tf.transpose(
|
||||
tf.broadcast_to(tf.expand_dims(tf.range(max_length), -1), [max_length, batch_size])
|
||||
tf.broadcast_to(tf.expand_dims(tf.range(max_sent_length), -1), [max_sent_length, batch_size])
|
||||
)
|
||||
|
||||
decoded = tf.where(broad_casted_range < broad_casted_sent_lengths, input_ids, padding)
|
||||
@@ -1205,9 +1205,11 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
|
||||
beam_tokens = tf.convert_to_tensor([x[1] for x in next_batch_beam], dtype=tf.int32)
|
||||
beam_idx = tf.convert_to_tensor([x[2] for x in next_batch_beam], dtype=tf.int32)
|
||||
|
||||
# re-order batch
|
||||
# re-order batch and update current length
|
||||
input_ids = tf.stack([tf.identity(input_ids[x, :]) for x in beam_idx])
|
||||
input_ids = tf.concat([input_ids, tf.expand_dims(beam_tokens, 1)], axis=-1)
|
||||
cur_len = cur_len + 1
|
||||
|
||||
# re-order internal states
|
||||
if past is not None:
|
||||
past = self._reorder_cache(past, beam_idx)
|
||||
@@ -1218,9 +1220,6 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
|
||||
[attention_mask, tf.ones((shape_list(attention_mask)[0], 1), dtype=tf.int32)], axis=-1
|
||||
)
|
||||
|
||||
# update current length
|
||||
cur_len = cur_len + 1
|
||||
|
||||
# finalize all open beam hypotheses and end to generated hypotheses
|
||||
for batch_idx in range(batch_size):
|
||||
# Add all open beam hypothesis to generated_hyps
|
||||
|
||||
@@ -17,7 +17,7 @@
|
||||
import inspect
|
||||
import logging
|
||||
import os
|
||||
from typing import Callable, Tuple
|
||||
from typing import Callable, List, Tuple
|
||||
|
||||
import torch
|
||||
from torch import Tensor, device, dtype, nn
|
||||
@@ -110,11 +110,33 @@ class ModuleUtilsMixin:
|
||||
|
||||
@property
|
||||
def device(self) -> device:
|
||||
return next(self.parameters()).device
|
||||
try:
|
||||
return next(self.parameters()).device
|
||||
except StopIteration:
|
||||
# For nn.DataParallel compatibility in PyTorch 1.5
|
||||
|
||||
def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:
|
||||
tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]
|
||||
return tuples
|
||||
|
||||
gen = self._named_members(get_members_fn=find_tensor_attributes)
|
||||
first_tuple = next(gen)
|
||||
return first_tuple[1].device
|
||||
|
||||
@property
|
||||
def dtype(self) -> dtype:
|
||||
return next(self.parameters()).dtype
|
||||
try:
|
||||
return next(self.parameters()).dtype
|
||||
except StopIteration:
|
||||
# For nn.DataParallel compatibility in PyTorch 1.5
|
||||
|
||||
def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:
|
||||
tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]
|
||||
return tuples
|
||||
|
||||
gen = self._named_members(get_members_fn=find_tensor_attributes)
|
||||
first_tuple = next(gen)
|
||||
return first_tuple[1].dtype
|
||||
|
||||
def invert_attention_mask(self, encoder_attention_mask: Tensor) -> Tensor:
|
||||
"""type: torch.Tensor -> torch.Tensor"""
|
||||
@@ -128,7 +150,18 @@ class ModuleUtilsMixin:
|
||||
# encoder_extended_attention_mask = (encoder_extended_attention_mask ==
|
||||
# encoder_extended_attention_mask.transpose(-1, -2))
|
||||
encoder_extended_attention_mask = encoder_extended_attention_mask.to(dtype=self.dtype) # fp16 compatibility
|
||||
encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9
|
||||
|
||||
if self.dtype == torch.float16:
|
||||
encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
|
||||
elif self.dtype == torch.float32:
|
||||
encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9
|
||||
else:
|
||||
raise ValueError(
|
||||
"{} not recognized. `dtype` should be set to either `torch.float32` or `torch.float16`".format(
|
||||
self.dtype
|
||||
)
|
||||
)
|
||||
|
||||
return encoder_extended_attention_mask
|
||||
|
||||
def get_extended_attention_mask(self, attention_mask: Tensor, input_shape: tuple, device: device):
|
||||
@@ -737,7 +770,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
|
||||
import torch_xla.core.xla_model as xm
|
||||
|
||||
model = xm.send_cpu_data_to_device(model, xm.xla_device())
|
||||
model = model.to(xm.xla_device())
|
||||
model.to(xm.xla_device())
|
||||
|
||||
return model
|
||||
|
||||
@@ -1236,13 +1269,15 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
|
||||
else:
|
||||
tokens_to_add = next_token
|
||||
|
||||
# add token and increase length by one
|
||||
input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)
|
||||
cur_len = cur_len + 1
|
||||
|
||||
if eos_token_id is not None:
|
||||
eos_in_sents = tokens_to_add == eos_token_id
|
||||
# if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length
|
||||
is_sents_unfinished_and_token_to_add_is_eos = unfinished_sents.mul(eos_in_sents.long()).bool()
|
||||
sent_lengths.masked_fill_(is_sents_unfinished_and_token_to_add_is_eos, cur_len + 1)
|
||||
sent_lengths.masked_fill_(is_sents_unfinished_and_token_to_add_is_eos, cur_len)
|
||||
# unfinished_sents is set to zero if eos in sentence
|
||||
unfinished_sents.mul_((~eos_in_sents).long())
|
||||
|
||||
@@ -1256,8 +1291,6 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
|
||||
[attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
|
||||
)
|
||||
|
||||
cur_len = cur_len + 1
|
||||
|
||||
# if there are different sentences lengths in the batch, some batches have to be padded
|
||||
if sent_lengths.min().item() != sent_lengths.max().item():
|
||||
assert pad_token_id is not None, "`Pad_token_id` has to be defined if batches have different lengths"
|
||||
@@ -1473,9 +1506,11 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
|
||||
beam_tokens = input_ids.new([x[1] for x in next_batch_beam])
|
||||
beam_idx = input_ids.new([x[2] for x in next_batch_beam])
|
||||
|
||||
# re-order batch
|
||||
# re-order batch and update current length
|
||||
input_ids = input_ids[beam_idx, :]
|
||||
input_ids = torch.cat([input_ids, beam_tokens.unsqueeze(1)], dim=-1)
|
||||
cur_len = cur_len + 1
|
||||
|
||||
# re-order internal states
|
||||
if past is not None:
|
||||
past = self._reorder_cache(past, beam_idx)
|
||||
@@ -1486,9 +1521,6 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
|
||||
[attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
|
||||
)
|
||||
|
||||
# update current length
|
||||
cur_len = cur_len + 1
|
||||
|
||||
# finalize all open beam hypotheses and end to generated hypotheses
|
||||
for batch_idx in range(batch_size):
|
||||
if done[batch_idx]:
|
||||
|
||||
@@ -623,7 +623,7 @@ class XLNetModel(XLNetPreTrainedModel):
|
||||
mask_lo = torch.tril(attn_mask, diagonal=-1)
|
||||
ret = torch.cat([ret[:, :qlen] + mask_lo, ret[:, qlen:]], dim=1)
|
||||
|
||||
ret = ret.to(next(self.parameters()))
|
||||
ret = ret.to(self.device)
|
||||
return ret
|
||||
|
||||
def cache_mem(self, curr_out, prev_mem):
|
||||
@@ -685,7 +685,7 @@ class XLNetModel(XLNetPreTrainedModel):
|
||||
fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)
|
||||
pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz)
|
||||
|
||||
pos_emb = pos_emb.to(next(self.parameters()))
|
||||
pos_emb = pos_emb.to(self.device)
|
||||
return pos_emb
|
||||
|
||||
@add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)
|
||||
@@ -761,8 +761,8 @@ class XLNetModel(XLNetPreTrainedModel):
|
||||
mlen = mems[0].shape[0] if mems is not None and mems[0] is not None else 0
|
||||
klen = mlen + qlen
|
||||
|
||||
dtype_float = next(self.parameters()).dtype
|
||||
device = next(self.parameters()).device
|
||||
dtype_float = self.dtype
|
||||
device = self.device
|
||||
|
||||
# Attention mask
|
||||
# causal attention mask
|
||||
|
||||
@@ -152,8 +152,8 @@ class AdamW(Optimizer):
|
||||
|
||||
# Decay the first and second moment running average coefficient
|
||||
# In-place operations to update the averages at the same time
|
||||
exp_avg.mul_(beta1).add_(1.0 - beta1, grad)
|
||||
exp_avg_sq.mul_(beta2).addcmul_(1.0 - beta2, grad, grad)
|
||||
exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1)
|
||||
exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)
|
||||
denom = exp_avg_sq.sqrt().add_(group["eps"])
|
||||
|
||||
step_size = group["lr"]
|
||||
@@ -162,7 +162,7 @@ class AdamW(Optimizer):
|
||||
bias_correction2 = 1.0 - beta2 ** state["step"]
|
||||
step_size = step_size * math.sqrt(bias_correction2) / bias_correction1
|
||||
|
||||
p.data.addcdiv_(-step_size, exp_avg, denom)
|
||||
p.data.addcdiv_(exp_avg, denom, value=-step_size)
|
||||
|
||||
# Just adding the square of the weights to the loss function is *not*
|
||||
# the correct way of using L2 regularization/weight decay with Adam,
|
||||
@@ -173,6 +173,6 @@ class AdamW(Optimizer):
|
||||
# of the weights to the loss with plain (non-momentum) SGD.
|
||||
# Add weight decay at the end (fixed version)
|
||||
if group["weight_decay"] > 0.0:
|
||||
p.data.add_(-group["lr"] * group["weight_decay"], p.data)
|
||||
p.data.add_(p.data, alpha=-group["lr"] * group["weight_decay"])
|
||||
|
||||
return loss
|
||||
|
||||
@@ -75,7 +75,7 @@ def create_optimizer(init_lr, num_train_steps, num_warmup_steps, end_lr=0.0, opt
|
||||
beta_1=0.9,
|
||||
beta_2=0.999,
|
||||
epsilon=1e-6,
|
||||
exclude_from_weight_decay=["layer_norm", "bias"],
|
||||
exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"],
|
||||
)
|
||||
|
||||
return optimizer
|
||||
@@ -217,7 +217,7 @@ class GradientAccumulator(object):
|
||||
"""The accumulated gradients on the current replica."""
|
||||
if not self._gradients:
|
||||
raise ValueError("The accumulator should be called first to initialize the gradients")
|
||||
return list(gradient.value() for gradient in self._gradients)
|
||||
return list(gradient.value() if gradient is not None else gradient for gradient in self._gradients)
|
||||
|
||||
def __call__(self, gradients):
|
||||
"""Accumulates :obj:`gradients` on the current replica."""
|
||||
@@ -231,6 +231,8 @@ class GradientAccumulator(object):
|
||||
synchronization=tf.VariableSynchronization.ON_READ,
|
||||
aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,
|
||||
)
|
||||
if gradient is not None
|
||||
else gradient
|
||||
for gradient in gradients
|
||||
]
|
||||
)
|
||||
@@ -238,7 +240,8 @@ class GradientAccumulator(object):
|
||||
raise ValueError("Expected %s gradients, but got %d" % (len(self._gradients), len(gradients)))
|
||||
|
||||
for accum_gradient, gradient in zip(self._gradients, gradients):
|
||||
accum_gradient.assign_add(gradient)
|
||||
if accum_gradient is not None and gradient is not None:
|
||||
accum_gradient.assign_add(gradient)
|
||||
|
||||
self._accum_steps.assign_add(1)
|
||||
|
||||
@@ -248,4 +251,5 @@ class GradientAccumulator(object):
|
||||
return
|
||||
self._accum_steps.assign(0)
|
||||
for gradient in self._gradients:
|
||||
gradient.assign(tf.zeros_like(gradient))
|
||||
if gradient is not None:
|
||||
gradient.assign(tf.zeros_like(gradient))
|
||||
|
||||
@@ -24,7 +24,7 @@ from abc import ABC, abstractmethod
|
||||
from contextlib import contextmanager
|
||||
from itertools import chain
|
||||
from os.path import abspath, exists
|
||||
from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple, Union
|
||||
from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Sequence, Tuple, Union
|
||||
|
||||
import numpy as np
|
||||
|
||||
@@ -58,6 +58,10 @@ if is_torch_available():
|
||||
AutoModelWithLMHead,
|
||||
)
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .modeling_utils import PreTrainedModel
|
||||
from .modeling_tf_utils import TFPreTrainedModel
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -864,6 +868,7 @@ class NerPipeline(Pipeline):
|
||||
binary_output: bool = False,
|
||||
ignore_labels=["O"],
|
||||
task: str = "",
|
||||
grouped_entities: bool = False,
|
||||
):
|
||||
super().__init__(
|
||||
model=model,
|
||||
@@ -878,6 +883,7 @@ class NerPipeline(Pipeline):
|
||||
|
||||
self._basic_tokenizer = BasicTokenizer(do_lower_case=False)
|
||||
self.ignore_labels = ignore_labels
|
||||
self.grouped_entities = grouped_entities
|
||||
|
||||
def __call__(self, *args, **kwargs):
|
||||
inputs = self._args_parser(*args, **kwargs)
|
||||
@@ -907,23 +913,74 @@ class NerPipeline(Pipeline):
|
||||
score = np.exp(entities) / np.exp(entities).sum(-1, keepdims=True)
|
||||
labels_idx = score.argmax(axis=-1)
|
||||
|
||||
answer = []
|
||||
for idx, label_idx in enumerate(labels_idx):
|
||||
if self.model.config.id2label[label_idx] not in self.ignore_labels:
|
||||
answer += [
|
||||
{
|
||||
"word": self.tokenizer.convert_ids_to_tokens(int(input_ids[idx])),
|
||||
"score": score[idx][label_idx].item(),
|
||||
"entity": self.model.config.id2label[label_idx],
|
||||
}
|
||||
]
|
||||
entities = []
|
||||
entity_groups = []
|
||||
entity_group_disagg = []
|
||||
# Filter to labels not in `self.ignore_labels`
|
||||
filtered_labels_idx = [
|
||||
(idx, label_idx)
|
||||
for idx, label_idx in enumerate(labels_idx)
|
||||
if self.model.config.id2label[label_idx] not in self.ignore_labels
|
||||
]
|
||||
|
||||
for idx, label_idx in filtered_labels_idx:
|
||||
|
||||
entity = {
|
||||
"word": self.tokenizer.convert_ids_to_tokens(int(input_ids[idx])),
|
||||
"score": score[idx][label_idx].item(),
|
||||
"entity": self.model.config.id2label[label_idx],
|
||||
"index": idx,
|
||||
}
|
||||
last_idx, _ = filtered_labels_idx[-1]
|
||||
if self.grouped_entities:
|
||||
if not entity_group_disagg:
|
||||
entity_group_disagg += [entity]
|
||||
if idx == last_idx:
|
||||
entity_groups += [self.group_entities(entity_group_disagg)]
|
||||
continue
|
||||
|
||||
# If the current entity is similar and adjacent to the previous entity, append it to the disaggregated entity group
|
||||
if (
|
||||
entity["entity"] == entity_group_disagg[-1]["entity"]
|
||||
and entity["index"] == entity_group_disagg[-1]["index"] + 1
|
||||
):
|
||||
entity_group_disagg += [entity]
|
||||
# Group the entities at the last entity
|
||||
if idx == last_idx:
|
||||
entity_groups += [self.group_entities(entity_group_disagg)]
|
||||
# If the current entity is different from the previous entity, aggregate the disaggregated entity group
|
||||
else:
|
||||
entity_groups += [self.group_entities(entity_group_disagg)]
|
||||
entity_group_disagg = [entity]
|
||||
|
||||
entities += [entity]
|
||||
|
||||
# Append
|
||||
answers += [answer]
|
||||
if self.grouped_entities:
|
||||
answers += [entity_groups]
|
||||
else:
|
||||
answers += [entities]
|
||||
|
||||
if len(answers) == 1:
|
||||
return answers[0]
|
||||
return answers
|
||||
|
||||
def group_entities(self, entities):
|
||||
"""
|
||||
Returns grouped entities
|
||||
"""
|
||||
# Get the last entity in the entity group
|
||||
entity = entities[-1]["entity"]
|
||||
scores = np.mean([entity["score"] for entity in entities])
|
||||
tokens = [entity["word"] for entity in entities]
|
||||
|
||||
entity_group = {
|
||||
"entity_group": entity,
|
||||
"score": np.mean(scores),
|
||||
"word": self.tokenizer.convert_tokens_to_string(tokens),
|
||||
}
|
||||
return entity_group
|
||||
|
||||
|
||||
TokenClassificationPipeline = NerPipeline
|
||||
|
||||
@@ -1509,7 +1566,7 @@ class TranslationPipeline(Pipeline):
|
||||
return results
|
||||
|
||||
|
||||
# Register all the supported task here
|
||||
# Register all the supported tasks here
|
||||
SUPPORTED_TASKS = {
|
||||
"feature-extraction": {
|
||||
"impl": FeatureExtractionPipeline,
|
||||
@@ -1572,9 +1629,9 @@ SUPPORTED_TASKS = {
|
||||
"tf": TFAutoModelWithLMHead if is_tf_available() else None,
|
||||
"pt": AutoModelWithLMHead if is_torch_available() else None,
|
||||
"default": {
|
||||
"model": {"pt": "bart-large-cnn", "tf": None},
|
||||
"model": {"pt": "bart-large-cnn", "tf": "t5-small"},
|
||||
"config": None,
|
||||
"tokenizer": ("bart-large-cnn", {"use_fast": False}),
|
||||
"tokenizer": {"pt": ("bart-large-cnn", {"use_fast": False}), "tf": "t5-small"},
|
||||
},
|
||||
},
|
||||
"translation_en_to_fr": {
|
||||
|
||||
@@ -29,6 +29,7 @@ from .configuration_auto import (
|
||||
ElectraConfig,
|
||||
FlaubertConfig,
|
||||
GPT2Config,
|
||||
LongformerConfig,
|
||||
OpenAIGPTConfig,
|
||||
ReformerConfig,
|
||||
RobertaConfig,
|
||||
@@ -50,6 +51,7 @@ from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFas
|
||||
from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast
|
||||
from .tokenization_flaubert import FlaubertTokenizer
|
||||
from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
|
||||
from .tokenization_longformer import LongformerTokenizer
|
||||
from .tokenization_marian import MarianTokenizer
|
||||
from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
|
||||
from .tokenization_reformer import ReformerTokenizer
|
||||
@@ -73,6 +75,7 @@ TOKENIZER_MAPPING = OrderedDict(
|
||||
(XLMRobertaConfig, (XLMRobertaTokenizer, None)),
|
||||
(MarianConfig, (MarianTokenizer, None)),
|
||||
(BartConfig, (BartTokenizer, None)),
|
||||
(LongformerConfig, (LongformerTokenizer, None)),
|
||||
(RobertaConfig, (RobertaTokenizer, RobertaTokenizerFast)),
|
||||
(ReformerConfig, (ReformerTokenizer, None)),
|
||||
(ElectraConfig, (ElectraTokenizer, ElectraTokenizerFast)),
|
||||
@@ -105,6 +108,7 @@ class AutoTokenizer:
|
||||
- contains `albert`: AlbertTokenizer (ALBERT model)
|
||||
- contains `camembert`: CamembertTokenizer (CamemBERT model)
|
||||
- contains `xlm-roberta`: XLMRobertaTokenizer (XLM-RoBERTa model)
|
||||
- contains `longformer`: LongformerTokenizer (AllenAI Longformer model)
|
||||
- contains `roberta`: RobertaTokenizer (RoBERTa model)
|
||||
- contains `bert`: BertTokenizer (Bert model)
|
||||
- contains `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)
|
||||
@@ -136,6 +140,7 @@ class AutoTokenizer:
|
||||
- contains `albert`: AlbertTokenizer (ALBERT model)
|
||||
- contains `camembert`: CamembertTokenizer (CamemBERT model)
|
||||
- contains `xlm-roberta`: XLMRobertaTokenizer (XLM-RoBERTa model)
|
||||
- contains `longformer`: LongformerTokenizer (AllenAI Longformer model)
|
||||
- contains `roberta`: RobertaTokenizer (RoBERTa model)
|
||||
- contains `bert-base-japanese`: BertJapaneseTokenizer (Bert model)
|
||||
- contains `bert`: BertTokenizer (Bert model)
|
||||
|
||||
@@ -27,8 +27,6 @@ vocab_url = "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-v
|
||||
merges_url = "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt"
|
||||
_all_bart_models = ["bart-large", "bart-large-mnli", "bart-large-cnn", "bart-large-xsum"]
|
||||
|
||||
VOCAB_FILES_NAMES = {"vocab_file": "sentence.bpe.model"}
|
||||
|
||||
|
||||
class BartTokenizer(RobertaTokenizer):
|
||||
# merges and vocab same as Roberta
|
||||
@@ -44,6 +42,6 @@ SPM_URL = "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-la
|
||||
|
||||
|
||||
class MBartTokenizer(XLMRobertaTokenizer):
|
||||
vocab_files_names = VOCAB_FILES_NAMES
|
||||
vocab_files_names = {"vocab_file": "sentencepiece.bpe.model"}
|
||||
max_model_input_sizes = {m: 1024 for m in _all_mbart_models}
|
||||
pretrained_vocab_files_map = {"vocab_file": {m: SPM_URL for m in _all_mbart_models}}
|
||||
|
||||
42
src/transformers/tokenization_longformer.py
Normal file
42
src/transformers/tokenization_longformer.py
Normal file
@@ -0,0 +1,42 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import logging
|
||||
|
||||
from .tokenization_roberta import RobertaTokenizer
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# vocab and merges same as roberta
|
||||
vocab_url = "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json"
|
||||
merges_url = "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt"
|
||||
_all_longformer_models = ["longformer-base-4096", "longformer-large-4096"]
|
||||
|
||||
|
||||
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
||||
"longformer-base-4096": 4096,
|
||||
"longformer-large-4096": 4096,
|
||||
}
|
||||
|
||||
|
||||
class LongformerTokenizer(RobertaTokenizer):
|
||||
# merges and vocab same as Roberta
|
||||
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
|
||||
pretrained_vocab_files_map = {
|
||||
"vocab_file": {m: vocab_url for m in _all_longformer_models},
|
||||
"merges_file": {m: merges_url for m in _all_longformer_models},
|
||||
}
|
||||
@@ -1,7 +1,9 @@
|
||||
import json
|
||||
import re
|
||||
import warnings
|
||||
from typing import Dict, List, Optional, Union
|
||||
from pathlib import Path
|
||||
from shutil import copyfile
|
||||
from typing import Dict, List, Optional, Tuple, Union
|
||||
|
||||
import sentencepiece
|
||||
|
||||
@@ -15,7 +17,7 @@ vocab_files_names = {
|
||||
"vocab": "vocab.json",
|
||||
"tokenizer_config_file": "tokenizer_config.json",
|
||||
}
|
||||
MODEL_NAMES = ("opus-mt-en-de",) # TODO(SS): the only required constant is vocab_files_names
|
||||
MODEL_NAMES = ("opus-mt-en-de",) # TODO(SS): delete this, the only required constant is vocab_files_names
|
||||
PRETRAINED_VOCAB_FILES_MAP = {
|
||||
k: {m: f"{S3_BUCKET_PREFIX}/Helsinki-NLP/{m}/{fname}" for m in MODEL_NAMES}
|
||||
for k, fname in vocab_files_names.items()
|
||||
@@ -55,14 +57,16 @@ class MarianTokenizer(PreTrainedTokenizer):
|
||||
eos_token="</s>",
|
||||
pad_token="<pad>",
|
||||
max_len=512,
|
||||
**kwargs,
|
||||
):
|
||||
|
||||
super().__init__(
|
||||
# bos_token=bos_token,
|
||||
# bos_token=bos_token, unused. Start decoding with config.decoder_start_token_id
|
||||
max_len=max_len,
|
||||
eos_token=eos_token,
|
||||
unk_token=unk_token,
|
||||
pad_token=pad_token,
|
||||
**kwargs,
|
||||
)
|
||||
self.encoder = load_json(vocab)
|
||||
if self.unk_token not in self.encoder:
|
||||
@@ -72,21 +76,23 @@ class MarianTokenizer(PreTrainedTokenizer):
|
||||
|
||||
self.source_lang = source_lang
|
||||
self.target_lang = target_lang
|
||||
self.supported_language_codes: list = [k for k in self.encoder if k.startswith(">>") and k.endswith("<<")]
|
||||
self.spm_files = [source_spm, target_spm]
|
||||
|
||||
# load SentencePiece model for pre-processing
|
||||
self.spm_source = sentencepiece.SentencePieceProcessor()
|
||||
self.spm_source.Load(source_spm)
|
||||
|
||||
self.spm_target = sentencepiece.SentencePieceProcessor()
|
||||
self.spm_target.Load(target_spm)
|
||||
self.spm_source = load_spm(source_spm)
|
||||
self.spm_target = load_spm(target_spm)
|
||||
self.current_spm = self.spm_source
|
||||
|
||||
# Multilingual target side: default to using first supported language code.
|
||||
self.supported_language_codes: list = [k for k in self.encoder if k.startswith(">>") and k.endswith("<<")]
|
||||
|
||||
self._setup_normalizer()
|
||||
|
||||
def _setup_normalizer(self):
|
||||
try:
|
||||
from mosestokenizer import MosesPunctuationNormalizer
|
||||
|
||||
self.punc_normalizer = MosesPunctuationNormalizer(source_lang)
|
||||
self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang)
|
||||
except ImportError:
|
||||
warnings.warn("Recommended: pip install mosestokenizer")
|
||||
self.punc_normalizer = lambda x: x
|
||||
@@ -124,9 +130,6 @@ class MarianTokenizer(PreTrainedTokenizer):
|
||||
# We don't expect to process pairs, but leave the pair logic for API consistency
|
||||
return token_ids_0 + token_ids_1 + [self.eos_token_id]
|
||||
|
||||
def batch_decode(self, token_ids, **kwargs) -> List[str]:
|
||||
return [self.decode(ids, **kwargs) for ids in token_ids]
|
||||
|
||||
def prepare_translation_batch(
|
||||
self,
|
||||
src_texts: List[str],
|
||||
@@ -179,6 +182,65 @@ class MarianTokenizer(PreTrainedTokenizer):
|
||||
def vocab_size(self) -> int:
|
||||
return len(self.encoder)
|
||||
|
||||
def save_vocabulary(self, save_directory: str) -> Tuple[str]:
|
||||
"""save vocab file to json and copy spm files from their original path."""
|
||||
save_dir = Path(save_directory)
|
||||
assert save_dir.is_dir(), f"{save_directory} should be a directory"
|
||||
save_json(self.encoder, save_dir / self.vocab_files_names["vocab"])
|
||||
|
||||
for f in self.spm_files:
|
||||
dest_path = save_dir / Path(f).name
|
||||
if not dest_path.exists():
|
||||
copyfile(f, save_dir / Path(f).name)
|
||||
return tuple(save_dir / f for f in self.vocab_files_names)
|
||||
|
||||
def get_vocab(self) -> Dict:
|
||||
vocab = self.encoder.copy()
|
||||
vocab.update(self.added_tokens_encoder)
|
||||
return vocab
|
||||
|
||||
def __getstate__(self) -> Dict:
|
||||
state = self.__dict__.copy()
|
||||
state.update({k: None for k in ["spm_source", "spm_target", "current_spm", "punc_normalizer"]})
|
||||
return state
|
||||
|
||||
def __setstate__(self, d: Dict) -> None:
|
||||
self.__dict__ = d
|
||||
self.spm_source, self.spm_target = (load_spm(f) for f in self.spm_files)
|
||||
self.current_spm = self.spm_source
|
||||
self._setup_normalizer()
|
||||
|
||||
def num_special_tokens_to_add(self, **unused):
|
||||
"""Just EOS"""
|
||||
return 1
|
||||
|
||||
def _special_token_mask(self, seq):
|
||||
all_special_ids = set(self.all_special_ids) # call it once instead of inside list comp
|
||||
all_special_ids.remove(self.unk_token_id) # <unk> is only sometimes special
|
||||
return [1 if x in all_special_ids else 0 for x in seq]
|
||||
|
||||
def get_special_tokens_mask(
|
||||
self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False
|
||||
) -> List[int]:
|
||||
"""Get list where entries are [1] if a token is [eos] or [pad] else 0."""
|
||||
if already_has_special_tokens:
|
||||
return self._special_token_mask(token_ids_0)
|
||||
elif token_ids_1 is None:
|
||||
return self._special_token_mask(token_ids_0) + [1]
|
||||
else:
|
||||
return self._special_token_mask(token_ids_0 + token_ids_1) + [1]
|
||||
|
||||
|
||||
def load_spm(path: str) -> sentencepiece.SentencePieceProcessor:
|
||||
spm = sentencepiece.SentencePieceProcessor()
|
||||
spm.Load(path)
|
||||
return spm
|
||||
|
||||
|
||||
def save_json(data, path: str) -> None:
|
||||
with open(path, "w") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
|
||||
def load_json(path: str) -> Union[Dict, List]:
|
||||
with open(path, "r") as f:
|
||||
|
||||
@@ -199,7 +199,7 @@ class RobertaTokenizer(GPT2Tokenizer):
|
||||
if token_ids_1 is not None:
|
||||
raise ValueError(
|
||||
"You should not supply a second sequence if the provided sequence of "
|
||||
"ids is already formated with special tokens for the model."
|
||||
"ids is already formatted with special tokens for the model."
|
||||
)
|
||||
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
|
||||
|
||||
|
||||
@@ -771,26 +771,26 @@ class PreTrainedTokenizer(SpecialTokensMixin):
|
||||
raise NotImplementedError
|
||||
|
||||
@property
|
||||
def is_fast(self):
|
||||
def is_fast(self) -> bool:
|
||||
return False
|
||||
|
||||
@property
|
||||
def max_len(self):
|
||||
def max_len(self) -> int:
|
||||
""" Kept here for backward compatibility.
|
||||
Now renamed to `model_max_length` to avoid ambiguity.
|
||||
"""
|
||||
return self.model_max_length
|
||||
|
||||
@property
|
||||
def max_len_single_sentence(self):
|
||||
def max_len_single_sentence(self) -> int:
|
||||
return self.model_max_length - self.num_special_tokens_to_add(pair=False)
|
||||
|
||||
@property
|
||||
def max_len_sentences_pair(self):
|
||||
def max_len_sentences_pair(self) -> int:
|
||||
return self.model_max_length - self.num_special_tokens_to_add(pair=True)
|
||||
|
||||
@max_len_single_sentence.setter
|
||||
def max_len_single_sentence(self, value):
|
||||
def max_len_single_sentence(self, value) -> int:
|
||||
""" For backward compatibility, allow to try to setup 'max_len_single_sentence' """
|
||||
if value == self.model_max_length - self.num_special_tokens_to_add(pair=False):
|
||||
logger.warning(
|
||||
@@ -802,7 +802,7 @@ class PreTrainedTokenizer(SpecialTokensMixin):
|
||||
)
|
||||
|
||||
@max_len_sentences_pair.setter
|
||||
def max_len_sentences_pair(self, value):
|
||||
def max_len_sentences_pair(self, value) -> int:
|
||||
""" For backward compatibility, allow to try to setup 'max_len_sentences_pair' """
|
||||
if value == self.model_max_length - self.num_special_tokens_to_add(pair=True):
|
||||
logger.warning(
|
||||
@@ -1118,7 +1118,7 @@ class PreTrainedTokenizer(SpecialTokensMixin):
|
||||
|
||||
return vocab_files + (special_tokens_map_file, added_tokens_file)
|
||||
|
||||
def save_vocabulary(self, save_directory):
|
||||
def save_vocabulary(self, save_directory) -> Tuple[str]:
|
||||
""" Save the tokenizer vocabulary to a directory. This method does *NOT* save added tokens
|
||||
and special token mappings.
|
||||
|
||||
@@ -1128,7 +1128,7 @@ class PreTrainedTokenizer(SpecialTokensMixin):
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
def add_tokens(self, new_tokens):
|
||||
def add_tokens(self, new_tokens: Union[str, List[str]]) -> int:
|
||||
"""
|
||||
Add a list of new tokens to the tokenizer class. If the new tokens are not in the
|
||||
vocabulary, they are added to it with indices starting from length of the current vocabulary.
|
||||
@@ -1156,7 +1156,7 @@ class PreTrainedTokenizer(SpecialTokensMixin):
|
||||
if not isinstance(new_tokens, list):
|
||||
new_tokens = [new_tokens]
|
||||
|
||||
to_add_tokens = []
|
||||
tokens_to_add = []
|
||||
for token in new_tokens:
|
||||
assert isinstance(token, str)
|
||||
if self.init_kwargs.get("do_lower_case", False) and token not in self.all_special_tokens:
|
||||
@@ -1164,18 +1164,18 @@ class PreTrainedTokenizer(SpecialTokensMixin):
|
||||
if (
|
||||
token != self.unk_token
|
||||
and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
|
||||
and token not in to_add_tokens
|
||||
and token not in tokens_to_add
|
||||
):
|
||||
to_add_tokens.append(token)
|
||||
tokens_to_add.append(token)
|
||||
logger.info("Adding %s to the vocabulary", token)
|
||||
|
||||
added_tok_encoder = dict((tok, len(self) + i) for i, tok in enumerate(to_add_tokens))
|
||||
added_tok_encoder = dict((tok, len(self) + i) for i, tok in enumerate(tokens_to_add))
|
||||
added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}
|
||||
self.added_tokens_encoder.update(added_tok_encoder)
|
||||
self.unique_added_tokens_encoder = set(self.added_tokens_encoder.keys()).union(set(self.all_special_tokens))
|
||||
self.added_tokens_decoder.update(added_tok_decoder)
|
||||
|
||||
return len(to_add_tokens)
|
||||
return len(tokens_to_add)
|
||||
|
||||
def num_special_tokens_to_add(self, pair=False):
|
||||
"""
|
||||
@@ -2080,10 +2080,7 @@ class PreTrainedTokenizer(SpecialTokensMixin):
|
||||
def build_inputs_with_special_tokens(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List:
|
||||
"""
|
||||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
|
||||
by concatenating and adding special tokens.
|
||||
A RoBERTa sequence has the following format:
|
||||
single sequence: <s> X </s>
|
||||
pair of sequences: <s> A </s></s> B </s>
|
||||
by concatenating and adding special tokens. This implementation does not add special tokens.
|
||||
"""
|
||||
if token_ids_1 is None:
|
||||
return token_ids_0
|
||||
@@ -2183,6 +2180,9 @@ class PreTrainedTokenizer(SpecialTokensMixin):
|
||||
else:
|
||||
return text
|
||||
|
||||
def batch_decode(self, sequences: List[List[int]], **kwargs) -> List[str]:
|
||||
return [self.decode(seq, **kwargs) for seq in sequences]
|
||||
|
||||
@staticmethod
|
||||
def clean_up_tokenization(out_string: str) -> str:
|
||||
""" Clean up a list of simple English tokenization artifacts like spaces before punctuations and abreviated forms.
|
||||
|
||||
@@ -1,12 +1,13 @@
|
||||
import json
|
||||
import logging
|
||||
import math
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import shutil
|
||||
from contextlib import contextmanager
|
||||
from pathlib import Path
|
||||
from typing import Callable, Dict, List, Optional, Tuple, Union
|
||||
from typing import Callable, Dict, List, Optional, Tuple
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
@@ -14,7 +15,7 @@ from torch import nn
|
||||
from torch.utils.data.dataloader import DataLoader
|
||||
from torch.utils.data.dataset import Dataset
|
||||
from torch.utils.data.distributed import DistributedSampler
|
||||
from torch.utils.data.sampler import RandomSampler
|
||||
from torch.utils.data.sampler import RandomSampler, Sampler, SequentialSampler
|
||||
from tqdm.auto import tqdm, trange
|
||||
|
||||
from .data.data_collator import DataCollator, DefaultDataCollator
|
||||
@@ -89,7 +90,7 @@ def set_seed(seed: int):
|
||||
@contextmanager
|
||||
def torch_distributed_zero_first(local_rank: int):
|
||||
"""
|
||||
Decorator to make all processes in distributed training wait for the first one (locally) to do something.
|
||||
Decorator to make all processes in distributed training wait for each local_master to do something.
|
||||
"""
|
||||
if local_rank not in [-1, 0]:
|
||||
torch.distributed.barrier()
|
||||
@@ -98,6 +99,50 @@ def torch_distributed_zero_first(local_rank: int):
|
||||
torch.distributed.barrier()
|
||||
|
||||
|
||||
class SequentialDistributedSampler(Sampler):
|
||||
"""
|
||||
Distributed Sampler that subsamples indicies sequentially,
|
||||
making it easier to collate all results at the end.
|
||||
|
||||
Even though we only use this sampler for eval and predict (no training),
|
||||
which means that the model params won't have to be synced (i.e. will not hang
|
||||
for synchronization even if varied number of forward passes), we still add extra
|
||||
samples to the sampler to make it evenly divisible (like in `DistributedSampler`)
|
||||
to make it easy to `gather` or `reduce` resulting tensors at the end of the loop.
|
||||
"""
|
||||
|
||||
def __init__(self, dataset, num_replicas=None, rank=None):
|
||||
if num_replicas is None:
|
||||
if not torch.distributed.is_available():
|
||||
raise RuntimeError("Requires distributed package to be available")
|
||||
num_replicas = torch.distributed.get_world_size()
|
||||
if rank is None:
|
||||
if not torch.distributed.is_available():
|
||||
raise RuntimeError("Requires distributed package to be available")
|
||||
rank = torch.distributed.get_rank()
|
||||
self.dataset = dataset
|
||||
self.num_replicas = num_replicas
|
||||
self.rank = rank
|
||||
self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))
|
||||
self.total_size = self.num_samples * self.num_replicas
|
||||
|
||||
def __iter__(self):
|
||||
indices = list(range(len(self.dataset)))
|
||||
|
||||
# add extra samples to make it evenly divisible
|
||||
indices += indices[: (self.total_size - len(indices))]
|
||||
assert len(indices) == self.total_size
|
||||
|
||||
# subsample
|
||||
indices = indices[self.rank * self.num_samples : (self.rank + 1) * self.num_samples]
|
||||
assert len(indices) == self.num_samples
|
||||
|
||||
return iter(indices)
|
||||
|
||||
def __len__(self):
|
||||
return self.num_samples
|
||||
|
||||
|
||||
def get_tpu_sampler(dataset: Dataset):
|
||||
if xm.xrt_world_size() <= 1:
|
||||
return RandomSampler(dataset)
|
||||
@@ -142,7 +187,7 @@ class Trainer:
|
||||
prediction_loss_only:
|
||||
(Optional) in evaluation and prediction, only return the loss
|
||||
"""
|
||||
self.model = model
|
||||
self.model = model.to(args.device)
|
||||
self.args = args
|
||||
if data_collator is not None:
|
||||
self.data_collator = data_collator
|
||||
@@ -155,7 +200,7 @@ class Trainer:
|
||||
self.optimizers = optimizers
|
||||
if tb_writer is not None:
|
||||
self.tb_writer = tb_writer
|
||||
elif is_tensorboard_available() and self.args.local_rank in [-1, 0]:
|
||||
elif is_tensorboard_available() and self.is_world_master():
|
||||
self.tb_writer = SummaryWriter(log_dir=self.args.logging_dir)
|
||||
if not is_tensorboard_available():
|
||||
logger.warning(
|
||||
@@ -170,7 +215,7 @@ class Trainer:
|
||||
)
|
||||
set_seed(self.args.seed)
|
||||
# Create output directory if needed
|
||||
if self.is_local_master():
|
||||
if self.is_world_master():
|
||||
os.makedirs(self.args.output_dir, exist_ok=True)
|
||||
if is_tpu_available():
|
||||
# Set an xla_device flag on the model's config.
|
||||
@@ -196,9 +241,6 @@ class Trainer:
|
||||
collate_fn=self.data_collator.collate_batch,
|
||||
)
|
||||
|
||||
if is_tpu_available():
|
||||
data_loader = pl.ParallelLoader(data_loader, [self.args.device]).per_device_loader(self.args.device)
|
||||
|
||||
return data_loader
|
||||
|
||||
def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader:
|
||||
@@ -207,36 +249,42 @@ class Trainer:
|
||||
|
||||
eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
|
||||
|
||||
sampler = get_tpu_sampler(eval_dataset) if is_tpu_available() else None
|
||||
if is_tpu_available():
|
||||
sampler = SequentialDistributedSampler(
|
||||
eval_dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal()
|
||||
)
|
||||
elif self.args.local_rank != -1:
|
||||
sampler = SequentialDistributedSampler(eval_dataset)
|
||||
else:
|
||||
sampler = SequentialSampler(eval_dataset)
|
||||
|
||||
data_loader = DataLoader(
|
||||
eval_dataset,
|
||||
sampler=sampler,
|
||||
batch_size=self.args.eval_batch_size,
|
||||
shuffle=False,
|
||||
collate_fn=self.data_collator.collate_batch,
|
||||
)
|
||||
|
||||
if is_tpu_available():
|
||||
data_loader = pl.ParallelLoader(data_loader, [self.args.device]).per_device_loader(self.args.device)
|
||||
|
||||
return data_loader
|
||||
|
||||
def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:
|
||||
# We use the same batch_size as for eval.
|
||||
sampler = get_tpu_sampler(test_dataset) if is_tpu_available() else None
|
||||
if is_tpu_available():
|
||||
sampler = SequentialDistributedSampler(
|
||||
test_dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal()
|
||||
)
|
||||
elif self.args.local_rank != -1:
|
||||
sampler = SequentialDistributedSampler(test_dataset)
|
||||
else:
|
||||
sampler = SequentialSampler(test_dataset)
|
||||
|
||||
data_loader = DataLoader(
|
||||
test_dataset,
|
||||
sampler=sampler,
|
||||
batch_size=self.args.eval_batch_size,
|
||||
shuffle=False,
|
||||
collate_fn=self.data_collator.collate_batch,
|
||||
)
|
||||
|
||||
if is_tpu_available():
|
||||
data_loader = pl.ParallelLoader(data_loader, [self.args.device]).per_device_loader(self.args.device)
|
||||
|
||||
return data_loader
|
||||
|
||||
def get_optimizers(
|
||||
@@ -293,15 +341,11 @@ class Trainer:
|
||||
self.model, log=os.getenv("WANDB_WATCH", "gradients"), log_freq=max(100, self.args.logging_steps)
|
||||
)
|
||||
|
||||
def num_examples(self, dataloader: Union[DataLoader, "pl.PerDeviceLoader"]) -> int:
|
||||
def num_examples(self, dataloader: DataLoader) -> int:
|
||||
"""
|
||||
Helper to get num of examples from a DataLoader, by accessing its Dataset.
|
||||
"""
|
||||
if is_tpu_available():
|
||||
assert isinstance(dataloader, pl.PerDeviceLoader)
|
||||
return len(dataloader._loader._loader.dataset)
|
||||
else:
|
||||
return len(dataloader.dataset)
|
||||
return len(dataloader.dataset)
|
||||
|
||||
def train(self, model_path: Optional[str] = None):
|
||||
"""
|
||||
@@ -331,11 +375,12 @@ class Trainer:
|
||||
and os.path.isfile(os.path.join(model_path, "scheduler.pt"))
|
||||
):
|
||||
# Load in optimizer and scheduler states
|
||||
optimizer.load_state_dict(torch.load(os.path.join(model_path, "optimizer.pt")))
|
||||
optimizer.load_state_dict(
|
||||
torch.load(os.path.join(model_path, "optimizer.pt"), map_location=self.args.device)
|
||||
)
|
||||
scheduler.load_state_dict(torch.load(os.path.join(model_path, "scheduler.pt")))
|
||||
|
||||
model = self.model
|
||||
model.to(self.args.device)
|
||||
if self.args.fp16:
|
||||
if not is_apex_available():
|
||||
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
|
||||
@@ -404,7 +449,17 @@ class Trainer:
|
||||
epochs_trained, int(num_train_epochs), desc="Epoch", disable=not self.is_local_master()
|
||||
)
|
||||
for epoch in train_iterator:
|
||||
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=not self.is_local_master())
|
||||
if isinstance(train_dataloader, DataLoader) and isinstance(train_dataloader.sampler, DistributedSampler):
|
||||
train_dataloader.sampler.set_epoch(epoch)
|
||||
|
||||
if is_tpu_available():
|
||||
parallel_loader = pl.ParallelLoader(train_dataloader, [self.args.device]).per_device_loader(
|
||||
self.args.device
|
||||
)
|
||||
epoch_iterator = tqdm(parallel_loader, desc="Iteration", disable=not self.is_local_master())
|
||||
else:
|
||||
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=not self.is_local_master())
|
||||
|
||||
for step, inputs in enumerate(epoch_iterator):
|
||||
|
||||
# Skip past any already trained steps if resuming training
|
||||
@@ -434,37 +489,43 @@ class Trainer:
|
||||
self.global_step += 1
|
||||
self.epoch = epoch + (step + 1) / len(epoch_iterator)
|
||||
|
||||
if self.is_local_master():
|
||||
if (self.args.logging_steps > 0 and self.global_step % self.args.logging_steps == 0) or (
|
||||
self.global_step == 1 and self.args.logging_first_step
|
||||
):
|
||||
logs: Dict[str, float] = {}
|
||||
logs["loss"] = (tr_loss - logging_loss) / self.args.logging_steps
|
||||
logs["learning_rate"] = scheduler.get_last_lr()[0]
|
||||
logging_loss = tr_loss
|
||||
if (self.args.logging_steps > 0 and self.global_step % self.args.logging_steps == 0) or (
|
||||
self.global_step == 1 and self.args.logging_first_step
|
||||
):
|
||||
logs: Dict[str, float] = {}
|
||||
logs["loss"] = (tr_loss - logging_loss) / self.args.logging_steps
|
||||
logs["learning_rate"] = (
|
||||
scheduler.get_last_lr()[0]
|
||||
)
|
||||
logging_loss = tr_loss
|
||||
|
||||
self._log(logs)
|
||||
self._log(logs)
|
||||
|
||||
if self.args.evaluate_during_training:
|
||||
self.evaluate()
|
||||
if self.args.evaluate_during_training:
|
||||
self.evaluate()
|
||||
|
||||
if self.args.save_steps > 0 and self.global_step % self.args.save_steps == 0:
|
||||
# In all cases (even distributed/parallel), self.model is always a reference
|
||||
# to the model we want to save.
|
||||
if hasattr(model, "module"):
|
||||
assert model.module is self.model
|
||||
else:
|
||||
assert model is self.model
|
||||
# Save model checkpoint
|
||||
output_dir = os.path.join(
|
||||
self.args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{self.global_step}"
|
||||
)
|
||||
if self.args.save_steps > 0 and self.global_step % self.args.save_steps == 0:
|
||||
# In all cases (even distributed/parallel), self.model is always a reference
|
||||
# to the model we want to save.
|
||||
if hasattr(model, "module"):
|
||||
assert model.module is self.model
|
||||
else:
|
||||
assert model is self.model
|
||||
# Save model checkpoint
|
||||
output_dir = os.path.join(self.args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{self.global_step}")
|
||||
|
||||
self.save_model(output_dir)
|
||||
self.save_model(output_dir)
|
||||
|
||||
if self.is_world_master():
|
||||
self._rotate_checkpoints()
|
||||
|
||||
if is_tpu_available():
|
||||
xm.rendezvous("saving_optimizer_states")
|
||||
xm.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
|
||||
xm.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
|
||||
elif self.is_world_master():
|
||||
torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
|
||||
torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
|
||||
logger.info("Saving optimizer and scheduler states to %s", output_dir)
|
||||
|
||||
if self.args.max_steps > 0 and self.global_step > self.args.max_steps:
|
||||
epoch_iterator.close()
|
||||
@@ -540,11 +601,30 @@ class Trainer:
|
||||
Saving best-practices: if you use default names for the model,
|
||||
you can reload it using from_pretrained().
|
||||
|
||||
Will only save from the master process.
|
||||
Will only save from the world_master process (unless in TPUs).
|
||||
"""
|
||||
if self.is_world_master():
|
||||
|
||||
if is_tpu_available():
|
||||
self._save_tpu(output_dir)
|
||||
elif self.is_world_master():
|
||||
self._save(output_dir)
|
||||
|
||||
def _save_tpu(self, output_dir: Optional[str] = None):
|
||||
output_dir = output_dir if output_dir is not None else self.args.output_dir
|
||||
logger.info("Saving model checkpoint to %s", output_dir)
|
||||
|
||||
if xm.is_master_ordinal():
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
torch.save(self.args, os.path.join(output_dir, "training_args.bin"))
|
||||
|
||||
# Save a trained model and configuration using `save_pretrained()`.
|
||||
# They can then be reloaded using `from_pretrained()`
|
||||
if not isinstance(self.model, PreTrainedModel):
|
||||
raise ValueError("Trainer.model appears to not be a PreTrainedModel")
|
||||
|
||||
xm.rendezvous("saving_checkpoint")
|
||||
self.model.save_pretrained(output_dir)
|
||||
|
||||
def _save(self, output_dir: Optional[str] = None):
|
||||
output_dir = output_dir if output_dir is not None else self.args.output_dir
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
@@ -627,6 +707,7 @@ class Trainer:
|
||||
In that case, this method will also return metrics, like in evaluate().
|
||||
"""
|
||||
test_dataloader = self.get_test_dataloader(test_dataset)
|
||||
|
||||
return self._prediction_loop(test_dataloader, description="Prediction")
|
||||
|
||||
def _prediction_loop(
|
||||
@@ -640,27 +721,29 @@ class Trainer:
|
||||
|
||||
prediction_loss_only = prediction_loss_only if prediction_loss_only is not None else self.prediction_loss_only
|
||||
|
||||
model = self.model
|
||||
# multi-gpu eval
|
||||
if self.args.n_gpu > 1 and not isinstance(self.model, torch.nn.DataParallel):
|
||||
model = torch.nn.DataParallel(self.model)
|
||||
if self.args.n_gpu > 1:
|
||||
model = torch.nn.DataParallel(model)
|
||||
else:
|
||||
model = self.model
|
||||
model.to(self.args.device)
|
||||
# Note: in torch.distributed mode, there's no point in wrapping the model
|
||||
# inside a DistributedDataParallel as we'll be under `no_grad` anyways.
|
||||
|
||||
if is_tpu_available():
|
||||
batch_size = dataloader._loader._loader.batch_size
|
||||
else:
|
||||
batch_size = dataloader.batch_size
|
||||
batch_size = dataloader.batch_size
|
||||
logger.info("***** Running %s *****", description)
|
||||
logger.info(" Num examples = %d", self.num_examples(dataloader))
|
||||
logger.info(" Batch size = %d", batch_size)
|
||||
eval_losses: List[float] = []
|
||||
preds: np.ndarray = None
|
||||
label_ids: np.ndarray = None
|
||||
preds: torch.Tensor = None
|
||||
label_ids: torch.Tensor = None
|
||||
model.eval()
|
||||
|
||||
if is_tpu_available():
|
||||
dataloader = pl.ParallelLoader(dataloader, [self.args.device]).per_device_loader(self.args.device)
|
||||
|
||||
for inputs in tqdm(dataloader, desc=description):
|
||||
has_labels = any(inputs.get(k) is not None for k in ["labels", "masked_lm_labels"])
|
||||
has_labels = any(inputs.get(k) is not None for k in ["labels", "lm_labels", "masked_lm_labels"])
|
||||
|
||||
for k, v in inputs.items():
|
||||
inputs[k] = v.to(self.args.device)
|
||||
@@ -675,19 +758,33 @@ class Trainer:
|
||||
|
||||
if not prediction_loss_only:
|
||||
if preds is None:
|
||||
preds = logits.detach().cpu().numpy()
|
||||
preds = logits.detach()
|
||||
else:
|
||||
preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
|
||||
preds = torch.cat((preds, logits.detach()), dim=0)
|
||||
if inputs.get("labels") is not None:
|
||||
if label_ids is None:
|
||||
label_ids = inputs["labels"].detach().cpu().numpy()
|
||||
label_ids = inputs["labels"].detach()
|
||||
else:
|
||||
label_ids = np.append(label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)
|
||||
label_ids = torch.cat((label_ids, inputs["labels"].detach()), dim=0)
|
||||
|
||||
if is_tpu_available():
|
||||
if self.args.local_rank != -1:
|
||||
# In distributed mode, concatenate all results from all nodes:
|
||||
if preds is not None:
|
||||
preds = self.distributed_concat(preds, num_total_examples=self.num_examples(dataloader))
|
||||
if label_ids is not None:
|
||||
label_ids = self.distributed_concat(label_ids, num_total_examples=self.num_examples(dataloader))
|
||||
elif is_tpu_available():
|
||||
# tpu-comment: Get all predictions and labels from all worker shards of eval dataset
|
||||
preds = xm.mesh_reduce("eval_preds", preds, np.concatenate)
|
||||
label_ids = xm.mesh_reduce("eval_out_label_ids", label_ids, np.concatenate)
|
||||
if preds is not None:
|
||||
preds = xm.mesh_reduce("eval_preds", preds, torch.cat)
|
||||
if label_ids is not None:
|
||||
label_ids = xm.mesh_reduce("eval_label_ids", label_ids, torch.cat)
|
||||
|
||||
# Finally, turn the aggregated tensors into numpy arrays.
|
||||
if preds is not None:
|
||||
preds = preds.cpu().numpy()
|
||||
if label_ids is not None:
|
||||
label_ids = label_ids.cpu().numpy()
|
||||
|
||||
if self.compute_metrics is not None and preds is not None and label_ids is not None:
|
||||
metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))
|
||||
@@ -702,3 +799,15 @@ class Trainer:
|
||||
metrics[f"eval_{key}"] = metrics.pop(key)
|
||||
|
||||
return PredictionOutput(predictions=preds, label_ids=label_ids, metrics=metrics)
|
||||
|
||||
def distributed_concat(self, tensor: torch.Tensor, num_total_examples: int) -> torch.Tensor:
|
||||
assert self.args.local_rank != -1
|
||||
|
||||
output_tensors = [tensor.clone() for _ in range(torch.distributed.get_world_size())]
|
||||
torch.distributed.all_gather(output_tensors, tensor)
|
||||
|
||||
concat = torch.cat(output_tensors, dim=0)
|
||||
|
||||
# truncate the dummy elements added by SequentialDistributedSampler
|
||||
output = concat[:num_total_examples]
|
||||
return output
|
||||
|
||||
@@ -141,7 +141,7 @@ class TFTrainer:
|
||||
self.optimizer = tf.keras.optimizers.get(
|
||||
{"class_name": self.args.optimizer_name, "config": {"learning_rate": self.args.learning_rate}}
|
||||
)
|
||||
logger.info("Created an/a {} optimizer".format(self.optimizer))
|
||||
logger.info("Created an/a {} optimizer".format(self.args.optimizer_name))
|
||||
|
||||
def _create_checkpoint_manager(self, max_to_keep: int = 5, load_model: bool = True) -> None:
|
||||
"""
|
||||
@@ -335,12 +335,8 @@ class TFTrainer:
|
||||
gradient / tf.cast(gradient_scale, gradient.dtype) for gradient in self.gradient_accumulator.gradients
|
||||
]
|
||||
gradients = [(tf.clip_by_value(grad, -self.args.max_grad_norm, self.args.max_grad_norm)) for grad in gradients]
|
||||
vars = self.model.trainable_variables
|
||||
|
||||
if self.args.mode in ["token-classification", "question-answering"]:
|
||||
vars = [var for var in self.model.trainable_variables if "pooler" not in var.name]
|
||||
|
||||
self.optimizer.apply_gradients(list(zip(gradients, vars)))
|
||||
self.optimizer.apply_gradients(list(zip(gradients, self.model.trainable_variables)))
|
||||
self.gradient_accumulator.reset()
|
||||
|
||||
def _accumulate_next_gradients(self):
|
||||
@@ -375,12 +371,10 @@ class TFTrainer:
|
||||
def _forward(self, features, labels):
|
||||
"""Forwards a training example and accumulates the gradients."""
|
||||
per_example_loss, _ = self._run_model(features, labels, True)
|
||||
vars = self.model.trainable_variables
|
||||
|
||||
if self.args.mode in ["token-classification", "question-answering"]:
|
||||
vars = [var for var in self.model.trainable_variables if "pooler" not in var.name]
|
||||
|
||||
gradients = self.optimizer.get_gradients(per_example_loss, vars)
|
||||
gradients = tf.gradients(per_example_loss, self.model.trainable_variables)
|
||||
gradients = [
|
||||
g if g is not None else tf.zeros_like(v) for g, v in zip(gradients, self.model.trainable_variables)
|
||||
]
|
||||
|
||||
self.gradient_accumulator(gradients)
|
||||
|
||||
|
||||
@@ -80,8 +80,9 @@ class AutoModelTest(unittest.TestCase):
|
||||
model, loading_info = AutoModelForPreTraining.from_pretrained(model_name, output_loading_info=True)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, BertForPreTraining)
|
||||
for value in loading_info.values():
|
||||
self.assertEqual(len(value), 0)
|
||||
for key, value in loading_info.items():
|
||||
# Only one value should not be initialized and in the missing keys.
|
||||
self.assertEqual(len(value), 1 if key == "missing_keys" else 0)
|
||||
|
||||
@slow
|
||||
def test_lmhead_model_from_pretrained(self):
|
||||
|
||||
@@ -231,7 +231,7 @@ class BartTranslationTests(unittest.TestCase):
|
||||
"""Only load the model if needed."""
|
||||
if self._model is None:
|
||||
model = BartForConditionalGeneration.from_pretrained("mbart-large-en-ro")
|
||||
self._model = model
|
||||
self._model = model.to(torch_device)
|
||||
return self._model
|
||||
|
||||
@slow
|
||||
@@ -257,10 +257,7 @@ class BartTranslationTests(unittest.TestCase):
|
||||
)
|
||||
}
|
||||
translated_tokens = model.generate(input_ids=inputs["input_ids"].to(torch_device), num_beams=5,)
|
||||
decoded = [
|
||||
self.tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False)
|
||||
for g in translated_tokens
|
||||
]
|
||||
decoded = self.tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
|
||||
self.assertEqual(expected_translation_romanian, decoded[0])
|
||||
|
||||
def test_mbart_enro_config(self):
|
||||
@@ -576,11 +573,13 @@ class BartModelIntegrationTests(unittest.TestCase):
|
||||
|
||||
PGE_ARTICLE = """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
|
||||
EXPECTED_SUMMARY = "California's largest power company has begun shutting off power to tens of thousands of homes and businesses in the state."
|
||||
dct = tok.batch_encode_plus([PGE_ARTICLE], max_length=1024, pad_to_max_length=True, return_tensors="pt",)
|
||||
dct = tok.batch_encode_plus([PGE_ARTICLE], max_length=1024, pad_to_max_length=True, return_tensors="pt",).to(
|
||||
torch_device
|
||||
)
|
||||
|
||||
hypotheses_batch = model.generate(
|
||||
input_ids=dct["input_ids"].to(torch_device),
|
||||
attention_mask=dct["attention_mask"].to(torch_device),
|
||||
input_ids=dct["input_ids"],
|
||||
attention_mask=dct["attention_mask"],
|
||||
num_beams=2,
|
||||
max_length=62,
|
||||
min_length=11,
|
||||
@@ -590,9 +589,7 @@ class BartModelIntegrationTests(unittest.TestCase):
|
||||
decoder_start_token_id=model.config.eos_token_id,
|
||||
)
|
||||
|
||||
decoded = [
|
||||
tok.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in hypotheses_batch
|
||||
]
|
||||
decoded = tok.batch_decode(hypotheses_batch, skip_special_tokens=True,)
|
||||
self.assertEqual(EXPECTED_SUMMARY, decoded[0])
|
||||
|
||||
def test_xsum_config_generation_params(self):
|
||||
|
||||
@@ -30,6 +30,7 @@ class CamembertModelIntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_output_embeds_base_model(self):
|
||||
model = CamembertModel.from_pretrained("camembert-base")
|
||||
model.to(torch_device)
|
||||
|
||||
input_ids = torch.tensor(
|
||||
[[5, 121, 11, 660, 16, 730, 25543, 110, 83, 6]], device=torch_device, dtype=torch.long,
|
||||
|
||||
@@ -23,7 +23,7 @@ from typing import List
|
||||
|
||||
from transformers import is_torch_available
|
||||
|
||||
from .utils import require_torch, slow, torch_device
|
||||
from .utils import require_multigpu, require_torch, slow, torch_device
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
@@ -758,6 +758,31 @@ class ModelTesterMixin:
|
||||
return True
|
||||
return False
|
||||
|
||||
@require_multigpu
|
||||
def test_multigpu_data_parallel_forward(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
# some params shouldn't be scattered by nn.DataParallel
|
||||
# so just remove them if they are present.
|
||||
blacklist_non_batched_params = ["head_mask"]
|
||||
for k in blacklist_non_batched_params:
|
||||
inputs_dict.pop(k, None)
|
||||
|
||||
# move input tensors to cuda:O
|
||||
for k, v in inputs_dict.items():
|
||||
if torch.is_tensor(v):
|
||||
inputs_dict[k] = v.to(0)
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config=config)
|
||||
model.to(0)
|
||||
model.eval()
|
||||
|
||||
# Wrap model in nn.DataParallel
|
||||
model = torch.nn.DataParallel(model)
|
||||
with torch.no_grad():
|
||||
_ = model(**inputs_dict)
|
||||
|
||||
|
||||
global_rng = random.Random()
|
||||
|
||||
|
||||
@@ -41,7 +41,7 @@ class CTRLModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
batch_size=14,
|
||||
seq_length=7,
|
||||
is_training=True,
|
||||
use_token_type_ids=True,
|
||||
@@ -219,6 +219,7 @@ class CTRLModelLanguageGenerationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_lm_generate_ctrl(self):
|
||||
model = CTRLLMHeadModel.from_pretrained("ctrl")
|
||||
model.to(torch_device)
|
||||
input_ids = torch.tensor(
|
||||
[[11859, 0, 1611, 8]], dtype=torch.long, device=torch_device
|
||||
) # Legal the president is
|
||||
|
||||
@@ -30,6 +30,7 @@ if is_torch_available():
|
||||
ElectraForMaskedLM,
|
||||
ElectraForTokenClassification,
|
||||
ElectraForPreTraining,
|
||||
ElectraForSequenceClassification,
|
||||
)
|
||||
from transformers.modeling_electra import ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||
|
||||
@@ -242,6 +243,31 @@ class ElectraModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
self.parent.assertListEqual(list(result["logits"].size()), [self.batch_size, self.seq_length])
|
||||
self.check_loss_output(result)
|
||||
|
||||
def create_and_check_electra_for_sequence_classification(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
fake_token_labels,
|
||||
):
|
||||
config.num_labels = self.num_labels
|
||||
model = ElectraForSequenceClassification(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
loss, logits = model(
|
||||
input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=sequence_labels
|
||||
)
|
||||
result = {
|
||||
"loss": loss,
|
||||
"logits": logits,
|
||||
}
|
||||
self.parent.assertListEqual(list(result["logits"].size()), [self.batch_size, self.num_labels])
|
||||
self.check_loss_output(result)
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
(
|
||||
@@ -280,6 +306,10 @@ class ElectraModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_electra_for_pretraining(*config_and_inputs)
|
||||
|
||||
def test_for_sequence_classification(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_electra_for_sequence_classification(*config_and_inputs)
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in list(ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
||||
|
||||
@@ -329,5 +329,5 @@ class EncoderDecoderModelTest(unittest.TestCase):
|
||||
|
||||
@slow
|
||||
def test_real_bert_model_from_pretrained(self):
|
||||
model = EncoderDecoderModel.from_pretrained("bert-base-uncased", "bert-base-uncased")
|
||||
model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
@@ -46,7 +46,7 @@ class GPT2ModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
batch_size=14,
|
||||
seq_length=7,
|
||||
is_training=True,
|
||||
use_token_type_ids=True,
|
||||
@@ -343,6 +343,7 @@ class GPT2ModelLanguageGenerationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_lm_generate_gpt2(self):
|
||||
model = GPT2LMHeadModel.from_pretrained("gpt2")
|
||||
model.to(torch_device)
|
||||
input_ids = torch.tensor([[464, 3290]], dtype=torch.long, device=torch_device) # The dog
|
||||
expected_output_ids = [
|
||||
464,
|
||||
@@ -372,6 +373,7 @@ class GPT2ModelLanguageGenerationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_lm_generate_distilgpt2(self):
|
||||
model = GPT2LMHeadModel.from_pretrained("distilgpt2")
|
||||
model.to(torch_device)
|
||||
input_ids = torch.tensor([[464, 1893]], dtype=torch.long, device=torch_device) # The president
|
||||
expected_output_ids = [
|
||||
464,
|
||||
|
||||
253
tests/test_modeling_longformer.py
Normal file
253
tests/test_modeling_longformer.py
Normal file
@@ -0,0 +1,253 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 The Google AI Language Team Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
import unittest
|
||||
|
||||
from transformers import is_torch_available
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, ids_tensor
|
||||
from .utils import require_torch, slow, torch_device
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
from transformers import (
|
||||
LongformerConfig,
|
||||
LongformerModel,
|
||||
LongformerForMaskedLM,
|
||||
)
|
||||
|
||||
|
||||
class LongformerModelTester(object):
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
seq_length=7,
|
||||
is_training=True,
|
||||
use_input_mask=True,
|
||||
use_token_type_ids=True,
|
||||
use_labels=True,
|
||||
vocab_size=99,
|
||||
hidden_size=32,
|
||||
num_hidden_layers=5,
|
||||
num_attention_heads=4,
|
||||
intermediate_size=37,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
max_position_embeddings=512,
|
||||
type_vocab_size=16,
|
||||
type_sequence_label_size=2,
|
||||
initializer_range=0.02,
|
||||
num_labels=3,
|
||||
num_choices=4,
|
||||
scope=None,
|
||||
attention_window=4,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.is_training = is_training
|
||||
self.use_input_mask = use_input_mask
|
||||
self.use_token_type_ids = use_token_type_ids
|
||||
self.use_labels = use_labels
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_act = hidden_act
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.type_vocab_size = type_vocab_size
|
||||
self.type_sequence_label_size = type_sequence_label_size
|
||||
self.initializer_range = initializer_range
|
||||
self.num_labels = num_labels
|
||||
self.num_choices = num_choices
|
||||
self.scope = scope
|
||||
self.attention_window = attention_window
|
||||
|
||||
# `ModelTesterMixin.test_attention_outputs` is expecting attention tensors to be of size
|
||||
# [num_attention_heads, encoder_seq_length, encoder_key_length], but LongformerSelfAttention
|
||||
# returns attention of shape [num_attention_heads, encoder_seq_length, self.attention_window + 1]
|
||||
# because its local attention only attends to `self.attention_window + 1` locations
|
||||
self.key_length = self.attention_window + 1
|
||||
|
||||
# because of padding `encoder_seq_length`, is different from `seq_length`. Relevant for
|
||||
# the `test_attention_outputs` and `test_hidden_states_output` tests
|
||||
self.encoder_seq_length = (
|
||||
self.seq_length + (self.attention_window - self.seq_length % self.attention_window) % self.attention_window
|
||||
)
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
|
||||
input_mask = None
|
||||
if self.use_input_mask:
|
||||
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
|
||||
|
||||
token_type_ids = None
|
||||
if self.use_token_type_ids:
|
||||
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
|
||||
|
||||
sequence_labels = None
|
||||
token_labels = None
|
||||
choice_labels = None
|
||||
if self.use_labels:
|
||||
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
|
||||
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
|
||||
choice_labels = ids_tensor([self.batch_size], self.num_choices)
|
||||
|
||||
config = LongformerConfig(
|
||||
vocab_size=self.vocab_size,
|
||||
hidden_size=self.hidden_size,
|
||||
num_hidden_layers=self.num_hidden_layers,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
intermediate_size=self.intermediate_size,
|
||||
hidden_act=self.hidden_act,
|
||||
hidden_dropout_prob=self.hidden_dropout_prob,
|
||||
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||
max_position_embeddings=self.max_position_embeddings,
|
||||
type_vocab_size=self.type_vocab_size,
|
||||
initializer_range=self.initializer_range,
|
||||
attention_window=self.attention_window,
|
||||
)
|
||||
|
||||
return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||
|
||||
def check_loss_output(self, result):
|
||||
self.parent.assertListEqual(list(result["loss"].size()), [])
|
||||
|
||||
def create_and_check_longformer_model(
|
||||
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||
):
|
||||
model = LongformerModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
sequence_output, pooled_output = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
|
||||
sequence_output, pooled_output = model(input_ids, token_type_ids=token_type_ids)
|
||||
sequence_output, pooled_output = model(input_ids)
|
||||
|
||||
result = {
|
||||
"sequence_output": sequence_output,
|
||||
"pooled_output": pooled_output,
|
||||
}
|
||||
self.parent.assertListEqual(
|
||||
list(result["sequence_output"].size()), [self.batch_size, self.seq_length, self.hidden_size]
|
||||
)
|
||||
self.parent.assertListEqual(list(result["pooled_output"].size()), [self.batch_size, self.hidden_size])
|
||||
|
||||
def create_and_check_longformer_for_masked_lm(
|
||||
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||
):
|
||||
model = LongformerForMaskedLM(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
loss, prediction_scores = model(
|
||||
input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, masked_lm_labels=token_labels
|
||||
)
|
||||
result = {
|
||||
"loss": loss,
|
||||
"prediction_scores": prediction_scores,
|
||||
}
|
||||
self.parent.assertListEqual(
|
||||
list(result["prediction_scores"].size()), [self.batch_size, self.seq_length, self.vocab_size]
|
||||
)
|
||||
self.check_loss_output(result)
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
(
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
) = config_and_inputs
|
||||
inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class LongformerModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
test_pruning = False # pruning is not supported
|
||||
test_headmasking = False # head masking is not supported
|
||||
test_torchscript = False
|
||||
|
||||
all_model_classes = (LongformerForMaskedLM, LongformerModel) if is_torch_available() else ()
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = LongformerModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=LongformerConfig, hidden_size=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_longformer_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_longformer_model(*config_and_inputs)
|
||||
|
||||
def test_longformer_for_masked_lm(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_longformer_for_masked_lm(*config_and_inputs)
|
||||
|
||||
|
||||
class LongformerModelIntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_inference_no_head(self):
|
||||
model = LongformerModel.from_pretrained("longformer-base-4096")
|
||||
model.to(torch_device)
|
||||
|
||||
# 'Hello world! ' repeated 1000 times
|
||||
input_ids = torch.tensor(
|
||||
[[0] + [20920, 232, 328, 1437] * 1000 + [2]], dtype=torch.long, device=torch_device
|
||||
) # long input
|
||||
|
||||
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)
|
||||
attention_mask[:, [1, 4, 21]] = 2 # Set global attention on a few random positions
|
||||
|
||||
output = model(input_ids, attention_mask=attention_mask)[0]
|
||||
|
||||
expected_output_sum = torch.tensor(74585.8594, device=torch_device)
|
||||
expected_output_mean = torch.tensor(0.0243, device=torch_device)
|
||||
self.assertTrue(torch.allclose(output.sum(), expected_output_sum, atol=1e-4))
|
||||
self.assertTrue(torch.allclose(output.mean(), expected_output_mean, atol=1e-4))
|
||||
|
||||
@slow
|
||||
def test_inference_masked_lm(self):
|
||||
model = LongformerForMaskedLM.from_pretrained("longformer-base-4096")
|
||||
model.to(torch_device)
|
||||
|
||||
# 'Hello world! ' repeated 1000 times
|
||||
input_ids = torch.tensor(
|
||||
[[0] + [20920, 232, 328, 1437] * 1000 + [2]], dtype=torch.long, device=torch_device
|
||||
) # long input
|
||||
|
||||
loss, prediction_scores = model(input_ids, masked_lm_labels=input_ids)
|
||||
|
||||
expected_loss = torch.tensor(0.0620, device=torch_device)
|
||||
expected_prediction_scores_sum = torch.tensor(-6.1599e08, device=torch_device)
|
||||
expected_prediction_scores_mean = torch.tensor(-3.0622, device=torch_device)
|
||||
input_ids = input_ids.to(torch_device)
|
||||
|
||||
self.assertTrue(torch.allclose(loss, expected_loss, atol=1e-4))
|
||||
self.assertTrue(torch.allclose(prediction_scores.sum(), expected_prediction_scores_sum, atol=1e-4))
|
||||
self.assertTrue(torch.allclose(prediction_scores.mean(), expected_prediction_scores_mean, atol=1e-4))
|
||||
@@ -129,11 +129,6 @@ class TestMarian_EN_DE_More(MarianIntegrationTest):
|
||||
max_indices = logits.argmax(-1)
|
||||
self.tokenizer.batch_decode(max_indices)
|
||||
|
||||
def test_tokenizer_equivalence(self):
|
||||
batch = self.tokenizer.prepare_translation_batch(["I am a small frog"]).to(torch_device)
|
||||
expected = [38, 121, 14, 697, 38848, 0]
|
||||
self.assertListEqual(expected, batch.input_ids[0].tolist())
|
||||
|
||||
def test_unk_support(self):
|
||||
t = self.tokenizer
|
||||
ids = t.prepare_translation_batch(["||"]).to(torch_device).input_ids[0].tolist()
|
||||
|
||||
@@ -227,6 +227,7 @@ class OPENAIGPTModelLanguageGenerationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_lm_generate_openai_gpt(self):
|
||||
model = OpenAIGPTLMHeadModel.from_pretrained("openai-gpt")
|
||||
model.to(torch_device)
|
||||
input_ids = torch.tensor([[481, 4735, 544]], dtype=torch.long, device=torch_device) # the president is
|
||||
expected_output_ids = [
|
||||
481,
|
||||
|
||||
@@ -19,7 +19,7 @@ from transformers import is_torch_available
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
|
||||
from .utils import require_torch, slow, torch_device
|
||||
from .utils import require_multigpu, require_torch, slow, torch_device
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
@@ -448,9 +448,14 @@ class ReformerTesterMixin:
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_reformer_model_fp16_generate(*config_and_inputs)
|
||||
|
||||
@require_multigpu
|
||||
def test_multigpu_data_parallel_forward(self):
|
||||
# Opt-out of this test.
|
||||
pass
|
||||
|
||||
|
||||
@require_torch
|
||||
class ReformerLocalAttnModelTest(ModelTesterMixin, ReformerTesterMixin, unittest.TestCase):
|
||||
class ReformerLocalAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest.TestCase):
|
||||
all_model_classes = (ReformerModel, ReformerModelWithLMHead) if is_torch_available() else ()
|
||||
all_generative_model_classes = (ReformerModelWithLMHead,) if is_torch_available() else ()
|
||||
test_pruning = False
|
||||
@@ -504,7 +509,7 @@ class ReformerLocalAttnModelTest(ModelTesterMixin, ReformerTesterMixin, unittest
|
||||
|
||||
|
||||
@require_torch
|
||||
class ReformerLSHAttnModelTest(ModelTesterMixin, unittest.TestCase, ReformerTesterMixin):
|
||||
class ReformerLSHAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest.TestCase):
|
||||
all_model_classes = (ReformerModel, ReformerModelWithLMHead) if is_torch_available() else ()
|
||||
all_generative_model_classes = (ReformerModelWithLMHead,) if is_torch_available() else ()
|
||||
test_pruning = False
|
||||
|
||||
@@ -304,6 +304,16 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
output_with_past_cache = model.generate(input_ids[:1], num_beams=2, max_length=5, do_sample=True)
|
||||
self.parent.assertTrue(torch.all(output_with_past_cache == output_without_past_cache))
|
||||
|
||||
def create_and_check_t5_model_fp16_forward(
|
||||
self, config, input_ids, decoder_input_ids, attention_mask, decoder_attention_mask, lm_labels,
|
||||
):
|
||||
model = T5Model(config=config)
|
||||
model.to(torch_device)
|
||||
model.half()
|
||||
model.eval()
|
||||
output = model(input_ids, decoder_input_ids=input_ids, attention_mask=attention_mask)[0]
|
||||
self.parent.assertFalse(torch.isnan(output).any().item())
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
(
|
||||
@@ -355,6 +365,11 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_t5_and_check_t5_generate_with_past_key_value_states(*config_and_inputs)
|
||||
|
||||
@unittest.skipIf(torch_device == "cpu", "Cant do half precision")
|
||||
def test_t5_model_fp16_forward(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_t5_model_fp16_forward(*config_and_inputs)
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in list(T5_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
||||
@@ -429,6 +444,7 @@ class T5ModelIntegrationTests(unittest.TestCase):
|
||||
)
|
||||
|
||||
input_ids = tok.encode(model.config.prefix + original_input, return_tensors="pt")
|
||||
input_ids = input_ids.to(torch_device)
|
||||
|
||||
output = model.generate(
|
||||
input_ids=input_ids,
|
||||
@@ -456,6 +472,7 @@ class T5ModelIntegrationTests(unittest.TestCase):
|
||||
expected_translation = "Cette section d'images provenant de l'enregistrement infrarouge effectué par le télescope Spitzer montre un « portrait familial » de générations innombrables de étoiles : les plus anciennes sont observées sous forme de pointes bleues, alors que les « nouveau-nés » de couleur rose dans la salle des accouchements doivent être plus difficiles "
|
||||
|
||||
input_ids = tok.encode(model.config.prefix + original_input, return_tensors="pt")
|
||||
input_ids = input_ids.to(torch_device)
|
||||
|
||||
output = model.generate(
|
||||
input_ids=input_ids,
|
||||
@@ -483,6 +500,7 @@ class T5ModelIntegrationTests(unittest.TestCase):
|
||||
expected_translation = "Taco Bell a declarat că intenţionează să adauge 2 000 de locaţii în SUA până în 2022."
|
||||
|
||||
input_ids = tok.encode(model.config.prefix + original_input, return_tensors="pt")
|
||||
input_ids = input_ids.to(torch_device)
|
||||
|
||||
output = model.generate(
|
||||
input_ids=input_ids,
|
||||
|
||||
@@ -21,7 +21,7 @@ from transformers import is_torch_available
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, ids_tensor
|
||||
from .utils import require_torch, slow, torch_device
|
||||
from .utils import require_multigpu, require_torch, slow, torch_device
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
@@ -43,7 +43,7 @@ class TransfoXLModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
batch_size=14,
|
||||
seq_length=7,
|
||||
mem_len=30,
|
||||
clamp_len=15,
|
||||
@@ -207,6 +207,11 @@ class TransfoXLModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
output_result = self.model_tester.create_transfo_xl_lm_head(*config_and_inputs)
|
||||
self.model_tester.check_transfo_xl_lm_head_output(output_result)
|
||||
|
||||
@require_multigpu
|
||||
def test_multigpu_data_parallel_forward(self):
|
||||
# Opt-out of this test.
|
||||
pass
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in list(TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
||||
@@ -218,6 +223,7 @@ class TransfoXLModelLanguageGenerationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_lm_generate_transfo_xl_wt103(self):
|
||||
model = TransfoXLLMHeadModel.from_pretrained("transfo-xl-wt103")
|
||||
model.to(torch_device)
|
||||
input_ids = torch.tensor(
|
||||
[
|
||||
[
|
||||
|
||||
@@ -434,6 +434,7 @@ class XLMModelLanguageGenerationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_lm_generate_xlm_mlm_en_2048(self):
|
||||
model = XLMWithLMHeadModel.from_pretrained("xlm-mlm-en-2048")
|
||||
model.to(torch_device)
|
||||
input_ids = torch.tensor([[14, 447]], dtype=torch.long, device=torch_device) # the president
|
||||
expected_output_ids = [
|
||||
14,
|
||||
@@ -459,4 +460,4 @@ class XLMModelLanguageGenerationTest(unittest.TestCase):
|
||||
] # the president the president the president the president the president the president the president the president the president the president
|
||||
# TODO(PVP): this and other input_ids I tried for generation give pretty bad results. Not sure why. Model might just not be made for auto-regressive inference
|
||||
output_ids = model.generate(input_ids, do_sample=False)
|
||||
self.assertListEqual(output_ids[0].numpy().tolist(), expected_output_ids)
|
||||
self.assertListEqual(output_ids[0].cpu().numpy().tolist(), expected_output_ids)
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user