Compare commits

...

57 Commits

Author SHA1 Message Date
LysandreJik
11c3257a18 unpin isort for pypi
Some checks failed
GitHub-hosted runner / check_code_quality (push) Has been cancelled
2020-04-06 10:06:41 -04:00
LysandreJik
36bffc81b3 Release: v2.8.0 2020-04-06 10:03:53 -04:00
Patrick von Platen
2ee410560e [Generate, Test] Split generate test function into beam search, no beam search (#3601)
* split beam search and no beam search test

* fix test

* clean generate tests
2020-04-06 10:37:05 +02:00
Patrick von Platen
1789c7daf1 fix argument order (#3637) 2020-04-05 12:33:41 +02:00
Patrick von Platen
b809d2f073 Fix TF T5 docstring (#3636) 2020-04-05 12:23:09 +02:00
Timo Moeller
4ab8ab4f50 Adjust model card to reflect changes to vocabulary
(cherry picked from commit 8e25c4bf2838211378db4d93e7f9722386cc1a04)
2020-04-04 15:27:41 -04:00
ktrapeznikov
ac40eed1a5 Create README.md
adding readme for 
ktrapeznikov/albert-xlarge-v2-squad-v2
2020-04-04 15:18:54 -04:00
ktrapeznikov
fd9995ebc5 Create README.md 2020-04-04 15:18:31 -04:00
Julien Chaumond
5d912e7ed4 Tweak typing for #3566 2020-04-04 15:04:03 -04:00
Julien Chaumond
94eb68d742 weigths*weights 2020-04-04 15:03:26 -04:00
Manuel Romero
243e687be6 Create model card 2020-04-04 08:20:34 -04:00
Julien Chaumond
3e4b4dd190 [model_cards] Link to ExBERT visualisation
Hat/tip @bhoov @HendrikStrobelt @sebastianGehrmann

Also cc @srush and @thomwolf
2020-04-03 20:03:29 -04:00
Max Ryabinin
c6acd246ec Speed up GELU computation with torch.jit (#2988)
* Compile gelu_new with torchscript

* Compile _gelu_python with torchscript

* Wrap gelu_new with torch.jit for torch>=1.4
2020-04-03 15:20:21 -04:00
Lysandre Debut
d5d7d88612 ELECTRA (#3257)
* Electra wip

* helpers

* Electra wip

* Electra v1

* ELECTRA may be saved/loaded

* Generator & Discriminator

* Embedding size instead of halving the hidden size

* ELECTRA Tokenizer

* Revert BERT helpers

* ELECTRA Conversion script

* Archive maps

* PyTorch tests

* Start fixing tests

* Tests pass

* Same configuration for both models

* Compatible with base + large

* Simplification + weight tying

* Archives

* Auto + Renaming to standard names

* ELECTRA is uncased

* Tests

* Slight API changes

* Update tests

* wip

* ElectraForTokenClassification

* temp

* Simpler arch + tests

Removed ElectraForPreTraining which will be in a script

* Conversion script

* Auto model

* Update links to S3

* Split ElectraForPreTraining and ElectraForTokenClassification

* Actually test PreTraining model

* Remove num_labels from configuration

* wip

* wip

* From discriminator and generator to electra

* Slight API changes

* Better naming

* TensorFlow ELECTRA tests

* Accurate conversion script

* Added to conversion script

* Fast ELECTRA tokenizer

* Style

* Add ELECTRA to README

* Modeling Pytorch Doc + Real style

* TF Docs

* Docs

* Correct links

* Correct model intialized

* random fixes

* style

* Addressing Patrick's and Sam's comments

* Correct links in docs
2020-04-03 14:10:54 -04:00
Yohei Tamura
8594dd80dd BertJapaneseTokenizer accept options for mecab (#3566)
* BertJapaneseTokenizer accept options for mecab

* black

* fix mecab_option to Option[str]
2020-04-03 11:12:19 -04:00
HUSEIN ZOLKEPLI
216e167ce6 Added albert-base-bahasa-cased README and fixed tiny-bert-bahasa-cased README (#3613)
* add bert bahasa readme

* update readme

* update readme

* added xlnet

* added tiny-bert and fix xlnet readme

* added albert base
2020-04-03 09:28:43 -04:00
ahotrod
1ac6a246d8 Update README.md (#3604)
Update AutoModel & AutoTokernizer loading.
2020-04-03 09:28:25 -04:00
ahotrod
e91692f4a3 Update README.md (#3603) 2020-04-03 09:27:57 -04:00
HenrykBorzymowski
8e287d507d corrected mistake in polish model cards (#3611)
* added model_cards for polish squad models

* corrected mistake in polish design cards

Co-authored-by: Henryk Borzymowski <henryk.borzymowski@pwc.com>
2020-04-03 09:07:15 -04:00
redewiedergabe
81484b447b Create README.md (#3568)
* Create README.md

* added meta block (language: german)

* Added additional information about test data
2020-04-02 21:48:31 -04:00
ahotrod
9f6349aba9 Create README.md 2020-04-02 21:43:12 -04:00
Henryk Borzymowski
ddb1ce7418 added model_cards for polish squad models 2020-04-02 21:40:16 -04:00
Patrick von Platen
f68d22850c delete bogus print statement (#3595) 2020-04-02 21:49:34 +02:00
Nicolas
c50aa67bff Resizing embedding matrix before sending it to the optimizer. (#3532)
* Resizing embedding matrix after sending it to the optimizer prevents from updating the newly resized matrix.

* Remove space for style matter
2020-04-02 15:00:05 -04:00
Mark Kockerbeck
1b10159950 Adding should_continue check for retraining (#3509) 2020-04-02 14:07:08 -04:00
Patrick von Platen
390c128592 [Encoder-Decoder] Force models outputs to always have batch_size as their first dim (#3536)
* solve conflicts

* improve comments
2020-04-02 15:18:33 +02:00
Patrick von Platen
ab5d06a094 [T5, examples] replace heavy t5 models with tiny random models (#3556)
* replace heavy t5 models with tiny random models as was done by sshleifer

* fix isort
2020-04-02 12:34:05 +02:00
Patrick von Platen
a4ee4da18a [T5, TF 2.2] change tf t5 argument naming (#3547)
* change tf t5 argument naming for TF 2.2

* correct bug in testing
2020-04-01 22:04:20 +02:00
Patrick von Platen
06dd597552 fix bug in warnings T5 pipelines (#3545) 2020-04-01 21:59:12 +02:00
Anirudh Srinivasan
9de9ceb6c5 Correct output shape for Bert NSP models in docs (#3482) 2020-04-01 15:04:38 -04:00
Patrick von Platen
b815edf69f [T5, Testst] Add extensive hard-coded integration tests and make sure PT and TF give equal results (#3550)
* add some t5 integration tests

* finish summarization and translation integration tests for T5 - results loook good

* add tf test

* fix == vs is bug

* fix tf beam search error and make tf t5 tests pass
2020-04-01 18:01:33 +02:00
HUSEIN ZOLKEPLI
8538ce9044 Add tiny-bert-bahasa-cased model card (#3567)
* add bert bahasa readme

* update readme

* update readme

* added xlnet

* added tiny-bert and fix xlnet readme
2020-04-01 07:15:00 -04:00
Manuel Romero
c1a6252be1 Create model card (#3557)
Create model card for: distilbert-multi-finetuned-for-xqua-on-tydiqa
2020-04-01 07:14:23 -04:00
Julien Chaumond
50e15c825c Tokenizers: Start cleaning examples a little (#3455)
* Start cleaning examples

* Fixup
2020-04-01 07:13:40 -04:00
Patrick von Platen
b38d552a92 [Generate] Add bad words list argument to the generate function (#3367)
* add bad words list

* make style

* add bad_words_tokens

* make style

* better naming

* make style

* fix typo
2020-03-31 18:42:31 +02:00
Patrick von Platen
ae6834e028 [Examples] Clean summarization and translation example testing files for T5 and Bart (#3514)
* fix conflicts

* add model size argument to summarization

* correct wrong import

* fix isort

* correct imports

* other isort make style

* make style
2020-03-31 17:54:13 +02:00
Manuel Romero
0373b60c4c Update README.md (#3552)
- Show that the last uploaded version was trained on more data (custom_license files)
2020-03-31 10:40:34 -04:00
Patrick von Platen
83d1fbcff6 [Docs] Add usage examples for translation and summarization (#3538) 2020-03-31 09:36:03 -04:00
Patrick von Platen
55bcae7f25 remove useless and confusing lm_labels line (#3531) 2020-03-31 09:32:25 -04:00
Patrick von Platen
42e1e3c67f Update usage doc regarding generate fn (#3504) 2020-03-31 09:31:46 -04:00
Patrick von Platen
57b0fab692 Add better explanation to check docs locally. (#3459) 2020-03-31 09:30:17 -04:00
Manuel Romero
a8d4dff0a1 Update README.md (#3470)
Fix typo
2020-03-31 08:01:09 -04:00
Manuel Romero
4a5663568f Create card for the model: GPT-2-finetuned-covid-bio-medrxiv (#3453) 2020-03-31 08:01:03 -04:00
Branden Chan
bbedb59675 Create README.md (#3393)
* Create README.md

* Update README.md
2020-03-31 08:00:35 -04:00
Manuel Romero
c2cf192943 Add link to 16 POS tags model (#3465) 2020-03-31 08:00:00 -04:00
Gabriele Sarti
c82ef72158 Added CovidBERT-NLI model card (#3477) 2020-03-31 07:59:49 -04:00
Manuel Romero
b48a1f08c1 Add text shown in example of usage (#3464) 2020-03-31 07:59:36 -04:00
Manuel Romero
99833a9cbf Create model card (#3487) 2020-03-31 07:59:22 -04:00
Sho Arora
ebceeeacda Add electra and alectra model cards (#3524) 2020-03-31 07:58:48 -04:00
Leandro von Werra
a6c4ee27fd Add model cards (#3537)
* feat: add model card bert-imdb

* feat: add model card gpt2-imdb-pos

* feat: add model card gpt2-imdb
2020-03-31 07:54:45 -04:00
Ethan Perez
e5c393dceb [Bug fix] Using loaded checkpoint with --do_predict (instead of… (#3437)
* Using loaded checkpoint with --do_predict

Without this fix, I'm getting near-random validation performance for a trained model, and the validation performance differs per validation run. I think this happens since the `model` variable isn't set with the loaded checkpoint, so I'm using a randomly initialized model. Looking at the model activations, they differ each time I run evaluation (but they don't with this fix).

* Update checkpoint loading

* Fixing model loading
2020-03-30 17:06:08 -04:00
Sam Shleifer
8deff3acf2 [bart-tiny-random] Put a 5MB model on S3 to allow faster exampl… (#3488) 2020-03-30 12:28:27 -04:00
dougian
1f72865726 [BART] Update encoder and decoder on set_input_embedding (#3501)
Co-authored-by: Ioannis Douratsos <ioannisd@amazon.com>
2020-03-30 12:20:37 -04:00
Julien Chaumond
cc598b312b [InputExample] Unfreeze for now, cf. #3423 2020-03-30 10:41:49 -04:00
Julien Plu
d38bbb225f Update the NER TF script (#3511)
* Update the NER TF script to remove the softmax and make the pad token label id to -1

* Reformat the quality and style

Co-authored-by: Julien Plu <julien.plu@adevinta.com>
2020-03-30 09:50:12 -04:00
LysandreJik
eff757f2e3 Re-pin isort version 2020-03-30 09:00:47 -04:00
LysandreJik
a009d751c2 Un-pin isort for v2.7.0 pypi 2020-03-30 08:55:10 -04:00
95 changed files with 4519 additions and 292 deletions

View File

@@ -164,8 +164,9 @@ At some point in the future, you'll be able to seamlessly move from pre-training
14. **[MMBT](https://github.com/facebookresearch/mmbt/)** (from Facebook), released together with the paper a [Supervised Multimodal Bitransformers for Classifying Images and Text](https://arxiv.org/pdf/1909.02950.pdf) by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.
15. **[FlauBERT](https://github.com/getalp/Flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
16. **[BART](https://github.com/pytorch/fairseq/tree/master/examples/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
17. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
18. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
17. **[ELECTRA](https://github.com/google-research/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
18. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
19. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).

View File

@@ -47,6 +47,8 @@ Once you have setup `sphinx`, you can build the documentation by running the fol
make html
```
A folder called ``_build/html`` should have been created. You can now open the file ``_build/html/index.html`` in your browser.
---
**NOTE**

View File

@@ -26,7 +26,7 @@ author = u'huggingface'
# The short X.Y version
version = u''
# The full version, including alpha/beta/rc tags
release = u'2.7.0'
release = u'2.8.0'
# -- General configuration ---------------------------------------------------

View File

@@ -104,3 +104,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
model_doc/flaubert
model_doc/bart
model_doc/t5
model_doc/electra

View File

@@ -27,7 +27,7 @@ loss = outputs[0]
# In transformers you can also have access to the logits:
loss, logits = outputs[:2]
# And even the attention weigths if you configure the model to output them (and other outputs too, see the docstrings and documentation)
# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
outputs = model(input_ids, labels=labels)
loss, logits, attentions = outputs

View File

@@ -0,0 +1,115 @@
ELECTRA
----------------------------------------------------
The ELECTRA model was proposed in the paper.
`ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators <https://openreview.net/pdf?id=r1xMH1BtvB>`__.
ELECTRA is a new pre-training approach which trains two transformer models: the generator and the discriminator. The
generator's role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator,
which is the model we're interested in, tries to identify which tokens were replaced by the generator in the sequence.
The abstract from the paper is the following:
*Masked language modeling (MLM) pre-training methods such as BERT corrupt
the input by replacing some tokens with [MASK] and then train a model to
reconstruct the original tokens. While they produce good results when transferred
to downstream NLP tasks, they generally require large amounts of compute to be
effective. As an alternative, we propose a more sample-efficient pre-training task
called replaced token detection. Instead of masking the input, our approach
corrupts it by replacing some tokens with plausible alternatives sampled from a small
generator network. Then, instead of training a model that predicts the original
identities of the corrupted tokens, we train a discriminative model that predicts
whether each token in the corrupted input was replaced by a generator sample
or not. Thorough experiments demonstrate this new pre-training task is more
efficient than MLM because the task is defined over all input tokens rather than
just the small subset that was masked out. As a result, the contextual representations
learned by our approach substantially outperform the ones learned by BERT
given the same model size, data, and compute. The gains are particularly strong
for small models; for example, we train a model on one GPU for 4 days that
outperforms GPT (trained using 30x more compute) on the GLUE natural language
understanding benchmark. Our approach also works well at scale, where it
performs comparably to RoBERTa and XLNet while using less than 1/4 of their
compute and outperforms them when using the same amount of compute.*
Tips:
- ELECTRA is the pre-training approach, therefore there is nearly no changes done to the underlying model: BERT. The
only change is the separation of the embedding size and the hidden size -> The embedding size is generally smaller,
while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from
their embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no
projection layer is used.
- The ELECTRA checkpoints saved using `Google Research's implementation <https://github.com/google-research/electra>`__
contain both the generator and discriminator. The conversion script requires the user to name which model to export
into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all
available ELECTRA models, however. This means that the discriminator may be loaded in the `ElectraForMaskedLM` model,
and the generator may be loaded in the `ElectraForPreTraining` model (the classification head will be randomly
initialized as it doesn't exist in the generator).
ElectraConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraConfig
:members:
ElectraTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraTokenizer
:members:
ElectraModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraModel
:members:
ElectraForPreTraining
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraForPreTraining
:members:
ElectraForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraForMaskedLM
:members:
ElectraForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraForTokenClassification
:members:
TFElectraModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFElectraModel
:members:
TFElectraForPreTraining
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFElectraForPreTraining
:members:
TFElectraForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFElectraForMaskedLM
:members:
TFElectraForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFElectraForTokenClassification
:members:

View File

@@ -420,7 +420,7 @@ to generate the tokens following the initial sequence in PyTorch, and creating a
sequence = f"Hugging Face is based in DUMBO, New York City, and is"
input = tokenizer.encode(sequence, return_tensors="pt")
generated = model.generate(input, max_length=50)
generated = model.generate(input, max_length=50, do_sample=True)
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
@@ -432,14 +432,10 @@ to generate the tokens following the initial sequence in PyTorch, and creating a
model = TFAutoModelWithLMHead.from_pretrained("gpt2")
sequence = f"Hugging Face is based in DUMBO, New York City, and is"
generated = tokenizer.encode(sequence)
input = tokenizer.encode(sequence, return_tensors="tf")
generated = model.generate(input, max_length=50, do_sample=True)
for i in range(50):
predictions = model(tf.constant([generated]))[0]
token = tf.argmax(predictions[0], axis=1)[-1].numpy()
generated += [token]
resulting_string = tokenizer.decode(generated)
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
@@ -594,4 +590,138 @@ following array should be the output:
::
[('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]
[('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]
Summarization
----------------------------------------------------
Summarization is the task of summarizing a text / an article into a shorter text.
An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was created for the task of summarization.
If you would like to fine-tune a model on a summarization task, you may leverage the ``examples/summarization/bart/run_train.sh`` (leveraging pytorch-lightning) script.
Here is an example using the pipelines do to summarization.
It leverages a Bart model that was fine-tuned on the CNN / Daily Mail data set.
::
from transformers import pipeline
summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30))
Because the summarization pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
of ``PretrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` and ``min_length`` above.
This outputs the following summary:
::
Liana Barrientos has been married 10 times, sometimes within two weeks of each other. Prosecutors say the marriages were part of an immigration scam. She pleaded not guilty at State Supreme Court in the Bronx on Friday.
Here is an example doing summarization using a model and a tokenizer. The process is the following:
- Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
- Define the article that should be summarizaed.
- Leverage the ``PretrainedModel.generate()`` method.
- Add the T5 specific prefix "summarize: ".
Here Google`s T5 model is used that was only pre-trained on a multi-task mixed data set (including CNN / Daily Mail), but nevertheless yields very good results.
::
## PYTORCH CODE
from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="pt", max_length=512)
outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
print(outputs)
## TENSORFLOW CODE
from transformers import TFAutoModelWithLMHead, AutoTokenizer
model = TFAutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="tf", max_length=512)
outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
print(outputs)
Translation
----------------------------------------------------
Translation is the task of translating a text from one language to another.
An example of a translation dataset is the WMT English to German dataset, which has English sentences as the input data
and German sentences as the target data.
Here is an example using the pipelines do to translation.
It leverages a T5 model that was only pre-trained on a multi-task mixture dataset (including WMT), but yields impressive
translation results nevertheless.
::
from transformers import pipeline
translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
Because the translation pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
of ``PretrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
This outputs the following translation into German:
::
Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.
Here is an example doing translation using a model and a tokenizer. The process is the following:
- Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
- Define the article that should be summarizaed.
- Leverage the ``PretrainedModel.generate()`` method.
- Add the T5 specific prefix "translate English to German: "
::
## PYTORCH CODE
from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)
print(outputs)
## TENSORFLOW CODE
from transformers import TFAutoModelWithLMHead, AutoTokenizer
model = TFAutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="tf")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)
print(outputs)

View File

@@ -68,7 +68,7 @@ class GLUETransformer(BaseTransformer):
output_mode=args.glue_output_mode,
pad_on_left=bool(args.model_type in ["xlnet"]), # pad on the left for xlnet
pad_token=self.tokenizer.convert_tokens_to_ids([self.tokenizer.pad_token])[0],
pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
pad_token_segment_id=self.tokenizer.pad_token_type_id,
)
logger.info("Saving features into cached file %s", cached_features_file)
torch.save(features, cached_features_file)
@@ -192,5 +192,5 @@ if __name__ == "__main__":
# Optionally, predict on dev set and write to output_dir
if args.do_predict:
checkpoints = list(sorted(glob.glob(os.path.join(args.output_dir, "checkpointepoch=*.ckpt"), recursive=True)))
GLUETransformer.load_from_checkpoint(checkpoints[-1])
model = model.load_from_checkpoint(checkpoints[-1])
trainer.test(model)

View File

@@ -342,8 +342,8 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
max_length=args.max_seq_length,
output_mode=output_mode,
pad_on_left=bool(args.model_type in ["xlnet"]), # pad on the left for xlnet
pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
pad_token=tokenizer.pad_token_id,
pad_token_segment_id=tokenizer.pad_token_type_id,
)
if args.local_rank in [-1, 0]:
logger.info("Saving features into cached file %s", cached_features_file)

View File

@@ -348,8 +348,8 @@ def load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode):
# roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
pad_on_left=bool(args.model_type in ["xlnet"]),
# pad on the left for xlnet
pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
pad_token=tokenizer.pad_token_id,
pad_token_segment_id=tokenizer.pad_token_type_id,
pad_token_label_id=pad_token_label_id,
)
if args.local_rank in [-1, 0]:

View File

@@ -64,8 +64,8 @@ class NERTransformer(BaseTransformer):
sep_token=self.tokenizer.sep_token,
sep_token_extra=bool(args.model_type in ["roberta"]),
pad_on_left=bool(args.model_type in ["xlnet"]),
pad_token=self.tokenizer.convert_tokens_to_ids([self.tokenizer.pad_token])[0],
pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
pad_token=self.tokenizer.pad_token_id,
pad_token_segment_id=self.tokenizer.pad_token_type_id,
pad_token_label_id=self.pad_token_label_id,
)
logger.info("Saving features into cached file %s", cached_features_file)
@@ -192,5 +192,5 @@ if __name__ == "__main__":
# https://github.com/PyTorchLightning/pytorch-lightning/blob/master\
# /pytorch_lightning/callbacks/model_checkpoint.py#L169
checkpoints = list(sorted(glob.glob(os.path.join(args.output_dir, "checkpointepoch=*.ckpt"), recursive=True)))
NERTransformer.load_from_checkpoint(checkpoints[-1])
model = model.load_from_checkpoint(checkpoints[-1])
trainer.test(model)

View File

@@ -157,7 +157,9 @@ def train(
writer = tf.summary.create_file_writer("/tmp/mylogs")
with strategy.scope():
loss_fct = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
loss_fct = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction=tf.keras.losses.Reduction.NONE
)
optimizer = create_optimizer(args["learning_rate"], num_train_steps, args["warmup_steps"])
if args["fp16"]:
@@ -205,11 +207,9 @@ def train(
with tf.GradientTape() as tape:
logits = model(train_features["input_ids"], **inputs)[0]
logits = tf.reshape(logits, (-1, len(labels) + 1))
active_loss = tf.reshape(train_features["input_mask"], (-1,))
active_logits = tf.boolean_mask(logits, active_loss)
train_labels = tf.reshape(train_labels, (-1,))
active_labels = tf.boolean_mask(train_labels, active_loss)
active_loss = tf.reshape(train_labels, (-1,)) != pad_token_label_id
active_logits = tf.boolean_mask(tf.reshape(logits, (-1, len(labels))), active_loss)
active_labels = tf.boolean_mask(tf.reshape(train_labels, (-1,)), active_loss)
cross_entropy = loss_fct(active_labels, active_logits)
loss = tf.reduce_sum(cross_entropy) * (1.0 / train_batch_size)
grads = tape.gradient(loss, model.trainable_variables)
@@ -329,11 +329,9 @@ def evaluate(args, strategy, model, tokenizer, labels, pad_token_label_id, mode)
with strategy.scope():
logits = model(eval_features["input_ids"], **inputs)[0]
tmp_logits = tf.reshape(logits, (-1, len(labels) + 1))
active_loss = tf.reshape(eval_features["input_mask"], (-1,))
active_logits = tf.boolean_mask(tmp_logits, active_loss)
tmp_eval_labels = tf.reshape(eval_labels, (-1,))
active_labels = tf.boolean_mask(tmp_eval_labels, active_loss)
active_loss = tf.reshape(eval_labels, (-1,)) != pad_token_label_id
active_logits = tf.boolean_mask(tf.reshape(logits, (-1, len(labels))), active_loss)
active_labels = tf.boolean_mask(tf.reshape(eval_labels, (-1,)), active_loss)
cross_entropy = loss_fct(active_labels, active_logits)
loss += tf.reduce_sum(cross_entropy) * (1.0 / eval_batch_size)
@@ -436,8 +434,8 @@ def load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, batch_s
# roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
pad_on_left=bool(args["model_type"] in ["xlnet"]),
# pad on the left for xlnet
pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
pad_token_segment_id=4 if args["model_type"] in ["xlnet"] else 0,
pad_token=tokenizer.pad_token_id,
pad_token_segment_id=tokenizer.pad_token_type_id,
pad_token_label_id=pad_token_label_id,
)
logging.info("Saving features into cached file %s", cached_features_file)
@@ -497,8 +495,8 @@ def main(_):
)
labels = get_labels(args["labels"])
num_labels = len(labels) + 1
pad_token_label_id = 0
num_labels = len(labels)
pad_token_label_id = -1
config = AutoConfig.from_pretrained(
args["config_name"] if args["config_name"] else args["model_name_or_path"],
num_labels=num_labels,
@@ -522,7 +520,6 @@ def main(_):
config=config,
cache_dir=args["cache_dir"] if args["cache_dir"] else None,
)
model.layers[-1].activation = tf.keras.activations.softmax
train_batch_size = args["per_device_train_batch_size"] * args["n_device"]
train_dataset, num_train_examples = load_and_cache_examples(

View File

@@ -360,8 +360,8 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
max_length=args.max_seq_length,
output_mode=output_mode,
pad_on_left=bool(args.model_type in ["xlnet"]), # pad on the left for xlnet
pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
pad_token=tokenizer.pad_token_id,
pad_token_segment_id=tokenizer.pad_token_type_id,
)
if args.local_rank in [-1, 0]:
logger.info("Saving features into cached file %s", cached_features_file)

View File

@@ -233,6 +233,9 @@ def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedToke
else:
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
model = model.module if hasattr(model, "module") else model # Take care of distributed/parallel training
model.resize_token_embeddings(len(tokenizer))
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
@@ -309,9 +312,6 @@ def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedToke
tr_loss, logging_loss = 0.0, 0.0
model_to_resize = model.module if hasattr(model, "module") else model # Take care of distributed/parallel training
model_to_resize.resize_token_embeddings(len(tokenizer))
model.zero_grad()
train_iterator = trange(
epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
@@ -624,6 +624,7 @@ def main():
and os.listdir(args.output_dir)
and args.do_train
and not args.overwrite_output_dir
and not args.should_continue
):
raise ValueError(
"Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(

View File

@@ -361,7 +361,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False, test=False):
args.max_seq_length,
tokenizer,
pad_on_left=bool(args.model_type in ["xlnet"]), # pad on the left for xlnet
pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
pad_token_segment_id=tokenizer.pad_token_type_id,
)
if args.local_rank in [-1, 0]:
logger.info("Saving features into cached file %s", cached_features_file)

View File

@@ -350,8 +350,8 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
max_length=args.max_seq_length,
output_mode=output_mode,
pad_on_left=False,
pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
pad_token_segment_id=0,
pad_token=tokenizer.pad_token_id,
pad_token_segment_id=tokenizer.pad_token_type_id,
)
if args.local_rank in [-1, 0]:
logger.info("Saving features into cached file %s", cached_features_file)

View File

@@ -16,15 +16,17 @@ def chunks(lst, n):
yield lst[i : i + n]
def generate_summaries(lns, out_file, batch_size=8, device=DEFAULT_DEVICE):
def generate_summaries(
examples: list, out_file: str, model_name: str, batch_size: int = 8, device: str = DEFAULT_DEVICE
):
fout = Path(out_file).open("w")
model = BartForConditionalGeneration.from_pretrained("bart-large-cnn", output_past=True,).to(device)
model = BartForConditionalGeneration.from_pretrained(model_name, output_past=True,).to(device)
tokenizer = BartTokenizer.from_pretrained("bart-large")
max_length = 140
min_length = 55
for batch in tqdm(list(chunks(lns, batch_size))):
for batch in tqdm(list(chunks(examples, batch_size))):
dct = tokenizer.batch_encode_plus(batch, max_length=1024, return_tensors="pt", pad_to_max_length=True)
summaries = model.generate(
input_ids=dct["input_ids"].to(device),
@@ -43,7 +45,7 @@ def generate_summaries(lns, out_file, batch_size=8, device=DEFAULT_DEVICE):
fout.flush()
def _run_generate():
def run_generate():
parser = argparse.ArgumentParser()
parser.add_argument(
"source_path", type=str, help="like cnn_dm/test.source",
@@ -51,6 +53,9 @@ def _run_generate():
parser.add_argument(
"output_path", type=str, help="where to save summaries",
)
parser.add_argument(
"model_name", type=str, default="bart-large-cnn", help="like bart-large-cnn",
)
parser.add_argument(
"--device", type=str, required=False, default=DEFAULT_DEVICE, help="cuda, cuda:1, cpu etc.",
)
@@ -58,9 +63,9 @@ def _run_generate():
"--bs", type=int, default=8, required=False, help="batch size: how many to summarize at a time",
)
args = parser.parse_args()
lns = [" " + x.rstrip() for x in open(args.source_path).readlines()]
generate_summaries(lns, args.output_path, batch_size=args.bs, device=args.device)
examples = [" " + x.rstrip() for x in open(args.source_path).readlines()]
generate_summaries(examples, args.output_path, args.model_name, batch_size=args.bs, device=args.device)
if __name__ == "__main__":
_run_generate()
run_generate()

View File

@@ -1,16 +1,13 @@
import logging
import os
import sys
import tempfile
import unittest
from pathlib import Path
from unittest.mock import patch
from .evaluate_cnn import _run_generate
from .evaluate_cnn import run_generate
output_file_name = "output_bart_sum.txt"
articles = [" New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County."]
logging.basicConfig(level=logging.DEBUG)
@@ -25,8 +22,11 @@ class TestBartExamples(unittest.TestCase):
tmp = Path(tempfile.gettempdir()) / "utest_generations_bart_sum.hypo"
with tmp.open("w") as f:
f.write("\n".join(articles))
testargs = ["evaluate_cnn.py", str(tmp), output_file_name]
output_file_name = Path(tempfile.gettempdir()) / "utest_output_bart_sum.hypo"
testargs = ["evaluate_cnn.py", str(tmp), str(output_file_name), "sshleifer/bart-tiny-random"]
with patch.object(sys, "argv", testargs):
_run_generate()
run_generate()
self.assertTrue(Path(output_file_name).exists())
os.remove(Path(output_file_name))

View File

@@ -64,7 +64,7 @@ def run_generate():
parser.add_argument(
"model_size",
type=str,
help="T5 model size, either 't5-small', 't5-base' or 't5-large'. Defaults to base.",
help="T5 model size, either 't5-small', 't5-base', 't5-large', 't5-3b', 't5-11b'. Defaults to 't5-base'.",
default="t5-base",
)
parser.add_argument(

View File

@@ -1,5 +1,4 @@
import logging
import os
import sys
import tempfile
import unittest
@@ -26,10 +25,20 @@ class TestT5Examples(unittest.TestCase):
tmp = Path(tempfile.gettempdir()) / "utest_generations_t5_sum.hypo"
with tmp.open("w") as f:
f.write("\n".join(articles))
testargs = ["evaluate_cnn.py", "t5-small", str(tmp), output_file_name, str(tmp), score_file_name]
output_file_name = Path(tempfile.gettempdir()) / "utest_output_t5_sum.hypo"
score_file_name = Path(tempfile.gettempdir()) / "utest_score_t5_sum.hypo"
testargs = [
"evaluate_cnn.py",
"patrickvonplaten/t5-tiny-random",
str(tmp),
str(output_file_name),
str(tmp),
str(score_file_name),
]
with patch.object(sys, "argv", testargs):
run_generate()
self.assertTrue(Path(output_file_name).exists())
self.assertTrue(Path(score_file_name).exists())
os.remove(Path(output_file_name))
os.remove(Path(score_file_name))

View File

@@ -14,13 +14,13 @@ def chunks(lst, n):
yield lst[i : i + n]
def generate_translations(lns, output_file_path, batch_size, device):
def generate_translations(lns, output_file_path, model_size, batch_size, device):
output_file = Path(output_file_path).open("w")
model = T5ForConditionalGeneration.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained(model_size)
model.to(device)
tokenizer = T5Tokenizer.from_pretrained("t5-base")
tokenizer = T5Tokenizer.from_pretrained(model_size)
# update config with summarization specific params
task_specific_params = model.config.task_specific_params
@@ -52,6 +52,12 @@ def calculate_bleu_score(output_lns, refs_lns, score_path):
def run_generate():
parser = argparse.ArgumentParser()
parser.add_argument(
"model_size",
type=str,
help="T5 model size, either 't5-small', 't5-base', 't5-large', 't5-3b', 't5-11b'. Defaults to 't5-base'.",
default="t5-base",
)
parser.add_argument(
"input_path", type=str, help="like wmt/newstest2013.en",
)
@@ -78,7 +84,7 @@ def run_generate():
input_lns = [x.strip().replace(dash_pattern[0], dash_pattern[1]) for x in open(args.input_path).readlines()]
generate_translations(input_lns, args.output_path, args.batch_size, args.device)
generate_translations(input_lns, args.output_path, args.model_size, args.batch_size, args.device)
output_lns = [x.strip() for x in open(args.output_path).readlines()]
refs_lns = [x.strip().replace(dash_pattern[0], dash_pattern[1]) for x in open(args.reference_path).readlines()]

View File

@@ -1,5 +1,4 @@
import logging
import os
import sys
import tempfile
import unittest
@@ -33,11 +32,19 @@ class TestT5Examples(unittest.TestCase):
with tmp_target.open("w") as f:
f.write("\n".join(translation))
testargs = ["evaluate_wmt.py", str(tmp_source), output_file_name, str(tmp_target), score_file_name]
output_file_name = Path(tempfile.gettempdir()) / "utest_output_trans.hypo"
score_file_name = Path(tempfile.gettempdir()) / "utest_score.hypo"
testargs = [
"evaluate_wmt.py",
"patrickvonplaten/t5-tiny-random",
str(tmp_source),
str(output_file_name),
str(tmp_target),
str(score_file_name),
]
with patch.object(sys, "argv", testargs):
run_generate()
self.assertTrue(Path(output_file_name).exists())
self.assertTrue(Path(score_file_name).exists())
os.remove(Path(output_file_name))
os.remove(Path(score_file_name))

View File

@@ -64,27 +64,8 @@ TensorFlow: 2.1.0
Python: 3.7.6
```
### Inferencing / prediction works with the current Transformers v2.4.1
### Access this albert_xxlargev1_sqd2_512 fine-tuned model with "tried & true" code:
### Access this albert_xxlargev1_sqd2_512 fine-tuned model with:
```python
config_class, model_class, tokenizer_class = \
AlbertConfig, AlbertForQuestionAnswering, AlbertTokenizer
model_name_or_path = "ahotrod/albert_xxlargev1_squad2_512"
config = config_class.from_pretrained(model_name_or_path)
tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True)
model = model_class.from_pretrained(model_name_or_path, config=config)
```
### or the AutoModels (AutoConfig, AutoTokenizer & AutoModel) should also work, however I have yet to use them in my app & confirm:
```python
from transformers import AutoConfig, AutoTokenizer, AutoModel
model_name_or_path = "ahotrod/albert_xxlargev1_squad2_512"
config = AutoConfig.from_pretrained(model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, do_lower_case=True)
model = AutoModel.from_pretrained(model_name_or_path, config=config)
```
tokenizer = AutoTokenizer.from_pretrained("ahotrod/albert_xxlargev1_squad2_512")
model = AutoModelForQuestionAnswering.from_pretrained("ahotrod/albert_xxlargev1_squad2_512")

View File

@@ -0,0 +1,68 @@
## RoBERTa-large language model fine-tuned on SQuAD2.0
### with the following results:
```
"exact": 84.02257222269014,
"f1": 87.47063479332766,
"total": 11873,
"HasAns_exact": 81.19095816464238,
"HasAns_f1": 88.0969714745582,
"HasAns_total": 5928,
"NoAns_exact": 86.84608915054667,
"NoAns_f1": 86.84608915054667,
"NoAns_total": 5945,
"best_exact": 84.02257222269014,
"best_exact_thresh": 0.0,
"best_f1": 87.47063479332759,
"best_f1_thresh": 0.0
```
### from script:
```
python -m torch.distributed.launch --nproc_per_node=2 ${RUN_SQUAD_DIR}/run_squad.py \
--model_type roberta \
--model_name_or_path roberta-large \
--do_train \
--train_file ${SQUAD_DIR}/train-v2.0.json \
--predict_file ${SQUAD_DIR}/dev-v2.0.json \
--version_2_with_negative \
--num_train_epochs 2 \
--warmup_steps 328 \
--weight_decay 0.01 \
--do_lower_case \
--learning_rate 1.5e-5 \
--max_seq_length 512 \
--doc_stride 128 \
--save_steps 1000 \
--per_gpu_train_batch_size 1 \
--gradient_accumulation_steps 24 \
--logging_steps 50 \
--threads 10 \
--overwrite_cache \
--overwrite_output_dir \
--output_dir ${MODEL_PATH}
python ${RUN_SQUAD_DIR}/run_squad.py \
--model_type roberta \
--model_name_or_path ${MODEL_PATH} \
--do_eval \
--train_file ${SQUAD_DIR}/train-v2.0.json \
--predict_file ${SQUAD_DIR}/dev-v2.0.json \
--version_2_with_negative \
--do_lower_case \
--max_seq_length 512 \
--per_gpu_eval_batch_size 24 \
--eval_all_checkpoints \
--overwrite_output_dir \
--output_dir ${MODEL_PATH}
$@
```
### using the following system & software:
```
OS/Platform: Linux-4.15.0-91-generic-x86_64-with-debian-buster-sid
GPU/CPU: 2 x NVIDIA 1080Ti / Intel i7-8700
Transformers: 2.7.0
PyTorch: 1.4.0
TensorFlow: 2.1.0
Python: 3.7.7
```

View File

@@ -56,22 +56,8 @@ PyTorch: 1.4.0
TensorFlow: 2.1.0
Python: 3.7.6
```
### Inferencing / prediction works with Transformers v2.4.1, the latest version tested
### Utilize this xlnet_large_squad2_512 fine-tuned model with:
```python
config_class, model_class, tokenizer_class = \
XLNetConfig, XLNetforQuestionAnswering, XLNetTokenizer
model_name_or_path = "ahotrod/xlnet_large_squad2_512"
config = config_class.from_pretrained(model_name_or_path)
tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True)
model = model_class.from_pretrained(model_name_or_path, config=config)
```
### or the AutoModels (AutoConfig, AutoTokenizer & AutoModel) should also work, however I have yet to use them in my apps & confirm:
```python
from transformers import AutoConfig, AutoTokenizer, AutoModel
model_name_or_path = "ahotrod/xlnet_large_squad2_512"
config = AutoConfig.from_pretrained(model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, do_lower_case=True)
model = AutoModel.from_pretrained(model_name_or_path, config=config)
tokenizer = AutoTokenizer.from_pretrained("ahotrod/xlnet_large_squad2_512")
model = AutoModelForQuestionAnswering.from_pretrained("ahotrod/xlnet_large_squad2_512")
```

View File

@@ -0,0 +1,6 @@
---
tags:
- exbert
---
[![ExBERT](https://img.shields.io/badge/Visualize%20Attentions-ExBERT-green)](https://huggingface.co/exbert/?model=albert-base-v1)

View File

@@ -0,0 +1,6 @@
---
tags:
- exbert
---
[![ExBERT](https://img.shields.io/badge/Visualize%20Attentions-ExBERT-green)](https://huggingface.co/exbert/?model=albert-xxlarge-v2)

View File

@@ -0,0 +1,6 @@
---
tags:
- exbert
---
[![ExBERT](https://img.shields.io/badge/Visualize%20Attentions-ExBERT-green)](https://huggingface.co/exbert/?model=bert-base-cased)

View File

@@ -1,8 +1,12 @@
---
language: german
thumbnail: https://static.tildacdn.com/tild6438-3730-4164-b266-613634323466/german_bert.png
tags:
- exbert
---
[![ExBERT](https://img.shields.io/badge/Visualize%20Attentions-ExBERT-green)](https://huggingface.co/exbert/?model=bert-base-german-cased)
# German BERT
![bert_image](https://static.tildacdn.com/tild6438-3730-4164-b266-613634323466/german_bert.png)
## Overview
@@ -18,6 +22,7 @@ thumbnail: https://static.tildacdn.com/tild6438-3730-4164-b266-613634323466/germ
- We trained 810k steps with a batch size of 1024 for sequence length 128 and 30k steps with sequence length 512. Training took about 9 days.
- As training data we used the latest German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB).
- We cleaned the data dumps with tailored scripts and segmented sentences with spacy v2.1. To create tensorflow records we used the recommended sentencepiece library for creating the word piece vocabulary and tensorflow scripts to convert the text to data usable by BERT.
- Update April 3rd, 2020: updated the vocab file on deepset s3 to adjust tokenization of punctuation.
See https://deepset.ai/german-bert for more details

View File

@@ -0,0 +1,6 @@
---
tags:
- exbert
---
[![ExBERT](https://img.shields.io/badge/Visualize%20Attentions-ExBERT-green)](https://huggingface.co/exbert/?model=bert-base-uncased)

View File

@@ -1,3 +1,7 @@
---
language: french
---
# CamemBERT
CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR.

View File

@@ -0,0 +1,59 @@
This language model is trained using sentence_transformers (https://github.com/UKPLab/sentence-transformers)
Started with bert-base-nli-stsb-mean-tokens
Continue training on quora questions deduplication dataset (https://www.kaggle.com/c/quora-question-pairs)
See train_script.py for script used
Below is the performance over the course of training
epoch,steps,cosine_pearson,cosine_spearman,euclidean_pearson,euclidean_spearman,manhattan_pearson,manhattan_spearman,dot_pearson,dot_spearman
0,1000,0.5944576426835938,0.6010801382777033,0.5942803776859142,0.5934485776801595,0.5939676679774666,0.593162725602328,0.5905591590826669,0.5921674789994058
0,2000,0.6404080440207146,0.6416811632113405,0.6384419354012121,0.6352050423100778,0.6379917744471867,0.6347884067391001,0.6410544760582826,0.6379252046791412
0,3000,0.6710168301884945,0.6676529324662036,0.6660195209784969,0.6618423144808695,0.6656461098096684,0.6615366331956389,0.6724401903484759,0.666073727723655
0,4000,0.6886373265097949,0.6808948140300153,0.67907655686838,0.6714218133850957,0.6786809551564443,0.6711577956884357,0.6926435869763303,0.68190855298609
0,5000,0.6991409753700026,0.6919630610321864,0.6991041519437052,0.6868961486499775,0.6987076032270729,0.6865385550504007,0.7035518148330993,0.6916275246101342
0,6000,0.7120367327025509,0.6975005265298305,0.7065567493967201,0.6922375503495235,0.7060005509843024,0.6916475765570651,0.7147094303373102,0.6981390706722722
0,7000,0.7254672394728687,0.7130118465900485,0.7261844956277705,0.7086213543110718,0.7257479964972307,0.7079315661881832,0.728729909455115,0.7122743793160531
0,8000,0.7402421930101399,0.7216774208330149,0.7367901914441078,0.7166256588352043,0.7362607046874481,0.7158881916281887,0.7433902441373252,0.7220998491980078
0,9000,0.7381005358120434,0.7197216844469877,0.7343228719349923,0.7139462687943793,0.7345247569255238,0.7145106206467152,0.7421843672419275,0.720686853053079
0,10000,0.7465436564646095,0.7260327107480364,0.7467524239596304,0.7230195666847953,0.7467721566237211,0.7231367593302213,0.749792199122442,0.7263143296580317
0,11000,0.7521805421706547,0.7323771570146701,0.7530672061250105,0.729223203496722,0.7530616532823367,0.7293818369675622,0.7552399002305836,0.7320808333541338
0,12000,0.7579359969644401,0.7340677616737238,0.7570017235719905,0.7305965412825544,0.7570601853520393,0.730718189957289,0.7611254136080384,0.7351501229591327
0,-1,0.7573407371218097,0.7329952035782198,0.755595312163209,0.7291445551777086,0.7557737117990928,0.7295404703700227,0.7607276219361719,0.7342415455980179
1,1000,0.7619907683805341,0.7374667949734767,0.7629820517114324,0.7330364216044966,0.7628369522755882,0.7331912674450544,0.7658583898073758,0.7381503446695727
1,2000,0.7618972640071228,0.7362151058969478,0.764582212425539,0.7335856230046062,0.7643125513700815,0.7334501607097152,0.7652852805583232,0.7369104639809163
1,3000,0.7687362955240467,0.7404674623181671,0.7708304819979073,0.7380959815601529,0.7707835692712482,0.7379796800453193,0.772074854759756,0.7414513460702766
1,4000,0.7685047787908202,0.7403088288815168,0.7703522257474043,0.7379787888808298,0.7701221475099808,0.7377898546753812,0.7713755359045312,0.7409415801952219
1,5000,0.7696438109797803,0.7410393893292365,0.773270389327895,0.7392953127251652,0.7729880866533291,0.7389853982789335,0.7726236305835863,0.7416278035580925
1,6000,0.7749538363837081,0.7436499342062207,0.774879168058157,0.7401827241766746,0.7745754601165837,0.739763415043146,0.7788801166152383,0.7446249060022169
1,7000,0.7794560817870597,0.7480970176267153,0.7803506944510302,0.7453305130502859,0.7799867949176531,0.7447100155494814,0.7828208193123926,0.7486740690324809
1,8000,0.7855844359073243,0.7496742172376921,0.7828816645965887,0.747176409009761,0.7827584875358967,0.7471037762845532,0.7879159073496309,0.7507349669102151
1,9000,0.7844110753729492,0.7507746252693759,0.7847208586489722,0.7485172180290892,0.7846408087474059,0.748491818820158,0.7872061334510225,0.7514470349769437
1,10000,0.7881311227435004,0.7530048509727403,0.7886917756879734,0.7508018068765787,0.7883332502188707,0.7505037008187275,0.7910707228932787,0.7537200382362567
1,11000,0.7883300109606874,0.7513494487126553,0.7879329130497712,0.749818368689255,0.7876525616593218,0.7494872882301785,0.7911454269743292,0.7522843165147303
1,12000,0.7853334933336618,0.7516809747712728,0.7893895316714998,0.749780492728257,0.7890075986655403,0.7494079715118533,0.7885959664070629,0.7523827940133203
1,-1,0.7887529238148887,0.7534076729932393,0.7896864404801204,0.7513080079201105,0.7894077512343298,0.7510009899066772,0.7919617393746149,0.7542173273241598
2,1000,0.7919209063905188,0.7550167329363414,0.7917464066515253,0.7523043685293455,0.7914371703225378,0.7520285423781206,0.7950297421784158,0.7562599556207076
2,2000,0.7924507768792486,0.7542908512484463,0.7934519001953887,0.7517491515010692,0.7931885648751081,0.751521004535999,0.7951637852162545,0.7551495215642072
2,3000,0.7937606244038364,0.755599577136169,0.7933633347508111,0.7527922999916203,0.7931581019714242,0.7527132061436363,0.797275652800117,0.7569827180764233
2,4000,0.7938389298721445,0.7578716892320315,0.7963783770097079,0.7555928931784702,0.796150381773947,0.7555438771581088,0.7972911620482322,0.759178632650707
2,5000,0.7935330563129844,0.7551129824372304,0.7970775059297484,0.7527285792572385,0.7967359830546507,0.7524478515463257,0.7966395126138969,0.756319220359678
2,6000,0.7929852776759999,0.7525490026774382,0.7952484474454824,0.7503695753216607,0.7950784132079611,0.7503677929234961,0.7956152082976395,0.7535275392698093
2,7000,0.794956504054517,0.756119591765251,0.7982025041673655,0.7532521587180684,0.7980261618830962,0.7532107179960499,0.7983222918908033,0.7571226363678287
2,8000,0.7934568432535339,0.7538336661192452,0.797015698241178,0.7514773358161916,0.7968076980315735,0.7513458838811067,0.7960694134685949,0.754143803399873
2,9000,0.7970040626682157,0.7576497805894974,0.7987855332059015,0.7550996144509958,0.7984693921009676,0.7548260162973456,0.7999509314900626,0.758347143906916
2,10000,0.7979442987735523,0.7585338500791028,0.8018677081664496,0.7557412777548302,0.8015397301245205,0.7552916678886369,0.8007921348414564,0.7589772216225288
2,11000,0.7985519561040211,0.7579986850302035,0.8021236875460913,0.7555826443181872,0.8019861620475348,0.7553763317660516,0.8009230128897853,0.7586541619907702
2,12000,0.7986842143860736,0.7599570950134775,0.8029131054823838,0.7577678644678973,0.8027922603736795,0.7575152095990927,0.8020896747930555,0.7608540869254408
2,-1,0.7994135319568432,0.7596286881516635,0.8022087183675333,0.7570593611974978,0.8020218401019292,0.7567291719729909,0.8026346812258125,0.7603928913647044
3,1000,0.7985505039929134,0.7592588405681144,0.8023296699449267,0.7569345933969436,0.8023622066009718,0.7570237132696928,0.8013054275981851,0.759643838536062
3,2000,0.7995482191699455,0.759205368623176,0.8026859405513612,0.7565709841358819,0.8024845263367439,0.7562920388231202,0.8021318586127523,0.7596496313300967
3,3000,0.7991070423195897,0.7582027696555826,0.8016352550470427,0.7555585819429662,0.8014268261947898,0.7551838327642736,0.8013136081494014,0.7584429477727118
3,4000,0.7999188836884763,0.7586764419322649,0.802987646214278,0.7561111254802977,0.8026549791861386,0.7556463650525692,0.8024068858366156,0.7591238238715613
3,5000,0.7988075932525881,0.7583533823004922,0.8019498750207454,0.755792967372457,0.8016459824731964,0.7553834613587099,0.8015528810821693,0.7589527136833425
3,6000,0.8003341798460688,0.7585432077405799,0.8032464035902267,0.7563722467405277,0.8028695045742804,0.7557626665682309,0.8027937010871594,0.7590404967573696
3,7000,0.799187592384933,0.7579358555659604,0.8028413548398412,0.7555875459131398,0.8025187078191003,0.7551196665011402,0.8018680475193432,0.7585565756912578
3,8000,0.797725037202641,0.757439012042047,0.802048241301358,0.7548888458326453,0.8017608103042271,0.7544606246736175,0.8005479449399782,0.758037452190282
3,9000,0.7990232649360067,0.7573703896772077,0.8021375332910405,0.754873027155089,0.8018733796679427,0.7545680141630304,0.8016400687760605,0.7579461042843499
3,10000,0.7994934439260372,0.758368978248884,0.8035693504115055,0.75619400688862,0.8032990505007025,0.7559016935896375,0.8022819185772518,0.7589558328445544
3,11000,0.8002954591825011,0.758710753096932,0.8043310859792212,0.7566387152306694,0.8040865016706966,0.7564221538891368,0.8030873114870971,0.7592722085543488
3,12000,0.8003726616196549,0.7588056657991931,0.8044000317617518,0.7566146528909147,0.8041705213966136,0.7563419459362758,0.8031760015719815,0.7593194421057111
3,-1,0.8004926728141455,0.7587192194882135,0.8043340929890026,0.756546030526114,0.8041028559910275,0.7563103085106637,0.8032542493776693,0.7592325501951863

View File

@@ -0,0 +1,6 @@
---
tags:
- exbert
---
[![ExBERT](https://img.shields.io/badge/Visualize%20Attentions-ExBERT-green)](https://huggingface.co/exbert/?model=distilbert-base-uncased)

View File

@@ -0,0 +1,6 @@
---
tags:
- exbert
---
[![ExBERT](https://img.shields.io/badge/Visualize%20Attentions-ExBERT-green)](https://huggingface.co/exbert/?model=distilgpt2)

View File

@@ -0,0 +1,6 @@
---
tags:
- exbert
---
[![ExBERT](https://img.shields.io/badge/Visualize%20Attentions-ExBERT-green)](https://huggingface.co/exbert/?model=distilroberta-base)

View File

@@ -0,0 +1,6 @@
---
tags:
- exbert
---
[![ExBERT](https://img.shields.io/badge/Visualize%20Attentions-ExBERT-green)](https://huggingface.co/exbert/?model=gpt2)

View File

@@ -0,0 +1,38 @@
# CovidBERT-NLI
This is the model **CovidBERT** trained by DeepSet on AllenAI's [CORD19 Dataset](https://pages.semanticscholar.org/coronavirus-research) of scientific articles about coronaviruses.
The model uses the original BERT wordpiece vocabulary and was subsequently fine-tuned on the [SNLI](https://nlp.stanford.edu/projects/snli/) and the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) datasets using the [`sentence-transformers` library](https://github.com/UKPLab/sentence-transformers/) to produce universal sentence embeddings [1] using the **average pooling strategy** and a **softmax loss**.
Parameter details for the original training on CORD-19 are available on [DeepSet's MLFlow](https://public-mlflow.deepset.ai/#/experiments/2/runs/ba27d00c30044ef6a33b1d307b4a6cba)
**Base model**: `deepset/covid_bert_base` from HuggingFace's `AutoModel`.
**Training time**: ~6 hours on the NVIDIA Tesla P100 GPU provided in Kaggle Notebooks.
**Parameters**:
| Parameter | Value |
|------------------|-------|
| Batch size | 64 |
| Training steps | 23000 |
| Warmup steps | 1450 |
| Lowercasing | True |
| Max. Seq. Length | 128 |
**Performances**: The performance was evaluated on the test portion of the [STS dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) using Spearman rank correlation and compared to the performances of similar models obtained with the same procedure to verify its performances.
| Model | Score |
|-------------------------------|-------------|
| `covidbert-nli` (this) | 67.52 |
| `gsarti/biobert-nli` | 73.40 |
| `gsarti/scibert-nli` | 74.50 |
| `bert-base-nli-mean-tokens`[2]| 77.12 |
An example usage for similarity-based scientific paper retrieval is provided in the [Covid-19 Semantic Browser](https://github.com/gsarti/covid-papers-browser) repository.
**References:**
[1] A. Conneau et al., [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://www.aclweb.org/anthology/D17-1070/)
[2] N. Reimers et I. Gurevych, [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://www.aclweb.org/anthology/D19-1410/)

View File

@@ -0,0 +1,94 @@
---
language: polish
---
# Multilingual + Polish SQuAD1.1
This model is the multilingual model provided by the Google research team with a fine-tuned polish Q&A downstream task.
## Details of the language model
Language model ([**bert-base-multilingual-cased**](https://github.com/google-research/bert/blob/master/multilingual.md)):
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias.
## Details of the downstream task
Using the `mtranslate` Python module, [**SQuAD1.1**](https://rajpurkar.github.io/SQuAD-explorer/) was machine-translated. In order to find the start tokens, the direct translations of the answers were searched in the corresponding paragraphs. Due to the different translations depending on the context (missing context in the pure answer), the answer could not always be found in the text, and thus a loss of question-answer examples occurred. This is a potential problem where errors can occur in the data set.
| Dataset | # Q&A |
| ---------------------- | ----- |
| SQuAD1.1 Train | 87.7 K |
| Polish SQuAD1.1 Train | 39.5 K |
| SQuAD1.1 Dev | 10.6 K |
| Polish SQuAD1.1 Dev | 2.6 K |
## Model benchmark
| Model | EM | F1 |
| ---------------------- | ----- | ----- |
| [SlavicBERT](https://huggingface.co/DeepPavlov/bert-base-bg-cs-pl-ru-cased) | **60.89** | 71.68 |
| [polBERT](https://huggingface.co/dkleczek/bert-base-polish-uncased-v1) | 57.46 | 68.87 |
| [multiBERT](https://huggingface.co/bert-base-multilingual-cased) | 60.67 | **71.89** |
| [xlm](https://huggingface.co/xlm-mlm-100-1280) | 47.98 | 59.42 |
## Model training
The model was trained on a **Tesla V100** GPU with the following command:
```python
export SQUAD_DIR=path/to/pl_squad
python run_squad.py
--model_type bert \
--model_name_or_path bert-base-multilingual-cased \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/pl_squadv1_train_clean.json \
--predict_file $SQUAD_DIR/pl_squadv1_dev_clean.json \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--save_steps=8000 \
--output_dir ../../output \
--overwrite_cache \
--overwrite_output_dir
```
**Results**:
{'exact': 60.670731707317074, 'f1': 71.8952193697293, 'total': 2624, 'HasAns_exact': 60.670731707317074, 'HasAns_f1': 71.8952193697293,
'HasAns_total': 2624, 'best_exact': 60.670731707317074, 'best_exact_thresh': 0.0, 'best_f1': 71.8952193697293, 'best_f1_thresh': 0.0}
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-polish-squad1",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-polish-squad1"
)
qa_pipeline({
'context': "Warszawa jest największym miastem w Polsce pod względem liczby ludności i powierzchni",
'question': "Jakie jest największe miasto w Polsce?"})
```
# Output:
```json
{
"score": 0.9988,
"start": 0,
"end": 8,
"answer": "Warszawa"
}
```
## Contact
Please do not hesitate to contact me via [LinkedIn](https://www.linkedin.com/in/henryk-borzymowski-0755a2167/) if you want to discuss or get access to the Polish version of SQuAD.

View File

@@ -0,0 +1,96 @@
---
language: polish
---
# Multilingual + Polish SQuAD2.0
This model is the multilingual model provided by the Google research team with a fine-tuned polish Q&A downstream task.
## Details of the language model
Language model ([**bert-base-multilingual-cased**](https://github.com/google-research/bert/blob/master/multilingual.md)):
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias.
## Details of the downstream task
Using the `mtranslate` Python module, [**SQuAD2.0**](https://rajpurkar.github.io/SQuAD-explorer/) was machine-translated. In order to find the start tokens, the direct translations of the answers were searched in the corresponding paragraphs. Due to the different translations depending on the context (missing context in the pure answer), the answer could not always be found in the text, and thus a loss of question-answer examples occurred. This is a potential problem where errors can occur in the data set.
| Dataset | # Q&A |
| ---------------------- | ----- |
| SQuAD2.0 Train | 130 K |
| Polish SQuAD2.0 Train | 83.1 K |
| SQuAD2.0 Dev | 12 K |
| Polish SQuAD2.0 Dev | 8.5 K |
## Model benchmark
| Model | EM/F1 |HasAns (EM/F1) | NoAns |
| ---------------------- | ----- | ----- | ----- |
| [SlavicBERT](https://huggingface.co/DeepPavlov/bert-base-bg-cs-pl-ru-cased) | 69.35/71.51 | 47.02/54.09 | 79.20 |
| [polBERT](https://huggingface.co/dkleczek/bert-base-polish-uncased-v1) | 67.33/69.80| 45.73/53.80 | 76.87 |
| [multiBERT](https://huggingface.co/bert-base-multilingual-cased) | **70.76**/**72.92** |45.00/52.04 | 82.13 |
## Model training
The model was trained on a **Tesla V100** GPU with the following command:
```python
export SQUAD_DIR=path/to/pl_squad
python run_squad.py
--model_type bert \
--model_name_or_path bert-base-multilingual-cased \
--do_train \
--do_eval \
--version_2_with_negative \
--train_file $SQUAD_DIR/pl_squadv2_train.json \
--predict_file $SQUAD_DIR/pl_squadv2_dev.json \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--save_steps=8000 \
--output_dir ../../output \
--overwrite_cache \
--overwrite_output_dir
```
**Results**:
{'exact': 70.76671723655035, 'f1': 72.92156947155917, 'total': 8569, 'HasAns_exact': 45.00762195121951, 'HasAns_f1': 52.04456128116991, 'HasAns_total': 2624, 'NoAns_exact': 82.13624894869638, '
NoAns_f1': 82.13624894869638, 'NoAns_total': 5945, 'best_exact': 71.72365503559342, 'best_exact_thresh': 0.0, 'best_f1': 73.62662512059369, 'best_f1_thresh': 0.0}
## Model in action
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-polish-squad2",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-polish-squad2"
)
qa_pipeline({
'context': "Warszawa jest największym miastem w Polsce pod względem liczby ludności i powierzchni",
'question': "Jakie jest największe miasto w Polsce?"})
```
# Output:
```json
{
"score": 0.9986,
"start": 0,
"end": 8,
"answer": "Warszawa"
}
```
## Contact
Please do not hesitate to contact me via [LinkedIn](https://www.linkedin.com/in/henryk-borzymowski-0755a2167/) if you want to discuss or get access to the Polish version of SQuAD.

View File

@@ -0,0 +1,86 @@
---
language: malay
---
# Bahasa Albert Model
Pretrained Albert base language model for Malay and Indonesian.
## Pretraining Corpus
`albert-base-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using Google Albert's github [repository](https://github.com/google-research/ALBERT) on v3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/albert](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/albert).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import AlbertTokenizer, AlbertModel
model = BertModel.from_pretrained('huseinzol05/albert-base-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/albert-base-bahasa-cased',
do_lower_case = False,
)
```
## Example using AutoModelWithLMHead
```python
from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
model = AutoModelWithLMHead.from_pretrained('huseinzol05/albert-base-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/albert-base-bahasa-cased',
do_lower_case = False,
)
fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
print(fill_mask('makan ayam dengan [MASK]'))
```
Output is,
```text
[{'sequence': '[CLS] makan ayam dengan ayam[SEP]',
'score': 0.044952988624572754,
'token': 629},
{'sequence': '[CLS] makan ayam dengan sayur[SEP]',
'score': 0.03621877357363701,
'token': 1639},
{'sequence': '[CLS] makan ayam dengan ikan[SEP]',
'score': 0.034429922699928284,
'token': 758},
{'sequence': '[CLS] makan ayam dengan nasi[SEP]',
'score': 0.032447945326566696,
'token': 453},
{'sequence': '[CLS] makan ayam dengan rendang[SEP]',
'score': 0.028885239735245705,
'token': 2451}]
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train Albert for Bahasa.

View File

@@ -0,0 +1,92 @@
---
language: malay
---
# Bahasa Tiny-BERT Model
General Distilled Tiny BERT language model for Malay and Indonesian.
## Pretraining Corpus
`tiny-bert-bahasa-cased` model was distilled on ~1.8 Billion words. We distilled on both standard and social media language structures, and below is list of data we distilled on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Distilling details
- This model was distilled using huawei-noah Tiny-BERT's github [repository](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT) on 3 Titan V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/tiny-bert](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/tiny-bert).
## Load Distilled Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import AlbertTokenizer, BertModel
model = BertModel.from_pretrained('huseinzol05/tiny-bert-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/tiny-bert-bahasa-cased',
unk_token = '[UNK]',
pad_token = '[PAD]',
do_lower_case = False,
)
```
We use [google/sentencepiece](https://github.com/google/sentencepiece) to train the tokenizer, so to use it, need to load from `AlbertTokenizer`.
## Example using AutoModelWithLMHead
```python
from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
model = AutoModelWithLMHead.from_pretrained('huseinzol05/tiny-bert-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/tiny-bert-bahasa-cased',
unk_token = '[UNK]',
pad_token = '[PAD]',
do_lower_case = False,
)
fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
print(fill_mask('makan ayam dengan [MASK]'))
```
Output is,
```text
[{'sequence': '[CLS] makan ayam dengan berbual[SEP]',
'score': 0.00015769545279908925,
'token': 17859},
{'sequence': '[CLS] makan ayam dengan kembar[SEP]',
'score': 0.0001448775001335889,
'token': 8289},
{'sequence': '[CLS] makan ayam dengan memaklumkan[SEP]',
'score': 0.00013484008377417922,
'token': 6881},
{'sequence': '[CLS] makan ayam dengan Senarai[SEP]',
'score': 0.00013061291247140616,
'token': 11698},
{'sequence': '[CLS] makan ayam dengan Tiga[SEP]',
'score': 0.00012453157978598028,
'token': 4232}]
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train BERT for Bahasa.

View File

@@ -50,7 +50,7 @@ tokenizer = XLNetTokenizer.from_pretrained(
'huseinzol05/xlnet-base-bahasa-cased', do_lower_case = False
)
fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
print(fill_mask('makan ayam dengan [MASK]'))
print(fill_mask('makan ayam dengan <mask>'))
```
## Results

View File

@@ -0,0 +1,61 @@
### Model
**[`albert-xlarge-v2`](https://huggingface.co/albert-xlarge-v2)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)**
### Training Parameters
Trained on 4 NVIDIA GeForce RTX 2080 Ti 11Gb
```bash
BASE_MODEL=albert-xlarge-v2
python run_squad.py \
--version_2_with_negative \
--model_type albert \
--model_name_or_path $BASE_MODEL \
--output_dir $OUTPUT_MODEL \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v2.0.json \
--predict_file $SQUAD_DIR/dev-v2.0.json \
--per_gpu_train_batch_size 3 \
--per_gpu_eval_batch_size 64 \
--learning_rate 3e-5 \
--num_train_epochs 3.0 \
--max_seq_length 384 \
--doc_stride 128 \
--save_steps 2000 \
--threads 24 \
--warmup_steps 814 \
--gradient_accumulation_steps 4 \
--fp16 \
--do_train
```
### Evaluation
Evaluation on the dev set. I did not sweep for best threshold.
| | val |
|-------------------|-------------------|
| exact | 84.41842836688285 |
| f1 | 87.4628460501696 |
| total | 11873.0 |
| HasAns_exact | 80.68488529014844 |
| HasAns_f1 | 86.78245127423482 |
| HasAns_total | 5928.0 |
| NoAns_exact | 88.1412952060555 |
| NoAns_f1 | 88.1412952060555 |
| NoAns_total | 5945.0 |
| best_exact | 84.41842836688285 |
| best_exact_thresh | 0.0 |
| best_f1 | 87.46284605016956 |
| best_f1_thresh | 0.0 |
### Usage
See [huggingface documentation](https://huggingface.co/transformers/model_doc/albert.html#albertforquestionanswering). Training on `SQuAD V2` allows the model to score if a paragraph contains an answer:
```python
start_scores, end_scores = model(input_ids)
span_scores = start_scores.softmax(dim=1).log()[:,:,None] + end_scores.softmax(dim=1).log()[:,None,:]
ignore_score = span_scores[:,0,0] #no answer scores
```

View File

@@ -0,0 +1,61 @@
### Model
**[`allenai/scibert_scivocab_uncased`](https://huggingface.co/allenai/scibert_scivocab_uncased)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)**
### Training Parameters
Trained on 4 NVIDIA GeForce RTX 2080 Ti 11Gb
```bash
BASE_MODEL=allenai/scibert_scivocab_uncased
python run_squad.py \
--version_2_with_negative \
--model_type albert \
--model_name_or_path $BASE_MODEL \
--output_dir $OUTPUT_MODEL \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v2.0.json \
--predict_file $SQUAD_DIR/dev-v2.0.json \
--per_gpu_train_batch_size 18 \
--per_gpu_eval_batch_size 64 \
--learning_rate 3e-5 \
--num_train_epochs 3.0 \
--max_seq_length 384 \
--doc_stride 128 \
--save_steps 2000 \
--threads 24 \
--warmup_steps 550 \
--gradient_accumulation_steps 1 \
--fp16 \
--logging_steps 50 \
--do_train
```
### Evaluation
Evaluation on the dev set. I did not sweep for best threshold.
| | val |
|-------------------|-------------------|
| exact | 75.07790785816559 |
| f1 | 78.47735207283013 |
| total | 11873.0 |
| HasAns_exact | 70.76585695006747 |
| HasAns_f1 | 77.57449412292718 |
| HasAns_total | 5928.0 |
| NoAns_exact | 79.37762825904122 |
| NoAns_f1 | 79.37762825904122 |
| NoAns_total | 5945.0 |
| best_exact | 75.08633032931863 |
| best_exact_thresh | 0.0 |
| best_f1 | 78.48577454398324 |
| best_f1_thresh | 0.0 |
### Usage
See [huggingface documentation](https://huggingface.co/transformers/model_doc/bert.html#bertforquestionanswering). Training on `SQuAD V2` allows the model to score if a paragraph contains an answer:
```python
start_scores, end_scores = model(input_ids)
span_scores = start_scores.softmax(dim=1).log()[:,:,None] + end_scores.softmax(dim=1).log()[:,None,:]
ignore_score = span_scores[:,0,0] #no answer scores
```

View File

@@ -0,0 +1,14 @@
# BERT-IMDB
## What is it?
BERT (`bert-large-cased`) trained for sentiment classification on the [IMDB dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).
## Training setting
The model was trained on 80% of the IMDB dataset for sentiment classification for three epochs with a learning rate of `1e-5` with the `simpletransformers` library. The library uses a learning rate schedule.
## Result
The model achieved 90% classification accuracy on the validation set.
## Reference
The full experiment is available in the [tlr repo](https://lvwerra.github.io/trl/03-bert-imdb-training/).

View File

@@ -0,0 +1,18 @@
# GPT2-IMDB-pos
## What is it?
A small GPT2 (`lvwerra/gpt2-imdb`) language model fine-tuned to produce positive movie reviews based the [IMDB dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). The model is trained with rewards from a BERT sentiment classifier (`lvwerra/gpt2-imdb`) via PPO.
## Training setting
The model was trained for `100` optimisation steps with a batch size of `256` which corresponds to `25600` training samples. The full experiment setup can be found in the Jupyter notebook in the [trl repo](https://lvwerra.github.io/trl/04-gpt2-sentiment-ppo-training/).
## Examples
A few examples of the model response to a query before and after optimisation:
| query | response (before) | response (after) | rewards (before) | rewards (after) |
|-------|-------------------|------------------|------------------|-----------------|
|I'd never seen a |heavier, woodier example of Victorian archite... |film of this caliber, and I think it's wonder... |3.297736 |4.158653|
|I love John's work |but I actually have to write language as in w... |and I hereby recommend this film. I am really... |-1.904006 |4.159198 |
|I's a big struggle |to see anyone who acts in that way. by Jim Th... |, but overall I'm happy with the changes even ... |-1.595925 |2.651260|

View File

@@ -0,0 +1,27 @@
# GPT2-IMDB
## What is it?
A GPT2 (`gpt2`) language model fine-tuned on the [IMDB dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).
## Training setting
The GPT2 language model was fine-tuned for 1 epoch on the IMDB dataset. All comments were joined into a single text file separated by the EOS token:
```
import pandas as pd
df = pd.read_csv("imdb-dataset.csv")
imdb_str = " <|endoftext|> ".join(df['review'].tolist())
with open ('imdb.txt', 'w') as f:
f.write(imdb_str)
```
To train the model the `run_language_modeling.py` script in the `transformer` library was used:
```
python run_language_modeling.py
--train_data_file imdb.txt
--output_dir gpt2-imdb
--model_type gpt2
--model_name_or_path gpt2
```

View File

@@ -5,15 +5,16 @@ thumbnail:
# GPT-2 + CORD19 dataset : 🦠 ✍ ⚕
**GPT-2** fine-tuned on **biorxiv_medrxiv** and **comm_use_subset files** from [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) dataset.
**GPT-2** fine-tuned on **biorxiv_medrxiv**, **comm_use_subset** and **custom_license** files from [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) dataset.
## Datasets details:
## Datasets details
| Dataset | # Files |
| ---------------------- | ----- |
| biorxiv_medrxiv | 885 |
| comm_use_subse | 9K |
| comm_use_subset | 9K |
| custom_license | 20.6K |
## Model training
@@ -37,7 +38,7 @@ python run_language_modeling.py \
<img alt="training loss" src="https://svgshare.com/i/JTf.svg' title='GTP-2-finetuned-CORDS19-loss" width="600" height="300" />
## Model in action / Example of usage:
## Model in action / Example of usage ✒
You can get the following script [here](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py)

View File

@@ -0,0 +1,62 @@
---
language: english
thumbnail:
---
# GPT-2 + bio/medrxiv files from CORD19: 🦠 ✍ ⚕
**GPT-2** fine-tuned on **biorxiv_medrxiv** files from [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) dataset.
## Datasets details:
| Dataset | # Files |
| ---------------------- | ----- |
| biorxiv_medrxiv | 885 |
## Model training:
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
export TRAIN_FILE=/path/to/dataset/train.txt
python run_language_modeling.py \
--model_type gpt2 \
--model_name_or_path gpt2 \
--do_train \
--train_data_file $TRAIN_FILE \
--num_train_epochs 4 \
--output_dir model_output \
--overwrite_output_dir \
--save_steps 2000 \
--per_gpu_train_batch_size 3
```
## Model in action / Example of usage: ✒
You can get the following script [here](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py)
```bash
python run_generation.py \
--model_type gpt2 \
--model_name_or_path mrm8488/GPT-2-finetuned-CORD19 \
--length 200
```
```txt
👵👴🦠
# Input: Old people with COVID-19 tends to suffer
# Output: === GENERATED SEQUENCE 1 ===
Old people with COVID-19 tends to suffer more symptom onset time and death. It is well known that many people with COVID-19 have high homozygous ZIKV infection in the face of severe symptoms in both severe and severe cases.
The origin of Wuhan Fever was investigated by Prof. Shen Jiang at the outbreak of Wuhan Fever [34]. As Huanan Province is the epicenter of this outbreak, Huanan, the epicenter of epidemic Wuhan Fever, is the most potential location for the direct transmission of infection (source: Zhongzhen et al., 2020). A negative risk ratio indicates more frequent underlying signs in the people in Huanan Province with COVID-19 patients. Further analysis of reported Huanan Fever onset data in the past two years indicated that the intensity of exposure is the key risk factor for developing MERS-CoV infection in this region, especially among children and elderly. To be continued to develop infected patients would be a very important area for
```
![Model in action](https://media.giphy.com/media/TgUdO72Iwk9h7hhm7G/giphy.gif)
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -66,6 +66,8 @@ nlp_ner = pipeline(
{"use_fast": False}
))
text = 'Mis amigos están pensando viajar a Londres este verano'
nlp_ner(text)
#Output: [{'entity': 'B-LOC', 'score': 0.9998720288276672, 'word': 'Londres'}]

View File

@@ -0,0 +1,83 @@
---
language: spanish
thumbnail:
---
# Spanish BERT (BETO) + Syntax POS tagging ✍🏷
This model is a fine-tuned version of the Spanish BERT [(BETO)](https://github.com/dccuchile/beto) on Spanish **syntax** annotations in [CONLL CORPORA](https://www.kaggle.com/nltkdata/conll-corpora) dataset for **syntax POS** (Part of Speech tagging) downstream task.
## Details of the downstream task (Syntax POS) - Dataset
- [Dataset: CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora)
#### [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py)
#### 21 Syntax annotations (Labels) covered:
- \_
- ATR
- ATR.d
- CAG
- CC
- CD
- CD.Q
- CI
- CPRED
- CPRED.CD
- CPRED.SUJ
- CREG
- ET
- IMPERS
- MOD
- NEG
- PASS
- PUNC
- ROOT
- SUJ
- VOC
## Metrics on test set 📋
| Metric | # score |
| :-------: | :-------: |
| F1 | **89.27** |
| Precision | **89.44** |
| Recall | **89.11** |
## Model in action 🔨
Fast usage with **pipelines** 🧪
```python
from transformers import pipeline
nlp_pos_syntax = pipeline(
"ner",
model="mrm8488/bert-spanish-cased-finetuned-pos-syntax",
tokenizer="mrm8488/bert-spanish-cased-finetuned-pos-syntax"
)
text = 'Mis amigos están pensando viajar a Londres este verano.'
nlp_pos_syntax(text)[1:len(nlp_pos_syntax(text))-1]
```
```json
[
{ "entity": "_", "score": 0.9999216794967651, "word": "Mis" },
{ "entity": "SUJ", "score": 0.999882698059082, "word": "amigos" },
{ "entity": "_", "score": 0.9998869299888611, "word": "están" },
{ "entity": "ROOT", "score": 0.9980518221855164, "word": "pensando" },
{ "entity": "_", "score": 0.9998420476913452, "word": "viajar" },
{ "entity": "CD", "score": 0.999351978302002, "word": "a" },
{ "entity": "_", "score": 0.999959409236908, "word": "Londres" },
{ "entity": "_", "score": 0.9998968839645386, "word": "este" },
{ "entity": "CC", "score": 0.99931401014328, "word": "verano" },
{ "entity": "PUNC", "score": 0.9998534917831421, "word": "." }
]
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -21,7 +21,7 @@ I preprocessed the dataset and splitted it as train / dev (80/20)
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py)
- Labels covered:
- **60** Labels covered:
```
AO, AQ, CC, CS, DA, DD, DE, DI, DN, DP, DT, Faa, Fat, Fc, Fd, Fe, Fg, Fh, Fia, Fit, Fp, Fpa, Fpt, Fs, Ft, Fx, Fz, I, NC, NP, P0, PD, PI, PN, PP, PR, PT, PX, RG, RN, SP, VAI, VAM, VAN, VAP, VAS, VMG, VMI, VMM, VMN, VMP, VMS, VSG, VSI, VSM, VSN, VSP, VSS, Y and Z
@@ -74,6 +74,8 @@ nlp_pos(text)
```
![model in action](https://media.giphy.com/media/jVC9m1cNrdIWuAAtjy/giphy.gif)
16 POS tags version also available [here](https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-pos-16-tags)
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)

View File

@@ -0,0 +1,82 @@
---
language: multilingual
thumbnail:
---
# DistilBERT multilingual fine-tuned on TydiQA (GoldP task) dataset for multilingual Q&A 😛🌍❓
## Details of the language model
[distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased)
## Details of the Tydi QA dataset
TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the **training and evaluation** of automatic question answering systems. This repository provides evaluation code and a baseline system for the dataset. https://ai.google.com/research/tydiqa
## Details of the downstream task (Gold Passage or GoldP aka the secondary task)
Given a passage that is guaranteed to contain the answer, predict the single contiguous span of characters that answers the question. the gold passage task differs from the [primary task](https://github.com/google-research-datasets/tydiqa/blob/master/README.md#the-tasks) in several ways:
* only the gold answer passage is provided rather than the entire Wikipedia article;
* unanswerable questions have been discarded, similar to MLQA and XQuAD;
* we evaluate with the SQuAD 1.1 metrics like XQuAD; and
* Thai and Japanese are removed since the lack of whitespace breaks some tools.
## Model training 💪🏋️‍
The model was fine-tuned on a Tesla P100 GPU and 25GB of RAM.
The script is the following:
```python
python transformers/examples/run_squad.py \
--model_type distilbert \
--model_name_or_path distilbert-base-multilingual-cased \
--do_train \
--do_eval \
--train_file /path/to/dataset/train.json \
--predict_file /path/to/dataset/dev.json \
--per_gpu_train_batch_size 24 \
--per_gpu_eval_batch_size 24 \
--learning_rate 3e-5 \
--num_train_epochs 5 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /content/model_output \
--overwrite_output_dir \
--save_steps 1000 \
--threads 400
```
## Global Results (dev set) 📝
| Metric | # Value |
| --------- | ----------- |
| **EM** | **63.85** |
| **F1** | **75.70** |
## Specific Results (per language) 🌍📝
| Language | # Samples | # EM | # F1 |
| --------- | ----------- |--------| ------ |
| Arabic | 1314 | 66.66 | 80.02 |
| Bengali | 180 | 53.09 | 63.50 |
| English | 654 | 62.42 | 73.12 |
| Finnish | 1031 | 64.57 | 75.15 |
| Indonesian| 773 | 67.89 | 79.70 |
| Korean | 414 | 51.29 | 61.73 |
| Russian | 1079 | 55.42 | 70.08 |
| Swahili | 596 | 74.51 | 81.15 |
| Telegu | 874 | 66.21 | 79.85 |
## Similar models
You can also try [bert-multi-cased-finedtuned-xquad-tydiqa-goldp](https://huggingface.co/mrm8488/bert-multi-cased-finedtuned-xquad-tydiqa-goldp) that achieves **F1 = 82.16** and **EM = 71.06** (And of course better marks per language).
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -0,0 +1,27 @@
# GPT2-IMDB-neg (LM + RL) 🎞😡✍
All credits to [@lvwerra](https://twitter.com/lvwerra)
## What is it?
A small GPT2 (`lvwerra/gpt2-imdb`) language model fine-tuned to produce **negative** movie reviews based the [IMDB dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). The model is trained with rewards from a BERT sentiment classifier (`lvwerra/gpt2-imdb`) via **PPO**.
## Why?
I wanted to reproduce the experiment [lvwerra/gpt2-imdb-pos](https://huggingface.co/lvwerra/gpt2-imdb-pos) but for generating **negative** movie reviews.
## Training setting
The model was trained for `100` optimisation steps with a batch size of `256` which corresponds to `25600` training samples. The full experiment setup (for positive samples) in [trl repo](https://lvwerra.github.io/trl/04-gpt2-sentiment-ppo-training/).
## Examples
A few examples of the model response to a query before and after optimisation:
| query | response (before) | response (after) | rewards (before) | rewards (after) |
|-------|-------------------|------------------|------------------|-----------------|
|This movie is a fine | attempt as far as live action is concerned, n...|example of how bad Hollywood in theatrics pla...| 2.118391 | -3.31625|
|I have watched 3 episodes |with this guy and he is such a talented actor...| but the show is just plain awful and there ne...| 2.681171| -4.512792|
|We know that firefighters and| police officers are forced to become populari...| other chains have going to get this disaster ...| 1.367811| -3.34017|
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -0,0 +1,57 @@
---
language: german
---
# Model description
## Dataset
Trained on fictional and non-fictional German texts written between 1840 and 1920:
* Narrative texts from Digitale Bibliothek (https://textgrid.de/digitale-bibliothek)
* Fairy tales and sagas from Grimm Korpus (https://www1.ids-mannheim.de/kl/projekte/korpora/archiv/gri.html)
* Newspaper and magazine article from Mannheimer Korpus Historischer Zeitungen und Zeitschriften (https://repos.ids-mannheim.de/mkhz-beschreibung.html)
* Magazine article from the journal „Die Grenzboten“ (http://www.deutschestextarchiv.de/doku/textquellen#grenzboten)
* Fictional and non-fictional texts from Projekt Gutenberg (https://www.projekt-gutenberg.org)
## Hardware used
1 Tesla P4 GPU
## Hyperparameters
| Parameter | Value |
|-------------------------------|----------|
| Epochs | 3 |
| Gradient_accumulation_steps | 1 |
| Train_batch_size | 32 |
| Learning_rate | 0.00003 |
| Max_seq_len | 128 |
## Evaluation results: Automatic tagging of four forms of speech/thought/writing representation in historical fictional and non-fictional German texts
The language model was used in the task to tag direct, indirect, reported and free indirect speech/thought/writing representation in fictional and non-fictional German texts. The tagger is available and described in detail at https://github.com/redewiedergabe/tagger.
The tagging model was trained using the SequenceTagger Class of the Flair framework ([Akbik et al., 2019](https://www.aclweb.org/anthology/N19-4010)) which implements a BiLSTM-CRF architecture on top of a language embedding (as proposed by [Huang et al. (2015)](https://arxiv.org/abs/1508.01991)).
Hyperparameters
| Parameter | Value |
|-------------------------------|------------|
| Hidden_size | 256 |
| Learning_rate | 0.1 |
| Mini_batch_size | 8 |
| Max_epochs | 150 |
Results are reported below in comparison to a custom trained flair embedding, which was stacked onto a custom trained fastText-model. Both models were trained on the same dataset.
| | BERT ||| FastText+Flair |||Test data|
|----------------|----------|-----------|----------|------|-----------|--------|--------|
| | F1 | Precision | Recall | F1 | Precision | Recall ||
| Direct | 0.80 | 0.86 | 0.74 | 0.84 | 0.90 | 0.79 |historical German, fictional & non-fictional|
| Indirect | **0.76** | **0.79** | **0.73** | 0.73 | 0.78 | 0.68 |historical German, fictional & non-fictional|
| Reported | **0.58** | **0.69** | **0.51** | 0.56 | 0.68 | 0.48 |historical German, fictional & non-fictional|
| Free indirect | **0.57** | **0.80** | **0.44** | 0.47 | 0.78 | 0.34 |modern German, fictional|
## Intended use:
Historical German Texts (1840 to 1920)
(Showed good performance with modern German fictional texts as well)

View File

@@ -0,0 +1,6 @@
---
tags:
- exbert
---
[![ExBERT](https://img.shields.io/badge/Visualize%20Attentions-ExBERT-green)](https://huggingface.co/exbert/?model=roberta-base)

View File

@@ -0,0 +1,60 @@
# ALECTRA-small-OWT
This is an extension of
[ELECTRA](https://openreview.net/forum?id=r1xMH1BtvB) small model, trained on the
[OpenWebText corpus](https://skylion007.github.io/OpenWebTextCorpus/).
The training task (discriminative LM / replaced-token-detection) can be generalized to any transformer type. Here, we train an ALBERT model under the same scheme.
## Pretraining task
![electra task diagram](https://github.com/shoarora/lmtuners/raw/master/assets/electra.png)
(figure from [Clark et al. 2020](https://openreview.net/pdf?id=r1xMH1BtvB))
ELECTRA uses discriminative LM / replaced-token-detection for pretraining.
This involves a generator (a Masked LM model) creating examples for a discriminator
to classify as original or replaced for each token.
The generator generalizes to any `*ForMaskedLM` model and the discriminator could be
any `*ForTokenClassification` model. Therefore, we can extend the task to ALBERT models,
not just BERT as in the original paper.
## Usage
```python
from transformers import AlbertForSequenceClassification, BertTokenizer
# Both models use the bert-base-uncased tokenizer and vocab.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
alectra = AlbertForSequenceClassification.from_pretrained('shoarora/alectra-small-owt')
```
NOTE: this ALBERT model uses a BERT WordPiece tokenizer.
## Code
The pytorch module that implements this task is available [here](https://github.com/shoarora/lmtuners/blob/master/lmtuners/lightning_modules/discriminative_lm.py).
Further implementation information [here](https://github.com/shoarora/lmtuners/tree/master/experiments/disc_lm_small),
and [here](https://github.com/shoarora/lmtuners/blob/master/experiments/disc_lm_small/train_alectra_small.py) is the script that created this model.
This specific model was trained with the following params:
- `batch_size: 512`
- `training_steps: 5e5`
- `warmup_steps: 4e4`
- `learning_rate: 2e-3`
## Downstream tasks
#### GLUE Dev results
| Model | # Params | CoLA | SST | MRPC | STS | QQP | MNLI | QNLI | RTE |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ELECTRA-Small++ | 14M | 57.0 | 91. | 88.0 | 87.5 | 89.0 | 81.3 | 88.4 | 66.7|
| ELECTRA-Small-OWT | 14M | 56.8 | 88.3| 87.4 | 86.8 | 88.3 | 78.9 | 87.9 | 68.5|
| ELECTRA-Small-OWT (ours) | 17M | 56.3 | 88.4| 75.0 | 86.1 | 89.1 | 77.9 | 83.0 | 67.1|
| ALECTRA-Small-OWT (ours) | 4M | 50.6 | 89.1| 86.3 | 87.2 | 89.1 | 78.2 | 85.9 | 69.6|
#### GLUE Test results
| Model | # Params | CoLA | SST | MRPC | STS | QQP | MNLI | QNLI | RTE |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| BERT-Base | 110M | 52.1 | 93.5| 84.8 | 85.9 | 89.2 | 84.6 | 90.5 | 66.4|
| GPT | 117M | 45.4 | 91.3| 75.7 | 80.0 | 88.5 | 82.1 | 88.1 | 56.0|
| ELECTRA-Small++ | 14M | 57.0 | 91.2| 88.0 | 87.5 | 89.0 | 81.3 | 88.4 | 66.7|
| ELECTRA-Small-OWT (ours) | 17M | 57.4 | 89.3| 76.2 | 81.9 | 87.5 | 78.1 | 82.4 | 68.1|
| ALECTRA-Small-OWT (ours) | 4M | 43.9 | 87.9| 82.1 | 82.0 | 87.6 | 77.9 | 85.8 | 67.5|

View File

@@ -0,0 +1,59 @@
# ELECTRA-small-OWT
This is an unnoficial implementation of an
[ELECTRA](https://openreview.net/forum?id=r1xMH1BtvB) small model, trained on the
[OpenWebText corpus](https://skylion007.github.io/OpenWebTextCorpus/).
Differences from official ELECTRA models:
- we use a `BertForMaskedLM` as the generator and `BertForTokenClassification` as the discriminator
- they use an embedding projection layer, but Bert doesn't have one
## Pretraining ttask
![electra task diagram](https://github.com/shoarora/lmtuners/raw/master/assets/electra.png)
(figure from [Clark et al. 2020](https://openreview.net/pdf?id=r1xMH1BtvB))
ELECTRA uses discriminative LM / replaced-token-detection for pretraining.
This involves a generator (a Masked LM model) creating examples for a discriminator
to classify as original or replaced for each token.
## Usage
```python
from transformers import BertForSequenceClassification, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
electra = BertForSequenceClassification.from_pretrained('shoarora/electra-small-owt')
```
## Code
The pytorch module that implements this task is available [here](https://github.com/shoarora/lmtuners/blob/master/lmtuners/lightning_modules/discriminative_lm.py).
Further implementation information [here](https://github.com/shoarora/lmtuners/tree/master/experiments/disc_lm_small),
and [here](https://github.com/shoarora/lmtuners/blob/master/experiments/disc_lm_small/train_electra_small.py) is the script that created this model.
This specific model was trained with the following params:
- `batch_size: 512`
- `training_steps: 5e5`
- `warmup_steps: 4e4`
- `learning_rate: 2e-3`
## Downstream tasks
#### GLUE Dev results
| Model | # Params | CoLA | SST | MRPC | STS | QQP | MNLI | QNLI | RTE |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ELECTRA-Small++ | 14M | 57.0 | 91. | 88.0 | 87.5 | 89.0 | 81.3 | 88.4 | 66.7|
| ELECTRA-Small-OWT | 14M | 56.8 | 88.3| 87.4 | 86.8 | 88.3 | 78.9 | 87.9 | 68.5|
| ELECTRA-Small-OWT (ours) | 17M | 56.3 | 88.4| 75.0 | 86.1 | 89.1 | 77.9 | 83.0 | 67.1|
| ALECTRA-Small-OWT (ours) | 4M | 50.6 | 89.1| 86.3 | 87.2 | 89.1 | 78.2 | 85.9 | 69.6|
- Table initialized from [ELECTRA github repo](https://github.com/google-research/electra)
#### GLUE Test results
| Model | # Params | CoLA | SST | MRPC | STS | QQP | MNLI | QNLI | RTE |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| BERT-Base | 110M | 52.1 | 93.5| 84.8 | 85.9 | 89.2 | 84.6 | 90.5 | 66.4|
| GPT | 117M | 45.4 | 91.3| 75.7 | 80.0 | 88.5 | 82.1 | 88.1 | 56.0|
| ELECTRA-Small++ | 14M | 57.0 | 91.2| 88.0 | 87.5 | 89.0 | 81.3 | 88.4 | 66.7|
| ELECTRA-Small-OWT (ours) | 17M | 57.4 | 89.3| 76.2 | 81.9 | 87.5 | 78.1 | 82.4 | 68.1|
| ALECTRA-Small-OWT (ours) | 4M | 43.9 | 87.9| 82.1 | 82.0 | 87.6 | 77.9 | 85.8 | 67.5|

View File

@@ -0,0 +1,6 @@
---
tags:
- exbert
---
[![ExBERT](https://img.shields.io/badge/Visualize%20Attentions-ExBERT-green)](https://huggingface.co/exbert/?model=xlm-mlm-en-2048)

View File

@@ -0,0 +1,6 @@
---
tags:
- exbert
---
[![ExBERT](https://img.shields.io/badge/Visualize%20Attentions-ExBERT-green)](https://huggingface.co/exbert/?model=xlm-roberta-base)

View File

@@ -76,14 +76,14 @@ extras["testing"] = ["pytest", "pytest-xdist"]
extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rtd-theme"]
extras["quality"] = [
"black",
"isort @ git+git://github.com/timothycrosley/isort.git@e63ae06ec7d70b06df9e528357650281a3d3ec22#egg=isort",
"isort",
"flake8",
]
extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3", "scikit-learn", "tensorflow", "torch"]
setup(
name="transformers",
version="2.7.0",
version="2.8.0",
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
author_email="thomas@huggingface.co",
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",

View File

@@ -2,7 +2,7 @@
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
__version__ = "2.7.0"
__version__ = "2.8.0"
# Work around to update TensorFlow's absl.logging threshold which alters the
# default Python logging output behavior when present.
@@ -38,6 +38,7 @@ from .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
from .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
from .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig
from .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig
from .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig
from .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig
from .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config
from .configuration_mmbt import MMBTConfig
@@ -127,6 +128,7 @@ from .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenize
from .tokenization_camembert import CamembertTokenizer
from .tokenization_ctrl import CTRLTokenizer
from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast
from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast
from .tokenization_flaubert import FlaubertTokenizer
from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
@@ -297,6 +299,15 @@ if is_torch_available():
FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
)
from .modeling_electra import (
ElectraForPreTraining,
ElectraForMaskedLM,
ElectraForTokenClassification,
ElectraModel,
load_tf_weights_in_electra,
ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP,
)
# Optimization
from .optimization import (
AdamW,
@@ -463,6 +474,15 @@ if is_tf_available():
TF_T5_PRETRAINED_MODEL_ARCHIVE_MAP,
)
from .modeling_tf_electra import (
TFElectraPreTrainedModel,
TFElectraModel,
TFElectraForPreTraining,
TFElectraForMaskedLM,
TFElectraForTokenClassification,
TF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP,
)
# Optimization
from .optimization_tf import WarmUp, create_optimizer, AdamWeightDecay, GradientAccumulator

View File

@@ -18,12 +18,6 @@ def _gelu_python(x):
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
if torch.__version__ < "1.4.0":
gelu = _gelu_python
else:
gelu = F.gelu
def gelu_new(x):
""" Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT).
Also see https://arxiv.org/abs/1606.08415
@@ -31,6 +25,12 @@ def gelu_new(x):
return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
if torch.__version__ < "1.4.0":
gelu = _gelu_python
else:
gelu = F.gelu
gelu_new = torch.jit.script(gelu_new)
ACT2FN = {
"relu": F.relu,
"swish": swish,

View File

@@ -24,6 +24,7 @@ from .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
from .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
from .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig
from .configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig
from .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig
from .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig
from .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config
from .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig
@@ -57,6 +58,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
T5_PRETRAINED_CONFIG_ARCHIVE_MAP,
XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,
]
for key, value, in pretrained_map.items()
)
@@ -79,6 +81,7 @@ CONFIG_MAPPING = OrderedDict(
("xlnet", XLNetConfig,),
("xlm", XLMConfig,),
("ctrl", CTRLConfig,),
("electra", ElectraConfig,),
]
)
@@ -133,6 +136,7 @@ class AutoConfig:
- contains `xlm`: :class:`~transformers.XLMConfig` (XLM model)
- contains `ctrl` : :class:`~transformers.CTRLConfig` (CTRL model)
- contains `flaubert` : :class:`~transformers.FlaubertConfig` (Flaubert model)
- contains `electra` : :class:`~transformers.ElectraConfig` (ELECTRA model)
Args:

View File

@@ -0,0 +1,132 @@
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" ELECTRA model configuration """
import logging
from .configuration_utils import PretrainedConfig
logger = logging.getLogger(__name__)
ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"google/electra-small-generator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-generator/config.json",
"google/electra-base-generator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-generator/config.json",
"google/electra-large-generator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-generator/config.json",
"google/electra-small-discriminator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-discriminator/config.json",
"google/electra-base-discriminator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/config.json",
"google/electra-large-discriminator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-discriminator/config.json",
}
class ElectraConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.ElectraModel`.
It is used to instantiate an ELECTRA model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the ELECTRA `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__
architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args:
vocab_size (:obj:`int`, optional, defaults to 30522):
Vocabulary size of the ELECTRA model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.ElectraModel`.
embedding_size (:obj:`int`, optional, defaults to 128):
Dimensionality of the encoder layers and the pooler layer.
hidden_size (:obj:`int`, optional, defaults to 256):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, optional, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, optional, defaults to 4):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 1024):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.ElectraModel`.
initializer_range (:obj:`float`, optional, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
The epsilon used by the layer normalization layers.
Example::
from transformers import ElectraModel, ElectraConfig
# Initializing a ELECTRA electra-base-uncased style configuration
configuration = ElectraConfig()
# Initializing a model from the electra-base-uncased style configuration
model = ElectraModel(configuration)
# Accessing the model configuration
configuration = model.config
Attributes:
pretrained_config_archive_map (Dict[str, str]):
A dictionary containing all the available pre-trained checkpoints.
"""
pretrained_config_archive_map = ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP
model_type = "electra"
def __init__(
self,
vocab_size=30522,
embedding_size=128,
hidden_size=256,
num_hidden_layers=12,
num_attention_heads=4,
intermediate_size=1024,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=2,
initializer_range=0.02,
layer_norm_eps=1e-12,
pad_token_id=0,
**kwargs
):
super().__init__(pad_token_id=pad_token_id, **kwargs)
self.vocab_size = vocab_size
self.embedding_size = embedding_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps

View File

@@ -80,6 +80,7 @@ class PretrainedConfig(object):
self.repetition_penalty = kwargs.pop("repetition_penalty", 1.0)
self.length_penalty = kwargs.pop("length_penalty", 1.0)
self.no_repeat_ngram_size = kwargs.pop("no_repeat_ngram_size", 0)
self.bad_words_ids = kwargs.pop("bad_words_ids", None)
self.num_return_sequences = kwargs.pop("num_return_sequences", 1)
# Fine-tuning task arguments

View File

@@ -0,0 +1,79 @@
# coding=utf-8
# Copyright 2018 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert ELECTRA checkpoint."""
import argparse
import logging
import torch
from transformers import ElectraConfig, ElectraForMaskedLM, ElectraForPreTraining, load_tf_weights_in_electra
logging.basicConfig(level=logging.INFO)
def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, config_file, pytorch_dump_path, discriminator_or_generator):
# Initialise PyTorch model
config = ElectraConfig.from_json_file(config_file)
print("Building PyTorch model from configuration: {}".format(str(config)))
if discriminator_or_generator == "discriminator":
model = ElectraForPreTraining(config)
elif discriminator_or_generator == "generator":
model = ElectraForMaskedLM(config)
else:
raise ValueError("The discriminator_or_generator argument should be either 'discriminator' or 'generator'")
# Load weights from tf checkpoint
load_tf_weights_in_electra(
model, config, tf_checkpoint_path, discriminator_or_generator=discriminator_or_generator
)
# Save pytorch-model
print("Save PyTorch model to {}".format(pytorch_dump_path))
torch.save(model.state_dict(), pytorch_dump_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--tf_checkpoint_path", default=None, type=str, required=True, help="Path to the TensorFlow checkpoint path."
)
parser.add_argument(
"--config_file",
default=None,
type=str,
required=True,
help="The config json file corresponding to the pre-trained model. \n"
"This specifies the model architecture.",
)
parser.add_argument(
"--pytorch_dump_path", default=None, type=str, required=True, help="Path to the output PyTorch model."
)
parser.add_argument(
"--discriminator_or_generator",
default=None,
type=str,
required=True,
help="Whether to export the generator or the discriminator. Should be a string, either 'discriminator' or "
"'generator'.",
)
args = parser.parse_args()
convert_tf_checkpoint_to_pytorch(
args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path, args.discriminator_or_generator
)

View File

@@ -25,6 +25,7 @@ from transformers import (
CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,
DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,
FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,
OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
@@ -39,6 +40,7 @@ from transformers import (
CamembertConfig,
CTRLConfig,
DistilBertConfig,
ElectraConfig,
FlaubertConfig,
GPT2Config,
OpenAIGPTConfig,
@@ -52,6 +54,7 @@ from transformers import (
TFCTRLLMHeadModel,
TFDistilBertForMaskedLM,
TFDistilBertForQuestionAnswering,
TFElectraForPreTraining,
TFFlaubertWithLMHeadModel,
TFGPT2LMHeadModel,
TFOpenAIGPTLMHeadModel,
@@ -110,6 +113,8 @@ if is_torch_available():
ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
T5ForConditionalGeneration,
T5_PRETRAINED_MODEL_ARCHIVE_MAP,
ElectraForPreTraining,
ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP,
)
else:
(
@@ -147,6 +152,8 @@ else:
ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
T5ForConditionalGeneration,
T5_PRETRAINED_MODEL_ARCHIVE_MAP,
ElectraForPreTraining,
ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP,
) = (
None,
None,
@@ -182,6 +189,8 @@ else:
None,
None,
None,
None,
None,
)
@@ -321,6 +330,13 @@ MODEL_CLASSES = {
T5_PRETRAINED_MODEL_ARCHIVE_MAP,
T5_PRETRAINED_CONFIG_ARCHIVE_MAP,
),
"electra": (
ElectraConfig,
TFElectraForPreTraining,
ElectraForPreTraining,
ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP,
ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,
),
}

View File

@@ -28,7 +28,7 @@ from ...file_utils import is_tf_available, is_torch_available
logger = logging.getLogger(__name__)
@dataclass(frozen=True)
@dataclass(frozen=False)
class InputExample:
"""
A single training/test example for simple sequence classification.

View File

@@ -26,6 +26,7 @@ from .configuration_auto import (
CamembertConfig,
CTRLConfig,
DistilBertConfig,
ElectraConfig,
FlaubertConfig,
GPT2Config,
OpenAIGPTConfig,
@@ -76,6 +77,13 @@ from .modeling_distilbert import (
DistilBertForTokenClassification,
DistilBertModel,
)
from .modeling_electra import (
ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP,
ElectraForMaskedLM,
ElectraForPreTraining,
ElectraForTokenClassification,
ElectraModel,
)
from .modeling_flaubert import (
FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
FlaubertForQuestionAnsweringSimple,
@@ -141,6 +149,7 @@ ALL_PRETRAINED_MODEL_ARCHIVE_MAP = dict(
T5_PRETRAINED_MODEL_ARCHIVE_MAP,
FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP,
]
for key, value, in pretrained_map.items()
)
@@ -162,6 +171,7 @@ MODEL_MAPPING = OrderedDict(
(FlaubertConfig, FlaubertModel),
(XLMConfig, XLMModel),
(CTRLConfig, CTRLModel),
(ElectraConfig, ElectraModel),
]
)
@@ -182,6 +192,7 @@ MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
(FlaubertConfig, FlaubertWithLMHeadModel),
(XLMConfig, XLMWithLMHeadModel),
(CTRLConfig, CTRLLMHeadModel),
(ElectraConfig, ElectraForPreTraining),
]
)
@@ -202,6 +213,7 @@ MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
(FlaubertConfig, FlaubertWithLMHeadModel),
(XLMConfig, XLMWithLMHeadModel),
(CTRLConfig, CTRLLMHeadModel),
(ElectraConfig, ElectraForMaskedLM),
]
)
@@ -242,6 +254,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
(BertConfig, BertForTokenClassification),
(XLNetConfig, XLNetForTokenClassification),
(AlbertConfig, AlbertForTokenClassification),
(ElectraConfig, ElectraForTokenClassification),
]
)
@@ -281,7 +294,8 @@ class AutoModel(object):
- isInstance of `transfo-xl` configuration class: :class:`~transformers.TransfoXLModel` (Transformer-XL model)
- isInstance of `xlnet` configuration class: :class:`~transformers.XLNetModel` (XLNet model)
- isInstance of `xlm` configuration class: :class:`~transformers.XLMModel` (XLM model)
- isInstance of `flaubert` configuration class: :class:`~transformers.FlaubertModel` (XLM model)
- isInstance of `flaubert` configuration class: :class:`~transformers.FlaubertModel` (Flaubert model)
- isInstance of `electra` configuration class: :class:`~transformers.ElectraModel` (Electra model)
Examples::
@@ -322,7 +336,8 @@ class AutoModel(object):
- contains `xlnet`: :class:`~transformers.XLNetModel` (XLNet model)
- contains `xlm`: :class:`~transformers.XLMModel` (XLM model)
- contains `ctrl`: :class:`~transformers.CTRLModel` (Salesforce CTRL model)
- contains `flaubert`: :class:`~transformers.Flaubert` (Flaubert model)
- contains `flaubert`: :class:`~transformers.FlaubertModel` (Flaubert model)
- contains `electra`: :class:`~transformers.ElectraModel` (Electra model)
The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
To train the model, you should first set it back in training mode with `model.train()`
@@ -430,6 +445,7 @@ class AutoModelForPreTraining(object):
- isInstance of `xlnet` configuration class: :class:`~transformers.XLNetLMHeadModel` (XLNet model)
- isInstance of `xlm` configuration class: :class:`~transformers.XLMWithLMHeadModel` (XLM model)
- isInstance of `flaubert` configuration class: :class:`~transformers.FlaubertWithLMHeadModel` (Flaubert model)
- isInstance of `electra` configuration class: :class:`~transformers.ElectraForPreTraining` (Electra model)
Examples::
@@ -470,6 +486,7 @@ class AutoModelForPreTraining(object):
- contains `xlm`: :class:`~transformers.XLMWithLMHeadModel` (XLM model)
- contains `ctrl`: :class:`~transformers.CTRLLMHeadModel` (Salesforce CTRL model)
- contains `flaubert`: :class:`~transformers.FlaubertWithLMHeadModel` (Flaubert model)
- contains `electra`: :class:`~transformers.ElectraForPreTraining` (Electra model)
The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
To train the model, you should first set it back in training mode with `model.train()`
@@ -571,6 +588,7 @@ class AutoModelWithLMHead(object):
- isInstance of `xlnet` configuration class: :class:`~transformers.XLNetLMHeadModel` (XLNet model)
- isInstance of `xlm` configuration class: :class:`~transformers.XLMWithLMHeadModel` (XLM model)
- isInstance of `flaubert` configuration class: :class:`~transformers.FlaubertWithLMHeadModel` (Flaubert model)
- isInstance of `electra` configuration class: :class:`~transformers.ElectraForMaskedLM` (Electra model)
Examples::
@@ -612,6 +630,7 @@ class AutoModelWithLMHead(object):
- contains `xlm`: :class:`~transformers.XLMWithLMHeadModel` (XLM model)
- contains `ctrl`: :class:`~transformers.CTRLLMHeadModel` (Salesforce CTRL model)
- contains `flaubert`: :class:`~transformers.FlaubertWithLMHeadModel` (Flaubert model)
- contains `electra`: :class:`~transformers.ElectraForMaskedLM` (Electra model)
The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
To train the model, you should first set it back in training mode with `model.train()`
@@ -998,6 +1017,7 @@ class AutoModelForTokenClassification:
- isInstance of `xlnet` configuration class: :class:`~transformers.XLNetModelForTokenClassification` (XLNet model)
- isInstance of `camembert` configuration class: :class:`~transformers.CamembertModelForTokenClassification` (Camembert model)
- isInstance of `roberta` configuration class: :class:`~transformers.RobertaModelForTokenClassification` (Roberta model)
- isInstance of `electra` configuration class: :class:`~transformers.ElectraForTokenClassification` (Electra model)
Examples::
@@ -1035,6 +1055,7 @@ class AutoModelForTokenClassification:
- contains `bert`: :class:`~transformers.BertForTokenClassification` (Bert model)
- contains `xlnet`: :class:`~transformers.XLNetForTokenClassification` (XLNet model)
- contains `roberta`: :class:`~transformers.RobertaForTokenClassification` (Roberta model)
- contains `electra`: :class:`~transformers.ElectraForTokenClassification` (Electra model)
The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
To train the model, you should first set it back in training mode with `model.train()`

View File

@@ -116,7 +116,6 @@ class PretrainedBartModel(PreTrainedModel):
config_class = BartConfig
base_model_prefix = "model"
pretrained_model_archive_map = BART_PRETRAINED_MODEL_ARCHIVE_MAP
encoder_outputs_batch_dim_idx = 1 # outputs shaped (seq_len, bs, ...)
def _init_weights(self, module):
std = self.config.init_std
@@ -294,7 +293,10 @@ class BartEncoder(nn.Module):
if self.output_hidden_states:
encoder_states.append(x)
# T x B x C -> B x T x C
encoder_states = [hidden_state.transpose(0, 1) for hidden_state in encoder_states]
x = x.transpose(0, 1)
return x, encoder_states, all_attentions
@@ -448,7 +450,11 @@ class BartDecoder(nn.Module):
x = self.layernorm_embedding(x)
x = F.dropout(x, p=self.dropout, training=self.training)
x = x.transpose(0, 1) # (seq_len, BS, model_dim)
# Convert to Bart output format: (seq_len, BS, model_dim) -> (BS, seq_len, model_dim)
x = x.transpose(0, 1)
encoder_hidden_states = encoder_hidden_states.transpose(0, 1)
# decoder layers
all_hidden_states = ()
all_self_attns = ()
@@ -477,9 +483,10 @@ class BartDecoder(nn.Module):
if self.output_attentions:
all_self_attns += (layer_self_attn,)
# Convert shapes from (seq_len, BS, model_dim) to (BS, seq_len, model_dim)
# Convert to standart output format: (seq_len, BS, model_dim) -> (BS, seq_len, model_dim)
all_hidden_states = [hidden_state.transpose(0, 1) for hidden_state in all_hidden_states]
x = x.transpose(0, 1)
encoder_hidden_states = encoder_hidden_states.transpose(0, 1)
if self.output_past:
next_cache = ((encoder_hidden_states, encoder_padding_mask), next_decoder_cache)
@@ -805,6 +812,8 @@ class BartModel(PretrainedBartModel):
def set_input_embeddings(self, value):
self.shared = value
self.encoder.embed_tokens = self.shared
self.decoder.embed_tokens = self.shared
def get_output_embeddings(self):
return _make_linear_from_emb(self.shared) # make it on the fly
@@ -928,10 +937,9 @@ class BartForConditionalGeneration(PretrainedBartModel):
layer_past_new = {
attn_key: _reorder_buffer(attn_cache, beam_idx) for attn_key, attn_cache in layer_past.items()
}
# reordered_layer_past = [layer_past[:, i].unsqueeze(1).clone().detach() for i in beam_idx]
# reordered_layer_past = torch.cat(reordered_layer_past, dim=1)
reordered_past.append(layer_past_new)
new_enc_out = enc_out if enc_out is None else enc_out.index_select(1, beam_idx)
new_enc_out = enc_out if enc_out is None else enc_out.index_select(0, beam_idx)
new_enc_mask = enc_mask if enc_mask is None else enc_mask.index_select(0, beam_idx)
past = ((new_enc_out, new_enc_mask), reordered_past)

View File

@@ -183,7 +183,7 @@ class BertEmbeddings(nn.Module):
class BertSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
if config.hidden_size % config.num_attention_heads != 0:
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (config.hidden_size, config.num_attention_heads)
@@ -845,7 +845,7 @@ class BertForPreTraining(BertPreTrainedModel):
Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, 2)`):
seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False
continuation before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):
@@ -1048,7 +1048,7 @@ class BertForNextSentencePrediction(BertPreTrainedModel):
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`next_sentence_label` is provided):
Next sequence prediction (classification) loss.
seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, 2)`):
seq_relationship_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)

View File

@@ -0,0 +1,671 @@
import logging
import os
import torch
import torch.nn as nn
from transformers import ElectraConfig, add_start_docstrings
from transformers.activations import get_activation
from .file_utils import add_start_docstrings_to_callable
from .modeling_bert import BertEmbeddings, BertEncoder, BertLayerNorm, BertPreTrainedModel
logger = logging.getLogger(__name__)
ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP = {
"google/electra-small-generator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-generator/pytorch_model.bin",
"google/electra-base-generator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-generator/pytorch_model.bin",
"google/electra-large-generator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-generator/pytorch_model.bin",
"google/electra-small-discriminator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-discriminator/pytorch_model.bin",
"google/electra-base-discriminator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/pytorch_model.bin",
"google/electra-large-discriminator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-discriminator/pytorch_model.bin",
}
def load_tf_weights_in_electra(model, config, tf_checkpoint_path, discriminator_or_generator="discriminator"):
""" Load tf checkpoints in a pytorch model.
"""
try:
import re
import numpy as np
import tensorflow as tf
except ImportError:
logger.error(
"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see "
"https://www.tensorflow.org/install/ for installation instructions."
)
raise
tf_path = os.path.abspath(tf_checkpoint_path)
logger.info("Converting TensorFlow checkpoint from {}".format(tf_path))
# Load weights from TF model
init_vars = tf.train.list_variables(tf_path)
names = []
arrays = []
for name, shape in init_vars:
logger.info("Loading TF weight {} with shape {}".format(name, shape))
array = tf.train.load_variable(tf_path, name)
names.append(name)
arrays.append(array)
for name, array in zip(names, arrays):
original_name: str = name
try:
if isinstance(model, ElectraForMaskedLM):
name = name.replace("electra/embeddings/", "generator/embeddings/")
if discriminator_or_generator == "generator":
name = name.replace("electra/", "discriminator/")
name = name.replace("generator/", "electra/")
name = name.replace("dense_1", "dense_prediction")
name = name.replace("generator_predictions/output_bias", "generator_lm_head/bias")
name = name.split("/")
# print(original_name, name)
# adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
# which are not required for using pretrained model
if any(n in ["global_step", "temperature"] for n in name):
logger.info("Skipping {}".format(original_name))
continue
pointer = model
for m_name in name:
if re.fullmatch(r"[A-Za-z]+_\d+", m_name):
scope_names = re.split(r"_(\d+)", m_name)
else:
scope_names = [m_name]
if scope_names[0] == "kernel" or scope_names[0] == "gamma":
pointer = getattr(pointer, "weight")
elif scope_names[0] == "output_bias" or scope_names[0] == "beta":
pointer = getattr(pointer, "bias")
elif scope_names[0] == "output_weights":
pointer = getattr(pointer, "weight")
elif scope_names[0] == "squad":
pointer = getattr(pointer, "classifier")
else:
pointer = getattr(pointer, scope_names[0])
if len(scope_names) >= 2:
num = int(scope_names[1])
pointer = pointer[num]
if m_name.endswith("_embeddings"):
pointer = getattr(pointer, "weight")
elif m_name == "kernel":
array = np.transpose(array)
try:
assert pointer.shape == array.shape, original_name
except AssertionError as e:
e.args += (pointer.shape, array.shape)
raise
print("Initialize PyTorch weight {}".format(name), original_name)
pointer.data = torch.from_numpy(array)
except AttributeError as e:
print("Skipping {}".format(original_name), name, e)
continue
return model
class ElectraEmbeddings(BertEmbeddings):
"""Construct the embeddings from word, position and token_type embeddings."""
def __init__(self, config):
super().__init__(config)
self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
self.LayerNorm = BertLayerNorm(config.embedding_size, eps=config.layer_norm_eps)
class ElectraDiscriminatorPredictions(nn.Module):
"""Prediction module for the discriminator, made up of two dense layers."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.dense_prediction = nn.Linear(config.hidden_size, 1)
self.config = config
def forward(self, discriminator_hidden_states, attention_mask):
hidden_states = self.dense(discriminator_hidden_states)
hidden_states = get_activation(self.config.hidden_act)(hidden_states)
logits = self.dense_prediction(hidden_states).squeeze()
return logits
class ElectraGeneratorPredictions(nn.Module):
"""Prediction module for the generator, made up of two dense layers."""
def __init__(self, config):
super().__init__()
self.LayerNorm = BertLayerNorm(config.embedding_size)
self.dense = nn.Linear(config.hidden_size, config.embedding_size)
def forward(self, generator_hidden_states):
hidden_states = self.dense(generator_hidden_states)
hidden_states = get_activation("gelu")(hidden_states)
hidden_states = self.LayerNorm(hidden_states)
return hidden_states
class ElectraPreTrainedModel(BertPreTrainedModel):
""" An abstract class to handle weights initialization and
a simple interface for downloading and loading pretrained models.
"""
config_class = ElectraConfig
pretrained_model_archive_map = ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP
load_tf_weights = load_tf_weights_in_electra
base_model_prefix = "electra"
def get_extended_attention_mask(self, attention_mask, input_shape, device):
# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
# ourselves in which case we just need to make it broadcastable to all heads.
if attention_mask.dim() == 3:
extended_attention_mask = attention_mask[:, None, :, :]
elif attention_mask.dim() == 2:
# Provided a padding mask of dimensions [batch_size, seq_length]
# - if the model is a decoder, apply a causal mask in addition to the padding mask
# - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
if self.config.is_decoder:
batch_size, seq_length = input_shape
seq_ids = torch.arange(seq_length, device=device)
causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]
causal_mask = causal_mask.to(
attention_mask.dtype
) # causal and attention masks must have same type with pytorch version < 1.3
extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
else:
extended_attention_mask = attention_mask[:, None, None, :]
else:
raise ValueError(
"Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
input_shape, attention_mask.shape
)
)
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
return extended_attention_mask
def get_head_mask(self, head_mask):
# Prepare head mask if needed
# 1.0 in head_mask indicate we keep the head
# attention_probs has shape bsz x n_heads x N x N
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
num_hidden_layers = self.config.num_hidden_layers
if head_mask is not None:
if head_mask.dim() == 1:
head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
head_mask = head_mask.expand(num_hidden_layers, -1, -1, -1, -1)
elif head_mask.dim() == 2:
head_mask = (
head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)
) # We can specify head_mask for each layer
head_mask = head_mask.to(
dtype=next(self.parameters()).dtype
) # switch to fload if need + fp16 compatibility
else:
head_mask = [None] * num_hidden_layers
return head_mask
ELECTRA_START_DOCSTRING = r"""
This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
usage and behavior.
Parameters:
config (:class:`~transformers.ElectraConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
"""
ELECTRA_INPUTS_DOCSTRING = r"""
Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`transformers.ElectraTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.encode_plus` for details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Segment token indices to indicate first and second portions of the inputs.
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
corresponds to a `sentence B` token
`What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Indices of positions of each input sequence tokens in the position embeddings.
Selected in the range ``[0, config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`_
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
:obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
than the model's internal embedding lookup matrix.
encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
if the model is configured as a decoder.
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask
is used in the cross-attention if the model is configured as a decoder.
Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
"""
@add_start_docstrings(
"The bare Electra Model transformer outputting raw hidden-states without any specific head on top. Identical to "
"the BERT model except that it uses an additional linear layer between the embedding layer and the encoder if the "
"hidden size and embedding size are different."
""
"Both the generator and discriminator checkpoints may be loaded into this model.",
ELECTRA_START_DOCSTRING,
)
class ElectraModel(ElectraPreTrainedModel):
config_class = ElectraConfig
def __init__(self, config):
super().__init__(config)
self.embeddings = ElectraEmbeddings(config)
if config.embedding_size != config.hidden_size:
self.embeddings_project = nn.Linear(config.embedding_size, config.hidden_size)
self.encoder = BertEncoder(config)
self.config = config
self.init_weights()
def get_input_embeddings(self):
return self.embeddings.word_embeddings
def set_input_embeddings(self, value):
self.embeddings.word_embeddings = value
def _prune_heads(self, heads_to_prune):
""" Prunes heads of the model.
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
See base class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
):
r"""
Return:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
Examples::
from transformers import ElectraModel, ElectraTokenizer
import torch
tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
model = ElectraModel.from_pretrained('google/electra-small-discriminator')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
"""
if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
elif input_ids is not None:
input_shape = input_ids.size()
elif inputs_embeds is not None:
input_shape = inputs_embeds.size()[:-1]
else:
raise ValueError("You have to specify either input_ids or inputs_embeds")
device = input_ids.device if input_ids is not None else inputs_embeds.device
if attention_mask is None:
attention_mask = torch.ones(input_shape, device=device)
if token_type_ids is None:
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, device)
head_mask = self.get_head_mask(head_mask)
hidden_states = self.embeddings(
input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
)
if hasattr(self, "embeddings_project"):
hidden_states = self.embeddings_project(hidden_states)
hidden_states = self.encoder(hidden_states, attention_mask=extended_attention_mask, head_mask=head_mask)
return hidden_states
@add_start_docstrings(
"""
Electra model with a binary classification head on top as used during pre-training for identifying generated
tokens.
It is recommended to load the discriminator checkpoint into that model.""",
ELECTRA_START_DOCSTRING,
)
class ElectraForPreTraining(ElectraPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.electra = ElectraModel(config)
self.discriminator_predictions = ElectraDiscriminatorPredictions(config)
self.init_weights()
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
):
r"""
labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):
Labels for computing the ELECTRA loss. Input should be a sequence of tokens (see :obj:`input_ids` docstring)
Indices should be in ``[0, 1]``.
``0`` indicates the token is an original token,
``1`` indicates the token was replaced.
Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
loss (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Total loss of the ELECTRA objective.
scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`)
Prediction scores of the head (scores for each token before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
Examples::
from transformers import ElectraTokenizer, ElectraForPreTraining
import torch
tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
model = ElectraForPreTraining.from_pretrained('google/electra-small-discriminator')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
prediction_scores, seq_relationship_scores = outputs[:2]
"""
discriminator_hidden_states = self.electra(
input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds
)
discriminator_sequence_output = discriminator_hidden_states[0]
logits = self.discriminator_predictions(discriminator_sequence_output, attention_mask)
output = (logits,)
if labels is not None:
loss_fct = nn.BCEWithLogitsLoss()
if attention_mask is not None:
active_loss = attention_mask.view(-1, discriminator_sequence_output.shape[1]) == 1
active_logits = logits.view(-1, discriminator_sequence_output.shape[1])[active_loss]
active_labels = labels[active_loss]
loss = loss_fct(active_logits, active_labels.float())
else:
loss = loss_fct(logits.view(-1, discriminator_sequence_output.shape[1]), labels.float())
output = (loss,) + output
output += discriminator_hidden_states[1:]
return output # (loss), scores, (hidden_states), (attentions)
@add_start_docstrings(
"""
Electra model with a language modeling head on top.
Even though both the discriminator and generator may be loaded into this model, the generator is
the only model of the two to have been trained for the masked language modeling task.""",
ELECTRA_START_DOCSTRING,
)
class ElectraForMaskedLM(ElectraPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.electra = ElectraModel(config)
self.generator_predictions = ElectraGeneratorPredictions(config)
self.generator_lm_head = nn.Linear(config.embedding_size, config.vocab_size)
self.init_weights()
def get_output_embeddings(self):
return self.generator_lm_head
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
masked_lm_labels=None,
):
r"""
masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
in ``[0, ..., config.vocab_size]``
Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Masked language modeling loss.
prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
Examples::
from transformers import ElectraTokenizer, ElectraForMaskedLM
import torch
tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-generator')
model = ElectraForMaskedLM.from_pretrained('google/electra-small-generator')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
"""
generator_hidden_states = self.electra(
input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds
)
generator_sequence_output = generator_hidden_states[0]
prediction_scores = self.generator_predictions(generator_sequence_output)
prediction_scores = self.generator_lm_head(prediction_scores)
output = (prediction_scores,)
# Masked language modeling softmax layer
if masked_lm_labels is not None:
loss_fct = nn.CrossEntropyLoss() # -100 index = padding token
loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
output = (loss,) + output
output += generator_hidden_states[1:]
return output # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)
@add_start_docstrings(
"""
Electra model with a token classification head on top.
Both the discriminator and generator may be loaded into this model.""",
ELECTRA_START_DOCSTRING,
)
class ElectraForTokenClassification(ElectraPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.electra = ElectraModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
self.init_weights()
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Labels for computing the token classification loss.
Indices should be in ``[0, ..., config.num_labels - 1]``.
Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :
Classification loss.
scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)
Classification scores (before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
Examples::
from transformers import ElectraTokenizer, ElectraForTokenClassification
import torch
tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
model = ElectraForTokenClassification.from_pretrained('google/electra-small-discriminator')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=labels)
loss, scores = outputs[:2]
"""
discriminator_hidden_states = self.electra(
input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds
)
discriminator_sequence_output = discriminator_hidden_states[0]
discriminator_sequence_output = self.dropout(discriminator_sequence_output)
logits = self.classifier(discriminator_sequence_output)
output = (logits,)
if labels is not None:
loss_fct = nn.CrossEntropyLoss()
# Only keep active parts of the loss
if attention_mask is not None:
active_loss = attention_mask.view(-1) == 1
active_logits = logits.view(-1, self.config.num_labels)[active_loss]
active_labels = labels.view(-1)[active_loss]
loss = loss_fct(active_logits, active_labels)
else:
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
output = (loss,) + output
output += discriminator_hidden_states[1:]
return output # (loss), scores, (hidden_states), (attentions)

View File

@@ -457,7 +457,6 @@ class T5PreTrainedModel(PreTrainedModel):
pretrained_model_archive_map = T5_PRETRAINED_MODEL_ARCHIVE_MAP
load_tf_weights = load_tf_weights_in_t5
base_model_prefix = "transformer"
encoder_outputs_batch_dim_idx = 0 # outputs shaped (bs, ...)
@property
def dummy_inputs(self):
@@ -901,7 +900,6 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
Labels for computing the sequence classification/regression loss.
Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
All labels set to ``-100`` are ignored (masked), the loss is only
computed for labels in ``[0, ..., config.vocab_size]``

View File

@@ -0,0 +1,615 @@
import logging
import tensorflow as tf
from transformers import ElectraConfig
from .file_utils import add_start_docstrings, add_start_docstrings_to_callable
from .modeling_tf_bert import ACT2FN, TFBertEncoder, TFBertPreTrainedModel
from .modeling_tf_utils import get_initializer, shape_list
logger = logging.getLogger(__name__)
TF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP = {
"google/electra-small-generator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-generator/tf_model.h5",
"google/electra-base-generator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-generator/tf_model.h5",
"google/electra-large-generator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-generator/tf_model.h5",
"google/electra-small-discriminator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-discriminator/tf_model.h5",
"google/electra-base-discriminator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/tf_model.h5",
"google/electra-large-discriminator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-discriminator/tf_model.h5",
}
class TFElectraEmbeddings(tf.keras.layers.Layer):
"""Construct the embeddings from word, position and token_type embeddings.
"""
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.vocab_size = config.vocab_size
self.embedding_size = config.embedding_size
self.initializer_range = config.initializer_range
self.position_embeddings = tf.keras.layers.Embedding(
config.max_position_embeddings,
config.embedding_size,
embeddings_initializer=get_initializer(self.initializer_range),
name="position_embeddings",
)
self.token_type_embeddings = tf.keras.layers.Embedding(
config.type_vocab_size,
config.embedding_size,
embeddings_initializer=get_initializer(self.initializer_range),
name="token_type_embeddings",
)
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
def build(self, input_shape):
"""Build shared word embedding layer """
with tf.name_scope("word_embeddings"):
# Create and initialize weights. The random normal initializer was chosen
# arbitrarily, and works well.
self.word_embeddings = self.add_weight(
"weight",
shape=[self.vocab_size, self.embedding_size],
initializer=get_initializer(self.initializer_range),
)
super().build(input_shape)
def call(self, inputs, mode="embedding", training=False):
"""Get token embeddings of inputs.
Args:
inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)
mode: string, a valid value is one of "embedding" and "linear".
Returns:
outputs: (1) If mode == "embedding", output embedding tensor, float32 with
shape [batch_size, length, embedding_size]; (2) mode == "linear", output
linear tensor, float32 with shape [batch_size, length, vocab_size].
Raises:
ValueError: if mode is not valid.
Shared weights logic adapted from
https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24
"""
if mode == "embedding":
return self._embedding(inputs, training=training)
elif mode == "linear":
return self._linear(inputs)
else:
raise ValueError("mode {} is not valid.".format(mode))
def _embedding(self, inputs, training=False):
"""Applies embedding based on inputs tensor."""
input_ids, position_ids, token_type_ids, inputs_embeds = inputs
if input_ids is not None:
input_shape = shape_list(input_ids)
else:
input_shape = shape_list(inputs_embeds)[:-1]
seq_length = input_shape[1]
if position_ids is None:
position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]
if token_type_ids is None:
token_type_ids = tf.fill(input_shape, 0)
if inputs_embeds is None:
inputs_embeds = tf.gather(self.word_embeddings, input_ids)
position_embeddings = self.position_embeddings(position_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
embeddings = inputs_embeds + position_embeddings + token_type_embeddings
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings, training=training)
return embeddings
def _linear(self, inputs):
"""Computes logits by running inputs through a linear layer.
Args:
inputs: A float32 tensor with shape [batch_size, length, hidden_size]
Returns:
float32 tensor with shape [batch_size, length, vocab_size].
"""
batch_size = shape_list(inputs)[0]
length = shape_list(inputs)[1]
x = tf.reshape(inputs, [-1, self.embedding_size])
logits = tf.matmul(x, self.word_embeddings, transpose_b=True)
return tf.reshape(logits, [batch_size, length, self.vocab_size])
class TFElectraDiscriminatorPredictions(tf.keras.layers.Layer):
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.dense = tf.keras.layers.Dense(config.hidden_size, name="dense")
self.dense_prediction = tf.keras.layers.Dense(1, name="dense_prediction")
self.config = config
def call(self, discriminator_hidden_states, training=False):
hidden_states = self.dense(discriminator_hidden_states)
hidden_states = ACT2FN[self.config.hidden_act](hidden_states)
logits = tf.squeeze(self.dense_prediction(hidden_states))
return logits
class TFElectraGeneratorPredictions(tf.keras.layers.Layer):
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
self.dense = tf.keras.layers.Dense(config.embedding_size, name="dense")
def call(self, generator_hidden_states, training=False):
hidden_states = self.dense(generator_hidden_states)
hidden_states = ACT2FN["gelu"](hidden_states)
hidden_states = self.LayerNorm(hidden_states)
return hidden_states
class TFElectraPreTrainedModel(TFBertPreTrainedModel):
config_class = ElectraConfig
pretrained_model_archive_map = TF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP
base_model_prefix = "electra"
def get_extended_attention_mask(self, attention_mask, input_shape):
if attention_mask is None:
attention_mask = tf.fill(input_shape, 1)
# We create a 3D attention mask from a 2D tensor mask.
# Sizes are [batch_size, 1, 1, to_seq_length]
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
# this attention mask is more simple than the triangular masking of causal attention
# used in OpenAI GPT, we just need to prepare the broadcast dimension here.
extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
extended_attention_mask = tf.cast(extended_attention_mask, tf.float32)
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
return extended_attention_mask
def get_head_mask(self, head_mask):
if head_mask is not None:
raise NotImplementedError
else:
head_mask = [None] * self.config.num_hidden_layers
return head_mask
class TFElectraMainLayer(TFElectraPreTrainedModel):
config_class = ElectraConfig
def __init__(self, config, **kwargs):
super().__init__(config, **kwargs)
self.embeddings = TFElectraEmbeddings(config, name="embeddings")
if config.embedding_size != config.hidden_size:
self.embeddings_project = tf.keras.layers.Dense(config.hidden_size, name="embeddings_project")
self.encoder = TFBertEncoder(config, name="encoder")
self.config = config
def get_input_embeddings(self):
return self.embeddings
def _resize_token_embeddings(self, new_num_tokens):
raise NotImplementedError
def _prune_heads(self, heads_to_prune):
""" Prunes heads of the model.
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
See base class PreTrainedModel
"""
raise NotImplementedError
def call(
self,
inputs,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
training=False,
):
if isinstance(inputs, (tuple, list)):
input_ids = inputs[0]
attention_mask = inputs[1] if len(inputs) > 1 else attention_mask
token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids
position_ids = inputs[3] if len(inputs) > 3 else position_ids
head_mask = inputs[4] if len(inputs) > 4 else head_mask
inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds
assert len(inputs) <= 6, "Too many inputs."
elif isinstance(inputs, dict):
input_ids = inputs.get("input_ids")
attention_mask = inputs.get("attention_mask", attention_mask)
token_type_ids = inputs.get("token_type_ids", token_type_ids)
position_ids = inputs.get("position_ids", position_ids)
head_mask = inputs.get("head_mask", head_mask)
inputs_embeds = inputs.get("inputs_embeds", inputs_embeds)
assert len(inputs) <= 6, "Too many inputs."
else:
input_ids = inputs
if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
elif input_ids is not None:
input_shape = shape_list(input_ids)
elif inputs_embeds is not None:
input_shape = shape_list(inputs_embeds)[:-1]
else:
raise ValueError("You have to specify either input_ids or inputs_embeds")
if attention_mask is None:
attention_mask = tf.fill(input_shape, 1)
if token_type_ids is None:
token_type_ids = tf.fill(input_shape, 0)
extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)
head_mask = self.get_head_mask(head_mask)
hidden_states = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)
if hasattr(self, "embeddings_project"):
hidden_states = self.embeddings_project(hidden_states, training=training)
hidden_states = self.encoder([hidden_states, extended_attention_mask, head_mask], training=training)
return hidden_states
ELECTRA_START_DOCSTRING = r"""
This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
Use it as a regular TF 2.0 Keras Model and
refer to the TF 2.0 documentation for all matter related to general usage and behavior.
.. note::
TF 2.0 models accepts two formats as inputs:
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
If you choose this second option, there are three possibilities you can use to gather all the input Tensors
in the first positional argument :
- a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
:obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
- a dictionary with one or several input Tensors associated to the input names given in the docstring:
:obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
Parameters:
config (:class:`~transformers.ElectraConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
"""
ELECTRA_INPUTS_DOCSTRING = r"""
Args:
input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`transformers.ElectraTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.encode_plus` for details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
`What are attention masks? <../glossary.html#attention-mask>`__
head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
:obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
than the model's internal embedding lookup matrix.
training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
(if set to :obj:`False`) for evaluation.
"""
@add_start_docstrings(
"The bare Electra Model transformer outputting raw hidden-states without any specific head on top. Identical to "
"the BERT model except that it uses an additional linear layer between the embedding layer and the encoder if the "
"hidden size and embedding size are different."
""
"Both the generator and discriminator checkpoints may be loaded into this model.",
ELECTRA_START_DOCSTRING,
)
class TFElectraModel(TFElectraPreTrainedModel):
def __init__(self, config, *inputs, **kwargs):
super().__init__(config, *inputs, **kwargs)
self.electra = TFElectraMainLayer(config, name="electra")
def get_input_embeddings(self):
return self.electra.embeddings
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
def call(self, inputs, **kwargs):
r"""
Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):
tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):
tuple of :obj:`tf.Tensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
import tensorflow as tf
from transformers import ElectraTokenizer, TFElectraModel
tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
model = TFElectraModel.from_pretrained('google/electra-small-discriminator')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
"""
outputs = self.electra(inputs, **kwargs)
return outputs
@add_start_docstrings(
"""
Electra model with a binary classification head on top as used during pre-training for identifying generated
tokens.
Even though both the discriminator and generator may be loaded into this model, the discriminator is
the only model of the two to have the correct classification head to be used for this model.""",
ELECTRA_START_DOCSTRING,
)
class TFElectraForPreTraining(TFElectraPreTrainedModel):
def __init__(self, config, **kwargs):
super().__init__(config, **kwargs)
self.electra = TFElectraMainLayer(config, name="electra")
self.discriminator_predictions = TFElectraDiscriminatorPredictions(config, name="discriminator_predictions")
def get_input_embeddings(self):
return self.electra.embeddings
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
def call(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
training=False,
):
r"""
Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):
Prediction scores of the head (scores for each token before SoftMax).
hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):
tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):
tuple of :obj:`tf.Tensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
import tensorflow as tf
from transformers import ElectraTokenizer, TFElectraForPreTraining
tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
model = TFElectraForPreTraining.from_pretrained('google/electra-small-discriminator')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
outputs = model(input_ids)
scores = outputs[0]
"""
discriminator_hidden_states = self.electra(
input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training
)
discriminator_sequence_output = discriminator_hidden_states[0]
logits = self.discriminator_predictions(discriminator_sequence_output)
output = (logits,)
output += discriminator_hidden_states[1:]
return output # (loss), scores, (hidden_states), (attentions)
class TFElectraMaskedLMHead(tf.keras.layers.Layer):
def __init__(self, config, input_embeddings, **kwargs):
super().__init__(**kwargs)
self.vocab_size = config.vocab_size
self.input_embeddings = input_embeddings
def build(self, input_shape):
self.bias = self.add_weight(shape=(self.vocab_size,), initializer="zeros", trainable=True, name="bias")
super().build(input_shape)
def call(self, hidden_states, training=False):
hidden_states = self.input_embeddings(hidden_states, mode="linear")
hidden_states = hidden_states + self.bias
return hidden_states
@add_start_docstrings(
"""
Electra model with a language modeling head on top.
Even though both the discriminator and generator may be loaded into this model, the generator is
the only model of the two to have been trained for the masked language modeling task.""",
ELECTRA_START_DOCSTRING,
)
class TFElectraForMaskedLM(TFElectraPreTrainedModel):
def __init__(self, config, **kwargs):
super().__init__(config, **kwargs)
self.vocab_size = config.vocab_size
self.electra = TFElectraMainLayer(config, name="electra")
self.generator_predictions = TFElectraGeneratorPredictions(config, name="generator_predictions")
if isinstance(config.hidden_act, str):
self.activation = ACT2FN[config.hidden_act]
else:
self.activation = config.hidden_act
self.generator_lm_head = TFElectraMaskedLMHead(config, self.electra.embeddings, name="generator_lm_head")
def get_input_embeddings(self):
return self.electra.embeddings
def get_output_embeddings(self):
return self.generator_lm_head
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
def call(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
training=False,
):
r"""
Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
prediction_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):
tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):
tuple of :obj:`tf.Tensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
import tensorflow as tf
from transformers import ElectraTokenizer, TFElectraForMaskedLM
tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-generator')
model = TFElectraForMaskedLM.from_pretrained('google/electra-small-generator')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
outputs = model(input_ids)
prediction_scores = outputs[0]
"""
generator_hidden_states = self.electra(
input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training
)
generator_sequence_output = generator_hidden_states[0]
prediction_scores = self.generator_predictions(generator_sequence_output, training=training)
prediction_scores = self.generator_lm_head(prediction_scores, training=training)
output = (prediction_scores,)
output += generator_hidden_states[1:]
return output # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)
@add_start_docstrings(
"""
Electra model with a token classification head on top.
Both the discriminator and generator may be loaded into this model.""",
ELECTRA_START_DOCSTRING,
)
class TFElectraForTokenClassification(TFElectraPreTrainedModel):
def __init__(self, config, **kwargs):
super().__init__(config, **kwargs)
self.electra = TFElectraMainLayer(config, name="electra")
self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
self.classifier = tf.keras.layers.Dense(config.num_labels, name="classifier")
@add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
def call(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
training=False,
):
r"""
Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.ElectraConfig`) and inputs:
scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):
Classification scores (before SoftMax).
hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):
tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):
tuple of :obj:`tf.Tensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
import tensorflow as tf
from transformers import ElectraTokenizer, TFElectraForTokenClassification
tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
model = TFElectraForTokenClassification.from_pretrained('google/electra-small-discriminator')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
outputs = model(input_ids)
scores = outputs[0]
"""
discriminator_hidden_states = self.electra(
input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, training=training
)
discriminator_sequence_output = discriminator_hidden_states[0]
discriminator_sequence_output = self.dropout(discriminator_sequence_output)
logits = self.classifier(discriminator_sequence_output)
output = (logits,)
output += discriminator_hidden_states[1:]
return output # (loss), scores, (hidden_states), (attentions)

View File

@@ -119,7 +119,7 @@ class TFT5Attention(tf.keras.layers.Layer):
if self.has_relative_attention_bias:
self.relative_attention_bias = tf.keras.layers.Embedding(
self.relative_attention_num_buckets, self.n_heads, name="relative_attention_bias"
self.relative_attention_num_buckets, self.n_heads, name="relative_attention_bias",
)
self.pruned_heads = set()
@@ -178,13 +178,15 @@ class TFT5Attention(tf.keras.layers.Layer):
memory_position = tf.range(klen)[None, :]
relative_position = memory_position - context_position # shape (qlen, klen)
rp_bucket = self._relative_position_bucket(
relative_position, bidirectional=not self.is_decoder, num_buckets=self.relative_attention_num_buckets
relative_position, bidirectional=not self.is_decoder, num_buckets=self.relative_attention_num_buckets,
)
values = self.relative_attention_bias(rp_bucket) # shape (qlen, klen, num_heads)
values = tf.expand_dims(tf.transpose(values, [2, 0, 1]), axis=0) # shape (1, num_heads, qlen, klen)
return values
def call(self, input, mask=None, kv=None, position_bias=None, cache=None, head_mask=None, training=False):
def call(
self, input, mask=None, kv=None, position_bias=None, cache=None, head_mask=None, training=False,
):
"""
Self-attention (if kv is None) or attention over source sentence (provided by kv).
"""
@@ -261,15 +263,17 @@ class TFT5LayerSelfAttention(tf.keras.layers.Layer):
def __init__(self, config, has_relative_attention_bias=False, **kwargs):
super().__init__(**kwargs)
self.SelfAttention = TFT5Attention(
config, has_relative_attention_bias=has_relative_attention_bias, name="SelfAttention"
config, has_relative_attention_bias=has_relative_attention_bias, name="SelfAttention",
)
self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name="layer_norm")
self.dropout = tf.keras.layers.Dropout(config.dropout_rate)
def call(self, hidden_states, attention_mask=None, position_bias=None, head_mask=None, training=False):
def call(
self, hidden_states, attention_mask=None, position_bias=None, head_mask=None, training=False,
):
norm_x = self.layer_norm(hidden_states)
attention_output = self.SelfAttention(
norm_x, mask=attention_mask, position_bias=position_bias, head_mask=head_mask, training=training
norm_x, mask=attention_mask, position_bias=position_bias, head_mask=head_mask, training=training,
)
y = attention_output[0]
layer_output = hidden_states + self.dropout(y, training=training)
@@ -281,15 +285,17 @@ class TFT5LayerCrossAttention(tf.keras.layers.Layer):
def __init__(self, config, has_relative_attention_bias=False, **kwargs):
super().__init__(**kwargs)
self.EncDecAttention = TFT5Attention(
config, has_relative_attention_bias=has_relative_attention_bias, name="EncDecAttention"
config, has_relative_attention_bias=has_relative_attention_bias, name="EncDecAttention",
)
self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name="layer_norm")
self.dropout = tf.keras.layers.Dropout(config.dropout_rate)
def call(self, hidden_states, kv, attention_mask=None, position_bias=None, head_mask=None, training=False):
def call(
self, hidden_states, kv, attention_mask=None, position_bias=None, head_mask=None, training=False,
):
norm_x = self.layer_norm(hidden_states)
attention_output = self.EncDecAttention(
norm_x, mask=attention_mask, kv=kv, position_bias=position_bias, head_mask=head_mask, training=training
norm_x, mask=attention_mask, kv=kv, position_bias=position_bias, head_mask=head_mask, training=training,
)
y = attention_output[0]
layer_output = hidden_states + self.dropout(y, training=training)
@@ -303,12 +309,12 @@ class TFT5Block(tf.keras.layers.Layer):
self.is_decoder = config.is_decoder
self.layer = []
self.layer.append(
TFT5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias, name="layer_._0")
TFT5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias, name="layer_._0",)
)
if self.is_decoder:
self.layer.append(
TFT5LayerCrossAttention(
config, has_relative_attention_bias=has_relative_attention_bias, name="layer_._1"
config, has_relative_attention_bias=has_relative_attention_bias, name="layer_._1",
)
)
self.layer.append(TFT5LayerFF(config, name="layer_._2"))
@@ -402,7 +408,7 @@ class TFT5MainLayer(tf.keras.layers.Layer):
self.num_hidden_layers = config.num_layers
self.block = [
TFT5Block(config, has_relative_attention_bias=bool(i == 0), name="block_._{}".format(i))
TFT5Block(config, has_relative_attention_bias=bool(i == 0), name="block_._{}".format(i),)
for i in range(config.num_layers)
]
self.final_layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon, name="final_layer_norm")
@@ -469,7 +475,7 @@ class TFT5MainLayer(tf.keras.layers.Layer):
if self.config.is_decoder:
seq_ids = tf.range(seq_length)
causal_mask = tf.less_equal(
tf.tile(seq_ids[None, None, :], (batch_size, seq_length, 1)), seq_ids[None, :, None]
tf.tile(seq_ids[None, None, :], (batch_size, seq_length, 1)), seq_ids[None, :, None],
)
causal_mask = tf.cast(causal_mask, dtype=tf.float32)
extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
@@ -586,8 +592,8 @@ class TFT5PreTrainedModel(TFPreTrainedModel):
input_ids = tf.constant(DUMMY_INPUTS)
input_mask = tf.constant(DUMMY_MASK)
dummy_inputs = {
"inputs": input_ids,
"decoder_input_ids": input_ids,
"input_ids": input_ids,
"decoder_attention_mask": input_mask,
}
return dummy_inputs
@@ -631,11 +637,9 @@ T5_START_DOCSTRING = r""" The T5 model was proposed in
T5_INPUTS_DOCSTRING = r"""
Args:
decoder_input_ids are usually used as a `dict` (see T5 description above for more information) containing all the following.
decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):
Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.
inputs are usually used as a `dict` (see T5 description above for more information) containing all the following.
input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
inputs (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
T5 is a model with relative position embeddings so you should be able to pad the inputs on
the right or the left.
@@ -644,6 +648,8 @@ T5_INPUTS_DOCSTRING = r"""
`T5 Training <./t5.html#training>`_ .
See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):
Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.
attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
@@ -700,7 +706,7 @@ class TFT5Model(TFT5PreTrainedModel):
return self.shared
@add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)
def call(self, decoder_input_ids, **kwargs):
def call(self, inputs, **kwargs):
r"""
Return:
:obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs.
@@ -725,18 +731,18 @@ class TFT5Model(TFT5PreTrainedModel):
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = TFT5Model.from_pretrained('t5-small')
input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="tf") # Batch size 1
outputs = model(input_ids, input_ids=input_ids)
outputs = model(input_ids, decoder_input_ids=input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
"""
if isinstance(decoder_input_ids, dict):
kwargs.update(decoder_input_ids)
if isinstance(inputs, dict):
kwargs.update(inputs)
else:
kwargs["decoder_input_ids"] = decoder_input_ids
kwargs["inputs"] = inputs
# retrieve arguments
input_ids = kwargs.get("input_ids", None)
input_ids = kwargs.get("inputs", None)
decoder_input_ids = kwargs.get("decoder_input_ids", None)
attention_mask = kwargs.get("attention_mask", None)
encoder_outputs = kwargs.get("encoder_outputs", None)
@@ -748,7 +754,7 @@ class TFT5Model(TFT5PreTrainedModel):
# Encode if needed (training, first prediction pass)
if encoder_outputs is None:
encoder_outputs = self.encoder(
input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask
input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,
)
hidden_states = encoder_outputs[0]
@@ -797,7 +803,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
return self.encoder
@add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)
def call(self, decoder_input_ids, **kwargs):
def call(self, inputs, **kwargs):
r"""
Return:
:obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs.
@@ -823,7 +829,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = TFT5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="tf") # Batch size 1
outputs = model(input_ids, input_ids=input_ids)
outputs = model(input_ids, decoder_input_ids=input_ids)
prediction_scores = outputs[0]
tokenizer = T5Tokenizer.from_pretrained('t5-small')
@@ -833,13 +839,13 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
"""
if isinstance(decoder_input_ids, dict):
kwargs.update(decoder_input_ids)
if isinstance(inputs, dict):
kwargs.update(inputs)
else:
kwargs["decoder_input_ids"] = decoder_input_ids
kwargs["inputs"] = inputs
# retrieve arguments
input_ids = kwargs.get("input_ids", None)
input_ids = kwargs.get("inputs", None)
decoder_input_ids = kwargs.get("decoder_input_ids", None)
attention_mask = kwargs.get("attention_mask", None)
encoder_outputs = kwargs.get("encoder_outputs", None)
@@ -852,7 +858,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
if encoder_outputs is None:
# Convert encoder inputs in embeddings if needed
encoder_outputs = self.encoder(
input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask
input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,
)
hidden_states = encoder_outputs[0]
@@ -884,7 +890,8 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
encoder_outputs = (past,)
return {
"inputs": input_ids,
"inputs": None, # inputs don't have to be defined, but still need to be passed to make Keras.layer.__call__ happy
"decoder_input_ids": input_ids, # input_ids are the decoder_input_ids
"encoder_outputs": encoder_outputs,
"attention_mask": attention_mask,
}

View File

@@ -488,7 +488,7 @@ class TFTransfoXLMainLayer(tf.keras.layers.Layer):
else:
return None
def _update_mems(self, hids, mems, qlen, mlen):
def _update_mems(self, hids, mems, mlen, qlen):
# does not deal with None
if mems is None:
return None

View File

@@ -467,6 +467,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
top_k=None,
top_p=None,
repetition_penalty=None,
bad_words_ids=None,
bos_token_id=None,
pad_token_id=None,
eos_token_id=None,
@@ -532,6 +533,9 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
no_repeat_ngram_size: (`optional`) int
If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.
bad_words_ids: (`optional`) list of lists of int
`bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.
num_return_sequences: (`optional`) int
The number of independently computed returned sequences for each element in the batch. Default to 1.
@@ -582,6 +586,12 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2) # generate sequences
print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))
tokenizer = AutoTokenizer.from_pretrained('gpt2') # Initialize tokenizer
model = TFAutoModelWithLMHead.from_pretrained('gpt2') # Download model and configuration from S3 and cache.
input_context = 'My cute dog' # "Legal" is one of the control codes for ctrl
bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]
input_ids = tokenizer.encode(input_context, return_tensors='tf') # encode input context
outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids) # generate sequences without allowing bad_words to be generated
"""
# We cannot generate if the model does not have a LM head
@@ -607,6 +617,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
no_repeat_ngram_size = (
no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size
)
bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids
num_return_sequences = (
num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences
)
@@ -641,6 +652,9 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
assert (
isinstance(num_return_sequences, int) and num_return_sequences > 0
), "`num_return_sequences` should be a strictely positive integer."
assert (
bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)
), "`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated"
if input_ids is None:
assert isinstance(bos_token_id, int) and bos_token_id >= 0, (
@@ -742,6 +756,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
top_p=top_p,
repetition_penalty=repetition_penalty,
no_repeat_ngram_size=no_repeat_ngram_size,
bad_words_ids=bad_words_ids,
bos_token_id=bos_token_id,
pad_token_id=pad_token_id,
eos_token_id=eos_token_id,
@@ -766,6 +781,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
top_p=top_p,
repetition_penalty=repetition_penalty,
no_repeat_ngram_size=no_repeat_ngram_size,
bad_words_ids=bad_words_ids,
bos_token_id=bos_token_id,
pad_token_id=pad_token_id,
eos_token_id=eos_token_id,
@@ -790,6 +806,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
top_p,
repetition_penalty,
no_repeat_ngram_size,
bad_words_ids,
bos_token_id,
pad_token_id,
eos_token_id,
@@ -828,7 +845,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
if no_repeat_ngram_size > 0:
# calculate a list of banned tokens to prevent repetitively generating the same ngrams
# from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345
banned_tokens = calc_banned_tokens(input_ids, batch_size, no_repeat_ngram_size, cur_len)
banned_tokens = calc_banned_ngram_tokens(input_ids, batch_size, no_repeat_ngram_size, cur_len)
# create banned_tokens boolean mask
banned_tokens_indices_mask = []
for banned_tokens_slice in banned_tokens:
@@ -840,6 +857,20 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
next_token_logits, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float("inf")
)
if bad_words_ids is not None:
# calculate a list of banned tokens according to bad words
banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)
banned_tokens_indices_mask = []
for banned_tokens_slice in banned_tokens:
banned_tokens_indices_mask.append(
[True if token in banned_tokens_slice else False for token in range(vocab_size)]
)
next_token_logits = set_tensor_by_indices_to_value(
next_token_logits, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float("inf")
)
# set eos token prob to zero if min_length is not reached
if eos_token_id is not None and cur_len < min_length:
# create eos_token_id boolean mask
@@ -936,6 +967,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
top_p,
repetition_penalty,
no_repeat_ngram_size,
bad_words_ids,
bos_token_id,
pad_token_id,
decoder_start_token_id,
@@ -1012,7 +1044,9 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
# calculate a list of banned tokens to prevent repetitively generating the same ngrams
# from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345
num_batch_hypotheses = batch_size * num_beams
banned_tokens = calc_banned_tokens(input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len)
banned_tokens = calc_banned_ngram_tokens(
input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len
)
# create banned_tokens boolean mask
banned_tokens_indices_mask = []
for banned_tokens_slice in banned_tokens:
@@ -1024,6 +1058,20 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
scores, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float("inf")
)
if bad_words_ids is not None:
# calculate a list of banned tokens according to bad words
banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)
banned_tokens_indices_mask = []
for banned_tokens_slice in banned_tokens:
banned_tokens_indices_mask.append(
[True if token in banned_tokens_slice else False for token in range(vocab_size)]
)
scores = set_tensor_by_indices_to_value(
scores, tf.convert_to_tensor(banned_tokens_indices_mask, dtype=tf.bool), -float("inf")
)
assert shape_list(scores) == [batch_size * num_beams, vocab_size]
if do_sample:
@@ -1064,12 +1112,12 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
assert shape_list(next_scores) == shape_list(next_tokens) == [batch_size, 2 * num_beams]
# next batch beam content
# list of (batch_size * num_beams) tuple(next hypothesis score, next token, current position in the batch)
next_batch_beam = []
# for each sentence
for batch_idx in range(batch_size):
# if we are done with this sentence
if done[batch_idx]:
assert (
len(generated_hyps[batch_idx]) >= num_beams
@@ -1087,14 +1135,13 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(
zip(next_tokens[batch_idx], next_scores[batch_idx])
):
# get beam and token IDs
beam_id = beam_token_id // vocab_size
token_id = beam_token_id % vocab_size
effective_beam_id = batch_idx * num_beams + beam_id
# add to generated hypotheses if end of sentence or last iteration
if eos_token_id is not None and token_id.numpy() is eos_token_id:
if (eos_token_id is not None) and (token_id.numpy() == eos_token_id):
# if beam_token does not belong to top num_beams tokens, it should not be added
is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams
if is_beam_token_worse_than_top_num_beams:
@@ -1110,9 +1157,9 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
if len(next_sent_beam) == num_beams:
break
# if we are done with this sentence
# Check if were done so that we can save a pad step if all(done)
done[batch_idx] = done[batch_idx] or generated_hyps[batch_idx].is_done(
tf.reduce_max(next_scores[batch_idx]).numpy()
tf.reduce_max(next_scores[batch_idx]).numpy(), cur_len=cur_len
)
# update next beam content
@@ -1137,6 +1184,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
if past is not None:
past = self._reorder_cache(past, beam_idx)
# extend attention_mask for new generated input if only decoder
if self.config.is_encoder_decoder is False:
attention_mask = tf.concat(
[attention_mask, tf.ones((shape_list(attention_mask)[0], 1), dtype=tf.int32)], axis=-1
@@ -1196,16 +1244,26 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
# fill with hypothesis and eos_token_id if necessary
for i, hypo in enumerate(best):
padding = tf.ones((sent_max_len - shape_list(hypo)[0],), dtype=tf.int32) * pad_token_id
decoded_hypo = tf.concat([hypo, padding], axis=0)
assert sent_lengths[i] == shape_list(hypo)[0]
# if sent_length is max_len do not pad
if sent_lengths[i] == sent_max_len:
decoded_slice = hypo
else:
# else pad to sent_max_len
num_pad_tokens = sent_max_len - sent_lengths[i]
padding = pad_token_id * tf.ones((num_pad_tokens,), dtype=tf.int32)
decoded_slice = tf.concat([hypo, padding], axis=-1)
# finish sentence with EOS token
if sent_lengths[i] < max_length:
decoded_slice = tf.where(
tf.range(sent_max_len, dtype=tf.int32) == sent_lengths[i],
eos_token_id * tf.ones((sent_max_len,), dtype=tf.int32),
decoded_slice,
)
# add to list
decoded_list.append(decoded_slice)
if sent_lengths[i] < max_length:
decoded_hypo = tf.where(
tf.range(max_length) == sent_lengths[i],
eos_token_id * tf.ones((sent_max_len,), dtype=tf.int32),
decoded_hypo,
)
decoded_list.append(decoded_hypo)
decoded = tf.stack(decoded_list)
else:
# none of the hypotheses have an eos_token
@@ -1243,7 +1301,7 @@ def _create_next_token_logits_penalties(input_ids, logits, repetition_penalty):
return tf.convert_to_tensor(token_penalties, dtype=tf.float32)
def calc_banned_tokens(prev_input_ids, num_hypos, no_repeat_ngram_size, cur_len):
def calc_banned_ngram_tokens(prev_input_ids, num_hypos, no_repeat_ngram_size, cur_len):
# Copied from fairseq for no_repeat_ngram in beam_search"""
if cur_len + 1 < no_repeat_ngram_size:
# return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet
@@ -1266,6 +1324,42 @@ def calc_banned_tokens(prev_input_ids, num_hypos, no_repeat_ngram_size, cur_len)
return banned_tokens
def calc_banned_bad_words_ids(prev_input_ids, bad_words_ids):
banned_tokens = []
def _tokens_match(prev_tokens, tokens):
if len(tokens) == 0:
# if bad word tokens is just one token always ban it
return True
if len(tokens) > len(prev_input_ids):
# if bad word tokens are longer then prev input_ids they can't be equal
return False
if prev_tokens[-len(tokens) :] == tokens:
# if tokens match
return True
else:
return False
for prev_input_ids_slice in prev_input_ids:
banned_tokens_slice = []
for banned_token_seq in bad_words_ids:
assert len(banned_token_seq) > 0, "Banned words token sequences {} cannot have an empty list".format(
bad_words_ids
)
if _tokens_match(prev_input_ids_slice.numpy().tolist(), banned_token_seq[:-1]) is False:
# if tokens do not match continue
continue
banned_tokens_slice.append(banned_token_seq[-1])
banned_tokens.append(banned_tokens_slice)
return banned_tokens
def tf_top_k_top_p_filtering(logits, top_k=0, top_p=1.0, filter_value=-float("Inf"), min_tokens_to_keep=1):
""" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
Args:

View File

@@ -136,7 +136,7 @@ def load_tf_weights_in_transfo_xl(model, config, tf_path):
if "kernel" in name or "proj" in name:
array = np.transpose(array)
if ("r_r_bias" in name or "r_w_bias" in name) and len(pointer) > 1:
# Here we will split the TF weigths
# Here we will split the TF weights
assert len(pointer) == array.shape[0]
for i, p_i in enumerate(pointer):
arr_i = array[i, ...]

View File

@@ -667,6 +667,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
top_k=None,
top_p=None,
repetition_penalty=None,
bad_words_ids=None,
bos_token_id=None,
pad_token_id=None,
eos_token_id=None,
@@ -731,6 +732,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
no_repeat_ngram_size: (`optional`) int
If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.
bad_words_ids: (`optional`) list of lists of int
`bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.
num_return_sequences: (`optional`) int
The number of independently computed returned sequences for each element in the batch. Default to 1.
@@ -782,6 +785,12 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2) # generate sequences
print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))
tokenizer = AutoTokenizer.from_pretrained('gpt2') # Initialize tokenizer
model = AutoModelWithLMHead.from_pretrained('gpt2') # Download model and configuration from S3 and cache.
input_context = 'My cute dog' # "Legal" is one of the control codes for ctrl
bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]
input_ids = tokenizer.encode(input_context, return_tensors='pt') # encode input context
outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids) # generate sequences without allowing bad_words to be generated
"""
# We cannot generate if the model does not have a LM head
@@ -807,6 +816,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
no_repeat_ngram_size = (
no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size
)
bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids
num_return_sequences = (
num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences
)
@@ -844,6 +854,9 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
assert (
isinstance(num_return_sequences, int) and num_return_sequences > 0
), "`num_return_sequences` should be a strictly positive integer."
assert (
bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)
), "`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated"
if input_ids is None:
assert isinstance(bos_token_id, int) and bos_token_id >= 0, (
@@ -935,18 +948,22 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
device=next(self.parameters()).device,
)
cur_len = 1
batch_idx = self.encoder_outputs_batch_dim_idx
assert (
batch_size == encoder_outputs[0].shape[batch_idx]
), f"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[1]} "
expanded_idx = (
batch_size == encoder_outputs[0].shape[0]
), f"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[0]} "
# expand batch_idx to assign correct encoder output for expanded input_ids (due to num_beams > 1 and num_return_sequences > 1)
expanded_batch_idxs = (
torch.arange(batch_size)
.view(-1, 1)
.repeat(1, num_beams * effective_batch_mult)
.view(-1)
.to(input_ids.device)
)
encoder_outputs = (encoder_outputs[0].index_select(batch_idx, expanded_idx), *encoder_outputs[1:])
# expand encoder_outputs
encoder_outputs = (encoder_outputs[0].index_select(0, expanded_batch_idxs), *encoder_outputs[1:])
else:
encoder_outputs = None
cur_len = input_ids.shape[-1]
@@ -964,6 +981,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
top_p=top_p,
repetition_penalty=repetition_penalty,
no_repeat_ngram_size=no_repeat_ngram_size,
bad_words_ids=bad_words_ids,
bos_token_id=bos_token_id,
pad_token_id=pad_token_id,
decoder_start_token_id=decoder_start_token_id,
@@ -988,6 +1006,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
top_p=top_p,
repetition_penalty=repetition_penalty,
no_repeat_ngram_size=no_repeat_ngram_size,
bad_words_ids=bad_words_ids,
bos_token_id=bos_token_id,
pad_token_id=pad_token_id,
decoder_start_token_id=decoder_start_token_id,
@@ -1011,6 +1030,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
top_p,
repetition_penalty,
no_repeat_ngram_size,
bad_words_ids,
bos_token_id,
pad_token_id,
eos_token_id,
@@ -1045,7 +1065,14 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
if no_repeat_ngram_size > 0:
# calculate a list of banned tokens to prevent repetitively generating the same ngrams
# from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345
banned_tokens = calc_banned_tokens(input_ids, batch_size, no_repeat_ngram_size, cur_len)
banned_tokens = calc_banned_ngram_tokens(input_ids, batch_size, no_repeat_ngram_size, cur_len)
for batch_idx in range(batch_size):
next_token_logits[batch_idx, banned_tokens[batch_idx]] = -float("inf")
if bad_words_ids is not None:
# calculate a list of banned tokens according to bad words
banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)
for batch_idx in range(batch_size):
next_token_logits[batch_idx, banned_tokens[batch_idx]] = -float("inf")
@@ -1121,6 +1148,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
top_p,
repetition_penalty,
no_repeat_ngram_size,
bad_words_ids,
bos_token_id,
pad_token_id,
eos_token_id,
@@ -1187,12 +1215,19 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
# calculate a list of banned tokens to prevent repetitively generating the same ngrams
num_batch_hypotheses = batch_size * num_beams
# from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345
banned_batch_tokens = calc_banned_tokens(
banned_batch_tokens = calc_banned_ngram_tokens(
input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len
)
for i, banned_tokens in enumerate(banned_batch_tokens):
scores[i, banned_tokens] = -float("inf")
if bad_words_ids is not None:
# calculate a list of banned tokens according to bad words
banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)
for i, banned_tokens in enumerate(banned_tokens):
scores[i, banned_tokens] = -float("inf")
assert scores.shape == (batch_size * num_beams, vocab_size), "Shapes of scores: {} != {}".format(
scores.shape, (batch_size * num_beams, vocab_size)
)
@@ -1253,14 +1288,13 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(
zip(next_tokens[batch_idx], next_scores[batch_idx])
):
# get beam and word IDs
# get beam and token IDs
beam_id = beam_token_id // vocab_size
token_id = beam_token_id % vocab_size
effective_beam_id = batch_idx * num_beams + beam_id
# add to generated hypotheses if end of sentence
if (eos_token_id is not None) and (token_id.item() is eos_token_id):
# add to generated hypotheses if end of sentence or last iteration
if (eos_token_id is not None) and (token_id.item() == eos_token_id):
# if beam_token does not belong to top num_beams tokens, it should not be added
is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams
if is_beam_token_worse_than_top_num_beams:
@@ -1269,7 +1303,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
input_ids[effective_beam_id].clone(), beam_token_score.item(),
)
else:
# add next predicted word if it is not eos_token
# add next predicted token if it is not eos_token
next_sent_beam.append((beam_token_score, token_id, effective_beam_id))
# the beam for next step is full
@@ -1299,7 +1333,6 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
# re-order batch
input_ids = input_ids[beam_idx, :]
input_ids = torch.cat([input_ids, beam_tokens.unsqueeze(1)], dim=-1)
# re-order internal states
if past is not None:
past = self._reorder_cache(past, beam_idx)
@@ -1397,7 +1430,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
return past
def calc_banned_tokens(prev_input_ids, num_hypos, no_repeat_ngram_size, cur_len):
def calc_banned_ngram_tokens(prev_input_ids, num_hypos, no_repeat_ngram_size, cur_len):
# Copied from fairseq for no_repeat_ngram in beam_search"""
if cur_len + 1 < no_repeat_ngram_size:
# return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet
@@ -1420,6 +1453,42 @@ def calc_banned_tokens(prev_input_ids, num_hypos, no_repeat_ngram_size, cur_len)
return banned_tokens
def calc_banned_bad_words_ids(prev_input_ids, bad_words_ids):
banned_tokens = []
def _tokens_match(prev_tokens, tokens):
if len(tokens) == 0:
# if bad word tokens is just one token always ban it
return True
if len(tokens) > len(prev_input_ids):
# if bad word tokens are longer then prev input_ids they can't be equal
return False
if prev_tokens[-len(tokens) :] == tokens:
# if tokens match
return True
else:
return False
for prev_input_ids_slice in prev_input_ids:
banned_tokens_slice = []
for banned_token_seq in bad_words_ids:
assert len(banned_token_seq) > 0, "Banned words token sequences {} cannot have an empty list".format(
bad_words_ids
)
if _tokens_match(prev_input_ids_slice.tolist(), banned_token_seq[:-1]) is False:
# if tokens do not match continue
continue
banned_tokens_slice.append(banned_token_seq[-1])
banned_tokens.append(banned_tokens_slice)
return banned_tokens
def top_k_top_p_filtering(logits, top_k=0, top_p=1.0, filter_value=-float("Inf"), min_tokens_to_keep=1):
""" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
Args:

View File

@@ -156,7 +156,7 @@ def load_tf_weights_in_xlnet(model, config, tf_path):
logger.info("Transposing")
array = np.transpose(array)
if isinstance(pointer, list):
# Here we will split the TF weigths
# Here we will split the TF weights
assert len(pointer) == array.shape[0]
for i, p_i in enumerate(pointer):
arr_i = array[i, ...]

View File

@@ -214,7 +214,7 @@ class GradientAccumulator(object):
raise ValueError("Expected %s gradients, but got %d" % (len(self._gradients), len(gradients)))
for accum_gradient, gradient in zip(self._get_replica_gradients(), gradients):
if accum_gradient is not None:
if accum_gradient is not None and gradient is not None:
accum_gradient.assign_add(gradient)
self._accum_steps.assign_add(1)
@@ -241,6 +241,7 @@ class GradientAccumulator(object):
return (
gradient.device_map.select_for_current_replica(gradient.values, replica_context)
for gradient in self._gradients
if gradient is not None
)
else:
return self._gradients

View File

@@ -1235,17 +1235,19 @@ class SummarizationPipeline(Pipeline):
elif self.framework == "tf":
input_length = tf.shape(inputs["input_ids"])[-1].numpy()
if input_length < self.model.config.min_length // 2:
min_length = generate_kwargs.get("min_length", self.model.config.min_length)
if input_length < min_length // 2:
logger.warning(
"Your min_length is set to {}, but you input_length is only {}. You might consider decreasing min_length manually, e.g. summarizer('...', min_length=10)".format(
self.model.config.min_length, input_length
min_length, input_length
)
)
if input_length < self.model.config.max_length:
max_length = generate_kwargs.get("max_length", self.model.config.max_length)
if input_length < max_length:
logger.warning(
"Your max_length is set to {}, but you input_length is only {}. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)".format(
self.model.config.max_length, input_length
max_length, input_length
)
)
@@ -1349,10 +1351,11 @@ class TranslationPipeline(Pipeline):
elif self.framework == "tf":
input_length = tf.shape(inputs["input_ids"])[-1].numpy()
if input_length > 0.9 * self.model.config.max_length:
max_length = generate_kwargs.get("max_length", self.model.config.max_length)
if input_length > 0.9 * max_length:
logger.warning(
"Your input_length: {} is bigger than 0.9 * max_length: {}. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)".format(
input_length, self.model.config.max_length
input_length, max_length
)
)

View File

@@ -26,6 +26,7 @@ from .configuration_auto import (
CamembertConfig,
CTRLConfig,
DistilBertConfig,
ElectraConfig,
FlaubertConfig,
GPT2Config,
OpenAIGPTConfig,
@@ -44,6 +45,7 @@ from .tokenization_bert_japanese import BertJapaneseTokenizer
from .tokenization_camembert import CamembertTokenizer
from .tokenization_ctrl import CTRLTokenizer
from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast
from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast
from .tokenization_flaubert import FlaubertTokenizer
from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
@@ -67,6 +69,7 @@ TOKENIZER_MAPPING = OrderedDict(
(XLMRobertaConfig, (XLMRobertaTokenizer, None)),
(BartConfig, (BartTokenizer, None)),
(RobertaConfig, (RobertaTokenizer, RobertaTokenizerFast)),
(ElectraConfig, (ElectraTokenizer, ElectraTokenizerFast)),
(BertConfig, (BertTokenizer, BertTokenizerFast)),
(OpenAIGPTConfig, (OpenAIGPTTokenizer, OpenAIGPTTokenizerFast)),
(GPT2Config, (GPT2Tokenizer, GPT2TokenizerFast)),
@@ -104,6 +107,7 @@ class AutoTokenizer:
- contains `xlnet`: XLNetTokenizer (XLNet model)
- contains `xlm`: XLMTokenizer (XLM model)
- contains `ctrl`: CTRLTokenizer (Salesforce CTRL model)
- contains `electra`: ElectraTokenizer (Google ELECTRA model)
This class cannot be instantiated using `__init__()` (throw an error).
"""
@@ -135,6 +139,7 @@ class AutoTokenizer:
- contains `xlnet`: XLNetTokenizer (XLNet model)
- contains `xlm`: XLMTokenizer (XLM model)
- contains `ctrl`: CTRLTokenizer (Salesforce CTRL model)
- contains `electra`: ElectraTokenizer (Google ELECTRA model)
Params:
pretrained_model_name_or_path: either:

View File

@@ -19,6 +19,7 @@ import collections
import logging
import os
import unicodedata
from typing import Optional
from .tokenization_bert import BasicTokenizer, BertTokenizer, WordpieceTokenizer, load_vocab
@@ -89,6 +90,7 @@ class BertJapaneseTokenizer(BertTokenizer):
pad_token="[PAD]",
cls_token="[CLS]",
mask_token="[MASK]",
mecab_kwargs=None,
**kwargs
):
"""Constructs a MecabBertTokenizer.
@@ -106,6 +108,7 @@ class BertJapaneseTokenizer(BertTokenizer):
Type of word tokenizer.
**subword_tokenizer_type**: (`optional`) string (default "wordpiece")
Type of subword tokenizer.
**mecab_kwargs**: (`optional`) dict passed to `MecabTokenizer` constructor (default None)
"""
super(BertTokenizer, self).__init__(
unk_token=unk_token,
@@ -134,7 +137,9 @@ class BertJapaneseTokenizer(BertTokenizer):
do_lower_case=do_lower_case, never_split=never_split, tokenize_chinese_chars=False
)
elif word_tokenizer_type == "mecab":
self.word_tokenizer = MecabTokenizer(do_lower_case=do_lower_case, never_split=never_split)
self.word_tokenizer = MecabTokenizer(
do_lower_case=do_lower_case, never_split=never_split, **(mecab_kwargs or {})
)
else:
raise ValueError("Invalid word_tokenizer_type '{}' is specified.".format(word_tokenizer_type))
@@ -161,10 +166,10 @@ class BertJapaneseTokenizer(BertTokenizer):
return split_tokens
class MecabTokenizer(object):
class MecabTokenizer:
"""Runs basic tokenization with MeCab morphological parser."""
def __init__(self, do_lower_case=False, never_split=None, normalize_text=True):
def __init__(self, do_lower_case=False, never_split=None, normalize_text=True, mecab_option: Optional[str] = None):
"""Constructs a MecabTokenizer.
Args:
@@ -176,6 +181,7 @@ class MecabTokenizer(object):
List of token not to split.
**normalize_text**: (`optional`) boolean (default True)
Whether to apply unicode normalization to text before tokenization.
**mecab_option**: (`optional`) string passed to `MeCab.Tagger` constructor (default "")
"""
self.do_lower_case = do_lower_case
self.never_split = never_split if never_split is not None else []
@@ -183,7 +189,7 @@ class MecabTokenizer(object):
import MeCab
self.mecab = MeCab.Tagger()
self.mecab = MeCab.Tagger(mecab_option) if mecab_option is not None else MeCab.Tagger()
def tokenize(self, text, never_split=None, **kwargs):
"""Tokenizes a piece of text."""

View File

@@ -0,0 +1,80 @@
# coding=utf-8
# Copyright 2020 The Google AI Team, Stanford University and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .tokenization_bert import BertTokenizer, BertTokenizerFast
VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {
"google/electra-small-generator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-generator/vocab.txt",
"google/electra-base-generator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-generator/vocab.txt",
"google/electra-large-generator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-generator/vocab.txt",
"google/electra-small-discriminator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-small-discriminator/vocab.txt",
"google/electra-base-discriminator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/vocab.txt",
"google/electra-large-discriminator": "https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-large-discriminator/vocab.txt",
}
}
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
"google/electra-small-generator": 512,
"google/electra-base-generator": 512,
"google/electra-large-generator": 512,
"google/electra-small-discriminator": 512,
"google/electra-base-discriminator": 512,
"google/electra-large-discriminator": 512,
}
PRETRAINED_INIT_CONFIGURATION = {
"google/electra-small-generator": {"do_lower_case": True},
"google/electra-base-generator": {"do_lower_case": True},
"google/electra-large-generator": {"do_lower_case": True},
"google/electra-small-discriminator": {"do_lower_case": True},
"google/electra-base-discriminator": {"do_lower_case": True},
"google/electra-large-discriminator": {"do_lower_case": True},
}
class ElectraTokenizer(BertTokenizer):
r"""
Constructs an Electra tokenizer.
:class:`~transformers.ElectraTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
tokenization: punctuation splitting + wordpiece.
Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
parameters.
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
class ElectraTokenizerFast(BertTokenizerFast):
r"""
Constructs an Electra Fast tokenizer.
:class:`~transformers.ElectraTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
tokenization: punctuation splitting + wordpiece.
Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
parameters.
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION

View File

@@ -59,4 +59,4 @@ You can then finish the addition step by adding imports for your classes in the
- [ ] add a link to your conversion script in the main conversion utility (in `commands/convert.py`)
- [ ] edit the PyTorch to TF 2.0 conversion script to add your model in the `convert_pytorch_checkpoint_to_tf2.py` file
- [ ] add a mention of your model in the doc: `README.md` and the documentation itself at `docs/source/pretrained_models.rst`.
- [ ] upload the pretrained weigths, configurations and vocabulary files.
- [ ] upload the pretrained weights, configurations and vocabulary files.

View File

@@ -27,7 +27,9 @@ from .utils import CACHE_DIR, require_torch, slow, torch_device
if is_torch_available():
import torch
from transformers import (
AutoModel,
AutoModelForSequenceClassification,
AutoTokenizer,
BartModel,
BartForConditionalGeneration,
BartForSequenceClassification,
@@ -183,6 +185,15 @@ class BARTModelTest(ModelTesterMixin, unittest.TestCase):
def test_inputs_embeds(self):
pass
def test_tiny_model(self):
model_name = "sshleifer/bart-tiny-random"
tiny = AutoModel.from_pretrained(model_name) # same vocab size
tok = AutoTokenizer.from_pretrained(model_name) # same tokenizer
inputs_dict = tok.batch_encode_plus(["Hello my friends"], return_tensors="pt")
with torch.no_grad():
tiny(**inputs_dict)
@require_torch
class BartHeadTests(unittest.TestCase):

View File

@@ -624,60 +624,114 @@ class ModelTesterMixin:
with torch.no_grad():
model(**inputs_dict)
def test_lm_head_model_random_generate(self):
def test_lm_head_model_random_no_beam_search_generate(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
input_ids = inputs_dict.get("input_ids")
input_ids = inputs_dict["input_ids"] if "input_ids" in inputs_dict else inputs_dict["inputs"]
# iterate over all generative models
for model_class in self.all_generative_model_classes:
model = model_class(config)
if config.bos_token_id is None:
# if bos token id is not defined mobel needs input_ids
with self.assertRaises(AssertionError):
model.generate(do_sample=True, max_length=5)
# num_return_sequences = 1
self._check_generated_ids(model.generate(input_ids, do_sample=True))
else:
# num_return_sequences = 1
self._check_generated_ids(model.generate(do_sample=True, max_length=5))
with self.assertRaises(AssertionError):
# generating multiple sequences when no beam search generation
# is not allowed as it would always generate the same sequences
model.generate(input_ids, do_sample=False, num_return_sequences=2)
# num_return_sequences > 1, sample
self._check_generated_ids(model.generate(input_ids, do_sample=True, num_return_sequences=2))
# check bad words tokens language generation
# create list of 1-seq bad token and list of 2-seq of bad tokens
bad_words_ids = [self._generate_random_bad_tokens(1, model), self._generate_random_bad_tokens(2, model)]
output_tokens = model.generate(
input_ids, do_sample=True, bad_words_ids=bad_words_ids, num_return_sequences=2
)
# only count generated tokens
generated_ids = output_tokens[:, input_ids.shape[-1] :]
self.assertFalse(self._check_match_tokens(generated_ids.tolist(), bad_words_ids))
def test_lm_head_model_random_beam_search_generate(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
input_ids = inputs_dict["input_ids"] if "input_ids" in inputs_dict else inputs_dict["inputs"]
if self.is_encoder_decoder:
config.output_past = True # needed for Bart TODO: might have to update for other encoder-decoder models
# needed for Bart beam search
config.output_past = True
for model_class in self.all_generative_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
if config.bos_token_id is None:
with self.assertRaises(AssertionError):
model.generate(do_sample=True, max_length=5)
# batch_size = 1
self._check_generated_tokens(model.generate(input_ids, do_sample=True))
# batch_size = 1, num_beams > 1
self._check_generated_tokens(model.generate(input_ids, do_sample=True, num_beams=3))
# if bos token id is not defined mobel needs input_ids, num_return_sequences = 1
self._check_generated_ids(model.generate(input_ids, do_sample=True, num_beams=2))
else:
# batch_size = 1
self._check_generated_tokens(model.generate(do_sample=True, max_length=5))
# batch_size = 1, num_beams > 1
self._check_generated_tokens(model.generate(do_sample=True, max_length=5, num_beams=3))
with self.assertRaises(AssertionError):
# generating multiple sequences when greedy no beam generation
# is not allowed as it would always generate the same sequences
model.generate(input_ids, do_sample=False, num_return_sequences=2)
# num_return_sequences = 1
self._check_generated_ids(model.generate(do_sample=True, max_length=5, num_beams=2))
with self.assertRaises(AssertionError):
# generating more sequences than having beams leads is not possible
model.generate(input_ids, do_sample=False, num_return_sequences=3, num_beams=2)
# batch_size > 1, sample
self._check_generated_tokens(model.generate(input_ids, do_sample=True, num_return_sequences=3))
# batch_size > 1, greedy
self._check_generated_tokens(model.generate(input_ids, do_sample=False))
# num_return_sequences > 1, sample
self._check_generated_ids(model.generate(input_ids, do_sample=True, num_beams=2, num_return_sequences=2,))
# num_return_sequences > 1, greedy
self._check_generated_ids(model.generate(input_ids, do_sample=False, num_beams=2, num_return_sequences=2))
# batch_size > 1, num_beams > 1, sample
self._check_generated_tokens(
model.generate(input_ids, do_sample=True, num_beams=3, num_return_sequences=3,)
)
# batch_size > 1, num_beams > 1, greedy
self._check_generated_tokens(
model.generate(input_ids, do_sample=False, num_beams=3, num_return_sequences=3)
# check bad words tokens language generation
# create list of 1-seq bad token and list of 2-seq of bad tokens
bad_words_ids = [self._generate_random_bad_tokens(1, model), self._generate_random_bad_tokens(2, model)]
output_tokens = model.generate(
input_ids, do_sample=False, bad_words_ids=bad_words_ids, num_beams=2, num_return_sequences=2
)
# only count generated tokens
generated_ids = output_tokens[:, input_ids.shape[-1] :]
self.assertFalse(self._check_match_tokens(generated_ids.tolist(), bad_words_ids))
def _check_generated_tokens(self, output_ids):
def _generate_random_bad_tokens(self, num_bad_tokens, model):
# special tokens cannot be bad tokens
special_tokens = []
if model.config.bos_token_id is not None:
special_tokens.append(model.config.bos_token_id)
if model.config.pad_token_id is not None:
special_tokens.append(model.config.pad_token_id)
if model.config.eos_token_id is not None:
special_tokens.append(model.config.eos_token_id)
# create random bad tokens that are not special tokens
bad_tokens = []
while len(bad_tokens) < num_bad_tokens:
token = ids_tensor((1, 1), self.model_tester.vocab_size).squeeze(0).numpy()[0]
if token not in special_tokens:
bad_tokens.append(token)
return bad_tokens
def _check_generated_ids(self, output_ids):
for token_id in output_ids[0].tolist():
self.assertGreaterEqual(token_id, 0)
self.assertLess(token_id, self.model_tester.vocab_size)
def _check_match_tokens(self, generated_ids, bad_words_ids):
# for all bad word tokens
for bad_word_ids in bad_words_ids:
# for all slices in batch
for generated_ids_slice in generated_ids:
# for all word idx
for i in range(len(bad_word_ids), len(generated_ids_slice)):
# if tokens match
if generated_ids_slice[i - len(bad_word_ids) : i] == bad_word_ids:
return True
return False
global_rng = random.Random()

View File

@@ -0,0 +1,287 @@
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
from transformers import is_torch_available
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
from .utils import CACHE_DIR, require_torch, slow, torch_device
if is_torch_available():
from transformers import (
ElectraConfig,
ElectraModel,
ElectraForMaskedLM,
ElectraForTokenClassification,
ElectraForPreTraining,
)
from transformers.modeling_electra import ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP
@require_torch
class ElectraModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (
(ElectraModel, ElectraForMaskedLM, ElectraForTokenClassification,) if is_torch_available() else ()
)
class ElectraModelTester(object):
def __init__(
self,
parent,
batch_size=13,
seq_length=7,
is_training=True,
use_input_mask=True,
use_token_type_ids=True,
use_labels=True,
vocab_size=99,
hidden_size=32,
num_hidden_layers=5,
num_attention_heads=4,
intermediate_size=37,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=16,
type_sequence_label_size=2,
initializer_range=0.02,
num_labels=3,
num_choices=4,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_token_type_ids = use_token_type_ids
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.type_sequence_label_size = type_sequence_label_size
self.initializer_range = initializer_range
self.num_labels = num_labels
self.num_choices = num_choices
self.scope = scope
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_mask = None
if self.use_input_mask:
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
token_type_ids = None
if self.use_token_type_ids:
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
sequence_labels = None
token_labels = None
choice_labels = None
if self.use_labels:
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
choice_labels = ids_tensor([self.batch_size], self.num_choices)
fake_token_labels = ids_tensor([self.batch_size, self.seq_length], 1)
config = ElectraConfig(
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
intermediate_size=self.intermediate_size,
hidden_act=self.hidden_act,
hidden_dropout_prob=self.hidden_dropout_prob,
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
max_position_embeddings=self.max_position_embeddings,
type_vocab_size=self.type_vocab_size,
is_decoder=False,
initializer_range=self.initializer_range,
)
return (
config,
input_ids,
token_type_ids,
input_mask,
sequence_labels,
token_labels,
choice_labels,
fake_token_labels,
)
def check_loss_output(self, result):
self.parent.assertListEqual(list(result["loss"].size()), [])
def create_and_check_electra_model(
self,
config,
input_ids,
token_type_ids,
input_mask,
sequence_labels,
token_labels,
choice_labels,
fake_token_labels,
):
model = ElectraModel(config=config)
model.to(torch_device)
model.eval()
(sequence_output,) = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
(sequence_output,) = model(input_ids, token_type_ids=token_type_ids)
(sequence_output,) = model(input_ids)
result = {
"sequence_output": sequence_output,
}
self.parent.assertListEqual(
list(result["sequence_output"].size()), [self.batch_size, self.seq_length, self.hidden_size]
)
def create_and_check_electra_for_masked_lm(
self,
config,
input_ids,
token_type_ids,
input_mask,
sequence_labels,
token_labels,
choice_labels,
fake_token_labels,
):
model = ElectraForMaskedLM(config=config)
model.to(torch_device)
model.eval()
loss, prediction_scores = model(
input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, masked_lm_labels=token_labels
)
result = {
"loss": loss,
"prediction_scores": prediction_scores,
}
self.parent.assertListEqual(
list(result["prediction_scores"].size()), [self.batch_size, self.seq_length, self.vocab_size]
)
self.check_loss_output(result)
def create_and_check_electra_for_token_classification(
self,
config,
input_ids,
token_type_ids,
input_mask,
sequence_labels,
token_labels,
choice_labels,
fake_token_labels,
):
config.num_labels = self.num_labels
model = ElectraForTokenClassification(config=config)
model.to(torch_device)
model.eval()
loss, logits = model(
input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels
)
result = {
"loss": loss,
"logits": logits,
}
self.parent.assertListEqual(
list(result["logits"].size()), [self.batch_size, self.seq_length, self.num_labels]
)
self.check_loss_output(result)
def create_and_check_electra_for_pretraining(
self,
config,
input_ids,
token_type_ids,
input_mask,
sequence_labels,
token_labels,
choice_labels,
fake_token_labels,
):
config.num_labels = self.num_labels
model = ElectraForPreTraining(config=config)
model.to(torch_device)
model.eval()
loss, logits = model(
input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=fake_token_labels
)
result = {
"loss": loss,
"logits": logits,
}
self.parent.assertListEqual(list(result["logits"].size()), [self.batch_size, self.seq_length])
self.check_loss_output(result)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(
config,
input_ids,
token_type_ids,
input_mask,
sequence_labels,
token_labels,
choice_labels,
fake_token_labels,
) = config_and_inputs
inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
return config, inputs_dict
def setUp(self):
self.model_tester = ElectraModelTest.ElectraModelTester(self)
self.config_tester = ConfigTester(self, config_class=ElectraConfig, hidden_size=37)
def test_config(self):
self.config_tester.run_common_tests()
def test_electra_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_electra_model(*config_and_inputs)
def test_for_masked_lm(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_electra_for_masked_lm(*config_and_inputs)
def test_for_token_classification(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_electra_for_token_classification(*config_and_inputs)
def test_for_pre_training(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_electra_for_pretraining(*config_and_inputs)
@slow
def test_model_from_pretrained(self):
for model_name in list(ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
model = ElectraModel.from_pretrained(model_name, cache_dir=CACHE_DIR)
self.assertIsNotNone(model)

File diff suppressed because one or more lines are too long

View File

@@ -162,6 +162,10 @@ class TFModelTesterMixin:
pt_inputs_dict = dict(
(name, torch.from_numpy(key.numpy()).to(torch.long)) for name, key in inputs_dict.items()
)
# need to rename encoder-decoder "inputs" for PyTorch
if "inputs" in pt_inputs_dict and self.is_encoder_decoder:
pt_inputs_dict["input_ids"] = pt_inputs_dict.pop("inputs")
with torch.no_grad():
pto = pt_model(**pt_inputs_dict)
tfo = tf_model(inputs_dict, training=False)
@@ -201,6 +205,10 @@ class TFModelTesterMixin:
pt_inputs_dict = dict(
(name, torch.from_numpy(key.numpy()).to(torch.long)) for name, key in inputs_dict.items()
)
# need to rename encoder-decoder "inputs" for PyTorch
if "inputs" in pt_inputs_dict and self.is_encoder_decoder:
pt_inputs_dict["input_ids"] = pt_inputs_dict.pop("inputs")
with torch.no_grad():
pto = pt_model(**pt_inputs_dict)
tfo = tf_model(inputs_dict)
@@ -223,7 +231,7 @@ class TFModelTesterMixin:
if self.is_encoder_decoder:
input_ids = {
"decoder_input_ids": tf.keras.Input(batch_shape=(2, 2000), name="decoder_input_ids", dtype="int32"),
"input_ids": tf.keras.Input(batch_shape=(2, 2000), name="input_ids", dtype="int32"),
"inputs": tf.keras.Input(batch_shape=(2, 2000), name="inputs", dtype="int32"),
}
else:
input_ids = tf.keras.Input(batch_shape=(2, 2000), name="input_ids", dtype="int32")
@@ -259,7 +267,7 @@ class TFModelTesterMixin:
outputs_dict = model(inputs_dict)
inputs_keywords = copy.deepcopy(inputs_dict)
input_ids = inputs_keywords.pop("input_ids" if not self.is_encoder_decoder else "decoder_input_ids", None,)
input_ids = inputs_keywords.pop("input_ids" if not self.is_encoder_decoder else "inputs", None,)
outputs_keywords = model(input_ids, **inputs_keywords)
output_dict = outputs_dict[0].numpy()
@@ -395,9 +403,9 @@ class TFModelTesterMixin:
input_ids = inputs_dict["input_ids"]
del inputs_dict["input_ids"]
else:
encoder_input_ids = inputs_dict["input_ids"]
encoder_input_ids = inputs_dict["inputs"]
decoder_input_ids = inputs_dict["decoder_input_ids"]
del inputs_dict["input_ids"]
del inputs_dict["inputs"]
del inputs_dict["decoder_input_ids"]
for model_class in self.all_model_classes:
@@ -412,58 +420,114 @@ class TFModelTesterMixin:
model(inputs_dict)
def test_lm_head_model_random_generate(self):
def test_lm_head_model_random_no_beam_search_generate(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
input_ids = inputs_dict["input_ids"]
input_ids = inputs_dict["input_ids"] if "input_ids" in inputs_dict else inputs_dict["inputs"]
# iterate over all generative models
for model_class in self.all_generative_model_classes:
model = model_class(config)
if config.bos_token_id is None:
# if bos token id is not defined mobel needs input_ids
with self.assertRaises(AssertionError):
model.generate(do_sample=True, max_length=5)
# num_return_sequences = 1
self._check_generated_ids(model.generate(input_ids, do_sample=True))
else:
# num_return_sequences = 1
self._check_generated_ids(model.generate(do_sample=True, max_length=5))
with self.assertRaises(AssertionError):
# generating multiple sequences when no beam search generation
# is not allowed as it would always generate the same sequences
model.generate(input_ids, do_sample=False, num_return_sequences=2)
# num_return_sequences > 1, sample
self._check_generated_ids(model.generate(input_ids, do_sample=True, num_return_sequences=2))
# check bad words tokens language generation
# create list of 1-seq bad token and list of 2-seq of bad tokens
bad_words_ids = [self._generate_random_bad_tokens(1, model), self._generate_random_bad_tokens(2, model)]
output_tokens = model.generate(
input_ids, do_sample=True, bad_words_ids=bad_words_ids, num_return_sequences=2
)
# only count generated tokens
generated_ids = output_tokens[:, input_ids.shape[-1] :]
self.assertFalse(self._check_match_tokens(generated_ids.numpy().tolist(), bad_words_ids))
def test_lm_head_model_random_beam_search_generate(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
input_ids = inputs_dict["input_ids"] if "input_ids" in inputs_dict else inputs_dict["inputs"]
if self.is_encoder_decoder:
config.output_past = True # needed for Bart TODO: might have to update for other encoder-decoder models
# needed for Bart beam search
config.output_past = True
for model_class in self.all_generative_model_classes:
model = model_class(config)
if config.bos_token_id is None:
with self.assertRaises(AssertionError):
model.generate(do_sample=True, max_length=5)
# batch_size = 1
self._check_generated_tokens(model.generate(input_ids, do_sample=True))
# batch_size = 1, num_beams > 1
self._check_generated_tokens(model.generate(input_ids, do_sample=True, num_beams=3))
# if bos token id is not defined mobel needs input_ids, num_return_sequences = 1
self._check_generated_ids(model.generate(input_ids, do_sample=True, num_beams=2))
else:
# batch_size = 1
self._check_generated_tokens(model.generate(do_sample=True, max_length=5))
# batch_size = 1, num_beams > 1
self._check_generated_tokens(model.generate(do_sample=True, max_length=5, num_beams=3))
with self.assertRaises(AssertionError):
# generating multiple sequences when greedy no beam generation
# is not allowed as it would always generate the same sequences
model.generate(input_ids, do_sample=False, num_return_sequences=2)
# num_return_sequences = 1
self._check_generated_ids(model.generate(do_sample=True, max_length=5, num_beams=2))
with self.assertRaises(AssertionError):
# generating more sequences than having beams leads is not possible
model.generate(input_ids, do_sample=False, num_return_sequences=3, num_beams=2)
# batch_size > 1, sample
self._check_generated_tokens(model.generate(input_ids, do_sample=True, num_return_sequences=3))
# batch_size > 1, greedy
self._check_generated_tokens(model.generate(input_ids, do_sample=False))
# num_return_sequences > 1, sample
self._check_generated_ids(model.generate(input_ids, do_sample=True, num_beams=2, num_return_sequences=2,))
# num_return_sequences > 1, greedy
self._check_generated_ids(model.generate(input_ids, do_sample=False, num_beams=2, num_return_sequences=2))
# batch_size > 1, num_beams > 1, sample
self._check_generated_tokens(
model.generate(input_ids, do_sample=True, num_beams=3, num_return_sequences=3,)
)
# batch_size > 1, num_beams > 1, greedy
self._check_generated_tokens(
model.generate(input_ids, do_sample=False, num_beams=3, num_return_sequences=3)
# check bad words tokens language generation
# create list of 1-seq bad token and list of 2-seq of bad tokens
bad_words_ids = [self._generate_random_bad_tokens(1, model), self._generate_random_bad_tokens(2, model)]
output_tokens = model.generate(
input_ids, do_sample=False, bad_words_ids=bad_words_ids, num_beams=2, num_return_sequences=2
)
# only count generated tokens
generated_ids = output_tokens[:, input_ids.shape[-1] :]
self.assertFalse(self._check_match_tokens(generated_ids.numpy().tolist(), bad_words_ids))
def _check_generated_tokens(self, output_ids):
def _generate_random_bad_tokens(self, num_bad_tokens, model):
# special tokens cannot be bad tokens
special_tokens = []
if model.config.bos_token_id is not None:
special_tokens.append(model.config.bos_token_id)
if model.config.pad_token_id is not None:
special_tokens.append(model.config.pad_token_id)
if model.config.eos_token_id is not None:
special_tokens.append(model.config.eos_token_id)
# create random bad tokens that are not special tokens
bad_tokens = []
while len(bad_tokens) < num_bad_tokens:
token = tf.squeeze(ids_tensor((1, 1), self.model_tester.vocab_size), 0).numpy()[0]
if token not in special_tokens:
bad_tokens.append(token)
return bad_tokens
def _check_generated_ids(self, output_ids):
for token_id in output_ids[0].numpy().tolist():
self.assertGreaterEqual(token_id, 0)
self.assertLess(token_id, self.model_tester.vocab_size)
def _check_match_tokens(self, generated_ids, bad_words_ids):
# for all bad word tokens
for bad_word_ids in bad_words_ids:
# for all slices in batch
for generated_ids_slice in generated_ids:
# for all word idx
for i in range(len(bad_word_ids), len(generated_ids_slice)):
# if tokens match
if generated_ids_slice[i - len(bad_word_ids) : i] == bad_word_ids:
return True
return False
def ids_tensor(shape, vocab_size, rng=None, name=None, dtype=None):
"""Creates a random int32 tensor of the shape within the vocab size."""

View File

@@ -0,0 +1,227 @@
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
from transformers import ElectraConfig, is_tf_available
from .test_configuration_common import ConfigTester
from .test_modeling_tf_common import TFModelTesterMixin, ids_tensor
from .utils import CACHE_DIR, require_tf, slow
if is_tf_available():
from transformers.modeling_tf_electra import (
TFElectraModel,
TFElectraForMaskedLM,
TFElectraForPreTraining,
TFElectraForTokenClassification,
)
@require_tf
class TFElectraModelTest(TFModelTesterMixin, unittest.TestCase):
all_model_classes = (
(TFElectraModel, TFElectraForMaskedLM, TFElectraForPreTraining, TFElectraForTokenClassification,)
if is_tf_available()
else ()
)
class TFElectraModelTester(object):
def __init__(
self,
parent,
batch_size=13,
seq_length=7,
is_training=True,
use_input_mask=True,
use_token_type_ids=True,
use_labels=True,
vocab_size=99,
hidden_size=32,
num_hidden_layers=5,
num_attention_heads=4,
intermediate_size=37,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=16,
type_sequence_label_size=2,
initializer_range=0.02,
num_labels=3,
num_choices=4,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_token_type_ids = use_token_type_ids
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.type_sequence_label_size = type_sequence_label_size
self.initializer_range = initializer_range
self.num_labels = num_labels
self.num_choices = num_choices
self.scope = scope
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_mask = None
if self.use_input_mask:
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
token_type_ids = None
if self.use_token_type_ids:
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
sequence_labels = None
token_labels = None
choice_labels = None
if self.use_labels:
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
choice_labels = ids_tensor([self.batch_size], self.num_choices)
config = ElectraConfig(
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
intermediate_size=self.intermediate_size,
hidden_act=self.hidden_act,
hidden_dropout_prob=self.hidden_dropout_prob,
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
max_position_embeddings=self.max_position_embeddings,
type_vocab_size=self.type_vocab_size,
initializer_range=self.initializer_range,
)
return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
def create_and_check_electra_model(
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
):
model = TFElectraModel(config=config)
inputs = {"input_ids": input_ids, "attention_mask": input_mask, "token_type_ids": token_type_ids}
(sequence_output,) = model(inputs)
inputs = [input_ids, input_mask]
(sequence_output,) = model(inputs)
(sequence_output,) = model(input_ids)
result = {
"sequence_output": sequence_output.numpy(),
}
self.parent.assertListEqual(
list(result["sequence_output"].shape), [self.batch_size, self.seq_length, self.hidden_size]
)
def create_and_check_electra_for_masked_lm(
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
):
model = TFElectraForMaskedLM(config=config)
inputs = {"input_ids": input_ids, "attention_mask": input_mask, "token_type_ids": token_type_ids}
(prediction_scores,) = model(inputs)
result = {
"prediction_scores": prediction_scores.numpy(),
}
self.parent.assertListEqual(
list(result["prediction_scores"].shape), [self.batch_size, self.seq_length, self.vocab_size]
)
def create_and_check_electra_for_pretraining(
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
):
model = TFElectraForPreTraining(config=config)
inputs = {"input_ids": input_ids, "attention_mask": input_mask, "token_type_ids": token_type_ids}
(prediction_scores,) = model(inputs)
result = {
"prediction_scores": prediction_scores.numpy(),
}
self.parent.assertListEqual(list(result["prediction_scores"].shape), [self.batch_size, self.seq_length])
def create_and_check_electra_for_token_classification(
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
):
config.num_labels = self.num_labels
model = TFElectraForTokenClassification(config=config)
inputs = {"input_ids": input_ids, "attention_mask": input_mask, "token_type_ids": token_type_ids}
(logits,) = model(inputs)
result = {
"logits": logits.numpy(),
}
self.parent.assertListEqual(
list(result["logits"].shape), [self.batch_size, self.seq_length, self.num_labels]
)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(
config,
input_ids,
token_type_ids,
input_mask,
sequence_labels,
token_labels,
choice_labels,
) = config_and_inputs
inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
return config, inputs_dict
def setUp(self):
self.model_tester = TFElectraModelTest.TFElectraModelTester(self)
self.config_tester = ConfigTester(self, config_class=ElectraConfig, hidden_size=37)
def test_config(self):
self.config_tester.run_common_tests()
def test_electra_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_electra_model(*config_and_inputs)
def test_for_masked_lm(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_electra_for_masked_lm(*config_and_inputs)
def test_for_pretraining(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_electra_for_pretraining(*config_and_inputs)
def test_for_token_classification(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_electra_for_token_classification(*config_and_inputs)
@slow
def test_model_from_pretrained(self):
# for model_name in list(TF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
for model_name in ["electra-small-discriminator"]:
model = TFElectraModel.from_pretrained(model_name, cache_dir=CACHE_DIR)
self.assertIsNotNone(model)

File diff suppressed because one or more lines are too long

View File

@@ -91,6 +91,20 @@ class BertJapaneseTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
["アップルストア", "", "iphone", "8", "", "発売", "", "", "", ""],
)
def test_mecab_tokenizer_with_option(self):
try:
tokenizer = MecabTokenizer(
do_lower_case=True, normalize_text=False, mecab_option="-d /usr/local/lib/mecab/dic/jumandic"
)
except RuntimeError:
# if dict doesn't exist in the system, previous code raises this error.
return
self.assertListEqual(
tokenizer.tokenize(" \tアップルストアでiPhone\n 発売された 。 "),
["アップルストア", "", "iPhone", "", "", "発売", "", "れた", "\u3000", ""],
)
def test_mecab_tokenizer_no_normalize(self):
tokenizer = MecabTokenizer(normalize_text=False)