Compare commits

...

72 Commits

Author SHA1 Message Date
Lysandre
b0892fa0e8 Release: v3.0.2
Some checks failed
GitHub-hosted runner / check_code_quality (push) Has been cancelled
2020-07-06 18:49:44 -04:00
Sylvain Gugger
f1e2e423ab Fix fast tokenizers too (#5562) 2020-07-06 18:45:01 -04:00
Anthony MOI
5787e4c159 Various tokenizers fixes (#5558)
* BertTokenizerFast - Do not specify strip_accents by default

* Bump tokenizers to new version

* Add test for AddedToken serialization
2020-07-06 18:27:53 -04:00
Sylvain Gugger
21f28c34b7 Fix #5507 (#5559)
* Fix #5507

* Fix formatting
2020-07-06 17:26:48 -04:00
Lysandre Debut
9d9b872b66 The add_space_before_punct_symbol is only for TransfoXL (#5549) 2020-07-06 12:17:05 -04:00
Lysandre Debut
d6b0b9d451 GPT2 tokenizer should not output token type IDs (#5546)
* GPT2 tokenizer should not output token type IDs

* Same for OpenAIGPT
2020-07-06 11:33:57 -04:00
Sylvain Gugger
7833b21a5a Fix #5544 (#5551) 2020-07-06 11:22:24 -04:00
Thomas Wolf
c473484087 Fix the tokenization warning noted in #5505 (#5550)
* fix warning

* style and quality
2020-07-06 11:15:25 -04:00
Lysandre
1bbc28bee7 Imports organization 2020-07-06 10:27:10 -04:00
Mohamed Taher Alrefaie
1bc13697b1 Update convert_pytorch_checkpoint_to_tf2.py (#5531)
fixed ImportError: cannot import name 'hf_bucket_url'
2020-07-06 09:55:10 -04:00
Arnav Sharma
b2309cc6bf Typo fix in training doc (#5495) 2020-07-06 09:15:22 -04:00
ELanning
7ecff0ccbb Fix typo in training (#5510) 2020-07-06 09:14:57 -04:00
Sam Shleifer
58cca47c16 [cleanup] TF T5 tests only init t5-base once. (#5410) 2020-07-03 14:27:49 -04:00
Patrick von Platen
991172922f better error message (#5497) 2020-07-03 19:25:25 +02:00
Thomas Wolf
b58a15a31e unpining specific git versions in setup.py 2020-07-03 17:38:39 +02:00
Thomas Wolf
fedabcd154 Release: 3.0.1
Some checks failed
GitHub-hosted runner / check_code_quality (push) Has been cancelled
2020-07-03 17:02:44 +02:00
Lysandre Debut
17ade127b9 Exposing prepare_for_model for both slow & fast tokenizers (#5479)
* Exposing prepare_for_model for both slow & fast tokenizers

* Update method signature

* The traditional style commit

* Hide the warnings behind the verbose flag

* update default truncation strategy and prepare_for_model

* fix tests and prepare_for_models methods

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-07-03 16:51:21 +02:00
Manuel Romero
814ed7ee76 Create model card (#5396)
Create model card for electicidad-small (Spanish Electra) fine-tuned on SQUAD-esv1
2020-07-03 08:29:09 -04:00
Moseli Motsoehli
49281ac939 grammar corrections and train data update (#5448)
- fixed grammar and spelling
- added an intro
- updated Training data references
2020-07-03 08:25:57 -04:00
chrisliu
97355339f6 Update upstream (#5456) 2020-07-03 08:16:27 -04:00
Manuel Romero
55b932a818 Create model card (#5464)
Create model card for electra-small-discriminator fine-tuned on SQUAD v2.0
2020-07-03 06:19:49 -04:00
Funtowicz Morgan
21cd8c4086 QA Pipelines fixes (#5429)
* Make QA pipeline supports models with more than 2 outputs such as BART assuming start/end are the two first outputs.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* When using the new padding/truncation paradigm setting padding="max_length" + max_length=X actually pads the input up to max_length.

This result in every sample going through QA pipelines to be of size 384 whatever the actual input size is making the overall pipeline very slow.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Mask padding & question before applying softmax. Softmax has been refactored to operate in log space for speed and stability.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Format.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Use PaddingStrategy.LONGEST instead of DO_NOT_PAD

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Revert "When using the new padding/truncation paradigm setting padding="max_length" + max_length=X actually pads the input up to max_length."

This reverts commit 1b00a9a2

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Trigger CI after unattended failure

* Trigger CI
2020-07-03 10:29:20 +02:00
Pierric Cistac
8438bab38e Fix roberta model ordering for TFAutoModel (#5414) 2020-07-02 19:23:55 -04:00
Sylvain Gugger
6b735a7253 Tokenizer summary (#5467)
* Work on tokenizer summary

* Finish tutorial

* Link to it

* Apply suggestions from code review

Co-authored-by: Anthony MOI <xn1t0x@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Add vocab definition

Co-authored-by: Anthony MOI <xn1t0x@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-07-02 17:07:42 -04:00
Shen
ef0e9d806c Update: ElectraDiscriminatorPredictions forward. (#5471)
`ElectraDiscriminatorPredictions.forward` should not need `attention_mask`.
2020-07-02 13:57:33 -04:00
Manuel Romero
13a8588f2d Create model card (#5432)
Create model card for electra-base-discriminator fine-tuned on SQUAD v1.1
2020-07-02 10:16:30 -04:00
Julien Chaumond
a0a6387a0d [model_cards] roberta-large-mnli: fix sep_token 2020-07-02 10:04:02 -04:00
Julien Chaumond
215db688da Create roberta-large-mnli-README.md 2020-07-02 09:43:54 -04:00
Lysandre Debut
69d313e808 Bans SentencePiece 0.1.92 (#5418) 2020-07-02 09:23:00 -04:00
George Ho
84e56669af Fix typo in glossary (#5466) 2020-07-02 09:19:33 -04:00
Teven
c6a510c6fa Fixing missing arguments for TransfoXL tokenizer when using TextGenerationPipeline (#5465)
* overriding _parse_and_tokenize in `TextGenerationPipeine` to allow for TransfoXl tokenizer arguments
2020-07-02 13:53:33 +02:00
Teven
6726416e4a Changed expected_output_ids in TransfoXL generation test (#5462)
* Changed expected_output_ids in TransfoXL generation test to match #4826 generation PR.

* making black happy

* making isort happy
2020-07-02 11:56:44 +02:00
tommccoy
812def00c9 fix use of mems in Transformer-XL (#4826)
Fixed duplicated memory use in Transformer-XL generation leading to bad predictions and performance.
2020-07-02 11:19:07 +02:00
Patrick von Platen
306f1a2695 Add Reformer MLM notebook (#5450)
* Add Reformer MLM notebook

* Update notebooks/README.md
2020-07-02 00:20:49 +02:00
Patrick von Platen
d16e36c7e5 [Reformer] Add Masked LM Reformer (#5426)
* fix conflicts

* fix

* happy rebasing
2020-07-01 22:43:18 +02:00
Funtowicz Morgan
f4323dbf8c Don't discard entity_group when token is the latest in the sequence. (#5439)
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
2020-07-01 20:30:42 +02:00
Joe Davison
35befd9ce3 Fix tensor label type inference in default collator (#5250)
* allow tensor label inputs to default collator

* replace try/except with type check
2020-07-01 10:40:14 -06:00
Patrick von Platen
fe81f7d12c finish reformer qa head (#5433) 2020-07-01 12:27:14 -04:00
Patrick von Platen
d697b6ca75 [Longformer] Major Refactor (#5219)
* refactor naming

* add small slow test

* refactor

* refactor naming

* rename selected to extra

* big global attention refactor

* make style

* refactor naming

* save intermed

* refactor functions

* finish function refactor

* fix tests

* fix longformer

* fix longformer

* fix longformer

* fix all tests but one

* finish longformer

* address sams and izs comments

* fix transpose
2020-07-01 17:43:32 +02:00
Sam Shleifer
e0d58ddb65 [fix] Marian tests import (#5442) 2020-07-01 11:42:22 -04:00
Funtowicz Morgan
608d5a7c44 Raises PipelineException on FillMaskPipeline when there are != 1 mask_token in the input (#5389)
* Added PipelineException

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* fill-mask pipeline raises exception when more than one mask_token detected.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Put everything in a function.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Added tests on pipeline fill-mask when input has != 1 mask_token

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Fix numel() computation for TF

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Addressing PR comments.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove function typing to avoid import on specific framework.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Quality.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Retry typing with @julien-c tip.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Quality².

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Simplify fill-mask mask_token checking.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Trigger CI
2020-07-01 17:27:47 +02:00
Sylvain Gugger
6c55e9fc32 Fix dropdown bug in searches (#5440)
* Trigger CI

* Fix dropdown bug in searches
2020-07-01 11:02:59 -04:00
Sylvain Gugger
734a28a767 Clean up diffs in Trainer/TFTrainer (#5417)
* Cleanup and unify Trainer/TFTrainer

* Forgot to adapt TFTrainingArgs

* In tf scripts n_gpu -> n_replicas

* Update src/transformers/training_args.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments

* Formatting

* Fix typo

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-07-01 11:00:20 -04:00
Sam Shleifer
43cb03a93d MarianTokenizer.prepare_translation_batch uses new tokenizer API (#5182) 2020-07-01 10:32:50 -04:00
Sam Shleifer
13deb95a40 Move tests/utils.py -> transformers/testing_utils.py (#5350) 2020-07-01 10:31:17 -04:00
sgugger
9c219305f5 Trigger CI 2020-07-01 10:22:50 -04:00
Sylvain Gugger
64e3d966b1 Add support for past states (#5399)
* Add support for past states

* Style and forgotten self

* You mean, documenting is not enough? I have to actually add it too?

* Add memory support during evaluation

* Fix tests in eval and add TF support

* No need to change this line anymore
2020-07-01 08:11:55 -04:00
Sylvain Gugger
4ade7491f4 Fix examples titles and optimization doc page (#5408) 2020-07-01 08:11:25 -04:00
Moseli Motsoehli
d60d231ea4 Create README.md (#5422)
* Create README.md

* Update model_cards/MoseliMotsoehli/TswanaBert/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-07-01 05:01:51 -04:00
Jay
298bdab18a Create model card for schmidek/electra-small-cased (#5400) 2020-07-01 04:01:56 -04:00
Julien Plu
fcf0652460 Fix TensorFlow dataset generator (#4881)
* fix TensorFlow generator

* Better features handling

* Apply style

* Apply style

* Fix squad as well

* Apply style

* Better factorization of TF Tensors creation
2020-06-30 19:49:11 -04:00
Hong Xu
501040fd30 In the run_ner.py example, give the optional label arg a default value (#5326)
Otherwise, if label is not specified, the following error occurs:

	Traceback (most recent call last):
	  File "run_ner.py", line 303, in <module>
	    main()
	  File "run_ner.py", line 101, in main
	    model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
	  File "/home/user/anaconda3/envs/bert/lib/python3.7/site-packages/transformers/hf_argparser.py", line 159, in parse_json_file
	    obj = dtype(**inputs)
	TypeError: __init__() missing 1 required positional argument: 'labels'
2020-06-30 19:45:35 -04:00
Sam Shleifer
b45e65efa0 Avoid deprecation warning for F.tanh (#5413) 2020-06-30 16:41:43 -04:00
Sam Shleifer
23231c0f78 [GH Runner] fix yaml indent (#5412) 2020-06-30 16:17:12 -04:00
Sam Shleifer
ac61114592 [CI] gh runner doesn't use -v, cats new result (#5409) 2020-06-30 16:12:14 -04:00
Sam Shleifer
27a7fe7a8d examples/seq2seq: never override $WANDB_PROJECT (#5407) 2020-06-30 15:29:13 -04:00
Sam Shleifer
32d2031458 [fix] slow fill_mask test failure (#5406) 2020-06-30 15:28:15 -04:00
Sam Shleifer
80aa4b8aa6 [CI] GH-runner stores artifacts like CircleCI (#5318) 2020-06-30 15:01:53 -04:00
Sylvain Gugger
87716a6d07 Documentation for the Trainer API (#5383)
* Documentation for the Trainer API

* Address review comments

* Address comments
2020-06-30 11:43:43 -04:00
Yacine Jernite
c4d4e8bdbd Move GenerationMixin to separate file (#5254)
* separate_generation_code

* isort

* renamed

* rename_files

* move_shapelit
2020-06-30 10:42:08 -04:00
Lysandre
90d13954c4 Repin versions 2020-06-30 09:16:36 -04:00
Sylvain Gugger
0607b88945 How to share model cards with the CLI (#5374)
* How to share model cards

* Switch the two options

* Fix bad copy/cut

* Julien's suggestion
2020-06-30 08:59:32 -04:00
Kevin Canwen Xu
331d8d2936 Upload DistilBART artwork (#5394) 2020-06-30 18:11:11 +08:00
Manuel Romero
09e841490c Model Card Fixing (#5369)
- Fix missing ```-``` in language meta
- T5 pic uploaded to a more permanent place
2020-06-30 18:02:24 +08:00
Manuel Romero
4c5bed192a Model Card Fixing (#5373)
- T5 pic uploaded to a more permanent place
2020-06-30 18:01:45 +08:00
Manuel Romero
02509d4b06 Model Card Fixing (#5371)
- Model pic uploaded to a more permanent place
2020-06-30 18:01:11 +08:00
Manuel Romero
79f0118c72 Model Card Fixing (#5370)
- Fix missing ```-``` in language meta
- T5 pic uploaded to a more permanent place
2020-06-30 18:00:29 +08:00
MichaelJanz
9a473f1e43 Update Bertabs example to work again (#5355)
* Fix the bug 'Attempted relative import with no known parent package' when using the bertabs example. Also change the used model from bertabs-finetuned-cnndm, since it seems not be accessible anymore

* Update run_summarization.py

Co-authored-by: Kevin Canwen Xu <canwenxu@126.com>
2020-06-30 14:05:01 +08:00
Sylvain Gugger
7f60e93ac5 Mention openAI model card and merge content (#5378)
* Mention openAI model card and merge content

* Fix sentence
2020-06-29 18:27:36 -04:00
chrisliu
482a5993c2 Fix model card folder name so that it is consistent with model hub (#5368)
* Merge upstream

* Merge upstream

* Add generate.py link

* Merge upstream

* Merge upstream

* Fix folder name
2020-06-29 12:54:30 -04:00
chrisliu
97f24303e8 Add link to file and fix typos in model card (#5367)
* Merge upstream

* Merge upstream

* Add generate.py link
2020-06-29 11:34:52 -04:00
Lysandre Debut
b9ee87f5c7 Doc for v3.0.0 (#5366)
* Doc for v3.0.0

* Update docs/source/_static/js/custom.js

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update docs/source/_static/js/custom.js

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-06-29 11:08:54 -04:00
138 changed files with 5346 additions and 3067 deletions

View File

@@ -57,7 +57,10 @@ jobs:
steps:
- checkout
- run: sudo pip install .[mecab,testing]
- run: python -m pytest -sv ./tests/test_tokenization_bert_japanese.py
- run: python -m pytest -s ./tests/test_tokenization_bert_japanese.py | tee output.txt
- store_artifacts:
path: ~/transformers/output.txt
destination: test_output.txt
run_examples_torch:
working_directory: ~/transformers
docker:

View File

@@ -46,4 +46,5 @@ deploy_doc "11c3257" v2.8.0
deploy_doc "e7cfc1a" v2.9.0
deploy_doc "7cb203f" v2.9.1
deploy_doc "10d7239" v2.10.0
deploy_doc "b42586e" #v2.11.0 Latest stable release
deploy_doc "b42586e" v2.11.0
deploy_doc "b62ca59" #v3.0.0 Latest stable release

View File

@@ -51,4 +51,11 @@ jobs:
USE_CUDA: yes
run: |
source .env/bin/activate
python -m pytest -n 2 --dist=loadfile -s -v ./tests/
python -m pytest -n 2 --dist=loadfile -s ./tests/ | tee output.txt
- name: cat output.txt
run: cat output.txt
- name: Upload output.txt
uses: actions/upload-artifact@v1
with:
name: pytest_output
path: output.txt

View File

@@ -46,5 +46,11 @@ jobs:
USE_CUDA: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s -v ./tests/
python -m pytest -n 1 --dist=loadfile -s ./tests/ | tee output.txt
- name: cat output.txt
run: cat output.txt
- name: Upload output.txt
uses: actions/upload-artifact@v1
with:
name: pytest_output
path: output.txt

View File

@@ -1,10 +1,11 @@
// These two things need to be updated at each release for the version selector.
// Last stable version
const stableVersion = "v2.11.0"
const stableVersion = "v3.0.0"
// Dictionary doc folder to label
const versionMapping = {
"master": "master",
"": "v2.11.0 (stable)",
"": "v3.0.0 (stable)",
"v2.11.0": "v2.11.0",
"v2.10.0": "v2.10.0",
"v2.9.1": "v2.9.0/v2.9.1",
"v2.8.0": "v2.8.0",
@@ -86,7 +87,7 @@ function addVersionControl() {
const parts = location.toString().split('/');
let versionIndex = parts.length - 2;
// Index page may not have a last part with filename.html so we need to go up
if (parts[parts.length - 1] != "" && ! parts[parts.length - 1].match(/\.html$/)) {
if (parts[parts.length - 1] != "" && ! parts[parts.length - 1].match(/\.html$|^search.html?/)) {
versionIndex = parts.length - 1;
}
// Main classes and models are nested so we need to go deeper

View File

@@ -26,7 +26,7 @@ author = u'huggingface'
# The short X.Y version
version = u''
# The full version, including alpha/beta/rc tags
release = u'3.0.0'
release = u'3.0.2'
# -- General configuration ---------------------------------------------------

View File

@@ -11,7 +11,7 @@ General terms
tokens at a certain timestep.
- MLM: masked language modeling, a pretraining task where the model sees a corrupted version of the texts, usually done
by masking some tokens randomly, and has to predict the original text.
- multimodal: a task taht combines texts with another kind of inputs (for instance images).
- multimodal: a task that combines texts with another kind of inputs (for instance images).
- NLG: natural language generation, all tasks related to generating text ( for instance talk with transformers,
translation)
- NLP: natural language processing, a generic way to say "deal with texts".

View File

@@ -142,6 +142,7 @@ conversion utilities for the following models:
preprocessing
training
model_sharing
tokenizer_summary
multilingual
.. toctree::
@@ -173,6 +174,7 @@ conversion utilities for the following models:
main_classes/pipelines
main_classes/optimizer_schedules
main_classes/processors
main_classes/trainer
model_doc/auto
model_doc/encoderdecoder
model_doc/bert

View File

@@ -1,4 +1,4 @@
Optimizer
Optimization
----------------------------------------------------
The ``.optimization`` module provides:
@@ -7,24 +7,25 @@ The ``.optimization`` module provides:
- several schedules in the form of schedule objects that inherit from ``_LRSchedule``:
- a gradient accumulation class to accumulate the gradients of multiple batches
``AdamW``
~~~~~~~~~~~~~~~~
``AdamW`` (PyTorch)
~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AdamW
:members:
``AdamWeightDecay``
~~~~~~~~~~~~~~~~~~~
``AdamWeightDecay`` (TensorFlow)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AdamWeightDecay
.. autofunction:: transformers.create_optimizer
Schedules
----------------------------------------------------
~~~~~~~~~~~~~~~~~~~
Learning Rate Schedules (Pytorch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Learning Rate Schedules
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: transformers.get_constant_schedule
@@ -56,16 +57,16 @@ Learning Rate Schedules
:target: /imgs/warmup_linear_schedule.png
:alt:
``Warmup``
~~~~~~~~~~~~~~~~
``Warmup`` (TensorFlow)
^^^^^^^^^^^^^^^^^^^^^^^
.. autoclass:: transformers.WarmUp
:members:
Gradient Strategies
----------------------------------------------------
~~~~~~~~~~~~~~~~~~~~
``GradientAccumulator``
~~~~~~~~~~~~~~~~~~~~~~~
``GradientAccumulator`` (TensorFlow)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. autoclass:: transformers.GradientAccumulator

View File

@@ -0,0 +1,45 @@
Trainer
----------
The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.
Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a
:class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
customization during training.
The API supports distributed training on multiple GPUs/TPUs, mixed precision through `NVIDIA Apex
<https://github.com/NVIDIA/apex>`__ for PyTorch and :obj:`tf.keras.mixed_precision` for TensorFlow.
``Trainer``
~~~~~~~~~~~
.. autoclass:: transformers.Trainer
:members:
``TFTrainer``
~~~~~~~~~~~~~
.. autoclass:: transformers.TFTrainer
:members:
``TrainingArguments``
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TrainingArguments
:members:
``TFTrainingArguments``
~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFTrainingArguments
:members:
Utilities
~~~~~~~~~
.. autoclass:: transformers.EvalPrediction
.. autofunction:: transformers.set_seed
.. autofunction:: transformers.torch_distributed_zero_first

View File

@@ -112,3 +112,17 @@ ReformerModelWithLMHead
.. autoclass:: transformers.ReformerModelWithLMHead
:members:
ReformerForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ReformerForMaskedLM
:members:
ReformerForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ReformerForQuestionAnswering
:members:

View File

@@ -171,8 +171,11 @@ Add a model card
^^^^^^^^^^^^^^^^
To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should be named
`README.md` and follow `this template <https://github.com/huggingface/model_card>`__.
considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be
placed in a subfolder with your username or organization, then another subfolder named like your model
(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will
get you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a
model card template (meta-suggestions are welcome).
If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
don't forget to link to its model card so that people can fully trace how your model was built.
@@ -180,6 +183,11 @@ don't forget to link to its model card so that people can fully trace how your m
If you have never made a pull request to the 🤗 Transformers repo, look at the
:doc:`contributing guide <contributing>` to see the steps to follow.
.. Note::
You can also send your model card in the folder you uploaded with the CLI by placing it in a `README.md` file
inside `path/to/awesome-name-you-picked/`.
Using your model
^^^^^^^^^^^^^^^^

View File

@@ -146,8 +146,9 @@ Using the tokenizer
We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the
same rules as when the model was pretrained.
that process (you can learn more about them in the :doc:`tokenizer_summary <tokenizer_summary>`, which is why we need
to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
pretrained.
The second step is to convert those `tokens` into numbers, to be able to build a tensor out of them and feed them to
the model. To do this, the tokenizer has a `vocab`, which is the part we download when we instantiate it with the

View File

@@ -0,0 +1,243 @@
Tokenizer summary
-----------------
In this page, we will have a closer look at tokenization. As we saw in
:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then
are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More
specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers:
:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and
:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those.
Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
using :ref:`WordPiece <wordpiece>`.
Introduction to tokenization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing
this text is just to split it by spaces, which would give:
::
["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those
will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation
into account. This would give:
::
["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so
it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and
part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts
into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't
perform properly if you don't use the exact same rules as the persons who pretrained it.
`spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
rule-based tokenizers. On the text above, they'd output something like:
::
["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used).
:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary
size of 267,735!
A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
transformers model rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
language.
So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts
as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds,
all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
Subword tokenization
^^^^^^^^^^^^^^^^^^^^
Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words
should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and
decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can
form (almost) arbitrarily long complex words by stringing together some subwords.
This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
subwords. This also gives the ability to the model to process words it has never seen before, by decomposing them into
subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
this:
::
>>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> tokenizer.tokenize("I have a new GPU!")
['i', 'have', 'a', 'new', 'gp', '##u', '!']
Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
vocabulary of the tokenizer, except for "gpu", so the tokenizer split it in subwords it knows: "gp" and "##u". The "##"
means that the rest of the token should be attached to the previous one, without space (for when we need to decode
predictions and reverse the tokenization).
Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
::
>>> from transformers import XLNetTokenizer
>>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
>>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.']
We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see
Transformers has been split into "Transform" and "ers".
Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of
training which is usually done on the corpus the corresponding model will be trained on.
.. _byte-pair-encoding:
Byte-Pair Encoding
~~~~~~~~~~~~~~~~~~
Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
splitting the training data into words, which can be a simple space tokenization
(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy) and, counts the frequency of each word in the training corpus.
It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
word):
::
('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character:
::
('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)
We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
`10 + 5 + 2 + 5 = 22` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
then it adds 'ug' to the vocabulary. Our corpus then becomes
::
('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)
and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two
and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug'
to the vocabulary.
At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is
represented as
::
('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that
were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as
``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the
base corpus uses all of them), but to special characters like emojis.
As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
and chose to stop the training of the tokenizer at 40,000 merges.
Byte-level BPE
^^^^^^^^^^^^^^
To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
all unicode characters, the
`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__
introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some
additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown
token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the
256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
.. _wordpiece:
WordPiece
=========
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as
:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in
`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
frequent but the one that will maximize the likelihood on the corpus once merged.
What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes
sure it's `worth it`.
.. _unigram:
Unigram
=======
Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__.
Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts
from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down
progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction
with :ref:`SentencePiece <sentencepiece>`.
More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
for each subword, evaluate how much the loss would augment if the subword was removed from the vocabulary. It then
sorts the subwords by this quantity (that represents how worse the loss becomes if the token is removed) and removes
all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
BPE or WordPiece).
Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
vocabulary
::
['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']
we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which
one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the
training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the
tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
of the tokenization according to their probabilities).
Those probabilities are what are used to define the loss that trains the tokenizer: if our corpus consists of the
words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
.. math::
\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
.. _sentencepiece:
SentencePiece
=============
All the methods we have been looking at so far required some from of pretrokenization, which has a central problem: not
all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
some '▁' characters, that represent spaces. Decoding a tokenized text is then super easy: we just have to concatenate
all of them together and replace those '▁' by spaces.
All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.

View File

@@ -39,7 +39,7 @@ of the specified model are used to initialize the model. The
library also includes a number of task-specific final layers or 'heads' whose
weights are instantiated randomly when not present in the specified
pre-trained model. For example, instantiating a model with
``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_classes=2)``
``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)``
will create a BERT model instance with encoder weights copied from the
``bert-base-uncased`` model and a randomly initialized sequence
classification head on top of the encoder with an output size of 2. Models
@@ -272,7 +272,7 @@ optimize.
:func:`~transformers.Trainer` uses a built-in default function to collate
batches and prepare them to be fed into the model. If needed, you can also
use the ``data_collator`` argument to pass your own collator function which
takes in the data in the format provides by your dataset and returns a
takes in the data in the format provided by your dataset and returns a
batch ready to be fed into the model. Note that
:func:`~transformers.TFTrainer` expects the passed datasets to be dataset
objects from ``tensorflow_datasets``.

View File

@@ -1,4 +1,4 @@
## Examples
# Examples
Version 2.9 of 🤗 Transformers introduces a new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) class for PyTorch, and its equivalent [`TFTrainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_tf.py) for TF 2.
Running the examples requires PyTorch 1.3.1+ or TensorFlow 2.1+.
@@ -13,7 +13,7 @@ Here is the list of all our examples:
This is still a work-in-progress in particular documentation is still sparse so please **contribute improvements/pull requests.**
# The Big Table of Tasks
## The Big Table of Tasks
| Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab
|---|---|:---:|:---:|:---:|:---:|

View File

@@ -108,7 +108,10 @@ def main():
level=logging.INFO,
)
logger.warning(
"device: %s, n_gpu: %s, 16-bits training: %s", training_args.device, training_args.n_gpu, training_args.fp16,
"device: %s, n_replicas: %s, 16-bits training: %s",
training_args.device,
training_args.n_replicas,
training_args.fp16,
)
logger.info("Training/evaluation parameters %s", training_args)

View File

@@ -137,9 +137,9 @@ def main():
level=logging.INFO,
)
logger.info(
"n_gpu: %s, distributed training: %s, 16-bits training: %s",
training_args.n_gpu,
bool(training_args.n_gpu > 1),
"n_replicas: %s, distributed training: %s, 16-bits training: %s",
training_args.n_replicas,
bool(training_args.n_replicas > 1),
training_args.fp16,
)
logger.info("Training/evaluation parameters %s", training_args)

View File

@@ -1,3 +1,5 @@
## Sequence to Sequence
This directory contains examples for finetuning and evaluating transformers on summarization and translation tasks.
Summarization support is more mature than translation support.
Please tag @sshleifer with any issues/unexpected behaviors, or send a PR!
@@ -69,7 +71,7 @@ Load it with `BartForConditionalGeneration.from_pretrained(f'{output_dir}/best_t
- If you want to run experiments on improving the summarization finetuning process, try the XSUM Shared Task (below). It's faster to train than CNNDM because the summaries are shorter.
- For CNN/DailyMail, the default `val_max_target_length` and `test_max_target_length` will truncate the ground truth labels, resulting in slightly higher rouge scores. To get accurate rouge scores, you should rerun calculate_rouge on the `{output_dir}/test_generations.txt` file saved by `trainer.test()`
- `--max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 ` is a reasonable setting for XSUM.
- `wandb` can be used by specifying `--logger wandb_shared` or `--logger wandb`. It is useful for reproducibility.
- `wandb` can be used by specifying `--logger wandb`. It is useful for reproducibility. Specify the environment variable `WANDB_PROJECT='hf_xsum'` to do the XSUM shared task.
- This warning can be safely ignored:
> "Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-xsum and are newly initialized: ['final_logits_bias']"
- Both finetuning and eval are 30% faster with `--fp16`. For that you need to [install apex](https://github.com/NVIDIA/apex#quick-start).
@@ -109,14 +111,14 @@ Compare XSUM results with others by using `--logger wandb_shared`. This requires
Here is an example command, but you can do whatever you want. Hopefully this will make debugging and collaboration easier!
```bash
./finetune.sh \
WANDB_PROJECT='hf_xsum' ./finetune.sh \
--data_dir $XSUM_DIR \
--output_dir xsum_frozen_embs \
--model_name_or_path facebook/bart-large \
--logger wandb_shared \
--train_batch_size 16 --eval_batch_size 16 --freeze_embeds --freeze_encoder \
--num_train_epochs 6 \
--max_target_length=60 --val_max_target_length=60 --test_max_target_length=100
--max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 \
--logger wandb
```
You can see your wandb logs [here](https://app.wandb.ai/sshleifer/hf_xsum?workspace=user-)
@@ -168,6 +170,7 @@ python run_eval.py sshleifer/distilbart-cnn-12-6 $DATA_DIR/val.source dbart_val_
### DistilBART
![DBART](https://huggingface.co/front/thumbnails/distilbart_large.png)
For the CNN/DailyMail dataset, (relatively longer, more extractive summaries), we found a simple technique that works:
you just copy alternating layers from `bart-large-cnn` and finetune more on the same data.

View File

@@ -30,7 +30,7 @@ Batch = namedtuple("Batch", ["document_names", "batch_size", "src", "segs", "mas
def evaluate(args):
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)
model = BertAbs.from_pretrained("bertabs-finetuned-cnndm")
model = BertAbs.from_pretrained("remi/bertabs-finetuned-extractive-abstractive-summarization")
model.to(args.device)
model.eval()

View File

@@ -298,8 +298,6 @@ def main(args, model=None) -> SummarizationModule:
model: SummarizationModule = SummarizationModule(args)
else:
model: SummarizationModule = TranslationModule(args)
dataset = Path(args.data_dir).name
if (
args.logger == "default"
or args.fast_dev_run
@@ -310,12 +308,12 @@ def main(args, model=None) -> SummarizationModule:
elif args.logger == "wandb":
from pytorch_lightning.loggers import WandbLogger
logger = WandbLogger(name=model.output_dir.name, project=dataset)
logger = WandbLogger(name=model.output_dir.name)
elif args.logger == "wandb_shared":
from pytorch_lightning.loggers import WandbLogger
logger = WandbLogger(name=model.output_dir.name, project=f"hf_{dataset}")
logger = WandbLogger(name=model.output_dir.name)
trainer: pl.Trainer = generic_train(
model,
args,

View File

@@ -12,7 +12,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
# Make output directory if it doesn't exist
mkdir -p $OUTPUT_DIR
# Add parent directory to python path to access lightning_base.py and utils.py
# Add parent directory to python path to access lightning_base.py and testing_utils.py
export PYTHONPATH="../":"${PYTHONPATH}"
python finetune.py \
--data_dir=cnn_tiny/ \

View File

@@ -12,6 +12,7 @@ import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from transformers.testing_utils import require_multigpu
from .distillation import distill_main, evaluate_checkpoint
from .finetune import main
@@ -107,7 +108,7 @@ class TestSummarizationDistiller(unittest.TestCase):
logging.disable(logging.CRITICAL) # remove noisy download output from tracebacks
return cls
@unittest.skipUnless(torch.cuda.device_count() > 1, "skipping multiGPU test")
@require_multigpu
def test_multigpu(self):
updates = dict(no_teacher=True, freeze_encoder=True, gpus=2, sortish_sampler=False,)
self._test_distiller_cli(updates)

View File

@@ -131,9 +131,9 @@ def main():
level=logging.INFO,
)
logger.info(
"n_gpu: %s, distributed training: %s, 16-bits training: %s",
training_args.n_gpu,
bool(training_args.n_gpu > 1),
"n_replicas: %s, distributed training: %s, 16-bits training: %s",
training_args.n_replicas,
bool(training_args.n_replicas > 1),
training_args.fp16,
)
logger.info("Training/evaluation parameters %s", training_args)

View File

@@ -214,8 +214,14 @@ def main():
if requires_preprocessing:
prepare_input = PREPROCESSING_FUNCTIONS.get(args.model_type)
preprocessed_prompt_text = prepare_input(args, model, tokenizer, prompt_text)
if model.__class__.__name__ in ["TransfoXLLMHeadModel"]:
tokenizer_kwargs = {"add_space_before_punct_symbol": True}
else:
tokenizer_kwargs = {}
encoded_prompt = tokenizer.encode(
preprocessed_prompt_text, add_special_tokens=False, return_tensors="pt", add_space_before_punct_symbol=True
preprocessed_prompt_text, add_special_tokens=False, return_tensors="pt", **tokenizer_kwargs
)
else:
encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")

View File

@@ -75,7 +75,8 @@ class DataTrainingArguments:
metadata={"help": "The input data dir. Should contain the .txt files for a CoNLL-2003-formatted task."}
)
labels: Optional[str] = field(
metadata={"help": "Path to a file containing all labels. If not specified, CoNLL-2003 labels are used."}
default=None,
metadata={"help": "Path to a file containing all labels. If not specified, CoNLL-2003 labels are used."},
)
max_seq_length: int = field(
default=128,

View File

@@ -109,9 +109,9 @@ def main():
level=logging.INFO,
)
logger.info(
"n_gpu: %s, distributed training: %s, 16-bits training: %s",
training_args.n_gpu,
bool(training_args.n_gpu > 1),
"n_replicas: %s, distributed training: %s, 16-bits training: %s",
training_args.n_replicas,
bool(training_args.n_replicas > 1),
training_args.fp16,
)
logger.info("Training/evaluation parameters %s", training_args)

View File

@@ -0,0 +1,74 @@
---
language: setswana
---
# TswanaBert
Pretrained model on the Tswana language using a masked language modeling (MLM) objective.
## Model Description.
TswanaBERT is a transformer model pre-trained on a corpus of Setswana in a self-supervised fashion by masking part of the input words and training to predict the masks by using byte-level tokens.
## Intended uses & limitations
The model can be used for either masked language modeling or next word prediction. It can also be fine-tuned on a specific down-stream NLP application.
#### How to use
```python
>>> from transformers import pipeline
>>> from transformers import AutoTokenizer, AutoModelWithLMHead
>>> tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/TswanaBert")
>>> model = AutoModelWithLMHead.from_pretrained("MoseliMotsoehli/TswanaBert")
>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>>> unmasker("Ntshopotse <mask> e godile.")
[{'score': 0.32749542593955994,
'sequence': '<s>Ntshopotse setse e godile.</s>',
'token': 538,
'token_str': 'Ġsetse'},
{'score': 0.060260992497205734,
'sequence': '<s>Ntshopotse le e godile.</s>',
'token': 270,
'token_str': 'Ġle'},
{'score': 0.058460816740989685,
'sequence': '<s>Ntshopotse bone e godile.</s>',
'token': 364,
'token_str': 'Ġbone'},
{'score': 0.05694682151079178,
'sequence': '<s>Ntshopotse ga e godile.</s>',
'token': 298,
'token_str': 'Ġga'},
{'score': 0.0565204992890358,
'sequence': '<s>Ntshopotse, e godile.</s>',
'token': 16,
'token_str': ','}]
```
#### Limitations and bias
The model is trained on a relatively small collection of setwana, mostly from news articles and creative writtings, and so is not representative enough of the language as yet.
## Training data
1. The largest portion of this dataset (10k) sentences of text, comes from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download)
2. I Then added SABC news headlines collected by Marivate Vukosi, & Sefara Tshephisho, (2020) that is generously made available on [zenoodo](http://doi.org/10.5281/zenodo.3668495 ). This added 185 tswana sentences to my corpus.
3. I went on to add 300 more sentences by scrapping following news sites and blogs that mosty originate in Botswana. I actively continue to expand the dataset.
* http://setswana.blogspot.com/
* https://omniglot.com/writing/tswana.php
* http://www.dailynews.gov.bw/
* http://www.mmegi.bw/index.php
* https://tsena.co.bw
* http://www.botswana.co.za/Cultural_Issues-travel/botswana-country-guide-en-route.html
* https://www.poemhunter.com/poem/2013-setswana/
https://www.poemhunter.com/poem/ngwana-wa-mosetsana/
### BibTeX entry and citation info
```bibtex
@inproceedings{author = {Moseli Motsoehli},
year={2020}
}
```

View File

@@ -12,13 +12,13 @@ datasets:
## Model description
This GPT-2 (774M) model is capable of generating abstracts given paper titles. It was trained using all research papers under aritficial intelligence (AI), machine learning (LG), computation and language (CL), and computer vision and pattern recognition (CV) on arXiv.
This GPT-2 (774M) model is capable of generating abstracts given paper titles. It was trained using all research paper titles and abstracts under artificial intelligence (AI), machine learning (LG), computation and language (CL), and computer vision and pattern recognition (CV) on arXiv.
## Intended uses & limitations
#### How to use
To generate paper abstracts, use the provided `generate.py`. This file is very similar to HuggingFace's `run_generation.py` [here](https://github.com/huggingface/transformers/tree/master/examples/text-generation). You can simply replace the text with with your own model path (line 89) and change the input string to your paper title (line 127).
To generate paper abstracts, use the provided `generate.py` [here](https://gist.github.com/chrisliu298/ccb8144888eace069da64ad3e6472d64). This is very similar to the HuggingFace's `run_generation.py` [here](https://github.com/huggingface/transformers/tree/master/examples/text-generation). You can simply replace the text with with your own model path (line 89) and change the input string to your paper title (line 127). If you want to use your own script, make sure to prepend `<|startoftext|> ` at the front and append ` <|sep|>` at the end of the paper title.
## Training data
I selected a subset of the [arXiv Archive](https://github.com/staeiou/arxiv_archive) dataset (Geiger, 2019) as the training and evaluation data to fine-tune GPT-2. The original arXiv Archive dataset contains a full archive of metadata about papers on arxiv.org, from the start of the site in 1993 to the end of 2019. Our subset includes all the paper titles (query) and abstracts (context) under the Artificial Intelligence (cs.AI), Machine Learning (cs.LG), Computation and Language (cs.CL), and Computer Vision and Pattern Recognition (cs.CV) categories. I provide the information of the sub-dataset and the distribution of the training and evaluation dataset as follows.

View File

@@ -13,8 +13,9 @@ Pretrained model on English language using a causal language modeling (CLM) obje
[this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
and first released at [this page](https://openai.com/blog/better-language-models/).
Disclaimer: The team releasing GPT-2 did not write a model card for this model so this model card has been written by
the Hugging Face team.
Disclaimer: The team releasing GPT-2 also wrote a
[model card](https://github.com/openai/gpt-2/blob/master/model_card.md) for their model. Content from this model card
has been written by the Hugging Face team to complete the information they provided and give specific examples of bias.
## Model description
@@ -79,7 +80,19 @@ output = model(encoded_input)
### Limitations and bias
The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
unfiltered from the internet, which is far from neutral. Therefore, the model can have biased predictions:
unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their
[model card](https://github.com/openai/gpt-2/blob/master/model_card.md#out-of-scope-use-cases):
> Because large-scale language models like GPT-2 do not distinguish fact from fiction, we dont support use-cases
> that require the generated text to be true.
>
> Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do
> not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a
> study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race,
> and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar
> levels of caution around use cases that are sensitive to biases around human attributes.
Here's an example of how the model can have biased predictions:
```python
>>> from transformers import pipeline, set_seed
@@ -110,7 +123,8 @@ This bias will also affect all fine-tuned versions of this model.
The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web
pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from
this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights
40GB of texts but has not been publicly released.
40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText
[here](https://github.com/openai/gpt-2/blob/master/domains.txt).
## Training procedure

View File

@@ -0,0 +1,85 @@
---
language: english
---
# Electra base ⚡ + SQuAD v1 ❓
[Electra-base-discriminator](https://huggingface.co/google/electra-base-discriminator) fine-tuned on [SQUAD v1.1 dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/) for **Q&A** downstream task.
## Details of the downstream task (Q&A) - Model 🧠
**ELECTRA** is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
## Details of the downstream task (Q&A) - Dataset 📚
**S**tanford **Q**uestion **A**nswering **D**ataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
SQuAD v1.1 contains **100,000+** question-answer pairs on **500+** articles.
## Model training 🏋️‍
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
python transformers/examples/question-answering/run_squad.py \
--model_type electra \
--model_name_or_path 'google/electra-base-discriminator' \
--do_eval \
--do_train \
--do_lower_case \
--train_file '/content/dataset/train-v1.1.json' \
--predict_file '/content/dataset/dev-v1.1.json' \
--per_gpu_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 10 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir '/content/output' \
--overwrite_output_dir \
--save_steps 1000
```
## Test set Results 🧾
| Metric | # Value |
| ------ | --------- |
| **EM** | **83.03** |
| **F1** | **90.77** |
| **Size**| **+ 400 MB** |
Very good metrics for such a "small" model!
```json
{
'exact': 83.03689687795648,
'f1': 90.77486052446231,
'total': 10570,
'HasAns_exact': 83.03689687795648,
'HasAns_f1': 90.77486052446231,
'HasAns_total': 10570,
'best_exact': 83.03689687795648,
'best_exact_thresh': 0.0,
'best_f1': 90.77486052446231,
'best_f1_thresh': 0.0
}
```
### Model in action 🚀
Fast usage with **pipelines**:
```python
from transformers import pipeline
QnA_pipeline = pipeline('question-answering', model='mrm8488/electra-base-finetuned-squadv1')
QnA_pipeline({
'context': 'A new strain of flu that has the potential to become a pandemic has been identified in China by scientists.',
'question': 'What has been discovered by scientists from China ?'
})
# Output:
{'answer': 'A new strain of flu', 'end': 19, 'score': 0.9995211430099182, 'start': 0}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -0,0 +1,86 @@
---
language: english
---
# Electra small ⚡ + SQuAD v2 ❓
[Electra-small-discriminator](https://huggingface.co/google/electra-small-discriminator) fine-tuned on [SQUAD v2.0 dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/) for **Q&A** downstream task.
## Details of the downstream task (Q&A) - Model 🧠
**ELECTRA** is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
## Details of the downstream task (Q&A) - Dataset 📚
**SQuAD2.0** combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
## Model training 🏋️‍
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
python transformers/examples/question-answering/run_squad.py \
--model_type electra \
--model_name_or_path 'google/electra-small-discriminator' \
--do_eval \
--do_train \
--do_lower_case \
--train_file '/content/dataset/train-v2.0.json' \
--predict_file '/content/dataset/dev-v2.0.json' \
--per_gpu_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 10 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir '/content/output' \
--overwrite_output_dir \
--save_steps 1000 \
--version_2_with_negative
```
## Test set Results 🧾
| Metric | # Value |
| ------ | --------- |
| **EM** | **69.71** |
| **F1** | **73.44** |
| **Size**| **50 MB** |
```json
{
'exact': 69.71279373368147,
'f1': 73.4439546123672,
'total': 11873,
'HasAns_exact': 69.92240215924427,
'HasAns_f1': 77.39542393937836,
'HasAns_total': 5928,
'NoAns_exact': 69.50378469301934,
'NoAns_f1': 69.50378469301934,
'NoAns_total': 5945,
'best_exact': 69.71279373368147,
'best_exact_thresh': 0.0,
'best_f1': 73.44395461236732,
'best_f1_thresh': 0.0
}
```
### Model in action 🚀
Fast usage with **pipelines**:
```python
from transformers import pipeline
QnA_pipeline = pipeline('question-answering', model='mrm8488/electra-base-finetuned-squadv2')
QnA_pipeline({
'context': 'A new strain of flu that has the potential to become a pandemic has been identified in China by scientists.',
'question': 'What has been discovered by scientists from China ?'
})
# Output:
{'answer': 'A new strain of flu', 'end': 19, 'score': 0.8650811568752914, 'start': 0}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -0,0 +1,102 @@
---
language: spanish
thumbnail: https://imgur.com/uxAvBfh
---
# Electricidad small + Spanish SQuAD v1 ⚡❓
[Electricidad-small-discriminator](https://huggingface.co/mrm8488/electricidad-small-discriminator) fine-tuned on [Spanish SQUAD v1.1 dataset](https://github.com/ccasimiro88/TranslateAlignRetrieve/tree/master/SQuAD-es-v1.1) for **Q&A** downstream task.
## Details of the downstream task (Q&A) - Dataset 📚
[SQuAD-es-v1.1](https://github.com/ccasimiro88/TranslateAlignRetrieve/tree/master/SQuAD-es-v1.1)
| Dataset split | # Samples |
| ------------- | --------- |
| Train | 130 K |
| Test | 11 K |
## Model training 🏋️‍
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
python /content/transformers/examples/question-answering/run_squad.py \
--model_type electra \
--model_name_or_path 'mrm8488/electricidad-small-discriminator' \
--do_eval \
--do_train \
--do_lower_case \
--train_file '/content/dataset/train-v1.1-es.json' \
--predict_file '/content/dataset/dev-v1.1-es.json' \
--per_gpu_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 10 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir '/content/electricidad-small-finetuned-squadv1-es' \
--overwrite_output_dir \
--save_steps 1000
```
## Test set Results 🧾
| Metric | # Value |
| ------ | --------- |
| **EM** | **46.82** |
| **F1** | **64.79** |
```json
{
'exact': 46.82119205298013,
'f1': 64.79435260021918,
'total': 10570,
'HasAns_exact': 46.82119205298013,
HasAns_f1': 64.79435260021918,
'HasAns_total': 10570,
'best_exact': 46.82119205298013,
'best_exact_thresh': 0.0,
'best_f1': 64.79435260021918,
'best_f1_thresh': 0.0
}
```
### Model in action 🚀
Fast usage with **pipelines**:
```python
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="mrm8488/electricidad-small-finetuned-squadv1-es",
tokenizer="mrm8488/electricidad-small-finetuned-squadv1-es"
)
context = "Manuel ha creado una versión del modelo Electra small en español que alcanza una puntuación F1 de 65 en el dataset SQUAD-es y sólo pesa 50 MB"
q1 = "Cuál es su marcador F1?"
q2 = "¿Cuál es el tamaño del modelo?"
q3 = "¿Quién lo ha creado?"
q4 = "¿Que es lo que ha hecho Manuel?"
questions = [q1, q2, q3, q4]
for question in questions:
result = qa_pipeline({
'context': context,
'question': question})
print(result)
# Output:
{'score': 0.14836778166355025, 'start': 98, 'end': 100, 'answer': '65'}
{'score': 0.32219420810758237, 'start': 136, 'end': 140, 'answer': '50 MB'}
{'score': 0.9672326951118713, 'start': 0, 'end': 6, 'answer': 'Manuel'}
{'score': 0.23552458113848118, 'start': 10, 'end': 53, 'answer': 'creado una versión del modelo Electra small'}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -1,4 +1,4 @@
--
---
language: english
---
@@ -13,7 +13,7 @@ The **T5** model was presented in [Exploring the Limits of Transfer Learning wit
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
![model image](https://i.imgur.com/jVFMMWR.png)
## Details of the downstream task (Sentiment Recognition) - Dataset 📚

View File

@@ -1,4 +1,4 @@
--
---
language: english
---
@@ -11,8 +11,7 @@ The **T5** model was presented in [Exploring the Limits of Transfer Learning wit
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
![model image](https://i.imgur.com/jVFMMWR.png)
## Details of the downstream task (Sequence Classification as Text generation) - Dataset 📚
[ Twitter Sarcasm Dataset](https://github.com/EducationalTestingService/sarcasm)
@@ -105,7 +104,7 @@ conversation = twit1 + me
eval_conversation(conversation) #Output: 'derison'
# We will get 'normal' when not sarcasm detected and 'derison' when detected
# We will get 'normal' when sarcasm is not detected and 'derison' when detected
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)

View File

@@ -14,14 +14,17 @@ The **T5** model was presented in [Exploring the Limits of Transfer Learning wit
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
![model image](https://i.imgur.com/jVFMMWR.png)
## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
Dataset ID: ```squad_v2``` from [HugginFace/NLP](https://github.com/huggingface/nlp)
| Dataset | Split | # samples |
| -------- | ----- | --------- |
| squad_v2 | train | 130319 |
| squad_v2 | valid | 11873 |
| squad_v2 | train | 130319 |
| squad_v2 | valid | 11873 |
How to load it from [nlp](https://github.com/huggingface/nlp)

View File

@@ -15,7 +15,7 @@ The **T5** model was presented in [Exploring the Limits of Transfer Learning wit
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
![model image](https://i.imgur.com/jVFMMWR.png)
## Details of the downstream task (Summarization) - Dataset 📚

View File

@@ -0,0 +1,22 @@
---
license: mit
widget:
- text: "I like you. </s></s> I love you."
---
## roberta-large-mnli
Trained by Facebook, [original source](https://github.com/pytorch/fairseq/tree/master/examples/roberta)
```bibtex
@article{liu2019roberta,
title = {RoBERTa: A Robustly Optimized BERT Pretraining Approach},
author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and
Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and
Luke Zettlemoyer and Veselin Stoyanov},
journal={arXiv preprint arXiv:1907.11692},
year = {2019},
}
```

View File

@@ -0,0 +1,11 @@
---
language: english
license: apache-2.0
---
## ELECTRA-small-cased
This is a cased version of `google/electra-small-discriminator`, trained on the
[OpenWebText corpus](https://skylion007.github.io/OpenWebTextCorpus/).
Uses the same tokenizer and vocab from `bert-base-cased`

View File

@@ -39,3 +39,4 @@ Pull Request so it can be included under the Community notebooks.
|[Fine-tune BERT for Multi-label Classification](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|How to fine-tune BERT for multi-label classification using PyTorch|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|
|[Fine-tune T5 for Summarization](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|How to fine-tune T5 for summarization in PyTorch and track experiments with WandB|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|
|[Speed up Fine-Tuning in Transformers with Dynamic Padding / Bucketing](https://github.com/ELS-RD/transformers-notebook/blob/master/Divide_Hugging_Face_Transformers_training_time_by_2_or_more.ipynb)|How to speed up fine-tuning by a factor of 2 using dynamic padding / bucketing|[Michael Benesty](https://github.com/pommedeterresautee) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CBfRU1zbfu7-ijiOqAAQUA-RJaxfcJoO?usp=sharing)|
|[Pretrain Reformer for Masked Language Modeling](https://github.com/patrickvonplaten/notebooks/blob/master/Reformer_For_Masked_LM.ipynb)| How to train a Reformer model with bi-directional self-attention layers | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tzzh0i8PgDQGV3SMFUGxM7_gGae3K-uW?usp=sharing)|

View File

@@ -73,11 +73,15 @@ extras["tf"] = [
"tensorflow",
"onnxconverter-common",
"keras2onnx"
# "onnxconverter-common @ git+git://github.com/microsoft/onnxconverter-common.git@f64ca15989b6dc95a1f3507ff6e4c395ba12dff5#egg=onnxconverter-common",
# "keras2onnx @ git+git://github.com/onnx/keras-onnx.git@cbdc75cb950b16db7f0a67be96a278f8d2953b48#egg=keras2onnx"
]
extras["tf-cpu"] = [
"tensorflow-cpu",
"onnxconverter-common",
"keras2onnx"
# "onnxconverter-common @ git+git://github.com/microsoft/onnxconverter-common.git@f64ca15989b6dc95a1f3507ff6e4c395ba12dff5#egg=onnxconverter-common",
# "keras2onnx @ git+git://github.com/onnx/keras-onnx.git@cbdc75cb950b16db7f0a67be96a278f8d2953b48#egg=keras2onnx"
]
extras["torch"] = ["torch"]
@@ -90,13 +94,14 @@ extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rt
extras["quality"] = [
"black",
"isort",
# "isort @ git+git://github.com/timothycrosley/isort.git@e63ae06ec7d70b06df9e528357650281a3d3ec22#egg=isort",
"flake8",
]
extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3<1", "scikit-learn", "tensorflow", "torch"]
setup(
name="transformers",
version="3.0.0",
version="3.0.2",
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Patrick von Platen, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
author_email="thomas@huggingface.co",
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
@@ -109,7 +114,7 @@ setup(
packages=find_packages("src"),
install_requires=[
"numpy",
"tokenizers == 0.8.0-rc4",
"tokenizers == 0.8.1.rc1",
# dataclasses for Python versions that don't have it
"dataclasses;python_version<'3.7'",
# utilities from PyPA to e.g. compare versions
@@ -123,7 +128,7 @@ setup(
# for OpenAI GPT
"regex != 2019.12.17",
# for XLNet
"sentencepiece",
"sentencepiece != 0.1.92",
# for XLM
"sacremoses",
],

View File

@@ -2,7 +2,7 @@
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
__version__ = "3.0.0"
__version__ = "3.0.2"
# Work around to update TensorFlow's absl.logging threshold which alters the
# default Python logging output behavior when present.
@@ -155,7 +155,7 @@ from .tokenization_xlm_roberta import XLMRobertaTokenizer
from .tokenization_xlnet import SPIECE_UNDERLINE, XLNetTokenizer
# Trainer
from .trainer_utils import EvalPrediction
from .trainer_utils import EvalPrediction, set_seed
from .training_args import TrainingArguments
from .training_args_tf import TFTrainingArguments
@@ -169,7 +169,8 @@ if is_sklearn_available():
# Modeling
if is_torch_available():
from .modeling_utils import PreTrainedModel, prune_layer, Conv1D, top_k_top_p_filtering, apply_chunking_to_forward
from .generation_utils import top_k_top_p_filtering
from .modeling_utils import PreTrainedModel, prune_layer, Conv1D, apply_chunking_to_forward
from .modeling_auto import (
AutoModel,
AutoModelForPreTraining,
@@ -365,7 +366,9 @@ if is_torch_available():
ReformerAttention,
ReformerLayer,
ReformerModel,
ReformerForMaskedLM,
ReformerModelWithLMHead,
ReformerForQuestionAnswering,
REFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
)
@@ -396,7 +399,7 @@ if is_torch_available():
)
# Trainer
from .trainer import Trainer, set_seed, torch_distributed_zero_first, EvalPrediction
from .trainer import Trainer, torch_distributed_zero_first
from .data.data_collator import default_data_collator, DataCollator, DataCollatorForLanguageModeling
from .data.datasets import GlueDataset, TextDataset, LineByLineTextDataset, GlueDataTrainingArguments
@@ -406,9 +409,9 @@ if is_torch_available():
# TensorFlow
if is_tf_available():
from .generation_tf_utils import tf_top_k_top_p_filtering
from .modeling_tf_utils import (
shape_list,
tf_top_k_top_p_filtering,
TFPreTrainedModel,
TFSequenceSummary,
TFSharedEmbeddings,

View File

@@ -71,10 +71,10 @@ from transformers import (
XLMRobertaConfig,
XLNetConfig,
cached_path,
hf_bucket_url,
is_torch_available,
load_pytorch_checkpoint_in_tf2_model,
)
from transformers.file_utils import hf_bucket_url
if is_torch_available():

View File

@@ -43,7 +43,8 @@ def default_data_collator(features: List[InputDataClass]) -> Dict[str, torch.Ten
# Ensure that tensor is created with the correct type
# (it should be automatically the case, but let's make sure of it.)
if "label" in first and first["label"] is not None:
dtype = torch.long if type(first["label"]) is int else torch.float
label = first["label"].item() if isinstance(first["label"], torch.Tensor) else first["label"]
dtype = torch.long if isinstance(label, int) else torch.float
batch["labels"] = torch.tensor([f["label"] for f in features], dtype=dtype)
elif "label_ids" in first and first["label_ids"] is not None:
if isinstance(first["label_ids"], torch.Tensor):

View File

@@ -17,6 +17,7 @@
import logging
import os
from dataclasses import asdict
from enum import Enum
from typing import List, Optional, Union
@@ -81,26 +82,16 @@ if is_tf_available():
def gen():
for ex in features:
yield (
{
"input_ids": ex.input_ids,
"attention_mask": ex.attention_mask,
"token_type_ids": ex.token_type_ids,
},
ex.label,
)
d = {k: v for k, v in asdict(ex).items() if v is not None}
label = d.pop("label")
yield (d, label)
input_names = ["input_ids"] + tokenizer.model_input_names
return tf.data.Dataset.from_generator(
gen,
({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
(
{
"input_ids": tf.TensorShape([None]),
"attention_mask": tf.TensorShape([None]),
"token_type_ids": tf.TensorShape([None]),
},
tf.TensorShape([]),
),
({k: tf.int32 for k in input_names}, tf.int64),
({k: tf.TensorShape([None]) for k in input_names}, tf.TensorShape([])),
)

View File

@@ -389,57 +389,102 @@ def squad_convert_examples_to_features(
def gen():
for i, ex in enumerate(features):
yield (
{
"input_ids": ex.input_ids,
"attention_mask": ex.attention_mask,
"token_type_ids": ex.token_type_ids,
"feature_index": i,
"qas_id": ex.qas_id,
},
{
"start_positions": ex.start_position,
"end_positions": ex.end_position,
"cls_index": ex.cls_index,
"p_mask": ex.p_mask,
"is_impossible": ex.is_impossible,
},
)
if ex.token_type_ids is None:
yield (
{
"input_ids": ex.input_ids,
"attention_mask": ex.attention_mask,
"feature_index": i,
"qas_id": ex.qas_id,
},
{
"start_positions": ex.start_position,
"end_positions": ex.end_position,
"cls_index": ex.cls_index,
"p_mask": ex.p_mask,
"is_impossible": ex.is_impossible,
},
)
else:
yield (
{
"input_ids": ex.input_ids,
"attention_mask": ex.attention_mask,
"token_type_ids": ex.token_type_ids,
"feature_index": i,
"qas_id": ex.qas_id,
},
{
"start_positions": ex.start_position,
"end_positions": ex.end_position,
"cls_index": ex.cls_index,
"p_mask": ex.p_mask,
"is_impossible": ex.is_impossible,
},
)
# Why have we split the batch into a tuple? PyTorch just has a list of tensors.
train_types = (
{
"input_ids": tf.int32,
"attention_mask": tf.int32,
"token_type_ids": tf.int32,
"feature_index": tf.int64,
"qas_id": tf.string,
},
{
"start_positions": tf.int64,
"end_positions": tf.int64,
"cls_index": tf.int64,
"p_mask": tf.int32,
"is_impossible": tf.int32,
},
)
if "token_type_ids" in tokenizer.model_input_names:
train_types = (
{
"input_ids": tf.int32,
"attention_mask": tf.int32,
"token_type_ids": tf.int32,
"feature_index": tf.int64,
"qas_id": tf.string,
},
{
"start_positions": tf.int64,
"end_positions": tf.int64,
"cls_index": tf.int64,
"p_mask": tf.int32,
"is_impossible": tf.int32,
},
)
train_shapes = (
{
"input_ids": tf.TensorShape([None]),
"attention_mask": tf.TensorShape([None]),
"token_type_ids": tf.TensorShape([None]),
"feature_index": tf.TensorShape([]),
"qas_id": tf.TensorShape([]),
},
{
"start_positions": tf.TensorShape([]),
"end_positions": tf.TensorShape([]),
"cls_index": tf.TensorShape([]),
"p_mask": tf.TensorShape([None]),
"is_impossible": tf.TensorShape([]),
},
)
train_shapes = (
{
"input_ids": tf.TensorShape([None]),
"attention_mask": tf.TensorShape([None]),
"token_type_ids": tf.TensorShape([None]),
"feature_index": tf.TensorShape([]),
"qas_id": tf.TensorShape([]),
},
{
"start_positions": tf.TensorShape([]),
"end_positions": tf.TensorShape([]),
"cls_index": tf.TensorShape([]),
"p_mask": tf.TensorShape([None]),
"is_impossible": tf.TensorShape([]),
},
)
else:
train_types = (
{"input_ids": tf.int32, "attention_mask": tf.int32, "feature_index": tf.int64, "qas_id": tf.string},
{
"start_positions": tf.int64,
"end_positions": tf.int64,
"cls_index": tf.int64,
"p_mask": tf.int32,
"is_impossible": tf.int32,
},
)
train_shapes = (
{
"input_ids": tf.TensorShape([None]),
"attention_mask": tf.TensorShape([None]),
"feature_index": tf.TensorShape([]),
"qas_id": tf.TensorShape([]),
},
{
"start_positions": tf.TensorShape([]),
"end_positions": tf.TensorShape([]),
"cls_index": tf.TensorShape([]),
"p_mask": tf.TensorShape([None]),
"is_impossible": tf.TensorShape([]),
},
)
return tf.data.Dataset.from_generator(gen, train_types, train_shapes)
else:

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,993 @@
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors, Facebook AI Research authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from typing import Iterable, Optional, Tuple
import torch
from torch import Tensor
from torch.nn import functional as F
logger = logging.getLogger(__name__)
class GenerationMixin:
"""
A class contraining all of the functions supporting generation, to be used as a mixin in PreTrainedModel.
"""
def prepare_inputs_for_generation(self, input_ids, **kwargs):
return {"input_ids": input_ids}
def adjust_logits_during_generation(self, logits, **kwargs):
return logits
def _use_cache(self, outputs, use_cache):
"""During generation, decide whether to pass the `past` variable to the next forward pass."""
if len(outputs) <= 1 or use_cache is False:
return False
if hasattr(self.config, "mem_len") and self.config.mem_len == 0:
return False
return True
def enforce_repetition_penalty_(self, lprobs, batch_size, num_beams, prev_output_tokens, repetition_penalty):
"""repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858). """
for i in range(batch_size * num_beams):
for previous_token in set(prev_output_tokens[i].tolist()):
# if score < 0 then repetition penalty has to multiplied to reduce the previous token probability
if lprobs[i, previous_token] < 0:
lprobs[i, previous_token] *= repetition_penalty
else:
lprobs[i, previous_token] /= repetition_penalty
def postprocess_next_token_scores(
self,
scores,
input_ids,
no_repeat_ngram_size,
bad_words_ids,
cur_len,
min_length,
max_length,
eos_token_id,
repetition_penalty,
batch_size,
num_beams,
):
# repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858)
if repetition_penalty != 1.0:
self.enforce_repetition_penalty_(
scores, batch_size, num_beams, input_ids, repetition_penalty,
)
# set eos token prob to zero if min_length is not reached
if eos_token_id is not None and cur_len < min_length:
scores[:, eos_token_id] = -float("inf")
if no_repeat_ngram_size > 0:
# calculate a list of banned tokens to prevent repetitively generating the same ngrams
num_batch_hypotheses = batch_size * num_beams
# from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345
banned_batch_tokens = calc_banned_ngram_tokens(
input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len
)
for i, banned_tokens in enumerate(banned_batch_tokens):
scores[i, banned_tokens] = -float("inf")
if bad_words_ids is not None:
# calculate a list of banned tokens according to bad words
banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)
for i, banned_tokens in enumerate(banned_tokens):
scores[i, banned_tokens] = -float("inf")
return scores
@torch.no_grad()
def generate(
self,
input_ids: Optional[torch.LongTensor] = None,
max_length: Optional[int] = None,
min_length: Optional[int] = None,
do_sample: Optional[bool] = None,
early_stopping: Optional[bool] = None,
num_beams: Optional[int] = None,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
repetition_penalty: Optional[float] = None,
bad_words_ids: Optional[Iterable[int]] = None,
bos_token_id: Optional[int] = None,
pad_token_id: Optional[int] = None,
eos_token_id: Optional[int] = None,
length_penalty: Optional[float] = None,
no_repeat_ngram_size: Optional[int] = None,
num_return_sequences: Optional[int] = None,
attention_mask: Optional[torch.LongTensor] = None,
decoder_start_token_id: Optional[int] = None,
use_cache: Optional[bool] = None,
**model_specific_kwargs
) -> torch.LongTensor:
r""" Generates sequences for models with a LM head. The method currently supports greedy decoding, beam-search decoding, sampling with temperature, sampling with top-k or nucleus sampling.
Adapted in part from `Facebook's XLM beam search code`_.
.. _`Facebook's XLM beam search code`:
https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529
Parameters:
input_ids: (`optional`) `torch.LongTensor` of shape `(batch_size, sequence_length)`
The sequence used as a prompt for the generation. If `None` the method initializes
it as an empty `torch.LongTensor` of shape `(1,)`.
max_length: (`optional`) int
The max length of the sequence to be generated. Between `min_length` and infinity. Default to 20.
min_length: (`optional`) int
The min length of the sequence to be generated. Between 0 and infinity. Default to 0.
do_sample: (`optional`) bool
If set to `False` greedy decoding is used. Otherwise sampling is used. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.
early_stopping: (`optional`) bool
if set to `True` beam search is stopped when at least `num_beams` sentences finished per batch. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.
num_beams: (`optional`) int
Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.
temperature: (`optional`) float
The value used to module the next token probabilities. Must be strictly positive. Default to 1.0.
top_k: (`optional`) int
The number of highest probability vocabulary tokens to keep for top-k-filtering. Between 1 and infinity. Default to 50.
top_p: (`optional`) float
The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.
repetition_penalty: (`optional`) float
The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.
pad_token_id: (`optional`) int
Padding token. Default to specicic model pad_token_id or None if it does not exist.
bos_token_id: (`optional`) int
BOS token. Defaults to `bos_token_id` as defined in the models config.
eos_token_id: (`optional`) int
EOS token. Defaults to `eos_token_id` as defined in the models config.
length_penalty: (`optional`) float
Exponential penalty to the length. Default to 1.
no_repeat_ngram_size: (`optional`) int
If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.
bad_words_ids: (`optional`) list of lists of int
`bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.
num_return_sequences: (`optional`) int
The number of independently computed returned sequences for each element in the batch. Default to 1.
attention_mask (`optional`) obj: `torch.LongTensor` of same shape as `input_ids`
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
Defaults to `None`.
`What are attention masks? <../glossary.html#attention-mask>`__
decoder_start_token_id=None: (`optional`) int
If an encoder-decoder model starts decoding with a different token than BOS.
Defaults to `None` and is changed to `BOS` later.
use_cache: (`optional`) bool
If `use_cache` is True, past key values are used to speed up decoding if applicable to model. Defaults to `True`.
model_specific_kwargs: (`optional`) dict
Additional model specific kwargs will be forwarded to the `forward` function of the model.
Return:
output: `torch.LongTensor` of shape `(batch_size * num_return_sequences, sequence_length)`
sequence_length is either equal to max_length or shorter if all batches finished early due to the `eos_token_id`
Examples::
tokenizer = AutoTokenizer.from_pretrained('distilgpt2') # Initialize tokenizer
model = AutoModelWithLMHead.from_pretrained('distilgpt2') # Download model and configuration from S3 and cache.
outputs = model.generate(max_length=40) # do greedy decoding
print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))
tokenizer = AutoTokenizer.from_pretrained('openai-gpt') # Initialize tokenizer
model = AutoModelWithLMHead.from_pretrained('openai-gpt') # Download model and configuration from S3 and cache.
input_context = 'The dog'
input_ids = tokenizer.encode(input_context, return_tensors='pt') # encode input context
outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5) # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'
for i in range(3): # 3 output sequences were generated
print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))
tokenizer = AutoTokenizer.from_pretrained('distilgpt2') # Initialize tokenizer
model = AutoModelWithLMHead.from_pretrained('distilgpt2') # Download model and configuration from S3 and cache.
input_context = 'The dog'
input_ids = tokenizer.encode(input_context, return_tensors='pt') # encode input context
outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3) # 3 generate sequences using by sampling
for i in range(3): # 3 output sequences were generated
print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))
tokenizer = AutoTokenizer.from_pretrained('ctrl') # Initialize tokenizer
model = AutoModelWithLMHead.from_pretrained('ctrl') # Download model and configuration from S3 and cache.
input_context = 'Legal My neighbor is' # "Legal" is one of the control codes for ctrl
input_ids = tokenizer.encode(input_context, return_tensors='pt') # encode input context
outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2) # generate sequences
print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))
tokenizer = AutoTokenizer.from_pretrained('gpt2') # Initialize tokenizer
model = AutoModelWithLMHead.from_pretrained('gpt2') # Download model and configuration from S3 and cache.
input_context = 'My cute dog' # "Legal" is one of the control codes for ctrl
bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]
input_ids = tokenizer.encode(input_context, return_tensors='pt') # encode input context
outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids) # generate sequences without allowing bad_words to be generated
"""
# We cannot generate if the model does not have a LM head
if self.get_output_embeddings() is None:
raise AttributeError(
"You tried to generate sequences with a model that does not have a LM Head."
"Please use another model class (e.g. `OpenAIGPTLMHeadModel`, `XLNetLMHeadModel`, `GPT2LMHeadModel`, `CTRLLMHeadModel`, `T5WithLMHeadModel`, `TransfoXLLMHeadModel`, `XLMWithLMHeadModel`, `BartForConditionalGeneration` )"
)
max_length = max_length if max_length is not None else self.config.max_length
min_length = min_length if min_length is not None else self.config.min_length
do_sample = do_sample if do_sample is not None else self.config.do_sample
early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping
use_cache = use_cache if use_cache is not None else self.config.use_cache
num_beams = num_beams if num_beams is not None else self.config.num_beams
temperature = temperature if temperature is not None else self.config.temperature
top_k = top_k if top_k is not None else self.config.top_k
top_p = top_p if top_p is not None else self.config.top_p
repetition_penalty = repetition_penalty if repetition_penalty is not None else self.config.repetition_penalty
bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id
pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id
eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id
length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty
no_repeat_ngram_size = (
no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size
)
bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids
num_return_sequences = (
num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences
)
decoder_start_token_id = (
decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id
)
if input_ids is not None:
batch_size = input_ids.shape[0] # overriden by the input batch_size
else:
batch_size = 1
assert isinstance(max_length, int) and max_length > 0, "`max_length` should be a strictly positive integer."
assert isinstance(min_length, int) and min_length >= 0, "`min_length` should be a positive integer."
assert isinstance(do_sample, bool), "`do_sample` should be a boolean."
assert isinstance(early_stopping, bool), "`early_stopping` should be a boolean."
assert isinstance(use_cache, bool), "`use_cache` should be a boolean."
assert isinstance(num_beams, int) and num_beams > 0, "`num_beams` should be a strictly positive integer."
assert temperature > 0, "`temperature` should be strictly positive."
assert isinstance(top_k, int) and top_k >= 0, "`top_k` should be a positive integer."
assert 0 <= top_p <= 1, "`top_p` should be between 0 and 1."
assert repetition_penalty >= 1.0, "`repetition_penalty` should be >= 1."
assert input_ids is not None or (
isinstance(bos_token_id, int) and bos_token_id >= 0
), "If input_ids is not defined, `bos_token_id` should be a positive integer."
assert pad_token_id is None or (
isinstance(pad_token_id, int) and (pad_token_id >= 0)
), "`pad_token_id` should be a positive integer."
assert (eos_token_id is None) or (
isinstance(eos_token_id, int) and (eos_token_id >= 0)
), "`eos_token_id` should be a positive integer."
assert length_penalty > 0, "`length_penalty` should be strictly positive."
assert (
isinstance(no_repeat_ngram_size, int) and no_repeat_ngram_size >= 0
), "`no_repeat_ngram_size` should be a positive integer."
assert (
isinstance(num_return_sequences, int) and num_return_sequences > 0
), "`num_return_sequences` should be a strictly positive integer."
assert (
bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)
), "`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated"
if input_ids is None:
assert isinstance(bos_token_id, int) and bos_token_id >= 0, (
"you should either supply a context to complete as `input_ids` input "
"or a `bos_token_id` (integer >= 0) as a first token to start the generation."
)
input_ids = torch.full(
(batch_size, 1), bos_token_id, dtype=torch.long, device=next(self.parameters()).device,
)
else:
assert input_ids.dim() == 2, "Input prompt should be of shape (batch_size, sequence length)."
# not allow to duplicate outputs when greedy decoding
if do_sample is False:
if num_beams == 1:
# no_beam_search greedy generation conditions
assert (
num_return_sequences == 1
), "Greedy decoding will always produce the same output for num_beams == 1 and num_return_sequences > 1. Please set num_return_sequences = 1"
else:
# beam_search greedy generation conditions
assert (
num_beams >= num_return_sequences
), "Greedy beam search decoding cannot return more sequences than it has beams. Please set num_beams >= num_return_sequences"
# create attention mask if necessary
# TODO (PVP): this should later be handled by the forward fn() in each model in the future see PR 3140
if (attention_mask is None) and (pad_token_id is not None) and (pad_token_id in input_ids):
attention_mask = input_ids.ne(pad_token_id).long()
elif attention_mask is None:
attention_mask = input_ids.new_ones(input_ids.shape)
# set pad_token_id to eos_token_id if not set. Important that this is done after
# attention_mask is created
if pad_token_id is None and eos_token_id is not None:
logger.warning(
"Setting `pad_token_id` to {} (first `eos_token_id`) to generate sequence".format(eos_token_id)
)
pad_token_id = eos_token_id
# current position and vocab size
if hasattr(self.config, "vocab_size"):
vocab_size = self.config.vocab_size
elif (
self.config.is_encoder_decoder
and hasattr(self.config, "decoder")
and hasattr(self.config.decoder, "vocab_size")
):
vocab_size = self.config.decoder.vocab_size
# set effective batch size and effective batch multiplier according to do_sample
if do_sample:
effective_batch_size = batch_size * num_return_sequences
effective_batch_mult = num_return_sequences
else:
effective_batch_size = batch_size
effective_batch_mult = 1
if self.config.is_encoder_decoder:
if decoder_start_token_id is None:
decoder_start_token_id = bos_token_id
assert (
decoder_start_token_id is not None
), "decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation"
assert hasattr(self, "get_encoder"), "{} should have a 'get_encoder' function defined".format(self)
assert callable(self.get_encoder), "{} should be a method".format(self.get_encoder)
# get encoder and store encoder outputs
encoder = self.get_encoder()
encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)
# Expand input ids if num_beams > 1 or num_return_sequences > 1
if num_return_sequences > 1 or num_beams > 1:
input_ids_len = input_ids.shape[-1]
input_ids = input_ids.unsqueeze(1).expand(batch_size, effective_batch_mult * num_beams, input_ids_len)
attention_mask = attention_mask.unsqueeze(1).expand(
batch_size, effective_batch_mult * num_beams, input_ids_len
)
input_ids = input_ids.contiguous().view(
effective_batch_size * num_beams, input_ids_len
) # shape: (batch_size * num_return_sequences * num_beams, cur_len)
attention_mask = attention_mask.contiguous().view(
effective_batch_size * num_beams, input_ids_len
) # shape: (batch_size * num_return_sequences * num_beams, cur_len)
if self.config.is_encoder_decoder:
# create empty decoder_input_ids
input_ids = torch.full(
(effective_batch_size * num_beams, 1),
decoder_start_token_id,
dtype=torch.long,
device=next(self.parameters()).device,
)
cur_len = 1
assert (
batch_size == encoder_outputs[0].shape[0]
), f"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[0]} "
# expand batch_idx to assign correct encoder output for expanded input_ids (due to num_beams > 1 and num_return_sequences > 1)
expanded_batch_idxs = (
torch.arange(batch_size)
.view(-1, 1)
.repeat(1, num_beams * effective_batch_mult)
.view(-1)
.to(input_ids.device)
)
# expand encoder_outputs
encoder_outputs = (encoder_outputs[0].index_select(0, expanded_batch_idxs), *encoder_outputs[1:])
else:
encoder_outputs = None
cur_len = input_ids.shape[-1]
assert (
cur_len < max_length
), f"The context has {cur_len} number of tokens, but `max_length` is only {max_length}. Please make sure that `max_length` is bigger than the number of tokens, by setting either `generate(max_length=...,...)` or `config.max_length = ...`"
if num_beams > 1:
output = self._generate_beam_search(
input_ids,
cur_len=cur_len,
max_length=max_length,
min_length=min_length,
do_sample=do_sample,
early_stopping=early_stopping,
temperature=temperature,
top_k=top_k,
top_p=top_p,
repetition_penalty=repetition_penalty,
no_repeat_ngram_size=no_repeat_ngram_size,
bad_words_ids=bad_words_ids,
pad_token_id=pad_token_id,
eos_token_id=eos_token_id,
batch_size=effective_batch_size,
num_return_sequences=num_return_sequences,
length_penalty=length_penalty,
num_beams=num_beams,
vocab_size=vocab_size,
encoder_outputs=encoder_outputs,
attention_mask=attention_mask,
use_cache=use_cache,
model_specific_kwargs=model_specific_kwargs,
)
else:
output = self._generate_no_beam_search(
input_ids,
cur_len=cur_len,
max_length=max_length,
min_length=min_length,
do_sample=do_sample,
temperature=temperature,
top_k=top_k,
top_p=top_p,
repetition_penalty=repetition_penalty,
no_repeat_ngram_size=no_repeat_ngram_size,
bad_words_ids=bad_words_ids,
pad_token_id=pad_token_id,
eos_token_id=eos_token_id,
batch_size=effective_batch_size,
encoder_outputs=encoder_outputs,
attention_mask=attention_mask,
use_cache=use_cache,
model_specific_kwargs=model_specific_kwargs,
)
return output
def _generate_no_beam_search(
self,
input_ids,
cur_len,
max_length,
min_length,
do_sample,
temperature,
top_k,
top_p,
repetition_penalty,
no_repeat_ngram_size,
bad_words_ids,
pad_token_id,
eos_token_id,
batch_size,
encoder_outputs,
attention_mask,
use_cache,
model_specific_kwargs,
):
""" Generate sequences for each example without beam search (num_beams == 1).
All returned sequence are generated independantly.
"""
# length of generated sentences / unfinished sentences
unfinished_sents = input_ids.new(batch_size).fill_(1)
sent_lengths = input_ids.new(batch_size).fill_(max_length)
past = (encoder_outputs, None) if encoder_outputs is not None else None
while cur_len < max_length:
model_inputs = self.prepare_inputs_for_generation(
input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs
)
outputs = self(**model_inputs)
next_token_logits = outputs[0][:, -1, :]
scores = self.postprocess_next_token_scores(
scores=next_token_logits,
input_ids=input_ids,
no_repeat_ngram_size=no_repeat_ngram_size,
bad_words_ids=bad_words_ids,
cur_len=cur_len,
min_length=min_length,
max_length=max_length,
eos_token_id=eos_token_id,
repetition_penalty=repetition_penalty,
batch_size=batch_size,
num_beams=1,
)
# if model has past, then set the past variable to speed up decoding
if self._use_cache(outputs, use_cache):
past = outputs[1]
if do_sample:
# Temperature (higher temperature => more likely to sample low probability tokens)
if temperature != 1.0:
scores = scores / temperature
# Top-p/top-k filtering
next_token_logscores = top_k_top_p_filtering(scores, top_k=top_k, top_p=top_p)
# Sample
probs = F.softmax(next_token_logscores, dim=-1)
next_token = torch.multinomial(probs, num_samples=1).squeeze(1)
else:
# Greedy decoding
next_token = torch.argmax(next_token_logits, dim=-1)
# update generations and finished sentences
if eos_token_id is not None:
# pad finished sentences if eos_token_id exist
tokens_to_add = next_token * unfinished_sents + (pad_token_id) * (1 - unfinished_sents)
else:
tokens_to_add = next_token
# add token and increase length by one
input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)
cur_len = cur_len + 1
if eos_token_id is not None:
eos_in_sents = tokens_to_add == eos_token_id
# if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length
is_sents_unfinished_and_token_to_add_is_eos = unfinished_sents.mul(eos_in_sents.long()).bool()
sent_lengths.masked_fill_(is_sents_unfinished_and_token_to_add_is_eos, cur_len)
# unfinished_sents is set to zero if eos in sentence
unfinished_sents.mul_((~eos_in_sents).long())
# stop when there is a </s> in each sentence, or if we exceed the maximul length
if unfinished_sents.max() == 0:
break
# extend attention_mask for new generated input if only decoder
if self.config.is_encoder_decoder is False:
attention_mask = torch.cat(
[attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
)
return input_ids
def _generate_beam_search(
self,
input_ids,
cur_len,
max_length,
min_length,
do_sample,
early_stopping,
temperature,
top_k,
top_p,
repetition_penalty,
no_repeat_ngram_size,
bad_words_ids,
pad_token_id,
eos_token_id,
batch_size,
num_return_sequences,
length_penalty,
num_beams,
vocab_size,
encoder_outputs,
attention_mask,
use_cache,
model_specific_kwargs,
):
""" Generate sequences for each example with beam search.
"""
# generated hypotheses
generated_hyps = [
BeamHypotheses(num_beams, max_length, length_penalty, early_stopping=early_stopping)
for _ in range(batch_size)
]
# scores for each sentence in the beam
beam_scores = torch.zeros((batch_size, num_beams), dtype=torch.float, device=input_ids.device)
# for greedy decoding it is made sure that only tokens of the first beam are considered to avoid sampling the exact same tokens three times
if do_sample is False:
beam_scores[:, 1:] = -1e9
beam_scores = beam_scores.view(-1) # shape (batch_size * num_beams,)
# cache compute states
past = (encoder_outputs, None) if encoder_outputs is not None else None
# done sentences
done = [False for _ in range(batch_size)]
while cur_len < max_length:
model_inputs = self.prepare_inputs_for_generation(
input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs
)
outputs = self(**model_inputs) # (batch_size * num_beams, cur_len, vocab_size)
next_token_logits = outputs[0][:, -1, :] # (batch_size * num_beams, vocab_size)
# if model has past, then set the past variable to speed up decoding
if self._use_cache(outputs, use_cache):
past = outputs[1]
if self.config.is_encoder_decoder and do_sample is False:
# TODO (PVP) still a bit hacky here - there might be a better solution
next_token_logits = self.adjust_logits_during_generation(
next_token_logits, cur_len=cur_len, max_length=max_length
)
scores = F.log_softmax(next_token_logits, dim=-1) # (batch_size * num_beams, vocab_size)
scores = self.postprocess_next_token_scores(
scores=scores,
input_ids=input_ids,
no_repeat_ngram_size=no_repeat_ngram_size,
bad_words_ids=bad_words_ids,
cur_len=cur_len,
min_length=min_length,
max_length=max_length,
eos_token_id=eos_token_id,
repetition_penalty=repetition_penalty,
batch_size=batch_size,
num_beams=num_beams,
)
assert scores.shape == (batch_size * num_beams, vocab_size), "Shapes of scores: {} != {}".format(
scores.shape, (batch_size * num_beams, vocab_size)
)
if do_sample:
_scores = scores + beam_scores[:, None].expand_as(scores) # (batch_size * num_beams, vocab_size)
# Temperature
if temperature != 1.0:
_scores = _scores / temperature
# Top-p/top-k filtering
_scores = top_k_top_p_filtering(
_scores, top_k=top_k, top_p=top_p, min_tokens_to_keep=2
) # (batch_size * num_beams, vocab_size)
# re-organize to group the beam together to sample from all beam_idxs
_scores = _scores.contiguous().view(
batch_size, num_beams * vocab_size
) # (batch_size, num_beams * vocab_size)
# Sample 2 next tokens for each beam (so we have some spare tokens and match output of greedy beam search)
probs = F.softmax(_scores, dim=-1)
next_tokens = torch.multinomial(probs, num_samples=2 * num_beams) # (batch_size, num_beams * 2)
# Compute next scores
next_scores = torch.gather(_scores, -1, next_tokens) # (batch_size, num_beams * 2)
# sort the sampled vector to make sure that the first num_beams samples are the best
next_scores, next_scores_indices = torch.sort(next_scores, descending=True, dim=1)
next_tokens = torch.gather(next_tokens, -1, next_scores_indices) # (batch_size, num_beams * 2)
else:
next_scores = scores + beam_scores[:, None].expand_as(scores) # (batch_size * num_beams, vocab_size)
# re-organize to group the beam together (we are keeping top hypothesis accross beams)
next_scores = next_scores.view(
batch_size, num_beams * vocab_size
) # (batch_size, num_beams * vocab_size)
next_scores, next_tokens = torch.topk(next_scores, 2 * num_beams, dim=1, largest=True, sorted=True)
assert next_scores.size() == next_tokens.size() == (batch_size, 2 * num_beams)
# next batch beam content
next_batch_beam = []
# for each sentence
for batch_idx in range(batch_size):
# if we are done with this sentence, add a pad token
if done[batch_idx]:
assert (
len(generated_hyps[batch_idx]) >= num_beams
), "Batch can only be done if at least {} beams have been generated".format(num_beams)
assert (
eos_token_id is not None and pad_token_id is not None
), "generated beams >= num_beams -> eos_token_id and pad_token have to be defined"
next_batch_beam.extend([(0, pad_token_id, 0)] * num_beams) # pad the batch
continue
# next sentence beam content, this will get added to next_batch_beam
next_sent_beam = []
# next tokens for this sentence
for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(
zip(next_tokens[batch_idx], next_scores[batch_idx])
):
# get beam and token IDs
beam_id = beam_token_id // vocab_size
token_id = beam_token_id % vocab_size
effective_beam_id = batch_idx * num_beams + beam_id
# add to generated hypotheses if end of sentence
if (eos_token_id is not None) and (token_id.item() == eos_token_id):
# if beam_token does not belong to top num_beams tokens, it should not be added
is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams
if is_beam_token_worse_than_top_num_beams:
continue
generated_hyps[batch_idx].add(
input_ids[effective_beam_id].clone(), beam_token_score.item(),
)
else:
# add next predicted token since it is not eos_token
next_sent_beam.append((beam_token_score, token_id, effective_beam_id))
# once the beam for next step is full, don't add more tokens to it.
if len(next_sent_beam) == num_beams:
break
# Check if we are done so that we can save a pad step if all(done)
done[batch_idx] = done[batch_idx] or generated_hyps[batch_idx].is_done(
next_scores[batch_idx].max().item(), cur_len
)
# update next beam content
assert len(next_sent_beam) == num_beams, "Beam should always be full"
next_batch_beam.extend(next_sent_beam)
assert len(next_batch_beam) == num_beams * (batch_idx + 1), "We should have added num_beams each step"
# stop when we are done with each sentence
if all(done):
break
# sanity check / prepare next batch
assert len(next_batch_beam) == batch_size * num_beams
beam_scores = beam_scores.new([x[0] for x in next_batch_beam])
beam_tokens = input_ids.new([x[1] for x in next_batch_beam])
beam_idx = input_ids.new([x[2] for x in next_batch_beam])
# re-order batch and update current length
input_ids = input_ids[beam_idx, :]
input_ids = torch.cat([input_ids, beam_tokens.unsqueeze(1)], dim=-1)
cur_len = cur_len + 1
# re-order internal states
if past is not None:
past = self._reorder_cache(past, beam_idx)
# extend attention_mask for new generated input if only decoder
if self.config.is_encoder_decoder is False:
attention_mask = torch.cat(
[attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
)
# finalize all open beam hypotheses and add to generated hypotheses
for batch_idx in range(batch_size):
if done[batch_idx]:
continue
# test that beam scores match previously calculated scores if not eos and batch_idx not done
if eos_token_id is not None and all(
(token_id % vocab_size).item() != eos_token_id for token_id in next_tokens[batch_idx]
):
assert torch.all(
next_scores[batch_idx, :num_beams] == beam_scores.view(batch_size, num_beams)[batch_idx]
), "If batch_idx is not done, final next scores: {} have to equal to accumulated beam_scores: {}".format(
next_scores[:, :num_beams][batch_idx], beam_scores.view(batch_size, num_beams)[batch_idx],
)
# need to add best num_beams hypotheses to generated hyps
for beam_id in range(num_beams):
effective_beam_id = batch_idx * num_beams + beam_id
final_score = beam_scores[effective_beam_id].item()
final_tokens = input_ids[effective_beam_id]
generated_hyps[batch_idx].add(final_tokens, final_score)
# depending on whether greedy generation is wanted or not define different output_batch_size and output_num_return_sequences_per_batch
output_batch_size = batch_size if do_sample else batch_size * num_return_sequences
output_num_return_sequences_per_batch = 1 if do_sample else num_return_sequences
# select the best hypotheses
sent_lengths = input_ids.new(output_batch_size)
best = []
# retrieve best hypotheses
for i, hypotheses in enumerate(generated_hyps):
sorted_hyps = sorted(hypotheses.beams, key=lambda x: x[0])
for j in range(output_num_return_sequences_per_batch):
effective_batch_idx = output_num_return_sequences_per_batch * i + j
best_hyp = sorted_hyps.pop()[1]
sent_lengths[effective_batch_idx] = len(best_hyp)
best.append(best_hyp)
# shorter batches are padded
if sent_lengths.min().item() != sent_lengths.max().item():
assert pad_token_id is not None, "`Pad_token_id` has to be defined"
sent_max_len = min(sent_lengths.max().item() + 1, max_length)
decoded = input_ids.new(output_batch_size, sent_max_len).fill_(pad_token_id)
# fill with hypothesis and eos_token_id if necessary
for i, hypo in enumerate(best):
decoded[i, : sent_lengths[i]] = hypo
if sent_lengths[i] < max_length:
decoded[i, sent_lengths[i]] = eos_token_id
else:
# none of the hypotheses have an eos_token
assert (len(hypo) == max_length for hypo in best)
decoded = torch.stack(best).type(torch.long).to(next(self.parameters()).device)
return decoded
@staticmethod
def _reorder_cache(past: Tuple, beam_idx: Tensor) -> Tuple[Tensor]:
return tuple(layer_past.index_select(1, beam_idx) for layer_past in past)
def calc_banned_ngram_tokens(prev_input_ids: Tensor, num_hypos: int, no_repeat_ngram_size: int, cur_len: int) -> None:
"""Copied from fairseq for no_repeat_ngram in beam_search"""
if cur_len + 1 < no_repeat_ngram_size:
# return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet
return [[] for _ in range(num_hypos)]
generated_ngrams = [{} for _ in range(num_hypos)]
for idx in range(num_hypos):
gen_tokens = prev_input_ids[idx].tolist()
generated_ngram = generated_ngrams[idx]
for ngram in zip(*[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):
prev_ngram_tuple = tuple(ngram[:-1])
generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]
def _get_generated_ngrams(hypo_idx):
# Before decoding the next token, prevent decoding of ngrams that have already appeared
start_idx = cur_len + 1 - no_repeat_ngram_size
ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].tolist())
return generated_ngrams[hypo_idx].get(ngram_idx, [])
banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]
return banned_tokens
def calc_banned_bad_words_ids(prev_input_ids: Iterable[int], bad_words_ids: Iterable[int]) -> Iterable[int]:
banned_tokens = []
def _tokens_match(prev_tokens, tokens):
if len(tokens) == 0:
# if bad word tokens is just one token always ban it
return True
if len(tokens) > len(prev_input_ids):
# if bad word tokens are longer then prev input_ids they can't be equal
return False
if prev_tokens[-len(tokens) :] == tokens:
# if tokens match
return True
else:
return False
for prev_input_ids_slice in prev_input_ids:
banned_tokens_slice = []
for banned_token_seq in bad_words_ids:
assert len(banned_token_seq) > 0, "Banned words token sequences {} cannot have an empty list".format(
bad_words_ids
)
if _tokens_match(prev_input_ids_slice.tolist(), banned_token_seq[:-1]) is False:
# if tokens do not match continue
continue
banned_tokens_slice.append(banned_token_seq[-1])
banned_tokens.append(banned_tokens_slice)
return banned_tokens
def top_k_top_p_filtering(
logits: Tensor,
top_k: int = 0,
top_p: float = 1.0,
filter_value: float = -float("Inf"),
min_tokens_to_keep: int = 1,
) -> Tensor:
""" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
Args:
logits: logits distribution shape (batch size, vocabulary size)
if top_k > 0: keep only top k tokens with highest probability (top-k filtering).
if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
Make sure we keep at least min_tokens_to_keep per batch example in the output
From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
"""
if top_k > 0:
top_k = min(max(top_k, min_tokens_to_keep), logits.size(-1)) # Safety check
# Remove all tokens with a probability less than the last token of the top-k
indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
logits[indices_to_remove] = filter_value
if top_p < 1.0:
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens with cumulative probability above the threshold (token with 0 are kept)
sorted_indices_to_remove = cumulative_probs > top_p
if min_tokens_to_keep > 1:
# Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)
sorted_indices_to_remove[..., :min_tokens_to_keep] = 0
# Shift the indices to the right to keep also the first token above the threshold
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = 0
# scatter sorted tensors to original indexing
indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
logits[indices_to_remove] = filter_value
return logits
class BeamHypotheses(object):
def __init__(self, num_beams, max_length, length_penalty, early_stopping):
"""
Initialize n-best list of hypotheses.
"""
self.max_length = max_length - 1 # ignoring bos_token
self.length_penalty = length_penalty
self.early_stopping = early_stopping
self.num_beams = num_beams
self.beams = []
self.worst_score = 1e9
def __len__(self):
"""
Number of hypotheses in the list.
"""
return len(self.beams)
def add(self, hyp, sum_logprobs):
"""
Add a new hypothesis to the list.
"""
score = sum_logprobs / len(hyp) ** self.length_penalty
if len(self) < self.num_beams or score > self.worst_score:
self.beams.append((score, hyp))
if len(self) > self.num_beams:
sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.beams)])
del self.beams[sorted_scores[0][1]]
self.worst_score = sorted_scores[1][0]
else:
self.worst_score = min(score, self.worst_score)
def is_done(self, best_sum_logprobs, cur_len):
"""
If there are enough hypotheses and that none of the hypotheses being generated
can become better than the worst one in the heap, then we are done with this sentence.
"""
if len(self) < self.num_beams:
return False
elif self.early_stopping:
return True
else:
cur_score = best_sum_logprobs / cur_len ** self.length_penalty
ret = self.worst_score >= cur_score
return ret

View File

@@ -122,7 +122,12 @@ from .modeling_mobilebert import (
MobileBertModel,
)
from .modeling_openai import OpenAIGPTLMHeadModel, OpenAIGPTModel
from .modeling_reformer import ReformerModel, ReformerModelWithLMHead
from .modeling_reformer import (
ReformerForMaskedLM,
ReformerForQuestionAnswering,
ReformerModel,
ReformerModelWithLMHead,
)
from .modeling_retribert import RetriBertModel
from .modeling_roberta import (
RobertaForMaskedLM,
@@ -266,6 +271,7 @@ MODEL_FOR_MASKED_LM_MAPPING = OrderedDict(
(FlaubertConfig, FlaubertWithLMHeadModel),
(XLMConfig, XLMWithLMHeadModel),
(ElectraConfig, ElectraForMaskedLM),
(ReformerConfig, ReformerForMaskedLM),
]
)
@@ -310,6 +316,7 @@ MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
(MobileBertConfig, MobileBertForQuestionAnswering),
(XLMConfig, XLMForQuestionAnsweringSimple),
(ElectraConfig, ElectraForQuestionAnswering),
(ReformerConfig, ReformerForQuestionAnswering),
]
)

View File

@@ -133,7 +133,7 @@ class ElectraDiscriminatorPredictions(nn.Module):
self.dense_prediction = nn.Linear(config.hidden_size, 1)
self.config = config
def forward(self, discriminator_hidden_states, attention_mask):
def forward(self, discriminator_hidden_states):
hidden_states = self.dense(discriminator_hidden_states)
hidden_states = get_activation(self.config.hidden_act)(hidden_states)
logits = self.dense_prediction(hidden_states).squeeze()
@@ -518,7 +518,7 @@ class ElectraForPreTraining(ElectraPreTrainedModel):
)
discriminator_sequence_output = discriminator_hidden_states[0]
logits = self.discriminator_predictions(discriminator_sequence_output, attention_mask)
logits = self.discriminator_predictions(discriminator_sequence_output)
output = (logits,)

File diff suppressed because it is too large Load Diff

View File

@@ -575,7 +575,7 @@ class MobileBertPooler(nn.Module):
return first_token_tensor
else:
pooled_output = self.dense(first_token_tensor)
pooled_output = F.tanh(pooled_output)
pooled_output = torch.tanh(pooled_output)
return pooled_output

View File

@@ -1704,6 +1704,7 @@ class ReformerModel(ReformerPreTrainedModel):
class ReformerModelWithLMHead(ReformerPreTrainedModel):
def __init__(self, config):
super().__init__(config)
assert config.is_decoder, "If you want to use `ReformerLMHeadModel` make sure that `is_decoder=True`."
self.reformer = ReformerModel(config)
self.lm_head = ReformerOnlyLMHead(config)
@@ -1789,3 +1790,190 @@ class ReformerModelWithLMHead(ReformerPreTrainedModel):
inputs_dict["num_hashes"] = kwargs["num_hashes"]
return inputs_dict
@add_start_docstrings("""Reformer Model with a `language modeling` head on top. """, REFORMER_START_DOCSTRING)
class ReformerForMaskedLM(ReformerPreTrainedModel):
def __init__(self, config):
super().__init__(config)
assert (
not config.is_decoder
), "If you want to use `ReformerForMaskedLM` make sure `config.is_decoder=False` for bi-directional self-attention."
self.reformer = ReformerModel(config)
self.lm_head = ReformerOnlyLMHead(config)
self.init_weights()
def get_output_embeddings(self):
return self.lm_head.decoder
def tie_weights(self):
# word embeddings are not tied in Reformer
pass
@add_start_docstrings_to_callable(REFORMER_INPUTS_DOCSTRING)
@add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="google/reformer-crime-and-punishment")
def forward(
self,
input_ids=None,
position_ids=None,
attention_mask=None,
head_mask=None,
inputs_embeds=None,
num_hashes=None,
labels=None,
output_hidden_states=None,
output_attentions=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
Return:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
Classification loss (cross entropy).
prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
all_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
all_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
"""
reformer_outputs = self.reformer(
input_ids,
position_ids=position_ids,
attention_mask=attention_mask,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
num_hashes=num_hashes,
output_hidden_states=output_hidden_states,
output_attentions=output_attentions,
)
sequence_output = reformer_outputs[0]
logits = self.lm_head(sequence_output)
outputs = (logits,) + reformer_outputs[1:]
if labels is not None:
loss_fct = CrossEntropyLoss() # -100 index = padding token
masked_lm_loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))
outputs = (masked_lm_loss,) + outputs
return outputs # (mlm_loss), lm_logits, (hidden_states), (attentions)
@add_start_docstrings(
"""Reformer Model with a span classification head on top for
extractive question-answering tasks like SQuAD / TriviaQA ( a linear layer on
top of hidden-states output to compute `span start logits` and `span end logits`. """,
REFORMER_START_DOCSTRING,
)
class ReformerForQuestionAnswering(ReformerPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.reformer = ReformerModel(config)
# 2 * config.hidden_size because we use reversible residual layers
self.qa_outputs = nn.Linear(2 * config.hidden_size, config.num_labels)
self.init_weights()
def tie_weights(self):
# word embeddings are not tied in Reformer
pass
@add_start_docstrings_to_callable(REFORMER_INPUTS_DOCSTRING)
@add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="google/reformer-crime-and-punishment")
def forward(
self,
input_ids=None,
position_ids=None,
attention_mask=None,
head_mask=None,
inputs_embeds=None,
num_hashes=None,
start_positions=None,
end_positions=None,
output_hidden_states=None,
output_attentions=None,
):
r"""
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`).
Position outside of the sequence are not taken into account for computing the loss.
Return:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.ReformerConfig`) and inputs:
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):
Span-start scores (before SoftMax).
end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):
Span-end scores (before SoftMax).
all_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
all_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
"""
reformer_outputs = self.reformer(
input_ids,
position_ids=position_ids,
attention_mask=attention_mask,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
num_hashes=num_hashes,
output_hidden_states=output_hidden_states,
output_attentions=output_attentions,
)
sequence_output = reformer_outputs[0]
logits = self.qa_outputs(sequence_output)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1)
end_logits = end_logits.squeeze(-1)
outputs = (start_logits, end_logits,) + reformer_outputs[1:]
if start_positions is not None and end_positions is not None:
# If we are on multi-GPU, split add a dimension
if len(start_positions.size()) > 1:
start_positions = start_positions.squeeze(-1)
if len(end_positions.size()) > 1:
end_positions = end_positions.squeeze(-1)
# sometimes the start/end positions are outside our model inputs, we ignore these terms
ignored_index = start_logits.size(1)
start_positions.clamp_(0, ignored_index)
end_positions.clamp_(0, ignored_index)
loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2
outputs = (total_loss,) + outputs
return outputs # (loss), start_logits, end_logits, (hidden_states), (attentions)

View File

@@ -141,7 +141,6 @@ logger = logging.getLogger(__name__)
TF_MODEL_MAPPING = OrderedDict(
[
(AlbertConfig, TFAlbertModel),
(BertConfig, TFBertModel),
(CamembertConfig, TFCamembertModel),
(CTRLConfig, TFCTRLModel),
(DistilBertConfig, TFDistilBertModel),
@@ -151,6 +150,7 @@ TF_MODEL_MAPPING = OrderedDict(
(MobileBertConfig, TFMobileBertModel),
(OpenAIGPTConfig, TFOpenAIGPTModel),
(RobertaConfig, TFRobertaModel),
(BertConfig, TFBertModel),
(T5Config, TFT5Model),
(TransfoXLConfig, TFTransfoXLModel),
(XLMConfig, TFXLMModel),
@@ -162,7 +162,6 @@ TF_MODEL_MAPPING = OrderedDict(
TF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
[
(AlbertConfig, TFAlbertForPreTraining),
(BertConfig, TFBertForPreTraining),
(CamembertConfig, TFCamembertForMaskedLM),
(CTRLConfig, TFCTRLLMHeadModel),
(DistilBertConfig, TFDistilBertForMaskedLM),
@@ -172,6 +171,7 @@ TF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
(MobileBertConfig, TFMobileBertForPreTraining),
(OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),
(RobertaConfig, TFRobertaForMaskedLM),
(BertConfig, TFBertForPreTraining),
(T5Config, TFT5ForConditionalGeneration),
(TransfoXLConfig, TFTransfoXLLMHeadModel),
(XLMConfig, TFXLMWithLMHeadModel),
@@ -183,7 +183,6 @@ TF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
TF_MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
[
(AlbertConfig, TFAlbertForMaskedLM),
(BertConfig, TFBertForMaskedLM),
(CamembertConfig, TFCamembertForMaskedLM),
(CTRLConfig, TFCTRLLMHeadModel),
(DistilBertConfig, TFDistilBertForMaskedLM),
@@ -193,6 +192,7 @@ TF_MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
(MobileBertConfig, TFMobileBertForMaskedLM),
(OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),
(RobertaConfig, TFRobertaForMaskedLM),
(BertConfig, TFBertForMaskedLM),
(T5Config, TFT5ForConditionalGeneration),
(TransfoXLConfig, TFTransfoXLLMHeadModel),
(XLMConfig, TFXLMWithLMHeadModel),
@@ -204,12 +204,12 @@ TF_MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(
[
(AlbertConfig, TFAlbertForMultipleChoice),
(BertConfig, TFBertForMultipleChoice),
(CamembertConfig, TFCamembertForMultipleChoice),
(DistilBertConfig, TFDistilBertForMultipleChoice),
(FlaubertConfig, TFFlaubertForMultipleChoice),
(MobileBertConfig, TFMobileBertForMultipleChoice),
(RobertaConfig, TFRobertaForMultipleChoice),
(BertConfig, TFBertForMultipleChoice),
(XLMConfig, TFXLMForMultipleChoice),
(XLMRobertaConfig, TFXLMRobertaForMultipleChoice),
(XLNetConfig, TFXLNetForMultipleChoice),
@@ -219,13 +219,13 @@ TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(
TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
[
(AlbertConfig, TFAlbertForQuestionAnswering),
(BertConfig, TFBertForQuestionAnswering),
(CamembertConfig, TFCamembertForQuestionAnswering),
(DistilBertConfig, TFDistilBertForQuestionAnswering),
(ElectraConfig, TFElectraForQuestionAnswering),
(FlaubertConfig, TFFlaubertForQuestionAnsweringSimple),
(MobileBertConfig, TFMobileBertForQuestionAnswering),
(RobertaConfig, TFRobertaForQuestionAnswering),
(BertConfig, TFBertForQuestionAnswering),
(XLMConfig, TFXLMForQuestionAnsweringSimple),
(XLMRobertaConfig, TFXLMRobertaForQuestionAnswering),
(XLNetConfig, TFXLNetForQuestionAnsweringSimple),
@@ -235,12 +235,12 @@ TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
[
(AlbertConfig, TFAlbertForSequenceClassification),
(BertConfig, TFBertForSequenceClassification),
(CamembertConfig, TFCamembertForSequenceClassification),
(DistilBertConfig, TFDistilBertForSequenceClassification),
(FlaubertConfig, TFFlaubertForSequenceClassification),
(MobileBertConfig, TFMobileBertForSequenceClassification),
(RobertaConfig, TFRobertaForSequenceClassification),
(BertConfig, TFBertForSequenceClassification),
(XLMConfig, TFXLMForSequenceClassification),
(XLMRobertaConfig, TFXLMRobertaForSequenceClassification),
(XLNetConfig, TFXLNetForSequenceClassification),
@@ -250,13 +250,13 @@ TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
[
(AlbertConfig, TFAlbertForTokenClassification),
(BertConfig, TFBertForTokenClassification),
(CamembertConfig, TFCamembertForTokenClassification),
(DistilBertConfig, TFDistilBertForTokenClassification),
(ElectraConfig, TFElectraForTokenClassification),
(FlaubertConfig, TFFlaubertForTokenClassification),
(MobileBertConfig, TFMobileBertForTokenClassification),
(RobertaConfig, TFRobertaForTokenClassification),
(BertConfig, TFBertForTokenClassification),
(XLMConfig, TFXLMForTokenClassification),
(XLMRobertaConfig, TFXLMRobertaForTokenClassification),
(XLNetConfig, TFXLNetForTokenClassification),

View File

@@ -478,7 +478,7 @@ class TFT5Block(tf.keras.layers.Layer):
return outputs # hidden-states, present_key_value_states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)
class _NoLayerEmbedTokens(object):
class _NoLayerEmbedTokens:
"""
this class wraps a the TFSharedEmbeddingTokens layer into a python 'no-keras-layer'
class to avoid problem with weight restoring. Also it makes sure that the layer is
@@ -655,7 +655,7 @@ class TFT5MainLayer(tf.keras.layers.Layer):
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
# T5 has a mask that can compare sequence ids, we can simulate this here with this transposistion
# T5 has a mask that can compare sequence ids, we can simulate this here with this transposition
# Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270
# extended_attention_mask = tf.math.equal(extended_attention_mask,
# tf.transpose(extended_attention_mask, perm=(-1, -2)))
@@ -682,16 +682,8 @@ class TFT5MainLayer(tf.keras.layers.Layer):
else:
encoder_extended_attention_mask = None
# Prepare head mask if needed
# 1.0 in head_mask indicate we keep the head
# attention_probs has shape bsz x n_heads x N x N
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
if head_mask is not None:
raise NotImplementedError
else:
head_mask = [None] * self.num_hidden_layers
# head_mask = tf.constant([0] * self.num_hidden_layers)
assert head_mask is None, "Head mask not supported"
head_mask = [None] * self.num_hidden_layers
present_key_value_states = ()
all_hidden_states = ()
@@ -1054,8 +1046,6 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
r"""
Returns:
:obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs:
loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):
Classification loss (cross entropy).
prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):

File diff suppressed because it is too large Load Diff

View File

@@ -1016,11 +1016,14 @@ class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
return self.crit.out_layers[-1]
def prepare_inputs_for_generation(self, input_ids, past, **model_kwargs):
inputs = {"input_ids": input_ids}
inputs = {}
# if past is defined in model kwargs then use it for faster decoding
if past:
inputs["mems"] = past
inputs["input_ids"] = input_ids[:, -1].unsqueeze(-1)
else:
inputs["input_ids"] = input_ids
return inputs

View File

@@ -17,7 +17,7 @@
import inspect
import logging
import os
from typing import Callable, Dict, Iterable, List, Optional, Tuple
from typing import Callable, Dict, List, Optional, Tuple
import torch
from torch import Tensor, device, dtype, nn
@@ -35,6 +35,7 @@ from .file_utils import (
hf_bucket_url,
is_remote_url,
)
from .generation_utils import GenerationMixin
logger = logging.getLogger(__name__)
@@ -261,7 +262,7 @@ class ModuleUtilsMixin:
return head_mask
class PreTrainedModel(nn.Module, ModuleUtilsMixin):
class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
r""" Base class for all models.
:class:`~transformers.PreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models
@@ -801,967 +802,6 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
return model
def prepare_inputs_for_generation(self, input_ids, **kwargs):
return {"input_ids": input_ids}
def adjust_logits_during_generation(self, logits, **kwargs):
return logits
def _use_cache(self, outputs, use_cache):
"""During generation, decide whether to pass the `past` variable to the next forward pass."""
if len(outputs) <= 1 or use_cache is False:
return False
if hasattr(self.config, "mem_len") and self.config.mem_len == 0:
return False
return True
def enforce_repetition_penalty_(self, lprobs, batch_size, num_beams, prev_output_tokens, repetition_penalty):
"""repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858). """
for i in range(batch_size * num_beams):
for previous_token in set(prev_output_tokens[i].tolist()):
# if score < 0 then repetition penalty has to multiplied to reduce the previous token probability
if lprobs[i, previous_token] < 0:
lprobs[i, previous_token] *= repetition_penalty
else:
lprobs[i, previous_token] /= repetition_penalty
def postprocess_next_token_scores(
self,
scores,
input_ids,
no_repeat_ngram_size,
bad_words_ids,
cur_len,
min_length,
max_length,
eos_token_id,
repetition_penalty,
batch_size,
num_beams,
):
# repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858)
if repetition_penalty != 1.0:
self.enforce_repetition_penalty_(
scores, batch_size, num_beams, input_ids, repetition_penalty,
)
# set eos token prob to zero if min_length is not reached
if eos_token_id is not None and cur_len < min_length:
scores[:, eos_token_id] = -float("inf")
if no_repeat_ngram_size > 0:
# calculate a list of banned tokens to prevent repetitively generating the same ngrams
num_batch_hypotheses = batch_size * num_beams
# from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345
banned_batch_tokens = calc_banned_ngram_tokens(
input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len
)
for i, banned_tokens in enumerate(banned_batch_tokens):
scores[i, banned_tokens] = -float("inf")
if bad_words_ids is not None:
# calculate a list of banned tokens according to bad words
banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)
for i, banned_tokens in enumerate(banned_tokens):
scores[i, banned_tokens] = -float("inf")
return scores
@torch.no_grad()
def generate(
self,
input_ids: Optional[torch.LongTensor] = None,
max_length: Optional[int] = None,
min_length: Optional[int] = None,
do_sample: Optional[bool] = None,
early_stopping: Optional[bool] = None,
num_beams: Optional[int] = None,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
repetition_penalty: Optional[float] = None,
bad_words_ids: Optional[Iterable[int]] = None,
bos_token_id: Optional[int] = None,
pad_token_id: Optional[int] = None,
eos_token_id: Optional[int] = None,
length_penalty: Optional[float] = None,
no_repeat_ngram_size: Optional[int] = None,
num_return_sequences: Optional[int] = None,
attention_mask: Optional[torch.LongTensor] = None,
decoder_start_token_id: Optional[int] = None,
use_cache: Optional[bool] = None,
**model_specific_kwargs
) -> torch.LongTensor:
r""" Generates sequences for models with a LM head. The method currently supports greedy decoding, beam-search decoding, sampling with temperature, sampling with top-k or nucleus sampling.
Adapted in part from `Facebook's XLM beam search code`_.
.. _`Facebook's XLM beam search code`:
https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529
Parameters:
input_ids: (`optional`) `torch.LongTensor` of shape `(batch_size, sequence_length)`
The sequence used as a prompt for the generation. If `None` the method initializes
it as an empty `torch.LongTensor` of shape `(1,)`.
max_length: (`optional`) int
The max length of the sequence to be generated. Between `min_length` and infinity. Default to 20.
min_length: (`optional`) int
The min length of the sequence to be generated. Between 0 and infinity. Default to 0.
do_sample: (`optional`) bool
If set to `False` greedy decoding is used. Otherwise sampling is used. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.
early_stopping: (`optional`) bool
if set to `True` beam search is stopped when at least `num_beams` sentences finished per batch. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.
num_beams: (`optional`) int
Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.
temperature: (`optional`) float
The value used to module the next token probabilities. Must be strictly positive. Default to 1.0.
top_k: (`optional`) int
The number of highest probability vocabulary tokens to keep for top-k-filtering. Between 1 and infinity. Default to 50.
top_p: (`optional`) float
The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.
repetition_penalty: (`optional`) float
The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.
pad_token_id: (`optional`) int
Padding token. Default to specicic model pad_token_id or None if it does not exist.
bos_token_id: (`optional`) int
BOS token. Defaults to `bos_token_id` as defined in the models config.
eos_token_id: (`optional`) int
EOS token. Defaults to `eos_token_id` as defined in the models config.
length_penalty: (`optional`) float
Exponential penalty to the length. Default to 1.
no_repeat_ngram_size: (`optional`) int
If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.
bad_words_ids: (`optional`) list of lists of int
`bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.
num_return_sequences: (`optional`) int
The number of independently computed returned sequences for each element in the batch. Default to 1.
attention_mask (`optional`) obj: `torch.LongTensor` of same shape as `input_ids`
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
Defaults to `None`.
`What are attention masks? <../glossary.html#attention-mask>`__
decoder_start_token_id=None: (`optional`) int
Start token id for the decoder. Defaults to ``decoder_start_token_id`` as defined the model's config or to the ``bos_token_id``
if no ``decoder_start_token_id`` is found in the config.
This is only relevant for encoder-decoder models.
use_cache: (`optional`) bool
If `use_cache` is True, past key values are used to speed up decoding if applicable to model. Defaults to `True`.
model_specific_kwargs: (`optional`) dict
Additional model specific kwargs will be forwarded to the `forward` function of the model.
Return:
output: `torch.LongTensor` of shape `(batch_size * num_return_sequences, sequence_length)`
sequence_length is either equal to max_length or shorter if all batches finished early due to the `eos_token_id`
Examples::
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('distilgpt2') # Initialize tokenizer
model = AutoModelForCausalLM.from_pretrained('distilgpt2') # Download model and configuration from S3 and cache.
outputs = model.generate(max_length=40) # do greedy decoding
print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))
tokenizer = AutoTokenizer.from_pretrained('openai-gpt') # Initialize tokenizer
model = AutoModelForCausalLM.from_pretrained('openai-gpt') # Download model and configuration from S3 and cache.
input_context = 'The dog'
input_ids = tokenizer.encode(input_context, return_tensors='pt') # encode input context
outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5) # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'
for i in range(3): # 3 output sequences were generated
print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))
tokenizer = AutoTokenizer.from_pretrained('distilgpt2') # Initialize tokenizer
model = AutoModelForCausalLM.from_pretrained('distilgpt2') # Download model and configuration from S3 and cache.
input_context = 'The dog'
input_ids = tokenizer.encode(input_context, return_tensors='pt') # encode input context
outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3, do_sample=True) # 3 generate sequences using by sampling
for i in range(3): # 3 output sequences were generated
print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))
tokenizer = AutoTokenizer.from_pretrained('ctrl') # Initialize tokenizer
model = AutoModelForCausalLM.from_pretrained('ctrl') # Download model and configuration from S3 and cache.
input_context = 'Legal My neighbor is' # "Legal" is one of the control codes for ctrl
input_ids = tokenizer.encode(input_context, return_tensors='pt') # encode input context
outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2) # generate sequences
print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))
tokenizer = AutoTokenizer.from_pretrained('gpt2') # Initialize tokenizer
model = AutoModelForCausalLM.from_pretrained('gpt2') # Download model and configuration from S3 and cache.
input_context = 'My cute dog' # "Legal" is one of the control codes for ctrl
bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]
input_ids = tokenizer.encode(input_context, return_tensors='pt') # encode input context
outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids) # generate sequences without allowing bad_words to be generated
"""
# We cannot generate if the model does not have a LM head
if self.get_output_embeddings() is None:
raise AttributeError(
"You tried to generate sequences with a model that does not have a LM Head."
"Please use another model class (e.g. `OpenAIGPTLMHeadModel`, `XLNetLMHeadModel`, `GPT2LMHeadModel`, `CTRLLMHeadModel`, `T5WithLMHeadModel`, `TransfoXLLMHeadModel`, `XLMWithLMHeadModel`, `BartForConditionalGeneration` )"
)
max_length = max_length if max_length is not None else self.config.max_length
min_length = min_length if min_length is not None else self.config.min_length
do_sample = do_sample if do_sample is not None else self.config.do_sample
early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping
use_cache = use_cache if use_cache is not None else self.config.use_cache
num_beams = num_beams if num_beams is not None else self.config.num_beams
temperature = temperature if temperature is not None else self.config.temperature
top_k = top_k if top_k is not None else self.config.top_k
top_p = top_p if top_p is not None else self.config.top_p
repetition_penalty = repetition_penalty if repetition_penalty is not None else self.config.repetition_penalty
bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id
pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id
eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id
length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty
no_repeat_ngram_size = (
no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size
)
bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids
num_return_sequences = (
num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences
)
decoder_start_token_id = (
decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id
)
if input_ids is not None:
batch_size = input_ids.shape[0] # overriden by the input batch_size
else:
batch_size = 1
assert isinstance(max_length, int) and max_length > 0, "`max_length` should be a strictly positive integer."
assert isinstance(min_length, int) and min_length >= 0, "`min_length` should be a positive integer."
assert isinstance(do_sample, bool), "`do_sample` should be a boolean."
assert isinstance(early_stopping, bool), "`early_stopping` should be a boolean."
assert isinstance(use_cache, bool), "`use_cache` should be a boolean."
assert isinstance(num_beams, int) and num_beams > 0, "`num_beams` should be a strictly positive integer."
assert temperature > 0, "`temperature` should be strictly positive."
assert isinstance(top_k, int) and top_k >= 0, "`top_k` should be a positive integer."
assert 0 <= top_p <= 1, "`top_p` should be between 0 and 1."
assert repetition_penalty >= 1.0, "`repetition_penalty` should be >= 1."
assert input_ids is not None or (
isinstance(bos_token_id, int) and bos_token_id >= 0
), "If input_ids is not defined, `bos_token_id` should be a positive integer."
assert pad_token_id is None or (
isinstance(pad_token_id, int) and (pad_token_id >= 0)
), "`pad_token_id` should be a positive integer."
assert (eos_token_id is None) or (
isinstance(eos_token_id, int) and (eos_token_id >= 0)
), "`eos_token_id` should be a positive integer."
assert length_penalty > 0, "`length_penalty` should be strictly positive."
assert (
isinstance(no_repeat_ngram_size, int) and no_repeat_ngram_size >= 0
), "`no_repeat_ngram_size` should be a positive integer."
assert (
isinstance(num_return_sequences, int) and num_return_sequences > 0
), "`num_return_sequences` should be a strictly positive integer."
assert (
bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)
), "`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated"
if input_ids is None:
assert isinstance(bos_token_id, int) and bos_token_id >= 0, (
"you should either supply a context to complete as `input_ids` input "
"or a `bos_token_id` (integer >= 0) as a first token to start the generation."
)
input_ids = torch.full(
(batch_size, 1), bos_token_id, dtype=torch.long, device=next(self.parameters()).device,
)
else:
assert input_ids.dim() == 2, "Input prompt should be of shape (batch_size, sequence length)."
# not allow to duplicate outputs when greedy decoding
if do_sample is False:
if num_beams == 1:
# no_beam_search greedy generation conditions
assert (
num_return_sequences == 1
), "Greedy decoding will always produce the same output for num_beams == 1 and num_return_sequences > 1. Please set num_return_sequences = 1"
else:
# beam_search greedy generation conditions
assert (
num_beams >= num_return_sequences
), "Greedy beam search decoding cannot return more sequences than it has beams. Please set num_beams >= num_return_sequences"
# create attention mask if necessary
# TODO (PVP): this should later be handled by the forward fn() in each model in the future see PR 3140
if (attention_mask is None) and (pad_token_id is not None) and (pad_token_id in input_ids):
attention_mask = input_ids.ne(pad_token_id).long()
elif attention_mask is None:
attention_mask = input_ids.new_ones(input_ids.shape)
# set pad_token_id to eos_token_id if not set. Important that this is done after
# attention_mask is created
if pad_token_id is None and eos_token_id is not None:
logger.warning(
"Setting `pad_token_id` to {} (first `eos_token_id`) to generate sequence".format(eos_token_id)
)
pad_token_id = eos_token_id
# current position and vocab size
if hasattr(self.config, "vocab_size"):
vocab_size = self.config.vocab_size
elif (
self.config.is_encoder_decoder
and hasattr(self.config, "decoder")
and hasattr(self.config.decoder, "vocab_size")
):
vocab_size = self.config.decoder.vocab_size
# set effective batch size and effective batch multiplier according to do_sample
if do_sample:
effective_batch_size = batch_size * num_return_sequences
effective_batch_mult = num_return_sequences
else:
effective_batch_size = batch_size
effective_batch_mult = 1
if self.config.is_encoder_decoder:
if decoder_start_token_id is None:
decoder_start_token_id = bos_token_id
assert (
decoder_start_token_id is not None
), "decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation"
assert hasattr(self, "get_encoder"), "{} should have a 'get_encoder' function defined".format(self)
assert callable(self.get_encoder), "{} should be a method".format(self.get_encoder)
# get encoder and store encoder outputs
encoder = self.get_encoder()
encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)
# Expand input ids if num_beams > 1 or num_return_sequences > 1
if num_return_sequences > 1 or num_beams > 1:
input_ids_len = input_ids.shape[-1]
input_ids = input_ids.unsqueeze(1).expand(batch_size, effective_batch_mult * num_beams, input_ids_len)
attention_mask = attention_mask.unsqueeze(1).expand(
batch_size, effective_batch_mult * num_beams, input_ids_len
)
input_ids = input_ids.contiguous().view(
effective_batch_size * num_beams, input_ids_len
) # shape: (batch_size * num_return_sequences * num_beams, cur_len)
attention_mask = attention_mask.contiguous().view(
effective_batch_size * num_beams, input_ids_len
) # shape: (batch_size * num_return_sequences * num_beams, cur_len)
if self.config.is_encoder_decoder:
# create empty decoder_input_ids
input_ids = torch.full(
(effective_batch_size * num_beams, 1),
decoder_start_token_id,
dtype=torch.long,
device=next(self.parameters()).device,
)
cur_len = 1
assert (
batch_size == encoder_outputs[0].shape[0]
), f"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[0]} "
# expand batch_idx to assign correct encoder output for expanded input_ids (due to num_beams > 1 and num_return_sequences > 1)
expanded_batch_idxs = (
torch.arange(batch_size)
.view(-1, 1)
.repeat(1, num_beams * effective_batch_mult)
.view(-1)
.to(input_ids.device)
)
# expand encoder_outputs
encoder_outputs = (encoder_outputs[0].index_select(0, expanded_batch_idxs), *encoder_outputs[1:])
else:
encoder_outputs = None
cur_len = input_ids.shape[-1]
if num_beams > 1:
output = self._generate_beam_search(
input_ids,
cur_len=cur_len,
max_length=max_length,
min_length=min_length,
do_sample=do_sample,
early_stopping=early_stopping,
temperature=temperature,
top_k=top_k,
top_p=top_p,
repetition_penalty=repetition_penalty,
no_repeat_ngram_size=no_repeat_ngram_size,
bad_words_ids=bad_words_ids,
pad_token_id=pad_token_id,
eos_token_id=eos_token_id,
batch_size=effective_batch_size,
num_return_sequences=num_return_sequences,
length_penalty=length_penalty,
num_beams=num_beams,
vocab_size=vocab_size,
encoder_outputs=encoder_outputs,
attention_mask=attention_mask,
use_cache=use_cache,
model_specific_kwargs=model_specific_kwargs,
)
else:
output = self._generate_no_beam_search(
input_ids,
cur_len=cur_len,
max_length=max_length,
min_length=min_length,
do_sample=do_sample,
temperature=temperature,
top_k=top_k,
top_p=top_p,
repetition_penalty=repetition_penalty,
no_repeat_ngram_size=no_repeat_ngram_size,
bad_words_ids=bad_words_ids,
pad_token_id=pad_token_id,
eos_token_id=eos_token_id,
batch_size=effective_batch_size,
encoder_outputs=encoder_outputs,
attention_mask=attention_mask,
use_cache=use_cache,
model_specific_kwargs=model_specific_kwargs,
)
return output
def _generate_no_beam_search(
self,
input_ids,
cur_len,
max_length,
min_length,
do_sample,
temperature,
top_k,
top_p,
repetition_penalty,
no_repeat_ngram_size,
bad_words_ids,
pad_token_id,
eos_token_id,
batch_size,
encoder_outputs,
attention_mask,
use_cache,
model_specific_kwargs,
):
""" Generate sequences for each example without beam search (num_beams == 1).
All returned sequence are generated independantly.
"""
# length of generated sentences / unfinished sentences
unfinished_sents = input_ids.new(batch_size).fill_(1)
sent_lengths = input_ids.new(batch_size).fill_(max_length)
past = (encoder_outputs, None) if encoder_outputs is not None else None
while cur_len < max_length:
model_inputs = self.prepare_inputs_for_generation(
input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs
)
outputs = self(**model_inputs)
next_token_logits = outputs[0][:, -1, :]
scores = self.postprocess_next_token_scores(
scores=next_token_logits,
input_ids=input_ids,
no_repeat_ngram_size=no_repeat_ngram_size,
bad_words_ids=bad_words_ids,
cur_len=cur_len,
min_length=min_length,
max_length=max_length,
eos_token_id=eos_token_id,
repetition_penalty=repetition_penalty,
batch_size=batch_size,
num_beams=1,
)
# if model has past, then set the past variable to speed up decoding
if self._use_cache(outputs, use_cache):
past = outputs[1]
if do_sample:
# Temperature (higher temperature => more likely to sample low probability tokens)
if temperature != 1.0:
scores = scores / temperature
# Top-p/top-k filtering
next_token_logscores = top_k_top_p_filtering(scores, top_k=top_k, top_p=top_p)
# Sample
probs = F.softmax(next_token_logscores, dim=-1)
next_token = torch.multinomial(probs, num_samples=1).squeeze(1)
else:
# Greedy decoding
next_token = torch.argmax(next_token_logits, dim=-1)
# update generations and finished sentences
if eos_token_id is not None:
# pad finished sentences if eos_token_id exist
tokens_to_add = next_token * unfinished_sents + (pad_token_id) * (1 - unfinished_sents)
else:
tokens_to_add = next_token
# add token and increase length by one
input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)
cur_len = cur_len + 1
if eos_token_id is not None:
eos_in_sents = tokens_to_add == eos_token_id
# if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length
is_sents_unfinished_and_token_to_add_is_eos = unfinished_sents.mul(eos_in_sents.long()).bool()
sent_lengths.masked_fill_(is_sents_unfinished_and_token_to_add_is_eos, cur_len)
# unfinished_sents is set to zero if eos in sentence
unfinished_sents.mul_((~eos_in_sents).long())
# stop when there is a </s> in each sentence, or if we exceed the maximul length
if unfinished_sents.max() == 0:
break
# extend attention_mask for new generated input if only decoder
if self.config.is_encoder_decoder is False:
attention_mask = torch.cat(
[attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
)
return input_ids
def _generate_beam_search(
self,
input_ids,
cur_len,
max_length,
min_length,
do_sample,
early_stopping,
temperature,
top_k,
top_p,
repetition_penalty,
no_repeat_ngram_size,
bad_words_ids,
pad_token_id,
eos_token_id,
batch_size,
num_return_sequences,
length_penalty,
num_beams,
vocab_size,
encoder_outputs,
attention_mask,
use_cache,
model_specific_kwargs,
):
""" Generate sequences for each example with beam search.
"""
# generated hypotheses
generated_hyps = [
BeamHypotheses(num_beams, max_length, length_penalty, early_stopping=early_stopping)
for _ in range(batch_size)
]
# scores for each sentence in the beam
beam_scores = torch.zeros((batch_size, num_beams), dtype=torch.float, device=input_ids.device)
# for greedy decoding it is made sure that only tokens of the first beam are considered to avoid sampling the exact same tokens three times
if do_sample is False:
beam_scores[:, 1:] = -1e9
beam_scores = beam_scores.view(-1) # shape (batch_size * num_beams,)
# cache compute states
past = (encoder_outputs, None) if encoder_outputs is not None else None
# done sentences
done = [False for _ in range(batch_size)]
while cur_len < max_length:
model_inputs = self.prepare_inputs_for_generation(
input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs
)
outputs = self(**model_inputs) # (batch_size * num_beams, cur_len, vocab_size)
next_token_logits = outputs[0][:, -1, :] # (batch_size * num_beams, vocab_size)
# if model has past, then set the past variable to speed up decoding
if self._use_cache(outputs, use_cache):
past = outputs[1]
if self.config.is_encoder_decoder and do_sample is False:
# TODO (PVP) still a bit hacky here - there might be a better solution
next_token_logits = self.adjust_logits_during_generation(
next_token_logits, cur_len=cur_len, max_length=max_length
)
scores = F.log_softmax(next_token_logits, dim=-1) # (batch_size * num_beams, vocab_size)
scores = self.postprocess_next_token_scores(
scores=scores,
input_ids=input_ids,
no_repeat_ngram_size=no_repeat_ngram_size,
bad_words_ids=bad_words_ids,
cur_len=cur_len,
min_length=min_length,
max_length=max_length,
eos_token_id=eos_token_id,
repetition_penalty=repetition_penalty,
batch_size=batch_size,
num_beams=num_beams,
)
assert scores.shape == (batch_size * num_beams, vocab_size), "Shapes of scores: {} != {}".format(
scores.shape, (batch_size * num_beams, vocab_size)
)
if do_sample:
_scores = scores + beam_scores[:, None].expand_as(scores) # (batch_size * num_beams, vocab_size)
# Temperature
if temperature != 1.0:
_scores = _scores / temperature
# Top-p/top-k filtering
_scores = top_k_top_p_filtering(
_scores, top_k=top_k, top_p=top_p, min_tokens_to_keep=2
) # (batch_size * num_beams, vocab_size)
# re-organize to group the beam together to sample from all beam_idxs
_scores = _scores.contiguous().view(
batch_size, num_beams * vocab_size
) # (batch_size, num_beams * vocab_size)
# Sample 2 next tokens for each beam (so we have some spare tokens and match output of greedy beam search)
probs = F.softmax(_scores, dim=-1)
next_tokens = torch.multinomial(probs, num_samples=2 * num_beams) # (batch_size, num_beams * 2)
# Compute next scores
next_scores = torch.gather(_scores, -1, next_tokens) # (batch_size, num_beams * 2)
# sort the sampled vector to make sure that the first num_beams samples are the best
next_scores, next_scores_indices = torch.sort(next_scores, descending=True, dim=1)
next_tokens = torch.gather(next_tokens, -1, next_scores_indices) # (batch_size, num_beams * 2)
else:
next_scores = scores + beam_scores[:, None].expand_as(scores) # (batch_size * num_beams, vocab_size)
# re-organize to group the beam together (we are keeping top hypothesis accross beams)
next_scores = next_scores.view(
batch_size, num_beams * vocab_size
) # (batch_size, num_beams * vocab_size)
next_scores, next_tokens = torch.topk(next_scores, 2 * num_beams, dim=1, largest=True, sorted=True)
assert next_scores.size() == next_tokens.size() == (batch_size, 2 * num_beams)
# next batch beam content
next_batch_beam = []
# for each sentence
for batch_idx in range(batch_size):
# if we are done with this sentence, add a pad token
if done[batch_idx]:
assert (
len(generated_hyps[batch_idx]) >= num_beams
), "Batch can only be done if at least {} beams have been generated".format(num_beams)
assert (
eos_token_id is not None and pad_token_id is not None
), "generated beams >= num_beams -> eos_token_id and pad_token have to be defined"
next_batch_beam.extend([(0, pad_token_id, 0)] * num_beams) # pad the batch
continue
# next sentence beam content, this will get added to next_batch_beam
next_sent_beam = []
# next tokens for this sentence
for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(
zip(next_tokens[batch_idx], next_scores[batch_idx])
):
# get beam and token IDs
beam_id = beam_token_id // vocab_size
token_id = beam_token_id % vocab_size
effective_beam_id = batch_idx * num_beams + beam_id
# add to generated hypotheses if end of sentence
if (eos_token_id is not None) and (token_id.item() == eos_token_id):
# if beam_token does not belong to top num_beams tokens, it should not be added
is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams
if is_beam_token_worse_than_top_num_beams:
continue
generated_hyps[batch_idx].add(
input_ids[effective_beam_id].clone(), beam_token_score.item(),
)
else:
# add next predicted token since it is not eos_token
next_sent_beam.append((beam_token_score, token_id, effective_beam_id))
# once the beam for next step is full, don't add more tokens to it.
if len(next_sent_beam) == num_beams:
break
# Check if we are done so that we can save a pad step if all(done)
done[batch_idx] = done[batch_idx] or generated_hyps[batch_idx].is_done(
next_scores[batch_idx].max().item(), cur_len
)
# update next beam content
assert len(next_sent_beam) == num_beams, "Beam should always be full"
next_batch_beam.extend(next_sent_beam)
assert len(next_batch_beam) == num_beams * (batch_idx + 1), "We should have added num_beams each step"
# stop when we are done with each sentence
if all(done):
break
# sanity check / prepare next batch
assert len(next_batch_beam) == batch_size * num_beams
beam_scores = beam_scores.new([x[0] for x in next_batch_beam])
beam_tokens = input_ids.new([x[1] for x in next_batch_beam])
beam_idx = input_ids.new([x[2] for x in next_batch_beam])
# re-order batch and update current length
input_ids = input_ids[beam_idx, :]
input_ids = torch.cat([input_ids, beam_tokens.unsqueeze(1)], dim=-1)
cur_len = cur_len + 1
# re-order internal states
if past is not None:
past = self._reorder_cache(past, beam_idx)
# extend attention_mask for new generated input if only decoder
if self.config.is_encoder_decoder is False:
attention_mask = torch.cat(
[attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
)
# finalize all open beam hypotheses and add to generated hypotheses
for batch_idx in range(batch_size):
if done[batch_idx]:
continue
# test that beam scores match previously calculated scores if not eos and batch_idx not done
if eos_token_id is not None and all(
(token_id % vocab_size).item() != eos_token_id for token_id in next_tokens[batch_idx]
):
assert torch.all(
next_scores[batch_idx, :num_beams] == beam_scores.view(batch_size, num_beams)[batch_idx]
), "If batch_idx is not done, final next scores: {} have to equal to accumulated beam_scores: {}".format(
next_scores[:, :num_beams][batch_idx], beam_scores.view(batch_size, num_beams)[batch_idx],
)
# need to add best num_beams hypotheses to generated hyps
for beam_id in range(num_beams):
effective_beam_id = batch_idx * num_beams + beam_id
final_score = beam_scores[effective_beam_id].item()
final_tokens = input_ids[effective_beam_id]
generated_hyps[batch_idx].add(final_tokens, final_score)
# depending on whether greedy generation is wanted or not define different output_batch_size and output_num_return_sequences_per_batch
output_batch_size = batch_size if do_sample else batch_size * num_return_sequences
output_num_return_sequences_per_batch = 1 if do_sample else num_return_sequences
# select the best hypotheses
sent_lengths = input_ids.new(output_batch_size)
best = []
# retrieve best hypotheses
for i, hypotheses in enumerate(generated_hyps):
sorted_hyps = sorted(hypotheses.beams, key=lambda x: x[0])
for j in range(output_num_return_sequences_per_batch):
effective_batch_idx = output_num_return_sequences_per_batch * i + j
best_hyp = sorted_hyps.pop()[1]
sent_lengths[effective_batch_idx] = len(best_hyp)
best.append(best_hyp)
# shorter batches are padded
if sent_lengths.min().item() != sent_lengths.max().item():
assert pad_token_id is not None, "`Pad_token_id` has to be defined"
sent_max_len = min(sent_lengths.max().item() + 1, max_length)
decoded = input_ids.new(output_batch_size, sent_max_len).fill_(pad_token_id)
# fill with hypothesis and eos_token_id if necessary
for i, hypo in enumerate(best):
decoded[i, : sent_lengths[i]] = hypo
if sent_lengths[i] < max_length:
decoded[i, sent_lengths[i]] = eos_token_id
else:
# none of the hypotheses have an eos_token
assert (len(hypo) == max_length for hypo in best)
decoded = torch.stack(best).type(torch.long).to(next(self.parameters()).device)
return decoded
@staticmethod
def _reorder_cache(past: Tuple, beam_idx: Tensor) -> Tuple[Tensor]:
return tuple(layer_past.index_select(1, beam_idx) for layer_past in past)
def calc_banned_ngram_tokens(prev_input_ids: Tensor, num_hypos: int, no_repeat_ngram_size: int, cur_len: int) -> None:
"""Copied from fairseq for no_repeat_ngram in beam_search"""
if cur_len + 1 < no_repeat_ngram_size:
# return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet
return [[] for _ in range(num_hypos)]
generated_ngrams = [{} for _ in range(num_hypos)]
for idx in range(num_hypos):
gen_tokens = prev_input_ids[idx].tolist()
generated_ngram = generated_ngrams[idx]
for ngram in zip(*[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):
prev_ngram_tuple = tuple(ngram[:-1])
generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]
def _get_generated_ngrams(hypo_idx):
# Before decoding the next token, prevent decoding of ngrams that have already appeared
start_idx = cur_len + 1 - no_repeat_ngram_size
ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].tolist())
return generated_ngrams[hypo_idx].get(ngram_idx, [])
banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]
return banned_tokens
def calc_banned_bad_words_ids(prev_input_ids: Iterable[int], bad_words_ids: Iterable[int]) -> Iterable[int]:
banned_tokens = []
def _tokens_match(prev_tokens, tokens):
if len(tokens) == 0:
# if bad word tokens is just one token always ban it
return True
if len(tokens) > len(prev_input_ids):
# if bad word tokens are longer then prev input_ids they can't be equal
return False
if prev_tokens[-len(tokens) :] == tokens:
# if tokens match
return True
else:
return False
for prev_input_ids_slice in prev_input_ids:
banned_tokens_slice = []
for banned_token_seq in bad_words_ids:
assert len(banned_token_seq) > 0, "Banned words token sequences {} cannot have an empty list".format(
bad_words_ids
)
if _tokens_match(prev_input_ids_slice.tolist(), banned_token_seq[:-1]) is False:
# if tokens do not match continue
continue
banned_tokens_slice.append(banned_token_seq[-1])
banned_tokens.append(banned_tokens_slice)
return banned_tokens
def top_k_top_p_filtering(
logits: Tensor,
top_k: int = 0,
top_p: float = 1.0,
filter_value: float = -float("Inf"),
min_tokens_to_keep: int = 1,
) -> Tensor:
""" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
Args:
logits: logits distribution shape (batch size, vocabulary size)
if top_k > 0: keep only top k tokens with highest probability (top-k filtering).
if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
Make sure we keep at least min_tokens_to_keep per batch example in the output
From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
"""
if top_k > 0:
top_k = min(max(top_k, min_tokens_to_keep), logits.size(-1)) # Safety check
# Remove all tokens with a probability less than the last token of the top-k
indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
logits[indices_to_remove] = filter_value
if top_p < 1.0:
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens with cumulative probability above the threshold (token with 0 are kept)
sorted_indices_to_remove = cumulative_probs > top_p
if min_tokens_to_keep > 1:
# Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)
sorted_indices_to_remove[..., :min_tokens_to_keep] = 0
# Shift the indices to the right to keep also the first token above the threshold
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = 0
# scatter sorted tensors to original indexing
indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
logits[indices_to_remove] = filter_value
return logits
class BeamHypotheses(object):
def __init__(self, num_beams, max_length, length_penalty, early_stopping):
"""
Initialize n-best list of hypotheses.
"""
self.max_length = max_length - 1 # ignoring bos_token
self.length_penalty = length_penalty
self.early_stopping = early_stopping
self.num_beams = num_beams
self.beams = []
self.worst_score = 1e9
def __len__(self):
"""
Number of hypotheses in the list.
"""
return len(self.beams)
def add(self, hyp, sum_logprobs):
"""
Add a new hypothesis to the list.
"""
score = sum_logprobs / len(hyp) ** self.length_penalty
if len(self) < self.num_beams or score > self.worst_score:
self.beams.append((score, hyp))
if len(self) > self.num_beams:
sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.beams)])
del self.beams[sorted_scores[0][1]]
self.worst_score = sorted_scores[1][0]
else:
self.worst_score = min(score, self.worst_score)
def is_done(self, best_sum_logprobs, cur_len):
"""
If there are enough hypotheses and that none of the hypotheses being generated
can become better than the worst one in the heap, then we are done with this sentence.
"""
if len(self) < self.num_beams:
return False
elif self.early_stopping:
return True
else:
cur_score = best_sum_logprobs / cur_len ** self.length_penalty
ret = self.worst_score >= cur_score
return ret
class Conv1D(nn.Module):
def __init__(self, nf, nx):

View File

@@ -16,6 +16,7 @@
import logging
import math
from typing import Callable, Iterable, Tuple
import torch
from torch.optim import Optimizer
@@ -25,18 +26,40 @@ from torch.optim.lr_scheduler import LambdaLR
logger = logging.getLogger(__name__)
def get_constant_schedule(optimizer, last_epoch=-1):
""" Create a schedule with a constant learning rate.
def get_constant_schedule(optimizer: Optimizer, last_epoch: int = -1):
"""
Create a schedule with a constant learning rate, using the learning rate set in optimizer.
Args:
optimizer (:class:`~torch.optim.Optimizer`):
The optimizer for which to schedule the learning rate.
last_epoch (:obj:`int`, `optional`, defaults to -1):
The index of the last epoch when resuming training.
Return:
:obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
"""
return LambdaLR(optimizer, lambda _: 1, last_epoch=last_epoch)
def get_constant_schedule_with_warmup(optimizer, num_warmup_steps, last_epoch=-1):
""" Create a schedule with a constant learning rate preceded by a warmup
period during which the learning rate increases linearly between 0 and 1.
def get_constant_schedule_with_warmup(optimizer: Optimizer, num_warmup_steps: int, last_epoch: int = -1):
"""
Create a schedule with a constant learning rate preceded by a warmup period during which the learning rate
increases linearly between 0 and the initial lr set in the optimizer.
Args:
optimizer (:class:`~torch.optim.Optimizer`):
The optimizer for which to schedule the learning rate.
num_warmup_steps (:obj:`int`):
The number of steps for the warmup phase.
last_epoch (:obj:`int`, `optional`, defaults to -1):
The index of the last epoch when resuming training.
Return:
:obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
"""
def lr_lambda(current_step):
def lr_lambda(current_step: int):
if current_step < num_warmup_steps:
return float(current_step) / float(max(1.0, num_warmup_steps))
return 1.0
@@ -45,11 +68,25 @@ def get_constant_schedule_with_warmup(optimizer, num_warmup_steps, last_epoch=-1
def get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1):
""" Create a schedule with a learning rate that decreases linearly after
linearly increasing during a warmup period.
"""
Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0,
after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.
Args:
optimizer (:class:`~torch.optim.Optimizer`):
The optimizer for which to schedule the learning rate.
num_warmup_steps (:obj:`int`):
The number of steps for the warmup phase.
num_training_steps (:obj:`int`):
The totale number of training steps.
last_epoch (:obj:`int`, `optional`, defaults to -1):
The index of the last epoch when resuming training.
Return:
:obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
"""
def lr_lambda(current_step):
def lr_lambda(current_step: int):
if current_step < num_warmup_steps:
return float(current_step) / float(max(1, num_warmup_steps))
return max(
@@ -59,10 +96,29 @@ def get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_st
return LambdaLR(optimizer, lr_lambda, last_epoch)
def get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=0.5, last_epoch=-1):
""" Create a schedule with a learning rate that decreases following the
values of the cosine function between 0 and `pi * cycles` after a warmup
period during which it increases linearly between 0 and 1.
def get_cosine_schedule_with_warmup(
optimizer: Optimizer, num_warmup_steps: int, num_training_steps: int, num_cycles: float = 0.5, last_epoch: int = -1
):
"""
Create a schedule with a learning rate that decreases following the values of the cosine function between the
initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the
initial lr set in the optimizer.
Args:
optimizer (:class:`~torch.optim.Optimizer`):
The optimizer for which to schedule the learning rate.
num_warmup_steps (:obj:`int`):
The number of steps for the warmup phase.
num_training_steps (:obj:`int`):
The total number of training steps.
num_cycles (:obj:`float`, `optional`, defaults to 0.5):
The number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0
following a half-cosine).
last_epoch (:obj:`int`, `optional`, defaults to -1):
The index of the last epoch when resuming training.
Return:
:obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
"""
def lr_lambda(current_step):
@@ -75,11 +131,27 @@ def get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_st
def get_cosine_with_hard_restarts_schedule_with_warmup(
optimizer, num_warmup_steps, num_training_steps, num_cycles=1.0, last_epoch=-1
optimizer: Optimizer, num_warmup_steps: int, num_training_steps: int, num_cycles: int = 1, last_epoch: int = -1
):
""" Create a schedule with a learning rate that decreases following the
values of the cosine function with several hard restarts, after a warmup
period during which it increases linearly between 0 and 1.
"""
Create a schedule with a learning rate that decreases following the values of the cosine function between the
initial lr set in the optimizer to 0, with several hard restarts, after a warmup period during which it increases
linearly between 0 and the initial lr set in the optimizer.
Args:
optimizer (:class:`~torch.optim.Optimizer`):
The optimizer for which to schedule the learning rate.
num_warmup_steps (:obj:`int`):
The number of steps for the warmup phase.
num_training_steps (:obj:`int`):
The total number of training steps.
num_cycles (:obj:`int`, `optional`, defaults to 1):
The number of hard restarts to use.
last_epoch (:obj:`int`, `optional`, defaults to -1):
The index of the last epoch when resuming training.
Return:
:obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
"""
def lr_lambda(current_step):
@@ -94,17 +166,34 @@ def get_cosine_with_hard_restarts_schedule_with_warmup(
class AdamW(Optimizer):
""" Implements Adam algorithm with weight decay fix.
"""
Implements Adam algorithm with weight decay fix as introduced in
`Decoupled Weight Decay Regularization <https://arxiv.org/abs/1711.05101>`__.
Parameters:
lr (float): learning rate. Default 1e-3.
betas (tuple of 2 floats): Adams beta parameters (b1, b2). Default: (0.9, 0.999)
eps (float): Adams epsilon. Default: 1e-6
weight_decay (float): Weight decay. Default: 0.0
correct_bias (bool): can be set to False to avoid correcting bias in Adam (e.g. like in Bert TF repository). Default True.
params (:obj:`Iterable[torch.nn.parameter.Parameter]`):
Iterable of parameters to optimize or dictionaries defining parameter groups.
lr (:obj:`float`, `optional`, defaults to 1e-3):
The learning rate to use.
betas (:obj:`Tuple[float,float]`, `optional`, defaults to (0.9, 0.999)):
Adam's betas parameters (b1, b2).
eps (:obj:`float`, `optional`, defaults to 1e-6):
Adam's epsilon for numerical stability.
weight_decay (:obj:`float`, `optional`, defaults to 0):
Decoupled weight decay to apply.
correct_bias (:obj:`bool`, `optional`, defaults to `True`):
Whether ot not to correct bias in Adam (for instance, in Bert TF repository they use :obj:`False`).
"""
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6, weight_decay=0.0, correct_bias=True):
def __init__(
self,
params: Iterable[torch.nn.parameter.Parameter],
lr: float = 1e-3,
betas: Tuple[float, float] = (0.9, 0.999),
eps: float = 1e-6,
weight_decay: float = 0.0,
correct_bias: bool = True,
):
if lr < 0.0:
raise ValueError("Invalid learning rate: {} - should be >= 0.0".format(lr))
if not 0.0 <= betas[0] < 1.0:
@@ -116,12 +205,12 @@ class AdamW(Optimizer):
defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, correct_bias=correct_bias)
super().__init__(params, defaults)
def step(self, closure=None):
"""Performs a single optimization step.
def step(self, closure: Callable = None):
"""
Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
closure (:obj:`Callable`, `optional`): A closure that reevaluates the model and returns the loss.
"""
loss = None
if closure is not None:

View File

@@ -16,15 +16,36 @@
import re
from typing import Callable, List, Optional, Union
import tensorflow as tf
class WarmUp(tf.keras.optimizers.schedules.LearningRateSchedule):
"""Applies a warmup schedule on a given learning rate decay schedule."""
"""
Applies a warmup schedule on a given learning rate decay schedule.
Args:
initial_learning_rate (:obj:`float`):
The initial learning rate for the schedule after the warmup (so this will be the learning rate at the end
of the warmup).
decay_schedule_fn (:obj:`Callable`):
The schedule function to apply after the warmup for the rest of training.
warmup_steps (:obj:`int`):
The number of steps for the warmup part of training.
power (:obj:`float`, `optional`, defaults to 1):
The power to use for the polynomial warmup (defaults is a linear warmup).
name (:obj:`str`, `optional`):
Optional name prefix for the returned tensors during the schedule.
"""
def __init__(
self, initial_learning_rate, decay_schedule_fn, warmup_steps, power=1.0, name=None,
self,
initial_learning_rate: float,
decay_schedule_fn: Callable,
warmup_steps: int,
power: float = 1.0,
name: str = None,
):
super().__init__()
self.initial_learning_rate = initial_learning_rate
@@ -59,15 +80,34 @@ class WarmUp(tf.keras.optimizers.schedules.LearningRateSchedule):
def create_optimizer(
init_lr,
num_train_steps,
num_warmup_steps,
min_lr_ratio=0.0,
adam_epsilon=1e-8,
weight_decay_rate=0.0,
include_in_weight_decay=None,
init_lr: float,
num_train_steps: int,
num_warmup_steps: int,
min_lr_ratio: float = 0.0,
adam_epsilon: float = 1e-8,
weight_decay_rate: float = 0.0,
include_in_weight_decay: Optional[List[str]] = None,
):
"""Creates an optimizer with learning rate schedule."""
"""
Creates an optimizer with a learning rate schedule using a warmup phase followed by a linear decay.
Args:
init_lr (:obj:`float`):
The desired learning rate at the end of the warmup phase.
num_train_step (:obj:`int`):
The total number of training steps.
num_warmup_steps (:obj:`int`):
The number of warmup steps.
min_lr_ratio (:obj:`float`, `optional`, defaults to 0):
The final learning rate at the end of the linear decay will be :obj:`init_lr * min_lr_ratio`.
adam_epsilon (:obj:`float`, `optional`, defaults to 1e-8):
The epsilon to use in Adam.
weight_decay_rate (:obj:`float`, `optional`, defaults to 0):
The weight decay to use.
include_in_weight_decay (:obj:`List[str]`, `optional`):
List of the parameter names (or re patterns) to apply weight decay to. If none is passed, weight decay is
applied to all parameters except bias and layer norm parameters.
"""
# Implements linear decay of the learning rate.
lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(
initial_learning_rate=init_lr,
@@ -96,26 +136,55 @@ def create_optimizer(
class AdamWeightDecay(tf.keras.optimizers.Adam):
"""Adam enables L2 weight decay and clip_by_global_norm on gradients.
Just adding the square of the weights to the loss function is *not* the
correct way of using L2 regularization/weight decay with Adam, since that will
interact with the m and v parameters in strange ways.
Instead we want ot decay the weights in a manner that doesn't interact with
the m/v parameters. This is equivalent to adding the square of the weights to
the loss with plain (non-momentum) SGD.
"""
Adam enables L2 weight decay and clip_by_global_norm on gradients. Just adding the square of the weights to the
loss function is *not* the correct way of using L2 regularization/weight decay with Adam, since that will interact
with the m and v parameters in strange ways as shown in
`Decoupled Weight Decay Regularization <https://arxiv.org/abs/1711.05101>`__.
Instead we want ot decay the weights in a manner that doesn't interact with the m/v parameters. This is equivalent
to adding the square of the weights to the loss with plain (non-momentum) SGD.
Args:
learning_rate (:obj:`Union[float, tf.keras.optimizers.schedules.LearningRateSchedule]`, `optional`, defaults to 1e-3):
The learning rate to use or a schedule.
beta_1 (:obj:`float`, `optional`, defaults to 0.9):
The beta1 parameter in Adam, which is the exponential decay rate for the 1st momentum estimates.
beta_2 (:obj:`float`, `optional`, defaults to 0.999):
The beta2 parameter in Adam, which is the exponential decay rate for the 2nd momentum estimates.
epsilon (:obj:`float`, `optional`, defaults to 1e-7):
The epsilon paramenter in Adam, which is a small constant for numerical stability.
amsgrad (:obj:`bool`, `optional`, default to `False`):
Wheter to apply AMSGrad varient of this algorithm or not, see
`On the Convergence of Adam and Beyond <https://arxiv.org/abs/1904.09237>`__.
weight_decay_rate (:obj:`float`, `optional`, defaults to 0):
The weight decay to apply.
include_in_weight_decay (:obj:`List[str]`, `optional`):
List of the parameter names (or re patterns) to apply weight decay to. If none is passed, weight decay is
applied to all parameters by default (unless they are in :obj:`exclude_from_weight_decay`).
exclude_from_weight_decay (:obj:`List[str]`, `optional`):
List of the parameter names (or re patterns) to exclude from applying weight decay to. If a
:obj:`include_in_weight_decay` is passed, the names in it will supersede this list.
name (:obj:`str`, `optional`, defaults to 'AdamWeightDecay'):
Optional name for the operations created when applying gradients.
kwargs:
Keyward arguments. Allowed to be {``clipnorm``, ``clipvalue``, ``lr``, ``decay``}. ``clipnorm`` is clip
gradients by norm; ``clipvalue`` is clip gradients by value, ``decay`` is included for backward
compatibility to allow time inverse decay of learning rate. ``lr`` is included for backward compatibility,
recommended to use ``learning_rate`` instead.
"""
def __init__(
self,
learning_rate=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-7,
amsgrad=False,
weight_decay_rate=0.0,
include_in_weight_decay=None,
exclude_from_weight_decay=None,
name="AdamWeightDecay",
learning_rate: Union[float, tf.keras.optimizers.schedules.LearningRateSchedule] = 0.001,
beta_1: float = 0.9,
beta_2: float = 0.999,
epsilon: float = 1e-7,
amsgrad: bool = False,
weight_decay_rate: float = 0.0,
include_in_weight_decay: Optional[List[str]] = None,
exclude_from_weight_decay: Optional[List[str]] = None,
name: str = "AdamWeightDecay",
**kwargs
):
super().__init__(learning_rate, beta_1, beta_2, epsilon, amsgrad, name, **kwargs)

View File

@@ -87,6 +87,18 @@ def get_framework(model=None):
return framework
class PipelineException(Exception):
"""
Raised by pipelines when handling __call__
"""
def __init__(self, task: str, model: str, reason: str):
super().__init__(reason)
self.task = task
self.model = model
class ArgumentHandler(ABC):
"""
Base interface for handling varargs for each Pipeline
@@ -603,6 +615,28 @@ class TextGenerationPipeline(Pipeline):
"TFCTRLLMHeadModel",
]
# overriding _parse_and_tokenize to allow for unusual language-modeling tokenizer arguments
def _parse_and_tokenize(self, *args, padding=True, add_special_tokens=True, **kwargs):
"""
Parse arguments and tokenize
"""
# Parse arguments
if self.model.__class__.__name__ in ["TransfoXLLMHeadModel"]:
tokenizer_kwargs = {"add_space_before_punct_symbol": True}
else:
tokenizer_kwargs = {}
inputs = self._args_parser(*args, **kwargs)
inputs = self.tokenizer(
inputs,
add_special_tokens=add_special_tokens,
return_tensors=self.framework,
padding=padding,
**tokenizer_kwargs,
)
return inputs
def __call__(
self, *args, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs
):
@@ -808,6 +842,21 @@ class FillMaskPipeline(Pipeline):
self.topk = topk
def ensure_exactly_one_mask_token(self, masked_index: np.ndarray):
numel = np.prod(masked_index.shape)
if numel > 1:
raise PipelineException(
"fill-mask",
self.model.base_model_prefix,
f"More than one mask_token ({self.tokenizer.mask_token}) is not supported",
)
elif numel < 1:
raise PipelineException(
"fill-mask",
self.model.base_model_prefix,
f"No mask_token ({self.tokenizer.mask_token}) found on the input",
)
def __call__(self, *args, **kwargs):
inputs = self._parse_and_tokenize(*args, **kwargs)
outputs = self._forward(inputs, return_tensors=True)
@@ -820,14 +869,22 @@ class FillMaskPipeline(Pipeline):
result = []
if self.framework == "tf":
masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy().item()
logits = outputs[i, masked_index, :]
masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy()
# Fill mask pipeline supports only one ${mask_token} per sample
self.ensure_exactly_one_mask_token(masked_index)
logits = outputs[i, masked_index.item(), :]
probs = tf.nn.softmax(logits)
topk = tf.math.top_k(probs, k=self.topk)
values, predictions = topk.values.numpy(), topk.indices.numpy()
else:
masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero().item()
logits = outputs[i, masked_index, :]
masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero()
# Fill mask pipeline supports only one ${mask_token} per sample
self.ensure_exactly_one_mask_token(masked_index.numpy())
logits = outputs[i, masked_index.item(), :]
probs = logits.softmax(dim=0)
values, predictions = probs.topk(self.topk)
@@ -987,6 +1044,10 @@ class TokenClassificationPipeline(Pipeline):
entities += [entity]
# Ensure if an entity is the latest one in the sequence it gets appended to the output
if len(entity_group_disagg) > 0:
entity_groups.append(self.group_entities(entity_group_disagg))
# Append
if self.grouped_entities:
answers += [entity_groups]
@@ -1211,33 +1272,34 @@ class QuestionAnsweringPipeline(Pipeline):
with self.device_placement():
if self.framework == "tf":
fw_args = {k: tf.constant(v) for (k, v) in fw_args.items()}
start, end = self.model(fw_args)
start, end = self.model(fw_args)[:2]
start, end = start.numpy(), end.numpy()
else:
with torch.no_grad():
# Retrieve the score for the context tokens only (removing question tokens)
fw_args = {k: torch.tensor(v, device=self.device) for (k, v) in fw_args.items()}
start, end = self.model(**fw_args)
start, end = self.model(**fw_args)[:2]
start, end = start.cpu().numpy(), end.cpu().numpy()
min_null_score = 1000000 # large and positive
answers = []
for (feature, start_, end_) in zip(features, start, end):
# Normalize logits and spans to retrieve the answer
start_ = np.exp(start_) / np.sum(np.exp(start_))
end_ = np.exp(end_) / np.sum(np.exp(end_))
# Mask padding and question
start_, end_ = (
start_ * np.abs(np.array(feature.p_mask) - 1),
end_ * np.abs(np.array(feature.p_mask) - 1),
)
# Mask CLS
start_[0] = end_[0] = 0
# Normalize logits and spans to retrieve the answer
start_ = np.exp(start_ - np.log(np.sum(np.exp(start_), axis=-1, keepdims=True)))
end_ = np.exp(end_ - np.log(np.sum(np.exp(end_), axis=-1, keepdims=True)))
if kwargs["handle_impossible_answer"]:
min_null_score = min(min_null_score, (start_[0] * end_[0]).item())
start_[0] = end_[0] = 0
starts, ends, scores = self.decode(start_, end_, kwargs["topk"], kwargs["max_answer_len"])
char_to_word = np.array(example.char_to_word_offset)

View File

@@ -606,7 +606,7 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
mask_token="[MASK]",
clean_text=True,
tokenize_chinese_chars=True,
strip_accents=True,
strip_accents=None,
wordpieces_prefix="##",
**kwargs
):

View File

@@ -102,17 +102,26 @@ def get_pairs(word):
class GPT2Tokenizer(PreTrainedTokenizer):
"""
GPT-2 BPE tokenizer. Peculiarities:
GPT-2 BPE tokenizer, using byte-level Byte-Pair-Encoding.
- Byte-level Byte-Pair-Encoding
- Requires a space to start the input string => the encoding methods should be called with the
``add_prefix_space`` flag set to ``True``.
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
the absence of a space at the beginning of a string:
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
be encoded differently whether it is at the beginning of the sentence (without space) or not:
::
tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
>>> from transformers import GPT2Tokenizer
>>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
>>> tokenizer("Hello world")['input_ids']
[15496, 995]
>>> tokenizer(" Hello world")['input_ids']
[18435, 995]
You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
.. note::
When used with ``is_pretokenized=True``, this tokenizer will add a space before each word (even the first one).
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
should refer to the superclass for more information regarding methods.
@@ -137,6 +146,7 @@ class GPT2Tokenizer(PreTrainedTokenizer):
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ["attention_mask"]
def __init__(
self,
@@ -287,21 +297,30 @@ class GPT2Tokenizer(PreTrainedTokenizer):
class GPT2TokenizerFast(PreTrainedTokenizerFast):
"""
Constructs a "Fast" GPT-2 BPE tokenizer (backed by HuggingFace's `tokenizers` library).
Constructs a "Fast" GPT-2 BPE tokenizer (backed by HuggingFace's `tokenizers` library), using byte-level
Byte-Pair-Encoding.
Peculiarities:
- Byte-level Byte-Pair-Encoding
- Requires a space to start the input string => the encoding methods should be called with the
``add_prefix_space`` flag set to ``True``.
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
the absence of a space at the beginning of a string:
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
be encoded differently whether it is at the beginning of the sentence (without space) or not:
::
tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
>>> from transformers import GPT2TokenizerFast
>>> tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
>>> tokenizer("Hello world")['input_ids']
[15496, 995]
>>> tokenizer(" Hello world")['input_ids']
[18435, 995]
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the methods. Users
You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
.. note::
When used with ``is_pretokenized=True``, this tokenizer needs to be instantiated with
``add_prefix_space=True``.
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
should refer to the superclass for more information regarding methods.
Args:
@@ -330,6 +349,7 @@ class GPT2TokenizerFast(PreTrainedTokenizerFast):
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ["attention_mask"]
def __init__(
self,

View File

@@ -129,6 +129,8 @@ class MarianTokenizer(PreTrainedTokenizer):
max_length: Optional[int] = None,
pad_to_max_length: bool = True,
return_tensors: str = "pt",
truncation_strategy="only_first",
padding="longest",
) -> BatchEncoding:
"""Prepare model inputs for translation. For best performance, translate one sentence at a time.
Arguments:
@@ -147,24 +149,21 @@ class MarianTokenizer(PreTrainedTokenizer):
raise ValueError(f"found empty string in src_texts: {src_texts}")
self.current_spm = self.spm_source
src_texts = [self.normalize(t) for t in src_texts] # this does not appear to do much
model_inputs: BatchEncoding = self.batch_encode_plus(
src_texts,
tokenizer_kwargs = dict(
add_special_tokens=True,
return_tensors=return_tensors,
max_length=max_length,
pad_to_max_length=pad_to_max_length,
truncation_strategy=truncation_strategy,
padding=padding,
)
model_inputs: BatchEncoding = self(src_texts, **tokenizer_kwargs)
if tgt_texts is None:
return model_inputs
self.current_spm = self.spm_target
decoder_inputs: BatchEncoding = self.batch_encode_plus(
tgt_texts,
add_special_tokens=True,
return_tensors=return_tensors,
max_length=max_length,
pad_to_max_length=pad_to_max_length,
)
decoder_inputs: BatchEncoding = self(tgt_texts, **tokenizer_kwargs)
for k, v in decoder_inputs.items():
model_inputs[f"decoder_{k}"] = v
self.current_spm = self.spm_source

View File

@@ -96,6 +96,7 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ["attention_mask"]
def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
super().__init__(unk_token=unk_token, **kwargs)
@@ -261,6 +262,7 @@ class OpenAIGPTTokenizerFast(PreTrainedTokenizerFast):
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ["attention_mask"]
def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
kwargs.setdefault("unk_token", unk_token)

View File

@@ -62,17 +62,26 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
class RobertaTokenizer(GPT2Tokenizer):
"""
Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities:
Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.
- Byte-level Byte-Pair-Encoding
- Requires a space to start the input string => the encoding methods should be called with the
``add_prefix_space`` flag set to ``True``.
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
the absence of a space at the beginning of a string:
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
be encoded differently whether it is at the beginning of the sentence (without space) or not:
::
tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
>>> from transformers import RobertaTokenizer
>>> tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
>>> tokenizer("Hello world")['input_ids']
[0, 31414, 232, 328, 2]
>>> tokenizer(" Hello world")['input_ids']
[0, 20920, 232, 2]
You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
.. note::
When used with ``is_pretokenized=True``, this tokenizer will add a space before each word (even the first one).
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
should refer to the superclass for more information regarding methods.
@@ -251,19 +260,28 @@ class RobertaTokenizer(GPT2Tokenizer):
class RobertaTokenizerFast(GPT2TokenizerFast):
"""
Constructs a "Fast" RoBERTa BPE tokenizer (backed by HuggingFace's `tokenizers` library).
Constructs a "Fast" RoBERTa BPE tokenizer (backed by HuggingFace's `tokenizers` library), derived from the GPT-2
tokenizer, using byte-level Byte-Pair-Encoding.
Peculiarities:
- Byte-level Byte-Pair-Encoding
- Requires a space to start the input string => the encoding methods should be called with the
``add_prefix_space`` flag set to ``True``.
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
the absence of a space at the beginning of a string:
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
be encoded differently whether it is at the beginning of the sentence (without space) or not:
::
tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
>>> from transformers import RobertaTokenizerFast
>>> tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
>>> tokenizer("Hello world")['input_ids']
[0, 31414, 232, 328, 2]
>>> tokenizer(" Hello world")['input_ids']
[0, 20920, 232, 2]
You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
.. note::
When used with ``is_pretokenized=True``, this tokenizer needs to be instantiated with
``add_prefix_space=True``.
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the methods. Users
should refer to the superclass for more information regarding methods.

View File

@@ -454,12 +454,12 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
first_ids = get_input_ids(text)
second_ids = get_input_ids(text_pair) if text_pair is not None else None
return self._prepare_for_model(
return self.prepare_for_model(
first_ids,
pair_ids=second_ids,
add_special_tokens=add_special_tokens,
padding_strategy=padding_strategy,
truncation_strategy=truncation_strategy,
padding=padding_strategy.value,
truncation=truncation_strategy.value,
max_length=max_length,
stride=stride,
pad_to_multiple_of=pad_to_multiple_of,
@@ -584,12 +584,12 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
batch_outputs = {}
for first_ids, second_ids in batch_ids_pairs:
outputs = self._prepare_for_model(
outputs = self.prepare_for_model(
first_ids,
second_ids,
add_special_tokens=add_special_tokens,
padding_strategy=PaddingStrategy.DO_NOT_PAD, # we pad in batch afterward
truncation_strategy=truncation_strategy,
padding=PaddingStrategy.DO_NOT_PAD.value, # we pad in batch afterward
truncation=truncation_strategy.value,
max_length=max_length,
stride=stride,
pad_to_multiple_of=None, # we pad in batch afterward
@@ -620,109 +620,6 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
return batch_outputs
@add_end_docstrings(ENCODE_KWARGS_DOCSTRING, ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
def _prepare_for_model(
self,
ids: List[int],
pair_ids: Optional[List[int]] = None,
add_special_tokens: bool = True,
padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
max_length: Optional[int] = None,
stride: int = 0,
pad_to_multiple_of: Optional[int] = None,
return_tensors: Optional[str] = None,
prepend_batch_axis: bool = False,
return_token_type_ids: Optional[bool] = None,
return_attention_mask: Optional[bool] = None,
return_overflowing_tokens: bool = False,
return_special_tokens_mask: bool = False,
return_length: bool = False,
verbose: bool = True,
) -> BatchEncoding:
""" Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model.
It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and
manages a moving window (with user defined stride) for overflowing tokens
Args:
ids: list of tokenized input ids. Can be obtained from a string by chaining the
`tokenize` and `convert_tokens_to_ids` methods.
pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the
`tokenize` and `convert_tokens_to_ids` methods.
"""
pair = bool(pair_ids is not None)
len_ids = len(ids)
len_pair_ids = len(pair_ids) if pair else 0
# Load from model defaults
if return_token_type_ids is None:
return_token_type_ids = "token_type_ids" in self.model_input_names
if return_attention_mask is None:
return_attention_mask = "attention_mask" in self.model_input_names
encoded_inputs = {}
# Compute the total size of the returned encodings
total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)
# Truncation: Handle max sequence length
if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and max_length and total_len > max_length:
ids, pair_ids, overflowing_tokens = self.truncate_sequences(
ids,
pair_ids=pair_ids,
num_tokens_to_remove=total_len - max_length,
truncation_strategy=truncation_strategy,
stride=stride,
)
if return_overflowing_tokens:
encoded_inputs["overflowing_tokens"] = overflowing_tokens
encoded_inputs["num_truncated_tokens"] = total_len - max_length
# Add special tokens
if add_special_tokens:
sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
else:
sequence = ids + pair_ids if pair else ids
token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])
# Build output dictionnary
encoded_inputs["input_ids"] = sequence
if return_token_type_ids:
encoded_inputs["token_type_ids"] = token_type_ids
if return_special_tokens_mask:
if add_special_tokens:
encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
else:
encoded_inputs["special_tokens_mask"] = [0] * len(sequence)
# Check lengths
if max_length is None and len(encoded_inputs["input_ids"]) > self.model_max_length and verbose:
logger.warning(
"Token indices sequence length is longer than the specified maximum sequence length "
"for this model ({} > {}). Running this sequence through the model will result in "
"indexing errors".format(len(ids), self.model_max_length)
)
# Padding
if padding_strategy != PaddingStrategy.DO_NOT_PAD or return_attention_mask:
encoded_inputs = self.pad(
encoded_inputs,
max_length=max_length,
padding=padding_strategy.value,
pad_to_multiple_of=pad_to_multiple_of,
return_attention_mask=return_attention_mask,
)
if return_length:
encoded_inputs["length"] = len(encoded_inputs["input_ids"])
batch_outputs = BatchEncoding(
encoded_inputs, tensor_type=return_tensors, prepend_batch_axis=prepend_batch_axis
)
return batch_outputs
def prepare_for_tokenization(self, text: str, is_pretokenized=False, **kwargs) -> (str, dict):
""" Performs any necessary transformations before tokenization.
@@ -731,90 +628,6 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
"""
return (text, kwargs)
def truncate_sequences(
self,
ids: List[int],
pair_ids: Optional[List[int]] = None,
num_tokens_to_remove: int = 0,
truncation_strategy: Union[str, TruncationStrategy] = "only_first",
stride: int = 0,
) -> Tuple[List[int], List[int], List[int]]:
""" Truncates a sequence pair in place to the maximum length.
Args:
ids: list of tokenized input ids. Can be obtained from a string by chaining the
`tokenize` and `convert_tokens_to_ids` methods.
pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the
`tokenize` and `convert_tokens_to_ids` methods.
num_tokens_to_remove (:obj:`int`, `optional`, defaults to ``0``):
number of tokens to remove using the truncation strategy
truncation_strategy (:obj:`string`, `optional`, defaults to "only_first"):
String selected in the following options:
- 'only_first' (default): Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.
- 'only_second': Only truncate the second sequence
- 'longest_first': Iteratively reduce the inputs sequence until the input is under max_length
starting from the longest one at each token (when there is a pair of input sequences).
Overflowing tokens only contains overflow from the first sequence.
- 'do_not_truncate'
stride (:obj:`int`, `optional`, defaults to ``0``):
If set to a number along with max_length, the overflowing tokens returned will contain some tokens
from the main sequence returned. The value of this argument defines the number of additional tokens.
"""
if num_tokens_to_remove <= 0:
return ids, pair_ids, []
if not isinstance(truncation_strategy, TruncationStrategy):
truncation_strategy = TruncationStrategy(truncation_strategy)
overflowing_tokens = []
if truncation_strategy == TruncationStrategy.LONGEST_FIRST:
for _ in range(num_tokens_to_remove):
if pair_ids is None or len(ids) > len(pair_ids):
ids = ids[:-1]
else:
pair_ids = pair_ids[:-1]
elif truncation_strategy == TruncationStrategy.ONLY_FIRST:
if len(ids) > num_tokens_to_remove:
window_len = min(len(ids), stride + num_tokens_to_remove)
overflowing_tokens = ids[-window_len:]
ids = ids[:-num_tokens_to_remove]
else:
logger.error(
f"We need to remove {num_tokens_to_remove} to truncate the input"
f"but the first sequence has a length {len(ids)}. "
f"Please select another truncation strategy than {truncation_strategy}, "
f"for instance 'longest_first' or 'only_second'."
)
elif truncation_strategy == TruncationStrategy.ONLY_SECOND and pair_ids is not None:
if len(pair_ids) > num_tokens_to_remove:
window_len = min(len(pair_ids), stride + num_tokens_to_remove)
overflowing_tokens = pair_ids[-window_len:]
pair_ids = pair_ids[:-num_tokens_to_remove]
else:
logger.error(
f"We need to remove {num_tokens_to_remove} to truncate the input"
f"but the second sequence has a length {len(pair_ids)}. "
f"Please select another truncation strategy than {truncation_strategy}, "
f"for instance 'longest_first' or 'only_first'."
)
return (ids, pair_ids, overflowing_tokens)
def create_token_type_ids_from_sequences(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List[int]:
if token_ids_1 is None:
return len(token_ids_0) * [0]
return [0] * len(token_ids_0) + [1] * len(token_ids_1)
def build_inputs_with_special_tokens(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. This implementation does not add special tokens.
"""
if token_ids_1 is None:
return token_ids_0
return token_ids_0 + token_ids_1
def get_special_tokens_mask(
self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False
) -> List[int]:
@@ -836,7 +649,7 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
def convert_ids_to_tokens(
self, ids: Union[int, List[int]], skip_special_tokens: bool = False
) -> Union[int, List[int]]:
) -> Union[str, List[str]]:
""" Converts a single index or a sequence of indices (integers) in a token "
(resp.) a sequence of tokens (str), using the vocabulary and added tokens.

View File

@@ -945,9 +945,9 @@ ENCODE_KWARGS_DOCSTRING = r"""
`truncation` (:obj:`Union[bool, str]`, `optional`, defaults to :obj:`False`):
Activate and control truncation. Accepts the following values:
* `True` or `'only_first'`: truncate to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`). This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided,
* `True` or `'longest_first'`: truncate to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`). This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided,
* `'only_first'`: truncate to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`). This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided,
* `'only_second'`: truncate to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`). This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided,
* `'longest_first'`: truncate to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`). This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided,
* `False` or `'do_not_truncate'` (default): No truncation (i.e. can output batch with sequences length greater than the model max admissible input size)
`max_length` (:obj:`Union[int, None]`, `optional`, defaults to :obj:`None`):
Control the length for padding/truncation. Accepts the following values
@@ -1446,10 +1446,11 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
logger.warning(
"Truncation was not explicitely activated but `max_length` is provided a specific value, "
"please use `truncation=True` to explicitely truncate examples to max length. "
"Defaulting to 'only_first' truncation strategy. "
"If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior."
"Defaulting to 'longest_first' truncation strategy. "
"If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy "
"more precisely by providing a specific strategy to `truncation`."
)
truncation = "only_first"
truncation = "longest_first"
# Get padding strategy
if padding is False and old_pad_to_max_length:
@@ -1469,7 +1470,7 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
elif padding is not False:
if padding is True:
padding_strategy = PaddingStrategy.LONGEST # Default to pad to the longest sequence in the batch
else:
elif not isinstance(padding, PaddingStrategy):
padding_strategy = PaddingStrategy(padding)
else:
padding_strategy = PaddingStrategy.DO_NOT_PAD
@@ -1492,9 +1493,9 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
elif truncation is not False:
if truncation is True:
truncation_strategy = (
TruncationStrategy.ONLY_FIRST
) # Default to truncate the first sequences in pairs of inputs
else:
TruncationStrategy.LONGEST_FIRST
) # Default to truncate the longest sequences in pairs of inputs
elif not isinstance(truncation, TruncationStrategy):
truncation_strategy = TruncationStrategy(truncation)
else:
truncation_strategy = TruncationStrategy.DO_NOT_TRUNCATE
@@ -1960,6 +1961,225 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
def create_token_type_ids_from_sequences(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List[int]:
if token_ids_1 is None:
return len(token_ids_0) * [0]
return [0] * len(token_ids_0) + [1] * len(token_ids_1)
def build_inputs_with_special_tokens(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. This implementation does not add special tokens.
"""
if token_ids_1 is None:
return token_ids_0
return token_ids_0 + token_ids_1
@add_end_docstrings(ENCODE_KWARGS_DOCSTRING, ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
def prepare_for_model(
self,
ids: List[int],
pair_ids: Optional[List[int]] = None,
add_special_tokens: bool = True,
padding: Union[bool, str] = False,
truncation: Union[bool, str] = False,
max_length: Optional[int] = None,
stride: int = 0,
pad_to_multiple_of: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
return_token_type_ids: Optional[bool] = None,
return_attention_mask: Optional[bool] = None,
return_overflowing_tokens: bool = False,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
verbose: bool = True,
prepend_batch_axis: bool = False,
**kwargs
) -> BatchEncoding:
""" Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model.
It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and
manages a moving window (with user defined stride) for overflowing tokens
Args:
ids: list of tokenized input ids. Can be obtained from a string by chaining the
`tokenize` and `convert_tokens_to_ids` methods.
pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the
`tokenize` and `convert_tokens_to_ids` methods.
"""
if "return_lengths" in kwargs:
if verbose:
warnings.warn(
"The PreTrainedTokenizerBase.prepare_for_model `return_lengths` parameter is deprecated. "
"Please use `return_length` instead.",
FutureWarning,
)
return_length = kwargs["return_lengths"]
# Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
padding=padding,
truncation=truncation,
max_length=max_length,
pad_to_multiple_of=pad_to_multiple_of,
verbose=verbose,
**kwargs,
)
pair = bool(pair_ids is not None)
len_ids = len(ids)
len_pair_ids = len(pair_ids) if pair else 0
# Load from model defaults
if return_token_type_ids is None:
return_token_type_ids = "token_type_ids" in self.model_input_names
if return_attention_mask is None:
return_attention_mask = "attention_mask" in self.model_input_names
encoded_inputs = {}
# Compute the total size of the returned encodings
total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)
# Truncation: Handle max sequence length
if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and max_length and total_len > max_length:
ids, pair_ids, overflowing_tokens = self.truncate_sequences(
ids,
pair_ids=pair_ids,
num_tokens_to_remove=total_len - max_length,
truncation_strategy=truncation_strategy,
stride=stride,
)
if return_overflowing_tokens:
encoded_inputs["overflowing_tokens"] = overflowing_tokens
encoded_inputs["num_truncated_tokens"] = total_len - max_length
# Add special tokens
if add_special_tokens:
sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
else:
sequence = ids + pair_ids if pair else ids
token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])
# Build output dictionnary
encoded_inputs["input_ids"] = sequence
if return_token_type_ids:
encoded_inputs["token_type_ids"] = token_type_ids
if return_special_tokens_mask:
if add_special_tokens:
encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
else:
encoded_inputs["special_tokens_mask"] = [0] * len(sequence)
# Check lengths
if max_length is None and len(encoded_inputs["input_ids"]) > self.model_max_length and verbose:
logger.warning(
"Token indices sequence length is longer than the specified maximum sequence length "
"for this model ({} > {}). Running this sequence through the model will result in "
"indexing errors".format(len(ids), self.model_max_length)
)
# Padding
if padding_strategy != PaddingStrategy.DO_NOT_PAD or return_attention_mask:
encoded_inputs = self.pad(
encoded_inputs,
max_length=max_length,
padding=padding_strategy.value,
pad_to_multiple_of=pad_to_multiple_of,
return_attention_mask=return_attention_mask,
)
if return_length:
encoded_inputs["length"] = len(encoded_inputs["input_ids"])
batch_outputs = BatchEncoding(
encoded_inputs, tensor_type=return_tensors, prepend_batch_axis=prepend_batch_axis
)
return batch_outputs
def truncate_sequences(
self,
ids: List[int],
pair_ids: Optional[List[int]] = None,
num_tokens_to_remove: int = 0,
truncation_strategy: Union[str, TruncationStrategy] = "longest_first",
stride: int = 0,
) -> Tuple[List[int], List[int], List[int]]:
""" Truncates a sequence pair in place to the maximum length.
Args:
ids: list of tokenized input ids. Can be obtained from a string by chaining the
`tokenize` and `convert_tokens_to_ids` methods.
pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the
`tokenize` and `convert_tokens_to_ids` methods.
num_tokens_to_remove (:obj:`int`, `optional`, defaults to ``0``):
number of tokens to remove using the truncation strategy
truncation_strategy (:obj:`string`, `optional`, defaults to "longest_first"):
String selected in the following options:
- 'longest_first' (default): Iteratively reduce the inputs sequence until the input is under max_length
starting from the longest one at each token (when there is a pair of input sequences).
Overflowing tokens only contains overflow from the first sequence.
- 'only_first': Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.
- 'only_second': Only truncate the second sequence
- 'do_not_truncate'
stride (:obj:`int`, `optional`, defaults to ``0``):
If set to a number along with max_length, the overflowing tokens returned will contain some tokens
from the main sequence returned. The value of this argument defines the number of additional tokens.
"""
if num_tokens_to_remove <= 0:
return ids, pair_ids, []
if not isinstance(truncation_strategy, TruncationStrategy):
truncation_strategy = TruncationStrategy(truncation_strategy)
overflowing_tokens = []
if truncation_strategy == TruncationStrategy.LONGEST_FIRST:
for _ in range(num_tokens_to_remove):
if pair_ids is None or len(ids) > len(pair_ids):
if not overflowing_tokens:
window_len = min(len(ids), stride + 1)
else:
window_len = 1
overflowing_tokens.extend(ids[-window_len:])
ids = ids[:-1]
else:
if not overflowing_tokens:
window_len = min(len(pair_ids), stride + 1)
else:
window_len = 1
overflowing_tokens.extend(pair_ids[-window_len:])
pair_ids = pair_ids[:-1]
elif truncation_strategy == TruncationStrategy.ONLY_FIRST:
if len(ids) > num_tokens_to_remove:
window_len = min(len(ids), stride + num_tokens_to_remove)
overflowing_tokens = ids[-window_len:]
ids = ids[:-num_tokens_to_remove]
else:
logger.error(
f"We need to remove {num_tokens_to_remove} to truncate the input"
f"but the first sequence has a length {len(ids)}. "
f"Please select another truncation strategy than {truncation_strategy}, "
f"for instance 'longest_first' or 'only_second'."
)
elif truncation_strategy == TruncationStrategy.ONLY_SECOND and pair_ids is not None:
if len(pair_ids) > num_tokens_to_remove:
window_len = min(len(pair_ids), stride + num_tokens_to_remove)
overflowing_tokens = pair_ids[-window_len:]
pair_ids = pair_ids[:-num_tokens_to_remove]
else:
logger.error(
f"We need to remove {num_tokens_to_remove} to truncate the input"
f"but the second sequence has a length {len(pair_ids)}. "
f"Please select another truncation strategy than {truncation_strategy}, "
f"for instance 'longest_first' or 'only_first'."
)
return (ids, pair_ids, overflowing_tokens)
def _pad(
self,
encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],

View File

@@ -1,7 +1,6 @@
import logging
import math
import os
import random
import re
import shutil
import warnings
@@ -23,7 +22,14 @@ from .data.data_collator import DataCollator, default_data_collator
from .file_utils import is_apex_available, is_torch_tpu_available
from .modeling_utils import PreTrainedModel
from .optimization import AdamW, get_linear_schedule_with_warmup
from .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput, TrainOutput, is_wandb_available
from .trainer_utils import (
PREFIX_CHECKPOINT_DIR,
EvalPrediction,
PredictionOutput,
TrainOutput,
is_wandb_available,
set_seed,
)
from .training_args import TrainingArguments
@@ -60,18 +66,13 @@ if is_wandb_available():
logger = logging.getLogger(__name__)
def set_seed(seed: int):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# ^^ safe to call this function even if cuda is not available
@contextmanager
def torch_distributed_zero_first(local_rank: int):
"""
Decorator to make all processes in distributed training wait for each local_master to do something.
Args:
local_rank (:obj:`int`): The rank of the local process.
"""
if local_rank not in [-1, 0]:
torch.distributed.barrier()
@@ -133,7 +134,31 @@ def get_tpu_sampler(dataset: Dataset):
class Trainer:
"""
Trainer is a simple but feature-complete training and eval loop for PyTorch,
optimized for Transformers.
optimized for 🤗 Transformers.
Args:
model (:class:`~transformers.PreTrainedModel`):
The model to train, evaluate or use for predictions.
args (:class:`~transformers.TrainingArguments`):
The arguments to tweak training.
data_collator (:obj:`DataCollator`, `optional`, defaults to :func:`~transformers.default_data_collator`):
The function to use to from a batch from a list of elements of :obj:`train_dataset` or
:obj:`eval_dataset`.
train_dataset (:obj:`Dataset`, `optional`):
The dataset to use for training.
eval_dataset (:obj:`Dataset`, `optional`):
The dataset to use for evaluation.
compute_metrics (:obj:`Callable[[EvalPrediction], Dict]`, `optional`):
The function that will be used to compute metrics at evaluation. Must take a
:class:`~transformers.EvalPrediction` and return a dictionary string to metric values.
prediction_loss_only (:obj:`bool`, `optional`, defaults to `False`):
When performing evaluation and predictions, only returns the loss.
tb_writer (:obj:`SummaryWriter`, `optional`):
Object to write to TensorBoard.
optimizers (:obj:`Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR`, `optional`):
A tuple containing the optimizer and the scheduler to use. Will default to an instance of
:class:`~transformers.AdamW` on your model and a scheduler given by
:func:`~transformers.get_linear_schedule_with_warmup` controlled by :obj:`args`.
"""
model: PreTrainedModel
@@ -160,14 +185,6 @@ class Trainer:
tb_writer: Optional["SummaryWriter"] = None,
optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = None,
):
"""
Trainer is a simple but feature-complete training and eval loop for PyTorch,
optimized for Transformers.
Args:
prediction_loss_only:
(Optional) in evaluation and prediction, only return the loss
"""
self.model = model.to(args.device)
self.args = args
self.data_collator = data_collator if data_collator is not None else default_data_collator
@@ -210,6 +227,9 @@ class Trainer:
)
def get_train_dataloader(self) -> DataLoader:
"""
Returns the training :class:`~torch.utils.data.DataLoader`.
"""
if self.train_dataset is None:
raise ValueError("Trainer: training requires a train_dataset.")
if is_torch_tpu_available():
@@ -232,6 +252,13 @@ class Trainer:
return data_loader
def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader:
"""
Returns the evaluation :class:`~torch.utils.data.DataLoader`.
Args:
eval_dataset (:obj:`Dataset`, `optional`):
If provided, will override `self.eval_dataset`.
"""
if eval_dataset is None and self.eval_dataset is None:
raise ValueError("Trainer: evaluation requires an eval_dataset.")
@@ -257,6 +284,12 @@ class Trainer:
return data_loader
def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:
"""
Returns the test :class:`~torch.utils.data.DataLoader`.
Args:
test_dataset (obj:`Dataset`): The test dataset to use.
"""
# We use the same batch_size as for eval.
if is_torch_tpu_available():
sampler = SequentialDistributedSampler(
@@ -283,9 +316,8 @@ class Trainer:
"""
Setup the optimizer and the learning rate scheduler.
We provide a reasonable default that works well.
If you want to use something else, you can pass a tuple in the Trainer's init,
or override this method in a subclass.
We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
Trainer's init through :obj:`optimizers`, or override this method in a subclass.
"""
if self.optimizers is not None:
return self.optimizers
@@ -336,7 +368,7 @@ class Trainer:
def num_examples(self, dataloader: DataLoader) -> int:
"""
Helper to get num of examples from a DataLoader, by accessing its Dataset.
Helper to get number of samples in a :class:`~torch.utils.data.DataLoader` by accessing its Dataset.
"""
return len(dataloader.dataset)
@@ -345,9 +377,9 @@ class Trainer:
Main training entry point.
Args:
model_path:
(Optional) Local path to model if model to train has been instantiated from a local path
If present, we will try reloading the optimizer/scheduler states from there.
model_path (:obj:`str`, `optional`):
Local path to the model if the model to train has been instantiated from a local path. If present,
training will resume from the optimizer/scheduler states loaded here.
"""
train_dataloader = self.get_train_dataloader()
if self.args.max_steps > 0:
@@ -453,6 +485,10 @@ class Trainer:
else:
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=not self.is_local_master())
# Reset the past mems state at the beginning of each epoch if necessary.
if self.args.past_index >= 0:
self._past = None
for step, inputs in enumerate(epoch_iterator):
# Skip past any already trained steps if resuming training
@@ -497,8 +533,8 @@ class Trainer:
self._log(logs)
if self.args.evaluate_during_training:
self.evaluate()
if self.args.evaluate_during_training and self.global_step % self.args.eval_steps == 0:
self.evaluate()
if self.args.save_steps > 0 and self.global_step % self.args.save_steps == 0:
# In all cases (even distributed/parallel), self.model is always a reference
@@ -529,12 +565,15 @@ class Trainer:
if self.args.max_steps > 0 and self.global_step > self.args.max_steps:
train_iterator.close()
break
if self.args.tpu_metrics_debug:
if self.args.tpu_metrics_debug or self.args.debug:
# tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
xm.master_print(met.metrics_report())
if self.tb_writer:
self.tb_writer.close()
if self.args.past_index and hasattr(self, "_past"):
# Clean the state at the end of training
delattr(self, "_past")
logger.info("\n\nTraining completed. Do not forget to share your model on huggingface.co/models =)\n\n")
return TrainOutput(self.global_step, tr_loss / self.global_step)
@@ -577,9 +616,15 @@ class Trainer:
if isinstance(v, torch.Tensor):
inputs[k] = v.to(self.args.device)
if self.args.past_index >= 0 and self._past is not None:
inputs["mems"] = self._past
outputs = model(**inputs)
loss = outputs[0] # model outputs are always tuple in transformers (see doc)
if self.args.past_index >= 0:
self._past = outputs[self.args.past_index]
if self.args.n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu parallel training
if self.args.gradient_accumulation_steps > 1:
@@ -611,8 +656,7 @@ class Trainer:
def save_model(self, output_dir: Optional[str] = None):
"""
Saving best-practices: if you use default names for the model,
you can reload it using from_pretrained().
Will save the model, so you can reload it using :obj:`from_pretrained()`.
Will only save from the world_master process (unless in TPUs).
"""
@@ -683,22 +727,18 @@ class Trainer:
logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
shutil.rmtree(checkpoint)
def evaluate(
self, eval_dataset: Optional[Dataset] = None, prediction_loss_only: Optional[bool] = None,
) -> Dict[str, float]:
def evaluate(self, eval_dataset: Optional[Dataset] = None) -> Dict[str, float]:
"""
Run evaluation and return metrics.
Run evaluation and returns metrics.
The calling script will be responsible for providing a method to compute metrics, as they are
task-dependent.
task-dependent (pass it to the init :obj:`compute_metrics` argument).
Args:
eval_dataset: (Optional) Pass a dataset if you wish to override
the one on the instance.
eval_dataset (:obj:`Dataset`, `optional`):
Pass a dataset if you wish to override :obj:`self.eval_dataset`.
Returns:
A dict containing:
- the eval loss
- the potential metrics computed from the predictions
A dictionary containing the evaluation loss and the potential metrics computed from the predictions.
"""
eval_dataloader = self.get_eval_dataloader(eval_dataset)
@@ -706,7 +746,7 @@ class Trainer:
self._log(output.metrics)
if self.args.tpu_metrics_debug:
if self.args.tpu_metrics_debug or self.args.debug:
# tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
xm.master_print(met.metrics_report())
@@ -714,10 +754,22 @@ class Trainer:
def predict(self, test_dataset: Dataset) -> PredictionOutput:
"""
Run prediction and return predictions and potential metrics.
Run prediction and returns predictions and potential metrics.
Depending on the dataset and your use case, your test dataset may contain labels.
In that case, this method will also return metrics, like in evaluate().
In that case, this method will also return metrics, like in :obj:`evaluate()`.
Args:
test_dataset (:obj:`Dataset`):
Dataset to run the predictions on.
Returns:
`NamedTuple`:
predictions (:obj:`np.ndarray`):
The predictions on :obj:`test_dataset`.
label_ids (:obj:`np.ndarray`, `optional`):
The labels (if the dataset contained some).
metrics (:obj:`Dict[str, float]`, `optional`):
The potential dictionary of metrics (if the dataset contained labels).
"""
test_dataloader = self.get_test_dataloader(test_dataset)
@@ -755,12 +807,17 @@ class Trainer:
if is_torch_tpu_available():
dataloader = pl.ParallelLoader(dataloader, [self.args.device]).per_device_loader(self.args.device)
if self.args.past_index >= 0:
past = None
for inputs in tqdm(dataloader, desc=description):
has_labels = any(inputs.get(k) is not None for k in ["labels", "lm_labels", "masked_lm_labels"])
for k, v in inputs.items():
if isinstance(v, torch.Tensor):
inputs[k] = v.to(self.args.device)
if self.args.past_index >= 0:
inputs["mems"] = past
with torch.no_grad():
outputs = model(**inputs)
@@ -769,6 +826,8 @@ class Trainer:
eval_losses += [step_eval_loss.mean().item()]
else:
logits = outputs[0]
if self.args.past_index >= 0:
past = outputs[self.args.past_index if has_labels else self.args.past_index - 1]
if not prediction_loss_only:
if preds is None:

View File

@@ -3,7 +3,6 @@
import logging
import math
import os
import random
from typing import Callable, Dict, Optional, Tuple
import numpy as np
@@ -11,7 +10,7 @@ import tensorflow as tf
from .modeling_tf_utils import TFPreTrainedModel
from .optimization_tf import GradientAccumulator, create_optimizer
from .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput, is_wandb_available
from .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput, is_wandb_available, set_seed
from .training_args_tf import TFTrainingArguments
@@ -22,13 +21,35 @@ if is_wandb_available():
logger = logging.getLogger(__name__)
def set_seed(seed: int):
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
class TFTrainer:
"""
TFTrainer is a simple but feature-complete training and eval loop for TensorFlow,
optimized for 🤗 Transformers.
Args:
model (:class:`~transformers.TFPreTrainedModel`):
The model to train, evaluate or use for predictions.
args (:class:`~transformers.TFTrainingArguments`):
The arguments to tweak training.
train_dataset (:class:`~tf.data.Dataset`, `optional`):
The dataset to use for training.
eval_dataset (:class:`~tf.data.Dataset`, `optional`):
The dataset to use for evaluation.
compute_metrics (:obj:`Callable[[EvalPrediction], Dict]`, `optional`):
The function that will be used to compute metrics at evaluation. Must take a
:class:`~transformers.EvalPrediction` and return a dictionary string to metric values.
prediction_loss_only (:obj:`bool`, `optional`, defaults to `False`):
When performing evaluation and predictions, only returns the loss.
tb_writer (:obj:`tf.summary.SummaryWriter`, `optional`):
Object to write to TensorBoard.
optimizers (:obj:`Tuple[tf.keras.optimizers.Optimizer, tf.keras.optimizers.schedules.LearningRateSchedule]`, `optional`):
A tuple containing the optimizer and the scheduler to use. The optimizer default to an instance of
:class:`tf.keras.optimizers.Adam` if :obj:`args.weight_decay_rate` is 0 else an instance of
:class:`~transformers.AdamWeightDecay`. The scheduler will default to an instance of
:class:`tf.keras.optimizers.schedules.PolynomialDecay` if :obj:`args.num_warmup_steps` is 0 else
an instance of :class:`~transformers.WarmUp`.
"""
model: TFPreTrainedModel
args: TFTrainingArguments
train_dataset: Optional[tf.data.Dataset]
@@ -78,6 +99,9 @@ class TFTrainer:
set_seed(self.args.seed)
def get_train_tfdataset(self) -> tf.data.Dataset:
"""
Returns the training :class:`~tf.data.Dataset`.
"""
if self.train_dataset is None:
raise ValueError("Trainer: training requires a train_dataset.")
@@ -101,6 +125,13 @@ class TFTrainer:
return self.args.strategy.experimental_distribute_dataset(ds)
def get_eval_tfdataset(self, eval_dataset: Optional[tf.data.Dataset] = None) -> tf.data.Dataset:
"""
Returns the evaluation :class:`~tf.data.Dataset`.
Args:
eval_dataset (:class:`~tf.data.Dataset`, `optional`):
If provided, will override `self.eval_dataset`.
"""
if eval_dataset is None and self.eval_dataset is None:
raise ValueError("Trainer: evaluation requires an eval_dataset.")
@@ -114,6 +145,12 @@ class TFTrainer:
return self.args.strategy.experimental_distribute_dataset(ds)
def get_test_tfdataset(self, test_dataset: tf.data.Dataset) -> tf.data.Dataset:
"""
Returns a test :class:`~tf.data.Dataset`.
Args:
test_dataset (:class:`~tf.data.Dataset`): The dataset to use.
"""
ds = test_dataset.batch(self.args.eval_batch_size, drop_remainder=self.args.dataloader_drop_last)
return self.args.strategy.experimental_distribute_dataset(ds)
@@ -124,9 +161,8 @@ class TFTrainer:
"""
Setup the optimizer and the learning rate scheduler.
We provide a reasonable default that works well.
If you want to use something else, you can pass a tuple in the Trainer's init,
or override this method in a subclass.
We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
TFTrainer's init through :obj:`optimizers`, or override this method in a subclass.
"""
if self.optimizers is not None:
return self.optimizers
@@ -197,6 +233,10 @@ class TFTrainer:
step: int = 1
# Reset the past mems state at the beginning of the evaluation if necessary.
if self.args.past_index >= 0:
self._past = None
for features, labels in dataset:
step = tf.convert_to_tensor(step, dtype=tf.int64)
loss, logits = self._evaluate_steps(features, labels)
@@ -209,7 +249,7 @@ class TFTrainer:
if isinstance(labels, tuple):
labels = labels[0]
if self.args.n_gpu > 1:
if self.args.n_replicas > 1:
for val in logits.values:
if preds is None:
preds = val.numpy()
@@ -245,6 +285,10 @@ class TFTrainer:
if not key.startswith("eval_"):
metrics[f"eval_{key}"] = metrics.pop(key)
if self.args.past_index and hasattr(self, "_past"):
# Clean the state at the end of training
delattr(self, "_past")
return PredictionOutput(predictions=preds, label_ids=label_ids, metrics=metrics)
def _log(self, logs: Dict[str, float]) -> None:
@@ -263,11 +307,18 @@ class TFTrainer:
logger.info(output)
def evaluate(
self, eval_dataset: Optional[tf.data.Dataset] = None, prediction_loss_only: Optional[bool] = None
) -> Dict[str, float]:
def evaluate(self, eval_dataset: Optional[tf.data.Dataset] = None) -> Dict[str, float]:
"""
Prediction/evaluation loop, shared by `evaluate()` and `predict()`.
Run evaluation and returns metrics.
The calling script will be responsible for providing a method to compute metrics, as they are
task-dependent (pass it to the init :obj:`compute_metrics` argument).
Args:
eval_dataset (:class:`~tf.data.Dataset`, `optional`):
Pass a dataset if you wish to override :obj:`self.eval_dataset`.
Returns:
A dictionary containing the evaluation loss and the potential metrics computed from the predictions.
"""
eval_ds = self.get_eval_tfdataset(eval_dataset)
@@ -355,6 +406,9 @@ class TFTrainer:
logger.info(" Total optimization steps = %d", t_total)
for epoch_iter in range(epochs_trained, int(epochs + 1)):
# Reset the past mems state at the beginning of each epoch if necessary.
if self.args.past_index >= 0:
self._past = None
for step, training_loss in enumerate(self._training_steps(train_ds, optimizer)):
self.global_step = iterations.numpy()
self.epoch_logging = epoch_iter - 1 + (step + 1) / steps_per_epoch
@@ -394,6 +448,10 @@ class TFTrainer:
if self.args.max_steps > 0 and self.global_step % self.args.max_steps == 0:
break
if self.args.past_index and hasattr(self, "_past"):
# Clean the state at the end of training
delattr(self, "_past")
def _training_steps(self, ds, optimizer):
"""
Returns a generator over training steps (i.e. parameters update).
@@ -468,22 +526,37 @@ class TFTrainer:
labels: the batched labels.
training: run the model in training mode or not
"""
if self.args.past_index >= 0 and getattr(self, "_past", None) is not None:
features["mems"] = self._past
if isinstance(labels, (dict)):
loss, logits = self.model(features, training=training, **labels)[:2]
outputs = self.model(features, training=training, **labels)[:2]
else:
loss, logits = self.model(features, labels=labels, training=training)[:2]
loss += sum(self.model.losses) * (1.0 / self.args.n_gpu)
outputs = self.model(features, labels=labels, training=training)[:2]
loss, logits = outputs[:2]
if self.args.past_index >= 0:
self._past = outputs[self.args.past_index]
loss += sum(self.model.losses) * (1.0 / self.args.n_replicas)
return loss, logits
def predict(self, test_dataset: tf.data.Dataset) -> PredictionOutput:
"""
Run prediction and return predictions and potential metrics.
Run prediction and returns predictions and potential metrics.
Depending on the dataset and your use case, your test dataset may contain labels.
In that case, this method will also return metrics, like in evaluate().
In that case, this method will also return metrics, like in :obj:`evaluate()`.
Args:
test_dataset: something similar to a PT Dataset. This is just
temporary before to have a framework-agnostic approach for datasets.
test_dataset (:class:`~tf.data.Dataset`):
Dataset to run the predictions on.
Returns:
`NamedTuple`:
predictions (:obj:`np.ndarray`):
The predictions on :obj:`test_dataset`.
label_ids (:obj:`np.ndarray`, `optional`):
The labels (if the dataset contained some).
metrics (:obj:`Dict[str, float]`, `optional`):
The potential dictionary of metrics (if the dataset contained labels).
"""
test_ds = self.get_test_tfdataset(test_dataset)
@@ -491,7 +564,7 @@ class TFTrainer:
def save_model(self, output_dir: Optional[str] = None):
"""
Save the pretrained model.
Will save the model, so you can reload it using :obj:`from_pretrained()`.
"""
output_dir = output_dir if output_dir is not None else self.args.output_dir

View File

@@ -1,8 +1,11 @@
import os
import random
from typing import Dict, NamedTuple, Optional
import numpy as np
from .file_utils import is_tf_available, is_torch_available
try:
import wandb
@@ -21,10 +24,35 @@ def is_wandb_available():
return _has_wandb
def set_seed(seed: int):
"""
Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf``
(if installed).
Args:
seed (:obj:`int`): The seed to set.
"""
random.seed(seed)
np.random.seed(seed)
if is_torch_available():
import torch
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# ^^ safe to call this function even if cuda is not available
if is_tf_available():
import tensorflow as tf
tf.random.set_seed(seed)
class EvalPrediction(NamedTuple):
"""
Evaluation output (always contains labels), to be used
to compute metrics.
Evaluation output (always contains labels), to be used to compute metrics.
Parameters:
predictions (:obj:`np.ndarray`): Predictions of the model.
label_ids (:obj:`np.ndarray`): Targets to be matched.
"""
predictions: np.ndarray

View File

@@ -35,9 +35,80 @@ class TrainingArguments:
TrainingArguments is the subset of the arguments we use in our example scripts
**which relate to the training loop itself**.
Using `HfArgumentParser` we can turn this class
into argparse arguments to be able to specify them on
the command line.
Using :class:`~transformers.HfArgumentParser` we can turn this class
into argparse arguments to be able to specify them on the command line.
Parameters:
output_dir (:obj:`str`):
The output directory where the model predictions and checkpoints will be written.
overwrite_output_dir (:obj:`bool`, `optional`, defaults to :obj:`False`):
If :obj:`True`, overwrite the content of the output directory. Use this to continue training if
:obj:`output_dir` points to a checkpoint directory.
do_train (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to run training or not.
do_eval (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to run evaluation on the dev set or not.
do_predict (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to run predictions on the test set or not.
evaluate_during_training (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to run evaluation during training at each logging step or not.
per_device_train_batch_size (:obj:`int`, `optional`, defaults to 8):
The batch size per GPU/TPU core/CPU for training.
per_device_eval_batch_size (:obj:`int`, `optional`, defaults to 8):
The batch size per GPU/TPU core/CPU for evaluation.
gradient_accumulation_steps: (:obj:`int`, `optional`, defaults to 1):
Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
learning_rate (:obj:`float`, `optional`, defaults to 5e-5):
The initial learning rate for Adam.
weight_decay (:obj:`float`, `optional`, defaults to 0):
The weight decay to apply (if not zero).
adam_epsilon (:obj:`float`, `optional`, defaults to 1e-8):
Epsilon for the Adam optimizer.
max_grad_norm (:obj:`float`, `optional`, defaults to 1.0):
Maximum gradient norm (for gradient clipping).
num_train_epochs(:obj:`float`, `optional`, defaults to 3.0):
Total number of training epochs to perform.
max_steps (:obj:`int`, `optional`, defaults to -1):
If set to a positive number, the total number of training steps to perform. Overrides
:obj:`num_train_epochs`.
warmup_steps (:obj:`int`, `optional`, defaults to 0):
Number of steps used for a linear warmup from 0 to :obj:`learning_rate`.
logging_dir (:obj:`str`, `optional`):
Tensorboard log directory. Will default to `runs/**CURRENT_DATETIME_HOSTNAME**`.
logging_first_step (:obj:`bool`, `optional`, defaults to :obj:`False`):
Wheter to log and evalulate the first :obj:`global_step` or not.
logging_steps (:obj:`int`, `optional`, defaults to 500):
Number of update steps between two logs.
save_steps (:obj:`int`, `optional`, defaults to 500):
Number of updates steps before two checkpoint saves.
save_total_limit (:obj:`int`, `optional`):
If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in
:obj:`output_dir`.
no_cuda (:obj:`bool`, `optional`, defaults to :obj:`False`):
Wherher to not use CUDA even when it is available or not.
seed (:obj:`int`, `optional`, defaults to 42):
Random seed for initialization.
fp16 (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training.
fp16_opt_level (:obj:`str`, `optional`, defaults to 'O1'):
For :obj:`fp16` training, apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. See details
on the `apex documentation <https://nvidia.github.io/apex/amp.html>`__.
local_rank (:obj:`int`, `optional`, defaults to -1):
During distributed training, the rank of the process.
tpu_num_cores (:obj:`int`, `optional`):
When training on TPU, the mumber of TPU cores (automatically passed by launcher script).
debug (:obj:`bool`, `optional`, defaults to :obj:`False`):
When training on TPU, whether to print debug metrics or not.
dataloader_drop_last (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size)
or not.
eval_steps (:obj:`int`, `optional`, defaults to 1000):
Number of update steps between two evaluations.
past_index (:obj:`int`, `optional`, defaults to -1):
Some models like :doc:`TransformerXL <../model_doc/transformerxl>` or :doc`XLNet <../model_doc/xlnet>` can
make use of the past hidden states for their predictions. If this argument is set to a positive int, the
``Trainer`` will use the corresponding output (usually index 2) as the past state and feed it to the model
at the next training step under the keyword argument ``mems``.
"""
output_dir: str = field(
@@ -133,14 +204,27 @@ class TrainingArguments:
tpu_num_cores: Optional[int] = field(
default=None, metadata={"help": "TPU: Number of TPU cores (automatically passed by launcher script)"}
)
tpu_metrics_debug: bool = field(default=False, metadata={"help": "TPU: Whether to print debug metrics"})
tpu_metrics_debug: bool = field(
default=False,
metadata={"help": "Deprecated, the use of `--debug` is preferred. TPU: Whether to print debug metrics"},
)
debug: bool = field(default=False, metadata={"help": "Whether to print debug metrics on TPU"})
dataloader_drop_last: bool = field(
default=False, metadata={"help": "Drop the last incomplete batch if it is not divisible by the batch size."}
)
eval_steps: int = field(default=1000, metadata={"help": "Run an evaluation every X steps."})
past_index: int = field(
default=-1,
metadata={"help": "If >=0, uses the corresponding part of the output as the past state for next step."},
)
@property
def train_batch_size(self) -> int:
"""
The actual batch size for training (may differ from :obj:`per_gpu_train_batch_size` in distributed training).
"""
if self.per_gpu_train_batch_size:
logger.warning(
"Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future "
@@ -151,6 +235,9 @@ class TrainingArguments:
@property
def eval_batch_size(self) -> int:
"""
The actual batch size for evaluation (may differ from :obj:`per_gpu_eval_batch_size` in distributed training).
"""
if self.per_gpu_eval_batch_size:
logger.warning(
"Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future "
@@ -193,11 +280,21 @@ class TrainingArguments:
@property
@torch_required
def device(self) -> "torch.device":
"""
The device used by this process.
"""
return self._setup_devices[0]
@property
@torch_required
def n_gpu(self):
"""
The number of GPUs used by this process.
Note:
This will only be greater than one when you have multiple GPUs available but are not using distributed
training. For distributed training, it will always be 1.
"""
return self._setup_devices[1]
def to_json_string(self):

View File

@@ -1,4 +1,5 @@
import logging
import warnings
from dataclasses import dataclass, field
from typing import Tuple
@@ -14,13 +15,91 @@ if is_tf_available():
@dataclass
class TFTrainingArguments(TrainingArguments):
"""
TrainingArguments is the subset of the arguments we use in our example scripts
**which relate to the training loop itself**.
Using :class:`~transformers.HfArgumentParser` we can turn this class
into argparse arguments to be able to specify them on the command line.
Parameters:
output_dir (:obj:`str`):
The output directory where the model predictions and checkpoints will be written.
overwrite_output_dir (:obj:`bool`, `optional`, defaults to :obj:`False`):
If :obj:`True`, overwrite the content of the output directory. Use this to continue training if
:obj:`output_dir` points to a checkpoint directory.
do_train (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to run training or not.
do_eval (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to run evaluation on the dev set or not.
do_predict (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to run predictions on the test set or not.
evaluate_during_training (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to run evaluation during training at each logging step or not.
per_device_train_batch_size (:obj:`int`, `optional`, defaults to 8):
The batch size per GPU/TPU core/CPU for training.
per_device_eval_batch_size (:obj:`int`, `optional`, defaults to 8):
The batch size per GPU/TPU core/CPU for evaluation.
gradient_accumulation_steps: (:obj:`int`, `optional`, defaults to 1):
Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
learning_rate (:obj:`float`, `optional`, defaults to 5e-5):
The initial learning rate for Adam.
weight_decay (:obj:`float`, `optional`, defaults to 0):
The weight decay to apply (if not zero).
adam_epsilon (:obj:`float`, `optional`, defaults to 1e-8):
Epsilon for the Adam optimizer.
max_grad_norm (:obj:`float`, `optional`, defaults to 1.0):
Maximum gradient norm (for gradient clipping).
num_train_epochs(:obj:`float`, `optional`, defaults to 3.0):
Total number of training epochs to perform.
max_steps (:obj:`int`, `optional`, defaults to -1):
If set to a positive number, the total number of training steps to perform. Overrides
:obj:`num_train_epochs`.
warmup_steps (:obj:`int`, `optional`, defaults to 0):
Number of steps used for a linear warmup from 0 to :obj:`learning_rate`.
logging_dir (:obj:`str`, `optional`):
Tensorboard log directory. Will default to `runs/**CURRENT_DATETIME_HOSTNAME**`.
logging_first_step (:obj:`bool`, `optional`, defaults to :obj:`False`):
Wheter to log and evalulate the first :obj:`global_step` or not.
logging_steps (:obj:`int`, `optional`, defaults to 500):
Number of update steps between two logs.
save_steps (:obj:`int`, `optional`, defaults to 500):
Number of updates steps before two checkpoint saves.
save_total_limit (:obj:`int`, `optional`):
If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in
:obj:`output_dir`.
no_cuda (:obj:`bool`, `optional`, defaults to :obj:`False`):
Wherher to not use CUDA even when it is available or not.
seed (:obj:`int`, `optional`, defaults to 42):
Random seed for initialization.
fp16 (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training.
fp16_opt_level (:obj:`str`, `optional`, defaults to 'O1'):
For :obj:`fp16` training, apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. See details
on the `apex documentation <https://nvidia.github.io/apex/amp.html>`__.
local_rank (:obj:`int`, `optional`, defaults to -1):
During distributed training, the rank of the process.
tpu_num_cores (:obj:`int`, `optional`):
When training on TPU, the mumber of TPU cores (automatically passed by launcher script).
debug (:obj:`bool`, `optional`, defaults to :obj:`False`):
Wheter to activate the trace to record computation graphs and profiling information or not.
dataloader_drop_last (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size)
or not.
eval_steps (:obj:`int`, `optional`, defaults to 1000):
Number of update steps before two evaluations.
past_index (:obj:`int`, `optional`, defaults to -1):
Some models like :doc:`TransformerXL <../model_doc/transformerxl>` or :doc`XLNet <../model_doc/xlnet>` can
make use of the past hidden states for their predictions. If this argument is set to a positive int, the
``Trainer`` will use the corresponding output (usually index 2) as the past state and feed it to the model
at the next training step under the keyword argument ``mems``.
tpu_name (:obj:`str`, `optional`):
The name of the TPU the process is running on.
"""
tpu_name: str = field(
default=None, metadata={"help": "Name of TPU"},
)
eval_steps: int = field(default=1000, metadata={"help": "Run an evaluation every X steps."})
debug: bool = field(
default=False, metadata={"help": "Activate the trace to record computation graphs and profiling information"}
)
@cached_property
@tf_required
@@ -59,9 +138,53 @@ class TFTrainingArguments(TrainingArguments):
@property
@tf_required
def strategy(self) -> "tf.distribute.Strategy":
"""
The strategy used for distributed training.
"""
return self._setup_strategy
@property
@tf_required
def n_gpu(self) -> int:
def n_replicas(self) -> int:
"""
The number of replicas (CPUs, GPUs or TPU cores) used in this training.
"""
return self._setup_strategy.num_replicas_in_sync
@property
def train_batch_size(self) -> int:
"""
The actual batch size for training (may differ from :obj:`per_gpu_train_batch_size` in distributed training).
"""
if self.per_gpu_train_batch_size:
logger.warning(
"Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future "
"version. Using `--per_device_train_batch_size` is preferred."
)
per_device_batch_size = self.per_gpu_train_batch_size or self.per_device_train_batch_size
return per_device_batch_size * max(1, self.n_replicas)
@property
def eval_batch_size(self) -> int:
"""
The actual batch size for evaluation (may differ from :obj:`per_gpu_eval_batch_size` in distributed training).
"""
if self.per_gpu_eval_batch_size:
logger.warning(
"Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future "
"version. Using `--per_device_eval_batch_size` is preferred."
)
per_device_batch_size = self.per_gpu_eval_batch_size or self.per_device_eval_batch_size
return per_device_batch_size * max(1, self.n_replicas)
@property
@tf_required
def n_gpu(self) -> int:
"""
The number of replicas (CPUs, GPUs or TPU cores) used in this training.
"""
warnings.warn(
"The n_gpu argument is deprecated and will be removed in a future version, use n_replicas instead.",
FutureWarning,
)
return self._setup_strategy.num_replicas_in_sync

View File

@@ -1,8 +1,7 @@
import unittest
from transformers import is_torch_available
from .utils import require_torch
from transformers.testing_utils import require_torch
if is_torch_available():

View File

@@ -4,8 +4,7 @@ import unittest
from pathlib import Path
from transformers import AutoConfig, is_torch_available
from .utils import require_torch, torch_device
from transformers.testing_utils import require_torch, torch_device
if is_torch_available():

View File

@@ -4,8 +4,7 @@ import unittest
from pathlib import Path
from transformers import AutoConfig, is_tf_available
from .utils import require_tf
from transformers.testing_utils import require_tf
if is_tf_available():

View File

@@ -19,8 +19,7 @@ import unittest
from transformers.configuration_auto import CONFIG_MAPPING, AutoConfig
from transformers.configuration_bert import BertConfig
from transformers.configuration_roberta import RobertaConfig
from .utils import DUMMY_UNKWOWN_IDENTIFIER
from transformers.testing_utils import DUMMY_UNKWOWN_IDENTIFIER
SAMPLE_ROBERTA_CONFIG = os.path.join(os.path.dirname(os.path.abspath(__file__)), "fixtures/dummy-config.json")

View File

@@ -21,8 +21,7 @@ from pathlib import Path
from typing import List, Union
import transformers
from .utils import require_tf, require_torch, slow
from transformers.testing_utils import require_tf, require_torch, slow
logger = logging.getLogger()

View File

@@ -17,10 +17,10 @@
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
from .utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -17,8 +17,7 @@
import unittest
from transformers import is_torch_available
from .utils import DUMMY_UNKWOWN_IDENTIFIER, SMALL_MODEL_IDENTIFIER, require_torch, slow
from transformers.testing_utils import DUMMY_UNKWOWN_IDENTIFIER, SMALL_MODEL_IDENTIFIER, require_torch, slow
if is_torch_available():

View File

@@ -20,10 +20,10 @@ import timeout_decorator # noqa
from transformers import is_torch_available
from transformers.file_utils import cached_property
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
from .utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -17,10 +17,10 @@
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
from .utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -16,8 +16,7 @@
import unittest
from transformers import is_torch_available
from .utils import require_torch, slow, torch_device
from transformers.testing_utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -21,8 +21,7 @@ import unittest
from typing import List
from transformers import is_torch_available
from .utils import require_multigpu, require_torch, slow, torch_device
from transformers.testing_utils import require_multigpu, require_torch, slow, torch_device
if is_torch_available():
@@ -812,7 +811,7 @@ class ModelTesterMixin:
# Wrap model in nn.DataParallel
model = torch.nn.DataParallel(model)
with torch.no_grad():
_ = model(**inputs_dict)
_ = model(**self._prepare_for_class(inputs_dict, model_class))
global_rng = random.Random()

View File

@@ -16,10 +16,10 @@
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
from .utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -17,10 +17,10 @@
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
from .utils import require_torch, torch_device
if is_torch_available():

View File

@@ -17,10 +17,10 @@
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
from .utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -18,12 +18,12 @@ import tempfile
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
# TODO(PVP): this line reruns all the tests in BertModelTest; not sure whether this can be prevented
# for now only run module with pytest tests/test_modeling_encoder_decoder.py::EncoderDecoderModelTest
from .test_modeling_bert import BertModelTester
from .test_modeling_common import ids_tensor
from .utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -17,10 +17,10 @@
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
from .utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -17,10 +17,10 @@
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
from .utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -17,10 +17,10 @@
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
from .utils import require_torch, slow, torch_device
if is_torch_available():
@@ -115,6 +115,18 @@ class LongformerModelTester:
def check_loss_output(self, result):
self.parent.assertListEqual(list(result["loss"].size()), [])
def create_and_check_attention_mask_determinism(
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
):
model = LongformerModel(config=config)
model.to(torch_device)
model.eval()
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=torch_device)
output_with_mask = model(input_ids, attention_mask=attention_mask)[0]
output_without_mask = model(input_ids)[0]
self.parent.assertTrue(torch.allclose(output_with_mask[0, 0, :5], output_without_mask[0, 0, :5], atol=1e-4))
def create_and_check_longformer_model(
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
):
@@ -134,6 +146,36 @@ class LongformerModelTester:
)
self.parent.assertListEqual(list(result["pooled_output"].size()), [self.batch_size, self.hidden_size])
def create_and_check_longformer_model_with_global_attention_mask(
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
):
model = LongformerModel(config=config)
model.to(torch_device)
model.eval()
global_attention_mask = input_mask.clone()
global_attention_mask[:, input_mask.shape[-1] // 2] = 0
global_attention_mask = global_attention_mask.to(torch_device)
sequence_output, pooled_output = model(
input_ids,
attention_mask=input_mask,
global_attention_mask=global_attention_mask,
token_type_ids=token_type_ids,
)
sequence_output, pooled_output = model(
input_ids, token_type_ids=token_type_ids, global_attention_mask=global_attention_mask
)
sequence_output, pooled_output = model(input_ids, global_attention_mask=global_attention_mask)
result = {
"sequence_output": sequence_output,
"pooled_output": pooled_output,
}
self.parent.assertListEqual(
list(result["sequence_output"].size()), [self.batch_size, self.seq_length, self.hidden_size]
)
self.parent.assertListEqual(list(result["pooled_output"].size()), [self.batch_size, self.hidden_size])
def create_and_check_longformer_for_masked_lm(
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
):
@@ -243,7 +285,13 @@ class LongformerModelTester:
token_labels,
choice_labels,
) = config_and_inputs
inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
global_attention_mask = torch.zeros_like(input_ids)
inputs_dict = {
"input_ids": input_ids,
"token_type_ids": token_type_ids,
"attention_mask": input_mask,
"global_attention_mask": global_attention_mask,
}
return config, inputs_dict
def prepare_config_and_inputs_for_question_answering(self):
@@ -277,11 +325,10 @@ class LongformerModelTest(ModelTesterMixin, unittest.TestCase):
(
LongformerModel,
LongformerForMaskedLM,
# TODO: make tests pass for those models
# LongformerForSequenceClassification,
# LongformerForQuestionAnswering,
# LongformerForTokenClassification,
# LongformerForMultipleChoice,
LongformerForSequenceClassification,
LongformerForQuestionAnswering,
LongformerForTokenClassification,
LongformerForMultipleChoice,
)
if is_torch_available()
else ()
@@ -298,6 +345,14 @@ class LongformerModelTest(ModelTesterMixin, unittest.TestCase):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_longformer_model(*config_and_inputs)
def test_longformer_model_attention_mask_determinism(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_attention_mask_determinism(*config_and_inputs)
def test_longformer_model_global_attention_mask(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_longformer_model_with_global_attention_mask(*config_and_inputs)
def test_longformer_for_masked_lm(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_longformer_for_masked_lm(*config_and_inputs)
@@ -325,15 +380,31 @@ class LongformerModelIntegrationTest(unittest.TestCase):
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")
model.to(torch_device)
# 'Hello world!'
input_ids = torch.tensor([[0, 20920, 232, 328, 1437, 2]], dtype=torch.long, device=torch_device)
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=torch_device)
output = model(input_ids, attention_mask=attention_mask)[0]
output_without_mask = model(input_ids)[0]
expected_output_slice = torch.tensor([0.0549, 0.1087, -0.1119, -0.0368, 0.0250], device=torch_device)
self.assertTrue(torch.allclose(output[0, 0, -5:], expected_output_slice, atol=1e-4))
self.assertTrue(torch.allclose(output_without_mask[0, 0, -5:], expected_output_slice, atol=1e-4))
@slow
def test_inference_no_head_long(self):
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")
model.to(torch_device)
# 'Hello world! ' repeated 1000 times
input_ids = torch.tensor(
[[0] + [20920, 232, 328, 1437] * 1000 + [2]], dtype=torch.long, device=torch_device
) # long input
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)
attention_mask[:, [1, 4, 21]] = 2 # Set global attention on a few random positions
global_attention_mask = torch.zeros(input_ids.shape, dtype=torch.long, device=input_ids.device)
global_attention_mask[:, [1, 4, 21]] = 1 # Set global attention on a few random positions
output = model(input_ids, attention_mask=attention_mask)[0]
output = model(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask)[0]
expected_output_sum = torch.tensor(74585.8594, device=torch_device)
expected_output_mean = torch.tensor(0.0243, device=torch_device)
@@ -341,7 +412,7 @@ class LongformerModelIntegrationTest(unittest.TestCase):
self.assertTrue(torch.allclose(output.mean(), expected_output_mean, atol=1e-4))
@slow
def test_inference_masked_lm(self):
def test_inference_masked_lm_long(self):
model = LongformerForMaskedLM.from_pretrained("allenai/longformer-base-4096")
model.to(torch_device)
@@ -352,9 +423,9 @@ class LongformerModelIntegrationTest(unittest.TestCase):
loss, prediction_scores = model(input_ids, labels=input_ids)
expected_loss = torch.tensor(0.0620, device=torch_device)
expected_prediction_scores_sum = torch.tensor(-6.1599e08, device=torch_device)
expected_prediction_scores_mean = torch.tensor(-3.0622, device=torch_device)
expected_loss = torch.tensor(0.0074, device=torch_device)
expected_prediction_scores_sum = torch.tensor(-6.1048e08, device=torch_device)
expected_prediction_scores_mean = torch.tensor(-3.0348, device=torch_device)
input_ids = input_ids.to(torch_device)
self.assertTrue(torch.allclose(loss, expected_loss, atol=1e-4))

View File

@@ -19,8 +19,7 @@ import unittest
from transformers import is_torch_available
from transformers.file_utils import cached_property
from transformers.hf_api import HfApi
from .utils import require_torch, slow, torch_device
from transformers.testing_utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -17,10 +17,10 @@
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
from .utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -17,10 +17,10 @@
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
from .utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -16,19 +16,21 @@
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_multigpu, require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
from .utils import require_multigpu, require_torch, slow, torch_device
if is_torch_available():
from transformers import (
ReformerConfig,
ReformerForMaskedLM,
ReformerModel,
ReformerModelWithLMHead,
ReformerTokenizer,
ReformerLayer,
ReformerForQuestionAnswering,
REFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
)
import torch
@@ -43,6 +45,7 @@ class ReformerModelTester:
is_training=None,
is_decoder=None,
use_input_mask=None,
use_labels=None,
vocab_size=None,
attention_head_size=None,
hidden_size=None,
@@ -81,6 +84,7 @@ class ReformerModelTester:
self.is_training = is_training
self.is_decoder = is_decoder
self.use_input_mask = use_input_mask
self.use_labels = use_labels
self.vocab_size = vocab_size
self.attention_head_size = attention_head_size
self.hidden_size = hidden_size
@@ -128,6 +132,10 @@ class ReformerModelTester:
if self.use_input_mask:
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
choice_labels = None
if self.use_labels:
choice_labels = ids_tensor([self.batch_size], 2)
config = ReformerConfig(
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
@@ -160,14 +168,13 @@ class ReformerModelTester:
config,
input_ids,
input_mask,
choice_labels,
)
def check_loss_output(self, result):
self.parent.assertListEqual(list(result["loss"].size()), [])
def create_and_check_reformer_model(
self, config, input_ids, input_mask,
):
def create_and_check_reformer_model(self, config, input_ids, input_mask, choice_labels):
model = ReformerModel(config=config)
model.to(torch_device)
model.eval()
@@ -182,18 +189,14 @@ class ReformerModelTester:
list(result["sequence_output"].size()), [self.batch_size, self.seq_length, 2 * self.hidden_size],
)
def create_and_check_reformer_model_with_lm_backward(
self, config, input_ids, input_mask,
):
def create_and_check_reformer_model_with_lm_backward(self, config, input_ids, input_mask, choice_labels):
model = ReformerModelWithLMHead(config=config)
model.to(torch_device)
model.eval()
loss = model(input_ids, attention_mask=input_mask, labels=input_ids)[0]
loss.backward()
def create_and_check_reformer_with_lm(
self, config, input_ids, input_mask,
):
def create_and_check_reformer_with_lm(self, config, input_ids, input_mask, choice_labels):
model = ReformerModelWithLMHead(config=config)
model.to(torch_device)
model.eval()
@@ -207,7 +210,24 @@ class ReformerModelTester:
)
self.check_loss_output(result)
def create_and_check_reformer_model_with_attn_mask(self, config, input_ids, input_mask, is_decoder):
def create_and_check_reformer_with_mlm(self, config, input_ids, input_mask, choice_labels):
config.is_decoder = False
model = ReformerForMaskedLM(config=config)
model.to(torch_device)
model.eval()
loss, prediction_scores = model(input_ids, attention_mask=input_mask, labels=input_ids)
result = {
"loss": loss,
"prediction_scores": prediction_scores,
}
self.parent.assertListEqual(
list(result["prediction_scores"].size()), [self.batch_size, self.seq_length, self.vocab_size],
)
self.check_loss_output(result)
def create_and_check_reformer_model_with_attn_mask(
self, config, input_ids, input_mask, choice_labels, is_decoder=False
):
# no special position embeddings
config.axial_pos_embds = False
config.is_decoder = is_decoder
@@ -248,7 +268,9 @@ class ReformerModelTester:
self.parent.assertTrue(torch.allclose(output_padded, output_padded_rolled, atol=1e-3))
def create_and_check_reformer_layer_dropout_seed(self, config, input_ids, input_mask, is_decoder):
def create_and_check_reformer_layer_dropout_seed(
self, config, input_ids, input_mask, choice_labels, is_decoder=False
):
config.is_decoder = is_decoder
layer = ReformerLayer(config).to(torch_device)
layer.train()
@@ -281,7 +303,7 @@ class ReformerModelTester:
torch.allclose(next_hidden_states, hidden_states + feed_forward_hidden_states, atol=1e-3,)
)
def create_and_check_reformer_feed_forward_chunking(self, config, input_ids, input_mask):
def create_and_check_reformer_feed_forward_chunking(self, config, input_ids, input_mask, choice_labels):
torch.manual_seed(0)
model = ReformerModel(config=config)
model.to(torch_device)
@@ -299,7 +321,7 @@ class ReformerModelTester:
hidden_states_with_chunk = model(input_ids, attention_mask=input_mask)[0]
self.parent.assertTrue(torch.allclose(hidden_states_no_chunk, hidden_states_with_chunk, atol=1e-3))
def create_and_check_reformer_feed_backward_chunking(self, config, input_ids, input_mask):
def create_and_check_reformer_feed_backward_chunking(self, config, input_ids, input_mask, choice_labels):
if not self.is_training:
return
@@ -341,7 +363,7 @@ class ReformerModelTester:
torch.allclose(grad_slice_position_factor_2_chunk, grad_slice_position_factor_2_no_chunk, atol=1e-3)
)
def create_and_check_reformer_random_seed(self, config, input_ids, input_mask):
def create_and_check_reformer_random_seed(self, config, input_ids, input_mask, choice_labels):
layer = ReformerLayer(config).to(torch_device)
layer.train()
@@ -372,7 +394,7 @@ class ReformerModelTester:
seeds.append(layer.feed_forward_seed)
self.parent.assertGreater(len(set(seeds)), 70)
def create_and_check_reformer_model_fp16_forward(self, config, input_ids, input_mask):
def create_and_check_reformer_model_fp16_forward(self, config, input_ids, input_mask, choice_labels):
model = ReformerModel(config=config)
model.to(torch_device)
model.half()
@@ -380,7 +402,7 @@ class ReformerModelTester:
output = model(input_ids, attention_mask=input_mask)[0]
self.parent.assertFalse(torch.isnan(output).any().item())
def create_and_check_reformer_model_fp16_generate(self, config, input_ids, input_mask):
def create_and_check_reformer_model_fp16_generate(self, config, input_ids, input_mask, choice_labels):
model = ReformerModelWithLMHead(config=config)
model.to(torch_device)
model.half()
@@ -388,7 +410,7 @@ class ReformerModelTester:
output = model.generate(input_ids, attention_mask=input_mask, do_sample=False)
self.parent.assertFalse(torch.isnan(output).any().item())
def create_and_check_reformer_no_chunking(self, config, input_ids, input_mask):
def create_and_check_reformer_no_chunking(self, config, input_ids, input_mask, choice_labels):
# force chunk length to be bigger than input_ids
config.lsh_attn_chunk_length = 2 * input_ids.shape[-1]
config.local_attn_chunk_length = 2 * input_ids.shape[-1]
@@ -398,9 +420,25 @@ class ReformerModelTester:
output_logits = model(input_ids, attention_mask=input_mask)[0]
self.parent.assertTrue(output_logits.shape[1] == input_ids.shape[-1])
def create_and_check_longformer_for_question_answering(self, config, input_ids, input_mask, choice_labels):
model = ReformerForQuestionAnswering(config=config)
model.to(torch_device)
model.eval()
loss, start_logits, end_logits = model(
input_ids, attention_mask=input_mask, start_positions=choice_labels, end_positions=choice_labels,
)
result = {
"loss": loss,
"start_logits": start_logits,
"end_logits": end_logits,
}
self.parent.assertListEqual(list(result["start_logits"].size()), [self.batch_size, self.seq_length])
self.parent.assertListEqual(list(result["end_logits"].size()), [self.batch_size, self.seq_length])
self.check_loss_output(result)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(config, input_ids, input_mask,) = config_and_inputs
(config, input_ids, input_mask, choice_labels) = config_and_inputs
inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
return config, inputs_dict
@@ -423,17 +461,21 @@ class ReformerTesterMixin:
def test_reformer_model_attn_masking(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_reformer_model_with_attn_mask(*config_and_inputs, True)
self.model_tester.create_and_check_reformer_model_with_attn_mask(*config_and_inputs, False)
self.model_tester.create_and_check_reformer_model_with_attn_mask(*config_and_inputs, is_decoder=True)
self.model_tester.create_and_check_reformer_model_with_attn_mask(*config_and_inputs, is_decoder=False)
def test_reformer_with_lm(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_reformer_with_lm(*config_and_inputs)
def test_reformer_with_mlm(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_reformer_with_mlm(*config_and_inputs)
def test_reformer_layer_training_dropout(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_reformer_layer_dropout_seed(*config_and_inputs, True)
self.model_tester.create_and_check_reformer_layer_dropout_seed(*config_and_inputs, False)
self.model_tester.create_and_check_reformer_layer_dropout_seed(*config_and_inputs, is_decoder=True)
self.model_tester.create_and_check_reformer_layer_dropout_seed(*config_and_inputs, is_decoder=False)
def test_reformer_chunking_forward_equality(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
@@ -470,7 +512,9 @@ class ReformerTesterMixin:
@require_torch
class ReformerLocalAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest.TestCase):
all_model_classes = (ReformerModel, ReformerModelWithLMHead) if is_torch_available() else ()
all_model_classes = (
(ReformerModel, ReformerModelWithLMHead, ReformerForQuestionAnswering) if is_torch_available() else ()
)
all_generative_model_classes = (ReformerModelWithLMHead,) if is_torch_available() else ()
test_pruning = False
test_headmasking = False
@@ -481,8 +525,9 @@ class ReformerLocalAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest
"batch_size": 13,
"seq_length": 32,
"is_training": True,
"is_decoder": False,
"is_decoder": True,
"use_input_mask": True,
"use_labels": True,
"vocab_size": 32,
"attention_head_size": 16,
"hidden_size": 32,
@@ -524,7 +569,9 @@ class ReformerLocalAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest
@require_torch
class ReformerLSHAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest.TestCase):
all_model_classes = (ReformerModel, ReformerModelWithLMHead) if is_torch_available() else ()
all_model_classes = (
(ReformerModel, ReformerModelWithLMHead, ReformerForQuestionAnswering) if is_torch_available() else ()
)
all_generative_model_classes = (ReformerModelWithLMHead,) if is_torch_available() else ()
test_pruning = False
test_headmasking = False
@@ -535,8 +582,9 @@ class ReformerLSHAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest.T
"batch_size": 13,
"seq_length": 13,
"use_input_mask": True,
"use_labels": True,
"is_training": False,
"is_decoder": False,
"is_decoder": True,
"vocab_size": 32,
"attention_head_size": 16,
"hidden_size": 64,
@@ -886,7 +934,7 @@ class ReformerIntegrationTests(unittest.TestCase):
config["num_buckets"] = [2, 4]
config["is_decoder"] = False
torch.manual_seed(0)
model = ReformerModelWithLMHead(ReformerConfig(**config)).to(torch_device)
model = ReformerForMaskedLM(ReformerConfig(**config)).to(torch_device)
model.eval()
input_ids, attn_mask = self._get_input_ids_and_mask()
hidden_states = model(input_ids=input_ids, attention_mask=attn_mask)[0]

View File

@@ -17,10 +17,10 @@
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
from .utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -17,10 +17,10 @@
import unittest
from transformers import is_torch_available
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, ids_tensor
from .utils import require_torch, slow, torch_device
if is_torch_available():

View File

@@ -17,10 +17,10 @@
import unittest
from transformers import AlbertConfig, is_tf_available
from transformers.testing_utils import require_tf, slow
from .test_configuration_common import ConfigTester
from .test_modeling_tf_common import TFModelTesterMixin, ids_tensor
from .utils import require_tf, slow
if is_tf_available():

View File

@@ -17,8 +17,7 @@
import unittest
from transformers import is_tf_available
from .utils import DUMMY_UNKWOWN_IDENTIFIER, SMALL_MODEL_IDENTIFIER, require_tf, slow
from transformers.testing_utils import DUMMY_UNKWOWN_IDENTIFIER, SMALL_MODEL_IDENTIFIER, require_tf, slow
if is_tf_available():

Some files were not shown because too many files have changed in this diff Show More