Compare commits

...

261 Commits

Author SHA1 Message Date
Lysandre
818878dc88 Release: v3.5.0 2020-11-10 08:50:43 -05:00
Lysandre Debut
9cebee38ad Model sharing rst (#8439)
* Update RST

* Finer details

* Re-organize

* Style
2020-11-10 08:35:11 -05:00
Julien Chaumond
ad2303a401 Fix style 2020-11-10 14:28:30 +01:00
Julien Chaumond
55e8d0cea2 Update links from s3 to huggingface.co 2020-11-10 14:03:29 +01:00
Lysandre Debut
850afb422d Patch token classification pipeline (#8364)
* Patch token classification pipeline

* Some added tests for TokenClassificationArgumentHandler (#8366)

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2020-11-10 07:29:34 -05:00
Julien Chaumond
70f622fab4 Model versioning (#8324)
* fix typo

* rm use_cdn & references, and implement new hf_bucket_url

* I'm pretty sure we don't need to `read` this file

* same here

* [BIG] file_utils.networking: do not gobble up errors anymore

* Fix CI 😇

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Tiny doc tweak

* Add doc + pass kwarg everywhere

* Add more tests and explain

cc @sshleifer let me know if better

Co-Authored-By: Sam Shleifer <sshleifer@gmail.com>

* Also implement revision in pipelines

In the case where we're passing a task name or a string model identifier

* Fix CI 😇

* Fix CI

* [hf_api] new methods + command line implem

* make style

* Final endpoints post-migration

* Fix post-migration

* Py3.6 compat

cc @stefan-it

Thank you @stas00

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-11-10 07:11:02 -05:00
Teven
4185b115d4 Changing XLNet default from not using memories to 512 context size following paper (#8417)
* Move XLNet memory length FutureWarning

* isort

* style

* Changed default XLNet memory length
2020-11-09 20:49:51 -05:00
Stas Bekman
190df58560 [github CI] add a multi-gpu job for all example tests (#8341)
* add a multi-gpu job for all example tests

* run only ported tests

* rename

* explain why env is re-activated on each step

* mark all unported/checked tests with @require_torch_non_multigpu_but_fix_me

* style

* Apply suggestions from code review

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-11-09 15:47:38 -05:00
Sylvain Gugger
a39218b75b Check all models are in an auto class (#8425) 2020-11-09 15:44:54 -05:00
Stas Bekman
ef032ddd1e [docs] [testing] gpu decorators table (#8422)
* gpu decorators table

* whitespace

* Update docs/source/testing.rst

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* whitespace

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-09 14:27:42 -05:00
Sam Shleifer
a8339b9ecc Fix bart shape comment (#8423) 2020-11-09 13:25:33 -05:00
Sam Shleifer
46509d1c19 [docs] remove sshleifer from issue-template :( (#8418) 2020-11-09 12:51:38 -05:00
Patrick von Platen
9c83b96e62 [Tests] Add Common Test for Training + Fix a couple of bugs (#8415)
* add training tests

* correct longformer

* fix docs

* fix some tests

* fix some more train tests

* remove ipdb

* fix multiple edge case model training

* fix funnel and prophetnet

* clean gpt models

* undo renaming of albert
2020-11-09 18:24:41 +01:00
Sylvain Gugger
52040517b8 Deprecate old data/metrics functions (#8420) 2020-11-09 12:10:09 -05:00
Stas Bekman
d4d1fbfc5a [fsmt convert script] fairseq broke chkpt data - fixing that (#8377)
* fairseq broke chkpt data - fixing that

* style

* support older bpecodes filenames - specifically "code" in iwslt14
2020-11-09 11:57:42 -05:00
Sylvain Gugger
5c766ecb50 Fix typo 2020-11-09 11:50:51 -05:00
Sylvain Gugger
908a28894c Add new token classification example (#8340)
* Add new token classification example

* Remove txt file

* Add test

* With actual testing done

* Less warmup is better

* Update examples/token-classification/run_ner_new.py

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Address review comments

* Fix test

* Make Lysandre happy

* Last touches and rename

* Rename in tests

* Address review comments

* More run_ner -> run_ner_old

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-11-09 11:39:55 -05:00
Sylvain Gugger
c7cb1aa26c Bump tokenizers (#8419) 2020-11-09 11:32:10 -05:00
Stas Bekman
78d706f3ae [fsmt tokenizer] support lowercase tokenizer (#8389)
* support lowercase tokenizer

* fix arg pos
2020-11-09 10:41:39 -05:00
Shashank Gupta
1e2acd0dcf Bug fix for permutation language modelling (#8409) 2020-11-09 10:23:26 -05:00
Philip May
bf8625e70b add evaluate doc - trainer.evaluate returns 'epoch' from training (#8273)
* add evaluate doc

* fix style with utils/style.doc

* Update src/transformers/trainer.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-11-09 09:00:59 -05:00
Sam Shleifer
ebde57acac examples/docs: caveat that PL examples don't work on TPU (#8309) 2020-11-09 08:55:22 -05:00
Julien Plu
76e7a44dee Fix some tooling for windows (#8359)
* Fix some tooling for windows

* Fix conflict

* Trigger CI
2020-11-09 13:50:38 +01:00
dartrevan
507dfb40c3 Update README.md (#8406) 2020-11-09 16:44:43 +08:00
smanjil
7247d0b4ea updating tag for exbert viz (#8408) 2020-11-09 16:43:55 +08:00
Stas Bekman
4ab5617b0b comet_ml temporary fix(#8410) 2020-11-09 16:36:06 +08:00
Sam Shleifer
e6d9cdaafe [s2s/distill] remove run_distiller.sh, fix xsum script (#8412) 2020-11-08 16:57:43 -05:00
Stas Bekman
66582492d3 [s2s test_finetune_trainer] failing multigpu test (#8400) 2020-11-08 16:45:40 -05:00
Stas Bekman
f62755a600 [s2s examples test] fix data path (#8398) 2020-11-08 16:44:18 -05:00
Jonathan Chang
4a53e8e9e4 Fix DataCollatorForWholeWordMask again (#8397) 2020-11-08 09:53:01 -05:00
Manav Rathod
610730998f fixed default labels for QA model (#8399) 2020-11-08 09:08:14 -05:00
Chengxi Guo
0b02489b2c Add gpt2-medium-chinese model card (#8402)
* Create README.md

* Update model_cards/mymusise/gpt2-medium-chinese/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-08 05:00:19 -05:00
Stas Bekman
187554366f fix md table (#8395) 2020-11-08 04:25:14 -05:00
Jonathan Chang
77a257fc21 Fix DataCollatorForWholeWordMask (#8379)
* Fix DataCollatorForWholeWordMask

* Replace all tensorize_batch in data_collator.py
2020-11-07 12:51:56 -05:00
Stas Bekman
517eaf460b [make] rewrite modified_py_files in python to be cross-platform (#8371)
* rewrite modified_py_files in python to be cross-platform

* try a different way to test for variable not being ""

* improve comment
2020-11-07 18:45:16 +01:00
Patrick von Platen
07708793f2 fix encoder outputs (#8368) 2020-11-06 21:03:25 +01:00
Yossi Synett
bc0d26d1de [All Seq2Seq model + CLM models that can be used with EncoderDecoder] Add cross-attention weights to outputs (#8071)
* Output cross-attention with decoder attention output

* Update src/transformers/modeling_bert.py

* add cross-attention for t5 and bart as well

* fix tests

* correct typo in docs

* add sylvains and sams comments

* correct typo

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-11-06 19:34:48 +01:00
hassoudi
30f2507a07 Update README.md (#8360)
Fix websitr address
2020-11-06 11:45:46 -05:00
Jonathan Chang
5807ba3fa9 Fix typo (#8351) 2020-11-06 11:19:41 -05:00
hassoudi
82146496b6 Update README.md (#8338)
fixes
2020-11-06 06:20:58 -05:00
ktrapeznikov
9e5c4d39ab Create README.md (#8312)
* Create README.md

* Update model_cards/ktrapeznikov/gpt2-medium-topic-news/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-06 06:19:59 -05:00
hasantanvir79
06ebc37967 Create README.md (#8255)
* Create README.md

Initial commit

* Updated Read me

Updated

* Apply suggestions from code review

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-06 03:34:24 -05:00
Karthik Uppuluri
41cd031cf2 Create README.md (#8169) 2020-11-06 03:26:07 -05:00
Karthik Uppuluri
f932ddeff5 Create README.md (#8170) 2020-11-06 03:25:52 -05:00
Karthik Uppuluri
08b92f78fa Create README.md (#8168)
* Create README.md

* Update README.md
2020-11-06 03:25:33 -05:00
Karthik Uppuluri
77d62e78b0 Create README.md (#8167)
* Create README.md

Telugu BERTU Readme file

* Update model_cards/kuppuluri/telugu_bertu/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-06 03:24:31 -05:00
Yifan Peng
dd6bfcaefb Create README.md (#8327) 2020-11-06 03:22:52 -05:00
smanjil
ddeecf08e6 german medbert model details (#8266)
* model details

* Apply suggestions from code review

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-06 03:21:13 -05:00
Jiaxin Pei
96baaafd34 Create README.md (#8258) 2020-11-06 03:19:12 -05:00
Stefan Schweter
185259c261 [model_cards] Update Italian BERT models and introduce new Italian XXL ELECTRA model 🎉 (#8343) 2020-11-06 03:17:03 -05:00
Manuel Romero
34bbf60bf8 Model card: GPT-2 fine-tuned on CommonGen (#8248) 2020-11-06 03:15:11 -05:00
Manuel Romero
973218fd3b Model card: CodeBERT fine-tuned for Insecure Code Detection (#8247)
* Model card: CodeBERT fine-tuned for Insecure Code Detection

* Update model_cards/mrm8488/codebert-base-finetuned-detect-insecure-code/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-06 03:13:45 -05:00
Manuel Romero
f833ca418b Model card: T5-base fine-tuned on QuaRel (#8334) 2020-11-06 03:09:55 -05:00
Stas Bekman
9edafaebef [s2s] test_bash_script.py - actually learn something (#8318)
* use decorator

* remove hardcoded paths

* make the test use more data and do real quality tests

* shave off 10 secs

* add --eval_beams 2, reformat

* reduce train size, use smaller custom dataset
2020-11-05 23:15:14 -05:00
Leandro von Werra
17450397a7 Docs bart training ref (#8330)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-11-05 17:20:57 -05:00
Stas Bekman
d787935a14 [s2s] test_distributed_eval (#8315)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-11-05 16:01:15 -05:00
Sylvain Gugger
04e442d575 Make Trainer evaluation handle dynamic seq_length (#8336)
* Make Trainer evaluation handle dynamic seq_length

* Document behavior.

* Fix test

* Better fix

* Fixes for realsies this time

* Address review comments

* Without forgetting to save...
2020-11-05 15:13:51 -05:00
Guillaume Filion
27b402cab0 Output global_attentions in Longformer models (#7562)
* Output global_attentions in Longformer models

* make style

* small refactoring

* fix tests

* make fix-copies

* add for tf as well

* remove comments in test

* make fix-copies

* make style

* add docs

* make docstring pretty

Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>
2020-11-05 21:10:43 +01:00
Sam Shleifer
7abc1d96d1 no warn (#8329) 2020-11-05 11:42:24 -05:00
Bobby Donchev
52f44dd6d2 change TokenClassificationTask class methods to static methods (#7902)
* change TokenClassificationTask class methods to static methods

Since we do not require self in the class methods of TokenClassificationTask we should probably switch to static methods. Also, since the class TokenClassificationTask does not contain a constructor it is currently unusable as is. By switching to static methods this fixes the issue of having to document the intent of the broken class.

Also, since the get_labels and read_examples_from_file methods are ought to be implemented. Static method definitions are unchanged even after inheritance, which means that it can be overridden, similar to other class methods.

* Trigger Build

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-11-05 09:38:30 -05:00
Guillem García Subies
77c8f6c627 Corrected typo in readme (#8320) 2020-11-05 07:48:36 -05:00
Patrick von Platen
226b9debb7 Update PULL_REQUEST_TEMPLATE.md 2020-11-05 09:40:15 +01:00
Patrick von Platen
6f35c61f93 Update bug-report.md 2020-11-05 09:39:05 +01:00
Yifan Peng
638c0b7c50 Create README.md (#8223)
* Create README.md

* Update README.md

* Apply suggestions from code review

Co-authored-by: Kevin Canwen Xu <canwenxu@126.com>
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-05 03:03:19 -05:00
Sylvain Gugger
9c4aa4ac1a Clean up data collators and datasets (#8308)
* Clean up data collators and datasets

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Remove needless clone

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-04 17:24:49 -05:00
Manuel Romero
b1d3e95eb5 Fix path to old run_language_modeling.py script (#8302) 2020-11-04 13:17:57 -05:00
Sylvain Gugger
b6e58db277 Speedup doc build (#8301)
* Try -j option

* Try other thing

* Bigger machine

* Test lower sphinx version

* Remove trailing space
2020-11-04 11:51:21 -05:00
Victor SANH
969ccac2e9 adding model cards for distilled models (#8300)
* adding model cards for distil models

* forgot the languages
2020-11-04 11:41:45 -05:00
Nicolas Patry
7342d9a583 Improve QA pipeline error handling (#8286)
- The issue is that with previous code we would have the following:

```python
qa_pipeline = (...)
qa_pipeline(question="Where was he born ?", context="")
-> IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
```

The goal here is to improve this to actually return a ValueError
wherever possible.

While at it, I tried to simplify QuestionArgumentHandler's code to
make it smaller and more compat while keeping backward compat.
2020-11-04 11:30:42 -05:00
Branden Chan
38630e7a87 Update model cards of deepset/roberta-base-squad2 v1 and v2 (#8241)
* update deepset/roberta-base-squad2 to v2

* Update model_cards/deepset/roberta-base-squad2/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-11-04 11:21:25 -05:00
Manuel Romero
04561ecbe6 Model card: T5-base fine-tuned on QASC (#8299) 2020-11-04 11:20:15 -05:00
Sylvain Gugger
854b44aa38 Revert size change as it doesn't change anything 2020-11-04 11:13:24 -05:00
Sylvain Gugger
414985c427 Upgrade resource for doc building 2020-11-04 10:44:19 -05:00
Sylvain Gugger
cf89724696 Fix validation file loading in scripts (#8298) 2020-11-04 10:42:18 -05:00
Patrick von Platen
cb966e640b [Generate Test] fix greedy generate test (#8293)
* fix greedy generate test

* delet ipdb
2020-11-04 15:44:36 +01:00
Pengzhi Gao
734afa37f6 Fix typo in language-modeling README.md (#8287) 2020-11-04 09:38:02 -05:00
Stas Bekman
7a7e2c2606 [blenderbot] regex fix (#8282)
Fixing:

```
src/transformers/tokenization_blenderbot.py:163: DeprecationWarning: invalid escape sequence \s
    token = re.sub("\s{2,}", " ", token)
```
2020-11-04 09:02:28 -05:00
Ceyda Cinarel
29b536a73a [WIP] Ner pipeline grouped_entities fixes (#5970)
* Bug fix: NER pipeline shouldn't group separate entities of same type

* style fix

* [Bug Fix] Shouldn't group entities that are both 'B' even if they are same type
	(B-type1 B-type1) != (B-type1 I-type1)
[Bug Fix] add an option `ignore_subwords` to ignore subsequent ##wordpieces in predictions. Because some models train on only the first token of a word and not on the subsequent wordpieces (BERT NER default). So it makes sense doing the same thing at inference time.
	The simplest fix is to just group the subwords with the first wordpiece.
	[TODO] how to handle ignored scores? just set them to 0 and calculate zero invariant mean ?
	[TODO] handle different wordpiece_prefix ## ? possible approaches:
		get it from tokenizer? but currently most tokenizers dont have a wordpiece_prefix property?
		have an _is_subword(token)
[Feature add] added option to `skip_special_tokens`. Cause It was harder to remove them after grouping.
[Additional Changes] remove B/I prefix on returned grouped_entities
[Feature Request/TODO] Return indexes?
[Bug TODO]  can't use fast tokenizer with grouped_entities ('BertTokenizerFast' object has no attribute 'convert_tokens_to_string')

* use offset_mapping to fix [UNK] token problem

* ignore score for subwords

* modify ner_pipeline test

* modify ner_pipeline test

* modify ner_pipeline test

* ner_pipeline change ignore_subwords default to true

* add ner_pipeline ignore_subword=False test case

* fix offset_mapping index

* fix style again duh

* change is_subword and convert_tokens_to_string logic

* merge tests with new test structure

* change test names

* remove old tests

* ner tests for fast tokenizer

* fast tokenizers have convert_tokens_to_string

* Fix the incorrect merge

Co-authored-by: Ceyda Cinarel <snu-ceyda@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-11-03 17:21:04 -05:00
Stas Bekman
1bb4bba53c [CIs] Better reports everywhere (#8275)
* make it possible to invoke testconf.py in both test suites without crashing on having the same option added

* perl -pi -e 's|--make_reports|--make-reports|' to be consistent with other opts

* add `pytest --make-reports` to all CIs (and artifacts)

* fix
2020-11-03 16:57:12 -05:00
Sylvain Gugger
7f556d2e39 Data collator for token classification (#8274)
* Add DataCollatorForTokenClassification and clean tests

* Make quality
2020-11-03 16:33:27 -05:00
Philip May
6a064447f2 improve documentation of training_args.py (#8270)
* improve documentation of training_args.py

- do_train
- do_eval
- do_predict

* fix line too long

* fix style with black on training_args.py

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix line length with utils/style_doc

* black reformatting

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-11-03 15:57:17 -05:00
Sylvain Gugger
4c19f3baab Clean Trainer tests and datasets dep (#8268) 2020-11-03 15:50:55 -05:00
Patrick von Platen
068e6b5edd make files independent (#8267) 2020-11-03 21:13:33 +01:00
Stas Bekman
cd360dcb26 [examples] minimal version requirement run-time check in PL (#8133)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-11-03 13:17:11 -05:00
Stas Bekman
971c638ee9 forward the worker stderr to the parent process (#8262) 2020-11-03 12:04:53 -05:00
Lysandre
eb6313e823 Fix Tatoeba skip 2020-11-03 10:35:00 -05:00
guillaume-be
74f6f91a9d Updated ConversationalPipeline to work with encoder-decoder models (#8207)
* Updated ConversationalPipeline to work with encoder-decoder models (e.g. BlenderBot)

* Addition of integration test for EncoderDecoder conversation model

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-03 10:33:01 -05:00
Nicolas Patry
c66ffa3a17 [FIX] TextGenerationPipeline is currently broken. (#8256)
* [FIX] TextGenerationPipeline is currently broken.

It's most likely due to #8180.
What's missing is a multi vs single string handler at the beginning of
the pipe.
And also there was no testing of this pipeline.

* Fixing Conversational tests too.
2020-11-03 10:10:22 -05:00
Patrick von Platen
a1bbcf3f6c Refactoring the generate() function (#6949)
* first draft

* show design proposition for new generate method

* up

* make better readable

* make first version

* gpt2 tests pass

* make beam search for gpt2 work

* add first encoder-decoder code

* delete typo

* make t5 work

* save indermediate

* make bart work with beam search

* finish beam search bart / t5

* add default kwargs

* make more tests pass

* fix no bad words sampler

* some fixes and tests for all distribution processors

* fix test

* fix rag slow tests

* merge to master

* add nograd to generate

* make all slow tests pass

* speed up generate

* fix edge case bug

* small fix

* correct typo

* add type hints and docstrings

* fix typos in tests

* add beam search tests

* add tests for beam scorer

* fix test rag

* finish beam search tests

* move generation tests in seperate file

* fix generation tests

* more tests

* add aggressive generation tests

* fix tests

* add gpt2 sample test

* add more docstring

* add more docs

* finish doc strings

* apply some more of sylvains and sams comments

* fix some typos

* make fix copies

* apply lysandres and sylvains comments

* final corrections on examples

* small fix for reformer
2020-11-03 16:04:22 +01:00
Sam Shleifer
b63beb743c Skip tatoeba tests if Tatoeba-Challenge not cloned (#8260) 2020-11-03 09:49:29 -05:00
Patrick von Platen
9f1747f999 [Seq2Seq] Correct import in Seq2Seq Trainer (#8254) 2020-11-03 07:56:41 -05:00
Stas Bekman
504ff7bb12 2 SinusoidalPositionalEmbedding fixes (#8226) 2020-11-02 18:50:26 -05:00
Patrick von Platen
f744b81572 add new notebooks (#8246) 2020-11-02 20:21:55 +01:00
Patrick von Platen
dc26726df2 fix encoder decoder bug (#8243) 2020-11-02 20:12:34 +01:00
Lysandre Debut
9a23af4aff Add XLMProphetNetTokenizer to tokenization auto (#8245) 2020-11-02 14:10:09 -05:00
Patrick von Platen
5b178f3c87 Create README.md 2020-11-02 20:03:44 +01:00
Sylvain Gugger
e1b1b614b1 Add line by line option to mlm/plm scripts (#8240)
* Make line by line optional in run_mlm

* Add option to disable dynamic padding

* Add option to plm too and update README

* Typos

* More typos

* Even more typos

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-02 12:27:04 -05:00
Patrick von Platen
ebec410c71 Create README.md 2020-11-02 17:53:22 +01:00
Sylvain Gugger
5406f31a1a Fix TensorBoardCallback for older versions of PyTorch (#8239) 2020-11-02 10:43:28 -05:00
Sylvain Gugger
d1ad4bff44 Fix bad import with PyTorch <= 1.4.1 (#8237) 2020-11-02 10:26:37 -05:00
Lysandre Debut
3c8d401cf6 Patch reports (#8238) 2020-11-02 10:26:25 -05:00
Martin Monperrus
93354bc779 doc: fix typo (#8235) 2020-11-02 08:53:17 -05:00
Santiago Castro
0c92e7d9fa Fix ignore list behavior in doctests (#8213) 2020-11-02 08:47:37 -05:00
Nicolas Patry
84caa23301 Fix the behaviour of DefaultArgumentHandler (removing it). (#8180)
* Some work to fix the behaviour of DefaultArgumentHandler by removing it.

* Fixing specific pipelines argument checking.
2020-11-02 12:33:50 +01:00
Zhiqi Huang
00cc2d1df2 DynaBERT model cards update (#8192)
* Update README.md

* Update README.md
2020-11-02 13:19:38 +08:00
Kushal
aa79aa4e7d Added 12 model cards for Indian Language Models (#8198)
* Create README.md

* added model cards
2020-11-02 13:17:43 +08:00
Patrick von Platen
9bd30f7cf4 [Seq2SeqTrainer] Move import to init to make file self-contained (#8194)
* boom boom

* reverse order
2020-11-01 23:31:55 +01:00
guillaume-be
1f12934df4 [Bug fix] Fixed value for BlenderBot pad token (#8205) 2020-11-01 10:21:57 -05:00
Abi See
8f1c960ee7 Fix two bugs with --logging_first_step (#8193)
* make sure that logging_first_step evaluates

* fix bug with incorrect loss on logging_first_step

* fix style

* logging_first_step only logs, not evals
2020-10-30 16:45:38 -04:00
Avital Oliver
689ff74f99 Minor style improvements for the Flax BERT and RoBERTa examples (#8178)
* Minor style improvements:

1. Use `@nn.compact` rather than `@compact` (as to not make it seem
   like compact is a standard Python decorator.
2. Move attribute docstrings from two `__call__` methods to comments
   on the attributes themselves. (This was probably a remnant from
   the pre-Linen version where the attributes were arguments to
   `call`.)

* Use black on the Flax modeling code
2020-10-30 16:25:39 -04:00
Sylvain Gugger
9eb3a410cd Remove deprecated arguments from new run_clm (#8197) 2020-10-30 15:27:20 -04:00
TFUsers
00112c3539 Replace swish with silu (#8166)
* Replace swish with silu

* revert nn.silu to nn.swish due to older version

* simplify optimized silu conditional and fix format

* Update activations.py

* Update activations_tf.py

* Update modeling_flax_utils.py

* Update modeling_openai.py

* add swish testcase

* add pytorch swish testcase

* Add more robust python version check

* more formatting fixes

Co-authored-by: TFUsers <TFUsers@gmail.com>
2020-10-30 15:09:10 -04:00
Sylvain Gugger
cdc48ce92d Finalize lm examples (#8188)
* Finish the cleanup of the language-modeling examples

* Update main README

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Apply suggestions from code review

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Propagate changes

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-10-30 14:20:18 -04:00
Sylvain Gugger
089cc1015e Doc fixes and filter warning in wandb (#8189) 2020-10-30 12:37:34 -04:00
Sam Shleifer
566b083eb1 TFMarian, TFMbart, TFPegasus, TFBlenderbot (#7987)
* Start plumbing

* Marian close

* Small stubs for all children

* Fixed bart

* marian working

* pegasus test is good, but failing

* Checkin tests

* More model files

* Subtle marian, pegasus integration test failures

* Works well

* rm print

* boom boom

* Still failing model2doc

* merge master

* Equivalence test failing, all others fixed

* cleanup

* Fix embed_scale

* Cleanup marian pipeline test

* Undo extra changes

* Smaller delta

* Cleanup model testers

* undo delta

* fix tests import structure

* cross test decorator

* Cleaner set_weights

* Respect authorized_unexpected_keys

* No warnings

* No warnings

* style

* Nest tf import

* black

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* functional dropout

* fixup

* Fixup

* style_doc

* embs

* shape list

* delete slow force_token_id_to_be_generated func

* fixup

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-30 11:23:16 -04:00
Santiago Castro
6279072f5f Fix typo: s/languaged/language/ (#8165) 2020-10-30 11:22:03 -04:00
Lysandre Debut
10f8c63620 Ci test tf super slow (#8007)
* Test TF GPU CI

* Change cache

* Fix missing torch requirement

* Fix some model tests


Style

* LXMERT

* MobileBERT

* Longformer skip test

* XLNet

* The rest of the tests

* RAG goes OOM in multi gpu setup

* YAML test files

* Last fixes

* Skip doctests

* Fill mask tests

* Yaml files

* Last test fix

* Style

* Update cache

* Change ONNX tests to slow + use tiny model
2020-10-30 10:25:48 -04:00
Nicolas Patry
7e36deec7a Fixing some warnings in DeBerta (#8176)
* Fixing some warnings in DeBerta

* Fixing docs with their rewritten version.
2020-10-30 09:15:41 -04:00
Stas Bekman
0538820737 [CI] Better reports #2 (#8163) 2020-10-29 19:30:05 -04:00
wlhgtc
9a21b50614 Fix eval ref miss in Chinese WWM. (#8115)
* ADD: add whole word mask proxy for both eng and chinese

* MOD: adjust format

* MOD: reformat code

* MOD: update import

* MOD: fix bug

* MOD: add import

* MOD: fix bug

* MOD: decouple code and update readme

* MOD: reformat code

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change wwm to whole_word_mask

* reformat code

* reformat

* format

* Code quality

* ADD: update chinese ref readme

* MOD: small changes

* MOD: small changes2

* update readme

* fix eval ref file miss bug

* format file

* MOD: move ref code to contrib

* MOD: add delimeter check

* reformat code

* refomat code

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-29 17:08:39 -04:00
Santiago Castro
fdf893c441 Fix typo: indinces -> indices (#8159)
* Fix typo: indinces -> indices

* Fix some more

* Fix some more

* Fix some more

* Fix CI
2020-10-29 17:04:20 -04:00
Stas Bekman
c83cec44f8 improve error checking (#8157) 2020-10-29 14:05:24 -04:00
Sylvain Gugger
691176283d Add a template for examples and apply it for mlm and plm examples (#8153)
* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Styling
2020-10-29 13:38:11 -04:00
Sam Shleifer
49e4fece5c [s2s] distillBART docs for paper replication (#8150) 2020-10-29 12:01:15 -04:00
Sylvain Gugger
acf56408d8 Smarter prediction loop and no- -> no_ in console args (#8151)
* Smarter prediction loop and no- -> no_ in console args

* Fix test
2020-10-29 10:56:25 -04:00
Sylvain Gugger
b0f1c0ee30 Document tokenizer_class in configurations (#8152) 2020-10-29 10:43:45 -04:00
Santiago Castro
969859d5f6 Fix doc errors and typos across the board (#8139)
* Fix doc errors and typos across the board

* Fix a typo

* Fix the CI

* Fix more typos

* Fix CI

* More fixes

* Fix CI

* More fixes

* More fixes
2020-10-29 10:33:33 -04:00
Ethan
4731a00c3e Update widget examples. (#8149)
Co-authored-by: yantan <yantan@effyic.com>
2020-10-29 08:49:16 -04:00
dartrevan
238876068c Update README.md (#8090) 2020-10-29 08:31:32 -04:00
Branden Chan
e566adc09c Add model_cards (#7969)
* add readme

* add readmes

* Add metadata
2020-10-29 08:29:54 -04:00
dartrevan
cc8941d881 Create README.md (#8089) 2020-10-29 08:23:43 -04:00
dartrevan
234a6dc388 Create README.md (#8088)
* Create README.md

* metadata

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-10-29 08:23:30 -04:00
gurkan08
5d76859531 Create README.md (#8075)
* Create README.md

* Update model_cards/gurkan08/bert-turkish-text-classification/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-10-29 08:22:33 -04:00
Ethan
b215090eed Add two model_cards: ethanyt/guwenbert-base and ethanyt/guwenbert-large (#8041) 2020-10-29 08:21:54 -04:00
Ashwani Tanwar
ba2ad3a98a Model Card for Gujarati-XLM-R-Base (#8038)
* Add model card for Gujarati-XLM-R-Base

* Update README.md

Add the model card for the Gujarati-XLM-R-Base.

* Apply suggestions from code review

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-10-29 08:21:11 -04:00
Manuel Romero
52cea7de75 Create README.md (#8017) 2020-10-29 08:19:47 -04:00
Manuel Romero
ff82a2aa93 Create README.md (#8015) 2020-10-29 08:19:35 -04:00
Zhiqi Huang
0a3b9733cb Add model_cards for DynaBERT (#8012)
* Update README.md

* Add dynabert_overview.png

* Update README.md

* Create README.md

* Add dynabert_overview.png

* Update README.md

* Update README.md

* Delete dynabert_overview.png

* Update README.md

* Delete dynabert_overview.png

* Update README.md
2020-10-29 08:19:17 -04:00
Patrick von Platen
afa21504b1 add tags (#8147) 2020-10-29 12:45:55 +01:00
Stas Bekman
825925dfaa [s2s test] cleanup (#8131) 2020-10-28 16:50:36 -04:00
Santiago Castro
e477eb919f Fix typo in AutoModelForMaskedLM docs (#8129) 2020-10-28 15:52:28 -04:00
Sean Naren
5e24982e58 Upgrade PyTorch Lightning to 1.0.2 (#7852)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-10-28 14:59:14 -04:00
Lysandre Debut
1b6c8d4811 Update CI cache (#8126) 2020-10-28 13:59:43 -04:00
Sylvain Gugger
378142afdf Rename add_start_docstrings_to_callable (#8120) 2020-10-28 13:42:31 -04:00
Sylvain Gugger
6241c873cd Document the various LM Auto models (#8118) 2020-10-28 13:41:56 -04:00
Bram Vanroy
5193172f12 [DOC] Improve pipeline() docstrings for config and tokenizer (#8123)
* Improve pipeline() docstrings

* make style

* Update wording for config
2020-10-28 13:26:12 -04:00
Boris Dayma
b4cacb7a63 fix(trainer_callback]: typo (#8121) 2020-10-28 12:15:30 -04:00
Stas Bekman
5423f2a9d4 [testing] port test_trainer_distributed to distributed pytest + TestCasePlus enhancements (#8107)
* move the helper code into testing_utils

* port test_trainer_distributed to work with pytest

* improve docs

* simplify notes

* doc

* doc

* style

* doc

* further improvements

* torch might not be available

* real fix

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-28 11:51:32 -04:00
Sylvain Gugger
47dfa65b0c New run_clm script (#8105)
* New run_clm script

* Formatting

* More comments

* Remove unused imports

* Apply suggestions from code review

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Address review comments

* Change link to the hub

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-10-28 10:38:58 -04:00
Stas Bekman
8065fea870 [gh actions] run artifacts job always (#8110) 2020-10-28 01:45:19 -04:00
Sylvain Gugger
1e01db3579 Remove header 2020-10-27 17:36:13 -04:00
Sylvain Gugger
b715e40ced Fix typo 2020-10-27 17:34:05 -04:00
Sylvain Gugger
41cc5f3f59 Move installation instructions to the top (#8106) 2020-10-27 17:32:20 -04:00
Joe Davison
556709ad92 rm multiclass option from model card 2020-10-27 17:11:43 -04:00
Sylvain Gugger
c5f3149f95 Adjust setup so that all extras run on Windows (#8102) 2020-10-27 14:39:49 -04:00
Davide Fiocco
995006eabb Add AzureML in integrations via dedicated callback (#8062)
* first attempt to add AzureML callbacks

* func arg fix

* var name fix, but still won't fix error...

* fixing as in https://discuss.huggingface.co/t/how-to-integrate-an-azuremlcallback-for-logging-in-azure/1713/2

* Avoid lint check of azureml import

* black compliance

* Make isort happy

* Fix point typo in docs

* Add AzureML to Callbacks docs

* Attempt to make sphinx happy

* Format callback docs

* Make documentation style happy

* Make docs compliant to style

Co-authored-by: Davide Fiocco <davide.fiocco@frontiersin.net>
2020-10-27 14:21:54 -04:00
Lysandre Debut
a0906068cf Fully remove codecov (#8093) 2020-10-27 14:14:13 -04:00
Joe Davison
3e58b6b7b8 infer entailment label id on zero shot pipeline (#8059)
* add entailment dim argument

* rename dim -> id

* fix last name change, style

* rm arg, auto-infer only

* typo

* rm superfluous import
2020-10-27 14:09:55 -04:00
Jason Wolosonovich
9fefdb0751 DEP: pinned sentencepiece to 0.1.91 in setup.py (#8069)
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-27 14:09:31 -04:00
Stas Bekman
edd3721cd4 update/add setup targets (#8076) 2020-10-27 13:54:57 -04:00
Julien Chaumond
55bc0c599a [model_cards] Switch to a more explicit domain for the media bucket 2020-10-27 18:08:05 +01:00
Harutaka Kawamura
7bff0af0a4 Fix a bug for CallbackHandler.callback_list (#8052)
* Fix callback_list

* Add test

Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>

* Fix test

Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>
2020-10-27 10:37:04 -04:00
Harutaka Kawamura
8e28c327fc Fix assertion error message for MLflowCallback (#8091) 2020-10-27 10:34:51 -04:00
Sylvain Gugger
3220f21f14 Styling fix 2020-10-27 10:09:51 -04:00
Jonathan Chang
286dc19a4f Fix IterableDataset with __len__ in Trainer (#8095) 2020-10-27 09:52:35 -04:00
Sam Shleifer
d93acd6f13 Move style_doc to extra_quality_checks (#8081) 2020-10-27 09:42:07 -04:00
Stas Bekman
bfd5e370a7 [CI] generate separate report files as artifacts (#7995)
* better reports

* a whole bunch of reports in their own files

* clean up

* improvements

* github artifacts experiment

* style

* complete the report generator with multiple improvements/fixes

* fix

* save all reports under one dir to easy upload

* can remove temp failing tests

* doc fix

* some cleanup
2020-10-27 09:25:07 -04:00
Lysandre Debut
33f6ef733a Fix DeBERTa docs (#8092)
* Fix DeBERTa docs

* Tokenizer and config
2020-10-27 09:07:41 -04:00
Sylvain Gugger
c42596bc07 Doc styling fixes (#8074)
* Fix a few docstrings

* More fixes

* Styling
2020-10-27 07:54:50 -04:00
Doug Blank
1496931b49 Fix comet_ml import and add ensure availability (#7933)
* Fix comet_ml import and add ensure availability

* Make isort happy

* Make flake8 happy

* Don't show comet_ml warn if COMET_MODE=DISABLED

* Make isort happy
2020-10-27 07:31:07 -04:00
Chengxi Guo
985bba9096 fix doc bug (#8082)
Signed-off-by: mymusise <mymusise1@gmail.com>
2020-10-27 07:29:25 -04:00
Sylvain Gugger
08f534d2da Doc styling (#8067)
* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy
2020-10-26 18:26:02 -04:00
Sylvain Gugger
04a17f8550 Doc fixes in preparation for the docstyle PR (#8061)
* Fixes in preparation for doc styling

* More fixes

* Better syntax

* Fixes

* Style

* More fixes

* More fixes
2020-10-26 15:01:09 -04:00
Philip May
8bbb74f211 [Model Card] new cross lingual sentence model for German and English (#8026)
* mc for new cross lingual sentence model

* fat text

* url spelling fix

* more url spelling fixes

* slight thanks change

* small improvements in text

* multilingual word xchange

* change colab link

* xval fold number

* add model links

* line break in model names

* Update README.md

* Update README.md

* new examples link

* new examples link

* add evaluation dataset name

* add more about multi lingual

* typo fix

* typo

* typos

* hyperparameter typos

* hyperparameter typo

* add metadata

* add metadata

* Update README.md

* typo fix

* Small improvement
2020-10-26 14:48:26 -04:00
Lysandre Debut
3a10764574 Fix TF training arguments instantiation (#8063) 2020-10-26 14:39:25 -04:00
Sam Shleifer
bc9332b545 [TF] from_pt should respect authorized_unexpected_keys (#8056) 2020-10-26 13:53:27 -04:00
Stas Bekman
7ff7c4934b fixing crash (#8057) 2020-10-26 13:19:10 -04:00
Lysandre Debut
cbad90d86d Fix + Test (#8049) 2020-10-26 12:32:27 -04:00
Patrick von Platen
664c7ec453 [Seq2Seq Trainer] Make sure padding is implemented for models without pad_token (#8043)
* make sure padding is implemented for non-padding tokens models as well

* add better error message

* add better warning

* remove results files

* Update examples/seq2seq/seq2seq_trainer.py

* remove unnecessary copy line

* correct usage of labels

* delete test files
2020-10-26 17:28:16 +01:00
mohammadreza-Banaei73
098ddc2244 Update README.md (#8050)
--wwm cant be used as an argument given run_language_modeling.py and should be changed to --whole_word_mask
2020-10-26 12:00:18 -04:00
Joe Davison
fbcddb8544 add mutliclass field to default zero shot example 2020-10-26 11:07:51 -04:00
Yusuke Mori
a9ac1db276 Minor error fix of 'bart-large-cnn' details in the pretrained_models doc (#8053) 2020-10-26 11:05:16 -04:00
Samuel
fc2d6eac3c Minor typo fixes to the preprocessing tutorial in the docs (#8046)
* Fix minor typos

Fix minor typos in the docs.

* Update docs/source/preprocessing.rst

Clearer data structure description.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-26 10:22:29 -04:00
Joe Davison
b0a907615a minor model card description updates (#8051) 2020-10-26 10:04:20 -04:00
noise-field
c48b16b8da Mlflow integration callback (#8016)
* Add MLflow integration class

Add integration code for MLflow in integrations.py along with the code
that checks that MLflow is installed.

* Add MLflowCallback import

Add import of MLflowCallback in trainer.py

* Handle model argument

Allow the callback to handle model argument and store model config items as hyperparameters.

* Log parameters to MLflow in batches

MLflow cannot log more than a hundred parameters at once.
Code added to split the parameters into batches of 100 items and log the batches one by one.

* Fix style

* Add docs on MLflow callback

* Fix issue with unfinished runs

The "fluent" api used in MLflow integration allows only one run to be active at any given moment. If the Trainer is disposed off and a new one is created, but the training is not finished, it will refuse to log the results when the next trainer is created.

* Add MLflow integration class

Add integration code for MLflow in integrations.py along with the code
that checks that MLflow is installed.

* Add MLflowCallback import

Add import of MLflowCallback in trainer.py

* Handle model argument

Allow the callback to handle model argument and store model config items as hyperparameters.

* Log parameters to MLflow in batches

MLflow cannot log more than a hundred parameters at once.
Code added to split the parameters into batches of 100 items and log the batches one by one.

* Fix style

* Add docs on MLflow callback

* Fix issue with unfinished runs

The "fluent" api used in MLflow integration allows only one run to be active at any given moment. If the Trainer is disposed off and a new one is created, but the training is not finished, it will refuse to log the results when the next trainer is created.
2020-10-26 09:41:58 -04:00
Lysandre Debut
8be9cb0aef Tiny TF Bart fixes (#8023) 2020-10-26 09:29:56 -04:00
Sylvain Gugger
077478637d Fix label name in DataCollatorForNextSentencePrediction test (#8048) 2020-10-26 09:23:12 -04:00
Sam Shleifer
8bbe8247f1 Cleanup pytorch tests (#8033) 2020-10-26 08:59:06 -04:00
suliuzh
20a0894d1a update version for scipy (#7998) 2020-10-26 08:56:56 -04:00
Sam Shleifer
f20aec1de5 fsmt slow test uses lists (#8031) 2020-10-26 08:32:36 -04:00
Stas Bekman
101186bc1f [docs] [testing] distributed training (#7993)
* distributed training

* fix

* fix formatting

* wording
2020-10-26 08:15:05 -04:00
luyug
c153bcc5c8 Add mixed precision evaluation (#8036)
* Add mixed precision evaluation

* use original flag
2020-10-26 08:12:31 -04:00
Samuel
9aa2826687 Minor typo fixes to the tokenizer summary (#8045)
Minor typo fixes to the tokenizer summary
2020-10-26 08:08:33 -04:00
Lysandre
829b9f8cc3 Remove codecov.yml 2020-10-26 08:05:02 -04:00
Thomas Wolf
79eb391586 [tokenizers] Fixing #8001 - Adding tests on tokenizers serialization (#8006)
* fixing #8001

* make T5 tokenizer serialization more robust - style
2020-10-26 10:27:48 +01:00
Julien Chaumond
7087d9b1c0 [model_cards] bert-base-danish Fixup
#8030
2020-10-26 09:38:21 +01:00
Julien Chaumond
efc4a21ffa Fixup #8025
Close #8030
2020-10-26 09:32:07 +01:00
Sam Longenbach
5148f43309 [Model Card] DJSammy/bert-base-danish-uncased_BotXO,ai (#8025)
* Create README.md

* Update README.md
2020-10-25 15:20:46 +08:00
Suraj Patil
38f6739cd6 [doc prepare_seq2seq_batch] fix docs (#8013) 2020-10-24 15:33:47 -04:00
Yixin Nie
00602f7840 Create model card for pre-trained NLI models. (#7864)
* Create README.md

* Update model_cards/ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

* Add Meta information for dataset identifier.

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-10-24 03:16:07 -04:00
Patrick von Platen
3c682ea15c [Examples] Allow EncoderDecoderModels to be trained with Seq2Seq (#7809)
* Make Seq2Seq Trainer more similar to Trainer

* fix typo

* fix seq2seq trainer

* remove from tests

* remove lock

* remove train files

* delete test files

* correct typo

* check at init

* make sure trainer is not slowed down on TPU

* correct isort

* remove use cache

* fix use cache

* add last use chache = false
2020-10-23 23:05:51 +02:00
Sacha Arbonel
59b5953d89 Create model card for bert-italian-cased-finetuned-pos (#8003)
* Create README.md

* Update model_cards/sachaarbonel/bert-italian-cased-finetuned-pos/README.md

* Apply suggestions from code review

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-10-23 10:58:05 -04:00
Zhiqi Huang
6e07c1f446 Add model cards for DynaBERT (#7999) 2020-10-23 10:53:53 -04:00
Zhiqi Huang
43fdafef89 Create README.md (#7997) 2020-10-23 10:53:37 -04:00
Blaise Cruz
627e813734 Added model cards for Tagalog ELECTRA models (#7996)
Co-authored-by: Jan Christian Blaise Cruz <jcblaise@Blaises-MacBook-Pro.local>
2020-10-23 10:52:21 -04:00
Philip May
9865e1fe52 model card for German Sentence Embeddings V2 (#7952)
* model card German Sentence Embeddings V2

- for German RoBERTa for Sentence Embeddings V2
- marked old as outdated

* small correction

* small improvement in description

* small spelling fix

* spelling fix

* add evaluation results

* spearman explanation

* add number of trials
2020-10-23 10:45:54 -04:00
Ethan Perez
d39da5a2ab Handling longformer model_type (#7990)
Updating the run_squad training script to handle the "longformer" `model_type`. The longformer is trained in the same was as RoBERTa, so I've added the "longformer" `model_type` (that's the right hugginface name for the LongFormer model, right?) everywhere there was a "roberta" `model_type` reference. The longformer (like RoBERTa) doesn't use `token_type_ids` (as I understand from looking at the [longformer notebook](https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb), which is what gets updated after this change.

This fix might be related to [this issue](https://github.com/huggingface/transformers/issues/7249) with SQuAD training when using run_squad.py
2020-10-23 10:34:06 -04:00
Anthony MOI
5e323017a4 Fix BatchEncoding.word_to_tokens for removed tokens (#7939) 2020-10-23 10:29:37 -04:00
Patrick von Platen
4acfd1a8dc [Reformer] remove reformer pad_token_id (#7991)
* remove reformer pad_token_id

* fix pegasus
2020-10-23 10:29:15 -04:00
Thomas Wolf
3a40cdf58d [tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers improvements - General tests speedups (#7970)
* WIP refactoring pipeline tests - switching to fast tokenizers

* fix dialog pipeline and fill-mask

* refactoring pipeline tests backbone

* make large tests slow

* fix tests (tf Bart inactive for now)

* fix doc...

* clean up for merge

* fixing tests - remove bart from summarization until there is TF

* fix quality and RAG

* Add new translation pipeline tests - fix JAX tests

* only slow for dialog

* Fixing the missing TF-BART imports in modeling_tf_auto

* spin out pipeline tests in separate CI job

* adding pipeline test to CI YAML

* add slow pipeline tests

* speed up tf and pt join test to avoid redoing all the standalone pt and tf tests

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* Update src/transformers/pipelines.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/pipelines.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add require_torch and require_tf in is_pt_tf_cross_test

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-23 15:58:19 +02:00
Lalit Pagaria
88b3a91e61 Handle the case when title is None (#7941) 2020-10-23 15:54:45 +02:00
Stas Bekman
023f0f3708 [s2s trainer] tests to use distributed on multi-gpu machine (#7965) 2020-10-22 17:26:22 -04:00
Joe Davison
64b24bb3c2 change zero shot widget default example (#7992) 2020-10-22 15:19:41 -06:00
Sam Shleifer
0397619ac6 Move NoLayerEmbedTokens (#7945)
* Move NoLayerEmbedTokens

* TFWrappedEmbeddings

* Add comment
2020-10-22 16:13:49 -04:00
Sam Shleifer
5ac07513e0 [gh ci] less output ( --durations=50) (#7989) 2020-10-22 16:10:15 -04:00
Sylvain Gugger
5ae935d233 Reload checkpoint (#7984)
* Fix checkpoint loading in Trainer

* Fix typo
2020-10-22 15:48:52 -04:00
Lysandre
467573ddde Fix documentation redirect 2020-10-22 15:37:51 -04:00
Joe Davison
077c99bb5f add zero shot pipeline tags & examples (#7983)
* add zero shot pipeline tags

* rm default and fix yaml format

* rm DS_Store

* add bart large default

* don't add more typos

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

* add multiple multilingual examples

* improve multilingual examples for single-label

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-10-22 13:01:23 -06:00
Sylvain Gugger
06fc3954a1 Only log total_flos at the end of training (#7981)
* Only log total_flos at the end of training

* Fix test
2020-10-22 14:26:55 -04:00
Julien Chaumond
ff65beafa3 FillMaskPipeline: support passing top_k on __call__ (#7971)
* FillMaskPipeline: support passing top_k on __call__

Also move from topk to top_k

* migrate to new param name in tests

* Review from @sgugger
2020-10-22 12:54:25 -04:00
Sylvain Gugger
2e5052d4f1 New run glue script (#7917)
* Start simplification

* More progress

* Finished script

* Address comments and update tests instructions

* Wrong test

* Accept files as inputs and fix test

* Update src/transformers/trainer_utils.py

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

* Fix labels and add combined score

* Add special labels

* Update TPU command

* Revert to old label strategy

* Use model labels

* Fix for STT-B

* Styling

* Apply suggestions from code review

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Code styling

* Fix review comments

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-10-22 11:42:22 -04:00
Nicolas Patry
18ce6b8ff3 Fixing the "translation", "translation_XX_to_YY" pipelines. (#7975)
* Actually make the "translation", "translation_XX_to_YY" task behave correctly.

Background:
- Currently "translation_cn_to_ar" does not work. (only 3 pairs are
supported)
- Some models, contain in their config the correct values for the (src,
tgt) pair they can translate. It's usually just one pair, and we can
infer it automatically from the `model.config.task_specific_params`. If
it's not defined we can still probably load the TranslationPipeline
nevertheless.

Proposed fix:
- A simplified version of what could become more general which is
a `parametrized` task. "translation" + (src, tgt) in this instance
it what we need in the general case. The way we go about it for now
is simply parsing "translation_XX_to_YY". If cases of parametrized task arise
we should preferably go in something closer to what `datasets` propose
which is having a secondary argument `task_options`? that will be close
to what that task requires.
- Should be backward compatible in all cases for instance
`pipeline(task="translation_en_to_de") should work out of the box.
- Should provide a warning when a specific translation pair has been
selected on behalf of the user using
`model.config.task_specific_params`.

* Update src/transformers/pipelines.py

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-10-22 17:16:21 +02:00
Funtowicz Morgan
901e9b8eda Remove the else branch adding 0 to the hidden state if token_type_embeds is None. (#7977)
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
2020-10-22 16:41:41 +02:00
Patrick von Platen
f34372a9ff [PretrainedConfig] Fix save pretrained config for edge case (#7943)
* fix config save

* add test

* add config class variable and another test

* line break

* fix fsmt and typo

* god am I making many errors today :-/

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-22 15:39:01 +02:00
Peter Bayerle
cc2e312ca3 adding text classification with DistilBERT/tf notebook (#7964)
Looking at the current community notebooks, it seems that few are targeted for absolute beginners and even fewer are written with TensorFlow. This notebook describes absolutely everything a beginner would need to know, including how to save/load their model and use it for new predictions (this is often omitted in tutorials)

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-22 09:30:50 -04:00
wlhgtc
a16e568f22 # Add whole word mask support for lm fine-tune (#7925)
* ADD: add whole word mask proxy for both eng and chinese

* MOD: adjust format

* MOD: reformat code

* MOD: update import

* MOD: fix bug

* MOD: add import

* MOD: fix bug

* MOD: decouple code and update readme

* MOD: reformat code

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change wwm to whole_word_mask

* reformat code

* reformat

* format

* Code quality

* ADD: update chinese ref readme

* MOD: small changes

* MOD: small changes2

* update readme

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
2020-10-22 09:19:00 -04:00
Stas Bekman
64b4d25cf3 [fsmt test] basic config test with online model + super tiny model (#7860)
* basic config test with online model

* typo

* style

* better test
2020-10-22 09:14:54 -04:00
Julien Chaumond
3479787edc Disable inference API for t5-11b (#7978) 2020-10-22 09:08:37 -04:00
Julien Chaumond
a7db81c33f [model_card] t5-11b move disclaimer to top of page
cc @Narsil @patrickvonplaten
2020-10-22 14:35:31 +02:00
Haebin Shin
f774b2e8c4 support relative path for best_model_checkpoint (#7973) 2020-10-22 07:55:31 -04:00
Stas Bekman
8348105692 [testing] slow tests should be marked as slow (#7895)
* slow tests should be slow

* exception note

* style

* integrate LysandreJik's notes with some expansions

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* another slow test

* fix link, and prose

* clarify.

* note from Sam

* typo

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-22 06:34:05 -04:00
rmroczkowski
95792a948e Herbert tokenizer auto load (#7968) 2020-10-22 05:48:29 -04:00
zolekode
4abb7ffc18 added qg evaluation notebook (#7958)
* added qg evaluation notebook

* Update notebooks/README.md

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-10-22 11:02:12 +02:00
Stas Bekman
8b38173398 [seq2seq testing] multigpu test run via subprocess (#7281)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-10-21 17:20:53 -04:00
Julien Chaumond
f8d3695e8c [model_cards] camembert: dataset = oscar
Hat/tip @pjox
2020-10-21 14:17:56 -04:00
Evan Pete Walsh
16da877139 fix 'encode_plus' docstring for 'special_tokens_mask' (0s and 1s were reversed) (#7949)
* fix docstring for 'special_tokens_mask'

* revert auto formatter changes

* revert another auto format

* revert another auto format
2020-10-21 13:57:44 -04:00
Patrick von Platen
52decab371 fix test (#7947) 2020-10-21 19:06:23 +02:00
Patrick von Platen
9b6610f7f6 [ProphetNet] Correct Doc string example (#7944)
* correct xlm prophetnet auto model and examples

* fix line-break docs
2020-10-21 17:27:20 +02:00
François Lagunas
e174bfeb34 TensorBoard/Wandb/optuna/raytune integration improvements. (#7935)
Improved TensorBoard and Wandb integration, as well as optuna and ray/tune support, with minor modifications to trainer core code.
2020-10-21 17:18:52 +02:00
Ali Hamdi Ali Fadel
bf162ce8ca Add AI-SOCO models (#7867) 2020-10-21 09:24:43 -04:00
Fangyu Liu
58fb25f25b Create README.md (#7857)
* Create README.md

model card for cambridgeltl/BioRedditBERT-uncased.

* Update model_cards/cambridgeltl/BioRedditBERT-uncased/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-10-21 08:41:41 -04:00
Manuel Romero
2b07ec7823 Model card for German BERT fine-tuned for LER/NER (#7855) 2020-10-21 08:31:41 -04:00
MichalPleban
35d2ad5b83 Create README.md (#7819) 2020-10-21 08:30:01 -04:00
Wuwei Lan
bdda4f2249 Create README.md (#7625)
* Create README.md

* Update model_cards/lanwuwei/GigaBERT-v3-Arabic-and-English/README.md

* Update model_cards/lanwuwei/GigaBERT-v3-Arabic-and-English/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-10-21 08:29:39 -04:00
Manuel Romero
8e23749649 Add missing comma (#7870) 2020-10-21 08:24:12 -04:00
Manuel Romero
3eaa007d78 Create README.md (#7899) 2020-10-21 08:23:55 -04:00
Julien Chaumond
758572cad8 [model_cards] move hatmimoha/arabic-ner to correct location
see 16d3cc187d and https://github.com/huggingface/transformers/pull/7836
2020-10-21 14:13:17 +02:00
Stas Bekman
57516c0cc8 [multiple models] skip saving/loading deterministic state_dict keys (#7878)
* make the save_load special key tests common

* handle mbart

* cleaner solution

* fix

* move test_save_load_missing_keys back into fstm for now

* restore

* style

* add marian

* add pegasus

* blenderbot

* revert - no static embed
2020-10-21 08:06:07 -04:00
quentinheinrich
006a16483f update model cards of Illuin models (#7930) 2020-10-21 08:05:53 -04:00
hatmimoha
16d3cc187d model card for arabic-ner model (#7836)
* Create README.md

README file for the Arabic NER model

* Update README.md

* Update README.md

* Update hatmimoha/arabic-ner/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-10-21 08:02:40 -04:00
Sam Shleifer
829842159e Add TFBartForConditionalGeneration (#5411)
* half done

* doc improvement

* Cp test file

* brokedn

* broken test

* undo some mess

* ckpt

* borked

* Halfway

* 6 passing

* boom boom

* Much progress but still 6

* boom boom

* merged master

* 10 passing

* boom boom

* Style

* no t5 changes

* 13 passing

* Integration test failing, but not gibberish

* Frustrated

* Merged master

* 4 fail

* 4 fail

* fix return_dict

* boom boom

* Still only 4

* prepare method

* prepare method

* before delete classif

* Skip tests to avoid adding boilerplate

* boom boom

* fast tests passing

* style

* boom boom

* Switch to supporting many input types

* remove FIXMENORM

* working

* Fixed past_key_values/decoder_cached_states confusion

* new broken test

* Fix attention mask kwarg name

* undo accidental

* Style and reviewers

* style

* Docs and common tests

* Cleaner assert messages

* copy docs

* style issues

* Sphinx fix

* Simplify caching logic

* test does not require torch

* copy _NoLayerEmbedTokens

* Update src/transformers/modeling_tf_bart.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update tests/test_modeling_tf_bart.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/modeling_tf_bart.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/modeling_tf_bart.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/modeling_tf_bart.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Line length and dont document None

* Add pipeline test coverage

* assert msg

* At parity

* Assert messages

* mark slow

* Update compile test

* back in init

* Merge master

* Fix tests

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-21 13:10:16 +02:00
Patrick von Platen
5cd9e2cba1 Update README.md 2020-10-21 12:43:42 +02:00
Patrick von Platen
220b5f97ca Create README.md 2020-10-21 12:34:46 +02:00
Patrick von Platen
8ffd7fb12d Update README.md 2020-10-21 12:27:09 +02:00
Patrick von Platen
613ab364eb Update README.md 2020-10-21 12:23:17 +02:00
Patrick von Platen
f7eb17dc47 Update README.md 2020-10-21 12:19:44 +02:00
Patrick von Platen
29792864cb [ProphetNet] Add Question Generation Model + Test (#7942)
* new prophetnet model

* correct name

* make style
2020-10-21 11:49:58 +02:00
Joe Davison
13842e413c PPL guide minor code snippet fix (#7938) 2020-10-20 16:17:39 -06:00
Stas Bekman
0e24e4c136 [s2s] create doc for pegasus/fsmt replication (#7934) 2020-10-20 15:07:52 -04:00
Lysandre Debut
96f4828ace Respect the 119 line chars (#7928) 2020-10-20 11:02:47 -04:00
Lysandre
ef0ac063c9 Docs for v3.4.0 2020-10-20 16:29:00 +02:00
667 changed files with 33043 additions and 16829 deletions

View File

@@ -74,21 +74,20 @@ jobs:
- checkout
- restore_cache:
keys:
- v0.3-torch_and_tf-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }}
- v0.4-torch_and_tf-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install git+https://github.com/huggingface/datasets
- run: pip install .[sklearn,tf-cpu,torch,testing]
- run: pip install codecov pytest-cov
- save_cache:
key: v0.3-{{ checksum "setup.py" }}
key: v0.4-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ --cov --durations=0 | tee output.txt
- run: codecov
- run: RUN_PT_TF_CROSS_TESTS=1 python -m pytest -n 8 --dist=loadfile -rA -s --make-reports=tests_torch_and_tf ./tests/ -m is_pt_tf_cross_test --durations=0 | tee tests_output.txt
- store_artifacts:
path: ~/transformers/output.txt
destination: test_output.txt
path: ~/transformers/tests_output.txt
- store_artifacts:
path: ~/transformers/reports
run_tests_torch:
working_directory: ~/transformers
docker:
@@ -101,19 +100,20 @@ jobs:
- checkout
- restore_cache:
keys:
- v0.3-torch-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }}
- v0.4-torch-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install git+https://github.com/huggingface/datasets
- run: pip install .[sklearn,torch,testing]
- save_cache:
key: v0.3-torch-{{ checksum "setup.py" }}
key: v0.4-torch-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ | tee output.txt
- run: python -m pytest -n 8 --dist=loadfile -s --make-reports=tests_torch ./tests/ | tee tests_output.txt
- store_artifacts:
path: ~/transformers/output.txt
destination: test_output.txt
path: ~/transformers/tests_output.txt
- store_artifacts:
path: ~/transformers/reports
run_tests_tf:
working_directory: ~/transformers
docker:
@@ -126,19 +126,20 @@ jobs:
- checkout
- restore_cache:
keys:
- v0.3-tf-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }}
- v0.4-tf-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install git+https://github.com/huggingface/datasets
- run: pip install .[sklearn,tf-cpu,testing]
- save_cache:
key: v0.3-tf-{{ checksum "setup.py" }}
key: v0.4-tf-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ | tee output.txt
- run: python -m pytest -n 8 --dist=loadfile -rA -s --make-reports=tests_tf ./tests/ | tee tests_output.txt
- store_artifacts:
path: ~/transformers/output.txt
destination: test_output.txt
path: ~/transformers/tests_output.txt
- store_artifacts:
path: ~/transformers/reports
run_tests_flax:
working_directory: ~/transformers
docker:
@@ -151,19 +152,72 @@ jobs:
- checkout
- restore_cache:
keys:
- v0.3-flax-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }}
- v0.4-flax-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install git+https://github.com/huggingface/datasets
- run: sudo pip install .[flax,sklearn,torch,testing]
- save_cache:
key: v0.3-flax-{{ checksum "setup.py" }}
key: v0.4-flax-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ | tee output.txt
- run: python -m pytest -n 8 --dist=loadfile -rA -s --make-reports=tests_flax ./tests/ | tee tests_output.txt
- store_artifacts:
path: ~/transformers/output.txt
destination: test_output.txt
path: ~/transformers/tests_output.txt
- store_artifacts:
path: ~/transformers/reports
run_tests_pipelines_torch:
working_directory: ~/transformers
docker:
- image: circleci/python:3.7
environment:
OMP_NUM_THREADS: 1
resource_class: xlarge
parallelism: 1
steps:
- checkout
- restore_cache:
keys:
- v0.4-torch-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[sklearn,torch,testing]
- save_cache:
key: v0.4-torch-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: RUN_PIPELINE_TESTS=1 python -m pytest -n 8 --dist=loadfile -rA -s --make-reports=tests_pipelines_torch -m is_pipeline_test ./tests/ | tee tests_output.txt
- store_artifacts:
path: ~/transformers/tests_output.txt
- store_artifacts:
path: ~/transformers/reports
run_tests_pipelines_tf:
working_directory: ~/transformers
docker:
- image: circleci/python:3.7
environment:
OMP_NUM_THREADS: 1
resource_class: xlarge
parallelism: 1
steps:
- checkout
- restore_cache:
keys:
- v0.4-tf-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[sklearn,tf-cpu,testing]
- save_cache:
key: v0.4-tf-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: RUN_PIPELINE_TESTS=1 python -m pytest -n 8 --dist=loadfile -rA -s --make-reports=tests_pipelines_tf ./tests/ -m is_pipeline_test | tee tests_output.txt
- store_artifacts:
path: ~/transformers/tests_output.txt
- store_artifacts:
path: ~/transformers/reports
run_tests_custom_tokenizers:
working_directory: ~/transformers
docker:
@@ -174,19 +228,21 @@ jobs:
- checkout
- restore_cache:
keys:
- v0.3-custom_tokenizers-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }}
- v0.4-custom_tokenizers-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[ja,testing]
- run: python -m unidic download
- save_cache:
key: v0.3-custom_tokenizers-{{ checksum "setup.py" }}
key: v0.4-custom_tokenizers-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: python -m pytest -s ./tests/test_tokenization_bert_japanese.py | tee output.txt
- run: python -m pytest -s --make-reports=tests_custom_tokenizers ./tests/test_tokenization_bert_japanese.py | tee tests_output.txt
- store_artifacts:
path: ~/transformers/output.txt
destination: test_output.txt
path: ~/transformers/tests_output.txt
- store_artifacts:
path: ~/transformers/reports
run_examples_torch:
working_directory: ~/transformers
docker:
@@ -199,19 +255,21 @@ jobs:
- checkout
- restore_cache:
keys:
- v0.3-torch_examples-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }}
- v0.4-torch_examples-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[sklearn,torch,testing]
- run: pip install -r examples/requirements.txt
- save_cache:
key: v0.3-torch_examples-{{ checksum "setup.py" }}
key: v0.4-torch_examples-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: python -m pytest -n 8 --dist=loadfile -rA -s ./examples/ | tee output.txt
- run: python -m pytest -n 8 --dist=loadfile -s --make-reports=examples_torch ./examples/ | tee examples_output.txt
- store_artifacts:
path: ~/transformers/output.txt
destination: test_output.txt
path: ~/transformers/examples_output.txt
- store_artifacts:
path: ~/transformers/reports
build_doc:
working_directory: ~/transformers
docker:
@@ -220,17 +278,18 @@ jobs:
- checkout
- restore_cache:
keys:
- v0.3-build_doc-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }}
- v0.4-build_doc-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[tf,torch,sentencepiece,docs]
- save_cache:
key: v0.3-build_doc-{{ checksum "setup.py" }}
key: v0.4-build_doc-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: cd docs && make html SPHINXOPTS="-W"
- store_artifacts:
path: ./docs/_build
deploy_doc:
working_directory: ~/transformers
docker:
@@ -242,14 +301,15 @@ jobs:
- checkout
- restore_cache:
keys:
- v0.3-deploy_doc-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }}
- v0.4-deploy_doc-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install .[tf,torch,sentencepiece,docs]
- save_cache:
key: v0.3-deploy_doc-{{ checksum "setup.py" }}
key: v0.4-deploy_doc-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: ./.circleci/deploy.sh
check_code_quality:
working_directory: ~/transformers
docker:
@@ -260,21 +320,23 @@ jobs:
- checkout
- restore_cache:
keys:
- v0.3-code_quality-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }}
- v0.4-code_quality-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install isort
- run: pip install .[tf,torch,flax,quality]
- save_cache:
key: v0.3-code_quality-{{ checksum "setup.py" }}
key: v0.4-code_quality-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: black --check examples templates tests src utils
- run: isort --check-only examples templates tests src utils
- run: flake8 examples templates tests src utils
- run: black --check examples tests src utils
- run: isort --check-only examples tests src utils
- run: flake8 examples tests src utils
- run: python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
- run: python utils/check_copies.py
- run: python utils/check_dummies.py
- run: python utils/check_repo.py
check_repository_consistency:
working_directory: ~/transformers
docker:
@@ -305,6 +367,7 @@ jobs:
- setup_remote_docker
- *build_push_docker
- *deploy_cluster
cleanup-gke-jobs:
docker:
- image: circleci/python:3.6
@@ -314,6 +377,7 @@ jobs:
cluster: $GKE_CLUSTER
perform-login: true
- *delete_gke_jobs
workflow_filters: &workflow_filters
filters:
branches:
@@ -331,6 +395,8 @@ workflows:
- run_tests_torch
- run_tests_tf
- run_tests_flax
- run_tests_pipelines_torch
- run_tests_pipelines_tf
- build_doc
- deploy_doc: *workflow_filters
tpu_testing_jobs:

View File

@@ -50,4 +50,5 @@ deploy_doc "b42586e" v2.11.0
deploy_doc "7fb8bdf" v3.0.2
deploy_doc "4b3ee9c" v3.1.0
deploy_doc "3ebb1b3" v3.2.0
deploy_doc "0613f05" # v3.3.0 Latest stable release
deploy_doc "0613f05" v3.3.1
deploy_doc "eb0e0ce" # v3.4.0 Latest stable release

View File

@@ -30,20 +30,22 @@ assignees: ''
Trainer: @sgugger
Speed and Memory Benchmarks: @patrickvonplaten
Model Cards: @julien-c
Translation: @sshleifer
Summarization: @sshleifer
TextGeneration: @TevenLeScao
examples/distillation: @VictorSanh
nlp datasets: [different repo](https://github.com/huggingface/nlp)
rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
Text Generation: @TevenLeScao
blenderbot: @mariamabarham
Bart: @sshleifer
Marian: @sshleifer
Text Generation: @patrickvonplaten @TevenLeScao
Blenderbot: @patrickvonplaten
Bart: @patrickvonplaten
Marian: @patrickvonplaten
Pegasus: @patrickvonplaten
mBART: @patrickvonplaten
T5: @patrickvonplaten
Longformer/Reformer: @patrickvonplaten
TransfoXL/XLNet: @TevenLeScao
examples/seq2seq: @sshleifer
TransfoXL/XLNet: @TevenLeScao
RAG: @patrickvonplaten, @lhoestq
FSMT: @stas00
examples/seq2seq: @patil-suraj
examples/bert-loses-patience: @JetRunner
tensorflow: @jplu
examples/token-classification: @stefan-it

View File

@@ -43,21 +43,20 @@ members/contributors which may be interested in your PR.
Trainer: @sgugger
Benchmarks: @patrickvonplaten
Model Cards: @julien-c
Translation: @sshleifer
Summarization: @sshleifer
examples/distillation: @VictorSanh
nlp datasets: [different repo](https://github.com/huggingface/nlp)
rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
Text Generation: @patrickvonplaten, @TevenLeScao
Blenderbot, Bart, Marian, Pegasus: @sshleifer
Blenderbot, Bart, Marian, Pegasus: @patrickvonplaten
T5: @patrickvonplaten
Rag: @patrickvonplaten, @lhoestq
EncoderDecoder: @patrickvonplaten
Longformer, Reformer: @patrickvonplaten
TransfoXL, XLNet: @TevenLeScao, @patrickvonplaten
examples/seq2seq: @sshleifer
examples/seq2seq: @patil-suraj
examples/bert-loses-patience: @JetRunner
tensorflow: @jplu
examples/token-classification: @stefan-it
documentation: @sgugger
FSTM: @stas00
-->

View File

@@ -1,10 +1,10 @@
name: Self-hosted runner (push)
on:
on:
push:
branches:
- master
paths:
paths:
- "src/**"
- "tests/**"
- ".github/**"
@@ -13,58 +13,70 @@ on:
jobs:
run_tests_torch_and_tf_gpu:
run_tests_torch_gpu:
runs-on: [self-hosted, single-gpu]
steps:
- uses: actions/checkout@v2
- name: Python version
run: |
which python
python --version
pip --version
- name: Current dir
run: pwd
- run: nvidia-smi
- uses: actions/checkout@v2
- name: Python version
run: |
which python
python --version
pip --version
- name: Loading cache.
uses: actions/cache@v2
id: cache
with:
path: .env
key: v0-tests_tf_torch_gpu-${{ hashFiles('setup.py') }}
- name: Current dir
run: pwd
- run: nvidia-smi
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
run: |
python -m venv .env
source .env/bin/activate
which python
python --version
pip --version
- name: Install dependencies
run: |
source .env/bin/activate
pip install --upgrade pip
pip install torch!=1.6.0
pip install .[sklearn,testing,onnxruntime]
pip install git+https://github.com/huggingface/datasets
- name: Loading cache.
uses: actions/cache@v2
id: cache
with:
path: .env
key: v1.1-tests_torch_gpu-${{ hashFiles('setup.py') }}
- name: Are GPUs recognized by our DL frameworks
run: |
source .env/bin/activate
python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
run: |
python -m venv .env
source .env/bin/activate
which python
python --version
pip --version
- name: Run all non-slow tests on GPU
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
# TF_GPU_MEMORY_LIMIT: 4096
OMP_NUM_THREADS: 1
run: |
source .env/bin/activate
python -m pytest -n 2 --dist=loadfile -s ./tests/
- name: Install dependencies
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[torch,sklearn,testing,onnxruntime]
pip install git+https://github.com/huggingface/datasets
run_tests_torch_and_tf_multiple_gpu:
runs-on: [self-hosted, multi-gpu]
- name: Are GPUs recognized by our DL frameworks
run: |
source .env/bin/activate
python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
- name: Run all non-slow tests on GPU
env:
OMP_NUM_THREADS: 1
CUDA_VISIBLE_DEVICES: 0
run: |
source .env/bin/activate
python -m pytest -n 2 --dist=loadfile -s --make-reports=tests_torch_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_torch_gpu_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: run_all_tests_torch_gpu_test_reports
path: reports
run_tests_tf_gpu:
runs-on: [self-hosted, single-gpu]
steps:
- uses: actions/checkout@v2
- name: Python version
@@ -81,7 +93,68 @@ jobs:
id: cache
with:
path: .env
key: v0-tests_tf_torch_multiple_gpu-${{ hashFiles('setup.py') }}
key: v1.1-tests_tf_gpu-${{ hashFiles('setup.py') }}
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
run: |
python -m venv .env
source .env/bin/activate
which python
python --version
pip --version
- name: Install dependencies
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[tf,sklearn,testing,onnxruntime]
pip install git+https://github.com/huggingface/datasets
- name: Are GPUs recognized by our DL frameworks
run: |
source .env/bin/activate
TF_CPP_MIN_LOG_LEVEL=3 python -c "import tensorflow as tf; print('TF GPUs available:', bool(tf.config.list_physical_devices('GPU')))"
TF_CPP_MIN_LOG_LEVEL=3 python -c "import tensorflow as tf; print('Number of TF GPUs available:', len(tf.config.list_physical_devices('GPU')))"
- name: Run all non-slow tests on GPU
env:
OMP_NUM_THREADS: 1
CUDA_VISIBLE_DEVICES: 0
run: |
source .env/bin/activate
python -m pytest -n 2 --dist=loadfile -s --make-reports=tests_tf_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_tf_gpu_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: run_all_tests_tf_gpu_test_reports
path: reports
run_tests_torch_multiple_gpu:
runs-on: [self-hosted, multi-gpu]
steps:
- uses: actions/checkout@v2
- name: Python version
run: |
which python
python --version
pip --version
- name: Current dir
run: pwd
- run: nvidia-smi
- name: Loading cache.
uses: actions/cache@v2
id: cache
with:
path: .env
key: v1.1-tests_torch_multiple_gpu-${{ hashFiles('setup.py') }}
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
run: |
@@ -94,8 +167,7 @@ jobs:
run: |
source .env/bin/activate
pip install --upgrade pip
pip install torch!=1.6.0
pip install .[sklearn,testing,onnxruntime]
pip install .[torch,sklearn,testing,onnxruntime]
pip install git+https://github.com/huggingface/datasets
- name: Are GPUs recognized by our DL frameworks
@@ -106,9 +178,78 @@ jobs:
- name: Run all non-slow tests on GPU
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
# TF_GPU_MEMORY_LIMIT: 4096
OMP_NUM_THREADS: 1
run: |
source .env/bin/activate
python -m pytest -n 2 --dist=loadfile -s ./tests/
python -m pytest -n 2 --dist=loadfile -s --make-reports=tests_torch_multiple_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_torch_multiple_gpu_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: run_all_tests_torch_multi_gpu_test_reports
path: reports
run_tests_tf_multiple_gpu:
runs-on: [self-hosted, multi-gpu]
steps:
- uses: actions/checkout@v2
- name: Python version
run: |
which python
python --version
pip --version
- name: Current dir
run: pwd
- run: nvidia-smi
- name: Loading cache.
uses: actions/cache@v2
id: cache
with:
path: .env
key: v1.1-tests_tf_multiple_gpu-${{ hashFiles('setup.py') }}
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
run: |
python -m venv .env
source .env/bin/activate
which python
python --version
pip --version
- name: Install dependencies
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[tf,sklearn,testing,onnxruntime]
pip install git+https://github.com/huggingface/datasets
- name: Are GPUs recognized by our DL frameworks
run: |
source .env/bin/activate
TF_CPP_MIN_LOG_LEVEL=3 python -c "import tensorflow as tf; print('TF GPUs available:', bool(tf.config.list_physical_devices('GPU')))"
TF_CPP_MIN_LOG_LEVEL=3 python -c "import tensorflow as tf; print('Number of TF GPUs available:', len(tf.config.list_physical_devices('GPU')))"
- name: Run all non-slow tests on GPU
env:
OMP_NUM_THREADS: 1
run: |
source .env/bin/activate
python -m pytest -n 2 --dist=loadfile -s --make-reports=tests_tf_multiple_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_tf_multiple_gpu_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: run_all_tests_tf_multi_gpu_test_reports
path: reports

View File

@@ -1,3 +1,8 @@
# configuration notes:
#
# - `source .env/bin/activate` is currently needed to be run first thing first in each step. Otherwise
# the step uses the system-wide python interpreter.
name: Self-hosted runner (scheduled)
on:
@@ -9,70 +14,8 @@ on:
- cron: "0 0 * * *"
jobs:
run_all_tests_torch_and_tf_gpu:
run_all_tests_torch_gpu:
runs-on: [self-hosted, single-gpu]
steps:
- uses: actions/checkout@v2
- name: Loading cache.
uses: actions/cache@v2
id: cache
with:
path: .env
key: v0-slow_tests_tf_torch_gpu-${{ hashFiles('setup.py') }}
- name: Python version
run: |
which python
python --version
pip --version
- name: Current dir
run: pwd
- run: nvidia-smi
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
if: steps.cache.outputs.cache-hit != 'true'
run: |
python -m venv .env
source .env/bin/activate
which python
python --version
pip --version
- name: Install dependencies
run: |
source .env/bin/activate
pip install --upgrade pip
pip install torch!=1.6.0
pip install .[sklearn,testing,onnxruntime]
pip install git+https://github.com/huggingface/datasets
- name: Are GPUs recognized by our DL frameworks
run: |
source .env/bin/activate
python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
- name: Run all tests on GPU
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
OMP_NUM_THREADS: 1
RUN_SLOW: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s ./tests/ --durations=0
- name: Run examples tests on GPU
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
OMP_NUM_THREADS: 1
RUN_SLOW: yes
run: |
source .env/bin/activate
pip install -r examples/requirements.txt
python -m pytest -n 1 --dist=loadfile -s examples --durations=0
run_all_tests_torch_and_tf_multiple_gpu:
runs-on: [self-hosted, multi-gpu]
steps:
- uses: actions/checkout@v2
@@ -81,16 +24,18 @@ jobs:
id: cache
with:
path: .env
key: v0-slow_tests_tf_torch_multi_gpu-${{ hashFiles('setup.py') }}
key: v 1.1-slow_tests_torch_gpu-${{ hashFiles('setup.py') }}
- name: Python version
run: |
which python
python --version
pip --version
- name: Current dir
run: pwd
- run: nvidia-smi
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
if: steps.cache.outputs.cache-hit != 'true'
run: |
@@ -99,13 +44,14 @@ jobs:
which python
python --version
pip --version
- name: Install dependencies
run: |
source .env/bin/activate
pip install --upgrade pip
pip install torch!=1.6.0
pip install .[sklearn,testing,onnxruntime]
pip install .[torch,sklearn,testing,onnxruntime]
pip install git+https://github.com/huggingface/datasets
pip list
- name: Are GPUs recognized by our DL frameworks
run: |
@@ -115,19 +61,299 @@ jobs:
- name: Run all tests on GPU
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
OMP_NUM_THREADS: 1
RUN_SLOW: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s ./tests/ --durations=0
python -m pytest -n 1 --dist=loadfile -s --make-reports=tests_torch_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_torch_gpu_failures_short.txt
- name: Run examples tests on GPU
if: ${{ always() }}
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
OMP_NUM_THREADS: 1
RUN_SLOW: yes
run: |
source .env/bin/activate
pip install -r examples/requirements.txt
python -m pytest -n 1 --dist=loadfile -s examples --durations=0
python -m pytest -n 1 --dist=loadfile -s --make-reports=examples_torch_gpu examples
- name: Failure short reports
if: ${{ always() }}
run: cat reports/examples_torch_gpu_failures_short.txt
- name: Run all pipeline tests on GPU
if: ${{ always() }}
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
OMP_NUM_THREADS: 1
RUN_SLOW: yes
RUN_PIPELINE_TESTS: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s -m is_pipeline_test --make-reports=tests_torch_pipeline_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_torch_pipeline_gpu_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: run_all_tests_torch_gpu_test_reports
path: reports
run_all_tests_tf_gpu:
runs-on: [self-hosted, single-gpu]
steps:
- uses: actions/checkout@v2
- name: Loading cache.
uses: actions/cache@v2
id: cache
with:
path: .env
key: v1.1-slow_tests_tf_gpu-${{ hashFiles('setup.py') }}
- name: Python version
run: |
which python
python --version
pip --version
- name: Current dir
run: pwd
- run: nvidia-smi
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
if: steps.cache.outputs.cache-hit != 'true'
run: |
python -m venv .env
source .env/bin/activate
which python
python --version
pip --version
- name: Install dependencies
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[tf,sklearn,testing,onnxruntime]
pip install git+https://github.com/huggingface/datasets
pip list
- name: Are GPUs recognized by our DL frameworks
run: |
source .env/bin/activate
TF_CPP_MIN_LOG_LEVEL=3 python -c "import tensorflow as tf; print('TF GPUs available:', bool(tf.config.list_physical_devices('GPU')))"
TF_CPP_MIN_LOG_LEVEL=3 python -c "import tensorflow as tf; print('Number of TF GPUs available:', len(tf.config.list_physical_devices('GPU')))"
- name: Run all tests on GPU
env:
OMP_NUM_THREADS: 1
RUN_SLOW: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s --make-reports=tests_tf_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_tf_gpu_failures_short.txt
- name: Run all pipeline tests on GPU
if: ${{ always() }}
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
OMP_NUM_THREADS: 1
RUN_SLOW: yes
RUN_PIPELINE_TESTS: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s -m is_pipeline_test --make-reports=tests_tf_pipelines_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_tf_pipelines_gpu_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: run_all_tests_tf_gpu_test_reports
path: reports
run_all_tests_torch_multiple_gpu:
runs-on: [self-hosted, multi-gpu]
steps:
- uses: actions/checkout@v2
- name: Loading cache.
uses: actions/cache@v2
id: cache
with:
path: .env
key: v1.1-slow_tests_torch_multi_gpu-${{ hashFiles('setup.py') }}
- name: Python version
run: |
which python
python --version
pip --version
- name: Current dir
run: pwd
- run: nvidia-smi
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
if: steps.cache.outputs.cache-hit != 'true'
run: |
python -m venv .env
source .env/bin/activate
which python
python --version
pip --version
- name: Install dependencies
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[torch,sklearn,testing,onnxruntime]
pip install git+https://github.com/huggingface/datasets
pip list
- name: Are GPUs recognized by our DL frameworks
run: |
source .env/bin/activate
python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
- name: Run all tests on multi-GPU
env:
OMP_NUM_THREADS: 1
RUN_SLOW: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s --make-reports=tests_torch_multiple_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_torch_multiple_gpu_failures_short.txt
- name: Run examples tests on multi-GPU
env:
OMP_NUM_THREADS: 1
RUN_SLOW: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s --make-reports=examples_torch_multiple_gpu examples
- name: Failure short reports
if: ${{ always() }}
run: cat reports/examples_torch_multiple_gpu_failures_short.txt
- name: Run all pipeline tests on multi-GPU
if: ${{ always() }}
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
OMP_NUM_THREADS: 1
RUN_SLOW: yes
RUN_PIPELINE_TESTS: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s -m is_pipeline_test --make-reports=tests_torch_pipeline_multiple_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_torch_pipeline_multiple_gpu_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: run_all_tests_torch_multi_gpu_test_reports
path: reports
run_all_tests_tf_multiple_gpu:
runs-on: [self-hosted, multi-gpu]
steps:
- uses: actions/checkout@v2
- name: Loading cache.
uses: actions/cache@v2
id: cache
with:
path: .env
key: v1.1-slow_tests_tf_multi_gpu-${{ hashFiles('setup.py') }}
- name: Python version
run: |
which python
python --version
pip --version
- name: Current dir
run: pwd
- run: nvidia-smi
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
if: steps.cache.outputs.cache-hit != 'true'
run: |
python -m venv .env
source .env/bin/activate
which python
python --version
pip --version
- name: Install dependencies
run: |
source .env/bin/activate
pip install --upgrade pip
pip install .[tf,sklearn,testing,onnxruntime]
pip install git+https://github.com/huggingface/datasets
pip list
- name: Are GPUs recognized by our DL frameworks
run: |
source .env/bin/activate
TF_CPP_MIN_LOG_LEVEL=3 python -c "import tensorflow as tf; print('TF GPUs available:', bool(tf.config.list_physical_devices('GPU')))"
TF_CPP_MIN_LOG_LEVEL=3 python -c "import tensorflow as tf; print('Number of TF GPUs available:', len(tf.config.list_physical_devices('GPU')))"
- name: Run all tests on multi-GPU
env:
OMP_NUM_THREADS: 1
RUN_SLOW: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s --make-reports=tests_tf_multiple_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_tf_multiple_gpu_failures_short.txt
- name: Run all pipeline tests on multi-GPU
if: ${{ always() }}
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
OMP_NUM_THREADS: 1
RUN_SLOW: yes
RUN_PIPELINE_TESTS: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s -m is_pipeline_test --make-reports=tests_tf_pipelines_multiple_gpu tests
- name: Failure short reports
if: ${{ always() }}
run: cat reports/tests_tf_multiple_gpu_pipelines_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: run_all_tests_tf_multi_gpu_test_reports
path: reports

View File

@@ -96,7 +96,7 @@ folder.
## Start contributing! (Pull Requests)
Before writing code, we strongly advise you to search through the exising PRs or
Before writing code, we strongly advise you to search through the existing PRs or
issues to make sure that nobody is already working on the same thing. If you are
unsure, it is always a good idea to open an issue to get some feedback.
@@ -235,7 +235,7 @@ Follow these steps to start contributing:
### Checklist
1. The title of your pull request should be a summary of its contribution;
2. If your pull request adresses an issue, please mention the issue number in
2. If your pull request addresses an issue, please mention the issue number in
the pull request description to make sure they are linked (and people
consulting the issue know you are working on it);
3. To indicate a work in progress please prefix the title with `[WIP]`. These

View File

@@ -1,16 +1,11 @@
.PHONY: modified_only_fixup extra_quality_checks quality style fixup fix-copies test test-examples docs
check_dirs := examples templates tests src utils
# get modified files since the branch was made
fork_point_sha := $(shell git merge-base --fork-point master)
joined_dirs := $(shell echo $(check_dirs) | tr " " "|")
modified_py_files := $(shell git diff --name-only $(fork_point_sha) | egrep '^($(joined_dirs))' | egrep '\.py$$')
#$(info modified files are: $(modified_py_files))
check_dirs := examples tests src utils
modified_only_fixup:
@if [ -n "$(modified_py_files)" ]; then \
$(eval modified_py_files := $(shell python utils/get_modified_files.py $(check_dirs)))
@if test -n "$(modified_py_files)"; then \
echo "Checking/fixing $(modified_py_files)"; \
black $(modified_py_files); \
isort $(modified_py_files); \
@@ -25,12 +20,14 @@ extra_quality_checks:
python utils/check_copies.py
python utils/check_dummies.py
python utils/check_repo.py
python utils/style_doc.py src/transformers docs/source --max_len 119
# this target runs checks on all files
quality:
black --check $(check_dirs)
isort --check-only $(check_dirs)
flake8 $(check_dirs)
python utils/style_doc.py src/transformers docs/source --max_len 119 --check_only
${MAKE} extra_quality_checks
# Format source code automatically and check is there are any problems left that need manual fixing
@@ -38,6 +35,7 @@ quality:
style:
black $(check_dirs)
isort $(check_dirs)
python utils/style_doc.py src/transformers docs/source --max_len 119
# Super fast fix and check target that only works on relevant modified files since the branch was made

View File

@@ -180,7 +180,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
1. **[LXMERT](https://huggingface.co/transformers/model_doc/lxmert.html)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
1. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
1. **[MBart](https://huggingface.co/transformers/model_doc/mbart.html)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
1. **[MBart](https://huggingface.co/transformers/model_doc/mbart.html)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
1. **[Pegasus](https://huggingface.co/transformers/model_doc/pegasus.html)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777)> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[ProphetNet](https://huggingface.co/transformers/model_doc/prophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.

View File

@@ -1,7 +0,0 @@
coverage:
status:
project:
default:
informational: true
patch: off
comment: false

View File

@@ -1,10 +1,11 @@
// These two things need to be updated at each release for the version selector.
// Last stable version
const stableVersion = "v3.3.0"
const stableVersion = "v3.4.0"
// Dictionary doc folder to label
const versionMapping = {
"master": "master",
"": "v3.3.0/v3.3.1",
"": "v3.4.0",
"v3.3.1": "v3.3.0/v3.3.1",
"v3.2.0": "v3.2.0",
"v3.1.0": "v3.1.0 (stable)",
"v3.0.2": "v3.0.0/v3.0.1/v3.0.2",

View File

@@ -3,21 +3,27 @@ Benchmarks
Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here <https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here
<https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
How to benchmark 🤗 Transformer models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly benchmark 🤗 Transformer models.
The benchmark classes allow us to measure the `peak memory usage` and `required time` for both
`inference` and `training`.
The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly
benchmark 🤗 Transformer models. The benchmark classes allow us to measure the `peak memory usage` and `required time`
for both `inference` and `training`.
.. note::
Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and backward pass.
Hereby, `inference` is defined by a single forward pass, and `training` is defined by a single forward pass and
backward pass.
The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an object of type :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation. :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data classes and contain all relevant configurations for their corresponding benchmark class.
In the following example, it is shown how a BERT model of type `bert-base-cased` can be benchmarked.
The benchmark classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` expect an
object of type :class:`~transformers.PyTorchBenchmarkArguments` and
:class:`~transformers.TensorFlowBenchmarkArguments`, respectively, for instantiation.
:class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` are data
classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it
is shown how a BERT model of type `bert-base-cased` can be benchmarked.
.. code-block::
@@ -34,11 +40,15 @@ In the following example, it is shown how a BERT model of type `bert-base-cased`
>>> benchmark = TensorFlowBenchmark(args)
Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and ``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the `model hub <https://huggingface.co/models>`__
The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define the size of the ``input_ids`` on which the model is benchmarked.
There are many more parameters that can be configured via the benchmark argument data classes. For more detail on these one can either directly consult the files
``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch) and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow).
Alternatively, running the following shell commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow respectively.
Here, three arguments are given to the benchmark argument data classes, namely ``models``, ``batch_sizes``, and
``sequence_lengths``. The argument ``models`` is required and expects a :obj:`list` of model identifiers from the
`model hub <https://huggingface.co/models>`__ The :obj:`list` arguments ``batch_sizes`` and ``sequence_lengths`` define
the size of the ``input_ids`` on which the model is benchmarked. There are many more parameters that can be configured
via the benchmark argument data classes. For more detail on these one can either directly consult the files
``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch)
and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow). Alternatively, running the following shell
commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
respectively.
.. code-block:: bash
@@ -65,7 +75,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
bert-base-uncased 8 128 0.018
bert-base-uncased 8 512 0.088
--------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB
@@ -75,7 +85,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
bert-base-uncased 8 128 1307
bert-base-uncased 8 512 1539
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: PyTorch
@@ -98,7 +108,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
- gpu_power_watts: 280.0
- gpu_performance_state: 2
- use_tpu: False
>>> ## TENSORFLOW CODE
>>> results = benchmark.run()
>>> print(results)
@@ -111,7 +121,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
bert-base-uncased 8 128 0.022
bert-base-uncased 8 512 0.105
--------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB
@@ -121,7 +131,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
bert-base-uncased 8 128 1330
bert-base-uncased 8 512 1770
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: Tensorflow
@@ -145,14 +155,17 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
- gpu_performance_state: 2
- use_tpu: False
By default, the `time` and the `required memory` for `inference` are benchmarked.
In the example output above the first two sections show the result corresponding to `inference time` and `inference memory`.
In addition, all relevant information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed out in the third section under `ENVIRONMENT INFORMATION`.
This information can optionally be saved in a `.csv` file when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and :class:`~transformers.TensorFlowBenchmarkArguments` respectively.
In this case, every section is saved in a separate `.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
By default, the `time` and the `required memory` for `inference` are benchmarked. In the example output above the first
two sections show the result corresponding to `inference time` and `inference memory`. In addition, all relevant
information about the computing environment, `e.g.` the GPU type, the system, the library versions, etc... are printed
out in the third section under `ENVIRONMENT INFORMATION`. This information can optionally be saved in a `.csv` file
when adding the argument :obj:`save_to_csv=True` to :class:`~transformers.PyTorchBenchmarkArguments` and
:class:`~transformers.TensorFlowBenchmarkArguments` respectively. In this case, every section is saved in a separate
`.csv` file. The path to each `.csv` file can optionally be defined via the argument data classes.
Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can alternatively benchmark an arbitrary configuration of any available model class.
In this case, a :obj:`list` of configurations must be inserted with the benchmark args as follows.
Instead of benchmarking pre-trained models via their model identifier, `e.g.` `bert-base-uncased`, the user can
alternatively benchmark an arbitrary configuration of any available model class. In this case, a :obj:`list` of
configurations must be inserted with the benchmark args as follows.
.. code-block::
@@ -183,7 +196,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
bert-6-lay 8 128 0.009
bert-6-lay 8 512 0.044
--------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB
@@ -201,7 +214,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
bert-6-lay 8 128 1127
bert-6-lay 8 512 1359
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: PyTorch
@@ -252,7 +265,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
bert-6-lay 8 128 0.0011
bert-6-lay 8 512 0.074
--------------------------------------------------------------------------------
==================== INFERENCE - MEMORY - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Memory in MB
@@ -270,7 +283,7 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
bert-6-lay 8 128 1330
bert-6-lay 8 512 1540
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: Tensorflow
@@ -295,8 +308,9 @@ In this case, a :obj:`list` of configurations must be inserted with the benchmar
- use_tpu: False
Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations of the :obj:`BertModel` class. This feature can especially be helpful when
deciding for which configuration the model should be trained.
Again, `inference time` and `required memory` for `inference` are measured, but this time for customized configurations
of the :obj:`BertModel` class. This feature can especially be helpful when deciding for which configuration the model
should be trained.
Benchmark best practices
@@ -304,19 +318,28 @@ Benchmark best practices
This section lists a couple of best practices one should be aware of when benchmarking a model.
- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate memory measurement it is recommended to run each memory benchmark in a separate process by making sure :obj:`no_multi_processing` is set to :obj:`True`.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very useful for the community.
- Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user
specifies on which device the code should be run by setting the ``CUDA_VISIBLE_DEVICES`` environment variable in the
shell, `e.g.` ``export CUDA_VISIBLE_DEVICES=0`` before running the code.
- The option :obj:`no_multi_processing` should only be set to :obj:`True` for testing and debugging. To ensure accurate
memory measurement it is recommended to run each memory benchmark in a separate process by making sure
:obj:`no_multi_processing` is set to :obj:`True`.
- One should always state the environment information when sharing the results of a model benchmark. Results can vary
heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very
useful for the community.
Sharing your benchmark
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different settings: using PyTorch, with
and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for
TensorFlow XLA) and GPUs.
Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different
settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were
done across CPUs (except for TensorFlow XLA) and GPUs.
The approach is detailed in the `following blogpost <https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are available `here <https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
The approach is detailed in the `following blogpost
<https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2>`__ and the results are
available `here
<https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here <https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__.
With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here
<https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__.

View File

@@ -1,18 +1,26 @@
BERTology
-----------------------------------------------------------------------------------------------------------------------
There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:
There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT
(that some call "BERTology"). Some good examples of this field are:
* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick:
https://arxiv.org/abs/1905.05950
* Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D.
Manning: https://arxiv.org/abs/1906.04341
In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to
help people access the inner representations, mainly adapted from the great work of Paul Michel
(https://arxiv.org/abs/1905.10650):
* accessing all the hidden-states of BERT/GPT/GPT-2,
* accessing all the attention weights for each head of BERT/GPT/GPT-2,
* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained
in https://arxiv.org/abs/1905.10650.
To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE.
To help you understand and use these features, we have added a specific example script: `bertology.py
<https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract
information and prune a model pre-trained on GLUE.

View File

@@ -26,7 +26,7 @@ author = u'huggingface'
# The short X.Y version
version = u''
# The full version, including alpha/beta/rc tags
release = u'3.4.0'
release = u'3.5.0'
# -- General configuration ---------------------------------------------------

View File

@@ -1,24 +1,40 @@
Converting Tensorflow Checkpoints
=======================================================================================================================
A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library.
A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models
than be loaded using the ``from_pretrained`` methods of the library.
.. note::
Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**)
available in any transformers >= 2.3.0 installation.
Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**) available in any
transformers >= 2.3.0 installation.
The documentation below reflects the **transformers-cli convert** command format.
BERT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_bert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google
<https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the
`convert_bert_original_tf_checkpoint_to_pytorch.py
<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
script.
This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ).
This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated
configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights
from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that
can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py
<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ ,
`run_bert_classifier.py
<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and
`run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\
).
You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\ ``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow
checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\
``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install tensorflow``\ ). The rest of the repository only requires PyTorch.
To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install
tensorflow``\ ). The rest of the repository only requires PyTorch.
Here is an example of the conversion process for a pre-trained ``BERT-Base Uncased`` model:
@@ -31,14 +47,20 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
--config $BERT_BASE_DIR/bert_config.json \
--pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin
You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.
You can download Google's pre-trained models for the conversion `here
<https://github.com/google-research/bert#pre-trained-models>`__.
ALBERT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Convert TensorFlow model checkpoints of ALBERT to PyTorch using the `convert_albert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
Convert TensorFlow model checkpoints of ALBERT to PyTorch using the
`convert_albert_original_tf_checkpoint_to_pytorch.py
<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
script.
The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you will need to have TensorFlow and PyTorch installed.
The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying
configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you
will need to have TensorFlow and PyTorch installed.
Here is an example of the conversion process for the pre-trained ``ALBERT Base`` model:
@@ -51,12 +73,15 @@ Here is an example of the conversion process for the pre-trained ``ALBERT Base``
--config $ALBERT_BASE_DIR/albert_config.json \
--pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin
You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/albert#pre-trained-models>`__.
You can download Google's pre-trained models for the conversion `here
<https://github.com/google-research/albert#pre-trained-models>`__.
OpenAI GPT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\ )
Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint
save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\
)
.. code-block:: shell
@@ -72,7 +97,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
OpenAI GPT-2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here <https://github.com/openai/gpt-2>`__\ )
Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here
<https://github.com/openai/gpt-2>`__\ )
.. code-block:: shell
@@ -87,7 +113,8 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT-2 mode
Transformer-XL
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here <https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here
<https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
.. code-block:: shell
@@ -130,4 +157,4 @@ Here is an example of the conversion process for a pre-trained XLM model:
--tf_checkpoint $XLM_CHECKPOINT_PATH \
--pytorch_dump_output $PYTORCH_DUMP_OUTPUT
[--config XML_CONFIG] \
[--finetuning_task_name XML_FINETUNED_TASK]
[--finetuning_task_name XML_FINETUNED_TASK]

View File

@@ -3,15 +3,15 @@ Fine-tuning with custom datasets
.. note::
The datasets used in this tutorial are available and can be more easily accessed using the
`🤗 NLP library <https://github.com/huggingface/nlp>`_. We do not use this library to access the datasets here
since this tutorial meant to illustrate how to work with your own data. A brief of introduction can be found
at the end of the tutorial in the section ":ref:`nlplib`".
The datasets used in this tutorial are available and can be more easily accessed using the `🤗 NLP library
<https://github.com/huggingface/nlp>`_. We do not use this library to access the datasets here since this tutorial
meant to illustrate how to work with your own data. A brief of introduction can be found at the end of the tutorial
in the section ":ref:`nlplib`".
This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets. The
guide shows one of many valid workflows for using these models and is meant to be illustrative rather than
definitive. We show examples of reading in several data formats, preprocessing the data for several types of tasks,
and then preparing the data into PyTorch/TensorFlow ``Dataset`` objects which can easily be used either with
This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets. The guide
shows one of many valid workflows for using these models and is meant to be illustrative rather than definitive. We
show examples of reading in several data formats, preprocessing the data for several types of tasks, and then preparing
the data into PyTorch/TensorFlow ``Dataset`` objects which can easily be used either with
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow.
We include several examples, each of which demonstrates a different type of common downstream task:
@@ -28,13 +28,13 @@ Sequence Classification with IMDb Reviews
.. note::
This dataset can be explored in the Hugging Face model hub (`IMDb <https://huggingface.co/datasets/imdb>`_), and can
be alternatively downloaded with the 🤗 NLP library with ``load_dataset("imdb")``.
This dataset can be explored in the Hugging Face model hub (`IMDb <https://huggingface.co/datasets/imdb>`_), and
can be alternatively downloaded with the 🤗 NLP library with ``load_dataset("imdb")``.
In this example, we'll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task
takes the text of a review and requires the model to predict whether the sentiment of the review is positive or
negative. Let's start by downloading the dataset from the
`Large Movie Review Dataset <http://ai.stanford.edu/~amaas/data/sentiment/>`_ webpage.
In this example, we'll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task takes
the text of a review and requires the model to predict whether the sentiment of the review is positive or negative.
Let's start by downloading the dataset from the `Large Movie Review Dataset
<http://ai.stanford.edu/~amaas/data/sentiment/>`_ webpage.
.. code-block:: bash
@@ -62,9 +62,8 @@ read this in.
train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')
We now have a train and test dataset, but let's also also create a validation set which we can use for for
evaluation and tuning without training our test set results. Sklearn has a convenient utility for creating such
splits:
We now have a train and test dataset, but let's also also create a validation set which we can use for for evaluation
and tuning without training our test set results. Sklearn has a convenient utility for creating such splits:
.. code-block:: python
@@ -80,8 +79,8 @@ pre-trained DistilBert, so let's use the DistilBert tokenizer.
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
Now we can simply pass our texts to the tokenizer. We'll pass ``truncation=True`` and ``padding=True``, which will
ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum
input length. This will allow us to feed batches of sequences into the model at the same time.
ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum input
length. This will allow us to feed batches of sequences into the model at the same time.
.. code-block:: python
@@ -90,9 +89,9 @@ input length. This will allow us to feed batches of sequences into the model at
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
Now, let's turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a
``torch.utils.data.Dataset`` object and implementing ``__len__`` and ``__getitem__``. In TensorFlow, we pass our input encodings and
labels to the ``from_tensor_slices`` constructor method. We put the data in this format so that the data can be
easily batched such that each key in the batch encoding corresponds to a named parameter of the
``torch.utils.data.Dataset`` object and implementing ``__len__`` and ``__getitem__``. In TensorFlow, we pass our input
encodings and labels to the ``from_tensor_slices`` constructor method. We put the data in this format so that the data
can be easily batched such that each key in the batch encoding corresponds to a named parameter of the
:meth:`~transformers.DistilBertForSequenceClassification.forward` method of the model we will train.
.. code-block:: python
@@ -133,17 +132,17 @@ easily batched such that each key in the batch encoding corresponds to a named p
))
Now that our datasets our ready, we can fine-tune a model either with the 🤗
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow. See
:doc:`training <training>`.
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow. See :doc:`training
<training>`.
.. _ft_trainer:
Fine-tuning with Trainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a
model to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments`
and instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`.
The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` and
instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`.
.. code-block:: python
@@ -248,15 +247,15 @@ Token Classification with W-NUT Emerging Entities
.. note::
This dataset can be explored in the Hugging Face model hub (`WNUT-17 <https://huggingface.co/datasets/wnut_17>`_), and can
be alternatively downloaded with the 🤗 NLP library with ``load_dataset("wnut_17")``.
This dataset can be explored in the Hugging Face model hub (`WNUT-17 <https://huggingface.co/datasets/wnut_17>`_),
and can be alternatively downloaded with the 🤗 NLP library with ``load_dataset("wnut_17")``.
Next we will look at token classification. Rather than classifying an entire sequence, this task classifies token by
token. We'll demonstrate how to do this with
`Named Entity Recognition <http://nlpprogress.com/english/named_entity_recognition.html>`_, which involves
identifying tokens which correspond to a predefined set of "entities". Specifically, we'll use the
`W-NUT Emerging and Rare entities <http://noisy-text.github.io/2017/emerging-rare-entities.html>`_ corpus. The data
is given as a collection of pre-tokenized documents where each token is assigned a tag.
token. We'll demonstrate how to do this with `Named Entity Recognition
<http://nlpprogress.com/english/named_entity_recognition.html>`_, which involves identifying tokens which correspond to
a predefined set of "entities". Specifically, we'll use the `W-NUT Emerging and Rare entities
<http://noisy-text.github.io/2017/emerging-rare-entities.html>`_ corpus. The data is given as a collection of
pre-tokenized documents where each token is assigned a tag.
Let's start by downloading the data.
@@ -264,10 +263,10 @@ Let's start by downloading the data.
wget http://noisy-text.github.io/2017/files/wnut17train.conll
In this case, we'll just download the train set, which is a single text file. Each line of the file contains either
(1) a word and tag separated by a tab, or (2) a blank line indicating the end of a document. Let's write a
function to read this in. We'll take in the file path and return ``token_docs`` which is a list of lists of token
strings, and ``token_tags`` which is a list of lists of tag strings.
In this case, we'll just download the train set, which is a single text file. Each line of the file contains either (1)
a word and tag separated by a tab, or (2) a blank line indicating the end of a document. Let's write a function to read
this in. We'll take in the file path and return ``token_docs`` which is a list of lists of token strings, and
``token_tags`` which is a list of lists of tag strings.
.. code-block:: python
@@ -290,11 +289,11 @@ strings, and ``token_tags`` which is a list of lists of tag strings.
tags.append(tag)
token_docs.append(tokens)
tag_docs.append(tags)
return token_docs, tag_docs
texts, tags = read_wnut('wnut17train.conll')
Just to see what this data looks like, let's take a look at a segment of the first document.
.. code-block:: python
@@ -303,8 +302,8 @@ Just to see what this data looks like, let's take a look at a segment of the fir
['for', 'two', 'weeks', '.', 'Empire', 'State', 'Building']
['O', 'O', 'O', 'O', 'B-location', 'I-location', 'I-location']
``location`` is an entity type, ``B-`` indicates the beginning of an entity, and ``I-`` indicates consecutive positions of
the same entity ("Empire State Building" is considered one entity). ``O`` indicates the token does not correspond to
``location`` is an entity type, ``B-`` indicates the beginning of an entity, and ``I-`` indicates consecutive positions
of the same entity ("Empire State Building" is considered one entity). ``O`` indicates the token does not correspond to
any entity.
Now that we've read the data in, let's create a train/validation split:
@@ -314,8 +313,8 @@ Now that we've read the data in, let's create a train/validation split:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_tags, val_tags = train_test_split(texts, tags, test_size=.2)
Next, let's create encodings for our tokens and tags. For the tags, we can start by just create a simple mapping
which we'll use in a moment:
Next, let's create encodings for our tokens and tags. For the tags, we can start by just create a simple mapping which
we'll use in a moment:
.. code-block:: python
@@ -323,11 +322,11 @@ which we'll use in a moment:
tag2id = {tag: id for id, tag in enumerate(unique_tags)}
id2tag = {id: tag for tag, id in tag2id.items()}
To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing
with ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model
to return information about the tokens which are split by the wordpiece tokenization process, which we will need in
a moment.
To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing with
ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model to
return information about the tokens which are split by the wordpiece tokenization process, which we will need in a
moment.
.. code-block:: python
@@ -339,26 +338,26 @@ a moment.
Great, so now our tokens are nicely encoded in the format that they need to be in to feed them into our DistilBert
model below.
Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens
in the W-NUT corpus are not in DistilBert's vocabulary. Bert and many models like it use a method called WordPiece
Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in
the vocabulary. For example, DistilBert's tokenizer would split the Twitter handle ``@huggingface`` into the tokens
``['@', 'hugging', '##face']``. This is a problem for us because we have exactly one tag per token. If the tokenizer
splits a token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.
Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in
the W-NUT corpus are not in DistilBert's vocabulary. Bert and many models like it use a method called WordPiece
Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the
vocabulary. For example, DistilBert's tokenizer would split the Twitter handle ``@huggingface`` into the tokens ``['@',
'hugging', '##face']``. This is a problem for us because we have exactly one tag per token. If the tokenizer splits a
token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.
One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in
🤗 Transformers by setting the labels we wish to ignore to ``-100``. In the example above, if the label for
One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in 🤗
Transformers by setting the labels we wish to ignore to ``-100``. In the example above, if the label for
``@HuggingFace`` is ``3`` (indexing ``B-corporation``), we would set the labels of ``['@', 'hugging', '##face']`` to
``[3, -100, -100]``.
Let's write a function to do this. This is where we will use the ``offset_mapping`` from the tokenizer as mentioned
above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token's
start position and end position relative to the original token it was split from. That means that if the first
position in the tuple is anything other than ``0``, we will set its corresponding label to ``-100``. While we're at
it, we can also set labels to ``-100`` if the second position of the offset mapping is ``0``, since this means it must
be a special token like ``[PAD]`` or ``[CLS]``.
start position and end position relative to the original token it was split from. That means that if the first position
in the tuple is anything other than ``0``, we will set its corresponding label to ``-100``. While we're at it, we can
also set labels to ``-100`` if the second position of the offset mapping is ``0``, since this means it must be a
special token like ``[PAD]`` or ``[CLS]``.
.. note::
.. note::
Due to a recently fixed bug, -1 must be used instead of -100 when using TensorFlow in 🤗 Transformers <= 3.02.
@@ -379,7 +378,7 @@ be a special token like ``[PAD]`` or ``[CLS]``.
encoded_labels.append(doc_enc_labels.tolist())
return encoded_labels
train_labels = encode_tags(train_tags, train_encodings)
val_labels = encode_tags(val_tags, val_encodings)
@@ -447,8 +446,9 @@ Question Answering with SQuAD 2.0
.. note::
This dataset can be explored in the Hugging Face model hub (`SQuAD V2 <https://huggingface.co/datasets/squad_v2>`_), and can
be alternatively downloaded with the 🤗 NLP library with ``load_dataset("squad_v2")``.
This dataset can be explored in the Hugging Face model hub (`SQuAD V2
<https://huggingface.co/datasets/squad_v2>`_), and can be alternatively downloaded with the 🤗 NLP library with
``load_dataset("squad_v2")``.
Question answering comes in many forms. In this example, we'll look at the particular type of extractive QA that
involves answering a question about a passage by highlighting the segment of the passage that answers the question.
@@ -464,8 +464,8 @@ We will start by downloading the data:
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O squad/dev-v2.0.json
Each split is in a structured json file with a number of questions and answers for each passage (or context). We'll
take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated
since there are multiple questions per context):
take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated since
there are multiple questions per context):
.. code-block:: python
@@ -491,17 +491,17 @@ since there are multiple questions per context):
answers.append(answer)
return contexts, questions, answers
train_contexts, train_questions, train_answers = read_squad('squad/train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('squad/dev-v2.0.json')
The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with
the correct answer as well as an integer indicating the character at which the answer begins. In order to train a
model on this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which *token*
positions the answer begins and ends.
The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with the
correct answer as well as an integer indicating the character at which the answer begins. In order to train a model on
this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which *token* positions the
answer begins and ends.
First, let's get the *character* position at which the answer ends in the passage (we are given the starting
position). Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.
First, let's get the *character* position at which the answer ends in the passage (we are given the starting position).
Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.
.. code-block:: python
@@ -510,7 +510,7 @@ position). Sometimes SQuAD answers are off by one or two characters, so we will
gold_text = answer['text']
start_idx = answer['answer_start']
end_idx = start_idx + len(gold_text)
# sometimes squad answers are off by a character or two fix this
if context[start_idx:end_idx] == gold_text:
answer['answer_end'] = end_idx
@@ -524,9 +524,9 @@ position). Sometimes SQuAD answers are off by one or two characters, so we will
add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)
Now ``train_answers`` and ``val_answers`` include the character end positions and the corrected start positions.
Next, let's tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode
them together as sequence pairs.
Now ``train_answers`` and ``val_answers`` include the character end positions and the corrected start positions. Next,
let's tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode them together
as sequence pairs.
.. code-block:: python
@@ -536,8 +536,8 @@ them together as sequence pairs.
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)
Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast
Tokenizers, we can use the built in :func:`~transformers.BatchEncoding.char_to_token` method.
Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast Tokenizers,
we can use the built in :func:`~transformers.BatchEncoding.char_to_token` method.
.. code-block:: python
@@ -557,9 +557,9 @@ Tokenizers, we can use the built in :func:`~transformers.BatchEncoding.char_to_t
add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)
Our data is ready. Let's just put it in a PyTorch/TensorFlow dataset so that we can easily use it for
training. In PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pass a tuple of
``(inputs_dict, labels_dict)`` to the ``from_tensor_slices`` method.
Our data is ready. Let's just put it in a PyTorch/TensorFlow dataset so that we can easily use it for training. In
PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pass a tuple of ``(inputs_dict, labels_dict)`` to the
``from_tensor_slices`` method.
.. code-block:: python
@@ -575,7 +575,7 @@ training. In PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pa
def __len__(self):
return len(self.encodings.input_ids)
train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)
## TENSORFLOW CODE
@@ -668,12 +668,11 @@ Additional Resources
Using the 🤗 NLP Datasets & Metrics library
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with
🤗 Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the
`🤗 NLP library <https://github.com/huggingface/nlp>`_ for working with the 150+ datasets included in the
`hub <https://huggingface.co/datasets>`_, including the three datasets used in this tutorial. As a very brief overview,
we will show how to use the NLP library to download and prepare the IMDb dataset from the first example,
:ref:`seq_imdb`.
This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with 🤗
Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the `🤗
NLP library <https://github.com/huggingface/nlp>`_ for working with the 150+ datasets included in the `hub
<https://huggingface.co/datasets>`_, including the three datasets used in this tutorial. As a very brief overview, we
will show how to use the NLP library to download and prepare the IMDb dataset from the first example, :ref:`seq_imdb`.
Start by downloading the dataset:
@@ -689,8 +688,8 @@ Each dataset has multiple columns corresponding to different features. Let's see
>>> print(train.column_names)
['label', 'text']
Great. Now let's tokenize the text. We can do this using the ``map`` method. We'll also rename the ``label`` column
to ``labels`` to match the model's input arguments.
Great. Now let's tokenize the text. We can do this using the ``map`` method. We'll also rename the ``label`` column to
``labels`` to match the model's input arguments.
.. code-block:: python
@@ -711,5 +710,5 @@ dataset elements.
>>> {key: val.shape for key, val in train[0].items()})
{'labels': TensorShape([]), 'input_ids': TensorShape([512]), 'attention_mask': TensorShape([512])}
We now have a fully-prepared dataset. Check out `the 🤗 NLP docs <https://huggingface.co/nlp/processing.html>`_ for
a more thorough introduction.
We now have a fully-prepared dataset. Check out `the 🤗 NLP docs <https://huggingface.co/nlp/processing.html>`_ for a
more thorough introduction.

View File

@@ -57,8 +57,8 @@ The tokenizer takes care of splitting the sequence into tokens available in the
>>> tokenized_sequence = tokenizer.tokenize(sequence)
The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split
in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix is
added for "RA" and "M":
in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix
is added for "RA" and "M":
.. code-block::
@@ -66,8 +66,8 @@ added for "RA" and "M":
['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding
the sentence to the tokenizer, which leverages the Rust implementation of
`huggingface/tokenizers <https://github.com/huggingface/tokenizers>`__ for peak performance.
the sentence to the tokenizer, which leverages the Rust implementation of `huggingface/tokenizers
<https://github.com/huggingface/tokenizers>`__ for peak performance.
.. code-block::
@@ -105,8 +105,8 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
Attention mask
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The attention mask is an optional argument used when batching sequences together. This argument indicates to the
model which tokens should be attended to, and which should not.
The attention mask is an optional argument used when batching sequences together. This argument indicates to the model
which tokens should be attended to, and which should not.
For example, consider these two sequences:
@@ -145,10 +145,10 @@ We can see that 0s have been added on the right of the first sentence to make it
>>> padded_sequences["input_ids"]
[[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating
the position of the padded indices so that the model does not attend to them. For the
:class:`~transformers.BertTokenizer`, :obj:`1` indicates a value that should be attended to, while :obj:`0` indicates
a padded value. This attention mask is in the dictionary returned by the tokenizer under the key "attention_mask":
This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating the
position of the padded indices so that the model does not attend to them. For the :class:`~transformers.BertTokenizer`,
:obj:`1` indicates a value that should be attended to, while :obj:`0` indicates a padded value. This attention mask is
in the dictionary returned by the tokenizer under the key "attention_mask":
.. code-block::
@@ -161,15 +161,16 @@ Token Type IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some models' purpose is to do sequence classification or question answering. These require two different sequences to
be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the classifier (``[CLS]``) and separator (``[SEP]``)
tokens. For example, the BERT model builds its two sequence input as such:
be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the
classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT model builds its two sequence input as
such:
.. code-block::
>>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two arguments (and
not a list, like before) like this:
We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two
arguments (and not a list, like before) like this:
.. code-block::
@@ -189,8 +190,8 @@ which will return:
[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
This is enough for some models to understand where one sequence ends and where another begins. However, other models,
such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary
mask identifying the two types of sequence in the model.
such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying
the two types of sequence in the model.
The tokenizer returns this mask as the "token_type_ids" entry:
@@ -209,14 +210,15 @@ Some models, like :class:`~transformers.XLNetModel` use an additional token repr
Position IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Contrary to RNNs that have the position of each token embedded within them,
transformers are unaware of the position of each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in the list of tokens.
Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of
each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in
the list of tokens.
They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as absolute
positional embeddings.
They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as
absolute positional embeddings.
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models
use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use
other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
.. _labels:
@@ -224,43 +226,41 @@ Labels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
should be the expected prediction of the model: it will use the standard loss in order to compute the loss between
its predictions and the expected value (the label).
should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its
predictions and the expected value (the label).
These labels are different according to the model head, for example:
- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects
a tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects a
tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
entire sequence.
- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects
a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
individual token.
- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects
a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
individual token: the labels being the token ID for the masked token, and values to be ignored for the rest (usually
-100).
- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects a tensor
of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual
token.
- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects a tensor of dimension
:obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual token: the
labels being the token ID for the masked token, and values to be ignored for the rest (usually -100).
- For sequence to sequence tasks,(e.g., :class:`~transformers.BartForConditionalGeneration`,
:class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension
:obj:`(batch_size, tgt_seq_length)` with each value corresponding to the target sequences associated with each
input sequence. During training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder
attention masks internally. They usually do not need to be supplied. This does not apply to models leveraging the
Encoder-Decoder framework.
See the documentation of each model for more information on each specific model's labels.
:class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension :obj:`(batch_size,
tgt_seq_length)` with each value corresponding to the target sequences associated with each input sequence. During
training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder attention masks internally.
They usually do not need to be supplied. This does not apply to models leveraging the Encoder-Decoder framework. See
the documentation of each model for more information on each specific model's labels.
The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer models,
simply outputting features.
The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer
models, simply outputting features.
.. _decoder-input-ids:
Decoder input IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder.
These inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually
built in a way specific to each model.
This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a
way specific to each model.
Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`.
In such models, passing the :obj:`labels` is the preferred way to handle training.
Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`. In
such models, passing the :obj:`labels` is the preferred way to handle training.
Please check each model's docs to see how they handle these input IDs for sequence to sequence training.
@@ -270,18 +270,18 @@ Feed Forward Chunking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g.,
for ``bert-base-uncased``).
The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for
``bert-base-uncased``).
For an input of size ``[batch_size, sequence_length]``, the memory required to store the intermediate feed forward
embeddings ``[batch_size, sequence_length, config.intermediate_size]`` can account for a large fraction of the memory
use. The authors of `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_ noticed that since the
computation is independent of the ``sequence_length`` dimension, it is mathematically equivalent to compute the output
embeddings of both feed forward layers ``[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n``
individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with
``n = sequence_length``, which trades increased computation time against reduced memory use, but yields a
mathematically **equivalent** result.
individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with ``n =
sequence_length``, which trades increased computation time against reduced memory use, but yields a mathematically
**equivalent** result.
For models employing the function :func:`~.transformers.apply_chunking_to_forward`, the ``chunk_size`` defines the
number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time
complexity. If ``chunk_size`` is set to 0, no feed forward chunking is done.
complexity. If ``chunk_size`` is set to 0, no feed forward chunking is done.

View File

@@ -47,6 +47,7 @@ The documentation is organized in five parts:
- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in
transformers model
- The three last section contain the documentation of each public class and function, grouped in:
- **MAIN CLASSES** for the main classes exposing the important APIs of the library.
- **MODELS** for the classes and functions related to each model implemented in the library.
- **INTERNAL HELPERS** for the classes and functions we use internally.
@@ -122,7 +123,7 @@ conversion utilities for the following models:
20. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
Translator Team.
21. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
21. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
22. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
@@ -271,3 +272,4 @@ conversion utilities for the following models:
internal/pipelines_utils
internal/tokenization_utils
internal/trainer_utils
internal/generation_utils

View File

@@ -80,9 +80,9 @@ cache home followed by ``/transformers/`` (even if you don't have PyTorch instal
So if you don't have any specific environment variable set, the cache directory will be at
``~/.cache/torch/transformers/``.
**Note:** If you have set a shell enviromnent variable for one of the predecessors of this library
**Note:** If you have set a shell environment variable for one of the predecessors of this library
(``PYTORCH_TRANSFORMERS_CACHE`` or ``PYTORCH_PRETRAINED_BERT_CACHE``), those will be used if there is no shell
enviromnent variable for ``TRANSFORMERS_CACHE``.
environment variable for ``TRANSFORMERS_CACHE``.
### Note on model downloads (Continuous Integration or large-scale deployments)

View File

@@ -0,0 +1,50 @@
Utilities for Generation
-----------------------------------------------------------------------------------------------------------------------
This page lists all the utility functions used by :meth:`~transformers.PretrainedModel.generate`,
:meth:`~transformers.PretrainedModel.greedy_search`, :meth:`~transformers.PretrainedModel.sample`,
:meth:`~transformers.PretrainedModel.beam_search`, and :meth:`~transformers.PretrainedModel.beam_sample`.
Most of those are only useful if you are studying the code of the generate methods in the library.
LogitsProcessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A :class:`~transformers.LogitsProcessor` can be used to modify the prediction scores of a language model head for
generation.
.. autoclass:: transformers.LogitsProcessor
:members: __call__
.. autoclass:: transformers.LogitsProcessorList
:members: __call__
.. autoclass:: transformers.MinLengthLogitsProcessor
:members: __call__
.. autoclass:: transformers.TemperatureLogitsWarper
:members: __call__
.. autoclass:: transformers.RepetitionPenaltyLogitsProcessor
:members: __call__
.. autoclass:: transformers.TopPLogitsWarper
:members: __call__
.. autoclass:: transformers.TopKLogitsWarper
:members: __call__
.. autoclass:: transformers.NoRepeatNGramLogitsProcessor
:members: __call__
.. autoclass:: transformers.NoBadWordsLogitsProcessor
:members: __call__
BeamSearch
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BeamScorer
:members: process, finalize
.. autoclass:: transformers.BeamSearchScorer
:members: process, finalize

View File

@@ -85,4 +85,4 @@ TensorFlow Helper Functions
.. autofunction:: transformers.modeling_tf_utils.keras_serializable
.. autofunction:: transformers.modeling_tf_utils.shape_list
.. autofunction:: transformers.modeling_tf_utils.shape_list

View File

@@ -25,6 +25,7 @@ SpecialTokensMixin
Enums and namedtuples
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.tokenization_utils_base.ExplicitEnum
.. autoclass:: transformers.tokenization_utils_base.PaddingStrategy

View File

@@ -24,4 +24,4 @@ Distributed Evaluation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.trainer_pt_utils.DistributedTensorGatherer
:members:
:members:

View File

@@ -13,13 +13,16 @@ subclass :class:`~transformers.Trainer` and override the methods you need (see :
By default a :class:`~transformers.Trainer` will use the following callbacks:
- :class:`~transformers.DefaultFlowCallback` which handles the default behavior for logging, saving and evaluation.
- :class:`~transformers.PrinterCallback` or :class:`~transformers.ProrgressCallback` to display progress and print the
- :class:`~transformers.PrinterCallback` or :class:`~transformers.ProgressCallback` to display progress and print the
logs (the first one is used if you deactivate tqdm through the :class:`~transformers.TrainingArguments`, otherwise
it's the second one).
- :class:`~transformers.integrations.TensorBoardCallback` if tensorboard is accessible (either through PyTorch >= 1.4
or tensorboardX).
- :class:`~transformers.integrations.WandbCallback` if `wandb <https://www.wandb.com/>`__ is installed.
- :class:`~transformers.integrations.CometCallback` if `comet_ml <https://www.comet.ml/site/>`__ is installed.
- :class:`~transformers.integrations.MLflowCallback` if `mlflow <https://www.mlflow.org/>`__ is installed.
- :class:`~transformers.integrations.AzureMLCallback` if `azureml-sdk <https://pypi.org/project/azureml-sdk/>`__ is
installed.
The main class that implements callbacks is :class:`~transformers.TrainerCallback`. It gets the
:class:`~transformers.TrainingArguments` used to instantiate the :class:`~transformers.Trainer`, can access that
@@ -46,6 +49,10 @@ Here is the list of the available :class:`~transformers.TrainerCallback` in the
.. autoclass:: transformers.integrations.WandbCallback
:members: setup
.. autoclass:: transformers.integrations.MLflowCallback
:members: setup
.. autoclass:: transformers.integrations.AzureMLCallback
TrainerCallback
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -17,7 +17,7 @@ You can also use the environment variable ``TRANSFORMERS_VERBOSITY`` to override
to one of the following: ``debug``, ``info``, ``warning``, ``error``, ``critical``. For example:
.. code-block:: bash
TRANSFORMERS_VERBOSITY=error ./myprogram.py
All the methods of this logging module are documented below, the main ones are
@@ -55,4 +55,4 @@ Other functions
.. autofunction:: transformers.logging.enable_explicit_format
.. autofunction:: transformers.logging.reset_format
.. autofunction:: transformers.logging.reset_format

View File

@@ -45,11 +45,11 @@ TFModelUtilsMixin
:members:
Generative models
Generation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.generation_utils.GenerationMixin
:members:
.. autoclass:: transformers.generation_tf_utils.TFGenerationMixin
:members:
:members:

View File

@@ -65,12 +65,34 @@ BaseModelOutputWithPooling
:members:
BaseModelOutputWithCrossAttentions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_outputs.BaseModelOutputWithCrossAttentions
:members:
BaseModelOutputWithPoolingAndCrossAttentions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions
:members:
BaseModelOutputWithPast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_outputs.BaseModelOutputWithPast
:members:
BaseModelOutputWithPastAndCrossAttentions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_outputs.BaseModelOutputWithPastAndCrossAttentions
:members:
Seq2SeqModelOutput
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -85,6 +107,20 @@ CausalLMOutput
:members:
CausalLMOutputWithCrossAttentions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_outputs.CausalLMOutputWithCrossAttentions
:members:
CausalLMOutputWithPastAndCrossAttentions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_outputs.CausalLMOutputWithPastAndCrossAttentions
:members:
CausalLMOutputWithPast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -1,8 +1,8 @@
Pipelines
-----------------------------------------------------------------------------------------------------------------------
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most
of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of
the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the
:doc:`task summary <../task_summary>` for examples of use.
@@ -26,8 +26,8 @@ There are two categories of pipeline abstractions to be aware about:
The pipeline abstraction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any
other pipeline but requires an additional argument which is the `task`.
The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any other
pipeline but requires an additional argument which is the `task`.
.. autofunction:: transformers.pipeline

View File

@@ -8,8 +8,8 @@ Processors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All processors follow the same architecture which is that of the
:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list
of :class:`~transformers.data.processors.utils.InputExample`. These
:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list of
:class:`~transformers.data.processors.utils.InputExample`. These
:class:`~transformers.data.processors.utils.InputExample` can be converted to
:class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model.
@@ -28,14 +28,16 @@ of :class:`~transformers.data.processors.utils.InputExample`. These
GLUE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates
the performance of models across a diverse set of existing NLU tasks. It was released together with the paper
`GLUE: A multi-task benchmark and analysis platform for natural language understanding <https://openreview.net/pdf?id=rJ4km2R5t7>`__
`General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates the
performance of models across a diverse set of existing NLU tasks. It was released together with the paper `GLUE: A
multi-task benchmark and analysis platform for natural language understanding
<https://openreview.net/pdf?id=rJ4km2R5t7>`__
This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched),
CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.
This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB,
QQP, QNLI, RTE and WNLI.
Those processors are:
- :class:`~transformers.data.processors.utils.MrpcProcessor`
- :class:`~transformers.data.processors.utils.MnliProcessor`
- :class:`~transformers.data.processors.utils.MnliMismatchedProcessor`
@@ -46,7 +48,7 @@ Those processors are:
- :class:`~transformers.data.processors.utils.RteProcessor`
- :class:`~transformers.data.processors.utils.WnliProcessor`
Additionally, the following method can be used to load values from a data file and convert them to a list of
Additionally, the following method can be used to load values from a data file and convert them to a list of
:class:`~transformers.data.processors.utils.InputExample`.
.. automethod:: transformers.data.processors.glue.glue_convert_examples_to_features
@@ -54,36 +56,39 @@ Additionally, the following method can be used to load values from a data file
Example usage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
An example using these processors is given in the `run_glue.py
<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
XNLI
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates
the quality of cross-lingual text representations.
XNLI is crowd-sourced dataset based on `MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment
annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates the
quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on `MultiNLI
<http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment annotations for 15
different languages (including both high-resource language such as English and low-resource languages such as Swahili).
It was released together with the paper
`XNLI: Evaluating Cross-lingual Sentence Representations <https://arxiv.org/abs/1809.05053>`__
It was released together with the paper `XNLI: Evaluating Cross-lingual Sentence Representations
<https://arxiv.org/abs/1809.05053>`__
This library hosts the processor to load the XNLI data:
- :class:`~transformers.data.processors.utils.XnliProcessor`
Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
An example using these processors is given in the
`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
An example using these processors is given in the `run_xnli.py
<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
SQuAD
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that evaluates
the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper
`SQuAD: 100,000+ Questions for Machine Comprehension of Text <https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside
the paper `Know What You Don't Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that
evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version
(v1.1) was released together with the paper `SQuAD: 100,000+ Questions for Machine Comprehension of Text
<https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside the paper `Know What You Don't
Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
This library hosts a processor for each of the two versions:
@@ -91,6 +96,7 @@ Processors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Those processors are:
- :class:`~transformers.data.processors.utils.SquadV1Processor`
- :class:`~transformers.data.processors.utils.SquadV2Processor`
@@ -99,20 +105,21 @@ They both inherit from the abstract class :class:`~transformers.data.processors.
.. autoclass:: transformers.data.processors.squad.SquadProcessor
:members:
Additionally, the following method can be used to convert SQuAD examples into :class:`~transformers.data.processors.utils.SquadFeatures`
that can be used as model inputs.
Additionally, the following method can be used to convert SQuAD examples into
:class:`~transformers.data.processors.utils.SquadFeatures` that can be used as model inputs.
.. automethod:: transformers.data.processors.squad.squad_convert_examples_to_features
These processors as well as the aforementionned method can be used with files containing the data as well as with the `tensorflow_datasets` package.
Examples are given below.
These processors as well as the aforementionned method can be used with files containing the data as well as with the
`tensorflow_datasets` package. Examples are given below.
Example usage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example using the processors as well as the conversion method using data files:
Example::
.. code-block::
# Loading a V2 processor
processor = SquadV2Processor()
@@ -133,7 +140,7 @@ Example::
Using `tensorflow_datasets` is as easy as using a data file:
Example::
.. code-block::
# tensorflow_datasets only handle Squad V1.
tfds_examples = tfds.load("squad")
@@ -149,5 +156,5 @@ Example::
)
Another example using these processors is given in the
`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
Another example using these processors is given in the `run_squad.py
<https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.

View File

@@ -29,11 +29,12 @@ methods for using all the tokenizers:
:class:`~transformers.BatchEncoding` holds the output of the tokenizer's encoding methods (``__call__``,
``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python
tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by these
methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by HuggingFace
`tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition several advanced
alignment methods which can be used to map between the original string (character and words) and the token space (e.g.,
getting the index of the token comprising a given character or the span of characters corresponding to a given token).
tokenizer, this class behaves just like a standard python dictionary and holds the various model inputs computed by
these methods (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e., backed by
HuggingFace `tokenizers library <https://github.com/huggingface/tokenizers>`__), this class provides in addition
several advanced alignment methods which can be used to map between the original string (character and words) and the
token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
to a given token).
PreTrainedTokenizer

View File

@@ -4,7 +4,7 @@ Trainer
The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.
Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a
Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a
:class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
customization during training.

View File

@@ -20,7 +20,7 @@ Here is a quick summary of what you should take care of when migrating from `pyt
The main breaking change when migrating from `pytorch-pretrained-bert` to 🤗 Transformers is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
The exact content of the tuples for each model are detailled in the models' docstrings and the [documentation](https://huggingface.co/transformers/).
The exact content of the tuples for each model are detailed in the models' docstrings and the [documentation](https://huggingface.co/transformers/).
In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.
@@ -109,7 +109,7 @@ for batch in train_data:
loss.backward()
optimizer.step()
### In 🤗 Transformers, optimizer and schedules are splitted and instantiated like this:
### In 🤗 Transformers, optimizer and schedules are split and instantiated like this:
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False) # To reproduce BertAdam specific behavior set correct_bias=False
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps) # PyTorch scheduler
### and used like this:

View File

@@ -19,14 +19,14 @@ downstream tasks. However, at some point further model increases become harder d
longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream
tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE,
RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.*
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
SQuAD benchmarks while having fewer parameters compared to BERT-large.*
Tips:
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
than the left.
- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
number of (repeating) layers.

View File

@@ -1,10 +1,9 @@
AutoClasses
Auto Classes
-----------------------------------------------------------------------------------------------------------------------
In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you
are supplying to the :obj:`from_pretrained()` method.
AutoClasses are here to do this job for you so that you automatically retrieve the relevant model given the name/path
to the pretrained weights/config/vocabulary.
are supplying to the :obj:`from_pretrained()` method. AutoClasses are here to do this job for you so that you
automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.
Instantiating one of :class:`~transformers.AutoConfig`, :class:`~transformers.AutoModel`, and
:class:`~transformers.AutoTokenizer` will directly create a class of the relevant architecture. For instance
@@ -47,10 +46,24 @@ AutoModelForPreTraining
:members:
AutoModelWithLMHead
AutoModelForCausalLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelWithLMHead
.. autoclass:: transformers.AutoModelForCausalLM
:members:
AutoModelForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelForMaskedLM
:members:
AutoModelForSeq2SeqLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelForSeq2SeqLM
:members:
@@ -68,6 +81,13 @@ AutoModelForMultipleChoice
:members:
AutoModelForNextSentencePrediction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelForNextSentencePrediction
:members:
AutoModelForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -96,10 +116,24 @@ TFAutoModelForPreTraining
:members:
TFAutoModelWithLMHead
TFAutoModelForCausalLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFAutoModelWithLMHead
.. autoclass:: transformers.TFAutoModelForCausalLM
:members:
TFAutoModelForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFAutoModelForMaskedLM
:members:
TFAutoModelForSeq2SeqLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFAutoModelForSeq2SeqLM
:members:

View File

@@ -3,7 +3,7 @@ BART
**DISCLAIMER:** If you see something strange, file a `Github Issue
<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
@sshleifer
@patrickvonplaten
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -26,13 +26,23 @@ According to the abstract,
The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/bart>`__.
Examples
_______________________________________________________________________________________________________________________
- Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in
`examples/seq2seq/ <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
- An example of how to train :class:`~transformers.BartForConditionalGeneration` with a Hugging Face :obj:`datasets`
object can be found in this `forum discussion
<https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904>`__.
Implementation Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use :class:`~transformers.BartTokenizer`
or :meth:`~transformers.BartTokenizer.encode` to get the proper splitting.
- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use :class:`~transformers.BartTokenizer` or
:meth:`~transformers.BartTokenizer.encode` to get the proper splitting.
- The forward pass of :class:`~transformers.BartModel` will create decoder inputs (using the helper function
:func:`transformers.modeling_bart._prepare_bart_decoder_inputs`) if they are not passed. This is different than some
:func:`transformers.modeling_bart._prepare_bart_decoder_inputs`) if they are not passed. This is different than some
other modeling APIs.
- Model predictions are intended to be identical to the original implementation. This only works, however, if the
string you pass to :func:`fairseq.encode` starts with a space.
@@ -86,3 +96,18 @@ BartForQuestionAnswering
.. autoclass:: transformers.BartForQuestionAnswering
:members: forward
TFBartModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFBartModel
:members: call
TFBartForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFBartForConditionalGeneration
:members: call

View File

@@ -25,8 +25,8 @@ improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).*
Tips:
- BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
the left.
- BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is
efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation.

View File

@@ -24,15 +24,15 @@ Usage:
- The model can be used in combination with the :class:`~transformers.EncoderDecoderModel` to leverage two pretrained
BERT checkpoints for subsequent fine-tuning.
:: code-block
.. code-block::
# leverage checkpoints for Bert2Bert model...
# use BERT's cls token as BOS token and sep token as EOS token
encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102)
# add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102)
bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
# create tokenizer...
tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")
@@ -47,7 +47,7 @@ Usage:
- Pretrained :class:`~transformers.EncoderDecoderModel` are also directly available in the model hub, e.g.,
:: code-block
.. code-block::
# instantiate sentence fusion model
sentence_fuser = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse")

View File

@@ -1,16 +1,28 @@
Blenderbot
-----------------------------------------------------------------------------------------------------------------------
**DISCLAIMER:** If you see something strange,
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ .
**DISCLAIMER:** If you see something strange, file a `Github Issue
<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ .
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot <https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot
<https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
The abstract of the paper is the following:
*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.*
*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
persona. We show that large scale models can learn these skills when given appropriate training data and choice of
generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
failure cases of our models.*
The authors' code can be found `here <https://github.com/facebookresearch/ParlAI>`__ .
@@ -20,15 +32,20 @@ Implementation Notes
- Blenderbot uses a standard `seq2seq model transformer <https://arxiv.org/pdf/1706.03762.pdf>`__ based architecture.
- It inherits completely from :class:`~transformers.BartForConditionalGeneration`
- Even though blenderbot is one model, it uses two tokenizers :class:`~transformers.BlenderbotSmallTokenizer` for 90M checkpoint and :class:`~transformers.BlenderbotTokenizer` for all other checkpoints.
- :class:`~transformers.BlenderbotSmallTokenizer` will always return :class:`~transformers.BlenderbotSmallTokenizer`, regardless of checkpoint. To use the 3B parameter checkpoint, you must call :class:`~transformers.BlenderbotTokenizer` directly.
- Even though blenderbot is one model, it uses two tokenizers :class:`~transformers.BlenderbotSmallTokenizer` for 90M
checkpoint and :class:`~transformers.BlenderbotTokenizer` for all other checkpoints.
- :class:`~transformers.BlenderbotSmallTokenizer` will always return :class:`~transformers.BlenderbotSmallTokenizer`,
regardless of checkpoint. To use the 3B parameter checkpoint, you must call
:class:`~transformers.BlenderbotTokenizer` directly.
- Available checkpoints can be found in the `model hub <https://huggingface.co/models?search=blenderbot>`__.
Usage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Model Usage:
Here is an example of model usage:
.. code-block::
>>> from transformers import BlenderbotSmallTokenizer, BlenderbotForConditionalGeneration
>>> mname = 'facebook/blenderbot-90M'
@@ -40,7 +57,10 @@ Model Usage:
>>> print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in reply_ids])
See Config Values:
Here is how you can check out config values:
.. code-block::
>>> from transformers import BlenderbotConfig
>>> config_90 = BlenderbotConfig.from_pretrained("facebook/blenderbot-90M")
@@ -51,6 +71,7 @@ See Config Values:
BlenderbotConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotConfig
:members:
@@ -69,7 +90,17 @@ BlenderbotSmallTokenizer
BlenderbotForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
See :obj:`transformers.BartForConditionalGeneration` for arguments to `forward` and `generate`
.. autoclass:: transformers.BlenderbotForConditionalGeneration
:members:
TFBlenderbotForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
See :obj:`transformers.TFBartForConditionalGeneration` for arguments to `forward` and `generate`
.. autoclass:: transformers.TFBlenderbotForConditionalGeneration
:members:

View File

@@ -4,26 +4,26 @@ CamemBERT
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The CamemBERT model was proposed in `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`__
by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
The CamemBERT model was proposed in `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`__ by
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model
trained on 138GB of French text.
The abstract from the paper is the following:
*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success,
most available models have either been trained on English data or on the concatenation of data in multiple
languages. This makes practical use of such models --in all languages except English-- very limited. Aiming
to address this issue for French, we release CamemBERT, a French version of the Bi-directional Encoders for
Transformers (BERT). We measure the performance of CamemBERT compared to multilingual models in multiple
downstream tasks, namely part-of-speech tagging, dependency parsing, named-entity recognition, and natural
language inference. CamemBERT improves the state of the art for most of the tasks considered. We release the
pretrained model for CamemBERT hoping to foster research and downstream applications for French NLP.*
*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available
models have either been trained on English data or on the concatenation of data in multiple languages. This makes
practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French,
we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the
performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging,
dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art
for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and
downstream applications for French NLP.*
Tips:
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage
examples as well as the information relative to the inputs and outputs.
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
as well as the information relative to the inputs and outputs.
The original code can be found `here <https://camembert-model.fr/>`__.
@@ -130,4 +130,4 @@ TFCamembertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFCamembertForQuestionAnswering
:members:
:members:

View File

@@ -6,33 +6,33 @@ Overview
CTRL model was proposed in `CTRL: A Conditional Transformer Language Model for Controllable Generation
<https://arxiv.org/abs/1909.05858>`_ by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and
Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large
corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus
of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
The abstract from the paper is the following:
*Large-scale language models show promising text generation capabilities, but users cannot easily control particular
aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model,
trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were
derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning
while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of
the training data are most likely given a sequence. This provides a potential method for analyzing large amounts
of data via model-based source attribution.*
derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while
providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the
training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data
via model-based source attribution.*
Tips:
- CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
or links to generate coherent text. Refer to the `original implementation <https://github.com/salesforce/ctrl>`__
for more information.
- CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
or links to generate coherent text. Refer to the `original implementation <https://github.com/salesforce/ctrl>`__ for
more information.
- CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
the left.
- CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as
it can be observed in the `run_generation.py` example script.
token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as it can be
observed in the `run_generation.py` example script.
- The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
this `past` value prevents the model from re-computing pre-computed values in the context of text generation.
See `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage
of this argument.
this `past` value prevents the model from re-computing pre-computed values in the context of text generation. See
`reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage of
this argument.
The original code can be found `here <https://github.com/salesforce/ctrl>`__.

View File

@@ -1,40 +1,43 @@
DeBERTa
----------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__
by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen
It is based on Google's BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
<https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa.
It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
RoBERTa.
The abstract from the paper is the following:
*Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks.
In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa
models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode
its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and
relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining.
We show that these two techniques significantly improve the efficiency of model pre-training and performance of downstream tasks. Compared to
RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements
on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and pre-trained
models will be made publicly available at https://github.com/microsoft/DeBERTa.*
*Recent progress in pre-trained neural language models has significantly improved the performance of many natural
language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
disentangled attention mechanism, where each word is represented using two vectors that encode its content and
position, respectively, and the attention weights among words are computed using disentangled matrices on their
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
of model pre-training and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half
of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
The original code can be found `here <https://github.com/microsoft/DeBERTa>`__.
DebertaConfig
~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaConfig
:members:
DebertaTokenizer
~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -42,21 +45,21 @@ DebertaTokenizer
DebertaModel
~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaModel
:members:
DebertaPreTrainedModel
~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaPreTrainedModel
:members:
DebertaForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaForSequenceClassification
:members:

View File

@@ -4,36 +4,39 @@ DialoGPT
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DialoGPT was proposed in
`DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`_
by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
It's a GPT2 Model trained on 147M conversation-like exchanges extracted from Reddit.
DialoGPT was proposed in `DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation
<https://arxiv.org/abs/1911.00536>`_ by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao,
Jianfeng Gao, Jingjing Liu, Bill Dolan. It's a GPT2 Model trained on 147M conversation-like exchanges extracted from
Reddit.
The abstract from the paper is the following:
*We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained transformer).
Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings.
We show that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems.
The pre-trained model and training pipeline are publicly released to facilitate research into neural response generation and the development of more intelligent open-domain dialogue systems.*
*We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained
transformer). Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning
from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human
both in terms of automatic and human evaluation in single-turn dialogue settings. We show that conversational systems
that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline
systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response
generation and the development of more intelligent open-domain dialogue systems.*
Tips:
- DialoGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- DialoGPT was trained with a causal language modeling (CLM) objective on conversational data and is therefore powerful at response generation in open-domain dialogue systems.
- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on `DialoGPT's model card <https://huggingface.co/microsoft/DialoGPT-medium>`_.
- DialoGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
than the left.
- DialoGPT was trained with a causal language modeling (CLM) objective on conversational data and is therefore powerful
at response generation in open-domain dialogue systems.
- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on `DialoGPT's model card
<https://huggingface.co/microsoft/DialoGPT-medium>`_.
Training:
In order to train or fine-tune DialoGPT, one can use causal language modeling training.
To cite the official paper:
*We follow the OpenAI GPT-2 to model a multiturn dialogue session
as a long text and frame the generation task as language modeling. We first
concatenate all dialog turns within a dialogue session into a long text
x_1,..., x_N (N is the sequence length), ended by the end-of-text token.*
For more information please confer to the original paper.
In order to train or fine-tune DialoGPT, one can use causal language modeling training. To cite the official paper: *We
follow the OpenAI GPT-2 to model a multiturn dialogue session as a long text and frame the generation task as language
modeling. We first concatenate all dialog turns within a dialogue session into a long text x_1,..., x_N (N is the
sequence length), ended by the end-of-text token.* For more information please confer to the original paper.
DialoGPT's architecture is based on the GPT2 model, so one can refer to GPT2's `docstring <https://huggingface.co/transformers/model_doc/gpt2.html>`_.
DialoGPT's architecture is based on the GPT2 model, so one can refer to GPT2's `docstring
<https://huggingface.co/transformers/model_doc/gpt2.html>`_.
The original code can be found `here <https://github.com/microsoft/DialoGPT>`_.

View File

@@ -4,13 +4,12 @@ DistilBERT
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The DistilBERT model was proposed in the blog post
`Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT
<https://medium.com/huggingface/distilbert-8cf3380435b5>`__, and the paper `DistilBERT, a distilled version of BERT:
smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__.
DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less
parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of BERT's performances as measured on
the GLUE language understanding benchmark.
The DistilBERT model was proposed in the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`__, and the paper `DistilBERT, a
distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__. DistilBERT is a
small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than
`bert-base-uncased`, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language
understanding benchmark.
The abstract from the paper is the following:
@@ -18,13 +17,13 @@ The abstract from the paper is the following:
operating these large models in on-the-edge and/or under constrained computational training or inference budgets
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
counterparts. While most prior work investigated the use of distillation for building task-specific models, we
leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a
BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage
the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language
modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train
and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative
on-device study.*
counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by
40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
biases learned by larger models during pre-training, we introduce a triple loss combining language modeling,
distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
study.*
Tips:
@@ -33,7 +32,8 @@ Tips:
- DistilBERT doesn't have options to select the input positions (:obj:`position_ids` input). This could be added if
necessary though, just let us know if you need this option.
The original code can be found `here <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__.
The original code can be found `here
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__.
DistilBertConfig

View File

@@ -4,9 +4,9 @@ DPR
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research.
It was intorduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__
by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
intorduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
The abstract from the paper is the following:

View File

@@ -12,34 +12,28 @@ identify which tokens were replaced by the generator in the sequence.
The abstract from the paper is the following:
*Masked language modeling (MLM) pre-training methods such as BERT corrupt
the input by replacing some tokens with [MASK] and then train a model to
reconstruct the original tokens. While they produce good results when transferred
to downstream NLP tasks, they generally require large amounts of compute to be
effective. As an alternative, we propose a more sample-efficient pre-training task
called replaced token detection. Instead of masking the input, our approach
corrupts it by replacing some tokens with plausible alternatives sampled from a small
generator network. Then, instead of training a model that predicts the original
identities of the corrupted tokens, we train a discriminative model that predicts
whether each token in the corrupted input was replaced by a generator sample
or not. Thorough experiments demonstrate this new pre-training task is more
efficient than MLM because the task is defined over all input tokens rather than
just the small subset that was masked out. As a result, the contextual representations
learned by our approach substantially outperform the ones learned by BERT
given the same model size, data, and compute. The gains are particularly strong
for small models; for example, we train a model on one GPU for 4 days that
outperforms GPT (trained using 30x more compute) on the GLUE natural language
understanding benchmark. Our approach also works well at scale, where it
performs comparably to RoBERTa and XLNet while using less than 1/4 of their
compute and outperforms them when using the same amount of compute.*
*Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with
[MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to
downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach
corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens
rather than just the small subset that was masked out. As a result, the contextual representations learned by our
approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained
using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale,
where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when
using the same amount of compute.*
Tips:
- ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The
only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller,
while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from
their embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no
projection layer is used.
while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from their
embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no projection
layer is used.
- The ELECTRA checkpoints saved using `Google Research's implementation <https://github.com/google-research/electra>`__
contain both the generator and discriminator. The conversion script requires the user to name which model to export
into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all

View File

@@ -13,7 +13,7 @@ any other models (see the examples for more information).
An application of this architecture could be to leverage two pretrained :class:`~transformers.BertModel` as the encoder
and decoder for a summarization model as was shown in: `Text Summarization with Pretrained Encoders
<https://arxiv.org/abs/1908.08345>`__ by Yang Liu and Mirella Lapata.
<https://arxiv.org/abs/1908.08345>`__ by Yang Liu and Mirella Lapata.
EncoderDecoderConfig

View File

@@ -11,17 +11,17 @@ modeling (MLM) objective (like BERT).
The abstract from the paper is the following:
*Language models have become a key step to achieve state-of-the art results in many different Natural Language
Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient
way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their
Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way
to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their
contextualization at the sentence level. This has been widely demonstrated for English using contextualized
representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et
al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large
and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre
for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most
of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified
evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared
to the research community for further reproducible experiments in French NLP.*
representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al.,
2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and
heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation
protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
community for further reproducible experiments in French NLP.*
The original code can be found `here <https://github.com/getalp/Flaubert>`__.

View File

@@ -58,4 +58,4 @@ FSMTForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FSMTForConditionalGeneration
:members: forward
:members: forward

View File

@@ -30,8 +30,8 @@ Tips:
directly for tasks that just require a sentence summary (like sequence classification or multiple choice). For other
tasks, the full model is used; this full model has a decoder that upsamples the final hidden states to the same
sequence length as the input.
- The Funnel Transformer checkpoints are all available with a full version and a base version. The first ones should
be used for :class:`~transformers.FunnelModel`, :class:`~transformers.FunnelForPreTraining`,
- The Funnel Transformer checkpoints are all available with a full version and a base version. The first ones should be
used for :class:`~transformers.FunnelModel`, :class:`~transformers.FunnelForPreTraining`,
:class:`~transformers.FunnelForMaskedLM`, :class:`~transformers.FunnelForTokenClassification` and
class:`~transformers.FunnelForQuestionAnswering`. The second ones should be used for
:class:`~transformers.FunnelBaseModel`, :class:`~transformers.FunnelForSequenceClassification` and

View File

@@ -6,51 +6,47 @@ Overview
OpenAI GPT model was proposed in `Improving Language Understanding by Generative Pre-Training
<https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf>`__
by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional)
transformer pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book
Corpus.
by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional) transformer
pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus.
The abstract from the paper is the following:
*Natural language understanding comprises a wide range of diverse tasks such
as textual entailment, question answering, semantic similarity assessment, and
document classification. Although large unlabeled text corpora are abundant,
labeled data for learning these specific tasks is scarce, making it challenging for
discriminatively trained models to perform adequately. We demonstrate that large
gains on these tasks can be realized by generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task. In contrast to previous approaches, we make use of task-aware input
transformations during fine-tuning to achieve effective transfer while requiring
minimal changes to the model architecture. We demonstrate the effectiveness of
our approach on a wide range of benchmarks for natural language understanding.
Our general task-agnostic model outperforms discriminatively trained models that
use architectures specifically crafted for each task, significantly improving upon the
state of the art in 9 out of the 12 tasks studied.*
*Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a
language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our
approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms
discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon
the state of the art in 9 out of the 12 tasks studied.*
Tips:
- GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
the left.
- GPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as
it can be observed in the `run_generation.py` example script.
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
observed in the `run_generation.py` example script.
`Write With Transformer <https://transformer.huggingface.co/doc/gpt>`__ is a webapp created and hosted by
Hugging Face showcasing the generative capabilities of several models. GPT is one of them.
`Write With Transformer <https://transformer.huggingface.co/doc/gpt>`__ is a webapp created and hosted by Hugging Face
showcasing the generative capabilities of several models. GPT is one of them.
The original code can be found `here <https://github.com/openai/finetune-transformer-lm>`__.
Note:
If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install
``ftfy`` and ``SpaCy``::
If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install ``ftfy``
and ``SpaCy``::
.. code-block:: bash
pip install spacy ftfy==4.4.3
python -m spacy download en
If you don't install ``ftfy`` and ``SpaCy``, the :class:`~transformers.OpenAIGPTTokenizer` will default to tokenize
using BERT's :obj:`BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't
worry).
using BERT's :obj:`BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
OpenAIGPTConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -5,29 +5,29 @@ Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
OpenAI GPT-2 model was proposed in `Language Models are Unsupervised Multitask Learners
<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_
by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It's a causal (unidirectional)
transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.
<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_ by Alec
Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It's a causal (unidirectional)
transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.
The abstract from the paper is the following:
*GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1]
of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous
words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring
demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X
the parameters and trained on more than 10X the amount of data.*
*GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million
web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some
text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks
across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than
10X the amount of data.*
Tips:
- GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
the left.
- GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as
it can be observed in the `run_generation.py` example script.
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
observed in the `run_generation.py` example script.
- The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
this `past` value prevents the model from re-computing pre-computed values in the context of text generation.
See `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage
of this argument.
this `past` value prevents the model from re-computing pre-computed values in the context of text generation. See
`reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage of
this argument.
`Write With Transformer <https://transformer.huggingface.co/doc/gpt2-large>`__ is a webapp created and hosted by
Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five

View File

@@ -1,55 +1,66 @@
LayoutLM
----------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__
by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. It's a simple but effective pre-training method
of text and layout for document image understanding and information extraction tasks, such as form understanding and receipt understanding.
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
Ming Zhou. It's a simple but effective pre-training method of text and layout for document image understanding and
information extraction tasks, such as form understanding and receipt understanding.
The abstract from the paper is the following:
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we propose the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42).*
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation,
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images,
which is beneficial for a great number of real-world document image understanding tasks such as information extraction
from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into
LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single
framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks,
including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image
classification (from 93.07 to 94.42).*
Tips:
- LayoutLM has an extra input called :obj:`bbox`, which is the bounding boxes of the input tokens.
- The :obj:`bbox` requires the data that on 0-1000 scale, which means you should normalize the bounding box before passing them into model.
- The :obj:`bbox` requires the data that on 0-1000 scale, which means you should normalize the bounding box before
passing them into model.
The original code can be found `here <https://github.com/microsoft/unilm/tree/master/layoutlm>`_.
LayoutLMConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LayoutLMConfig
:members:
LayoutLMTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LayoutLMTokenizer
:members:
LayoutLMModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LayoutLMModel
:members:
LayoutLMForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LayoutLMForMaskedLM
:members:
LayoutLMForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LayoutLMForTokenClassification
:members:

View File

@@ -27,20 +27,20 @@ The Authors' code can be found `here <https://github.com/allenai/longformer>`__.
Longformer Self Attention
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Longformer self attention employs self attention on both a "local" context and a "global" context.
Most tokens only attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous
tokens and :math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in
Longformer self attention employs self attention on both a "local" context and a "global" context. Most tokens only
attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous tokens and
:math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in
:obj:`config.attention_window`. Note that :obj:`config.attention_window` can be of type :obj:`List` to define a
different :math:`w` for each layer. A selected few tokens attend "globally" to all other tokens, as it is
conventionally done for all tokens in :obj:`BertSelfAttention`.
Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices.
Also note that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all
"globally" attending tokens so that global attention is *symmetric*.
Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices. Also note
that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all "globally"
attending tokens so that global attention is *symmetric*.
The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor
:obj:`global_attention_mask` at run-time appropriately. All Longformer models employ the following logic for
:obj:`global_attention_mask`:
:obj:`global_attention_mask`:
- 0: the token attends "locally",
- 1: the token attends "globally".
@@ -90,6 +90,32 @@ LongformerTokenizerFast
.. autoclass:: transformers.LongformerTokenizerFast
:members:
Longformer specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_longformer.LongformerBaseModelOutput
:members:
.. autoclass:: transformers.modeling_longformer.LongformerBaseModelOutputWithPooling
:members:
.. autoclass:: transformers.modeling_longformer.LongformerMultipleChoiceModelOutput
:members:
.. autoclass:: transformers.modeling_longformer.LongformerQuestionAnsweringModelOutput
:members:
.. autoclass:: transformers.modeling_tf_longformer.TFLongformerBaseModelOutput
:members:
.. autoclass:: transformers.modeling_tf_longformer.TFLongformerBaseModelOutputWithPooling
:members:
.. autoclass:: transformers.modeling_tf_longformer.TFLongformerQuestionAnsweringModelOutput
:members:
LongformerModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LongformerModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -8,9 +8,8 @@ The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Repres
<https://arxiv.org/abs/1908.07490>`__ by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders
(one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives.
The pretraining consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering,
VQA 2.0, and GQA.
visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining
consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.
The abstract from the paper is the following:

View File

@@ -3,7 +3,7 @@ MarianMT
**Bugs:** If you see something strange, file a `Github Issue
<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
and assign @sshleifer.
and assign @patrickvonplaten.
Translations should be similar, but not identical to, output in the test set linked to in each model card.
@@ -12,13 +12,14 @@ Implementation Notes
- Each model is about 298 MB on disk, there are more than 1,000 models.
- The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
- Models were originally trained by
`Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the
`Marian <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
- Models were originally trained by `Jörg Tiedemann
<https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian
<https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
in a model card.
- The 80 opus models that require BPE preprocessing are not supported.
- The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
- static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
- a new final_logits_bias (:obj:`MarianConfig.add_bias_logits=True`)
- no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
@@ -29,17 +30,17 @@ Implementation Notes
Naming
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
- The language codes used to name models are inconsistent. Two digit codes can usually be found `here
<https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling
"language code {code}".
<https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
code {code}".
- Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.
Multilingual Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
- If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by
looking at the model card, or the Group Members `mapping
@@ -112,6 +113,7 @@ Code to see available pretrained models:
MarianConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MarianConfig
:members:
@@ -127,3 +129,9 @@ MarianMTModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MarianMTModel
TFMarianMTModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFMarianMTModel

View File

@@ -3,13 +3,14 @@ MBart
**DISCLAIMER:** If you see something strange, file a `Github Issue
<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
@sshleifer
@patrickvonplaten
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation
<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete
@@ -21,12 +22,13 @@ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/ma
Training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task.
As the model is multilingual it expects the sequences in a different format. A special language id token
is added in both the source and target text. The source text format is :obj:`X [eos, src_lang_code]`
where :obj:`X` is the source text. The target text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode
MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. As the model is
multilingual it expects the sequences in a different format. A special language id token is added in both the source
and target text. The source text format is :obj:`X [eos, src_lang_code]` where :obj:`X` is the source text. The target
text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode
the sequences for sequence-to-sequence fine-tuning.
- Supervised training
@@ -44,8 +46,8 @@ the sequences for sequence-to-sequence fine-tuning.
- Generation
While generating the target text set the :obj:`decoder_start_token_id` to the target language id.
The following example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.
While generating the target text set the :obj:`decoder_start_token_id` to the target language id. The following
example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.
.. code-block::
@@ -77,4 +79,11 @@ MBartForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MBartForConditionalGeneration
:members: forward
:members:
TFMBartForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFMBartForConditionalGeneration
:members:

View File

@@ -14,23 +14,23 @@ The abstract from the paper is the following:
*Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds
of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot
be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating
the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied
to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while
equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward
networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated
BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that
MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known
benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7
(0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task,
MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).*
the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to
various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while
equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks.
To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE
model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is
4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the
natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms
latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of
90.0/79.2 (1.5/2.1 higher than BERT_BASE).*
Tips:
- MobileBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- MobileBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
- MobileBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
than the left.
- MobileBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
with a causal language modeling (CLM) objective are better in that regard.
The original code can be found `here <https://github.com/google-research/mobilebert>`__.

View File

@@ -3,15 +3,14 @@ Pegasus
**DISCLAIMER:** If you see something strange, file a `Github Issue
<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
and assign @sshleifer.
and assign @patrickvonplaten.
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for
Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and
Peter J. Liu on Dec 18, 2019.
The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
<https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
According to the abstract,
@@ -26,7 +25,7 @@ The Authors' code can be found `here <https://github.com/google-research/pegasus
Checkpoints
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-tuned for summarization, besides
All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-tuned for summarization, besides
`pegasus-large`, whence the other checkpoints are fine-tuned:
- Each checkpoint is 2.2 GB on disk and 568M parameters.
@@ -44,6 +43,7 @@ Implementation Notes
- All models are transformer encoder-decoders with 16 layers in each component.
- The implementation is completely inherited from :class:`~transformers.BartForConditionalGeneration`
- Some key configuration differences:
- static, sinusoidal position embeddings
- no :obj:`layernorm_embedding` (:obj`PegasusConfig.normalize_embedding=False`)
- the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
@@ -84,6 +84,7 @@ PegasusConfig
PegasusTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ``add_tokens`` does not work at the moment.
.. autoclass:: transformers.PegasusTokenizer
@@ -94,3 +95,9 @@ PegasusForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PegasusForConditionalGeneration
TFPegasusForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFPegasusForConditionalGeneration

View File

@@ -8,13 +8,24 @@ ProphetNet
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,
<https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
Zhang, Ming Zhou on 13 Jan, 2020.
ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token.
ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just
the next token.
The abstract from the paper is the following:
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.

View File

@@ -1,8 +1,8 @@
RAG
----------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and
sequence-to-sequence models. RAG models retrieve documents, pass them to a seq2seq model, then marginalize to generate
@@ -15,46 +15,40 @@ Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäs
The abstract from the paper is the following:
*Large pre-trained language models have been shown to store factual knowledge
in their parameters, and achieve state-of-the-art results when fine-tuned on
downstream NLP tasks. However, their ability to access and precisely manipulate
knowledge is still limited, and hence on knowledge-intensive tasks, their
performance lags behind task-specific architectures. Additionally, providing
provenance for their decisions and updating their world knowledge remain open
research problems. Pre-trained models with a differentiable access mechanism to
explicit nonparametric memory can overcome this issue, but have so far been only
investigated for extractive downstream tasks. We explore a general-purpose
fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine
pre-trained parametric and non-parametric memory for language generation. We
introduce RAG models where the parametric memory is a pre-trained seq2seq model and
the non-parametric memory is a dense vector index of Wikipedia, accessed with
a pre-trained neural retriever. We compare two RAG formulations, one which
conditions on the same retrieved passages across the whole generated sequence, the
other can use different passages per token. We fine-tune and evaluate our models
on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art
on three open domain QA tasks, outperforming parametric seq2seq models and
task-specific retrieve-and-extract architectures. For language generation tasks, we
find that RAG models generate more specific, diverse and factual language than a
state-of-the-art parametric-only seq2seq baseline.*
*Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve
state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely
manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind
task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge
remain open research problems. Pre-trained models with a differentiable access mechanism to explicit nonparametric
memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a
general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained
parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a
pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a
pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages
across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our
models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks,
outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation
tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art
parametric-only seq2seq baseline.*
RagConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RagConfig
:members:
RagTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RagTokenizer
:members: prepare_seq2seq_batch
Rag specific outputs
~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_rag.RetrievAugLMMarginOutput
:members:
@@ -63,28 +57,28 @@ Rag specific outputs
:members:
RagRetriever
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RagRetriever
:members:
RagModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RagModel
:members: forward
RagSequenceForGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RagSequenceForGeneration
:members: forward, generate
RagTokenForGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RagTokenForGeneration
:members: forward, generate

View File

@@ -10,7 +10,7 @@ Overview
The Reformer model was proposed in the paper `Reformer: The Efficient Transformer
<https://arxiv.org/abs/2001.04451.pdf>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
The abstract from the paper is the following:
The abstract from the paper is the following:
*Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can
be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of
@@ -36,12 +36,12 @@ would result in a position encoding matrix:
.. math::
X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right]
which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices:
which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices:
.. math::
X^{1}_{i,j}, \text{ with } i \in \left[1,\ldots, d^1\right] \text{ and } j \in \left[1,\ldots, n_s^1\right]
and
and
.. math::
X^{2}_{i,j}, \text{ with } i \in \left[1,\ldots, d^2\right] \text{ and } j \in \left[1,\ldots, n_s^2\right]
@@ -67,22 +67,23 @@ factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the :obj
Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}`
can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters.
In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to
be equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the
`sequence length` of the :obj:`input_ids`.
In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to be
equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the `sequence
length` of the :obj:`input_ids`.
LSH Self Attention
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key
query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in
`Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`__ to assign each of the tied key
query embedding vectors to one of :obj:`config.num_buckets` possible buckets. The premise is that the more "similar"
key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to
the same bucket.
the same bucket.
The accuracy of the LSH mechanism can be improved by increasing :obj:`config.num_hashes` or directly the argument
The accuracy of the LSH mechanism can be improved by increasing :obj:`config.num_hashes` or directly the argument
:obj:`num_hashes` of the forward function so that the output of the LSH self attention better approximates the output
of the "normal" full self attention. The buckets are then sorted and chunked into query key embedding vector chunks
each of length :obj:`config.lsh_chunk_length`. For each chunk, the query embedding vectors attend to its key vectors
@@ -92,11 +93,11 @@ neighboring chunks and :obj:`config.lsh_num_chunks_after` following neighboring
For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`__ or this great `blog post
<https://www.pragmatic.ml/reformer-deep-dive/>`__.
Note that :obj:`config.num_buckets` can also be factorized into a list
:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to
one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of
:math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`.
This is crucial for very long sequences to save memory.
Note that :obj:`config.num_buckets` can also be factorized into a list :math:`(n_{\text{buckets}}^1,
n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots,
n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots,
1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to
save memory.
When training a model from scratch, it is recommended to leave :obj:`config.num_buckets=None`, so that depending on the
sequence length a good value for :obj:`num_buckets` is calculated on the fly. This value will then automatically be
@@ -128,7 +129,7 @@ multiple of :obj:`config.lsh_chunk_length` and :obj:`config.local_chunk_length`
Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can
easily be trained on sequences as long as 64000 tokens.
For training, the :class:`~transformers.ReformerModelWithLMHead` should be used as follows:
For training, the :class:`~transformers.ReformerModelWithLMHead` should be used as follows:
.. code-block::

View File

@@ -8,8 +8,8 @@ The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretrainin
<https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.
It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
objective and training with much larger mini-batches and learning rates.
It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with
much larger mini-batches and learning rates.
The abstract from the paper is the following:
@@ -17,15 +17,15 @@ The abstract from the paper is the following:
approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes,
and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication
study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and
training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of
every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These
results highlight the importance of previously overlooked design choices, and raise questions about the source
of recently reported improvements. We release our models and code.*
training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every
model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results
highlight the importance of previously overlooked design choices, and raise questions about the source of recently
reported improvements. We release our models and code.*
Tips:
- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a
setup for Roberta pretrained models.
- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a setup
for Roberta pretrained models.
- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
different pretraining scheme.
- RoBERTa doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just

View File

@@ -4,38 +4,34 @@ SqueezeBERT
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The SqueezeBERT model was proposed in
`SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
<https://arxiv.org/abs/2006.11316>`__
by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer.
It's a bidirectional transformer similar to the BERT model.
The key difference between the BERT architecture and the SqueezeBERT architecture
is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
The SqueezeBERT model was proposed in `SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
<https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It's a
bidirectional transformer similar to the BERT model. The key difference between the BERT architecture and the
SqueezeBERT architecture is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
instead of fully-connected layers for the Q, K, V and FFN layers.
The abstract from the paper is the following:
*Humans read and write hundreds of billions of messages every day. Further, due to the availability of
large datasets, large computing systems, and better neural network models, natural language processing (NLP)
technology has made significant strides in understanding, proofreading, and organizing these messages.
Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users,
social networks, and businesses. In particular, we consider smartphones and other mobile devices as
crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network
models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds
to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped
convolutions have yielded significant speedups for computer vision networks, but many of these techniques
have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
self-attention layers with grouped convolutions, and we use this technique in a novel network architecture
called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive
accuracy on the GLUE test set. The SqueezeBERT code will be released.*
*Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets,
large computing systems, and better neural network models, natural language processing (NLP) technology has made
significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant
opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. In particular, we
consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today's
highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with
BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods
such as grouped convolutions have yielded significant speedups for computer vision networks, but many of these
techniques have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called
SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test
set. The SqueezeBERT code will be released.*
Tips:
- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right
rather than the left.
- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
with a causal language modeling (CLM) objective are better in that regard.
- For best results when finetuning on sequence classification tasks, it is recommended to start with the
`squeezebert/squeezebert-mnli-headless` checkpoint.

View File

@@ -29,13 +29,12 @@ Tips:
each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a
different prefix to the input corresponding to each task, e.g., for translation: *translate English to German: ...*,
for summarization: *summarize: ...*.
For more information about which prefix to use, it is easiest to look into Appendix D of the `paper
<https://arxiv.org/pdf/1910.10683.pdf>`__.
- For sequence-to-sequence generation, it is recommended to use :obj:`T5ForConditionalGeneration.generate()``. This
method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively
generates the decoder output.
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
<https://arxiv.org/pdf/1910.10683.pdf>`__. - For sequence-to-sequence generation, it is recommended to use
:obj:`T5ForConditionalGeneration.generate()``. This method takes care of feeding the encoded input via
cross-attention layers to the decoder and auto-regressively generates the decoder output. - T5 uses relative scalar
embeddings. Encoder input padding can be done on the left and on the right.
The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`__.
@@ -51,14 +50,14 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash
- Unsupervised denoising training
In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens)
and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens.
Each sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`,
In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) and
the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. Each
sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`,
:obj:`<extra_id_1>`, ... up to :obj:`<extra_id_99>`. As a default, 100 sentinel tokens are available in
:class:`~transformers.T5Tokenizer`.
For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be
processed as follows:
processed as follows:
.. code-block::
@@ -69,10 +68,10 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash
- Supervised training
In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping.
In translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist
In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping. In
translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist
wunderbar.", the sentences should be processed as follows:
.. code-block::
input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids

View File

@@ -14,19 +14,19 @@ The abstract from the paper is the following:
*Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the
setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency
beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and
a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves
the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and
450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up
to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results
of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on
Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably
beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a
novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the
context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450%
longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+
times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of
bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn
Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably
coherent, novel text articles with thousands of tokens.*
Tips:
- Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right.
The original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
- Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right. The
original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
- Transformer-XL is one of the few models that has no sequence length limit.
The original code can be found `here <https://github.com/kimiyoung/transformer-xl>`__.

View File

@@ -14,21 +14,21 @@ Guillaume Lample, Alexis Conneau. It's a transformer pretrained using one of the
The abstract from the paper is the following:
*Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding.
In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining.
We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We
propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain
state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI,
our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation,
we obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On
supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming
the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*
state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our
approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we
obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised
machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming the
previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*
Tips:
- XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the
:doc:`multi-lingual <../multilingual>` page for more information.
- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the :doc:`multi-lingual
<../multilingual>` page for more information.
The original code can be found `here <https://github.com/facebookresearch/XLM/>`__.

View File

@@ -9,13 +9,25 @@ XLM-ProphetNet
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,
<https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
Zhang, Ming Zhou on 13 Jan, 2020.
XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual "wiki100" Wikipedia dump.
XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of
just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual
"wiki100" Wikipedia dump.
The abstract from the paper is the following:
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.

View File

@@ -12,25 +12,25 @@ data.
The abstract from the paper is the following:
*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for
a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a
wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly
outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy
on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model.
We also present a detailed empirical evaluation of the key factors that are required to achieve these gains,
including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and
low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling
without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE
and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.*
outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on
XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We
also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the
trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource
languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing
per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We
will make XLM-R code, data, and models publicly available.*
Tips:
- XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
not require :obj:`lang` tensors to understand which language is used, and should be able to determine the correct
language from the input ids.
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage
examples as well as the information relative to the inputs and outputs.
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
as well as the information relative to the inputs and outputs.
The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`__.

View File

@@ -16,11 +16,11 @@ The abstract from the paper is the following:
better performance than pretraining approaches based on autoregressive language modeling. However, relying on
corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a
pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive
pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over
all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model,
into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by
a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.*
pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all
permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into
pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large
margin, including question answering, natural language inference, sentiment analysis, and document ranking.*
Tips:

View File

@@ -15,42 +15,68 @@ Prepare your model for uploading
We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
done something similar on your task, either using the model directly in your own training loop or using the
:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on
the `model hub <https://huggingface.co/models>`__.
:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on the
`model hub <https://huggingface.co/models>`__.
Model versioning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Since version v3.5.0, the model hub has built-in model versioning based on git and git-lfs. It is based on the paradigm
that one model *is* one repo.
This allows:
- built-in versioning
- access control
- scalability
This is built around *revisions*, which is a way to pin a specific version of a model, using a commit hash, tag or
branch.
For instance:
.. code-block::
>>> tokenizer = AutoTokenizer.from_pretrained(
>>> "julien-c/EsperBERTo-small",
>>> revision="v2.0.1" # tag name, or branch name, or commit hash
>>> )
Basic steps
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
..
When #5258 is merged, we can remove the need to create the directory.
In order to upload a model, you'll need to first create a git repo. This repo will live on the model hub, allowing
users to clone it and you (and your organization members) to push to it. First, you should ensure you are logged in the
``transformers-cli``:
First, pick a directory with the name you want your model to have on the model hub (its full name will then be
`username/awesome-name-you-picked` or `organization/awesome-name-you-picked`) and create it with either
Go in a terminal and run the following command. It should be in the virtual environment where you installed 🤗
Transformers, since that command :obj:`transformers-cli` comes from the library.
.. code-block::
mkdir path/to/awesome-name-you-picked
transformers-cli login
or in python
Once you are logged in with your model hub credentials, you can start building your repositories. To create a repo:
.. code-block::
import os
os.makedirs("path/to/awesome-name-you-picked")
transformers-cli repo create your-model-name
then you can save your model and tokenizer with:
This creates a repo on the model hub, which can be cloned. You can then add/remove from that repo as you would with any
other git repo.
.. code-block::
model.save_pretrained("path/to/awesome-name-you-picked")
tokenizer.save_pretrained("path/to/awesome-name-you-picked")
git clone https://huggingface.co/username/your-model-name
Or, if you're using the Trainer API
# Then commit as usual
cd your-model-name
echo "hello" >> README.md
git add . && git commit -m "Update from $USER"
.. code-block::
We are intentionally not wrapping git too much, so as to stay intuitive and easy-to-use.
trainer.save_model("path/to/awesome-name-you-picked")
tokenizer.save_pretrained("path/to/awesome-name-you-picked")
Make your model work on all frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -60,26 +86,24 @@ Make your model work on all frameworks
You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's super easy to do (and in a future version,
it will all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don't need to
worry about the GPU, so it should be very easy. Check the
`TensorFlow installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__
and/or the `PyTorch installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's
super easy to do (and in a future version, it will all be automatic). You will need to install both PyTorch and
TensorFlow for this step, but you don't need to worry about the GPU, so it should be very easy. Check the `TensorFlow
installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ and/or the `PyTorch
installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
First check that your model class exists in the other framework, that is try to import the same model by either adding
or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to
type
or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to type
.. code-block::
from transformers import TFDistilBertForSequenceClassification
>>> from transformers import TFDistilBertForSequenceClassification
and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to
type
and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to type
.. code-block::
from transformers import DistilBertForSequenceClassification
>>> from transformers import DistilBertForSequenceClassification
This will give back an error if your model does not exist in the other framework (something that should be pretty rare
since we're aiming for full parity between the two frameworks). In this case, skip this and go to the next step.
@@ -89,20 +113,20 @@ model class:
.. code-block::
tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True)
tf_model.save_pretrained("path/to/awesome-name-you-picked")
>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True)
>>> tf_model.save_pretrained("path/to/awesome-name-you-picked")
and if you trained your model in TensorFlow and have to create a PyTorch version, adapt the following code to your
model class:
.. code-block::
pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True)
pt_model.save_pretrained("path/to/awesome-name-you-picked")
>>> pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True)
>>> pt_model.save_pretrained("path/to/awesome-name-you-picked")
That's all there is to it!
Check the directory before uploading
Check the directory before pushing to the model hub.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Make sure there are no garbage files in the directory you'll upload. It should only have:
@@ -112,66 +136,52 @@ Make sure there are no garbage files in the directory you'll upload. It should o
- a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
- a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
- a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
- files named `vocab.json`, `vocab.txt`, `merges.txt`, or similar, which contain the vocabulary of your tokenizer, part of your :doc:`tokenizer <main_classes/tokenizer>` save;
- files named `vocab.json`, `vocab.txt`, `merges.txt`, or similar, which contain the vocabulary of your tokenizer, part
of your :doc:`tokenizer <main_classes/tokenizer>` save;
- maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save.
Other files can safely be deleted.
Upload your model with the CLI
Uploading your files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Now go in a terminal and run the following command. It should be in the virtual enviromnent where you installed 🤗
Transformers, since that command :obj:`transformers-cli` comes from the library.
Once the repo is cloned, you can add the model, configuration and tokenizer files. For instance, saving the model and
tokenizer files:
.. code-block::
transformers-cli login
>>> model.save_pretrained("path/to/repo/clone/your-model-name")
>>> tokenizer.save_pretrained("path/to/repo/clone/your-model-name")
Then log in using the same credentials as on huggingface.co. To upload your model, just type
Or, if you're using the Trainer API
.. code-block::
transformers-cli upload path/to/awesome-name-you-picked/
>>> trainer.save_model("path/to/awesome-name-you-picked")
This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section.
By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple models and need to script that process, you can add `-y` to bypass the prompt. For example:
You can then add these files to the staging environment and verify that they have been correctly staged with the ``git
status`` command:
.. code-block::
transformers-cli upload -y path/to/awesome-name-you-picked/
git add --all
git status
If you want to upload a single file (a new version of your model, or the other framework checkpoint you want to add),
just type:
Finally, the files should be comitted:
.. code-block::
transformers-cli upload path/to/awesome-name-you-picked/that-file
git commit -m "First version of the your-model-name model and tokenizer."
or
And pushed to the remote:
.. code-block::
transformers-cli upload path/to/awesome-name-you-picked/that-file --filename awesome-name-you-picked/new_name
git push
if you want to change its filename.
This will upload the folder containing the weights, tokenizer and configuration we have just prepared.
This uploads the model to your personal account. If you want your model to be namespaced by your organization name
rather than your username, add the following flag to any command:
.. code-block::
--organization organization_name
so for instance:
.. code-block::
transformers-cli upload path/to/awesome-name-you-picked/ --organization organization_name
Your model will then be accessible through its identifier, which is, as we saw above,
`username/awesome-name-you-picked` or `organization/awesome-name-you-picked`.
Add a model card
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -179,15 +189,15 @@ Add a model card
To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be
placed in a subfolder with your username or organization, then another subfolder named like your model
(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will
get you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a
model card template (meta-suggestions are welcome).
(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will get
you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a model
card template (meta-suggestions are welcome).
If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
don't forget to link to its model card so that people can fully trace how your model was built.
If you have never made a pull request to the 🤗 Transformers repo, look at the
:doc:`contributing guide <contributing>` to see the steps to follow.
If you have never made a pull request to the 🤗 Transformers repo, look at the :doc:`contributing guide <contributing>`
to see the steps to follow.
.. Note::
@@ -203,20 +213,15 @@ Anyone can load it from code:
.. code-block::
tokenizer = AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked")
model = AutoModel.from_pretrained("namespace/awesome-name-you-picked")
>>> tokenizer = AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked")
>>> model = AutoModel.from_pretrained("namespace/awesome-name-you-picked")
Additional commands
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can list all the files you uploaded on the hub like this:
You may specify a revision by using the ``revision`` flag in the ``from_pretrained`` method:
.. code-block::
transformers-cli s3 ls
You can also delete unneeded files with
.. code-block::
transformers-cli s3 rm awesome-name-you-picked/filename
>>> tokenizer = AutoTokenizer.from_pretrained(
>>> "julien-c/EsperBERTo-small",
>>> revision="v2.0.1" # tag name, or branch name, or commit hash
>>> )

View File

@@ -1,12 +1,12 @@
Summary of the models
=======================================================================================================================
This is a summary of the models available in 🤗 Transformers. It assumes youre familiar with the original
`transformer model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
This is a summary of the models available in 🤗 Transformers. It assumes youre familiar with the original `transformer
model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
<http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
models. You can check them more in detail in their respective documentation. Also checkout the
:doc:`pretrained model page </pretrained_models>` to see the checkpoints available for each type of model and all `the
community models <https://huggingface.co/models>`_.
models. You can check them more in detail in their respective documentation. Also checkout the :doc:`pretrained model
page </pretrained_models>` to see the checkpoints available for each type of model and all `the community models
<https://huggingface.co/models>`_.
Each one of the models in the library falls into one of the following categories:
@@ -19,8 +19,8 @@ Each one of the models in the library falls into one of the following categories
Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
sentence so that the attention heads can only see what was before in the next, and not whats after. Although those
models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation.
A typical example of such models is GPT.
models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A
typical example of such models is GPT.
Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original
sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the
@@ -30,8 +30,8 @@ sentence classification or token classification. A typical example of such model
Note that the only difference between autoregressive models and autoencoding models is in the way the model is
pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given
model has been used for both types of pretraining, we have put it in the category corresponding to the article where it was first
introduced.
model has been used for both types of pretraining, we have put it in the category corresponding to the article where it
was first introduced.
Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation
tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their
@@ -60,8 +60,8 @@ Original GPT
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
</a>
`Improving Language Understanding by Generative Pre-Training <https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_,
Alec Radford et al.
`Improving Language Understanding by Generative Pre-Training
<https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_, Alec Radford et al.
The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.
@@ -80,7 +80,8 @@ GPT-2
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
</a>
`Language Models are Unsupervised Multitask Learners <https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
`Language Models are Unsupervised Multitask Learners
<https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
Alec Radford et al.
A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or
@@ -122,8 +123,8 @@ Transformer-XL
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
</a>
`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_,
Zihang Dai et al.
`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_, Zihang
Dai et al.
Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular
RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that
@@ -153,8 +154,7 @@ Reformer
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
</a>
`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_,
Nikita Kitaev et al .
`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_, Nikita Kitaev et al .
An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks
include:
@@ -188,8 +188,8 @@ XLNet
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
</a>
`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_,
Zhilin Yang et al.
`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_, Zhilin
Yang et al.
XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the
tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done
@@ -207,7 +207,8 @@ Autoencoding models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their corrupted versions.
look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
corrupted versions.
BERT
-----------------------------------------------------------------------------------------------------------------------
@@ -260,8 +261,8 @@ Same as BERT but with a few tweaks:
sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V
being the vocab size). If E < H, it has less parameters.
* Layers are split in groups that share parameters (to save memory).
* Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B
(that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
* Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and
B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
been swapped or not.
The library provides a version of the model for masked language modeling, token classification, sentence
@@ -279,8 +280,7 @@ RoBERTa
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
</a>
`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_,
Yinhan Liu et al.
`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_, Yinhan Liu et al.
Same as BERT with better pretraining tricks:
@@ -339,8 +339,8 @@ library provides checkpoints for all of them:
previous section as well). One of the languages is selected for each training sample, and the model input is a
sentence of 256 tokens, that may span over several documents in one of those languages.
* Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample,
and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages, with
dynamic masking of the tokens.
and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages,
with dynamic masking of the tokens.
* A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two
different languages, with random masking. To predict one of the masked tokens, the model can use both, the
surrounding context in language 1 and the context given by language 2.
@@ -500,8 +500,8 @@ BART
<https://arxiv.org/abs/1910.13461>`_, Mike Lewis et al.
Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is
fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder, on the
pretraining tasks, a composition of the following transformations are applied:
fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder
, on the pretraining tasks, a composition of the following transformations are applied:
* mask random tokens (like in BERT)
* delete random tokens
@@ -523,15 +523,21 @@ Pegasus
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet">
</a>
`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization
`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization
<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training objective, called Gap Sentence Generation (GSG).
Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
objective, called Gap Sentence Generation (GSG).
* MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in BERT)
* GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a causal mask to hide the future words like a regular auto-regressive transformer decoder.
* MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in
BERT)
* GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a
causal mask to hide the future words like a regular auto-regressive transformer decoder.
In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are masked and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are
masked and are generated together as one output sequence from the remaining sentences, similar to an extractive
summary.
The library provides a version of this model for conditional generation, which should be used for summarization.
@@ -566,22 +572,23 @@ T5
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
</a>
`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`_,
Colin Raffel et al.
`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
<https://arxiv.org/abs/1910.10683>`_, Colin Raffel et al.
Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at
each layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at each
layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
prefixes: “summarize: ”, “question: ”, “translate English to German: ” and so forth.
The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream
tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).
Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and
replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the
original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and replacing them with
individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a
single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the original
sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the tokens: "dog", "is" and "cute", the encoder
input becomes “My <x> very <y> .” and the target input becomes “<x> dog is <y> cute .<z>”
For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the tokens: "dog", "is" and
"cute", the encoder input becomes “My <x> very <y> .” and the target input becomes “<x> dog is <y> cute .<z>”
The library provides a version of this model for conditional generation.
@@ -597,20 +604,21 @@ MBart
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
</a>
`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages
and is intended for supervised and unsupervised machine translation. MBart is one of the first methods
for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages,
The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages and is intended
for supervised and unsupervised machine translation. MBart is one of the first methods for pre-training a complete
sequence-to-sequence model by denoising full texts in multiple languages,
The library provides a version of this model for conditional generation.
The `mbart-large-en-ro checkpoint <https://huggingface.co/facebook/mbart-large-en-ro>`_ can be used for english -> romanian translation.
The `mbart-large-en-ro checkpoint <https://huggingface.co/facebook/mbart-large-en-ro>`_ can be used for english ->
romanian translation.
The `mbart-large-cc25 <https://huggingface.co/facebook/mbart-large-cc25>`_ checkpoint can be finetuned for other translation and summarization tasks, using code in ```examples/seq2seq/``` , but is not very useful without finetuning.
The `mbart-large-cc25 <https://huggingface.co/facebook/mbart-large-cc25>`_ checkpoint can be finetuned for other
translation and summarization tasks, using code in ```examples/seq2seq/``` , but is not very useful without finetuning.
.. _multimodal-models:
ProphetNet
-----------------------------------------------------------------------------------------------------------------------
@@ -624,12 +632,18 @@ ProphetNet
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-prophetnet-blueviolet">
</a>
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations.
The model architecture is based on the original Transformer, but replaces the "standard" self-attention mechanism in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism.
ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In
future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
the original Transformer, but replaces the "standard" self-attention mechanism in the decoder by a a main
self-attention mechanism and a self and n-stream (predict) self-attention mechanism.
The library provides a pre-trained version of this model for conditional generation and a fine-tuned version for summarization.
The library provides a pre-trained version of this model for conditional generation and a fine-tuned version for
summarization.
XLM-ProphetNet
-----------------------------------------------------------------------------------------------------------------------
@@ -643,11 +657,16 @@ XLM-ProphetNet
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xprophetnet-blueviolet">
</a>
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
XLM-ProphetNet's model architecture and pre-training objective is same as ProphetNet, but XLM-ProphetNet was pre-trained on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
XLM-ProphetNet's model architecture and pre-training objective is same as ProphetNet, but XLM-ProphetNet was
pre-trained on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned versions for headline generation and question generation, respectively.
The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned
versions for headline generation and question generation, respectively.
.. _multimodal-models:
Multimodal models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -663,8 +682,8 @@ et al.
A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer
model takes as inputs the embeddings of the tokenized text and the final activations of a pretrained on images resnet
(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the
resnet to the hidden state dimension of the transformer).
(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the resnet to
the hidden state dimension of the transformer).
The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the
model know which part of the input vector corresponds to the text and which to the image.
@@ -672,8 +691,7 @@ model know which part of the input vector corresponds to the text and which to t
The pretrained model only works for classification.
..
More information in this :doc:`model documentation </model_doc/mmbt.html>`.
TODO: write this page
More information in this :doc:`model documentation </model_doc/mmbt.html>`. TODO: write this page
.. _retrieval-based-models:
@@ -695,19 +713,22 @@ DPR
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
</a>
`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_,
Vladimir Karpukhin et al.
`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_, Vladimir Karpukhin et
al.
Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering research.
Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering
research.
DPR consists in three models:
* Question encoder: encode questions as vectors
* Context encoder: encode contexts as vectors
* Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
* Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the
inferred span actually answers the question).
DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and then it calls the reader with the question and the retrieved documents to get the answer.
DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and
then it calls the reader with the question and the retrieved documents to get the answer.
RAG
-----------------------------------------------------------------------------------------------------------------------
@@ -721,12 +742,14 @@ RAG
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
</a>
`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_,
Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_, Patrick Lewis,
Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau
Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models.
RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs.
The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.
Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq
models. RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and
seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation
to adapt to downstream tasks.
The two models RAG-Token and RAG-Sequence are available for generation.
@@ -745,19 +768,19 @@ use a sparse version of the attention matrix to speed up training.
**LSH attention**
:ref:`Reformer <reformer>` uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
modified to mask the current token (except at the first position), because it will give a query and a key equal (so very
similar to each other). Since the hash can be a bit random, several hash functions are used in practice (determined by
a n_rounds parameter) and then are averaged together.
modified to mask the current token (except at the first position), because it will give a query and a key equal (so
very similar to each other). Since the hash can be a bit random, several hash functions are used in practice
(determined by a n_rounds parameter) and then are averaged together.
.. _local-attention:
**Local attention**
:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the left and
right?) is enough to take action for a given token. Also, by stacking attention layers that have a small window, the
last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the
left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small
window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
representation of the whole sentence.
Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
@@ -780,8 +803,9 @@ Other tricks
:ref:`Reformer <reformer>` uses axial positional encodings: in traditional transformer models, the positional encoding
E is a matrix of size :math:`l` by :math:`d`, :math:`l` being the sequence length and :math:`d` the dimension of the
hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and
E2, with dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l`
and :math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for
time step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and
:math:`j // l1` in E2.
hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate
that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with
dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l` and
:math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for time
step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and :math:`j // l1`
in E2.

View File

@@ -1,9 +1,9 @@
Multi-lingual models
=======================================================================================================================
Most of the models available in this library are mono-lingual models (English, Chinese and German). A few
multi-lingual models are available and have a different mechanisms than mono-lingual models.
This page details the usage of these models.
Most of the models available in this library are mono-lingual models (English, Chinese and German). A few multi-lingual
models are available and have a different mechanisms than mono-lingual models. This page details the usage of these
models.
The two models that currently support multiple languages are BERT and XLM.
@@ -28,8 +28,8 @@ This section concerns the following checkpoints:
These checkpoints require language embeddings that will specify the language used at inference time. These language
embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes
from the tokenizer.
these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes from
the tokenizer.
Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French):
@@ -78,8 +78,9 @@ You can then feed it all as input to your model:
>>> outputs = model(input_ids, langs=langs)
The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__
can generate text using the CLM checkpoints from XLM, using the language embeddings.
The example `run_generation.py
<https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__ can generate
text using the CLM checkpoints from XLM, using the language embeddings.
XLM without Language Embeddings
-----------------------------------------------------------------------------------------------------------------------
@@ -89,8 +90,8 @@ This section concerns the following checkpoints:
- ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages)
- ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages)
These checkpoints do not require language embeddings at inference time. These models are used to have generic
sentence representations, differently from previously-mentioned XLM checkpoints.
These checkpoints do not require language embeddings at inference time. These models are used to have generic sentence
representations, differently from previously-mentioned XLM checkpoints.
BERT
@@ -101,15 +102,15 @@ BERT has two checkpoints that can be used for multi-lingual tasks:
- ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages)
- ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages)
These checkpoints do not require language embeddings at inference time. They should identify the language
used in the context and infer accordingly.
These checkpoints do not require language embeddings at inference time. They should identify the language used in the
context and infer accordingly.
XLM-RoBERTa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong
gains over previously released multi-lingual models like mBERT or XLM on downstream taks like classification,
sequence labeling and question answering.
XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
over previously released multi-lingual models like mBERT or XLM on downstream taks like classification, sequence
labeling and question answering.
Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:

View File

@@ -1,86 +1,69 @@
Perplexity of fixed-length models
=======================================================================================================================
Perplexity (PPL) is one of the most common metrics for evaluating language
models. Before diving in, we should note that the metric applies specifically
to classical language models (sometimes called autoregressive or causal
language models) and is not well defined for masked language models like BERT
(see :doc:`summary of the models <model_summary>`).
Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
models) and is not well defined for masked language models like BERT (see :doc:`summary of the models
<model_summary>`).
Perplexity is defined as the exponentiated average log-likelihood of a
sequence. If we have a tokenized sequence :math:`X = (x_0, x_1, \dots, x_t)`,
then the perplexity of :math:`X` is,
Perplexity is defined as the exponentiated average log-likelihood of a sequence. If we have a tokenized sequence
:math:`X = (x_0, x_1, \dots, x_t)`, then the perplexity of :math:`X` is,
.. math::
\text{PPL}(X)
= \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}
where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith
token conditioned on the preceding tokens :math:`x_{<i}` according to our
model. Intuitively, it can be thought of as an evaluation of the model's
ability to predict uniformly among the set of specified tokens in a corpus.
Importantly, this means that the tokenization procedure has a direct impact
on a model's perplexity which should always be taken into consideration when
comparing different models.
where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith token conditioned on the preceding tokens
:math:`x_{<i}` according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to
predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization
procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing
different models.
This is also equivalent to the exponentiation of the cross-entropy between
the data and model predictions. For more intuition about perplexity and its
relationship to Bits Per Character (BPC) and data compression, check out this
`fantastic blog post on The Gradient
<https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
`fantastic blog post on The Gradient <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
Calculating PPL with fixed-length models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If we weren't limited by a model's context size, we would evaluate the
model's perplexity by autoregressively factorizing a sequence and
conditioning on the entire preceding subsequence at each step, as shown
below.
If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.
.. image:: imgs/ppl_full.gif
:width: 600
:alt: Full decomposition of a sequence with unlimited context length
When working with approximate models, however, we typically have a constraint
on the number of tokens the model can process. The largest version
of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024
tokens, so we cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when
:math:`t` is greater than 1024.
When working with approximate models, however, we typically have a constraint on the number of tokens the model can
process. The largest version of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024 tokens, so we
cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when :math:`t` is greater than 1024.
Instead, the sequence is typically broken into subsequences equal to the
model's maximum input size. If a model's max input size is :math:`k`, we
then approximate the likelihood of a token :math:`x_t` by conditioning only
on the :math:`k-1` tokens that precede it rather than the entire context.
When evaluating the model's perplexity of a sequence, a tempting but
suboptimal approach is to break the sequence into disjoint chunks and
add up the decomposed log-likelihoods of each segment independently.
Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
input size is :math:`k`, we then approximate the likelihood of a token :math:`x_t` by conditioning only on the
:math:`k-1` tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
log-likelihoods of each segment independently.
.. image:: imgs/ppl_chunked.gif
:width: 600
:alt: Suboptimal PPL not taking advantage of full available context
This is quick to compute since the perplexity of each segment can be computed
in one forward pass, but serves as a poor approximation of the
fully-factorized perplexity and will typically yield a higher (worse) PPL
because the model will have less context at most of the prediction steps.
This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
have less context at most of the prediction steps.
Instead, the PPL of fixed-length models should be evaluated with a
sliding-window strategy. This involves repeatedly sliding the
context window so that the model has more context when making each
prediction.
Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
sliding the context window so that the model has more context when making each prediction.
.. image:: imgs/ppl_sliding.gif
:width: 600
:alt: Sliding window PPL taking advantage of all available context
This is a closer approximation to the true decomposition of the
sequence probability and will typically yield a more favorable score.
The downside is that it requires a separate forward pass for each token in
the corpus. A good practical compromise is to employ a strided sliding
window, moving the context by larger strides rather than sliding by 1 token a
time. This allows computation to procede much faster while still giving the
model a large context to make predictions at each step.
This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
1 token a time. This allows computation to procede much faster while still giving the model a large context to make
predictions at each step.
Example: Calculating perplexity with GPT-2 in 🤗 Transformers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -95,10 +78,9 @@ Let's demonstrate this process with GPT-2.
model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
We'll load in the WikiText-2 dataset and evaluate the perplexity using a few
different sliding-window strategies. Since this dataset is small and we're
just doing one forward pass over the set, we can just load and encode the
entire dataset in memory.
We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since
this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire
dataset in memory.
.. code-block:: python
@@ -106,16 +88,13 @@ entire dataset in memory.
test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')
With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels``
to our model, and the average log-likelihood for each token is returned as
the loss. With our sliding window approach, however, there is overlap in the
tokens we pass to the model at each iteration. We don't want the
log-likelihood for the tokens we're just treating as context to be included
in our loss, so we can set these targets to ``-100`` so that they are
ignored. The following is an example of how we could do this with a stride of
``512``. This means that the model will have at least 512 tokens for context
when calculating the conditional likelihood of any one token (provided there
are 512 preceding tokens available to condition on).
With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels`` to our model, and the average
log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in
the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
as context to be included in our loss, so we can set these targets to ``-100`` so that they are ignored. The following
is an example of how we could do this with a stride of ``512``. This means that the model will have at least 512 tokens
for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
available to condition on).
.. code-block:: python
@@ -125,27 +104,25 @@ are 512 preceding tokens available to condition on).
lls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
begin_loc = max(i + stride - max_length, 0)
end_loc = i + stride
end_loc = min(i + stride, encodings.input_ids.size(1))
trg_len = end_loc - i # may be different from stride on last loop
input_ids = encodings.input_ids[:,begin_loc:end_loc].to(device)
target_ids = input_ids.clone()
target_ids[:,:-stride] = -100
target_ids[:,:-trg_len] = -100
with torch.no_grad():
outputs = model(input_ids, labels=target_ids)
log_likelihood = outputs[0] * stride
log_likelihood = outputs[0] * trg_len
lls.append(log_likelihood)
ppl = torch.exp(torch.stack(lls).sum() / i)
Running this with the stride length equal to the max input length is
equivalent to the suboptimal, non-sliding-window strategy we discussed above.
The smaller the stride, the more context the model will have in making each
prediction, and the better the reported perplexity will typically be.
ppl = torch.exp(torch.stack(lls).sum() / end_loc)
When we run the above with ``stride = 1024``, i.e. no overlap, the resulting
PPL is ``19.64``, which is about the same as the ``19.93`` reported in the
GPT-2 paper. By using ``stride = 512`` and thereby employing our striding
window strategy, this jumps down to ``16.53``. This is not only a more
favorable score, but is calculated in a way that is closer to the true
autoregressive decomposition of a sequence likelihood.
Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
and the better the reported perplexity will typically be.
When we run the above with ``stride = 1024``, i.e. no overlap, the resulting PPL is ``19.64``, which is about the same
as the ``19.93`` reported in the GPT-2 paper. By using ``stride = 512`` and thereby employing our striding window
strategy, this jumps down to ``16.53``. This is not only a more favorable score, but is calculated in a way that is
closer to the true autoregressive decomposition of a sequence likelihood.

View File

@@ -12,15 +12,15 @@ The library was designed with two strong goals in mind:
- Be as easy and fast to use as possible:
- We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions,
just three standard classes required to use each model: :doc:`configuration <main_classes/configuration>`,
just three standard classes required to use each model: :doc:`configuration <main_classes/configuration>`,
:doc:`models <main_classes/model>` and :doc:`tokenizer <main_classes/tokenizer>`.
- All of these classes can be initialized in a simple and unified way from pretrained instances by using a common
:obj:`from_pretrained()` instantiation method which will take care of downloading (if needed), caching and
loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary,
and models' weights) from a pretrained checkpoint provided on
`Hugging Face Hub <https://huggingface.co/models>`__ or your own saved checkpoint.
loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary,
and models' weights) from a pretrained checkpoint provided on `Hugging Face Hub
<https://huggingface.co/models>`__ or your own saved checkpoint.
- On top of those three base classes, the library provides two APIs: :func:`~transformers.pipeline` for quickly
using a model (plus its associated tokenizer and configuration) on a given task and
using a model (plus its associated tokenizer and configuration) on a given task and
:func:`~transformers.Trainer`/:func:`~transformers.TFTrainer` to quickly train or fine-tune a given model.
- As a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to
extend/build-upon the library, just use regular Python/PyTorch/TensorFlow/Keras modules and inherit from the base
@@ -52,10 +52,10 @@ Main concepts
The library is built around three types of classes for each model:
- **Model classes** such as :class:`~transformers.BertModel`, which are 30+ PyTorch models
(`torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models
(`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained
weights provided in the library.
- **Model classes** such as :class:`~transformers.BertModel`, which are 30+ PyTorch models (`torch.nn.Module
<https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models (`tf.keras.Model
<https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained weights provided in the
library.
- **Configuration classes** such as :class:`~transformers.BertConfig`, which store all the parameters required to build
a model. You don't always need to instantiate these yourself. In particular, if you are using a pretrained model
without any modification, creating the model will automatically take care of instantiating the configuration (which
@@ -66,8 +66,8 @@ The library is built around three types of classes for each model:
All these classes can be instantiated from pretrained instances and saved locally using two methods:
- :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>`
or stored locally (or on a server) by the user,
provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>` or
stored locally (or on a server) by the user,
- :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
:obj:`from_pretrained()`.

View File

@@ -17,7 +17,7 @@ work properly.
the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence
token to index (that we usually call a `vocab`) as during pretraining.
To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the
To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the
:func:`~transformers.AutoTokenizer.from_pretrained` method:
.. code-block::
@@ -39,10 +39,10 @@ is its ``__call__``: you just need to feed your sentence to your tokenizer objec
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
This returns a dictionary string to list of ints.
The `input_ids <glossary.html#input-ids>`__ are the indices corresponding to each token in our sentence. We will see
below what the `attention_mask <glossary.html#attention-mask>`__ is used for and in
:ref:`the next section <sentence-pairs>` the goal of `token_type_ids <glossary.html#token-type-ids>`__.
This returns a dictionary string to list of ints. The `input_ids <glossary.html#input-ids>`__ are the indices
corresponding to each token in our sentence. We will see below what the `attention_mask
<glossary.html#attention-mask>`__ is used for and in :ref:`the next section <sentence-pairs>` the goal of
`token_type_ids <glossary.html#token-type-ids>`__.
The tokenizer can decode a list of token ids in a proper sentence:
@@ -51,10 +51,10 @@ The tokenizer can decode a list of token ids in a proper sentence:
>>> tokenizer.decode(encoded_input["input_ids"])
"[CLS] Hello, I'm a single sentence! [SEP]"
As you can see, the tokenizer automatically added some special tokens that the model expect. Not all model need special
tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we would have
seen the same sentence as the original one here. You can disable this behavior (which is only advised if you have added
those special tokens yourself) by passing ``add_special_tokens=False``.
As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need
special tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we
would have seen the same sentence as the original one here. You can disable this behavior (which is only advised if you
have added those special tokens yourself) by passing ``add_special_tokens=False``.
If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
tokenizer:
@@ -76,7 +76,7 @@ tokenizer:
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1]]}
We get back a dictionary once again, this time with values being list of list of ints.
We get back a dictionary once again, this time with values being lists of lists of ints.
If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
probably want:
@@ -114,9 +114,9 @@ You can do all of this by using the following options when feeding your list of
[1, 1, 1, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 0]])}
It returns a dictionary string to tensor. We can now see what the `attention_mask <glossary.html#attention-mask>`__ is
all about: it points out which tokens the model should pay attention to and which ones it should not (because they
represent padding in this case).
It returns a dictionary with string keys and tensor values. We can now see what the `attention_mask
<glossary.html#attention-mask>`__ is all about: it points out which tokens the model should pay attention to and which
ones it should not (because they represent padding in this case).
Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
@@ -127,9 +127,9 @@ can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer
Preprocessing pairs of sentences
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sometimes you need to feed pair of sentences to your model. For instance, if you want to classify if two sentences in a
pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is
then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
is then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
@@ -146,8 +146,8 @@ This will once again return a dict string to list of ints:
This shows us what the `token_type_ids <glossary.html#token-type-ids>`__ are for: they indicate to the model which part
of the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
`token_type_ids` are not required or handled by all models. By default, a tokenizer will only return the inputs that
its associated model expects. You can force the return (or the non-return) of any of those special arguments by
using ``return_input_ids`` or ``return_token_type_ids``.
its associated model expects. You can force the return (or the non-return) of any of those special arguments by using
``return_input_ids`` or ``return_token_type_ids``.
If we decode the token ids we obtained, we will see that the special tokens have been properly added.
@@ -179,7 +179,7 @@ list of first sentences and the list of second sentences:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
As we can see, it returns a dictionary with the values being list of lists of ints.
As we can see, it returns a dictionary where each value is a list of lists of ints.
To double-check what is fed to the model, we can decode each list in `input_ids` one by one:
@@ -215,7 +215,7 @@ three arguments you need to know for this are :obj:`padding`, :obj:`truncation`
a single sequence).
- :obj:`'max_length'` to pad to a length specified by the :obj:`max_length` argument or the maximum length accepted
by the model if no :obj:`max_length` is provided (``max_length=None``). If you only provide a single sequence,
padding will still be applied to it.
padding will still be applied to it.
- :obj:`False` or :obj:`'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
behavior.
@@ -238,9 +238,9 @@ three arguments you need to know for this are :obj:`padding`, :obj:`truncation`
truncation/padding to :obj:`max_length` is deactivated.
Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
any of the following examples, you can replace :obj:`truncation=True` by a :obj:`STRATEGY` selected in
:obj:`['only_first', 'only_second', 'longest_first']`, i.e. :obj:`truncation='only_second'` or
:obj:`truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
any of the following examples, you can replace :obj:`truncation=True` by a :obj:`STRATEGY` selected in
:obj:`['only_first', 'only_second', 'longest_first']`, i.e. :obj:`truncation='only_second'` or :obj:`truncation=
'longest_first'` to control how both sequence in the pair are truncated as detailed before.
+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
| Truncation | Padding | Instruction |
@@ -286,7 +286,7 @@ predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Na
.. warning::
Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them though the tokenizer
Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them through the tokenizer
if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
like BPE).

View File

@@ -3,7 +3,8 @@ Pretrained models
Here is the full list of the currently provided pretrained models together with a short presentation of each model.
For a list that includes community-uploaded models, refer to `https://huggingface.co/models <https://huggingface.co/models>`__.
For a list that includes community-uploaded models, refer to `https://huggingface.co/models
<https://huggingface.co/models>`__.
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Architecture | Shortcut name | Details of the model |
@@ -329,7 +330,7 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| | ``facebook/bart-large-mnli`` | | Adds a 2 layer classification head with 1 million parameters |
| | | | bart-large base architecture with a classification head, finetuned on MNLI |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``facebook/bart-large-cnn`` | | 12-layer, 1024-hidden, 16-heads, 406M parameters (same as base) |
| | ``facebook/bart-large-cnn`` | | 24-layer, 1024-hidden, 16-heads, 406M parameters (same as large) |
| | | | bart-large base architecture finetuned on cnn summarization task |
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| DialoGPT | ``DialoGPT-small`` | | 12-layer, 768-hidden, 12-heads, 124M parameters |

View File

@@ -1,8 +1,8 @@
Quick tour
=======================================================================================================================
Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for
Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Natural
Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
such as completing a prompt with new text or translating in another language.
First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we
@@ -29,8 +29,8 @@ provides the following tasks out of the box:
- Translation: translate a text in another language.
- Feature extraction: return a tensor representation of the text.
Let's see how this work for sentiment analysis (the other tasks are all covered in the
:doc:`task summary </task_summary>`):
Let's see how this work for sentiment analysis (the other tasks are all covered in the :doc:`task summary
</task_summary>`):
.. code-block::
@@ -160,9 +160,10 @@ To apply these steps on a given text, we can just feed it to our tokenizer:
>>> inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
This returns a dictionary string to list of ints. It contains the `ids of the tokens <glossary.html#input-ids>`__,
as mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an
`attention mask <glossary.html#attention-mask>`__ that the model will use to have a better understanding of the sequence:
This returns a dictionary string to list of ints. It contains the `ids of the tokens <glossary.html#input-ids>`__, as
mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an
`attention mask <glossary.html#attention-mask>`__ that the model will use to have a better understanding of the
sequence:
.. code-block::
@@ -191,8 +192,8 @@ and get tensors back. You can specify all of that to the tokenizer:
... return_tensors="tf"
... )
The padding is automatically applied on the side expected by the model (in this case, on the right), with the
padding token the model was pretrained with. The attention mask is also adapted to take the padding into account:
The padding is automatically applied on the side expected by the model (in this case, on the right), with the padding
token the model was pretrained with. The attention mask is also adapted to take the padding into account:
.. code-block::
@@ -213,8 +214,8 @@ Using the model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will
contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the
dictionary keys directly to tensors, for a PyTorch model, you need to unpack the dictionary by adding :obj:`**`.
contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the dictionary
keys directly to tensors, for a PyTorch model, you need to unpack the dictionary by adding :obj:`**`.
.. code-block::
@@ -223,8 +224,8 @@ dictionary keys directly to tensors, for a PyTorch model, you need to unpack the
>>> ## TENSORFLOW CODE
>>> tf_outputs = tf_model(tf_batch)
In 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the
final activations of the model.
In 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the final
activations of the model.
.. code-block::
@@ -239,11 +240,10 @@ final activations of the model.
[ 0.08181786, -0.04179301]], dtype=float32)>,)
The model can return more than just the final activations, which is why the output is a tuple. Here we only asked for
the final activations, so we get a tuple with one element.
.. note::
the final activations, so we get a tuple with one element. .. note::
All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final
activation function (like SoftMax) since this final activation function is often fused with the loss.
All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final activation
function (like SoftMax) since this final activation function is often fused with the loss.
Let's apply the SoftMax activation to get predictions.
@@ -281,11 +281,11 @@ If you have labels, you can provide them to the model, it will return a tuple wi
>>> import tensorflow as tf
>>> tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]))
Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ or
`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual
training loop. 🤗 Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if
you are using TensorFlow) class to help with your training (taking care of things such as distributed training, mixed
precision, etc.). See the :doc:`training tutorial <training>` for more details.
Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ or `tf.keras.Model
<https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual training loop. 🤗
Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if you are using
TensorFlow) class to help with your training (taking care of things such as distributed training, mixed precision,
etc.). See the :doc:`training tutorial <training>` for more details.
.. note::
@@ -336,13 +336,13 @@ The :obj:`AutoModel` and :obj:`AutoTokenizer` classes are just shortcuts that wi
pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the
code is easy to access and tweak if you need to.
In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's
using the :doc:`DistilBERT </model_doc/distilbert>` architecture. As
:class:`~transformers.AutoModelForSequenceClassification` (or :class:`~transformers.TFAutoModelForSequenceClassification`
if you are using TensorFlow) was used, the model automatically created is then a
:class:`~transformers.DistilBertForSequenceClassification`. You can look at its documentation for all details relevant
to that specific model, or browse the source code. This is how you would directly instantiate model and tokenizer
without the auto magic:
In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's using
the :doc:`DistilBERT </model_doc/distilbert>` architecture. As
:class:`~transformers.AutoModelForSequenceClassification` (or
:class:`~transformers.TFAutoModelForSequenceClassification` if you are using TensorFlow) was used, the model
automatically created is then a :class:`~transformers.DistilBertForSequenceClassification`. You can look at its
documentation for all details relevant to that specific model, or browse the source code. This is how you would
directly instantiate model and tokenizer without the auto magic:
.. code-block::

View File

@@ -5,16 +5,18 @@ Exporting transformers models
ONNX / ONNXRuntime
=======================================================================================================================
Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT) <https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field
to provide a unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety
Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT)
<https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field to provide a
unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety
of hardware and dedicated optimizations.
Starting from transformers v2.10.0 we partnered with ONNX Runtime to provide an easy export of transformers models to
the ONNX format. You can have a look at the effort by looking at our joint blog post `Accelerate your NLP pipelines using
Hugging Face Transformers and ONNX Runtime <https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333>`_.
the ONNX format. You can have a look at the effort by looking at our joint blog post `Accelerate your NLP pipelines
using Hugging Face Transformers and ONNX Runtime
<https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333>`_.
Exporting a model is done through the script `convert_graph_to_onnx.py` at the root of the transformers sources.
The following command shows how easy it is to export a BERT model from the library, simply run:
Exporting a model is done through the script `convert_graph_to_onnx.py` at the root of the transformers sources. The
following command shows how easy it is to export a BERT model from the library, simply run:
.. code-block:: bash
@@ -27,62 +29,66 @@ The conversion tool works for both PyTorch and Tensorflow models and ensures:
* The generated model can be correctly loaded through onnxruntime.
.. note::
Currently, inputs and outputs are always exported with dynamic sequence axes preventing some optimizations
on the ONNX Runtime. If you would like to see such support for fixed-length inputs/outputs, please
open up an issue on transformers.
Currently, inputs and outputs are always exported with dynamic sequence axes preventing some optimizations on the
ONNX Runtime. If you would like to see such support for fixed-length inputs/outputs, please open up an issue on
transformers.
Also, the conversion tool supports different options which let you tune the behavior of the generated model:
* **Change the target opset version of the generated model.** (More recent opset generally supports more operators and enables faster inference)
* **Change the target opset version of the generated model.** (More recent opset generally supports more operators and
enables faster inference)
* **Export pipeline-specific prediction heads.** (Allow to export model along with its task-specific prediction head(s))
* **Export pipeline-specific prediction heads.** (Allow to export model along with its task-specific prediction
head(s))
* **Use the external data format (PyTorch only).** (Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_))
* **Use the external data format (PyTorch only).** (Lets you export model which size is above 2Gb (`More info
<https://github.com/pytorch/pytorch/pull/33062>`_))
Optimizations
-----------------------------------------------------------------------------------------------------------------------
ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph.
Below are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*):
ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph. Below
are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*):
* Constant folding
* Attention Layer fusing
* Skip connection LayerNormalization fusing
* FastGeLU approximation
Some of the optimizations performed by ONNX runtime can be hardware specific and thus lead to different performances
if used on another machine with a different hardware configuration than the one used for exporting the model.
For this reason, when using ``convert_graph_to_onnx.py`` optimizations are not enabled,
ensuring the model can be easily exported to various hardware.
Optimizations can then be enabled when loading the model through ONNX runtime for inference.
Some of the optimizations performed by ONNX runtime can be hardware specific and thus lead to different performances if
used on another machine with a different hardware configuration than the one used for exporting the model. For this
reason, when using ``convert_graph_to_onnx.py`` optimizations are not enabled, ensuring the model can be easily
exported to various hardware. Optimizations can then be enabled when loading the model through ONNX runtime for
inference.
.. note::
When quantization is enabled (see below), ``convert_graph_to_onnx.py`` script will enable optimizations on the model
because quantization would modify the underlying graph making it impossible for ONNX runtime to do the optimizations
afterwards.
When quantization is enabled (see below), ``convert_graph_to_onnx.py`` script will enable optimizations on the
model because quantization would modify the underlying graph making it impossible for ONNX runtime to do the
optimizations afterwards.
.. note::
For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github <https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_)
For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github
<https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_)
Quantization
-----------------------------------------------------------------------------------------------------------------------
ONNX exporter supports generating a quantized version of the model to allow efficient inference.
Quantization works by converting the memory representation of the parameters in the neural network
to a compact integer format. By default, weights of a neural network are stored as single-precision float (`float32`)
which can express a wide-range of floating-point numbers with decent precision.
These properties are especially interesting at training where you want fine-grained representation.
Quantization works by converting the memory representation of the parameters in the neural network to a compact integer
format. By default, weights of a neural network are stored as single-precision float (`float32`) which can express a
wide-range of floating-point numbers with decent precision. These properties are especially interesting at training
where you want fine-grained representation.
On the other hand, after the training phase, it has been shown one can greatly reduce the range and the precision of `float32` numbers
without changing the performances of the neural network.
On the other hand, after the training phase, it has been shown one can greatly reduce the range and the precision of
`float32` numbers without changing the performances of the neural network.
More technically, `float32` parameters are converted to a type requiring fewer bits to represent each number, thus reducing
the overall size of the model. Here, we are enabling `float32` mapping to `int8` values (a non-floating, single byte, number representation)
according to the following formula:
More technically, `float32` parameters are converted to a type requiring fewer bits to represent each number, thus
reducing the overall size of the model. Here, we are enabling `float32` mapping to `int8` values (a non-floating,
single byte, number representation) according to the following formula:
.. math::
y_{float32} = scale * x_{int8} - zero\_point
@@ -96,9 +102,9 @@ Leveraging tiny-integers has numerous advantages when it comes to inference:
* Integer operations execute a magnitude faster on modern hardware
* Integer operations require less power to do the computations
In order to convert a transformers model to ONNX IR with quantized weights you just need to specify ``--quantize``
when using ``convert_graph_to_onnx.py``. Also, you can have a look at the ``quantize()`` utility-method in this
same script file.
In order to convert a transformers model to ONNX IR with quantized weights you just need to specify ``--quantize`` when
using ``convert_graph_to_onnx.py``. Also, you can have a look at the ``quantize()`` utility-method in this same script
file.
Example of quantized BERT model export:
@@ -111,26 +117,27 @@ Example of quantized BERT model export:
.. note::
When exporting quantized model you will end up with two different ONNX files. The one specified at the end of the
above command will contain the original ONNX model storing `float32` weights.
The second one, with ``-quantized`` suffix, will hold the quantized parameters.
above command will contain the original ONNX model storing `float32` weights. The second one, with ``-quantized``
suffix, will hold the quantized parameters.
TorchScript
=======================================================================================================================
.. note::
This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities
with variable-input-size models. It is a focus of interest to us and we will deepen our analysis in upcoming
releases, with more code examples, a more flexible implementation, and benchmarks comparing python-based codes
with compiled TorchScript.
This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities with
variable-input-size models. It is a focus of interest to us and we will deepen our analysis in upcoming releases,
with more code examples, a more flexible implementation, and benchmarks comparing python-based codes with compiled
TorchScript.
According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch code".
Pytorch's two modules `JIT and TRACE <https://pytorch.org/docs/stable/jit.html>`_ allow the developer to export
According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch
code". Pytorch's two modules `JIT and TRACE <https://pytorch.org/docs/stable/jit.html>`_ allow the developer to export
their model to be re-used in other programs, such as efficiency-oriented C++ programs.
We have provided an interface that allows the export of 🤗 Transformers models to TorchScript so that they can
be reused in a different environment than a Pytorch-based python program. Here we explain how to export and use our models using TorchScript.
We have provided an interface that allows the export of 🤗 Transformers models to TorchScript so that they can be reused
in a different environment than a Pytorch-based python program. Here we explain how to export and use our models using
TorchScript.
Exporting a model requires two things:
@@ -145,13 +152,14 @@ Implications
TorchScript flag and tied weights
-----------------------------------------------------------------------------------------------------------------------
This flag is necessary because most of the language models in this repository have tied weights between their
``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied weights, therefore
it is necessary to untie and clone the weights beforehand.
This implies that models instantiated with the ``torchscript`` flag have their ``Embedding`` layer and ``Decoding`` layer
separate, which means that they should not be trained down the line. Training would de-synchronize the two layers,
leading to unexpected results.
This flag is necessary because most of the language models in this repository have tied weights between their
``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied
weights, therefore it is necessary to untie and clone the weights beforehand.
This implies that models instantiated with the ``torchscript`` flag have their ``Embedding`` layer and ``Decoding``
layer separate, which means that they should not be trained down the line. Training would de-synchronize the two
layers, leading to unexpected results.
This is not the case for models that do not have a Language Model head, as those do not have tied weights. These models
can be safely exported without the ``torchscript`` flag.
@@ -160,8 +168,8 @@ Dummy inputs and standard lengths
-----------------------------------------------------------------------------------------------------------------------
The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers,
Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used
to create the "trace" of the model.
Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used to
create the "trace" of the model.
The trace is created relatively to the inputs' dimensions. It is therefore constrained by the dimensions of the dummy
input, and will not work for any other sequence length or batch size. When trying with a different size, an error such
@@ -185,8 +193,8 @@ Below is an example, showing how to save, load models as well as how to use the
Saving a model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated
according to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt``
This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated according
to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt``
.. code-block:: python

View File

@@ -2,30 +2,30 @@ Summary of the tasks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This page shows the most frequent use-cases when using the library. The models available allow for many different
configurations and a great versatility in use-cases. The most simple ones are presented here, showcasing usage
for tasks such as question answering, sequence classification, named entity recognition and others.
configurations and a great versatility in use-cases. The most simple ones are presented here, showcasing usage for
tasks such as question answering, sequence classification, named entity recognition and others.
These examples leverage auto-models, which are classes that will instantiate a model according to a given checkpoint,
automatically selecting the correct model architecture. Please check the :class:`~transformers.AutoModel` documentation
for more information.
Feel free to modify the code to be more specific and adapt it to your specific use-case.
for more information. Feel free to modify the code to be more specific and adapt it to your specific use-case.
In order for a model to perform well on a task, it must be loaded from a checkpoint corresponding to that task. These
checkpoints are usually pre-trained on a large corpus of data and fine-tuned on a specific task. This means the
following:
- Not all models were fine-tuned on all tasks. If you want to fine-tune a model on a specific task, you can leverage
one of the `run_$TASK.py` scripts in the
`examples <https://github.com/huggingface/transformers/tree/master/examples>`__ directory.
- Fine-tuned models were fine-tuned on a specific dataset. This dataset may or may not overlap with your use-case
and domain. As mentioned previously, you may leverage the
`examples <https://github.com/huggingface/transformers/tree/master/examples>`__ scripts to fine-tune your model, or you
may create your own training script.
one of the `run_$TASK.py` scripts in the `examples
<https://github.com/huggingface/transformers/tree/master/examples>`__ directory.
- Fine-tuned models were fine-tuned on a specific dataset. This dataset may or may not overlap with your use-case and
domain. As mentioned previously, you may leverage the `examples
<https://github.com/huggingface/transformers/tree/master/examples>`__ scripts to fine-tune your model, or you may
create your own training script.
In order to do an inference on a task, several mechanisms are made available by the library:
- Pipelines: very easy-to-use abstractions, which require as little as two lines of code.
- Direct model use: Less abstractions, but more flexibility and power via a direct access to a tokenizer (PyTorch/TensorFlow) and full inference capacity.
- Direct model use: Less abstractions, but more flexibility and power via a direct access to a tokenizer
(PyTorch/TensorFlow) and full inference capacity.
Both approaches are showcased here.
@@ -40,15 +40,17 @@ Both approaches are showcased here.
Sequence Classification
-----------------------------------------------------------------------------------------------------------------------
Sequence classification is the task of classifying sequences according to a given number of classes. An example
of sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune
a model on a GLUE sequence classification task, you may leverage the
`run_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_glue.py>`__ and
`run_pl_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_pl_glue.py>`__ or
`run_tf_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_tf_glue.py>`__ scripts.
Sequence classification is the task of classifying sequences according to a given number of classes. An example of
sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune a
model on a GLUE sequence classification task, you may leverage the `run_glue.py
<https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_glue.py>`__ and
`run_pl_glue.py
<https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_pl_glue.py>`__ or
`run_tf_glue.py
<https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_tf_glue.py>`__ scripts.
Here is an example of using pipelines to do sentiment analysis: identifying if a sequence is positive or negative.
It leverages a fine-tuned model on sst2, which is a GLUE task.
Here is an example of using pipelines to do sentiment analysis: identifying if a sequence is positive or negative. It
leverages a fine-tuned model on sst2, which is a GLUE task.
This returns a label ("POSITIVE" or "NEGATIVE") alongside a score, as follows:
@@ -67,18 +69,16 @@ This returns a label ("POSITIVE" or "NEGATIVE") alongside a score, as follows:
label: POSITIVE, with score: 0.9999
Here is an example of doing a sequence classification using a model to determine if two sequences are paraphrases
of each other. The process is the following:
Here is an example of doing a sequence classification using a model to determine if two sequences are paraphrases of
each other. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. The model is
identified as a BERT model and loads it with the weights stored in the
checkpoint.
2. Build a sequence from the two sentences, with the correct model-specific
separators token type ids and attention masks
(:func:`~transformers.PreTrainedTokenizer.encode` and
:func:`~transformers.PreTrainedTokenizer.__call__` take care of this).
3. Pass this sequence through the model so that it is classified in one of the
two available classes: 0 (not a paraphrase) and 1 (is a paraphrase).
1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
with the weights stored in the checkpoint.
2. Build a sequence from the two sentences, with the correct model-specific separators token type ids and attention
masks (:func:`~transformers.PreTrainedTokenizer.encode` and :func:`~transformers.PreTrainedTokenizer.__call__` take
care of this).
3. Pass this sequence through the model so that it is classified in one of the two available classes: 0 (not a
paraphrase) and 1 (is a paraphrase).
4. Compute the softmax of the result to get probabilities over the classes.
5. Print the results.
@@ -155,14 +155,15 @@ Extractive Question Answering
-----------------------------------------------------------------------------------------------------------------------
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the
`run_squad.py <https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_squad.py>`__ and
`run_tf_squad.py <https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_tf_squad.py>`__ scripts.
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a
model on a SQuAD task, you may leverage the `run_squad.py
<https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_squad.py>`__ and
`run_tf_squad.py
<https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_tf_squad.py>`__ scripts.
Here is an example of using pipelines to do question answering: extracting an answer from a text given a question.
It leverages a fine-tuned model on SQuAD.
Here is an example of using pipelines to do question answering: extracting an answer from a text given a question. It
leverages a fine-tuned model on SQuAD.
.. code-block::
@@ -176,8 +177,8 @@ It leverages a fine-tuned model on SQuAD.
... a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
... """
This returns an answer extracted from the text, a confidence score, alongside "start" and "end" values, which
are the positions of the extracted answer in the text.
This returns an answer extracted from the text, a confidence score, alongside "start" and "end" values, which are the
positions of the extracted answer in the text.
.. code-block::
@@ -192,16 +193,13 @@ are the positions of the extracted answer in the text.
Here is an example of question answering using a model and a tokenizer. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. The model is
identified as a BERT model and loads it with the weights stored in the
checkpoint.
1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
with the weights stored in the checkpoint.
2. Define a text and a few questions.
3. Iterate over the questions and build a sequence from the text and the current
question, with the correct model-specific separators token type ids and
attention masks.
4. Pass this sequence through the model. This outputs a range of scores across
the entire sequence tokens (question and text), for both the start and end
positions.
3. Iterate over the questions and build a sequence from the text and the current question, with the correct
model-specific separators token type ids and attention masks.
4. Pass this sequence through the model. This outputs a range of scores across the entire sequence tokens (question and
text), for both the start and end positions.
5. Compute the softmax of the result to get probabilities over the tokens.
6. Fetch the tokens from the identified start and stop values, convert those tokens to a string.
7. Print the results.
@@ -299,22 +297,22 @@ Here is an example of question answering using a model and a tokenizer. The proc
Language Modeling
-----------------------------------------------------------------------------------------------------------------------
Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular transformer-based
models are trained using a variant of language modeling, e.g. BERT with masked language modeling, GPT-2 with
causal language modeling.
Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular
transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling,
GPT-2 with causal language modeling.
Language modeling can be useful outside of pre-training as well, for example to shift the model distribution to be
domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset
or on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or
on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
Masked Language Modeling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to
fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the
right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis
for downstream tasks, requiring bi-directional context such as SQuAD (question answering,
see `Lewis, Lui, Goyal et al. <https://arxiv.org/abs/1910.13461>`__, part 4.2).
right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for
downstream tasks, requiring bi-directional context such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
<https://arxiv.org/abs/1910.13461>`__, part 4.2).
Here is an example of using pipelines to replace a mask from a sequence:
@@ -324,8 +322,7 @@ Here is an example of using pipelines to replace a mask from a sequence:
>>> nlp = pipeline("fill-mask")
This outputs the sequences with the mask filled, the confidence score, and the token id in the tokenizer
vocabulary:
This outputs the sequences with the mask filled, the confidence score, and the token id in the tokenizer vocabulary:
.. code-block::
@@ -359,14 +356,12 @@ vocabulary:
Here is an example of doing masked language modeling using a model and a tokenizer. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. The model is
identified as a DistilBERT model and loads it with the weights stored in the
checkpoint.
1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a DistilBERT model and
loads it with the weights stored in the checkpoint.
2. Define a sequence with a masked token, placing the :obj:`tokenizer.mask_token` instead of a word.
3. Encode that sequence into a list of IDs and find the position of the masked token in that list.
4. Retrieve the predictions at the index of the mask token: this tensor has the
same size as the vocabulary, and the values are the scores attributed to each
token. The model gives higher score to tokens it deems probable in that
4. Retrieve the predictions at the index of the mask token: this tensor has the same size as the vocabulary, and the
values are the scores attributed to each token. The model gives higher score to tokens it deems probable in that
context.
5. Retrieve the top 5 tokens using the PyTorch :obj:`topk` or TensorFlow :obj:`top_k` methods.
6. Replace the mask token by the tokens and print the results
@@ -427,9 +422,12 @@ Causal language modeling is the task of predicting the token following a sequenc
model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting
for generation tasks.
Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the input sequence.
Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the
input sequence.
Here is an example of using the tokenizer and model and leveraging the :func:`~transformers.PreTrainedModel.top_k_top_p_filtering` method to sample the next token following an input sequence of tokens.
Here is an example of using the tokenizer and model and leveraging the
:func:`~transformers.PreTrainedModel.top_k_top_p_filtering` method to sample the next token following an input sequence
of tokens.
.. code-block::
@@ -490,12 +488,16 @@ This outputs a (hopefully) coherent next token following the original sequence,
>>> print(resulting_string)
Hugging Face is based in DUMBO, New York City, and has
In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to generate multiple tokens up to a user-defined length.
In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to
generate multiple tokens up to a user-defined length.
Text Generation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In text generation (*a.k.a* *open-ended text generation*) the goal is to create a coherent portion of text that is a continuation from the given context. The following example shows how *GPT-2* can be used in pipelines to generate text. As a default all models apply *Top-K* sampling when used in pipelines, as configured in their respective configurations (see `gpt-2 config <https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json>`__ for example).
In text generation (*a.k.a* *open-ended text generation*) the goal is to create a coherent portion of text that is a
continuation from the given context. The following example shows how *GPT-2* can be used in pipelines to generate text.
As a default all models apply *Top-K* sampling when used in pipelines, as configured in their respective configurations
(see `gpt-2 config <https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json>`__ for example).
.. code-block::
@@ -507,8 +509,9 @@ In text generation (*a.k.a* *open-ended text generation*) the goal is to create
Here, the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am concerned, I will"*.
The default arguments of ``PreTrainedModel.generate()`` can be directly overriden in the pipeline, as is shown above for the argument ``max_length``.
Here, the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am
concerned, I will"*. The default arguments of ``PreTrainedModel.generate()`` can be directly overridden in the
pipeline, as is shown above for the argument ``max_length``.
Here is an example of text generation using ``XLNet`` and its tokenzier.
@@ -569,25 +572,30 @@ Here is an example of text generation using ``XLNet`` and its tokenzier.
>>> print(generated)
Today the weather is really nice and I am planning on anning on taking a nice...... of a great time!<eop>...............
Text generation is currently possible with *GPT-2*, *OpenAi-GPT*, *CTRL*, *XLNet*, *Transfo-XL* and *Reformer* in PyTorch and for most models in Tensorflow as well. As can be seen in the example above *XLNet* and *Transfo-XL* often need to be padded to work well.
GPT-2 is usually a good choice for *open-ended text generation* because it was trained on millions of webpages with a causal language modeling objective.
Text generation is currently possible with *GPT-2*, *OpenAi-GPT*, *CTRL*, *XLNet*, *Transfo-XL* and *Reformer* in
PyTorch and for most models in Tensorflow as well. As can be seen in the example above *XLNet* and *Transfo-XL* often
need to be padded to work well. GPT-2 is usually a good choice for *open-ended text generation* because it was trained
on millions of webpages with a causal language modeling objective.
For more information on how to apply different decoding strategies for text generation, please also refer to our text generation blog post `here <https://huggingface.co/blog/how-to-generate>`__.
For more information on how to apply different decoding strategies for text generation, please also refer to our text
generation blog post `here <https://huggingface.co/blog/how-to-generate>`__.
Named Entity Recognition
-----------------------------------------------------------------------------------------------------------------------
Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a
token as a person, an organisation or a location.
An example of a named entity recognition dataset is the CoNLL-2003 dataset, which is entirely based on that task.
If you would like to fine-tune a model on an NER task, you may leverage the
`run_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_ner.py>`__ (PyTorch),
`run_pl_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_pl_ner.py>`__ (leveraging pytorch-lightning) or the
`run_tf_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_tf_ner.py>`__ (TensorFlow) scripts.
Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a token
as a person, an organisation or a location. An example of a named entity recognition dataset is the CoNLL-2003 dataset,
which is entirely based on that task. If you would like to fine-tune a model on an NER task, you may leverage the
`run_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_ner.py>`__
(PyTorch), `run_pl_ner.py
<https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_pl_ner.py>`__ (leveraging
pytorch-lightning) or the `run_tf_ner.py
<https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_tf_ner.py>`__ (TensorFlow)
scripts.
Here is an example of using pipelines to do named entity recognition, specifically, trying to identify tokens as belonging to one
of 9 classes:
Here is an example of using pipelines to do named entity recognition, specifically, trying to identify tokens as
belonging to one of 9 classes:
- O, Outside of a named entity
- B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
@@ -599,8 +607,8 @@ of 9 classes:
- B-LOC, Beginning of a location right after another location
- I-LOC, Location
It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https://github.com/stefan-it>`__ from
`dbmdz <https://github.com/dbmdz>`__.
It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https://github.com/stefan-it>`__ from `dbmdz
<https://github.com/dbmdz>`__.
.. code-block::
@@ -612,8 +620,8 @@ It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https:
... "close to the Manhattan Bridge which is visible from the window."
This outputs a list of all words that have been identified as one of the entities from the 9 classes defined above. Here are the
expected results:
This outputs a list of all words that have been identified as one of the entities from the 9 classes defined above.
Here are the expected results:
.. code-block::
@@ -633,24 +641,21 @@ expected results:
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]
Note, how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City", "DUMBO" and
"Manhattan Bridge" have been identified as locations.
Note, how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
"DUMBO" and "Manhattan Bridge" have been identified as locations.
Here is an example of doing named entity recognition, using a model and a tokenizer. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. The model is
identified as a BERT model and loads it with the weights stored in the
checkpoint.
1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
with the weights stored in the checkpoint.
2. Define the label list with which the model was trained on.
3. Define a sequence with known entities, such as "Hugging Face" as an organisation and "New York City" as a location.
4. Split words into tokens so that they can be mapped to predictions. We use a
small hack by, first, completely encoding and decoding the sequence, so that
we're left with a string that contains the special tokens.
4. Split words into tokens so that they can be mapped to predictions. We use a small hack by, first, completely
encoding and decoding the sequence, so that we're left with a string that contains the special tokens.
5. Encode that sequence into IDs (special tokens are added automatically).
6. Retrieve the predictions by passing the input to the model and getting the
first output. This results in a distribution over the 9 possible classes for
each token. We take the argmax to retrieve the most likely class for each
token.
6. Retrieve the predictions by passing the input to the model and getting the first output. This results in a
distribution over the 9 possible classes for each token. We take the argmax to retrieve the most likely class for
each token.
7. Zip together each token with its prediction and print it.
.. code-block::
@@ -713,9 +718,9 @@ Here is an example of doing named entity recognition, using a model and a tokeni
>>> predictions = tf.argmax(outputs, axis=2)
This outputs a list of each token mapped to its corresponding prediction. Differently from the pipeline, here every token has
a prediction as we didn't remove the "0"th class, which means that no particular entity was found on that token. The
following array should be the output:
This outputs a list of each token mapped to its corresponding prediction. Differently from the pipeline, here every
token has a prediction as we didn't remove the "0"th class, which means that no particular entity was found on that
token. The following array should be the output:
.. code-block::
@@ -727,11 +732,13 @@ Summarization
Summarization is the task of summarizing a document or an article into a shorter text.
An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was created for the task of summarization.
If you would like to fine-tune a model on a summarization task, various approaches are described in this
`document <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was
created for the task of summarization. If you would like to fine-tune a model on a summarization task, various
approaches are described in this `document
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
Here is an example of using the pipelines to do summarization. It leverages a Bart model that was fine-tuned on the CNN / Daily Mail data set.
Here is an example of using the pipelines to do summarization. It leverages a Bart model that was fine-tuned on the CNN
/ Daily Mail data set.
.. code-block::
@@ -758,9 +765,9 @@ Here is an example of using the pipelines to do summarization. It leverages a Ba
... If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
... """
Because the summarization pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments
of ``PreTrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown below.
This outputs the following summary:
Because the summarization pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default
arguments of ``PreTrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown
below. This outputs the following summary:
.. code-block::
@@ -769,12 +776,14 @@ This outputs the following summary:
Here is an example of doing summarization using a model and a tokenizer. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder
model, such as ``Bart`` or ``T5``.
2. Define the article that should be summarized.
3. Add the T5 specific prefix "summarize: ".
4. Use the ``PreTrainedModel.generate()`` method to generate the summary.
In this example we use Google`s T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including CNN / Daily Mail), it yields very good results.
In this example we use Google`s T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including
CNN / Daily Mail), it yields very good results.
.. code-block::
@@ -802,14 +811,13 @@ Translation
Translation is the task of translating a text from one language to another.
An example of a translation dataset is the WMT English to German dataset, which has sentences in English as the input data
and the corresponding sentences in German as the target data.
If you would like to fine-tune a model on a translation task, various approaches are described in this
`document <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
An example of a translation dataset is the WMT English to German dataset, which has sentences in English as the input
data and the corresponding sentences in German as the target data. If you would like to fine-tune a model on a
translation task, various approaches are described in this `document
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
Here is an example of using the pipelines to do translation.
It leverages a T5 model that was only pre-trained on a multi-task mixture dataset (including WMT), yet, yielding impressive
translation results.
Here is an example of using the pipelines to do translation. It leverages a T5 model that was only pre-trained on a
multi-task mixture dataset (including WMT), yet, yielding impressive translation results.
.. code-block::
@@ -819,12 +827,13 @@ translation results.
>>> print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
Because the translation pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments
of ``PreTrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
Because the translation pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default
arguments of ``PreTrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
Here is an example of doing translation using a model and a tokenizer. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder
model, such as ``Bart`` or ``T5``.
2. Define the article that should be summarizaed.
3. Add the T5 specific prefix "translate English to German: "
4. Use the ``PreTrainedModel.generate()`` method to perform the translation.

View File

@@ -12,17 +12,26 @@ There are 2 test suites in the repository:
How transformers are tested
-----------------------------------------------------------------------------------------------------------------------
1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs are defined in this `config file <https://github.com/huggingface/transformers/blob/master/.circleci/config.yml>`__, so that if needed you can reproduce the same environment on your machine.
1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs
are defined in this `config file <https://github.com/huggingface/transformers/blob/master/.circleci/config.yml>`__,
so that if needed you can reproduce the same environment on your machine.
These CI jobs don't run ``@slow`` tests.
2. There are 3 jobs run by `github actions <https://github.com/huggingface/transformers/actions>`__:
* `torch hub integration <https://github.com/huggingface/transformers/blob/master/.github/workflows/github-torch-hub.yml>`__: checks whether torch hub integration works.
* `torch hub integration
<https://github.com/huggingface/transformers/blob/master/.github/workflows/github-torch-hub.yml>`__: checks
whether torch hub integration works.
* `self-hosted (push) <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml>`__: runs fast tests on GPU only on commits on ``master``. It only runs if a commit on ``master`` has updated the code in one of the following folders: ``src``, ``tests``, ``.github`` (to prevent running on added model cards, notebooks, etc.)
* `self-hosted runner <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-scheduled.yml>`__: runs normal and slow tests on GPU in ``tests`` and ``examples``:
* `self-hosted (push) <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml>`__:
runs fast tests on GPU only on commits on ``master``. It only runs if a commit on ``master`` has updated the code
in one of the following folders: ``src``, ``tests``, ``.github`` (to prevent running on added model cards,
notebooks, etc.)
* `self-hosted runner
<https://github.com/huggingface/transformers/blob/master/.github/workflows/self-scheduled.yml>`__: runs normal and
slow tests on GPU in ``tests`` and ``examples``:
.. code-block:: bash
@@ -43,7 +52,8 @@ Running tests
Choosing which tests to run
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This document goes into many details of how tests can be run. If after reading everything, you need even more details you will find them `here <https://docs.pytest.org/en/latest/usage.html>`__.
This document goes into many details of how tests can be run. If after reading everything, you need even more details
you will find them `here <https://docs.pytest.org/en/latest/usage.html>`__.
Here are some most useful ways of running tests.
@@ -90,7 +100,7 @@ All tests of a given test file:
pytest tests/test_optimization.py --collect-only -q
Run a specific test module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -99,12 +109,13 @@ To run an individual test module:
.. code-block:: bash
pytest tests/test_logging.py
Run specific tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest class containing those tests. For example, it could be:
Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest
class containing those tests. For example, it could be:
.. code-block:: bash
@@ -131,7 +142,7 @@ As mentioned earlier you can see what tests are contained inside the ``Optimizat
pytest tests/test_optimization.py::OptimizationTest --collect-only -q
You can run tests by keyword expressions.
To run only tests whose name contains ``adam``:
@@ -158,7 +169,9 @@ And you can combine the two patterns in one:
Run only modified tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can run the tests related to the unstaged files or the current branch (according to Git) by using `pytest-picked <https://github.com/anapaulagomes/pytest-picked>`__. This is a great way of quickly testing your changes didn't break anything, since it won't run the tests related to files you didn't touch.
You can run the tests related to the unstaged files or the current branch (according to Git) by using `pytest-picked
<https://github.com/anapaulagomes/pytest-picked>`__. This is a great way of quickly testing your changes didn't break
anything, since it won't run the tests related to files you didn't touch.
.. code-block:: bash
@@ -168,17 +181,14 @@ You can run the tests related to the unstaged files or the current branch (accor
pytest --picked
All tests will be run from files and folders which are modified, but not
yet committed.
All tests will be run from files and folders which are modified, but not yet committed.
Automatically rerun failed tests on source modification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
`pytest-xdist <https://github.com/pytest-dev/pytest-xdist>`__ provides a
very useful feature of detecting all failed tests, and then waiting for
you to modify files and continuously re-rerun those failing tests until
they pass while you fix them. So that you don't need to re start pytest
after you made the fix. This is repeated until all tests pass after
`pytest-xdist <https://github.com/pytest-dev/pytest-xdist>`__ provides a very useful feature of detecting all failed
tests, and then waiting for you to modify files and continuously re-rerun those failing tests until they pass while you
fix them. So that you don't need to re start pytest after you made the fix. This is repeated until all tests pass after
which again a full run is performed.
.. code-block:: bash
@@ -187,10 +197,9 @@ which again a full run is performed.
To enter the mode: ``pytest -f`` or ``pytest --looponfail``
File changes are detected by looking at ``looponfailroots`` root
directories and all of their contents (recursively). If the default for
this value does not work for you, you can change it in your project by
setting a configuration option in ``setup.cfg``:
File changes are detected by looking at ``looponfailroots`` root directories and all of their contents (recursively).
If the default for this value does not work for you, you can change it in your project by setting a configuration
option in ``setup.cfg``:
.. code-block:: ini
@@ -204,17 +213,17 @@ or ``pytest.ini``/``tox.ini`` files:
[pytest]
looponfailroots = transformers tests
This would lead to only looking for file changes in the respective
directories, specified relatively to the ini-files directory.
This would lead to only looking for file changes in the respective directories, specified relatively to the ini-files
directory.
`pytest-watch <https://github.com/joeyespo/pytest-watch>`__ is an
alternative implementation of this functionality.
`pytest-watch <https://github.com/joeyespo/pytest-watch>`__ is an alternative implementation of this functionality.
Skip a test module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For example, to run all except ``test_modeling_*.py`` tests:
If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For
example, to run all except ``test_modeling_*.py`` tests:
.. code-block:: bash
@@ -224,8 +233,7 @@ If you want to run all test modules, except a few you can exclude them by giving
Clearing state
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CI builds and when isolation is important (against speed), cache should
be cleared:
CI builds and when isolation is important (against speed), cache should be cleared:
.. code-block:: bash
@@ -234,24 +242,23 @@ be cleared:
Running tests in parallel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As mentioned earlier ``make test`` runs tests in parallel via ``pytest-xdist`` plugin (``-n X`` argument, e.g. ``-n 2`` to run 2 parallel jobs).
As mentioned earlier ``make test`` runs tests in parallel via ``pytest-xdist`` plugin (``-n X`` argument, e.g. ``-n 2``
to run 2 parallel jobs).
``pytest-xdist``'s ``--dist=`` option allows one to control how the tests are grouped. ``--dist=loadfile`` puts the tests located in one file onto the same process.
``pytest-xdist``'s ``--dist=`` option allows one to control how the tests are grouped. ``--dist=loadfile`` puts the
tests located in one file onto the same process.
Since the order of executed tests is different and unpredictable, if
running the test suite with ``pytest-xdist`` produces failures (meaning
we have some undetected coupled tests), use
`pytest-replay <https://github.com/ESSS/pytest-replay>`__ to replay the
tests in the same order, which should help with then somehow reducing
that failing sequence to a minimum.
Since the order of executed tests is different and unpredictable, if running the test suite with ``pytest-xdist``
produces failures (meaning we have some undetected coupled tests), use `pytest-replay
<https://github.com/ESSS/pytest-replay>`__ to replay the tests in the same order, which should help with then somehow
reducing that failing sequence to a minimum.
Test order and repetition
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It's good to repeat the tests several times, in sequence, randomly, or
in sets, to detect any potential inter-dependency and state-related bugs
(tear down). And the straightforward multiple repetition is just good to
detect some problems that get uncovered by randomness of DL.
It's good to repeat the tests several times, in sequence, randomly, or in sets, to detect any potential
inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect
some problems that get uncovered by randomness of DL.
Repeat tests
@@ -268,10 +275,10 @@ And then run every test multiple times (50 by default):
.. code-block:: bash
pytest --flake-finder --flake-runs=5 tests/test_failing_test.py
.. note::
This plugin doesn't work with ``-n`` flag from ``pytest-xdist``.
.. note::
There is another plugin ``pytest-repeat``, but it doesn't work with ``unittest``.
@@ -283,14 +290,11 @@ Run tests in a random order
pip install pytest-random-order
Important: the presence of ``pytest-random-order`` will automatically
randomize tests, no configuration change or command line options is
required.
Important: the presence of ``pytest-random-order`` will automatically randomize tests, no configuration change or
command line options is required.
As explained earlier this allows detection of coupled tests - where one
test's state affects the state of another. When ``pytest-random-order``
is installed it will print the random seed it used for that session,
e.g:
As explained earlier this allows detection of coupled tests - where one test's state affects the state of another. When
``pytest-random-order`` is installed it will print the random seed it used for that session, e.g:
.. code-block:: bash
@@ -299,8 +303,7 @@ e.g:
Using --random-order-bucket=module
Using --random-order-seed=573663
So that if the given particular sequence fails, you can reproduce it by
adding that exact seed, e.g.:
So that if the given particular sequence fails, you can reproduce it by adding that exact seed, e.g.:
.. code-block:: bash
@@ -309,11 +312,9 @@ adding that exact seed, e.g.:
Using --random-order-bucket=module
Using --random-order-seed=573663
It will only reproduce the exact order if you use the exact same list of
tests (or no list at all). Once you start to manually narrowing
down the list you can no longer rely on the seed, but have to list them
manually in the exact order they failed and tell pytest to not randomize
them instead using ``--random-order-bucket=none``, e.g.:
It will only reproduce the exact order if you use the exact same list of tests (or no list at all). Once you start to
manually narrowing down the list you can no longer rely on the seed, but have to list them manually in the exact order
they failed and tell pytest to not randomize them instead using ``--random-order-bucket=none``, e.g.:
.. code-block:: bash
@@ -325,12 +326,13 @@ To disable the shuffling for all tests:
pytest --random-order-bucket=none
By default ``--random-order-bucket=module`` is implied, which will
shuffle the files on the module levels. It can also shuffle on
``class``, ``package``, ``global`` and ``none`` levels. For the complete
details please see its `documentation <https://github.com/jbasko/pytest-random-order>`__.
By default ``--random-order-bucket=module`` is implied, which will shuffle the files on the module levels. It can also
shuffle on ``class``, ``package``, ``global`` and ``none`` levels. For the complete details please see its
`documentation <https://github.com/jbasko/pytest-random-order>`__.
Another randomization alternative is: ``pytest-randomly`` <https://github.com/pytest-dev/pytest-randomly>`__. This module has a very similar functionality/interface, but it doesn't have the bucket modes available in ``pytest-random-order``. It has the same problem of imposing itself once installed.
Another randomization alternative is: ``pytest-randomly`` <https://github.com/pytest-dev/pytest-randomly>`__. This
module has a very similar functionality/interface, but it doesn't have the bucket modes available in
``pytest-random-order``. It has the same problem of imposing itself once installed.
Look and feel variations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -338,13 +340,11 @@ Look and feel variations
pytest-sugar
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`pytest-sugar <https://github.com/Frozenball/pytest-sugar>`__ is a
plugin that improves the look-n-feel, adds a progressbar, and show tests
that fail and the assert instantly. It gets activated automatically upon
installation.
`pytest-sugar <https://github.com/Frozenball/pytest-sugar>`__ is a plugin that improves the look-n-feel, adds a
progressbar, and show tests that fail and the assert instantly. It gets activated automatically upon installation.
.. code-block:: bash
pip install pytest-sugar
To run tests without it, run:
@@ -360,8 +360,7 @@ or uninstall it.
Report each sub-test name and its progress
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For a single or a group of tests via ``pytest`` (after
``pip install pytest-pspec``):
For a single or a group of tests via ``pytest`` (after ``pip install pytest-pspec``):
.. code-block:: bash
@@ -372,9 +371,8 @@ For a single or a group of tests via ``pytest`` (after
Instantly shows failed tests
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`pytest-instafail <https://github.com/pytest-dev/pytest-instafail>`__
shows failures and errors instantly instead of waiting until the end of
test session.
`pytest-instafail <https://github.com/pytest-dev/pytest-instafail>`__ shows failures and errors instantly instead of
waiting until the end of test session.
.. code-block:: bash
@@ -390,18 +388,20 @@ To GPU or not to GPU
On a GPU-enabled setup, to test in CPU-only mode add ``CUDA_VISIBLE_DEVICES=""``:
.. code-block:: bash
CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py
or if you have multiple gpus, you can specify which one is to be used by ``pytest``. For example, to use only the second gpu if you have gpus ``0`` and ``1``, you can run:
or if you have multiple gpus, you can specify which one is to be used by ``pytest``. For example, to use only the
second gpu if you have gpus ``0`` and ``1``, you can run:
.. code-block:: bash
CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py
This is handy when you want to run different tasks on different GPUs.
Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip
decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
* ``require_torch`` - this test will run only under torch
* ``require_torch_gpu`` - as ``require_torch`` plus requires at least 1 GPU
@@ -409,6 +409,22 @@ Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet othe
* ``require_torch_non_multigpu`` - as ``require_torch`` plus requires 0 or 1 GPUs
* ``require_torch_tpu`` - as ``require_torch`` plus requires at least 1 TPU
Let's depict the GPU requirements in the following table:
+----------+---------------------------------+
| n gpus | decorator |
+==========+=================================+
| ``>= 0`` | ``@require_torch`` |
+----------+---------------------------------+
| ``>= 1`` | ``@require_torch_gpu`` |
+----------+---------------------------------+
| ``>= 2`` | ``@require_torch_multigpu`` |
+----------+---------------------------------+
| ``< 2`` | ``@require_torch_non_multigpu`` |
+----------+---------------------------------+
For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed:
.. code-block:: python
@@ -423,7 +439,8 @@ If a test requires ``tensorflow`` use the ``require_tf`` decorator. For example:
@require_tf
def test_tf_thing_with_tensorflow():
These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is how to set it up:
These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is
how to set it up:
.. code-block:: python
@@ -431,7 +448,8 @@ These decorators can be stacked. For example, if a test is slow and requires at
@slow
def test_example_slow_on_gpu():
Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_*`` skip decorators have to be listed last for them to work correctly. Here is an example of the correct usage:
Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_*`` skip decorators have to be listed
last for them to work correctly. Here is an example of the correct usage:
.. code-block:: python
@@ -439,7 +457,8 @@ Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_
@require_torch_multigpu
def test_integration_foo():
This order problem doesn't exist with ``@pytest.mark.parametrize``, you can put it first or last and it will still work. But it only works with non-unittests.
This order problem doesn't exist with ``@pytest.mark.parametrize``, you can put it first or last and it will still
work. But it only works with non-unittests.
Inside tests:
@@ -447,21 +466,44 @@ Inside tests:
.. code-block:: bash
torch.cuda.device_count()
from transformers.testing_utils import get_gpu_count
n_gpu = get_gpu_count() # works with torch and tf
Distributed training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``pytest`` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right
thing and end up thinking they are ``pytest`` and start running the test suite in loops. It works, however, if one
spawns a normal process that then spawns off multiple workers and manages the IO pipes.
This is still under development but you can study 2 different tests that perform this successfully:
* `test_seq2seq_examples_multi_gpu.py
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_seq2seq_examples_multi_gpu.py>`__ - a
``pytorch-lightning``-running test (had to use PL's ``ddp`` spawning method which is the default)
* `test_finetune_trainer.py
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_finetune_trainer.py>`__ - a normal
(non-PL) test
To jump right into the execution point, search for the ``execute_subprocess_async`` function in those tests.
You will need at least 2 GPUs to see these tests in action:
.. code-block:: bash
CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW=1 pytest -sv examples/seq2seq/test_finetune_trainer.py \
examples/seq2seq/test_seq2seq_examples_multi_gpu.py
Output capture
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
During test execution any output sent to ``stdout`` and ``stderr`` is
captured. If a test or a setup method fails, its according captured
output will usually be shown along with the failure traceback.
During test execution any output sent to ``stdout`` and ``stderr`` is captured. If a test or a setup method fails, its
according captured output will usually be shown along with the failure traceback.
To disable output capturing and to get the ``stdout`` and ``stderr``
normally, use ``-s`` or ``--capture=no``:
To disable output capturing and to get the ``stdout`` and ``stderr`` normally, use ``-s`` or ``--capture=no``:
.. code-block:: bash
@@ -494,9 +536,8 @@ Creating a URL for each test failure:
pytest --pastebin=failed tests/test_logging.py
This will submit test run information to a remote Paste service and
provide a URL for each failure. You may select tests as usual or add for
example -x if you only want to send one particular failure.
This will submit test run information to a remote Paste service and provide a URL for each failure. You may select
tests as usual or add for example -x if you only want to send one particular failure.
Creating a URL for a whole test session log:
@@ -509,18 +550,22 @@ Creating a URL for a whole test session log:
Writing tests
-----------------------------------------------------------------------------------------------------------------------
🤗 transformers tests are based on ``unittest``, but run by ``pytest``, so most of the time features from both systems can be used.
🤗 transformers tests are based on ``unittest``, but run by ``pytest``, so most of the time features from both systems
can be used.
You can read `here <https://docs.pytest.org/en/stable/unittest.html>`__ which features are supported, but the important thing to remember is that most ``pytest`` fixtures don't work. Neither parametrization, but we use the module ``parameterized`` that works in a similar way.
You can read `here <https://docs.pytest.org/en/stable/unittest.html>`__ which features are supported, but the important
thing to remember is that most ``pytest`` fixtures don't work. Neither parametrization, but we use the module
``parameterized`` that works in a similar way.
Parametrization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within the test, but then there is no way of running that test for just one set of arguments.
Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within
the test, but then there is no way of running that test for just one set of arguments.
.. code-block:: python
# test_this1.py
import unittest
from parameterized import parameterized
@@ -533,7 +578,8 @@ Often, there is a need to run the same test multiple times, but with different a
def test_floor(self, name, input, expected):
assert_equal(math.floor(input), expected)
Now, by default this test will be run 3 times, each time with the last 3 arguments of ``test_floor`` being assigned the corresponding arguments in the parameter list.
Now, by default this test will be run 3 times, each time with the last 3 arguments of ``test_floor`` being assigned the
corresponding arguments in the parameter list.
and you could run just the ``negative`` and ``integer`` sets of params with:
@@ -547,14 +593,15 @@ or all but ``negative`` sub-tests, with:
pytest -k "not negative" tests/test_mytest.py
Besides using the ``-k`` filter that was just mentioned, you can find out the exact name of each sub-test and run any or all of them using their exact names.
Besides using the ``-k`` filter that was just mentioned, you can find out the exact name of each sub-test and run any
or all of them using their exact names.
.. code-block:: bash
pytest test_this1.py --collect-only -q
and it will list:
.. code-block:: bash
test_this1.py::TestMathUnitTest::test_floor_0_negative
@@ -566,10 +613,12 @@ So now you can run just 2 specific sub-tests:
.. code-block:: bash
pytest test_this1.py::TestMathUnitTest::test_floor_0_negative test_this1.py::TestMathUnitTest::test_floor_1_integer
The module `parameterized <https://pypi.org/project/parameterized/>`__ which is already in the developer dependencies of ``transformers`` works for both: ``unittests`` and ``pytest`` tests.
If, however, the test is not a ``unittest``, you may use ``pytest.mark.parametrize`` (or you may see it being used in some existing tests, mostly under ``examples``).
The module `parameterized <https://pypi.org/project/parameterized/>`__ which is already in the developer dependencies
of ``transformers`` works for both: ``unittests`` and ``pytest`` tests.
If, however, the test is not a ``unittest``, you may use ``pytest.mark.parametrize`` (or you may see it being used in
some existing tests, mostly under ``examples``).
Here is the same example, this time using ``pytest``'s ``parametrize`` marker:
@@ -588,14 +637,16 @@ Here is the same example, this time using ``pytest``'s ``parametrize`` marker:
def test_floor(name, input, expected):
assert_equal(math.floor(input), expected)
Same as with ``parameterized``, with ``pytest.mark.parametrize`` you can have a fine control over which sub-tests are run, if the ``-k`` filter doesn't do the job. Except, this parametrization function creates a slightly different set of names for the sub-tests. Here is what they look like:
Same as with ``parameterized``, with ``pytest.mark.parametrize`` you can have a fine control over which sub-tests are
run, if the ``-k`` filter doesn't do the job. Except, this parametrization function creates a slightly different set of
names for the sub-tests. Here is what they look like:
.. code-block:: bash
pytest test_this2.py --collect-only -q
and it will list:
.. code-block:: bash
test_this2.py::test_floor[integer-1-1.0]
@@ -610,16 +661,69 @@ So now you can run just the specific test:
as in the previous example.
Files and directories
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In tests often we need to know where things are relative to the current test file, and it's not trivial since the test
could be invoked from more than one directory or could reside in sub-directories with different depths. A helper class
:obj:`transformers.test_utils.TestCasePlus` solves this problem by sorting out all the basic paths and provides easy
accessors to them:
* ``pathlib`` objects (all fully resolved):
- ``test_file_path`` - the current test file path, i.e. ``__file__``
- ``test_file_dir`` - the directory containing the current test file
- ``tests_dir`` - the directory of the ``tests`` test suite
- ``examples_dir`` - the directory of the ``examples`` test suite
- ``repo_root_dir`` - the directory of the repository
- ``src_dir`` - the directory of ``src`` (i.e. where the ``transformers`` sub-dir resides)
* stringified paths---same as above but these return paths as strings, rather than ``pathlib`` objects:
- ``test_file_path_str``
- ``test_file_dir_str``
- ``tests_dir_str``
- ``examples_dir_str``
- ``repo_root_dir_str``
- ``src_dir_str``
To start using those all you need is to make sure that the test resides in a subclass of
:obj:`transformers.test_utils.TestCasePlus`. For example:
.. code-block:: python
from transformers.testing_utils import TestCasePlus
class PathExampleTest(TestCasePlus):
def test_something_involving_local_locations(self):
data_dir = self.examples_dir / "seq2seq/test_data/wmt_en_ro"
If you don't need to manipulated paths via ``pathlib`` or you just need a path as a string, you can always invoked
``str()`` on the ``pathlib`` oboject or use the accessors ending with ``_str``. For example:
.. code-block:: python
from transformers.testing_utils import TestCasePlus
class PathExampleTest(TestCasePlus):
def test_something_involving_stringified_locations(self):
examples_dir = self.examples_dir_str
Temporary files and directories
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite each other's data. Also we want to get the temp files and directories removed at the end of each test that created them. Therefore, using packages like ``tempfile``, which address these needs is essential.
Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite
each other's data. Also we want to get the temp files and directories removed at the end of each test that created
them. Therefore, using packages like ``tempfile``, which address these needs is essential.
However, when debugging tests, you need to be able to see what goes into the temp file or directory and you want to know it's exact path and not having it randomized on every test re-run.
However, when debugging tests, you need to be able to see what goes into the temp file or directory and you want to
know it's exact path and not having it randomized on every test re-run.
A helper class :obj:`transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of :obj:`unittest.TestCase`, so we can easily inherit from it in the test modules.
A helper class :obj:`transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of
:obj:`unittest.TestCase`, so we can easily inherit from it in the test modules.
Here is an example of its usage:
@@ -627,28 +731,32 @@ Here is an example of its usage:
from transformers.testing_utils import TestCasePlus
class ExamplesTests(TestCasePlus):
def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir()
def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir()
This code creates a unique temporary directory, and sets :obj:`tmp_dir` to its location.
In this and all the following scenarios the temporary directory will be auto-removed at the end of test, unless ``after=False`` is passed to the helper function.
In this and all the following scenarios the temporary directory will be auto-removed at the end of test, unless
``after=False`` is passed to the helper function.
* Create a temporary directory of my choice and delete it at the end - useful for debugging when you want to monitor a specific directory:
* Create a temporary directory of my choice and delete it at the end - useful for debugging when you want to monitor a
specific directory:
.. code-block:: python
def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test")
* Create a temporary directory of my choice and do not delete it at the end---useful for when you want to look at the temp results:
* Create a temporary directory of my choice and do not delete it at the end---useful for when you want to look at the
temp results:
.. code-block:: python
def test_whatever(self):
tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test", after=False)
* Create a temporary directory of my choice and ensure to delete it right away---useful for when you disabled deletion in the previous test run and want to make sure the that temporary directory is empty before the new test is run:
* Create a temporary directory of my choice and ensure to delete it right away---useful for when you disabled deletion
in the previous test run and want to make sure the that temporary directory is empty before the new test is run:
.. code-block:: python
@@ -656,38 +764,33 @@ In this and all the following scenarios the temporary directory will be auto-rem
tmp_dir = self.get_auto_remove_tmp_dir(tmp_dir="./tmp/run/test", before=True)
.. note::
In order to run the equivalent of ``rm -r`` safely, only subdirs of the project repository checkout are allowed if an explicit obj:`tmp_dir` is used, so that by mistake no ``/tmp`` or similar important part of the filesystem will get nuked. i.e. please always pass paths that start with ``./``.
In order to run the equivalent of ``rm -r`` safely, only subdirs of the project repository checkout are allowed if
an explicit obj:`tmp_dir` is used, so that by mistake no ``/tmp`` or similar important part of the filesystem will
get nuked. i.e. please always pass paths that start with ``./``.
.. note::
Each test can register multiple temporary directories and they all will get auto-removed, unless requested otherwise.
Each test can register multiple temporary directories and they all will get auto-removed, unless requested
otherwise.
Skipping tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is useful when a bug is found and a new test is written, yet the
bug is not fixed yet. In order to be able to commit it to the main
repository we need make sure it's skipped during ``make test``.
This is useful when a bug is found and a new test is written, yet the bug is not fixed yet. In order to be able to
commit it to the main repository we need make sure it's skipped during ``make test``.
Methods:
- A **skip** means that you expect your test to pass only if some
conditions are met, otherwise pytest should skip running the test
altogether. Common examples are skipping windows-only tests on
non-windows platforms, or skipping tests that depend on an external
resource which is not available at the moment (for example a
database).
- A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip
running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping
tests that depend on an external resource which is not available at the moment (for example a database).
- A **xfail** means that you expect a test to fail for some reason. A
common example is a test for a feature not yet implemented, or a bug
not yet fixed. When a test passes despite being expected to fail
(marked with pytest.mark.xfail), its an xpass and will be reported
in the test summary.
- A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet
implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with
pytest.mark.xfail), its an xpass and will be reported in the test summary.
One of the important differences between the two is that ``skip``
doesn't run the test, and ``xfail`` does. So if the code that's buggy
causes some bad state that will affect other tests, do not use
``xfail``.
One of the important differences between the two is that ``skip`` doesn't run the test, and ``xfail`` does. So if the
code that's buggy causes some bad state that will affect other tests, do not use ``xfail``.
Implementation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -754,7 +857,7 @@ or:
@unittest.skipIf(torch_device == "cpu", "Can't do half precision")
def test_feature_x():
or skip the whole module:
.. code-block:: python
@@ -765,12 +868,12 @@ or skip the whole module:
More details, example and ways are `here <https://docs.pytest.org/en/latest/skipping.html>`__.
Custom markers
Slow tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Slow tests
Tests that are too slow (e.g. once downloading huge model files) are marked with:
The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for
an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be
marked as in the example below:
.. code-block:: python
@@ -778,13 +881,14 @@ Tests that are too slow (e.g. once downloading huge model files) are marked with
@slow
def test_integration_foo():
To run such tests set ``RUN_SLOW=1`` env var, e.g.:
Once a test is marked as ``@slow``, to run such tests set ``RUN_SLOW=1`` env var, e.g.:
.. code-block:: bash
RUN_SLOW=1 pytest tests
Some decorators like ``@parametrized`` rewrite test names, therefore ``@slow`` and the rest of the skip decorators ``@require_*`` have to be listed last for them to work correctly. Here is an example of the correct usage:
Some decorators like ``@parameterized`` rewrite test names, therefore ``@slow`` and the rest of the skip decorators
``@require_*`` have to be listed last for them to work correctly. Here is an example of the correct usage:
.. code-block:: python
@@ -792,13 +896,55 @@ Some decorators like ``@parametrized`` rewrite test names, therefore ``@slow`` a
@slow
def test_integration_foo():
As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI
checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will
get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your
machine before submitting the PR.
Here is a rough decision making mechanism for choosing which tests should be marked as slow:
If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files,
pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library,
such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine
this approach we should have exceptions:
* All tests that need to download a heavy set of weights (e.g., model or tokenizer integration tests, pipeline
integration tests) should be set to slow. If you're adding a new model, you should create and upload to the hub a
tiny version of it (with random weights) for integration tests. This is discussed in the following paragraphs.
* All tests that need to do a training not specifically optimized to be fast should be set to slow.
* We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to
``@slow``. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked
as ``@slow``.
* If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless.
Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. For example,
a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models
have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. Then the ``@slow`` tests can use large
slow models to do qualitative testing. To see the use of these simply look for *tiny* models with:
.. code-block:: bash
grep tiny tests examples
Here is a an example of a `script
<https://github.com/huggingface/transformers/blob/master/scripts/fsmt/fsmt-make-tiny-model.py>`__ that created the tiny
model `stas/tiny-wmt19-en-de <https://huggingface.co/stas/tiny-wmt19-en-de>`__. You can easily adjust it to your
specific model's architecture.
It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if
you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the
execution speed report in CI logs instead (the output of ``pytest --durations=0 tests``).
That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast.
If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest
tests.
Testing the stdout/stderr output
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In order to test functions that write to ``stdout`` and/or ``stderr``,
the test can access those streams using the ``pytest``'s `capsys
system <https://docs.pytest.org/en/latest/capture.html>`__. Here is how
this is accomplished:
In order to test functions that write to ``stdout`` and/or ``stderr``, the test can access those streams using the
``pytest``'s `capsys system <https://docs.pytest.org/en/latest/capture.html>`__. Here is how this is accomplished:
.. code-block:: python
@@ -817,8 +963,8 @@ this is accomplished:
assert msg in out
assert msg in err
And, of course, most of the time, ``stderr`` will come as a part of an
exception, so try/except has to be used in such a case:
And, of course, most of the time, ``stderr`` will come as a part of an exception, so try/except has to be used in such
a case:
.. code-block:: python
@@ -850,16 +996,13 @@ Another approach to capturing stdout is via ``contextlib.redirect_stdout``:
# test:
assert msg in out
An important potential issue with capturing stdout is that it may
contain ``\r`` characters that in normal ``print`` reset everything that
has been printed so far. There is no problem with ``pytest``, but with
``pytest -s`` these characters get included in the buffer, so to be able
to have the test run with and without ``-s``, you have to make an extra
cleanup to the captured output, using ``re.sub(r'~.*\r', '', buf, 0, re.M)``.
An important potential issue with capturing stdout is that it may contain ``\r`` characters that in normal ``print``
reset everything that has been printed so far. There is no problem with ``pytest``, but with ``pytest -s`` these
characters get included in the buffer, so to be able to have the test run with and without ``-s``, you have to make an
extra cleanup to the captured output, using ``re.sub(r'~.*\r', '', buf, 0, re.M)``.
But, then we have a helper context manager wrapper to automatically take
care of it all, regardless of whether it has some ``\r``'s in it or
not, so it's a simple:
But, then we have a helper context manager wrapper to automatically take care of it all, regardless of whether it has
some ``\r``'s in it or not, so it's a simple:
.. code-block:: python
@@ -879,8 +1022,7 @@ Here is a full test example:
print(msg + final)
assert cs.out == final+"\n", f"captured: {cs.out}, expecting {final}"
If you'd like to capture ``stderr`` use the :obj:`CaptureStderr` class
instead:
If you'd like to capture ``stderr`` use the :obj:`CaptureStderr` class instead:
.. code-block:: python
@@ -889,8 +1031,7 @@ instead:
function_that_writes_to_stderr()
print(cs.err)
If you need to capture both streams at once, use the parent
:obj:`CaptureStd` class:
If you need to capture both streams at once, use the parent :obj:`CaptureStd` class:
.. code-block:: python
@@ -922,7 +1063,8 @@ If you need to validate the output of a logger, you can use :obj:`CaptureLogger`
Testing with environment variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you want to test the impact of environment variables for a specific test you can use a helper decorator ``transformers.testing_utils.mockenv``
If you want to test the impact of environment variables for a specific test you can use a helper decorator
``transformers.testing_utils.mockenv``
.. code-block:: python
@@ -932,12 +1074,30 @@ If you want to test the impact of environment variables for a specific test you
def test_env_override(self):
env_level_str = os.getenv("TRANSFORMERS_VERBOSITY", None)
At times an external program needs to be called, which requires setting ``PYTHONPATH`` in ``os.environ`` to include
multiple local paths. A helper class :obj:`transformers.test_utils.TestCasePlus` comes to help:
.. code-block:: python
from transformers.testing_utils import TestCasePlus
class EnvExampleTest(TestCasePlus):
def test_external_prog(self):
env = self.get_env()
# now call the external program, passing ``env`` to it
Depending on whether the test file was under the ``tests`` test suite or ``examples`` it'll correctly set up
``env[PYTHONPATH]`` to include one of these two directories, and also the ``src`` directory to ensure the testing is
done against the current repo, and finally with whatever ``env[PYTHONPATH]`` was already set to before the test was
called if anything.
This helper method creates a copy of the ``os.environ`` object, so the original remains intact.
Getting reproducible results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In some situations you may want to remove randomness for your tests. To
get identical reproducable results set, you will need to fix the seed:
In some situations you may want to remove randomness for your tests. To get identical reproducable results set, you
will need to fix the seed:
.. code-block:: python

View File

@@ -1,12 +1,12 @@
Tokenizer summary
-----------------------------------------------------------------------------------------------------------------------
In this page, we will have a closer look at tokenization. As we saw in
:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then
are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More
specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers:
:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and
:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those.
In this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
<preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids. The second
part is pretty straightforward, here we will focus on the first part. More specifically, we will look at the three main
different kinds of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`,
:ref:`WordPiece <wordpiece>` and :ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of
those.
Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
@@ -16,8 +16,8 @@ Introduction to tokenization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing
this text is just to split it by spaces, which would give:
instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing this
text is just to split it by spaces, which would give:
.. code-block::
@@ -46,9 +46,8 @@ rule-based tokenizers. On the text above, they'd output something like:
Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used).
:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary
size of 267,735!
you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used). :doc:`Transformer
XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary size of 267,735!
A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
@@ -69,9 +68,8 @@ decomposed as "annoying" and "ly". This is especially useful in agglutinative la
form (almost) arbitrarily long complex words by stringing together some subwords.
This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
subwords. This also enables the model to process words it has never seen before, by decomposing them into
subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
this:
subwords. This also enables the model to process words it has never seen before, by decomposing them into subwords it
knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like this:
.. code-block::
@@ -81,8 +79,8 @@ this:
['i', 'have', 'a', 'new', 'gp', '##u', '!']
Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
vocabulary of the tokenizer, except for "gpu", so the tokenizer split it in subwords it knows: "gp" and "##u". The "##"
means that the rest of the token should be attached to the previous one, without space (for when we need to decode
vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The
"##" means that the rest of the token should be attached to the previous one, without space (for when we need to decode
predictions and reverse the tokenization).
Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
@@ -106,13 +104,13 @@ Byte-Pair Encoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
splitting the training data into words, which can be a simple space tokenization
(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
splitting the training data into words, which can be a simple space tokenization (:doc:`GPT-2 <model_doc/gpt2>` and
:doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer (:doc:`XLM <model_doc/xlm>` use
Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
It then begins from the list of all characters and will learn merge rules to form a new token from two symbols in the
vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
@@ -148,10 +146,10 @@ represented as
('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that
were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as
``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the
base corpus uses all of them), but to special characters like emojis.
If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters
that were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be
tokenized as ``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general
(since the base corpus uses all of them), but to special characters like emojis.
As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
@@ -161,24 +159,24 @@ Byte-level BPE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
all unicode characters, the
`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__
introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some
additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown
token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the
256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
all unicode characters, the `GPT-2 paper
<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ introduces a
clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some additional rules to
deal with punctuation, this manages to be able to tokenize every text without needing an unknown token. For instance,
the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens,
a special end-of-text token and the symbols learned with 50,000 merges.
.. _wordpiece:
WordPiece
=======================================================================================================================
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as
:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in
`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
frequent but the one that will maximize the likelihood on the corpus once merged.
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as :doc:`DistilBERT
<model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in `this paper
<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies on the same
base as BPE, which is to initialize the vocabulary to every character present in the corpus and progressively learn a
given number of merge rules, the difference is that it doesn't choose the pair that is the most frequent but the one
that will maximize the likelihood on the corpus once merged.
What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
@@ -197,11 +195,11 @@ progressively. It's not used directly for any of the pretrained models in the li
with :ref:`SentencePiece <sentencepiece>`.
More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
for each subword, evaluate how much the loss would augment if the subword was removed from the vocabulary. It then
sorts the subwords by this quantity (that represents how worse the loss becomes if the token is removed) and removes
all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
BPE or WordPiece).
for each subword, evaluate how much the loss would increase if the subword was removed from the vocabulary. It then
sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and
removes all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary
has reached the desired size, always keeping the base characters (to be able to tokenize any word written with them,
like BPE or WordPiece).
Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
@@ -217,9 +215,9 @@ training corpus. You can then give a probability to each tokenization (which is
tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
of the tokenization according to their probabilities).
Those probabilities define the loss that trains the tokenizer: if our corpus consists of the
words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
Those probabilities define the loss that trains the tokenizer: if our corpus consists of the words :math:`x_{1}, \dots,
x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible tokenizations of
:math:`x_{i}` (with the current vocabulary), then the loss is defined as
.. math::
\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
@@ -236,8 +234,8 @@ SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`
includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate
all of them together and replace '▁' with space.
the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate all
of them together and replace '▁' with space.
All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.

View File

@@ -1,18 +1,14 @@
Training and fine-tuning
=======================================================================================================================
Model classes in 🤗 Transformers are designed to be compatible with native
PyTorch and TensorFlow 2 and can be used seemlessly with either. In this
quickstart, we will show how to fine-tune (or train from scratch) a model
using the standard training tools available in either framework. We will also
show how to use our included :func:`~transformers.Trainer` class which
handles much of the complexity of training for you.
Model classes in 🤗 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used
seemlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
standard training tools available in either framework. We will also show how to use our included
:func:`~transformers.Trainer` class which handles much of the complexity of training for you.
This guide assume that you are already familiar with loading and use our
models for inference; otherwise, see the :doc:`task summary <task_summary>`. We also assume
that you are familiar with training deep neural networks in either PyTorch or
TF2, and focus specifically on the nuances and tools for training models in
🤗 Transformers.
This guide assume that you are already familiar with loading and use our models for inference; otherwise, see the
:doc:`task summary <task_summary>`. We also assume that you are familiar with training deep neural networks in either
PyTorch or TF2, and focus specifically on the nuances and tools for training models in 🤗 Transformers.
Sections:
@@ -26,25 +22,19 @@ Sections:
Fine-tuning in native PyTorch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Model classes in 🤗 Transformers that don't begin with ``TF`` are
`PyTorch Modules <https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_,
meaning that you can use them just as you would any model in PyTorch for
both inference and optimization.
Model classes in 🤗 Transformers that don't begin with ``TF`` are `PyTorch Modules
<https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_, meaning that you can use them just as you would any
model in PyTorch for both inference and optimization.
Let's consider the common task of fine-tuning a masked language model like
BERT on a sequence classification dataset. When we instantiate a model with
:func:`~transformers.PreTrainedModel.from_pretrained`, the model
configuration and pre-trained weights
of the specified model are used to initialize the model. The
library also includes a number of task-specific final layers or 'heads' whose
weights are instantiated randomly when not present in the specified
Let's consider the common task of fine-tuning a masked language model like BERT on a sequence classification dataset.
When we instantiate a model with :func:`~transformers.PreTrainedModel.from_pretrained`, the model configuration and
pre-trained weights of the specified model are used to initialize the model. The library also includes a number of
task-specific final layers or 'heads' whose weights are instantiated randomly when not present in the specified
pre-trained model. For example, instantiating a model with
``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)``
will create a BERT model instance with encoder weights copied from the
``bert-base-uncased`` model and a randomly initialized sequence
classification head on top of the encoder with an output size of 2. Models
are initialized in ``eval`` mode by default. We can call ``model.train()`` to
put it in train mode.
``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)`` will create a BERT model instance
with encoder weights copied from the ``bert-base-uncased`` model and a randomly initialized sequence classification
head on top of the encoder with an output size of 2. Models are initialized in ``eval`` mode by default. We can call
``model.train()`` to put it in train mode.
.. code-block:: python
@@ -52,20 +42,17 @@ put it in train mode.
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', return_dict=True)
model.train()
This is useful because it allows us to make use of the pre-trained BERT
encoder and easily train it on whatever sequence classification dataset we
choose. We can use any PyTorch optimizer, but our library also provides the
:func:`~transformers.AdamW` optimizer which implements gradient bias
correction as well as weight decay.
This is useful because it allows us to make use of the pre-trained BERT encoder and easily train it on whatever
sequence classification dataset we choose. We can use any PyTorch optimizer, but our library also provides the
:func:`~transformers.AdamW` optimizer which implements gradient bias correction as well as weight decay.
.. code-block:: python
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)
The optimizer allows us to apply different hyperpameters for specific
parameter groups. For example, we can apply weight decay to all parameters
other than bias and layer normalization terms:
The optimizer allows us to apply different hyperpameters for specific parameter groups. For example, we can apply
weight decay to all parameters other than bias and layer normalization terms:
.. code-block:: python
@@ -75,11 +62,9 @@ other than bias and layer normalization terms:
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
Now we can set up a simple dummy training batch using
:func:`~transformers.PreTrainedTokenizer.__call__`. This returns a
:func:`~transformers.BatchEncoding` instance which
prepares everything we might need to pass to the model.
Now we can set up a simple dummy training batch using :func:`~transformers.PreTrainedTokenizer.__call__`. This returns
a :func:`~transformers.BatchEncoding` instance which prepares everything we might need to pass to the model.
.. code-block:: python
@@ -90,10 +75,9 @@ prepares everything we might need to pass to the model.
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
When we call a classification model with the ``labels`` argument, the first
returned element is the Cross Entropy loss between the predictions and the
passed labels. Having already set up our optimizer, we can then do a
backwards pass and update the weights:
When we call a classification model with the ``labels`` argument, the first returned element is the Cross Entropy loss
between the predictions and the passed labels. Having already set up our optimizer, we can then do a backwards pass and
update the weights:
.. code-block:: python
@@ -103,8 +87,8 @@ backwards pass and update the weights:
loss.backward()
optimizer.step()
Alternatively, you can just get the logits and calculate the loss yourself.
The following is equivalent to the previous example:
Alternatively, you can just get the logits and calculate the loss yourself. The following is equivalent to the previous
example:
.. code-block:: python
@@ -115,12 +99,10 @@ The following is equivalent to the previous example:
loss.backward()
optimizer.step()
Of course, you can train on GPU by calling ``to('cuda')`` on the model and
inputs as usual.
Of course, you can train on GPU by calling ``to('cuda')`` on the model and inputs as usual.
We also provide a few learning rate scheduling tools. With the following, we
can set up a scheduler which warms up for ``num_warmup_steps`` and then
linearly decays to 0 by the end of training.
We also provide a few learning rate scheduling tools. With the following, we can set up a scheduler which warms up for
``num_warmup_steps`` and then linearly decays to 0 by the end of training.
.. code-block:: python
@@ -135,19 +117,16 @@ Then all we have to do is call ``scheduler.step()`` after ``optimizer.step()``.
optimizer.step()
scheduler.step()
We highly recommend using :func:`~transformers.Trainer`, discussed below,
which conveniently handles the moving parts of training 🤗 Transformers models
with features like mixed precision and easy tensorboard logging.
We highly recommend using :func:`~transformers.Trainer`, discussed below, which conveniently handles the moving parts
of training 🤗 Transformers models with features like mixed precision and easy tensorboard logging.
Freezing the encoder
-----------------------------------------------------------------------------------------------------------------------
In some cases, you might be interested in keeping the weights of the
pre-trained encoder frozen and optimizing only the weights of the head
layers. To do so, simply set the ``requires_grad`` attribute to ``False`` on
the encoder parameters, which can be accessed with the ``base_model``
submodule on any task-specific model in the library:
In some cases, you might be interested in keeping the weights of the pre-trained encoder frozen and optimizing only the
weights of the head layers. To do so, simply set the ``requires_grad`` attribute to ``False`` on the encoder
parameters, which can be accessed with the ``base_model`` submodule on any task-specific model in the library:
.. code-block:: python
@@ -160,10 +139,8 @@ submodule on any task-specific model in the library:
Fine-tuning in native TensorFlow 2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Models can also be trained natively in TensorFlow 2. Just as with PyTorch,
TensorFlow models can be instantiated with
:func:`~transformers.PreTrainedModel.from_pretrained` to load the weights of
the encoder from a pretrained model.
Models can also be trained natively in TensorFlow 2. Just as with PyTorch, TensorFlow models can be instantiated with
:func:`~transformers.PreTrainedModel.from_pretrained` to load the weights of the encoder from a pretrained model.
.. code-block:: python
@@ -171,11 +148,9 @@ the encoder from a pretrained model.
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
Let's use ``tensorflow_datasets`` to load in the `MRPC dataset
<https://www.tensorflow.org/datasets/catalog/glue#gluemrpc>`_ from GLUE. We
can then use our built-in
:func:`~transformers.data.processors.glue.glue_convert_examples_to_features`
to tokenize MRPC and convert it to a TensorFlow ``Dataset`` object. Note that
tokenizers are framework-agnostic, so there is no need to prepend ``TF`` to
<https://www.tensorflow.org/datasets/catalog/glue#gluemrpc>`_ from GLUE. We can then use our built-in
:func:`~transformers.data.processors.glue.glue_convert_examples_to_features` to tokenize MRPC and convert it to a
TensorFlow ``Dataset`` object. Note that tokenizers are framework-agnostic, so there is no need to prepend ``TF`` to
the pretrained tokenizer name.
.. code-block:: python
@@ -197,8 +172,8 @@ The model can then be compiled and trained as any Keras model:
model.compile(optimizer=optimizer, loss=loss)
model.fit(train_dataset, epochs=2, steps_per_epoch=115)
With the tight interoperability between TensorFlow and PyTorch models, you
can even save the model and then reload it as a PyTorch model (or vice-versa):
With the tight interoperability between TensorFlow and PyTorch models, you can even save the model and then reload it
as a PyTorch model (or vice-versa):
.. code-block:: python
@@ -212,12 +187,9 @@ can even save the model and then reload it as a PyTorch model (or vice-versa):
Trainer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We also provide a simple but feature-complete training and evaluation
interface through :func:`~transformers.Trainer` and
:func:`~transformers.TFTrainer`. You can train, fine-tune,
and evaluate any 🤗 Transformers model with a wide range of training options and
with built-in features like logging, gradient accumulation, and mixed
precision.
We also provide a simple but feature-complete training and evaluation interface through :func:`~transformers.Trainer`
and :func:`~transformers.TFTrainer`. You can train, fine-tune, and evaluate any 🤗 Transformers model with a wide range
of training options and with built-in features like logging, gradient accumulation, and mixed precision.
.. code-block:: python
@@ -264,21 +236,16 @@ precision.
eval_dataset=tfds_test_dataset # tensorflow_datasets evaluation dataset
)
Now simply call ``trainer.train()`` to train and ``trainer.evaluate()`` to
evaluate. You can use your own module as well, but the first
argument returned from ``forward`` must be the loss which you wish to
optimize.
Now simply call ``trainer.train()`` to train and ``trainer.evaluate()`` to evaluate. You can use your own module as
well, but the first argument returned from ``forward`` must be the loss which you wish to optimize.
:func:`~transformers.Trainer` uses a built-in default function to collate
batches and prepare them to be fed into the model. If needed, you can also
use the ``data_collator`` argument to pass your own collator function which
takes in the data in the format provided by your dataset and returns a
batch ready to be fed into the model. Note that
:func:`~transformers.TFTrainer` expects the passed datasets to be dataset
objects from ``tensorflow_datasets``.
:func:`~transformers.Trainer` uses a built-in default function to collate batches and prepare them to be fed into the
model. If needed, you can also use the ``data_collator`` argument to pass your own collator function which takes in the
data in the format provided by your dataset and returns a batch ready to be fed into the model. Note that
:func:`~transformers.TFTrainer` expects the passed datasets to be dataset objects from ``tensorflow_datasets``.
To calculate additional metrics in addition to the loss, you can also define
your own ``compute_metrics`` function and pass it to the trainer.
To calculate additional metrics in addition to the loss, you can also define your own ``compute_metrics`` function and
pass it to the trainer.
.. code-block:: python
@@ -296,8 +263,8 @@ your own ``compute_metrics`` function and pass it to the trainer.
'recall': recall
}
Finally, you can view the results, including any calculated metrics, by
launching tensorboard in your specified ``logging_dir`` directory.
Finally, you can view the results, including any calculated metrics, by launching tensorboard in your specified
``logging_dir`` directory.
.. _additional-resources:
@@ -308,11 +275,12 @@ Additional resources
- `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_
which uses ``Trainer`` for IMDb sentiment classification.
- `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`_
including scripts for training and fine-tuning on GLUE, SQuAD, and several other tasks.
- `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`_ including scripts for
training and fine-tuning on GLUE, SQuAD, and several other tasks.
- `How to train a language model <https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb>`_,
a detailed colab notebook which uses ``Trainer`` to train a masked language model from scratch on Esperanto.
- `How to train a language model
<https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb>`_, a detailed
colab notebook which uses ``Trainer`` to train a masked language model from scratch on Esperanto.
- `🤗 Transformers Notebooks <notebooks.html>`_ which contain dozens of example notebooks from the community for
training and using 🤗 Transformers on a variety of tasks.

View File

@@ -1,41 +1,22 @@
# Examples
Version 2.9 of 🤗 Transformers introduces a new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) class for PyTorch, and its equivalent [`TFTrainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_tf.py) for TF 2.
Version 2.9 of 🤗 Transformers introduced a new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) class for PyTorch, and its equivalent [`TFTrainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_tf.py) for TF 2.
Running the examples requires PyTorch 1.3.1+ or TensorFlow 2.2+.
Here is the list of all our examples:
- **grouped by task** (all official examples work for multiple models)
- with information on whether they are **built on top of `Trainer`/`TFTrainer`** (if not, they still work, they might just lack some features),
- whether they also include examples for **`pytorch-lightning`**, which is a great fully-featured, general-purpose training library for PyTorch,
- with information on whether they are **built on top of `Trainer`/`TFTrainer`** (if not, they still work, they might
just lack some features),
- whether or not they leverage the [🤗 Datasets](https://github.com/huggingface/datasets) library.
- links to **Colab notebooks** to walk through the scripts and run them easily,
- links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
This is still a work-in-progress in particular documentation is still sparse so please **contribute improvements/pull requests.**
## The Big Table of Tasks
| Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab
|---|---|:---:|:---:|:---:|:---:|
| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | Raw text | ✅ | - | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb)
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | -
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | - | -
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | n/a | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
| [**`distillation`**](https://github.com/huggingface/transformers/tree/master/examples/distillation) | All | - | - | - | -
| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | CNN/Daily Mail | ✅ | - | ✅ | -
| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | WMT | ✅ | - | ✅ | -
| [**`bertology`**](https://github.com/huggingface/transformers/tree/master/examples/bertology) | - | - | - | - | -
| [**`adversarial`**](https://github.com/huggingface/transformers/tree/master/examples/adversarial) | HANS | ✅ | - | - | -
<br>
## Important note
**Important**
To make sure you can successfully run the latest versions of the example scripts, you have to install the library from source and install some example-specific requirements.
To make sure you can successfully run the latest versions of the example scripts, you have to **install the library from source** and install some example-specific requirements.
Execute the following steps in a new virtual environment:
```bash
@@ -45,6 +26,30 @@ pip install .
pip install -r ./examples/requirements.txt
```
Alternatively, you can run the version of the examples as they were for your current version of Transformers via (for instance with v3.4.0):
```bash
git checkout tags/v3.4.0
```
## The Big Table of Tasks
| Task | Example datasets | Trainer support | TFTrainer support | 🤗 Datasets | Colab
|---|---|:---:|:---:|:---:|:---:|
| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | Raw text | ✅ | - | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | -
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | - | -
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
| [**`distillation`**](https://github.com/huggingface/transformers/tree/master/examples/distillation) | All | - | - | - | -
| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | CNN/Daily Mail | ✅ | - | - | -
| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | WMT | ✅ | - | - | -
| [**`bertology`**](https://github.com/huggingface/transformers/tree/master/examples/bertology) | - | - | - | - | -
| [**`adversarial`**](https://github.com/huggingface/transformers/tree/master/examples/adversarial) | HANS | ✅ | - | - | -
<br>
## One-click Deploy to Cloud (wip)
**Coming soon!**
@@ -57,13 +62,14 @@ When using PyTorch, we support TPUs thanks to `pytorch/xla`. For more context an
very detailed [pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
In this repo, we provide a very simple launcher script named [xla_spawn.py](https://github.com/huggingface/transformers/tree/master/examples/xla_spawn.py) that lets you run our example scripts on multiple TPU cores without any boilerplate.
Just pass a `--num_cores` flag to this script, then your regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for torch.distributed).
Just pass a `--num_cores` flag to this script, then your regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for torch.distributed).
Note that this approach does not work for examples that use `pytorch-lightning`.
For example for `run_glue`:
```bash
python examples/xla_spawn.py --num_cores 8 \
examples/text-classification/run_glue.py
examples/text-classification/run_glue.py \
--model_name_or_path bert-base-cased \
--task_name mnli \
--data_dir ./data/glue_data/MNLI \

View File

@@ -291,10 +291,9 @@ def hans_convert_examples_to_features(
Args:
examples: List of ``InputExamples`` containing the examples.
tokenizer: Instance of a tokenizer that will tokenize the examples.
max_length: Maximum example length.
label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method.
output_mode: String indicating the output mode. Either ``regression`` or ``classification``.
max_length: Maximum example length.
tokenizer: Instance of a tokenizer that will tokenize the examples.
Returns:
A list of task-specific ``InputFeatures`` which can be fed to the model.

View File

@@ -20,7 +20,7 @@ import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss, MSELoss
from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_callable
from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_model_forward
from transformers.modeling_albert import (
ALBERT_INPUTS_DOCSTRING,
ALBERT_START_DOCSTRING,
@@ -87,7 +87,7 @@ class AlbertModelWithPabee(AlbertModel):
message = f"*** Patience = {self.patience} Avg. Inference Layers = {avg_inf_layers:.2f} Speed Up = {1 - avg_inf_layers / self.config.num_hidden_layers:.2f} ***"
print(message)
@add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
@add_start_docstrings_to_model_forward(ALBERT_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,
@@ -230,7 +230,7 @@ class AlbertForSequenceClassificationWithPabee(AlbertPreTrainedModel):
self.init_weights()
@add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
@add_start_docstrings_to_model_forward(ALBERT_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,

View File

@@ -22,7 +22,7 @@ import torch
from torch import nn
from torch.nn import CrossEntropyLoss, MSELoss
from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_callable
from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_model_forward
from transformers.modeling_bert import (
BERT_INPUTS_DOCSTRING,
BERT_START_DOCSTRING,
@@ -92,7 +92,7 @@ class BertModelWithPabee(BertModel):
message = f"*** Patience = {self.patience} Avg. Inference Layers = {avg_inf_layers:.2f} Speed Up = {1 - avg_inf_layers / self.config.num_hidden_layers:.2f} ***"
print(message)
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING)
@add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,
@@ -155,7 +155,7 @@ class BertModelWithPabee(BertModel):
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
# If a 2D ou 3D attention mask is provided for the cross-attention
# we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]
# we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
if self.config.is_decoder and encoder_hidden_states is not None:
encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
@@ -254,7 +254,7 @@ class BertForSequenceClassificationWithPabee(BertPreTrainedModel):
self.init_weights()
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING)
@add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,

View File

@@ -4,7 +4,7 @@ import sys
from unittest.mock import patch
import run_glue_with_pabee
from transformers.testing_utils import TestCasePlus
from transformers.testing_utils import TestCasePlus, require_torch_non_multigpu_but_fix_me
logging.basicConfig(level=logging.DEBUG)
@@ -20,6 +20,7 @@ def get_setup_file():
class PabeeTests(TestCasePlus):
@require_torch_non_multigpu_but_fix_me
def test_run_glue(self):
stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler)

View File

@@ -14,3 +14,17 @@ sys.path.insert(1, git_repo_path)
# silence FutureWarning warnings in tests since often we can't act on them until
# they become normal warnings - i.e. the tests still need to test the current functionality
warnings.simplefilter(action="ignore", category=FutureWarning)
def pytest_addoption(parser):
from transformers.testing_utils import pytest_addoption_shared
pytest_addoption_shared(parser)
def pytest_terminal_summary(terminalreporter):
from transformers.testing_utils import pytest_terminal_summary_main
make_reports = terminalreporter.config.getoption("--make-reports")
if make_reports:
pytest_terminal_summary_main(terminalreporter, id=make_reports)

View File

@@ -37,8 +37,10 @@ from transformers import (
AutoTokenizer,
DataCollatorForLanguageModeling,
DataCollatorForPermutationLanguageModeling,
DataCollatorForWholeWordMask,
HfArgumentParser,
LineByLineTextDataset,
LineByLineWithRefDataset,
PreTrainedTokenizer,
TextDataset,
Trainer,
@@ -101,6 +103,14 @@ class DataTrainingArguments:
default=None,
metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
)
train_ref_file: Optional[str] = field(
default=None,
metadata={"help": "An optional input train ref data file for whole word mask in Chinese."},
)
eval_ref_file: Optional[str] = field(
default=None,
metadata={"help": "An optional input eval ref data file for whole word mask in Chinese."},
)
line_by_line: bool = field(
default=False,
metadata={"help": "Whether distinct lines of text in the dataset are to be handled as distinct sequences."},
@@ -109,6 +119,7 @@ class DataTrainingArguments:
mlm: bool = field(
default=False, metadata={"help": "Train with masked-language modeling loss instead of language modeling."}
)
whole_word_mask: bool = field(default=False, metadata={"help": "Whether ot not to use whole word mask."})
mlm_probability: float = field(
default=0.15, metadata={"help": "Ratio of tokens to mask for masked language modeling loss"}
)
@@ -141,8 +152,18 @@ def get_dataset(
evaluate: bool = False,
cache_dir: Optional[str] = None,
):
def _dataset(file_path):
def _dataset(file_path, ref_path=None):
if args.line_by_line:
if ref_path is not None:
if not args.whole_word_mask or not args.mlm:
raise ValueError("You need to set world whole masking and mlm to True for Chinese Whole Word Mask")
return LineByLineWithRefDataset(
tokenizer=tokenizer,
file_path=file_path,
block_size=args.block_size,
ref_path=ref_path,
)
return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
else:
return TextDataset(
@@ -154,11 +175,11 @@ def get_dataset(
)
if evaluate:
return _dataset(args.eval_data_file)
return _dataset(args.eval_data_file, args.eval_ref_file)
elif args.train_data_files:
return ConcatDataset([_dataset(f) for f in glob(args.train_data_files)])
else:
return _dataset(args.train_data_file)
return _dataset(args.train_data_file, args.train_ref_file)
def main():
@@ -174,7 +195,6 @@ def main():
"Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
"or remove the --do_eval argument."
)
if (
os.path.exists(training_args.output_dir)
and os.listdir(training_args.output_dir)
@@ -270,9 +290,14 @@ def main():
max_span_length=data_args.max_span_length,
)
else:
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=data_args.mlm, mlm_probability=data_args.mlm_probability
)
if data_args.mlm and data_args.whole_word_mask:
data_collator = DataCollatorForWholeWordMask(
tokenizer=tokenizer, mlm_probability=data_args.mlm_probability
)
else:
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=data_args.mlm, mlm_probability=data_args.mlm_probability
)
# Initialize our Trainer
trainer = Trainer(

View File

@@ -0,0 +1,147 @@
import argparse
import json
from typing import List
from ltp import LTP
from transformers.tokenization_bert import BertTokenizer
def _is_chinese_char(cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if (
(cp >= 0x4E00 and cp <= 0x9FFF)
or (cp >= 0x3400 and cp <= 0x4DBF) #
or (cp >= 0x20000 and cp <= 0x2A6DF) #
or (cp >= 0x2A700 and cp <= 0x2B73F) #
or (cp >= 0x2B740 and cp <= 0x2B81F) #
or (cp >= 0x2B820 and cp <= 0x2CEAF) #
or (cp >= 0xF900 and cp <= 0xFAFF)
or (cp >= 0x2F800 and cp <= 0x2FA1F) #
): #
return True
return False
def is_chinese(word: str):
# word like '180' or '身高' or '神'
for char in word:
char = ord(char)
if not _is_chinese_char(char):
return 0
return 1
def get_chinese_word(tokens: List[str]):
word_set = set()
for token in tokens:
chinese_word = len(token) > 1 and is_chinese(token)
if chinese_word:
word_set.add(token)
word_list = list(word_set)
return word_list
def add_sub_symbol(bert_tokens: List[str], chinese_word_set: set()):
if not chinese_word_set:
return bert_tokens
max_word_len = max([len(w) for w in chinese_word_set])
bert_word = bert_tokens
start, end = 0, len(bert_word)
while start < end:
single_word = True
if is_chinese(bert_word[start]):
l = min(end - start, max_word_len)
for i in range(l, 1, -1):
whole_word = "".join(bert_word[start : start + i])
if whole_word in chinese_word_set:
for j in range(start + 1, start + i):
bert_word[j] = "##" + bert_word[j]
start = start + i
single_word = False
break
if single_word:
start += 1
return bert_word
def prepare_ref(lines: List[str], ltp_tokenizer: LTP, bert_tokenizer: BertTokenizer):
ltp_res = []
for i in range(0, len(lines), 100):
res = ltp_tokenizer.seg(lines[i : i + 100])[0]
res = [get_chinese_word(r) for r in res]
ltp_res.extend(res)
assert len(ltp_res) == len(lines)
bert_res = []
for i in range(0, len(lines), 100):
res = bert_tokenizer(lines[i : i + 100], add_special_tokens=True, truncation=True, max_length=512)
bert_res.extend(res["input_ids"])
assert len(bert_res) == len(lines)
ref_ids = []
for input_ids, chinese_word in zip(bert_res, ltp_res):
input_tokens = []
for id in input_ids:
token = bert_tokenizer._convert_id_to_token(id)
input_tokens.append(token)
input_tokens = add_sub_symbol(input_tokens, chinese_word)
ref_id = []
# We only save pos of chinese subwords start with ##, which mean is part of a whole word.
for i, token in enumerate(input_tokens):
if token[:2] == "##":
clean_token = token[2:]
# save chinese tokens' pos
if len(clean_token) == 1 and _is_chinese_char(ord(clean_token)):
ref_id.append(i)
ref_ids.append(ref_id)
assert len(ref_ids) == len(bert_res)
return ref_ids
def main(args):
# For Chinese (Ro)Bert, the best result is from : RoBERTa-wwm-ext (https://github.com/ymcui/Chinese-BERT-wwm)
# If we want to fine-tune these model, we have to use same tokenizer : LTP (https://github.com/HIT-SCIR/ltp)
with open(args.file_name, "r", encoding="utf-8") as f:
data = f.readlines()
data = [line.strip() for line in data if len(line) > 0 and not line.isspace()] # avoid delimiter like '\u2029'
ltp_tokenizer = LTP(args.ltp) # faster in GPU device
bert_tokenizer = BertTokenizer.from_pretrained(args.bert)
ref_ids = prepare_ref(data, ltp_tokenizer, bert_tokenizer)
with open(args.save_path, "w", encoding="utf-8") as f:
data = [json.dumps(ref) + "\n" for ref in ref_ids]
f.writelines(data)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="prepare_chinese_ref")
parser.add_argument(
"--file_name",
type=str,
default="./resources/chinese-demo.txt",
help="file need process, same as training data in lm",
)
parser.add_argument(
"--ltp", type=str, default="./resources/ltp", help="resources for LTP tokenizer, usually a path"
)
parser.add_argument("--bert", type=str, default="./resources/robert", help="resources for Bert tokenizer")
parser.add_argument("--save_path", type=str, default="./resources/ref.txt", help="path to save res")
args = parser.parse_args()
main(args)

View File

@@ -2,7 +2,7 @@ import torch
from torch import nn
from torch.nn import CrossEntropyLoss, MSELoss
from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_callable
from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_model_forward
from transformers.modeling_bert import (
BERT_INPUTS_DOCSTRING,
BERT_START_DOCSTRING,
@@ -134,7 +134,7 @@ class DeeBertModel(BertPreTrainedModel):
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING)
@add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,
@@ -198,7 +198,7 @@ class DeeBertModel(BertPreTrainedModel):
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
# If a 2D ou 3D attention mask is provided for the cross-attention
# we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]
# we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
if encoder_attention_mask.dim() == 3:
encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]
if encoder_attention_mask.dim() == 2:
@@ -260,7 +260,7 @@ class BertHighway(nn.Module):
# BertModel
bmodel_output = (pooler_input, pooler_output) + encoder_outputs[1:]
# "return" bodel_output
# "return" bmodel_output
# Dropout and classification
pooled_output = bmodel_output[1]
@@ -288,7 +288,7 @@ class DeeBertForSequenceClassification(BertPreTrainedModel):
self.init_weights()
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING)
@add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,

View File

@@ -4,7 +4,7 @@ import torch.nn as nn
from torch.nn import CrossEntropyLoss, MSELoss
from transformers.configuration_roberta import RobertaConfig
from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_callable
from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_model_forward
from transformers.modeling_roberta import ROBERTA_INPUTS_DOCSTRING, ROBERTA_START_DOCSTRING, RobertaEmbeddings
from .modeling_highway_bert import BertPreTrainedModel, DeeBertModel, HighwayException, entropy
@@ -45,7 +45,7 @@ class DeeRobertaForSequenceClassification(BertPreTrainedModel):
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
@add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)
@add_start_docstrings_to_model_forward(ROBERTA_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,

View File

@@ -5,7 +5,7 @@ import unittest
from unittest.mock import patch
import run_glue_deebert
from transformers.testing_utils import slow
from transformers.testing_utils import require_torch_non_multigpu_but_fix_me, slow
logging.basicConfig(level=logging.DEBUG)
@@ -26,6 +26,7 @@ class DeeBertTests(unittest.TestCase):
logger.addHandler(stream_handler)
@slow
@require_torch_non_multigpu_but_fix_me
def test_glue_deebert_train(self):
train_args = """

View File

@@ -12,7 +12,7 @@ This folder contains the original code used to train Distil* as well as examples
**October 3, 2019 - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2. **The paper supersedes our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances. Please use the paper as a reference when comparing/reporting results on DistilBERT.**
**September 19, 2019 - Update:** We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 99% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future!
**September 19, 2019 - Update:** We fixed bugs in the code and released an updated version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 99% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future!
## What is Distil*

View File

@@ -265,7 +265,7 @@ class Distiller:
-------
token_ids: `torch.tensor(bs, seq_length)` - The token ids after the modifications for MLM.
attn_mask: `torch.tensor(bs, seq_length)` - The attention mask for the self-attention.
clm_labels: `torch.tensor(bs, seq_length)` - The causal languge modeling labels. There is a -100 where there is nothing to predict.
clm_labels: `torch.tensor(bs, seq_length)` - The causal language modeling labels. There is a -100 where there is nothing to predict.
"""
token_ids, lengths = batch
token_ids, lengths = self.round_batch(x=token_ids, lengths=lengths)
@@ -401,9 +401,9 @@ class Distiller:
# https://github.com/peterliht/knowledge-distillation-pytorch/blob/master/model/net.py#L100
# https://github.com/peterliht/knowledge-distillation-pytorch/issues/2
if self.params.restrict_ce_to_mask:
mask = (lm_labels > -1).unsqueeze(-1).expand_as(s_logits) # (bs, seq_lenth, voc_size)
mask = (lm_labels > -1).unsqueeze(-1).expand_as(s_logits) # (bs, seq_length, voc_size)
else:
mask = attention_mask.unsqueeze(-1).expand_as(s_logits) # (bs, seq_lenth, voc_size)
mask = attention_mask.unsqueeze(-1).expand_as(s_logits) # (bs, seq_length, voc_size)
s_logits_slct = torch.masked_select(s_logits, mask) # (bs * seq_length * voc_size) modulo the 1s in mask
s_logits_slct = s_logits_slct.view(-1, s_logits.size(-1)) # (bs * seq_length, voc_size) modulo the 1s in mask
t_logits_slct = torch.masked_select(t_logits, mask) # (bs * seq_length * voc_size) modulo the 1s in mask

View File

@@ -61,7 +61,7 @@ class LmSeqsDataset(Dataset):
def remove_long_sequences(self):
"""
Sequences that are too long are splitted by chunk of max_model_input_size.
Sequences that are too long are split by chunk of max_model_input_size.
"""
max_len = self.params.max_model_input_size
indices = self.lengths > max_len
@@ -138,8 +138,8 @@ class LmSeqsDataset(Dataset):
# logger.info(f'{data_len} tokens ({nb_unique_tokens} unique)')
# unk_idx = self.params.special_tok_ids['unk_token']
# nb_unkown = sum([(t==unk_idx).sum() for t in self.token_ids])
# logger.info(f'{nb_unkown} unknown tokens (covering {100*nb_unkown/data_len:.2f}% of the data)')
# nb_unknown = sum([(t==unk_idx).sum() for t in self.token_ids])
# logger.info(f'{nb_unknown} unknown tokens (covering {100*nb_unknown/data_len:.2f}% of the data)')
def batch_sequences(self, batch):
"""

View File

@@ -4,4 +4,4 @@ gitpython==3.0.2
tensorboard>=1.14.0
tensorboardX==1.8
psutil==5.6.6
scipy==1.3.1
scipy>=1.4.1

View File

@@ -96,7 +96,7 @@ if __name__ == "__main__":
compressed_sd["lm_head.weight"] = state_dict["lm_head.weight"]
print(f"N layers selected for distillation: {std_idx}")
print(f"Number of params transfered for distillation: {len(compressed_sd.keys())}")
print(f"Number of params transferred for distillation: {len(compressed_sd.keys())}")
print(f"Save transfered checkpoint to {args.dump_checkpoint}.")
print(f"Save transferred checkpoint to {args.dump_checkpoint}.")
torch.save(compressed_sd, args.dump_checkpoint)

View File

@@ -1,16 +1,19 @@
## Language model training
Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py).
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2,
ALBERT, BERT, DistilBERT, RoBERTa, XLNet... GPT and GPT-2 are trained or fine-tuned using a causal language modeling
(CLM) loss while ALBERT, BERT, DistilBERT and RoBERTa are trained or fine-tuned using a masked language modeling (MLM)
loss. XLNet uses permutation language modeling (PLM), you can find more information about the differences between those
objectives in our [model summary](https://huggingface.co/transformers/model_summary.html).
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT, DistilBERT and RoBERTa. GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT, DistilBERT and RoBERTa
are fine-tuned using a masked language modeling (MLM) loss.
These scripts leverage the 🤗 Datasets library and the Trainer API. You can easily customize them to your needs if you
need extra processing on your datasets.
Before running the following example, you should get a file that contains text on which the language model will be
trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
**Note:** The old script `run_language_modeling.py` is still available
[here](https://github.com/huggingface/transformers/blob/master/examples/contrib/legacy/run_language_modeling.py).
We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
text that will be used for evaluation.
The following examples, will run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own
text files for training and validation. We give examples of both below.
### GPT-2/GPT and causal language modeling
@@ -18,48 +21,137 @@ The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiTe
the tokenization). The loss here is that of causal language modeling.
```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw
python run_language_modeling.py \
--output_dir=output \
--model_type=gpt2 \
--model_name_or_path=gpt2 \
python run_clm.py \
--model_name_or_path gpt2 \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE
--output_dir /tmp/test-clm
```
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
a score of ~20 perplexity once fine-tuned on the dataset.
To run on your own training and validation files, use the following command:
```bash
python run_clm.py \
--model_name_or_path gpt2 \
--train_file path_to_train_file \
--validation_file path_to_validation_file \
--do_train \
--do_eval \
--output_dir /tmp/test-clm
```
### RoBERTa/BERT/DistilBERT and masked language modeling
The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
pre-training: masked language modeling.
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
slightly slower (over-fitting takes more epochs).
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore,
converge slightly slower (over-fitting takes more epochs).
```bash
python run_mlm.py \
--model_name_or_path roberta-base \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--do_eval \
--output_dir /tmp/test-mlm
```
To run on your own training and validation files, use the following command:
```bash
python run_mlm.py \
--model_name_or_path roberta-base \
--train_file path_to_train_file \
--validation_file path_to_validation_file \
--do_train \
--do_eval \
--output_dir /tmp/test-mlm
```
If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
concatenates all texts and then splits them in blocks of the same length).
**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make
sure all your batches have the same length.
### Whole word masking
The BERT authors released a new version of BERT using Whole Word Masking in May 2019. Instead of masking randomly
selected tokens (which may be part of words), they mask randomly selected words (masking all the tokens corresponding
to that word). This technique has been refined for Chinese in [this paper](https://arxiv.org/abs/1906.08101).
To fine-tune a model using whole word masking, use the following script:
python run_mlm_wwm.py \
--model_name_or_path roberta-base \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--do_eval \
--output_dir /tmp/test-mlm-wwm
```
For Chinese models, we need to generate a reference files (which requires the ltp library), because it's tokenized at
the character level.
**Q :** Why a reference file?
**A :** Suppose we have a Chinese sentence like: `我喜欢你` The original Chinese-BERT will tokenize it as
`['我','喜','欢','你']` (character level). But `喜欢` is a whole word. For whole word masking proxy, we need a result
like `['我','喜','##欢','你']`, so we need a reference file to tell the model which position of the BERT original token
should be added `##`.
**Q :** Why LTP ?
**A :** Cause the best known Chinese WWM BERT is [Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm) by HIT.
It works well on so many Chines Task like CLUE (Chinese GLUE). They use LTP, so if we want to fine-tune their model,
we need LTP.
Now LTP only only works well on `transformers==3.2.0`. So we don't add it to requirements.txt.
You need to create a separate environment with this version of Transformers to run the `run_chinese_ref.py` script that
will create the reference files. The script is in `examples/contrib`. Once in the proper environment, run the
following:
We use the `--mlm` flag so that the script may change its loss function.
```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw
export LTP_RESOURCE=/path/to/ltp/tokenizer
export BERT_RESOURCE=/path/to/bert/tokenizer
export SAVE_PATH=/path/to/data/ref.txt
python run_language_modeling.py \
--output_dir=output \
--model_type=roberta \
--model_name_or_path=roberta-base \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--mlm
python examples/contrib/run_chinese_ref.py \
--file_name=path_to_train_or_eval_file \
--ltp=path_to_ltp_tokenizer \
--bert=path_to_bert_tokenizer \
--save_path=path_to_reference_file
```
Then you can run the script like this:
```bash
python run_mlm_wwm.py \
--model_name_or_path roberta-base \
--train_file path_to_train_file \
--validation_file path_to_validation_file \
--train_ref_file path_to_train_chinese_ref_file \
--validation_ref_file path_to_validation_chinese_ref_file \
--do_train \
--do_eval \
--output_dir /tmp/test-mlm-wwm
```
**Note:** On TPU, you should the flag `--pad_to_max_length` to make sure all your batches have the same length.
### XLNet and permutation language modeling
XLNet uses a different training objective, which is permutation language modeling. It is an autoregressive method
@@ -72,15 +164,32 @@ context length for permutation language modeling.
The `--max_span_length` flag may also be used to limit the length of a span of masked tokens used
for permutation language modeling.
```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw
Here is how to fine-tun XLNet on wikitext-2:
python run_language_modeling.py \
--output_dir=output \
```bash
python run_plm.py \
--model_name_or_path=xlnet-base-cased \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--output_dir /tmp/test-plm
```
To fine-tune it on your own training and validation file, run:
```bash
python run_plm.py \
--model_name_or_path=xlnet-base-cased \
--train_file path_to_train_file \
--validation_file path_to_validation_file \
--do_train \
--do_eval \
--output_dir /tmp/test-plm
```
If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
concatenates all texts and then splits them in blocks of the same length).
**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make
sure all your batches have the same length.

View File

@@ -0,0 +1,345 @@
# coding=utf-8
# Copyright 2020 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, ...) on a text file or a dataset.
Here is the full list of checkpoints on the hub that can be fine-tuned by this script:
https://huggingface.co/models?filter=causal-lm
"""
# You can also adapt this script on your own causal language modeling task. Pointers for this are left as comments.
import logging
import math
import os
import sys
from dataclasses import dataclass, field
from typing import Optional
from datasets import load_dataset
import transformers
from transformers import (
CONFIG_MAPPING,
MODEL_FOR_CAUSAL_LM_MAPPING,
AutoConfig,
AutoModelForCausalLM,
AutoTokenizer,
HfArgumentParser,
Trainer,
TrainingArguments,
default_data_collator,
set_seed,
)
from transformers.trainer_utils import is_main_process
logger = logging.getLogger(__name__)
MODEL_CONFIG_CLASSES = list(MODEL_FOR_CAUSAL_LM_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
"""
model_name_or_path: Optional[str] = field(
default=None,
metadata={
"help": "The model checkpoint for weights initialization."
"Don't set if you want to train a model from scratch."
},
)
model_type: Optional[str] = field(
default=None,
metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)},
)
config_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
)
tokenizer_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
)
use_fast_tokenizer: bool = field(
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
)
@dataclass
class DataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
"""
dataset_name: Optional[str] = field(
default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
)
dataset_config_name: Optional[str] = field(
default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
)
train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."})
validation_file: Optional[str] = field(
default=None,
metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
)
block_size: int = field(
default=-1,
metadata={
"help": "Optional input sequence length after tokenization."
"The training dataset will be truncated in block of this size for training."
"Default to the model max input length for single sentence inputs (take into account special tokens)."
},
)
overwrite_cache: bool = field(
default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
)
preprocessing_num_workers: Optional[int] = field(
default=None,
metadata={"help": "The number of processes to use for the preprocessing."},
)
def __post_init__(self):
if self.dataset_name is None and self.train_file is None and self.validation_file is None:
raise ValueError("Need either a dataset name or a training/validation file.")
else:
if self.train_file is not None:
extension = self.train_file.split(".")[-1]
assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, a json or a txt file."
if self.validation_file is not None:
extension = self.validation_file.split(".")[-1]
assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, a json or a txt file."
def main():
# See all possible arguments in src/transformers/training_args.py
# or by passing the --help flag to this script.
# We now keep distinct sets of args, for a cleaner separation of concerns.
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
# If we pass only one argument to the script and it's the path to a json file,
# let's parse it to get our arguments.
model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
else:
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
if (
os.path.exists(training_args.output_dir)
and os.listdir(training_args.output_dir)
and training_args.do_train
and not training_args.overwrite_output_dir
):
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty."
"Use --overwrite_output_dir to overcome."
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
)
# Log on each process the small summary:
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
)
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(training_args.local_rank):
transformers.utils.logging.set_verbosity_info()
logger.info("Training/evaluation parameters %s", training_args)
# Set seed before initializing model.
set_seed(training_args.seed)
# Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below)
# or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
# (the dataset will be downloaded automatically from the datasets Hub).
#
# For CSV/JSON files, this script will use the column called 'text' or the first column if no column called
# 'text' is found. You can easily tweak this behavior (see below).
#
# In distributed training, the load_dataset function guarantee that only one local process can concurrently
# download the dataset.
if data_args.dataset_name is not None:
# Downloading and loading a dataset from the hub.
datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name)
else:
data_files = {}
if data_args.train_file is not None:
data_files["train"] = data_args.train_file
if data_args.validation_file is not None:
data_files["validation"] = data_args.validation_file
extension = data_args.train_file.split(".")[-1]
if extension == "txt":
extension = "text"
datasets = load_dataset(extension, data_files=data_files)
# See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
# https://huggingface.co/docs/datasets/loading_datasets.html.
# Load pretrained model and tokenizer
#
# Distributed training:
# The .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.
if model_args.config_name:
config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
elif model_args.model_name_or_path:
config = AutoConfig.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
else:
config = CONFIG_MAPPING[model_args.model_type]()
logger.warning("You are instantiating a new config instance from scratch.")
if model_args.tokenizer_name:
tokenizer = AutoTokenizer.from_pretrained(
model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
)
elif model_args.model_name_or_path:
tokenizer = AutoTokenizer.from_pretrained(
model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
)
else:
raise ValueError(
"You are instantiating a new tokenizer from scratch. This is not supported by this script."
"You can do it from another script, save it, and load it from here, using --tokenizer_name."
)
if model_args.model_name_or_path:
model = AutoModelForCausalLM.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
)
else:
logger.info("Training new model from scratch")
model = AutoModelForCausalLM.from_config(config)
model.resize_token_embeddings(len(tokenizer))
# Preprocessing the datasets.
# First we tokenize all the texts.
if training_args.do_train:
column_names = datasets["train"].column_names
else:
column_names = datasets["validation"].column_names
text_column_name = "text" if "text" in column_names else column_names[0]
def tokenize_function(examples):
return tokenizer(examples[text_column_name])
tokenized_datasets = datasets.map(
tokenize_function,
batched=True,
num_proc=data_args.preprocessing_num_workers,
remove_columns=[text_column_name],
load_from_cache_file=not data_args.overwrite_cache,
)
if data_args.block_size <= 0:
block_size = tokenizer.model_max_length
else:
if data_args.block_size > tokenizer.model_max_length:
logger.warn(
f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model"
f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}."
)
block_size = min(data_args.block_size, tokenizer.model_max_length)
# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
# Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a remainder
# for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher value might be slower
# to preprocess.
#
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map
lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
num_proc=data_args.preprocessing_num_workers,
load_from_cache_file=not data_args.overwrite_cache,
)
# Initialize our Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_datasets["train"] if training_args.do_train else None,
eval_dataset=lm_datasets["validation"] if training_args.do_eval else None,
tokenizer=tokenizer,
# Data collator will default to DataCollatorWithPadding, so we change it.
data_collator=default_data_collator,
)
# Training
if training_args.do_train:
trainer.train(
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
)
trainer.save_model() # Saves the tokenizer too for easy upload
# Evaluation
results = {}
if training_args.do_eval:
logger.info("*** Evaluate ***")
eval_output = trainer.evaluate()
perplexity = math.exp(eval_output["eval_loss"])
results["perplexity"] = perplexity
output_eval_file = os.path.join(training_args.output_dir, "eval_results_clm.txt")
if trainer.is_world_process_zero():
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in results.items():
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")
return results
def _mp_fn(index):
# For xla_spawn (TPUs)
main()
if __name__ == "__main__":
main()

Some files were not shown because too many files have changed in this diff Show More