Compare commits

...

169 Commits

Author SHA1 Message Date
Lysandre
64e78564a5 Release: v4.6.0
Some checks failed
Release - Conda / build_and_package (push) Has been cancelled
2021-05-12 17:03:03 +02:00
Patrick von Platen
fd6204b2a7 [Lazy init] Force fall back to slow init for composite models (#11705)
* fix encoder-decoder & RAG

* finalize

* Update src/transformers/models/encoder_decoder/modeling_encoder_decoder.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/models/rag/modeling_rag.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Patrick von Platen <patrick@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-05-12 10:52:54 -04:00
Suraj Patil
5c1cda9d3c fix example in config doc (#11696) 2021-05-12 09:48:52 -04:00
Philip May
77f4c46b50 remove defaults to None if optional (#11703) 2021-05-12 09:11:10 -04:00
Marc van Zee
6797cdc077 Updates README and fixes bug (#11701) 2021-05-12 13:52:52 +01:00
Suraj Patil
f063c56d94 Fix clip docs (#11694)
* fix doc url

* fix example
2021-05-12 15:28:30 +05:30
Suraj Patil
8719afa1ad CLIP (#11445)
* begin second draft

* fix import, style

* add loss

* fix embeds, logits_scale, and projection

* fix imports

* add conversion script

* add feature_extractor and processor

* style

* add tests for tokenizer, extractor and processor

* add vision model tests

* add weight init

* add more tests

* fix save_load  test

* model output, dosstrings, causal mask

* config doc

* add clip model tests

* return dict

* bigin integration test

* add integration tests

* fix-copies

* fix init

* Clip => CLIP

* fix module name

* docs

* fix doc

* output_dim => projection_dim

* fix checkpoint names

* remoe fast tokenizer file

* fix conversion script

* fix tests, quality

* put causal mask on device

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix attribute test

* style

* address sylvains comments

* style

* fix docstrings

* add qucik_gelu in activations, docstrings

* clean-up attention test

* fix act fun

* fix config

* fix torchscript tests

* even batch_size

* remove comment

* fix ouput tu_tuple

* fix save load tests

* fix add tokens test

* add fast tokenizer

* update copyright

* new processor API

* fix docs

* docstrings

* docs

* fix doc

* fix doc

* fix tokenizer

* fix import in doc example

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* check types of config

* valhalla => openai

* load image using url

* fix test

* typo

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-05-12 13:48:15 +05:30
Marc van Zee
4ce6bcc310 Adds Flax BERT finetuning example on GLUE (#11564)
* Adds Flax BERT finetuning example

* fix traced jax tensor type

* Use Optax losses and learning schedulers

* Add 1GPU training results

* merge into master & make style

* fix input

* del file

* Fix bug in loss and add torch runs

* finish bert flax fine-tune

* Update examples/flax/text-classification/README.md

* Update examples/flax/text-classification/run_flax_glue.py

* add requirements

* finalize

* finalize

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Patrick von Platen <patrick@huggingface.co>
2021-05-11 19:02:59 +01:00
Sylvain Gugger
f13f1f8fb8 Test checkpointing (#11682)
* Add test and see where CI is unhappy

* Load with strict=False
2021-05-11 12:02:48 -04:00
Julien Plu
d9b286272c Fix TF Roberta for mixed precision training (#11675) 2021-05-11 12:01:03 -04:00
Sylvain Gugger
a135f59536 Auto modelcard (#11599)
* Autogenerate model cards from the Trainer

* ModelCard deprecated

* Fix test

* Style

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Address review comments

* Quality

* With all metadata

* Metadata

* Post-merge conflict mess

* Data args and all examples

* Default license and languages when possible

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-05-11 11:30:34 -04:00
Matt
b3429ab678 Grammar and style edits for the frontpage README (#11679)
* Grammar and style edits for the frontpage README

* Going all-in on em-dashes because you only live once

* Update README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-05-11 15:49:34 +01:00
nxznm
901153c61e Fix docstring of description about input_ids (#11672) 2021-05-11 08:12:02 -04:00
Jonathan Chang
64232bc0df Add --text_column to run_summarization_no_trainer (#11673) 2021-05-11 07:58:38 -04:00
Julien Plu
024cd19bb7 Add MacOS TF version (#11674)
Co-authored-by: Julien Plu <jplu@argos.local>
2021-05-11 05:42:21 -04:00
Pavel Soriano
9120ae7d66 Fixes NoneType exception when topk is larger than one coupled with a small context in the Question-Answering pipeline (#11628)
* added fix to decode function. added test to qa pipeline tests

* completed topk docstring

* fixed formatting with black

* applied style_doc to fix line length
2021-05-10 13:28:10 -04:00
Patrick von Platen
dcb0e61430 push (#11667) 2021-05-10 17:38:17 +01:00
Sylvain Gugger
05a930671f Save scaler state dict when checkpointing (#11663) 2021-05-10 10:58:30 -04:00
Matt
ef8d32c5ea Fix suggested by @bhadreshpsavani (#11660) 2021-05-10 14:28:04 +01:00
Vasudev Gupta
575c979144 Update community.md (#11654) 2021-05-10 09:48:21 +01:00
Tanmay Laud
f7f872955d Big Bird Fast Tokenizer implementation (#11075)
* Added Big Bird Fast Tokenizer initial file

* style fixes

* flake fixes

* Added big bird fast tokenizer to init files

* Added big bird fast to Auto tokenization

* fix styles

* minor quality fixes

* Added initial test code

* Fix SpmConverter when precompiled_charsmap doesn't exist

* fixed post processor

* minor style fix

* minor fix input names

* Actually fix identity normalization

* style

* Added token type ids to fast tokenizer

* style

* flake fix

* fix copies

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-05-10 03:01:23 -04:00
Bhavitvya Malik
80da304a0f updated user permissions based on umask (#11119)
* updated user permissions based on umask

* updated user permissions based on umask

* changes as per suggestions

* minor changes
2021-05-10 02:45:29 -04:00
Quentin Lhoest
1a0b41781d Update requirements.txt (#11634) 2021-05-10 11:19:52 +05:30
NielsRogge
f785c51692 Update code example (#11631)
* Update code example

* Code review
2021-05-10 11:18:43 +05:30
Tommy Chiang
7e406f4a65 [Examples] Fix invalid links after reorg (#11650) 2021-05-10 11:16:48 +05:30
Tommy Chiang
f2ffcaf49f [Examples] Check key exists in datasets first (#11503) 2021-05-09 15:42:38 -04:00
Stas Bekman
ba0d50f214 [examples] fix sys.path in conftest.py (#11636)
* restore conftest.py

* fix conftest and make copies

* remove unneeded parts

* remove unwanted files
2021-05-07 14:44:22 -07:00
Stas Bekman
cd9b8d7efe [self-push CI] sync with self-scheduled (#11637)
forgot to add the missing `libaio-dev` to this workflow
2021-05-07 14:06:33 -07:00
Lysandre Debut
da37eb8e43 Reduce to 1 worker and set timeout for GPU TF tests (#11633) 2021-05-07 11:55:20 -04:00
Lysandre Debut
39084ca663 Add the ImageClassificationPipeline (#11598)
* Add the ImageClassificationPipeline

* Code review

Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>

* Have `load_image` at the module level

Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>
2021-05-07 08:08:40 -04:00
Patrick von Platen
e7bff0aabe make fix copy (#11627) 2021-05-07 07:48:51 -04:00
Vasudev Gupta
dc3f6758cf Add BigBirdPegasus (#10991)
* init bigbird pegasus

* add debugging nb ; update config

* init conversion

* update conversion script

* complete conversion script

* init forward()

* complete forward()

* add tokenizer

* add some slow tests

* commit current

* fix copies

* add docs

* add conversion script for bigbird-roberta-summarization

* remove TODO

* small fixups

* correct tokenizer

* add bigbird core for now

* fix config

* fix more

* revert pegasus-tokenizer back

* make style

* everything working for pubmed; yayygit status

* complete tests finally

* remove bigbird pegasus tok

* correct tokenizer

* correct tests

* add tokenizer files

* finish make style

* fix test

* update

* make style

* fix tok utils base file

* make fix-copies

* clean a bit

* small update

* fix some suggestions

* add to readme

* fix a bit, clean tests

* fix more tests

* Update src/transformers/__init__.py

* Update src/transformers/__init__.py

* make fix-copies

* complete attn switching, auto-padding left

* make style

* fix auto-padding test

* make style

* fix batched attention tests

* put tolerance at 1e-1 for stand-alone decoder test

* fix docs

* fix tests

* correct slow tokenizer conversion

* Apply suggestions from code review

Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* complete remaining suggestions

* fix test

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-05-07 09:27:43 +02:00
Jonathan Chang
6f40e31766 Fix comment in run_clm_no_trainer.py (#11624) 2021-05-07 12:32:30 +05:30
Sylvain Gugger
33fd83bc01 Fix RNG saves in distributed mode. (#11620)
* Fix RNG saves in distributed mode.

* Update src/transformers/trainer.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2021-05-06 17:14:12 -04:00
Stas Bekman
619200cc42 [cuda ext tests] fixing tests (#11619)
* fixing tests

* cleanup
2021-05-06 13:35:28 -07:00
Patrick von Platen
44c5621db0 fix tests (#11615) 2021-05-06 20:42:51 +02:00
Sylvain Gugger
7eee950ac3 Re-styling in seq2seq attention (#11613) 2021-05-06 14:24:19 -04:00
Eldar Kurtic
cf409e5594 Fix docstring typo (#11611) 2021-05-06 17:09:28 +05:30
Vipul Raheja
f594090a93 fix typo in command (#11605) 2021-05-06 12:32:54 +05:30
Lysandre Debut
079557c1c5 Fix Python version (#11607) 2021-05-06 02:50:11 -04:00
baeseongsu
c1780ce7a4 fix head_mask for albert encoder part(AlbertTransformer) (#11596)
* fix head mask for albert encoder part

* fix head_mask for albert encoder part
2021-05-06 02:18:02 -04:00
Mats Sjöberg
864c1dfe34 Accept tensorflow-rocm package when checking TF availability (#11595) 2021-05-05 14:44:29 -04:00
Patrick von Platen
3e3e41ae20 Pytorch - Lazy initialization of models (#11471)
* lazy_init_weights

* remove ipdb

* save int

* add necessary code

* remove unnecessary utils

* Update src/transformers/models/t5/modeling_t5.py

* clean

* add tests

* correct

* finish tests

* finish tests

* fix some more tests

* fix xlnet & transfo-xl

* fix more tests

* make sure tests are independent

* fix tests more

* finist tests

* final touches

* Update src/transformers/modeling_utils.py

* Apply suggestions from code review

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* clean tests

* give arg positive name

* add more mock weights to xlnet

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2021-05-05 17:22:20 +02:00
Lysandre
8fa8e19429 Skip Funnel test 2021-05-05 12:38:01 +02:00
Deepali
83e59d8e0b add importlib_metadata and huggingface_hub as dependency in the conda recipe (#11591)
* add importlib_metadata as dependency (#11490)

Co-authored-by: Deepali Chourasia <deepch23@us.ibm.com>

* add huggingface_hub dependency

Co-authored-by: Deepali Chourasia <deepch23@us.ibm.com>
2021-05-05 03:36:18 -04:00
Stas Bekman
bf0dfa98d3 copies need to be fixed too (#11585) 2021-05-05 03:35:15 -04:00
Stas Bekman
c065025c47 [trainer] document resume randomness (#11588)
* document resume randomness

* fix link

* reword

* fix

* reword

* style
2021-05-04 14:17:11 -07:00
Sylvain Gugger
6b241e0e3b Reproducible checkpoint (#11582)
* Set generator in dataloader

* Use generator in all random samplers

* Checkpoint all RNG states

* Final version

* Quality

* Test

* Address review comments

* Quality

* Remove debug util

* Add python and numpy RNGs

* Split states in different files in distributed

* Quality

* local_rank for TPUs

* Only use generator when accepted

* Add test

* Set seed to avoid flakiness

* Make test less flaky

* Quality
2021-05-04 16:20:56 -04:00
Patrick Fernandes
0afe4a90f9 [Flax] Add Electra models (#11426)
* add electra model to flax

* Remove Electra Next Sentence Prediction model added by mistake

* fix parameter sharing and loosen equality threshold

* fix styling issues

* add mistaken removen imports

* fix electra table

* Add FlaxElectra to automodels and fixe docs

* fix issues pointed out the PR

* fix flax electra to comply with latest changes

* remove stale class

* add copied from

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-05-04 20:56:09 +02:00
Philipp Schmid
226e74b610 Removes SageMakerTrainer code but keeps class as wrapper (#11587)
* removed all old code

* make quality
2021-05-04 14:31:18 -04:00
Patrick von Platen
084a187da3 [FlaxRoberta] Add FlaxRobertaModels & adapt run_mlm_flax.py (#11470)
* add flax roberta

* make style

* correct initialiazation

* modify model to save weights

* fix copied from

* fix copied from

* correct some more code

* add more roberta models

* Apply suggestions from code review

* merge from master

* finish

* finish docs

Co-authored-by: Patrick von Platen <patrick@huggingface.co>
2021-05-04 19:57:59 +02:00
Sylvain Gugger
2ce0fb84cc Make quality scripts work when one backend is missing. (#11573)
* Make quality scripts work when one backend is missing.

* Check env variable is properly set

* Add default

* With print statements

* Fix typo

* Set env variable

* Remove debug code
2021-05-04 09:53:44 -04:00
Lysandre Debut
09b0bcfea9 Enable added tokens (#11325)
* Fix tests

* Reorganize

* Update tests/test_modeling_mobilebert.py

* Remove unnecessary addition
2021-05-04 08:13:57 -04:00
abhishek thakur
c40c7e213b Add multi-class, multi-label and regression to transformers (#11012)
* add to  bert

* review comments

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* self.config.problem_type

* fix style

* fix

* fin

* fix

* update doc

* fix

* test

* Test more problem types

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix

* remove

* fix

* quality

* make fix-copies

* remove test

Co-authored-by: abhishek thakur <abhishekkrthakur@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2021-05-04 02:23:40 -04:00
Stas Bekman
7c622482e8 fix resize_token_embeddings (#11572) 2021-05-03 13:12:06 -07:00
Sylvain Gugger
fe82b1bfa0 Update training tutorial (#11533)
* Update training tutorial

* Apply suggestions from code review

Co-authored-by: Hamel Husain <hamelsmu@github.com>

* Address review comments

* Update docs/source/training.rst

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* More review comments

* Last review comments

Co-authored-by: Hamel Husain <hamelsmu@github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-05-03 13:18:46 -04:00
Sylvain Gugger
f4c9a7e62e Accumulate opt state dict on do_rank 0 (#11481) 2021-05-03 13:18:27 -04:00
Nicolas Patry
1e8e06862f Fixes a useless warning. (#11566)
Fixes #11525
2021-05-03 18:48:13 +02:00
Sylvain Gugger
87dd1a00ef Fix metric computation in run_glue_no_trainer (#11569) 2021-05-03 11:42:55 -04:00
Muktan
a721a5eefd [Wav2vec2] Fixed tokenization mistakes while adding single-char tokens to tokenizer (#11538)
* Fixed tokenization mistakes while adding single-char tokens to tokenizer

* Added tests and Removed unnecessary comments.

* finalize wav2vec2 tok

* add more aggressive tests

* Apply suggestions from code review

* fix useless import

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-05-03 17:19:12 +02:00
NielsRogge
f3cf8ae7b3 Add LUKE (#11223)
* Rebase with master

* Minor bug fix in docs

* Copy files from adding_luke_v2 and improve docs

* change the default value of use_entity_aware_attention to True

* remove word_hidden_states

* fix head models

* fix tests

* fix the conversion script

* add integration tests for the pretrained large model

* improve docstring

* Improve docs, make style

* fix _init_weights for pytorch 1.8

* improve docs

* fix tokenizer to construct entity sequence with [MASK] entity when entities=None

* Make fix-copies

* Make style & quality

* Bug fixes

* Add LukeTokenizer to init

* Address most comments by @patil-suraj and @LysandreJik

* rename _compute_extended_attention_mask to get_extended_attention_mask

* add comments to LukeSelfAttention

* fix the documentation of the tokenizer

* address comments by @patil-suraj, @LysandreJik, and @sgugger

* improve docs

* Make style, quality and fix-copies

* Improve docs

* fix docs

* add "entity_span_classification" task

* update example code for LukeForEntitySpanClassification

* improve docs

* improve docs

* improve the code example in luke.rst

* rename the classification layer in LukeForEntityClassification from typing to classifier

* add bias to the classifier in LukeForEntitySpanClassification

* update docs to use fine-tuned hub models in code examples of the head models

* update the example sentences

* Make style & quality

* Add require_torch to tokenizer tests

* Add require_torch to tokenizer tests

* Address comments by @sgugger and add community notebooks

* Make fix-copies

Co-authored-by: Ikuya Yamada <ikuya@ikuya.net>
2021-05-03 09:07:29 -04:00
Frederik Bode
6a11e4c2ad fix the mlm longformer example by changing [MASK] to <mask> (#11559) 2021-05-03 12:43:30 +01:00
Lysandre Debut
1c86157d9d Remove datasets submodule. (#11563) 2021-05-03 06:02:33 -04:00
Patrick von Platen
c448c01f25 [Wav2Vec2] Fix convert (#11562)
* push

* small change

* correct other typo
2021-05-03 11:53:30 +02:00
Suraj Patil
623281aa12 [Flax BERT/Roberta] few small fixes (#11558)
* small fixes

* style
2021-05-03 10:35:06 +02:00
lewtun
a5d2967bd8 Fix examples in M2M100 docstrings (#11540)
Replaces `tok` with `tokenizer` so examples can run with copy-paste
2021-05-03 10:56:31 +05:30
jingyihe
980208650a Fixed docs for the shape of scores in generate() (#10057)
* Fixed the doc for the shape of return scores tuples in generation_utils.py.

* Fix the output shape of `scores` for `DecoderOnlyOutput`.

* style fix
2021-05-02 10:10:47 +02:00
Stas Bekman
4e7bf94e72 [DeepSpeed] fp32 support (#11499)
* prep for deepspeed==0.3.16

* new version

* too soon

* support and test fp32 mode

* troubleshooting doc start

* workaround no longer needed

* add fp32 doc

* style

* cleanup, add tf32 note

* clarify

* release was made
2021-04-30 12:51:48 -07:00
Stas Bekman
282f3ac3ef [debug utils] activation/weights underflow/overflow detector (#11274)
* sync

* add activation overflow debug utility

* cleanup

* document detect_overflow

* import torch

* add deprecation warning

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* convert to rst, add note

* add class

* fix docs

* improve the doc

* rework to dump a lot more info about each frame

* complete expansion

* cleanup

* format

* cleanup

* doesn't have to be transformers

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* wrap long line

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-30 11:15:46 -07:00
Hamel Husain
804c2974d5 Improve task summary docs (#11513)
* fix task summary docs

* refactor to use model.config.id2label instead of list

* fix nit

* Update docs/source/task_summary.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-30 09:06:47 -04:00
Sylvain Gugger
bc80f8bc37 Add Stas and Suraj as authors (#11526) 2021-04-30 09:03:13 -04:00
Bhadresh Savani
84326a28f8 [Examples] Added support for test-file in QA examples with no trainer (#11510)
* added support for test-file

* fixed typo

* added suggested changes

* reformatted code

* modifed files

* fix post processing error

* Trigger CI

* removed extra lines
2021-04-30 09:02:50 -04:00
Lysandre Debut
af0692a2ca Run model templates on master (#11527) 2021-04-30 08:47:12 -04:00
Suraj Patil
57c8e822f7 reszie token embeds (#11524) 2021-04-30 08:47:01 -04:00
Matt
20d6931e32 Update TF text classification example (#11496)
Big refactor, fixes and multi-GPU/TPU support
2021-04-30 13:45:33 +01:00
bonniehyeon
8b945ef03e Fix do_eval default value in training_args.py (#11511)
* Fix do_eval default value in training_args.py

* Update PULL_REQUEST_TEMPLATE.md
2021-04-30 08:35:12 -04:00
Takuya Makino
c2cd02ac62 Accepts BatchEncoding in LengthSampler (#11431) 2021-04-30 08:27:46 -04:00
Shubham Sanghavi
30ede8994e Implement Fast Tokenization for Deberta (#11387) 2021-04-30 08:08:15 -04:00
Nicolas Patry
db9dd09cf9 Adding AutomaticSpeechRecognitionPipeline. (#11337)
* Adding `AutomaticSpeechRecognitionPipeline`.

- Because we added everything to enable this pipeline, we probably
should add it to `transformers`.
- This PR tries to limit the scope and focuses only on the pipeline part
(what should go in, and out).
- The tests are very specific for S2T and Wav2vec2 to make sure both
architectures are supported by the pipeline. We don't use the mixin for
tests right now, because that requires more work in the `pipeline`
function (will be done in a follow up PR).
- Unsure about the "helper" function `ffmpeg_read`. It makes a lot of
  sense from a user perspective, it does not add any additional
dependencies (as in hard dependency, because users can always use their
own load mechanism). Meanwhile, it feels slightly clunky to have so much
optional preprocessing.
- The pipeline is not done to support streaming audio right now.

Future work:

- Add `automatic-speech-recognition` as a `task`. And add the
FeatureExtractor.from_pretrained within `pipeline` function.
- Add small models within tests
- Add the Mixin to tests.
- Make the logic between ForCTC vs ForConditionalGeneration better.

* Update tests/test_pipelines_automatic_speech_recognition.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Adding docs + main import + type checking + LICENSE.

* Doc style !.

* Fixing TYPE_HINT.

* Specifying waveform shape in the docs.

* Adding asserts + specify in the documentation the shape of the input
np.ndarray.

* Update src/transformers/pipelines/automatic_speech_recognition.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Adding require to tests + move the `feature_extractor` doc.

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-04-30 11:54:08 +02:00
CeShine Lee
76116f479b T5 Gradient Checkpointing (#11353)
* Implement gradient checkpoinging for T5Stack

* A bit more robust type checking

* Add `gradient_checkpointing` to T5Config

* Formatting

* Set requires_grad only when training

* None return value will only cause problems when training

* Change the output tuple according to `use_cache`

* Enable gradient checkpointing for the decoder

Squashed commit of the following:

commit 658bdd0bd1215353a8770f558bda2ea69a0ad0c7
Author: Ceshine Lee <shuanck@gmail.com>
Date:   Sat Apr 24 14:08:17 2021 +0800

    Only set `require_grad` for gradient checkpointing

commit acaeee6b2e675045fb28ce2176444c1d63e908bd
Author: Ceshine Lee <shuanck@gmail.com>
Date:   Sat Apr 24 13:59:35 2021 +0800

    Make gradient checkpointing work with the decoder

* Formatting
2021-04-30 14:13:55 +05:30
Manuel Romero
58c789e3d2 Update README.md (#11489)
Add link to code
2021-04-30 04:29:59 -04:00
Patrick von Platen
022a1e9e67 make style (#11520) 2021-04-30 09:54:58 +02:00
Philip May
e0db8276a6 add sp_model_kwargs to unpickle of xlm roberta tok (#11430)
add test for pickle

simplify test

fix test code style

add missing pickle import

fix test

fix test

fix test
2021-04-30 03:44:58 -04:00
Frederik Bode
b43e3f93ac correct the dimension comment of matrix multiplication (#11494)
Co-authored-by: Frederik Bode <frederik@paperbox.ai>
2021-04-30 09:42:13 +02:00
Lysandre Debut
f37f2adb68 Pin HuggingFace Hub dependency (#11502) 2021-04-30 02:57:50 -04:00
Lysandre
60d5bda4fd Patch notification service 2021-04-30 08:56:18 +02:00
Sylvain Gugger
b29eb247d3 Split checkpoint from model_name_or_path in examples (#11492)
* Split checkpoint from model_name_or_path in examples

* Address review comments

* Address review comments
2021-04-29 18:33:47 -04:00
Michael Benayoun
d6ec54ba36 solved coefficient issue for the TF version of gelu_fast (#11514)
Co-authored-by: Michael Benayoun <michael@huggingface.co>
2021-04-29 21:47:26 +02:00
Sylvain Gugger
ad1f7bef13 Reformat to make code clearer in tokenizer call (#11497)
* Reformat to make code clearer

* Reformat to make code clearer
2021-04-29 07:51:09 -04:00
Patrick von Platen
f748bd4242 [Flax] Add docstrings & model outputs (#11498)
* add attentions & hidden states

* add model outputs + docs

* finish docs

* finish tests

* finish impl

* del @

* finish

* finish

* correct test

* apply sylvains suggestions

* Update src/transformers/models/bert/modeling_flax_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* simplify more

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-29 12:04:51 +02:00
Hamel Husain
3f6add8bab fix #1149 (#11493) 2021-04-28 11:16:41 -04:00
Hamel Husain
c0eb218a55 Update PreTrainedTokenizerBase to check/handle batch length for text_pair parameter (#11486)
* Update tokenization_utils_base.py

* add assertion

* check batch len

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add error message

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-28 10:11:17 -04:00
Sylvain Gugger
2d27900b5d Update min versions in README and add Flax (#11472)
* Update min versions in README and add Flax

* Adapt index
2021-04-28 09:10:06 -04:00
Suraj Patil
8d43c71a1c fix docs for decoder_input_ids (#11466)
* fix docs for decoder_input_ids

* revert the changes for bart and mbart
2021-04-27 19:36:36 +05:30
Hamel Husain
7ceff67e1a Finish Making Quick Tour respect the model object (#11467)
* finish quicktour

* fix import

* fix print

* explain config default better

* Update docs/source/quicktour.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-27 10:04:12 -04:00
Hamel Husain
88ac60f7b5 update QuickTour docs to reflect model output object (#11462)
* update docs to reflect model output object

* run make style`
2021-04-26 22:18:37 -04:00
Ashwin Geet D'Sa
741d48f5c7 Remove max length beam scorer (#11378)
* removed max_len

* removed max_length from BeamSearchScorer

* correct max length

* finish

* del vim

* finish & add test

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-04-27 00:28:40 +02:00
Stas Bekman
bc2571e61c [Deepspeed] ZeRO-Infinity integration plus config revamp (#11418)
* adding Z-inf

* revamp config process

* up version requirement

* wip

* massive rewrite

* cleanup

* cleanup

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* consistent json commas

* act on suggestions

* leave this feature for 0.3.16

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-26 10:40:32 -07:00
Jaimeen Ahn
0661abc545 Variable Correction for Consistency in Distillation Example (#11444)
As the error comes from the inconsistency of variable meaning number of gpus in parser and its actual usage in the train.py script, 'gpus' and 'n_gpu' respectively,  the correction makes the example work
2021-04-26 13:30:48 -04:00
Bhadresh Savani
1d30ec95c7 [Examples] Fixes inconsistency around eval vs val and predict vs test (#11380)
* added changes for uniformity

* modified files

* corrected typo

* fixed qa scripts

* fix typos

* fixed predict typo in qa no trainer

* fixed test file

* reverted trainer changes

* reverted trainer changes in custom exmaples

* updated readme

* added changes in deepspeed test

* added changes for predict and eval
2021-04-26 09:24:31 -07:00
Sylvain Gugger
7959d83599 Give each test a different repo name (#11453) 2021-04-26 11:52:23 -04:00
Sylvain Gugger
b03b2a653d Style 2021-04-26 11:45:04 -04:00
Stas Bekman
ce11318e7e make sure to test against the local checkout (#11437) 2021-04-26 08:42:43 -07:00
Stas Bekman
a753cafdc0 [docs] fix invalid class name (#11438)
* fix invalid class name

* proper ref

* proper ref
2021-04-26 08:37:32 -07:00
Kostas Stathoulopoulos
6715e3b6a1 Clarify description of the is_split_into_words argument (#11449)
* Improve documentation for is_split_into_words argument

* Change description wording
2021-04-26 11:29:36 -04:00
Sylvain Gugger
ab2cabb964 Pass along seed to DistributedSampler (#11406)
* Pass along seed to DistributedSampler

* Add seed to DistributedLengthGroupedSampler
2021-04-26 10:26:52 -04:00
LSinev
b24ead87e1 fix some typos in docs, comments, logging/errors (#11432) 2021-04-26 09:14:25 -04:00
Amine Abdaoui
e3e70f9551 docs(examples): fix link to TPU launcher script (#11427) 2021-04-26 09:08:43 -04:00
Sylvain Gugger
d7633a4e46 Add basic support for FP16 in SageMaker model parallelism (#11407)
* Add FP16 support for SageMaker MP

* Add print debugs

* Squeeze

* Remove debug statements

* Add defensive check

* Typo
2021-04-26 08:55:14 -04:00
Daniel Stancl
38a716cd41 TF BART models - Add cross_attentions to model output and fix cross-attention head masking (#10699)
* Add cross_attn_head_mask to BART

* Fix cross_attentions in TFBart-like models

* This commit enables returning of `cross_attentions`
for TFBart-like models

* It also fixes attention head masking in cross-attenion module

* Update TF model templates

* Fix missing , in TF model templates

* Fix typo: congig -> config
2021-04-26 14:16:21 +02:00
Sylvain Gugger
4bd6b54fa4 Pin black to 21.4b0 2021-04-26 08:12:54 -04:00
Sylvain Gugger
c1625b3261 With style 2021-04-26 08:07:29 -04:00
Sylvain Gugger
4b72cfd958 Pin black to 20.8.b1 2021-04-26 08:06:50 -04:00
Patrick von Platen
32dbb2d954 make style (#11442) 2021-04-26 13:50:34 +02:00
Vasudev Gupta
04ab2ca639 add pooling layer support (#11439) 2021-04-26 09:05:53 +02:00
abiolaTresor
30f065890e updating the checkpoint for GPT2ForSequence Classification to one with classification head (#11434) 2021-04-26 10:28:51 +05:30
cronoik
35cd8eed88 EncoderDecoderConfigs should not create new objects (#11300)
* removes the creation of separate config objects and uses the existing ones instead+overwrite resize_token_embeddings from parent class because it is not working for the EncoderDecoderModel

* rollback to current version of the huggingface master branch

* reworked version that ties the encoder and decoder config of the parent encoderdecoder instance

* overwrite of resize_token_embeddings throws an error now

* review comment suggestion

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* implemented warning in case encoderdecoder is created with differing configs of encoderdecoderconfig and decoderconfig or encoderconfig

* added test to avoid diverging configs of wrapper class and wrapped classes

* Update src/transformers/models/encoder_decoder/modeling_encoder_decoder.py

* make style

Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-04-25 11:45:46 +02:00
Daniel Stancl
f45cb66bf6 Add head_mask, decoder_head_mask, cross_head_mask to ProphetNet (#9964)
* Add head_mask & decoder_head_mask + some corrections

* Fix head masking for N-grams

* Enable test_headmasking for encoder and decod

* Fix one typo regarding in modeling_propgetnet.py

* Enable test_headmasking for ProphetNetStandaloneDecoderModelTest
and ProphetNetStandaloneEncoderModelTest in test_modeling_prophetnet.py

* make style

* Fix cross_head_mask

* Fix attention head mask naming

* `cross_head_mask` -> `cross_attn_head_mask`

* `cross_layer_head_mask` -> `cross_attn_layer_head_mask`

* Still need to merge #10605 to master to pass the tests
2021-04-25 11:06:16 +02:00
Sylvain Gugger
52166f672e Style 2021-04-23 20:40:17 -04:00
cronoik
9cac4fab07 documentation linked to the parent class PreTrainedTokenizerFast but it should be the slow tokenizer (#11410) 2021-04-23 20:19:15 -04:00
Sylvain Gugger
b7fc043fce Merge branch 'master' of github.com:huggingface/transformers 2021-04-23 18:47:55 -04:00
Sylvain Gugger
81a6c7cd39 Use 3 workers for torch tests 2021-04-23 18:47:46 -04:00
Philip May
195bfd118a Enable option for subword regularization in XLMRobertaTokenizer (#11149)
* enable subword regularization.

* fix tokenizer storage

* fix docstring formatting

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Stefan Schweter <stefan@schweter.it>

* fix docstring formatting

* add test for subword regularization tokenizer

* improve comments of test

* add sp_model_kwargs

* reformat docstring to match the style

* add some more documentation

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* improve docstring

* empty commit to trigger CI

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix docstring formatting for sphinx

Co-authored-by: Stefan Schweter <stefan@schweter.it>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-23 17:52:31 -04:00
Sylvain Gugger
1ef152eb48 Default to accuracy metric (#11405) 2021-04-23 14:49:59 -04:00
Daniel Stancl
e3ff165aa5 Fix cross-attention head mask for Torch encoder-decoder models (#10605)
* Fix cross-attention head mask for Torch BART models

* Fix head masking for cross-attention module for the following
models: BART, Blenderbot, Blenderbot_small, M2M_100, Marian, MBart,
Pegasus

* Enable test_headmasking for M2M_100 model

* Fix cross_head_mask for FSMT, LED and T5

* This commit fixes `head_mask` for cross-attention modules
in the following models: FSMT, LED, T5

* It also contains some smaller changes in doc so that
it is be perfectly clear the shape of `cross_head_mask`
is the same as of `decoder_head_mask`

* Update template

* Fix template for BartForCausalLM

* Fix cross_head_mask for Speech2Text models

* Fix cross_head_mask in templates

* Fix args order in BartForCausalLM template

* Fix doc in BART templates

* Make more explicit naming

* `cross_head_mask` -> `cross_attn_head_mask`

* `cross_layer_head_mask` -> `cross_attn_layer_head_mask`

* Fix doc

* make style quality

* Fix speech2text docstring
2021-04-23 18:58:06 +02:00
Sylvain Gugger
ca6b80cadb Wrong branch Sylvain... 2021-04-23 12:46:54 -04:00
Sylvain Gugger
3951fc55ee Try to trigger failure more 2021-04-23 12:44:54 -04:00
Sylvain Gugger
bd41a0f74d Style 2021-04-23 12:32:37 -04:00
Nicola De Cao
1811883e80 Fixing bug in generation (#11297)
When passing `inputs_embeds` and not `input_ids=None` the generation function fails because `input_ids` is created but the function but it should not.
2021-04-23 18:24:26 +02:00
Kiran R
5c00918681 added support for exporting of t5 to onnx with past_key_values (#10651) 2021-04-23 18:14:20 +02:00
Patrick von Platen
50f4539b82 push (#11400) 2021-04-23 15:36:27 +02:00
Sylvain Gugger
bf2e0cf70b Trainer push to hub (#11328)
* Initial support for upload to hub

* push -> upload

* Fixes + examples

* Fix torchhub test

* Torchhub test I hate you

* push_model_to_hub -> push_to_hub

* Apply mixin to other pretrained models

* Remove ABC inheritance

* Add tests

* Typo

* Run tests

* Install git-lfs

* Change approach

* Add push_to_hub to all

* Staging test suite

* Typo

* Maybe like this?

* More deps

* Cache

* Adapt name

* Quality

* MOAR tests

* Put it in testing_utils

* Docs + torchhub last hope

* Styling

* Wrong method

* Typos

* Update src/transformers/file_utils.py

Co-authored-by: Julien Chaumond <julien@huggingface.co>

* Address review comments

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-04-23 09:17:37 -04:00
Teven
7bc86bea68 Fixed trainer total_flos relaoding in distributed mode (#11383)
* Fixed trainer total_flos relaoding in distributed mode

* logging flos at the end of training
2021-04-23 07:53:33 -04:00
Patrick von Platen
74e84f1fa6 make blenderbot test slow (#11395) 2021-04-23 07:49:09 -04:00
Yoshitomo Matsubara
c3d6f33918 fixed typos (#11391) 2021-04-23 07:48:42 -04:00
Max Del
a90d3f1862 Fix typo in text (#11396) 2021-04-23 07:37:19 -04:00
Patrick von Platen
2dc2d79ac7 correct conversion (#11394) 2021-04-23 11:59:34 +02:00
Patrick von Platen
b48cf7124c correct typo (#11393) 2021-04-23 11:34:59 +02:00
Patrick von Platen
8c9b5fcbaf [Flax] Big FlaxBert Refactor (#11364)
* improve flax

* refactor

* typos

* Update src/transformers/modeling_flax_utils.py

* Apply suggestions from code review

* Update src/transformers/modeling_flax_utils.py

* fix typo

* improve error tolerance

* typo

* correct nasty saving bug

* fix from pretrained

* correct tree map

* add note

* correct weight tying
2021-04-23 09:53:09 +02:00
Sylvain Gugger
3ed5e97ba0 Fix Trainer with remove_unused_columns=False (#11382)
* Fix Trainer with remove_unused_columns=False

* Typo
2021-04-22 11:16:24 -04:00
PenutChen
0f3ad1507e Fix typo (#11369) 2021-04-22 10:10:16 -04:00
Matt
2617396094 Correctly cast num_train_epochs to int (#11379) 2021-04-22 13:49:59 +01:00
Takuya Makino
881945c0b5 Add space (#11373) 2021-04-22 17:48:58 +05:30
johnson7788
5b5e4ca366 [run_translation.py] fix typo (#11372)
fix typo

Co-authored-by: johnson <johnson@github.com>
2021-04-22 17:47:11 +05:30
Patrick von Platen
58d8795d74 [Flax] Correct typo (#11374)
* finish

* fix copy
2021-04-22 13:11:44 +02:00
Patrick von Platen
880154d2e1 [Wav2Vec2] Fix special tokens for Wav2Vec2 tokenizer (#11349)
* fix wav2vec2 tok

* up
2021-04-22 12:23:08 +02:00
Sylvain Gugger
6f14eab50b Add in torchhub 2021-04-21 19:17:29 -04:00
Sylvain Gugger
ff26f8ee3a Add huggingface_hub dep for #11328 2021-04-21 19:12:58 -04:00
wlhgtc
5e04d70868 Fix token_type_ids error for big_bird model. (#11355)
* MOD: fit chinese wwm to new datasets

* MOD: move wwm to new folder

* MOD: formate code

* Styling

* MOD add param and recover trainer

* MOD: add token_type_ids method for big bird

* MOD: format code

* MOD: format code

Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
2021-04-21 19:37:57 +02:00
Stas Bekman
5aaf5aac0b [contributing doc] explain/link to good first issue (#11346)
* explain/link to good first issue

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-21 10:10:11 -07:00
Matt
6fe79e57d7 Move old TF text classification script to legacy (#11361)
And update README to explain the work-in-progress!
2021-04-21 17:36:18 +01:00
Patrick von Platen
50595a3336 Remove boiler plate code (#11340)
* remove boiler plate code

* adapt roberta

* correct docs

* finish refactor
2021-04-21 18:34:38 +02:00
Matt
ac588594e2 Merge new TF example script (#11360)
First of the new and more idiomatic TF examples!
2021-04-21 17:04:55 +01:00
Stas Bekman
9f72e8f4e1 [testing doc] bring doc up to date (#11359)
* bring doc up to date

* fix
2021-04-21 08:51:00 -07:00
lewtun
41f3133a3a Extract metric_key_prefix during NotebookProgressCallback.on_evaluate (#11347)
* Pass metric_key_prefix as kwarg to on_evaluate

* Replace eval_loss with metric_key_prefix_loss

* Default to "eval" if metric_key_prefix not in kwargs

* Add kwargs to CallbackHandler.on_evaluate signature

* Revert "Add kwargs to CallbackHandler.on_evaluate signature"

This reverts commit 8d4c85ed512f558f7579d36771e907b3379947b7.

* Revert "Pass metric_key_prefix as kwarg to on_evaluate"

This reverts commit 7766bfe2718601230ae593d37b1317bd53cfc075.

* Extract metric_key_prefix from metrics
2021-04-21 11:12:09 -04:00
Sylvain Gugger
dabeb15292 Examples reorg (#11350)
* Base move

* Examples reorganization

* Update references

* Put back test data

* Move conftest

* More fixes

* Move test data to test fixtures

* Update path

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments and clean

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-04-21 11:11:20 -04:00
Stas Bekman
ca7ff64f5b [deepspeed] fix resume from checkpoint (#11352)
This PR fixes a bug that most likely somehow got exposed (not caused) by https://github.com/huggingface/transformers/pull/11318 - surprisingly the same test worked just fine before that other PR.
2021-04-21 07:48:15 -07:00
Sylvain Gugger
74712e22f3 Honor contributors to models (#11329)
* Honor contributors to models

* Fix typo

* Address review comments

* Add more authors
2021-04-21 09:47:27 -04:00
Nicolas Patry
aad95c7cde Removed max_length from being mandatory within generate. (#11314)
* Removed `max_length` from being mandatory within `generate`.

- Moving on to fully using `StoppingCriteria` for `greedy` and `sample`
modes.
- `max_length` still used for `beam_search` and `group_beam_search`
(Follow up PR)
- Fixes a bug with MaxLengthStoppingCriteria (we should stop as soon a
we hit the max_length, the comparison needs to be or equal, that affects
the tests).
- Added options to use `logits_processor` and `stopping_criteria`
directly within `generate` function (so some users can define their own
`logits_processor` and `stopping_criteria`).
- Modified the backward compat tests to make sure we issue a warning.

* Fix `max_length` argument in `generate`.

* Moving validate to being functional.

- Renamed `smax_length` to `stoppping_max_length`.

* Removing `logits_processor` and `stopping_criteria` from `generate`
arguments.

* Deepcopy.

* Fix global variable name.
2021-04-21 11:56:45 +02:00
Yusuke Mori
95dab34d55 Add an error message that fires when Reformer is not in training mode, but one runs .backward() (#11117) 2021-04-21 00:23:37 +02:00
Sylvain Gugger
f1b938fda8 Update to use datasets remove_cloumns method (#11343)
* Update to use datasets remove_cloumns method

* Quality
2021-04-20 14:12:01 -04:00
Suraj Patil
cfd2eaa8cf [GPTNeo] create local attention mask ones (#11335)
* create local attention mask ones

* remove old method, address patricks comment
2021-04-20 18:37:44 +05:30
Patrick von Platen
f464f10a2c [Generate] Remove outdated code (#11331)
* remove update function

* update

* refactor more

* refactor
2021-04-20 15:16:02 +03:00
rajvi-k
bfd83c17a7 Added translation example script (#11196)
* initial changes

* modified evaluation

* updated evaluation

* updated evaluation on text translation example script

* added translation example script

* Formatted translation example script

* Reformatted translation example

* Fixed evaluation bug and added support for other tokenisers

* Fixed evaluation bug and added support for other tokenisers

* Added translation example script

* Formatted summarization example script

* Removed typos from summarization example script
2021-04-20 07:18:47 -04:00
Sylvain Gugger
c0328a6c26 Load checkpoint without re-creating the model (#11318) 2021-04-19 20:31:29 -04:00
Sylvain Gugger
95037a169f [Trainer] Add a progress bar for batches skipped (#11324) 2021-04-19 19:04:52 -04:00
Stas Bekman
95ffbe1686 [Trainer] fix the placement on device with fp16_full_eval (#11322)
* fix the placement on device with fp16_full_eval

* deepspeed never goes on device
2021-04-19 11:55:33 -07:00
TAE YOUNGDON
3981ce3dd2 modify double considering special tokens in language_modeling.py (#11275)
* Update language_modeling.py

in "class TextDatasetForNextSentencePrediction(Dataset)", double considering "self.tokenizer.num_special_tokens_to_add(pair=True)" 

so, i remove self.block_size, and add parameter for "def create_examples_from_document". like "class LineByLineWithSOPTextDataset" do

* Update language_modeling.py
2021-04-19 11:24:43 -04:00
e
5a34d8d982 move device statements outside if statements (#11292) 2021-04-19 08:25:40 -04:00
499 changed files with 29584 additions and 5038 deletions

View File

@@ -145,7 +145,7 @@ jobs:
key: v0.4-torch-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: python -m pytest -n 4 --dist=loadfile -s --make-reports=tests_torch ./tests/ | tee tests_output.txt
- run: python -m pytest -n 3 --dist=loadfile -s --make-reports=tests_torch ./tests/ | tee tests_output.txt
- store_artifacts:
path: ~/transformers/tests_output.txt
- store_artifacts:
@@ -306,35 +306,44 @@ jobs:
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[sklearn,torch,sentencepiece,testing]
- run: pip install -r examples/_tests_requirements.txt
- run: pip install -r examples/pytorch/_tests_requirements.txt
- save_cache:
key: v0.4-torch_examples-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: TRANSFORMERS_IS_CI=1 python -m pytest -n 8 --dist=loadfile -s --make-reports=examples_torch ./examples/ | tee examples_output.txt
- run: TRANSFORMERS_IS_CI=1 python -m pytest -n 8 --dist=loadfile -s --make-reports=examples_torch ./examples/pytorch/ | tee examples_output.txt
- store_artifacts:
path: ~/transformers/examples_output.txt
- store_artifacts:
path: ~/transformers/reports
run_tests_git_lfs:
run_tests_hub:
working_directory: ~/transformers
docker:
- image: circleci/python:3.7
environment:
HUGGINGFACE_CO_STAGING: yes
RUN_GIT_LFS_TESTS: yes
TRANSFORMERS_IS_CI: yes
resource_class: xlarge
parallelism: 1
steps:
- checkout
- restore_cache:
keys:
- v0.4-hub-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: sudo apt-get install git-lfs
- run: |
git config --global user.email "ci@dummy.com"
git config --global user.name "ci"
- run: pip install --upgrade pip
- run: pip install .[testing]
- run: python -m pytest -sv ./tests/test_hf_api.py -k "HfLargefilesTest"
- run: pip install .[torch,sentencepiece,testing]
- save_cache:
key: v0.4-hub-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: python -m pytest -sv ./tests/ -m is_staging_test
build_doc:
working_directory: ~/transformers
@@ -382,6 +391,8 @@ jobs:
docker:
- image: circleci/python:3.6
resource_class: medium
environment:
TRANSFORMERS_IS_CI: yes
parallelism: 1
steps:
- checkout
@@ -469,7 +480,7 @@ workflows:
- run_tests_flax
- run_tests_pipelines_torch
- run_tests_pipelines_tf
- run_tests_git_lfs
- run_tests_hub
- build_doc
- deploy_doc: *workflow_filters
# tpu_testing_jobs:

View File

@@ -16,6 +16,8 @@ requirements:
- pip
- numpy >=1.17
- dataclasses
- importlib_metadata
- huggingface_hub
- packaging
- filelock
- requests
@@ -28,6 +30,8 @@ requirements:
- python
- numpy >=1.17
- dataclasses
- importlib_metadata
- huggingface_hub
- packaging
- filelock
- requests

View File

@@ -1,6 +1,9 @@
name: Model templates runner
on:
push:
branches:
- master
pull_request:
paths:
- "src/**"
@@ -34,6 +37,7 @@ jobs:
- name: Install dependencies
run: |
pip install --upgrade pip
sudo apt -y update && sudo apt install -y libsndfile1-dev
pip install .[dev]
- name: Create model files
run: |
@@ -46,6 +50,7 @@ jobs:
make style
python utils/check_table.py --fix_and_overwrite
python utils/check_dummies.py --fix_and_overwrite
python utils/check_copies.py --fix_and_overwrite
- name: Run all non-slow tests
run: |

View File

@@ -24,6 +24,7 @@ jobs:
with:
auto-update-conda: true
auto-activate-base: false
python-version: 3.8
activate-environment: "build-transformers"
channels: huggingface

View File

@@ -63,6 +63,7 @@ jobs:
run_tests_tf_gpu:
runs-on: [self-hosted, docker-gpu, single-gpu]
timeout-minutes: 120
container:
image: tensorflow/tensorflow:2.4.1-gpu
options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
@@ -89,7 +90,7 @@ jobs:
TF_NUM_INTRAOP_THREADS: 8
TF_NUM_INTEROP_THREADS: 1
run: |
python -m pytest -n 2 --dist=loadfile --make-reports=tests_tf_gpu tests
python -m pytest -n 1 --dist=loadfile --make-reports=tests_tf_gpu tests
- name: Failure short reports
if: ${{ always() }}
@@ -148,6 +149,7 @@ jobs:
run_tests_tf_multi_gpu:
runs-on: [self-hosted, docker-gpu, multi-gpu]
timeout-minutes: 120
container:
image: tensorflow/tensorflow:2.4.1-gpu
options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
@@ -174,7 +176,7 @@ jobs:
TF_NUM_INTRAOP_THREADS: 8
TF_NUM_INTEROP_THREADS: 1
run: |
python -m pytest -n 2 --dist=loadfile --make-reports=tests_tf_multi_gpu tests
python -m pytest -n 1 --dist=loadfile --make-reports=tests_tf_multi_gpu tests
- name: Failure short reports
if: ${{ always() }}
@@ -202,6 +204,7 @@ jobs:
- name: Install dependencies
run: |
apt -y update && apt install -y libaio-dev
pip install --upgrade pip
pip install .[testing,deepspeed]
@@ -242,6 +245,7 @@ jobs:
- name: Install dependencies
run: |
apt -y update && apt install -y libaio-dev
pip install --upgrade pip
pip install .[testing,deepspeed,fairscale]
@@ -292,4 +296,4 @@ jobs:
run: |
pip install slack_sdk
python utils/notification_service.py push
python utils/notification_service.py push

View File

@@ -59,7 +59,7 @@ jobs:
HF_HOME: /mnt/cache
TRANSFORMERS_IS_CI: yes
run: |
pip install -r examples/_tests_requirements.txt
pip install -r examples/pytorch/_tests_requirements.txt
python -m pytest -n 1 --dist=loadfile --make-reports=examples_torch_gpu examples
- name: Failure short reports
@@ -261,6 +261,7 @@ jobs:
- name: Install dependencies
run: |
apt -y update && apt install -y libaio-dev
pip install --upgrade pip
pip install .[testing,deepspeed]
@@ -301,6 +302,7 @@ jobs:
- name: Install dependencies
run: |
apt -y update && apt install -y libaio-dev
pip install --upgrade pip
pip install .[testing,deepspeed,fairscale]

View File

@@ -36,6 +36,13 @@ There are 4 ways you can contribute to transformers:
* Contributing to the examples or to the documentation;
* Submitting issues related to bugs or desired new features.
In particular there is a special [Good First
Issue](https://github.com/huggingface/transformers/contribute) listing. Tt will give you a list of
open Issues that are open to anybody to work on. Just comment in the issue that you'd like to work
on it. In that same listing you will also find some Issues with `Good Second Issue` label. These are
typically slightly more complicated than the Issues with just `Good First Issue` label. But if you
feel you know what you're doing, go for it.
*All are equally valuable to the community.*
## Submitting a new issue or feature request
@@ -46,7 +53,7 @@ feedback.
### Did you find a bug?
The transformers are robust and reliable thanks to the users who notify us of
The 🤗 Transformers library is robust and reliable thanks to the users who notify us of
the problems they encounter. So thank you for reporting an issue.
First, we would really appreciate it if you could **make sure the bug was not
@@ -285,7 +292,7 @@ $ python -m pytest -n auto --dist=loadfile -s -v ./tests/
and for the examples:
```bash
$ pip install -r examples/requirements.txt # only needed the first time
$ pip install -r examples/xxx/requirements.txt # only needed the first time
$ python -m pytest -n auto --dist=loadfile -s -v ./examples/
```
In fact, that's how `make test` and `make test-examples` are implemented (sans the `pip install` line)!
@@ -343,7 +350,7 @@ You can now use `make` from any terminal (Powershell, cmd.exe, etc) 🎉
### Syncing forked master with upstream (HuggingFace) master
To avoid pinging the upstream repository which adds reference notes to each upstream PR and sends unnessary notifications to the developers involved in these PRs,
To avoid pinging the upstream repository which adds reference notes to each upstream PR and sends unnessary notifications to the developers involved in these PRs,
when syncing the master branch of a forked repository, please, follow these steps:
1. When possible, avoid syncing with the upstream using a branch and PR on the forked repository. Instead merge directly into the forked master.
2. If a PR is absolutely necessary, use the following steps after checking out your branch:

View File

@@ -1,5 +1,7 @@
.PHONY: deps_table_update modified_only_fixup extra_quality_checks quality style fixup fix-copies test test-examples docs
# make sure to test the local checkout in scripts and not the pre-installed one (don't use quotes!)
export PYTHONPATH = src
check_dirs := examples tests src utils
@@ -73,7 +75,7 @@ test:
# Run tests for examples
test-examples:
python -m pytest -n auto --dist=loadfile -s -v ./examples/
python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/
# Run tests for SageMaker DLC release

View File

@@ -38,18 +38,18 @@ limitations under the License.
</p>
<h3 align="center">
<p>State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0
<p>State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow
</h3>
🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over 100 languages. Its aim is to make cutting-edge NLP easier to use for everyone.
🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our [model hub](https://huggingface.co/models). At the same time, each python module defining an architecture can be used as a standalone and modified to enable quick research experiments.
🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our [model hub](https://huggingface.co/models). At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments.
🤗 Transformers is backed by the two most popular deep learning libraries, [PyTorch](https://pytorch.org/) and [TensorFlow](https://www.tensorflow.org/), with a seamless integration between them, allowing you to train your models with one then load it for inference with the other.
🤗 Transformers is backed by the three most popular deep learning libraries — [Jax](https://jax.readthedocs.io/en/latest/), [PyTorch](https://pytorch.org/) and [TensorFlow](https://www.tensorflow.org/) with a seamless integration between them. It's straightforward to train your models with one before loading them for inference with the other.
## Online demos
You can test most of our models directly on their pages from the [model hub](https://huggingface.co/models). We also offer [private model hosting, versioning, & an inference API](https://huggingface.co/pricing) to use those models.
You can test most of our models directly on their pages from the [model hub](https://huggingface.co/models). We also offer [private model hosting, versioning, & an inference API](https://huggingface.co/pricing) for public and private models.
Here are a few examples:
- [Masked word completion with BERT](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France)
@@ -64,20 +64,20 @@ Here are a few examples:
## Quick tour
To immediately use a model on a given text, we provide the `pipeline` API. Pipelines group together a pretrained model with the preprocessing that was used during that model training. Here is how to quickly use a pipeline to classify positive versus negative texts
To immediately use a model on a given text, we provide the `pipeline` API. Pipelines group together a pretrained model with the preprocessing that was used during that model's training. Here is how to quickly use a pipeline to classify positive versus negative texts:
```python
>>> from transformers import pipeline
# Allocate a pipeline for sentiment-analysis
>>> classifier = pipeline('sentiment-analysis')
>>> classifier('We are very happy to include pipeline into the transformers repository.')
[{'label': 'POSITIVE', 'score': 0.9978193640708923}]
>>> classifier('We are very happy to introduce pipeline to the transformers repository.')
[{'label': 'POSITIVE', 'score': 0.9996980428695679}]
```
The second line of code downloads and caches the pretrained model used by the pipeline, the third line evaluates it on the given text. Here the answer is "positive" with a confidence of 99.8%.
The second line of code downloads and caches the pretrained model used by the pipeline, while the third evaluates it on the given text. Here the answer is "positive" with a confidence of 99.97%.
This is another example of pipeline used for that can extract question answers from some context:
Many NLP tasks have a pre-trained `pipeline` ready to go. For example, we can easily extract question answers given context:
``` python
>>> from transformers import pipeline
@@ -86,15 +86,15 @@ This is another example of pipeline used for that can extract question answers f
>>> question_answerer = pipeline('question-answering')
>>> question_answerer({
... 'question': 'What is the name of the repository ?',
... 'context': 'Pipeline have been included in the huggingface/transformers repository'
... 'context': 'Pipeline has been included in the huggingface/transformers repository'
... })
{'score': 0.5135612454720828, 'start': 35, 'end': 59, 'answer': 'huggingface/transformers'}
{'score': 0.30970096588134766, 'start': 34, 'end': 58, 'answer': 'huggingface/transformers'}
```
On top of the answer, the pretrained model used here returned its confidence score, along with the start position and its end position in the tokenized sentence. You can learn more about the tasks supported by the `pipeline` API in [this tutorial](https://huggingface.co/transformers/task_summary.html).
In addition to the answer, the pretrained model used here returned its confidence score, along with the start position and end position of the answer in the tokenized sentence. You can learn more about the tasks supported by the `pipeline` API in [this tutorial](https://huggingface.co/transformers/task_summary.html).
To download and use any of the pretrained models on your given task, you just need to use those three lines of codes (PyTorch version):
To download and use any of the pretrained models on your given task, all it takes is three lines of code. Here is the PyTorch version:
```python
>>> from transformers import AutoTokenizer, AutoModel
@@ -104,7 +104,7 @@ To download and use any of the pretrained models on your given task, you just ne
>>> inputs = tokenizer("Hello world!", return_tensors="pt")
>>> outputs = model(**inputs)
```
or for TensorFlow:
And here is the equivalent code for TensorFlow:
```python
>>> from transformers import AutoTokenizer, TFAutoModel
@@ -115,9 +115,9 @@ or for TensorFlow:
>>> outputs = model(**inputs)
```
The tokenizer is responsible for all the preprocessing the pretrained model expects, and can be called directly on one (or list) of texts (as we can see on the fourth line of both code examples). It will output a dictionary you can directly pass to your model (which is done on the fifth line).
The tokenizer is responsible for all the preprocessing the pretrained model expects, and can be called directly on a single string (as in the above examples) or a list. It will output a dictionary that you can use in downstream code or simply directly pass to your model using the ** argument unpacking operator.
The model itself is a regular [Pytorch `nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) or a [TensorFlow `tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) (depending on your backend) which you can use normally. For instance, [this tutorial](https://huggingface.co/transformers/training.html) explains how to integrate such a model in classic PyTorch or TensorFlow training loop, or how to use our `Trainer` API to quickly fine-tune the on a new dataset.
The model itself is a regular [Pytorch `nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) or a [TensorFlow `tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) (depending on your backend) which you can use normally. [This tutorial](https://huggingface.co/transformers/training.html) explains how to integrate such a model into a classic PyTorch or TensorFlow training loop, or how to use our `Trainer` API to quickly fine-tune on a new dataset.
## Why should I use transformers?
@@ -135,16 +135,16 @@ The model itself is a regular [Pytorch `nn.Module`](https://pytorch.org/docs/sta
1. Choose the right framework for every part of a model's lifetime:
- Train state-of-the-art models in 3 lines of code.
- Move a single model between TF2.0/PyTorch frameworks at will.
- Seamlessly pick the right framework for training, evaluation, production.
- Seamlessly pick the right framework for training, evaluation and production.
1. Easily customize a model or an example to your needs:
- Examples for each architecture to reproduce the results by the official authors of said architecture.
- Expose the models internal as consistently as possible.
- We provide examples for each architecture to reproduce the results published by its original authors.
- Model internals are exposed as consistently as possible.
- Model files can be used independently of the library for quick experiments.
## Why shouldn't I use transformers?
- This library is not a modular toolbox of building blocks for neural nets. The code in the model files is not refactored with additional abstractions on purpose, so that researchers can quickly iterate on each of the models without diving in additional abstractions/files.
- This library is not a modular toolbox of building blocks for neural nets. The code in the model files is not refactored with additional abstractions on purpose, so that researchers can quickly iterate on each of the models without diving into additional abstractions/files.
- The training API is not intended to work on any model but is optimized to work with the models provided by the library. For generic machine learning loops, you should use another library.
- While we strive to present as many use cases as possible, the scripts in our [examples folder](https://github.com/huggingface/transformers/tree/master/examples) are just that: examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs.
@@ -152,16 +152,16 @@ The model itself is a regular [Pytorch `nn.Module`](https://pytorch.org/docs/sta
### With pip
This repository is tested on Python 3.6+, PyTorch 1.0.0+ (PyTorch 1.3.1+ for [examples](https://github.com/huggingface/transformers/tree/master/examples)) and TensorFlow 2.0.
This repository is tested on Python 3.6+, Flax 0.3.2+, PyTorch 1.3.1+ and TensorFlow 2.3+.
You should install 🤗 Transformers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
First, create a virtual environment with the version of Python you're going to use and activate it.
Then, you will need to install at least one of TensorFlow 2.0, PyTorch or Flax.
Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available), [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform and/or [Flax installation page](https://github.com/google/flax#quick-install).
Then, you will need to install at least one of Flax, PyTorch or TensorFlow.
Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/), [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) and/or [Flax installation page](https://github.com/google/flax#quick-install) regarding the specific install command for your platform.
When TensorFlow 2.0 and/or PyTorch has been installed, 🤗 Transformers can be installed using pip as follows:
When one of those backends has been installed, 🤗 Transformers can be installed using pip as follows:
```bash
pip install transformers
@@ -179,9 +179,9 @@ Since Transformers version v4.0.0, we now have a conda channel: `huggingface`.
conda install -c huggingface transformers
```
Follow the installation pages of TensorFlow, PyTorch or Flax to see how to install them with conda.
Follow the installation pages of Flax, PyTorch or TensorFlow to see how to install them with conda.
## Models architectures
## Model architectures
**[All the model checkpoints](https://huggingface.co/models)** provided by 🤗 Transformers are seamlessly integrated from the huggingface.co [model hub](https://huggingface.co) where they are uploaded directly by [users](https://huggingface.co/users) and [organizations](https://huggingface.co/organizations).
@@ -195,10 +195,12 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[BERT](https://huggingface.co/transformers/model_doc/bert.html)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
1. **[BERT For Sequence Generation](https://huggingface.co/transformers/model_doc/bertgeneration.html)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
1. **[BigBird-RoBERTa](https://huggingface.co/transformers/model_doc/bigbird.html)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
1. **[BigBird-Pegasus](https://huggingface.co/transformers/model_doc/bigbird_pegasus.html)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
1. **[Blenderbot](https://huggingface.co/transformers/model_doc/blenderbot.html)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
1. **[BlenderbotSmall](https://huggingface.co/transformers/model_doc/blenderbot_small.html)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
1. **[BORT](https://huggingface.co/transformers/model_doc/bort.html)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry.
1. **[CamemBERT](https://huggingface.co/transformers/model_doc/camembert.html)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
1. **[CLIP](https://huggingface.co/transformers/model_doc/clip.html)** from (OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
1. **[ConvBERT](https://huggingface.co/transformers/model_doc/convbert.html)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
1. **[CPM](https://huggingface.co/transformers/model_doc/cpm.html)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
1. **[CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
@@ -220,6 +222,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[LayoutLM](https://huggingface.co/transformers/model_doc/layoutlm.html)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1. **[LED](https://huggingface.co/transformers/model_doc/led.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
1. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
1. **[LUKE](https://huggingface.co/transformers/model_doc/luke.html)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
1. **[LXMERT](https://huggingface.co/transformers/model_doc/lxmert.html)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
1. **[M2M100](https://huggingface.co/transformers/model_doc/m2m_100.html)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
1. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
@@ -247,9 +250,9 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[XLSR-Wav2Vec2](https://huggingface.co/transformers/model_doc/xlsr_wav2vec2.html)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
1. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
To check if each model has an implementation in PyTorch/TensorFlow/Flax or has an associated tokenizer backed by the 🤗 Tokenizers library, refer to [this table](https://huggingface.co/transformers/index.html#bigtable)
To check if each model has an implementation in Flax, PyTorch or TensorFlow, or has an associated tokenizer backed by the 🤗 Tokenizers library, refer to [this table](https://huggingface.co/transformers/index.html#bigtable).
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations. You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
These implementations have been tested on several datasets (see the example scripts) and should match the performance of the original implementations. You can find more details on performance in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
## Learn more

View File

@@ -53,7 +53,7 @@ RUN git clone https://github.com/huggingface/transformers.git && \
git checkout CI && \
cd .. && \
pip install ./transformers && \
pip install -r ./transformers/examples/requirements.txt && \
pip install -r ./transformers/examples/pytorch/_test_requirements.txt && \
pip install pytest
RUN python -c "import torch_xla; print(torch_xla.__version__)"

View File

@@ -27,7 +27,7 @@ local bertBaseCased = base.BaseTest {
},
command: utils.scriptCommand(
|||
python -m pytest -s transformers/examples/test_xla_examples.py -v
python -m pytest -s transformers/examples/pytorch/test_xla_examples.py -v
test_exit_code=$?
echo "\nFinished running commands.\n"
test $test_exit_code -eq 0

View File

@@ -65,10 +65,10 @@ respectively.
.. code-block:: bash
## PYTORCH CODE
python examples/benchmarking/run_benchmark.py --help
python examples/pytorch/benchmarking/run_benchmark.py --help
## TENSORFLOW CODE
python examples/benchmarking/run_benchmark_tf.py --help
python examples/tensorflow/benchmarking/run_benchmark_tf.py --help
An instantiated benchmark object can then simply be run by calling ``benchmark.run()``.

View File

@@ -52,3 +52,7 @@ This page regroups resources around 🤗 Transformers developed by the community
|[Fine-tune BART for summarization in two languages with Trainer class](https://github.com/elsanns/xai-nlp-notebooks/blob/master/fine_tune_bart_summarization_two_langs.ipynb) | How to fine-tune BART for summarization in two languages with Trainer class | [Eliza Szczechla](https://github.com/elsanns) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/fine_tune_bart_summarization_two_langs.ipynb)|
|[Evaluate Big Bird on Trivia QA](https://github.com/patrickvonplaten/notebooks/blob/master/Evaluating_Big_Bird_on_TriviaQA.ipynb) | How to evaluate BigBird on long document question answering on Trivia QA | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Evaluating_Big_Bird_on_TriviaQA.ipynb)|
| [Create video captions using Wav2Vec2](https://github.com/Muennighoff/ytclipcc/blob/main/wav2vec_youtube_captions.ipynb) | How to create YouTube captions from any video by transcribing the audio with Wav2Vec | [Niklas Muennighoff](https://github.com/Muennighoff) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Muennighoff/ytclipcc/blob/main/wav2vec_youtube_captions.ipynb) |
| [Evaluate LUKE on Open Entity, an entity typing dataset](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_open_entity.ipynb) | How to evaluate *LukeForEntityClassification* on the Open Entity dataset | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_open_entity.ipynb) |
| [Evaluate LUKE on TACRED, a relation extraction dataset](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) | How to evaluate *LukeForEntityPairClassification* on the TACRED dataset | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) |
| [Evaluate LUKE on CoNLL-2003, an important NER benchmark](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) | How to evaluate *LukeForEntitySpanClassification* on the CoNLL-2003 dataset | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) |
| [Evaluate BigBird-Pegasus on PubMed dataset](https://github.com/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) | How to evaluate *BigBirdPegasusForConditionalGeneration* on PubMed dataset | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) |

View File

@@ -33,8 +33,8 @@ You can convert any TensorFlow checkpoint for BERT (in particular `the pre-train
This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated
configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights
from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that
can be imported using ``from_pretrained()`` (see example in :doc:`quicktour` , `run_glue.py
<https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py>`_\ ).
can be imported using ``from_pretrained()`` (see example in :doc:`quicktour` , :prefix_link:`run_glue.py
<examples/pytorch/text-classification/run_glue.py>` \ ).
You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow
checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\

295
docs/source/debugging.rst Normal file
View File

@@ -0,0 +1,295 @@
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Debugging
=======================================================================================================================
Underflow and Overflow Detection
-----------------------------------------------------------------------------------------------------------------------
.. note::
This feature is currently available for PyTorch-only.
.. note::
This feature can be used with any ``nn.Module``-based model
If you start getting ``loss=NaN`` or the model inhibits some other abnormal behavior due to ``inf`` or ``nan`` in
activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily
you can accomplish that easily by activating a special module that will do the detection automatically.
If you're using :class:`~transformers.Trainer`, you just need to add:
.. code-block:: bash
--debug underflow_overflow
to the normal command line arguments, or pass ``debug="underflow_overflow"`` when creating the
:class:`~transformers.TrainingArguments` object.
If you're using your own training loop or another Trainer you can accomplish the same with:
.. code-block:: python
from .debug_utils import DebugUnderflowOverflow
debug_overflow = DebugUnderflowOverflow(model)
:class:`~transformers.debug_utils.DebugUnderflowOverflow` inserts hooks into the model that immediately after each
forward call will test input and output variables and also the corresponding module's weights. As soon as ``inf`` or
``nan`` is detected in at least one element of the activations or weights, the program will assert and print a report
like this (this was caught with ``google/mt5-small`` under fp16 mixed precision):
.. code-block::
Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min abs max metadata
encoder.block.1.layer.1.DenseReluDense.dropout Dropout
0.00e+00 2.57e+02 input[0]
0.00e+00 2.85e+02 output
[...]
encoder.block.2.layer.0 T5LayerSelfAttention
6.78e-04 3.15e+03 input[0]
2.65e-04 3.42e+03 output[0]
None output[1]
2.25e-01 1.00e+04 output[2]
encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output
encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
encoder.block.2.layer.1.DenseReluDense.dropout Dropout
0.00e+00 8.76e+03 input[0]
0.00e+00 9.74e+03 output
encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00 inf output
The example output has been trimmed in the middle for brevity.
The second column shows the value of the absolute largest element, so if you have a closer look at the last few frames,
the inputs and outputs were in the range of ``1e4``. So when this training was done under fp16 mixed precision the very
last step overflowed (since under ``fp16`` the largest number before ``inf`` is ``64e3``). To avoid overflows under
``fp16`` the activations must remain way below ``1e4``, because ``1e4 * 1e4 = 1e8`` so any matrix multiplication with
large activations is going to lead to a numerical overflow condition.
At the very start of the trace you can discover at which batch number the problem occurred (here ``Detected inf/nan
during batch_number=0`` means the problem occurred on the first batch).
Each reported frame starts by declaring the fully qualified entry for the corresponding module this frame is reporting
for. If we look just at this frame:
.. code-block::
encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output
Here, ``encoder.block.2.layer.1.layer_norm`` indicates that it was a layer norm for the first layer, of the second
block of the encoder. And the specific calls of the ``forward`` is ``T5LayerNorm``.
Let's look at the last few frames of that report:
.. code-block::
Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min abs max metadata
[...]
encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00 inf output
The last frame reports for ``Dropout.forward`` function with the first entry for the only input and the second for the
only output. You can see that it was called from an attribute ``dropout`` inside ``DenseReluDense`` class. We can see
that it happened during the first layer, of the 2nd block, during the very first batch. Finally, the absolute largest
input elements was ``6.27e+04`` and same for the output was ``inf``.
You can see here, that ``T5DenseGatedGeluDense.forward`` resulted in output activations, whose absolute max value was
around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have ``Dropout`` which renormalizes
the weights, after it zeroed some of the elements, which pushes the absolute max value to more than 64K, and we get an
overlow (``inf``).
As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16
numbers.
Let's match the report to the code from ``models/t5/modeling_t5.py``:
.. code-block:: python
class T5DenseGatedGeluDense(nn.Module):
def __init__(self, config):
super().__init__()
self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
self.dropout = nn.Dropout(config.dropout_rate)
self.gelu_act = ACT2FN["gelu_new"]
def forward(self, hidden_states):
hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
hidden_linear = self.wi_1(hidden_states)
hidden_states = hidden_gelu * hidden_linear
hidden_states = self.dropout(hidden_states)
hidden_states = self.wo(hidden_states)
return hidden_states
Now it's easy to see the ``dropout`` call, and all the previous calls as well.
Since the detection is happening in a forward hook, these reports are printed immediately after each ``forward``
returns.
Going back to the full report, to act on it and to fix the problem, we need to go a few frames up where the numbers
started to go up and most likely switch to the ``fp32`` mode here, so that the numbers don't overflow when multiplied
or summed up. Of course, there might be other solutions. For example, we could turn off ``amp`` temporarily if it's
enabled, after moving the original ``forward`` into a helper wrapper, like so:
.. code-block:: python
def _forward(self, hidden_states):
hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
hidden_linear = self.wi_1(hidden_states)
hidden_states = hidden_gelu * hidden_linear
hidden_states = self.dropout(hidden_states)
hidden_states = self.wo(hidden_states)
return hidden_states
import torch
def forward(self, hidden_states):
if torch.is_autocast_enabled():
with torch.cuda.amp.autocast(enabled=False):
return self._forward(hidden_states)
else:
return self._forward(hidden_states)
Since the automatic detector only reports on inputs and outputs of full frames, once you know where to look, you may
want to analyse the intermediary stages of any specific ``forward`` function as well. In such a case you can use the
``detect_overflow`` helper function to inject the detector where you want it, for example:
.. code-block:: python
from debug_utils import detect_overflow
class T5LayerFF(nn.Module):
[...]
def forward(self, hidden_states):
forwarded_states = self.layer_norm(hidden_states)
detect_overflow(forwarded_states, "after layer_norm")
forwarded_states = self.DenseReluDense(forwarded_states)
detect_overflow(forwarded_states, "after DenseReluDense")
return hidden_states + self.dropout(forwarded_states)
You can see that we added 2 of these and now we track if ``inf`` or ``nan`` for ``forwarded_states`` was detected
somewhere in between.
Actually, the detector already reports these because each of the calls in the example above is a `nn.Module``, but
let's say if you had some local direct calculations this is how you'd do that.
Additionally, if you're instantiating the debugger in your own code, you can adjust the number of frames printed from
its default, e.g.:
.. code-block:: python
from .debug_utils import DebugUnderflowOverflow
debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
Specific batch absolute mix and max value tracing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The same debugging class can be used for per-batch tracing with the underflow/overflow detection feature turned off.
Let's say you want to watch the absolute min and max values for all the ingredients of each ``forward`` call of a given
batch, and only do that for batches 1 and 3. Then you instantiate this class as:
.. code-block:: python
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3])
And now full batches 1 and 3 will be traced using the same format as the underflow/overflow detector does.
Batches are 0-indexed.
This is helpful if you know that the program starts misbehaving after a certain batch number, so you can fast-forward
right to that area. Here is a sample truncated output for such configuration:
.. code-block::
*** Starting batch number=1 ***
abs min abs max metadata
shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.47e+04 input[0]
5.36e-05 7.92e+02 output
[...]
decoder.dropout Dropout
1.60e-07 2.27e+01 input[0]
0.00e+00 2.52e+01 output
decoder T5Stack
not a tensor output
lm_head Linear
1.01e-06 7.92e+02 weight
0.00e+00 1.11e+00 input[0]
6.06e-02 8.39e+01 output
T5ForConditionalGeneration
not a tensor output
*** Starting batch number=3 ***
abs min abs max metadata
shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.78e+04 input[0]
5.36e-05 7.92e+02 output
[...]
Here you will get a huge number of frames dumped - as many as there were forward calls in your model, so it may or may
not what you want, but sometimes it can be easier to use for debugging purposes than a normal debugger. For example, if
a problem starts happening at batch number 150. So you can dump traces for batches 149 and 150 and compare where
numbers started to diverge.
You can also specify the batch number after which to stop the training, with:
.. code-block:: python
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3], abort_after_batch_num=3)

View File

@@ -1,12 +1,12 @@
Transformers
=======================================================================================================================
State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
State-of-the-art Natural Language Processing for Jax, Pytorch and TensorFlow
🤗 Transformers (formerly known as `pytorch-transformers` and `pytorch-pretrained-bert`) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between Jax,
PyTorch and TensorFlow.
This is the documentation of our repository `transformers <https://github.com/huggingface/transformers>`_.
@@ -43,11 +43,11 @@ Lower compute costs, smaller carbon footprint:
Choose the right framework for every part of a model's lifetime:
- Train state-of-the-art models in 3 lines of code
- Deep interoperability between TensorFlow 2.0 and PyTorch models
- Move a single model between TF2.0/PyTorch frameworks at will
- Deep interoperability between Jax, Pytorch and TensorFlow models
- Move a single model between Jax/PyTorch/TensorFlow frameworks at will
- Seamlessly pick the right framework for training, evaluation, production
Experimental support for Flax with a few models right now, expected to grow in the coming months.
The support for Jax is still experimental (with a few models right now), expect to see it grow in the coming months!
`All the model checkpoints <https://huggingface.co/models>`__ are seamlessly integrated from the huggingface.co `model
hub <https://huggingface.co>`__ where they are uploaded directly by `users <https://huggingface.co/users>`__ and
@@ -74,8 +74,8 @@ The documentation is organized in five parts:
- **MODELS** for the classes and functions related to each model implemented in the library.
- **INTERNAL HELPERS** for the classes and functions we use internally.
The library currently contains PyTorch, Tensorflow and Flax implementations, pretrained model weights, usage scripts
and conversion utilities for the following models:
The library currently contains Jax, PyTorch and Tensorflow implementations, pretrained model weights, usage scripts and
conversion utilities for the following models:
..
This list is updated automatically from the README with `make fix-copies`. Do not update manually!
@@ -100,150 +100,160 @@ and conversion utilities for the following models:
6. :doc:`BigBird-RoBERTa <model_doc/bigbird>` (from Google Research) released with the paper `Big Bird: Transformers
for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua
Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
7. :doc:`Blenderbot <model_doc/blenderbot>` (from Facebook) released with the paper `Recipes for building an
7. :doc:`BigBird-Pegasus <model_doc/bigbird_pegasus>` (from Google Research) released with the paper `Big Bird:
Transformers for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by Manzil Zaheer, Guru Guruganesh, Avinava
Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
8. :doc:`Blenderbot <model_doc/blenderbot>` (from Facebook) released with the paper `Recipes for building an
open-domain chatbot <https://arxiv.org/abs/2004.13637>`__ by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary
Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
8. :doc:`BlenderbotSmall <model_doc/blenderbot_small>` (from Facebook) released with the paper `Recipes for building an
9. :doc:`BlenderbotSmall <model_doc/blenderbot_small>` (from Facebook) released with the paper `Recipes for building an
open-domain chatbot <https://arxiv.org/abs/2004.13637>`__ by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary
Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
9. :doc:`BORT <model_doc/bort>` (from Alexa) released with the paper `Optimal Subarchitecture Extraction For BERT
<https://arxiv.org/abs/2010.10499>`__ by Adrian de Wynter and Daniel J. Perry.
10. :doc:`CamemBERT <model_doc/camembert>` (from Inria/Facebook/Sorbonne) released with the paper `CamemBERT: a Tasty
10. :doc:`BORT <model_doc/bort>` (from Alexa) released with the paper `Optimal Subarchitecture Extraction For BERT
<https://arxiv.org/abs/2010.10499>`__ by Adrian de Wynter and Daniel J. Perry.
11. :doc:`CamemBERT <model_doc/camembert>` (from Inria/Facebook/Sorbonne) released with the paper `CamemBERT: a Tasty
French Language Model <https://arxiv.org/abs/1911.03894>`__ by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz
Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
11. :doc:`ConvBERT <model_doc/convbert>` (from YituTech) released with the paper `ConvBERT: Improving BERT with
12. :doc:`CLIP <model_doc/clip>` from (OpenAI) released with the paper `Learning Transferable Visual Models From
Natural Language Supervision <https://arxiv.org/abs/2103.00020>`__ by Alec Radford, Jong Wook Kim, Chris Hallacy,
Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Krueger, Ilya Sutskever.
13. :doc:`ConvBERT <model_doc/convbert>` (from YituTech) released with the paper `ConvBERT: Improving BERT with
Span-based Dynamic Convolution <https://arxiv.org/abs/2008.02496>`__ by Zihang Jiang, Weihao Yu, Daquan Zhou,
Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
12. :doc:`CPM <model_doc/cpm>` (from Tsinghua University) released with the paper `CPM: A Large-scale Generative
14. :doc:`CPM <model_doc/cpm>` (from Tsinghua University) released with the paper `CPM: A Large-scale Generative
Chinese Pre-trained Language Model <https://arxiv.org/abs/2012.00413>`__ by Zhengyan Zhang, Xu Han, Hao Zhou, Pei
Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng,
Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang,
Juanzi Li, Xiaoyan Zhu, Maosong Sun.
13. :doc:`CTRL <model_doc/ctrl>` (from Salesforce) released with the paper `CTRL: A Conditional Transformer Language
15. :doc:`CTRL <model_doc/ctrl>` (from Salesforce) released with the paper `CTRL: A Conditional Transformer Language
Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`__ by Nitish Shirish Keskar*, Bryan McCann*,
Lav R. Varshney, Caiming Xiong and Richard Socher.
14. :doc:`DeBERTa <model_doc/deberta>` (from Microsoft) released with the paper `DeBERTa: Decoding-enhanced BERT with
16. :doc:`DeBERTa <model_doc/deberta>` (from Microsoft) released with the paper `DeBERTa: Decoding-enhanced BERT with
Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu
Chen.
15. :doc:`DeBERTa-v2 <model_doc/deberta_v2>` (from Microsoft) released with the paper `DeBERTa: Decoding-enhanced BERT
17. :doc:`DeBERTa-v2 <model_doc/deberta_v2>` (from Microsoft) released with the paper `DeBERTa: Decoding-enhanced BERT
with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao,
Weizhu Chen.
16. :doc:`DeiT <model_doc/deit>` (from Facebook) released with the paper `Training data-efficient image transformers &
18. :doc:`DeiT <model_doc/deit>` (from Facebook) released with the paper `Training data-efficient image transformers &
distillation through attention <https://arxiv.org/abs/2012.12877>`__ by Hugo Touvron, Matthieu Cord, Matthijs
Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
17. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
19. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`__ by Yizhe
Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
18. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
20. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__ by Victor
Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, RoBERTa into `DistilRoBERTa
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, Multilingual BERT into
`DistilmBERT <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German
version of DistilBERT.
19. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
21. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
Question Answering <https://arxiv.org/abs/2004.04906>`__ by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick
Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
20. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
22. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
Pre-training text encoders as discriminators rather than generators <https://arxiv.org/abs/2003.10555>`__ by Kevin
Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
21. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
23. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
Pre-training for French <https://arxiv.org/abs/1912.05372>`__ by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne,
Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
22. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
24. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
Filtering out Sequential Redundancy for Efficient Language Processing <https://arxiv.org/abs/2006.03236>`__ by
Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
23. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
25. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
Pre-Training <https://blog.openai.com/language-unsupervised/>`__ by Alec Radford, Karthik Narasimhan, Tim Salimans
and Ilya Sutskever.
24. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
26. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
Learners <https://blog.openai.com/better-language-models/>`__ by Alec Radford*, Jeffrey Wu*, Rewon Child, David
Luan, Dario Amodei** and Ilya Sutskever**.
25. :doc:`GPT Neo <model_doc/gpt_neo>` (from EleutherAI) released in the repository `EleutherAI/gpt-neo
27. :doc:`GPT Neo <model_doc/gpt_neo>` (from EleutherAI) released in the repository `EleutherAI/gpt-neo
<https://github.com/EleutherAI/gpt-neo>`__ by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
26. :doc:`I-BERT <model_doc/ibert>` (from Berkeley) released with the paper `I-BERT: Integer-only BERT Quantization
28. :doc:`I-BERT <model_doc/ibert>` (from Berkeley) released with the paper `I-BERT: Integer-only BERT Quantization
<https://arxiv.org/abs/2101.01321>`__ by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer
27. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
29. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li,
Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
28. :doc:`LED <model_doc/led>` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer
30. :doc:`LED <model_doc/led>` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer
<https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
29. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
31. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
30. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
32. :doc:`LUKE <model_doc/luke>` (from Studio Ousia) released with the paper `LUKE: Deep Contextualized Entity
Representations with Entity-aware Self-attention <https://arxiv.org/abs/2010.01057>`__ by Ikuya Yamada, Akari Asai,
Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
33. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
Encoder Representations from Transformers for Open-Domain Question Answering <https://arxiv.org/abs/1908.07490>`__
by Hao Tan and Mohit Bansal.
31. :doc:`M2M100 <model_doc/m2m_100>` (from Facebook) released with the paper `Beyond English-Centric Multilingual
34. :doc:`M2M100 <model_doc/m2m_100>` (from Facebook) released with the paper `Beyond English-Centric Multilingual
Machine Translation <https://arxiv.org/abs/2010.11125>`__ by by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi
Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman
Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
32. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
35. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
Translator Team.
33. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
36. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
34. :doc:`MBart-50 <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Translation with Extensible
37. :doc:`MBart-50 <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Translation with Extensible
Multilingual Pretraining and Finetuning <https://arxiv.org/abs/2008.00401>`__ by Yuqing Tang, Chau Tran, Xian Li,
Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
35. :doc:`Megatron-BERT <model_doc/megatron_bert>` (from NVIDIA) released with the paper `Megatron-LM: Training
38. :doc:`Megatron-BERT <model_doc/megatron_bert>` (from NVIDIA) released with the paper `Megatron-LM: Training
Multi-Billion Parameter Language Models Using Model Parallelism <https://arxiv.org/abs/1909.08053>`__ by Mohammad
Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
36. :doc:`Megatron-GPT2 <model_doc/megatron_gpt2>` (from NVIDIA) released with the paper `Megatron-LM: Training
39. :doc:`Megatron-GPT2 <model_doc/megatron_gpt2>` (from NVIDIA) released with the paper `Megatron-LM: Training
Multi-Billion Parameter Language Models Using Model Parallelism <https://arxiv.org/abs/1909.08053>`__ by Mohammad
Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
37. :doc:`MPNet <model_doc/mpnet>` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted
40. :doc:`MPNet <model_doc/mpnet>` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted
Pre-training for Language Understanding <https://arxiv.org/abs/2004.09297>`__ by Kaitao Song, Xu Tan, Tao Qin,
Jianfeng Lu, Tie-Yan Liu.
38. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
41. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
text-to-text transformer <https://arxiv.org/abs/2010.11934>`__ by Linting Xue, Noah Constant, Adam Roberts, Mihir
Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
39. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
42. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
Gap-sentences for Abstractive Summarization <https://arxiv.org/abs/1912.08777>`__> by Jingqing Zhang, Yao Zhao,
Mohammad Saleh and Peter J. Liu.
40. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
43. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi,
Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
41. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
44. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
Transformer <https://arxiv.org/abs/2001.04451>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
42. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
45. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
Pretraining Approach <https://arxiv.org/abs/1907.11692>`__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
43. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
46. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
`fairseq S2T: Fast Speech-to-Text Modeling with fairseq <https://arxiv.org/abs/2010.05171>`__ by Changhan Wang, Yun
Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
44. :doc:`SqueezeBert <model_doc/squeezebert>` released with the paper `SqueezeBERT: What can computer vision teach NLP
47. :doc:`SqueezeBert <model_doc/squeezebert>` released with the paper `SqueezeBERT: What can computer vision teach NLP
about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi
Krishna, and Kurt W. Keutzer.
45. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
48. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
46. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
49. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
Francesco Piccinno and Julian Martin Eisenschlos.
47. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
50. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
48. :doc:`Vision Transformer (ViT) <model_doc/vit>` (from Google AI) released with the paper `An Image is Worth 16x16
51. :doc:`Vision Transformer (ViT) <model_doc/vit>` (from Google AI) released with the paper `An Image is Worth 16x16
Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy,
Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
49. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
52. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
Zhou, Abdelrahman Mohamed, Michael Auli.
50. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
53. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
51. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
54. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
52. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
55. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
Zettlemoyer and Veselin Stoyanov.
53. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
56. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
54. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
57. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
Cross-Lingual Representation Learning For Speech Recognition <https://arxiv.org/abs/2006.13979>`__ by Alexis
Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
@@ -251,8 +261,8 @@ and conversion utilities for the following models:
.. _bigtable:
The table below represents the current support in the library for each of those models, whether they have a Python
tokenizer (called "slow"). A "fast" tokenizer backed by the 🤗 Tokenizers library, whether they have support in PyTorch,
TensorFlow and/or Flax.
tokenizer (called "slow"). A "fast" tokenizer backed by the 🤗 Tokenizers library, whether they have support in Jax (via
Flax), PyTorch, and/or TensorFlow.
..
This table is updated automatically from the auto modules with `make fix-copies`. Do not update manually!
@@ -270,12 +280,16 @@ TensorFlow and/or Flax.
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Bert Generation | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BigBird | ✅ | | ✅ | ❌ | ❌ |
| BigBird | ✅ | | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BigBirdPegasus | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Blenderbot | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BlenderbotSmall | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CLIP | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CTRL | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
@@ -284,7 +298,7 @@ TensorFlow and/or Flax.
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DPR | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DeBERTa | ✅ | | ✅ | ❌ | ❌ |
| DeBERTa | ✅ | | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DeBERTa-v2 | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
@@ -292,7 +306,7 @@ TensorFlow and/or Flax.
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DistilBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ELECTRA | ✅ | ✅ | ✅ | ✅ | |
| ELECTRA | ✅ | ✅ | ✅ | ✅ | |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Encoder decoder | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
@@ -308,6 +322,8 @@ TensorFlow and/or Flax.
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LED | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LUKE | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LXMERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LayoutLM | ✅ | ✅ | ✅ | ✅ | ❌ |
@@ -405,6 +421,7 @@ TensorFlow and/or Flax.
add_new_model
fast_tokenizers
testing
debugging
serialization
.. toctree::
@@ -445,10 +462,12 @@ TensorFlow and/or Flax.
model_doc/bertgeneration
model_doc/bert_japanese
model_doc/bigbird
model_doc/bigbird_pegasus
model_doc/blenderbot
model_doc/blenderbot_small
model_doc/bort
model_doc/camembert
model_doc/clip
model_doc/convbert
model_doc/cpm
model_doc/ctrl
@@ -468,6 +487,7 @@ TensorFlow and/or Flax.
model_doc/layoutlm
model_doc/led
model_doc/longformer
model_doc/luke
model_doc/lxmert
model_doc/marian
model_doc/m2m_100

View File

@@ -149,12 +149,6 @@ So if you don't have any specific environment variable set, the cache directory
(``PYTORCH_TRANSFORMERS_CACHE`` or ``PYTORCH_PRETRAINED_BERT_CACHE``), those will be used if there is no shell
environment variable for ``TRANSFORMERS_CACHE``.
### Note on model downloads (Continuous Integration or large-scale deployments)
If you expect to be downloading large volumes of models (more than 10,000) from huggingface.co (for instance through
your CI setup, or a large-scale production deployment), please cache the model files on your end. It will be way
faster, and cheaper. Feel free to contact us privately, we'd love to help with this.
### Offline mode
It's possible to run 🤗 Transformers in a firewalled or a no-network environment.
@@ -168,13 +162,13 @@ Here is an example of how this can be used on a filesystem that is shared betwee
On the instance with the normal network run your program which will download and cache models (and optionally datasets if you use 🤗 Datasets). For example:
```
python examples/seq2seq/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...
python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...
```
and then with the same filesystem you can now run the same program on a firewalled instance:
```
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
python examples/seq2seq/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...
python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...
```
and it should succeed without any hanging waiting to timeout.

View File

@@ -1,4 +1,4 @@
..
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
@@ -46,3 +46,9 @@ Distributed Evaluation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.HfArgumentParser
Debug Utilities
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.debug_utils.DebugUnderflowOverflow

View File

@@ -73,3 +73,10 @@ Generation
.. autoclass:: transformers.generation_tf_utils.TFGenerationMixin
:members:
Pushing to the Hub
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.file_utils.PushToHubMixin
:members:

View File

@@ -13,8 +13,8 @@
Model outputs
-----------------------------------------------------------------------------------------------------------------------
PyTorch models have outputs that are instances of subclasses of :class:`~transformers.file_utils.ModelOutput`. Those
are data structures containing all the information returned by the model, but that can also be used as tuples or
All models have outputs that are instances of subclasses of :class:`~transformers.file_utils.ModelOutput`. Those are
data structures containing all the information returned by the model, but that can also be used as tuples or
dictionaries.
Let's see of this looks on an example:

View File

@@ -23,6 +23,7 @@ There are two categories of pipeline abstractions to be aware about:
- The :func:`~transformers.pipeline` which is the most powerful object encapsulating all other pipelines.
- The other task-specific pipelines:
- :class:`~transformers.AutomaticSpeechRecognitionPipeline`
- :class:`~transformers.ConversationalPipeline`
- :class:`~transformers.FeatureExtractionPipeline`
- :class:`~transformers.FillMaskPipeline`
@@ -35,6 +36,7 @@ There are two categories of pipeline abstractions to be aware about:
- :class:`~transformers.ZeroShotClassificationPipeline`
- :class:`~transformers.Text2TextGenerationPipeline`
- :class:`~transformers.TableQuestionAnsweringPipeline`
- :class:`~transformers.ImageClassificationPipeline`
The pipeline abstraction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -48,6 +50,13 @@ pipeline but requires an additional argument which is the `task`.
The task specific pipelines
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
AutomaticSpeechRecognitionPipeline
=======================================================================================================================
.. autoclass:: transformers.AutomaticSpeechRecognitionPipeline
:special-members: __call__
:members:
ConversationalPipeline
=======================================================================================================================
@@ -71,6 +80,13 @@ FillMaskPipeline
:special-members: __call__
:members:
ImageClassificationPipeline
=======================================================================================================================
.. autoclass:: transformers.ImageClassificationPipeline
:special-members: __call__
:members:
NerPipeline
=======================================================================================================================

View File

@@ -68,8 +68,8 @@ Additionally, the following method can be used to load values from a data file a
Example usage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
An example using these processors is given in the `run_glue.py
<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
An example using these processors is given in the :prefix_link:`run_glue.py
<examples/legacy/text-classification/run_glue.py>` script.
XNLI
@@ -89,8 +89,8 @@ This library hosts the processor to load the XNLI data:
Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
An example using these processors is given in the `run_xnli.py
<https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.
An example using these processors is given in the :prefix_link:`run_xnli.py
<examples/legacy/text-classification/run_xnli.py>` script.
SQuAD
@@ -169,4 +169,4 @@ Using `tensorflow_datasets` is as easy as using a data file:
Another example using these processors is given in the :prefix_link:`run_squad.py
<examples/question-answering/run_squad.py>` script.
<examples/legacy/question-answering/run_squad.py>` script.

File diff suppressed because it is too large Load Diff

View File

@@ -169,8 +169,8 @@ Regarding the `TFTrainer` class:
- The `TFTrainer` method `_setup_wandb` is deprecated in favor of `setup_wandb`.
- The `TFTrainer` method `_run_model` is deprecated in favor of `run_model`.
Regarding the `TrainerArgument` class:
- The `TrainerArgument` argument `evaluate_during_training` is deprecated in favor of `evaluation_strategy`.
Regarding the `TrainingArguments` class:
- The `TrainingArguments` argument `evaluate_during_training` is deprecated in favor of `evaluation_strategy`.
Regarding the Transfo-XL model:
- The Transfo-XL configuration attribute `tie_weight` becomes `tie_words_embeddings`.

View File

@@ -43,7 +43,8 @@ Tips:
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
number of (repeating) layers.
The original code can be found `here <https://github.com/google-research/ALBERT>`__.
This model was contributed by `lysandre <https://huggingface.co/lysandre>`__. The original code can be found `here
<https://github.com/google-research/ALBERT>`__.
AlbertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -128,6 +128,13 @@ AutoModelForTableQuestionAnswering
:members:
AutoModelForImageClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelForImageClassification
:members:
TFAutoModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -35,14 +35,15 @@ According to the abstract,
state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains
of up to 6 ROUGE.
The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/bart>`__.
This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__. The Authors' code can be found `here
<https://github.com/pytorch/fairseq/tree/master/examples/bart>`__.
Examples
_______________________________________________________________________________________________________________________
- Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in
:prefix_link:`examples/seq2seq/ <examples/seq2seq/README.md>`.
:prefix_link:`examples/pytorch/summarization/ <examples/pytorch/summarization/README.md>`.
- An example of how to train :class:`~transformers.BartForConditionalGeneration` with a Hugging Face :obj:`datasets`
object can be found in this `forum discussion
<https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904>`__.

View File

@@ -16,7 +16,7 @@ BARThez
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The BARThez model was proposed in `BARThez: a Skilled Pretrained French Sequence-to-Sequence Model`
The BARThez model was proposed in `BARThez: a Skilled Pretrained French Sequence-to-Sequence Model
<https://arxiv.org/abs/2010.12321>`__ by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis on 23 Oct,
2020.
@@ -35,14 +35,15 @@ summarization dataset, OrangeSum, that we release with this paper. We also conti
pretrained multilingual BART on BARThez's corpus, and we show that the resulting model, which we call mBARTHez,
provides a significant boost over vanilla BARThez, and is on par with or outperforms CamemBERT and FlauBERT.*
The Authors' code can be found `here <https://github.com/moussaKam/BARThez>`__.
This model was contributed by `moussakam <https://huggingface.co/moussakam>`__. The Authors' code can be found `here
<https://github.com/moussaKam/BARThez>`__.
Examples
_______________________________________________________________________________________________________________________
- BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check:
:prefix_link:`examples/seq2seq/ <examples/seq2seq/README.md>`.
:prefix_link:`examples/pytorch/summarization/ <examples/pytorch/summarization/README.md>`.
BarthezTokenizer

View File

@@ -42,7 +42,8 @@ Tips:
- BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is
efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation.
The original code can be found `here <https://github.com/google-research/bert>`__.
This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
<https://github.com/google-research/bert>`__.
BertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -71,6 +71,8 @@ Tips:
- This implementation is the same as BERT, except for tokenization method. Refer to the :doc:`documentation of BERT
<bert>` for more usage examples.
This model was contributed by `cl-tohoku <https://huggingface.co/cl-tohoku>`__.
BertJapaneseTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -79,7 +79,8 @@ Tips:
- For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input.
Therefore, no EOS token should be added to the end of the input.
The original code can be found `here <https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder>`__.
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The original code can be
found `here <https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder>`__.
BertGenerationConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -54,8 +54,8 @@ Example of use:
>>> # from transformers import TFAutoModel
>>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
The original code can be found `here <https://github.com/VinAIResearch/BERTweet>`__.
This model was contributed by `dqnguyen <https://huggingface.co/dqnguyen>`__. The original code can be found `here
<https://github.com/VinAIResearch/BERTweet>`__.
BertweetTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -50,7 +50,8 @@ Tips:
- Current implementation supports only **ITC**.
- Current implementation doesn't support **num_random_blocks = 0**
The original code can be found `here <https://github.com/google-research/bigbird>`__.
This model was contributed by `vasudevgupta <https://huggingface.co/vasudevgupta>`__. The original code can be found
`here <https://github.com/google-research/bigbird>`__.
BigBirdConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -66,6 +67,11 @@ BigBirdTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
BigBirdTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BigBirdTokenizerFast
:members:
BigBird specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -0,0 +1,98 @@
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
BigBirdPegasus
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The BigBird model was proposed in `Big Bird: Transformers for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by
Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon,
Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention
based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse
attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it
has been shown that applying sparse, global, and random attention approximates full attention, while being
computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context,
BigBird has shown improved performance on various long document NLP tasks, such as question answering and
summarization, compared to BERT or RoBERTa.
The abstract from the paper is the following:
*Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP.
Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence
length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that
reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and
is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our
theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire
sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to
8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context,
BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also
propose novel applications to genomics data.*
Tips:
- For an in-detail explanation on how BigBird's attention works, see `this blog post
<https://huggingface.co/blog/big-bird>`__.
- BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using
**original_full** is advised as there is no benefit in using **block_sparse** attention.
- The code currently uses window size of 3 blocks and 2 global blocks.
- Sequence length must be divisible by block size.
- Current implementation supports only **ITC**.
- Current implementation doesn't support **num_random_blocks = 0**.
- BigBirdPegasus uses the `PegasusTokenizer
<https://github.com/huggingface/transformers/blob/master/src/transformers/models/pegasus/tokenization_pegasus.py>`__.
The original code can be found `here <https://github.com/google-research/bigbird>`__.
BigBirdPegasusConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BigBirdPegasusConfig
:members:
BigBirdPegasusModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BigBirdPegasusModel
:members: forward
BigBirdPegasusForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BigBirdPegasusForConditionalGeneration
:members: forward
BigBirdPegasusForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BigBirdPegasusForSequenceClassification
:members: forward
BigBirdPegasusForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BigBirdPegasusForQuestionAnswering
:members: forward
BigBirdPegasusForCausalLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BigBirdPegasusForCausalLM
:members: forward

View File

@@ -36,7 +36,8 @@ and code publicly available. Human evaluations show our best models are superior
dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
failure cases of our models.*
The authors' code can be found `here <https://github.com/facebookresearch/ParlAI>`__ .
This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__. The authors' code can be found `here
<https://github.com/facebookresearch/ParlAI>`__ .
Implementation Notes

View File

@@ -39,7 +39,8 @@ and code publicly available. Human evaluations show our best models are superior
dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
failure cases of our models.*
The authors' code can be found `here <https://github.com/facebookresearch/ParlAI>`__ .
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The authors' code can be
found `here <https://github.com/facebookresearch/ParlAI>`__ .
BlenderbotSmallConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -43,4 +43,5 @@ Tips:
that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the
algorithm to make BORT fine-tuning work.
The original code can be found `here <https://github.com/alexa/bort/>`__.
This model was contributed by `stefan-it <https://huggingface.co/stefan-it>`__. The original code can be found `here
<https://github.com/alexa/bort/>`__.

View File

@@ -37,7 +37,8 @@ Tips:
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
as well as the information relative to the inputs and outputs.
The original code can be found `here <https://camembert-model.fr/>`__.
This model was contributed by `camembert <https://huggingface.co/camembert>`__. The original code can be found `here
<https://camembert-model.fr/>`__.
CamembertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -0,0 +1,154 @@
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
CLIP
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The CLIP model was proposed in `Learning Transferable Visual Models From Natural Language Supervision
<https://arxiv.org/abs/2103.00020>`__ by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. CLIP
(Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be
instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing
for the task, similarly to the zero-shot capabilities of GPT-2 and 3.
The abstract from the paper is the following:
*State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This
restricted form of supervision limits their generality and usability since additional labeled data is needed to specify
any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a
much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes
with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400
million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference
learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study
the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks
such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The
model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need
for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot
without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained
model weights at this https URL.*
Usage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image
classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text
features. Both the text and visual features are then projected to a latent space with identical dimension. The dot
product between the projected image and text features is then used as a similar score.
To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors
also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
The :class:`~transformers.CLIPFeatureExtractor` can be used to resize (or rescale) and normalize images for the model.
The :class:`~transformers.CLIPTokenizer` is used to encode the text. The :class:`~transformers.CLIPProcessor` wraps
:class:`~transformers.CLIPFeatureExtractor` and :class:`~transformers.CLIPTokenizer` into a single instance to both
encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
:class:`~transformers.CLIPProcessor` and :class:`~transformers.CLIPModel`.
.. code-block::
>>> import torch
>>> from PIL import Image
>>> import requests
>>> from transformers import CLIPProcessor, CLIPModel
>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
This model was contributed by `valhalla <https://huggingface.co/valhalla>`__. The original code can be found `here
<https://github.com/openai/CLIP>`__.
CLIPConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CLIPConfig
:members: from_text_vision_configs
CLIPTextConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CLIPTextConfig
:members:
CLIPVisionConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CLIPVisionConfig
:members:
CLIPTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CLIPTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
CLIPTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CLIPTokenizerFast
:members:
CLIPFeatureExtractor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CLIPFeatureExtractor
:members:
CLIPProcessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CLIPProcessor
:members:
CLIPModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CLIPModel
:members: forward, get_text_features, get_image_features
CLIPTextModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CLIPTextModel
:members: forward
CLIPVisionModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CLIPVisionModel
:members: forward

View File

@@ -34,8 +34,10 @@ ConvBERT significantly outperforms BERT and its variants in various downstream t
fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
using less than 1/4 training cost. Code and pre-trained models will be released.*
ConvBERT training tips are similar to those of BERT. The original implementation can be found here:
https://github.com/yitu-opensource/ConvBert
ConvBERT training tips are similar to those of BERT.
This model was contributed by `abhishek <https://huggingface.co/abhishek>`__. The original implementation can be found
here: https://github.com/yitu-opensource/ConvBert
ConvBertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -33,7 +33,8 @@ language model, which could facilitate several downstream Chinese NLP tasks, suc
cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many
NLP tasks in the settings of few-shot (even zero-shot) learning.*
The original implementation can be found here: https://github.com/TsinghuaAI/CPM-Generate
This model was contributed by `canwenxu <https://huggingface.co/canwenxu>`__. The original implementation can be found
here: https://github.com/TsinghuaAI/CPM-Generate
Note: We only have a tokenizer here, since the model architecture is the same as GPT-2.

View File

@@ -46,7 +46,8 @@ Tips:
`reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage of
this argument.
The original code can be found `here <https://github.com/salesforce/ctrl>`__.
This model was contributed by `keskarnitishr <https://huggingface.co/keskarnitishr>`__. The original code can be found
`here <https://github.com/salesforce/ctrl>`__.
CTRLConfig

View File

@@ -38,7 +38,8 @@ the training data performs consistently better on a wide range of NLP tasks, ach
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
The original code can be found `here <https://github.com/microsoft/DeBERTa>`__.
This model was contributed by `DeBERTa <https://huggingface.co/DeBERTa>`__. The original code can be found `here
<https://github.com/microsoft/DeBERTa>`__.
DebertaConfig
@@ -55,6 +56,12 @@ DebertaTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
DebertaTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaTokenizerFast
:members: build_inputs_with_special_tokens, create_token_type_ids_from_sequences
DebertaModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -58,7 +58,8 @@ New in v2:
- **900M model & 1.5B model** Two additional model sizes are available: 900M and 1.5B, which significantly improves the
performance of downstream tasks.
The original code can be found `here <https://github.com/microsoft/DeBERTa>`__.
This model was contributed by `DeBERTa <https://huggingface.co/DeBERTa>`__. The original code can be found `here
<https://github.com/microsoft/DeBERTa>`__.
DebertaV2Config

View File

@@ -73,6 +73,8 @@ Tips:
`facebook/deit-base-patch16-384`. Note that one should use :class:`~transformers.DeiTFeatureExtractor` in order to
prepare images for the model.
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__.
DeiTConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -44,8 +44,8 @@ Tips:
- DistilBERT doesn't have options to select the input positions (:obj:`position_ids` input). This could be added if
necessary though, just let us know if you need this option.
The original code can be found `here
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__.
This model was contributed by `victorsanh <https://huggingface.co/victorsanh>`__. The original code can be found
:prefix_link:`here <examples/research-projects/distillation>`.
DistilBertConfig

View File

@@ -30,7 +30,8 @@ our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% ab
retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA
benchmarks.*
The original code can be found `here <https://github.com/facebookresearch/DPR>`__.
This model was contributed by `lhoestq <https://huggingface.co/lhoestq>`__. The original code can be found `here
<https://github.com/facebookresearch/DPR>`__.
DPRConfig

View File

@@ -54,7 +54,8 @@ Tips:
:class:`~transformers.ElectraForPreTraining` model (the classification head will be randomly initialized as it
doesn't exist in the generator).
The original code can be found `here <https://github.com/google-research/electra>`__.
This model was contributed by `lysandre <https://huggingface.co/lysandre>`__. The original code can be found `here
<https://github.com/google-research/electra>`__.
ElectraConfig
@@ -184,3 +185,52 @@ TFElectraForQuestionAnswering
.. autoclass:: transformers.TFElectraForQuestionAnswering
:members: call
FlaxElectraModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxElectraModel
:members: __call__
FlaxElectraForPreTraining
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxElectraForPreTraining
:members: __call__
FlaxElectraForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxElectraForMaskedLM
:members: __call__
FlaxElectraForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxElectraForSequenceClassification
:members: __call__
FlaxElectraForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxElectraForMultipleChoice
:members: __call__
FlaxElectraForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxElectraForTokenClassification
:members: __call__
FlaxElectraForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxElectraForQuestionAnswering
:members: __call__

View File

@@ -35,7 +35,8 @@ time they outperform other pretraining approaches. Different versions of FlauBER
protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
community for further reproducible experiments in French NLP.*
The original code can be found `here <https://github.com/getalp/Flaubert>`__.
This model was contributed by `formiel <https://huggingface.co/formiel>`__. The original code can be found `here
<https://github.com/getalp/Flaubert>`__.
FlaubertConfig

View File

@@ -34,7 +34,8 @@ data, then decode using noisy channel model reranking. Our submissions are ranke
human evaluation campaign. On En->De, our system significantly outperforms other systems as well as human translations.
This system improves upon our WMT'18 submission by 4.5 BLEU points.*
The original code can be found here <https://github.com/pytorch/fairseq/tree/master/examples/wmt19>__.
This model was contributed by `stas <https://huggingface.co/stas>`__. The original code can be found here
<https://github.com/pytorch/fairseq/tree/master/examples/wmt19>__.
Implementation Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -49,7 +49,8 @@ Tips:
:class:`~transformers.FunnelBaseModel`, :class:`~transformers.FunnelForSequenceClassification` and
:class:`~transformers.FunnelForMultipleChoice`.
The original code can be found `here <https://github.com/laiguokun/Funnel-Transformer>`__.
This model was contributed by `sgugger <https://huggingface.co/sgugger>`__. The original code can be found `here
<https://github.com/laiguokun/Funnel-Transformer>`__.
FunnelConfig

View File

@@ -45,7 +45,8 @@ Tips:
`Write With Transformer <https://transformer.huggingface.co/doc/gpt>`__ is a webapp created and hosted by Hugging Face
showcasing the generative capabilities of several models. GPT is one of them.
The original code can be found `here <https://github.com/openai/finetune-transformer-lm>`__.
This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
<https://github.com/openai/finetune-transformer-lm>`__.
Note:

View File

@@ -45,7 +45,8 @@ Tips:
Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five
different sizes: small, medium, large, xl and a distilled version of the small checkpoint: `distilgpt-2`.
The original code can be found `here <https://openai.com/blog/better-language-models/>`__.
This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
<https://openai.com/blog/better-language-models/>`__.
GPT2Config

View File

@@ -23,6 +23,8 @@ Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. It is a GPT2 like c
The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of
256 tokens.
This model was contributed by `valhalla <https://huggingface.co/valhalla>`__.
Generation
_______________________________________________________________________________________________________________________

View File

@@ -56,7 +56,9 @@ Examples of use:
>>> model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
The original code can be found `here <https://github.com/allegro/HerBERT>`__.
This model was contributed by `rmroczkowski <https://huggingface.co/rmroczkowski>`__. The original code can be found
`here <https://github.com/allegro/HerBERT>`__.
HerbertTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -36,8 +36,9 @@ the full-precision baseline. Furthermore, our preliminary implementation of I-BE
INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has
been open-sourced.*
This model was contributed by `kssteven <https://huggingface.co/kssteven>`__. The original code can be found `here
<https://github.com/kssteven418/I-BERT>`__.
The original code can be found `here <https://github.com/kssteven418/I-BERT>`__.
IBertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -80,7 +80,8 @@ occurs. Those can be obtained using the Python Image Library (PIL) library for e
<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb>`__.
It includes an inference part, which shows how to use Google's Tesseract on a new document.
The original code can be found `here <https://github.com/microsoft/unilm/tree/master/layoutlm>`_.
This model was contributed by `liminghao1630 <https://huggingface.co/liminghao1630>`__. The original code can be found
`here <https://github.com/microsoft/unilm/tree/master/layoutlm>`_.
LayoutLMConfig

View File

@@ -53,6 +53,8 @@ Tips:
- A notebook showing how to fine-tune LED, can be accessed `here
<https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing>`__.
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__.
LEDConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -40,7 +40,8 @@ Tips:
token belongs to which segment. Just separate your segments with the separation token :obj:`tokenizer.sep_token` (or
:obj:`</s>`).
The Authors' code can be found `here <https://github.com/allenai/longformer>`__.
This model was contributed by `beltagy <https://huggingface.co/beltagy>`__. The Authors' code can be found `here
<https://github.com/allenai/longformer>`__.
Longformer Self Attention
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -0,0 +1,159 @@
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
LUKE
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The LUKE model was proposed in `LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention
<https://arxiv.org/abs/2010.01057>`_ by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda and Yuji Matsumoto.
It is based on RoBERTa and adds entity embeddings as well as an entity-aware self-attention mechanism, which helps
improve performance on various downstream tasks involving reasoning about entities such as named entity recognition,
extractive and cloze-style question answering, entity typing, and relation classification.
The abstract from the paper is the following:
*Entity representations are useful in natural language tasks involving entities. In this paper, we propose new
pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed
model treats words and entities in a given text as independent tokens, and outputs contextualized representations of
them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves
predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also
propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the
transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model
achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains
state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification),
CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question
answering).*
Tips:
- This implementation is the same as :class:`~transformers.RobertaModel` with the addition of entity embeddings as well
as an entity-aware self-attention mechanism, which improves performance on tasks involving reasoning about entities.
- LUKE treats entities as input tokens; therefore, it takes :obj:`entity_ids`, :obj:`entity_attention_mask`,
:obj:`entity_token_type_ids` and :obj:`entity_position_ids` as extra input. You can obtain those using
:class:`~transformers.LukeTokenizer`.
- :class:`~transformers.LukeTokenizer` takes :obj:`entities` and :obj:`entity_spans` (character-based start and end
positions of the entities in the input text) as extra input. :obj:`entities` typically consist of [MASK] entities or
Wikipedia entities. The brief description when inputting these entities are as follows:
- *Inputting [MASK] entities to compute entity representations*: The [MASK] entity is used to mask entities to be
predicted during pretraining. When LUKE receives the [MASK] entity, it tries to predict the original entity by
gathering the information about the entity from the input text. Therefore, the [MASK] entity can be used to address
downstream tasks requiring the information of entities in text such as entity typing, relation classification, and
named entity recognition.
- *Inputting Wikipedia entities to compute knowledge-enhanced token representations*: LUKE learns rich information
(or knowledge) about Wikipedia entities during pretraining and stores the information in its entity embedding. By
using Wikipedia entities as input tokens, LUKE outputs token representations enriched by the information stored in
the embeddings of these entities. This is particularly effective for tasks requiring real-world knowledge, such as
question answering.
- There are three head models for the former use case:
- :class:`~transformers.LukeForEntityClassification`, for tasks to classify a single entity in an input text such as
entity typing, e.g. the `Open Entity dataset <https://www.cs.utexas.edu/~eunsol/html_pages/open_entity.html>`__.
This model places a linear head on top of the output entity representation.
- :class:`~transformers.LukeForEntityPairClassification`, for tasks to classify the relationship between two entities
such as relation classification, e.g. the `TACRED dataset <https://nlp.stanford.edu/projects/tacred/>`__. This
model places a linear head on top of the concatenated output representation of the pair of given entities.
- :class:`~transformers.LukeForEntitySpanClassification`, for tasks to classify the sequence of entity spans, such as
named entity recognition (NER). This model places a linear head on top of the output entity representations. You
can address NER using this model by inputting all possible entity spans in the text to the model.
:class:`~transformers.LukeTokenizer` has a ``task`` argument, which enables you to easily create an input to these
head models by specifying ``task="entity_classification"``, ``task="entity_pair_classification"``, or
``task="entity_span_classification"``. Please refer to the example code of each head models.
There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with
the HuggingFace implementation of LUKE. They can be found `here
<https://github.com/studio-ousia/luke/tree/master/notebooks>`__.
Example:
.. code-block::
>>> from transformers import LukeTokenizer, LukeModel, LukeForEntityPairClassification
>>> model = LukeModel.from_pretrained("studio-ousia/luke-base")
>>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-base")
# Example 1: Computing the contextualized entity representation corresponding to the entity mention "Beyoncé"
>>> text = "Beyoncé lives in Los Angeles."
>>> entity_spans = [(0, 7)] # character-based entity span corresponding to "Beyoncé"
>>> inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
>>> outputs = model(**inputs)
>>> word_last_hidden_state = outputs.last_hidden_state
>>> entity_last_hidden_state = outputs.entity_last_hidden_state
# Example 2: Inputting Wikipedia entities to obtain enriched contextualized representations
>>> entities = ["Beyoncé", "Los Angeles"] # Wikipedia entity titles corresponding to the entity mentions "Beyoncé" and "Los Angeles"
>>> entity_spans = [(0, 7), (17, 28)] # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
>>> inputs = tokenizer(text, entities=entities, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
>>> outputs = model(**inputs)
>>> word_last_hidden_state = outputs.last_hidden_state
>>> entity_last_hidden_state = outputs.entity_last_hidden_state
# Example 3: Classifying the relationship between two entities using LukeForEntityPairClassification head model
>>> model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
>>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
>>> entity_spans = [(0, 7), (17, 28)] # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
>>> inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> predicted_class_idx = int(logits[0].argmax())
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
This model was contributed by `ikuyamada <https://huggingface.co/ikuyamada>`__ and `nielsr
<https://huggingface.co/nielsr>`__. The original code can be found `here <https://github.com/studio-ousia/luke>`__.
LukeConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LukeConfig
:members:
LukeTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LukeTokenizer
:members: __call__, save_vocabulary
LukeModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LukeModel
:members: forward
LukeForEntityClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LukeForEntityClassification
:members: forward
LukeForEntityPairClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LukeForEntityPairClassification
:members: forward
LukeForEntitySpanClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LukeForEntitySpanClassification
:members: forward

View File

@@ -52,7 +52,8 @@ Tips:
contains self-attention for each respective modality and cross-attention, only the cross attention is returned and
both self attention outputs are disregarded.
The original code can be found `here <https://github.com/airsplay/lxmert>`__.
This model was contributed by `eltoto1219 <https://huggingface.co/eltoto1219>`__. The original code can be found `here
<https://github.com/airsplay/lxmert>`__.
LxmertConfig

View File

@@ -34,6 +34,8 @@ to create high quality models. Our focus on non-English-Centric models brings ga
translating between non-English directions while performing competitively to the best single systems of WMT. We
open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.*
This model was contributed by `valhalla <https://huggingface.co/valhalla>`__.
Training and Generation
_______________________________________________________________________________________________________________________

View File

@@ -37,6 +37,7 @@ Implementation Notes
- the model starts generating with :obj:`pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
:obj:`<s/>`),
- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``.
- This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__.
Naming
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -29,7 +29,8 @@ corpora in many languages using the BART objective. mBART is one of the first me
sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
on the encoder, decoder, or reconstructing parts of the text.
The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`__
This model was contributed by `valhalla <https://huggingface.co/valhalla>`__. The Authors' code can be found `here
<https://github.com/pytorch/fairseq/tree/master/examples/mbart>`__
Training of MBart
_______________________________________________________________________________________________________________________

View File

@@ -77,9 +77,10 @@ The following commands allow you to do the conversion. We assume that the folder
python3 $PATH_TO_TRANSFORMERS/models/megatron_bert/convert_megatron_bert_checkpoint.py megatron_bert_345m_v0_1_cased.zip
The original code can be found `here <https://github.com/NVIDIA/Megatron-LM>`__. That repository contains a multi-GPU
and multi-node implementation of the Megatron Language models. In particular, it contains a hybrid model parallel
approach using "tensor parallel" and "pipeline parallel" techniques.
This model was contributed by `jdemouth <https://huggingface.co/jdemouth>`__. The original code can be found `here
<https://github.com/NVIDIA/Megatron-LM>`__. That repository contains a multi-GPU and multi-node implementation of the
Megatron Language models. In particular, it contains a hybrid model parallel approach using "tensor parallel" and
"pipeline parallel" techniques.
MegatronBertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -64,7 +64,8 @@ The following command allows you to do the conversion. We assume that the folder
python3 $PATH_TO_TRANSFORMERS/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py megatron_gpt2_345m_v0_0.zip
The original code can be found `here <https://github.com/NVIDIA/Megatron-LM>`__. That repository contains a multi-GPU
and multi-node implementation of the Megatron Language models. In particular, it contains a hybrid model parallel
approach using "tensor parallel" and "pipeline parallel" techniques.
This model was contributed by `jdemouth <https://huggingface.co/jdemouth>`__. The original code can be found `here
<https://github.com/NVIDIA/Megatron-LM>`__. That repository contains a multi-GPU and multi-node implementation of the
Megatron Language models. In particular, it contains a hybrid model parallel approach using "tensor parallel" and
"pipeline parallel" techniques.

View File

@@ -44,7 +44,8 @@ Tips:
efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
with a causal language modeling (CLM) objective are better in that regard.
The original code can be found `here <https://github.com/google-research/mobilebert>`__.
This model was contributed by `vshampor <https://huggingface.co/vshampor>`__. The original code can be found `here
<https://github.com/google-research/mobilebert>`__.
MobileBertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -28,7 +28,8 @@ multilingual variant of T5 that was pre-trained on a new Common Crawl-based data
the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual
benchmarks. All of the code and model checkpoints*
The original code can be found `here <https://github.com/google-research/multilingual-t5>`__.
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The original code can be
found `here <https://github.com/google-research/multilingual-t5>`__.
MT5Config
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -31,7 +31,8 @@ According to the abstract,
extractive summary.
- Pegasus achieves SOTA summarization performance on all 12 downstream tasks, as measured by ROUGE and human eval.
The Authors' code can be found `here <https://github.com/google-research/pegasus>`__.
This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__. The Authors' code can be found `here
<https://github.com/google-research/pegasus>`__.
Checkpoints
@@ -52,7 +53,8 @@ Examples
_______________________________________________________________________________________________________________________
- :prefix_link:`Script <examples/research_projects/seq2seq-distillation/finetune_pegasus_xsum.sh>` to fine-tune pegasus
on the XSUM dataset. Data download instructions at :prefix_link:`examples/seq2seq/ <examples/seq2seq/README.md>`.
on the XSUM dataset. Data download instructions at :prefix_link:`examples/pytorch/summarization/
<examples/pytorch/summarization/README.md>`.
- FP16 is not supported (help/ideas on this appreciated!).
- The adafactor optimizer is recommended for pegasus fine-tuning.

View File

@@ -50,7 +50,7 @@ Example of use:
>>> # phobert = TFAutoModel.from_pretrained("vinai/phobert-base")
The original code can be found `here <https://github.com/VinAIResearch/PhoBERT>`__.
This model was contributed by `dqnguyen <https://huggingface.co/dqnguyen>`__. The original code can be found `here <https://github.com/VinAIResearch/PhoBERT>`__.
PhobertTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -43,6 +43,7 @@ outperforming parametric seq2seq models and task-specific retrieve-and-extract a
tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art
parametric-only seq2seq baseline.*
This model was contributed by `ola13 <https://huggingface.co/ola13>`__.
RagConfig

View File

@@ -32,7 +32,8 @@ layers instead of the standard residuals, which allows storing activations only
N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models
while being much more memory-efficient and much faster on long sequences.*
The Authors' code can be found `here <https://github.com/google/trax/tree/master/trax/models/reformer>`__.
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The Authors' code can be
found `here <https://github.com/google/trax/tree/master/trax/models/reformer>`__.
Axial Positional Encodings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -20,8 +20,8 @@ The RetriBERT model was proposed in the blog post `Explain Anything Like I'm Fiv
Question Answering <https://yjernite.github.io/lfqa.html>`__. RetriBERT is a small model that uses either a single or
pair of BERT encoders with lower-dimension projection for dense semantic indexing of text.
Code to train and use the model can be found `here
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__.
This model was contributed by `yjernite <https://huggingface.co/yjernite>`__. Code to train and use the model can be
found :prefix_link:`here <examples/research-projects/distillation>`.
RetriBertConfig

View File

@@ -44,7 +44,8 @@ Tips:
separate your segments with the separation token :obj:`tokenizer.sep_token` (or :obj:`</s>`)
- :doc:`CamemBERT <camembert>` is a wrapper around RoBERTa. Refer to this page for usage examples.
The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`_.
This model was contributed by `julien-c <https://huggingface.co/julien-c>`__. The original code can be found `here
<https://github.com/pytorch/fairseq/tree/master/examples/roberta>`_.
RobertaConfig
@@ -165,3 +166,38 @@ FlaxRobertaModel
.. autoclass:: transformers.FlaxRobertaModel
:members: __call__
FlaxRobertaForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxRobertaForMaskedLM
:members: __call__
FlaxRobertaForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxRobertaForSequenceClassification
:members: __call__
FlaxRobertaForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxRobertaForMultipleChoice
:members: __call__
FlaxRobertaForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxRobertaForTokenClassification
:members: __call__
FlaxRobertaForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaxRobertaForQuestionAnswering
:members: __call__

View File

@@ -25,7 +25,8 @@ transcripts/translations autoregressively. Speech2Text has been fine-tuned on se
`LibriSpeech <http://www.openslr.org/12>`__, `CoVoST 2 <https://github.com/facebookresearch/covost>`__, `MuST-C
<https://ict.fbk.eu/must-c/>`__.
The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text>`__.
This model was contributed by `valhalla <https://huggingface.co/valhalla>`__. The original code can be found `here
<https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text>`__.
Inference

View File

@@ -47,6 +47,9 @@ Tips:
- For best results when finetuning on sequence classification tasks, it is recommended to start with the
`squeezebert/squeezebert-mnli-headless` checkpoint.
This model was contributed by `forresti <https://huggingface.co/forresti>`__.
SqueezeBertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -48,7 +48,8 @@ Tips:
layers to the decoder and auto-regressively generates the decoder output. - T5 uses relative scalar embeddings.
Encoder input padding can be done on the left and on the right.
The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`__.
This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
<https://github.com/google-research/text-to-text-transfer-transformer>`__.
Training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -49,7 +49,8 @@ entailment (a binary classification task). For more details, see their follow-up
intermediate pre-training <https://www.aclweb.org/anthology/2020.findings-emnlp.27/>`__ by Julian Martin Eisenschlos,
Syrine Krichene and Thomas Müller.
The original code can be found `here <https://github.com/google-research/tapas>`__.
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
<https://github.com/google-research/tapas>`__.
Tips:

View File

@@ -41,7 +41,8 @@ Tips:
original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
- Transformer-XL is one of the few models that has no sequence length limit.
The original code can be found `here <https://github.com/kimiyoung/transformer-xl>`__.
This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
<https://github.com/kimiyoung/transformer-xl>`__.
TransfoXLConfig

View File

@@ -67,7 +67,8 @@ Tips:
improvement of 2% to training from scratch, but still 4% behind supervised pre-training.
The original code (written in JAX) can be found `here <https://github.com/google-research/vision_transformer>`__.
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code (written in JAX) can be
found `here <https://github.com/google-research/vision_transformer>`__.
Note that we converted the weights from Ross Wightman's `timm library
<https://github.com/rwightman/pytorch-image-models>`__, who already converted the weights from JAX to PyTorch. Credits

View File

@@ -36,6 +36,8 @@ Tips:
- Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded
using :class:`~transformers.Wav2Vec2CTCTokenizer`.
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__.
Wav2Vec2Config
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -42,7 +42,8 @@ Tips:
- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the :doc:`multi-lingual
<../multilingual>` page for more information.
The original code can be found `here <https://github.com/facebookresearch/XLM/>`__.
This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
<https://github.com/facebookresearch/XLM/>`__.
XLMConfig

View File

@@ -44,7 +44,8 @@ Tips:
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
as well as the information relative to the inputs and outputs.
The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`__.
This model was contributed by `stefan-it <https://huggingface.co/stefan-it>`__. The original code can be found `here
<https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`__.
XLMRobertaConfig

View File

@@ -41,10 +41,11 @@ Tips:
using only a sub-set of the output tokens as target which are selected with the :obj:`target_mapping` input.
- To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the :obj:`perm_mask` and
:obj:`target_mapping` inputs to control the attention span and outputs (see examples in
`examples/text-generation/run_generation.py`)
`examples/pytorch/text-generation/run_generation.py`)
- XLNet is one of the few models that has no sequence length limit.
The original code can be found `here <https://github.com/zihangdai/xlnet/>`__.
This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
<https://github.com/zihangdai/xlnet/>`__.
XLNetConfig

View File

@@ -22,8 +22,6 @@ the `model hub <https://huggingface.co/models>`__.
Optionally, you can join an existing organization or create a new one.
Prepare your model for uploading
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
done something similar on your task, either using the model directly in your own training loop or using the
@@ -31,7 +29,7 @@ done something similar on your task, either using the model directly in your own
`model hub <https://huggingface.co/models>`__.
Model versioning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since version v3.5.0, the model hub has built-in model versioning based on git and git-lfs. It is based on the paradigm
that one model *is* one repo.
@@ -54,6 +52,106 @@ For instance:
>>> revision="v2.0.1" # tag name, or branch name, or commit hash
>>> )
Push your model from Python
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Preparation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The first step is to make sure your credentials to the hub are stored somewhere. This can be done in two ways. If you
have access to a terminal, you cam just run the following command in the virtual environment where you installed 🤗
Transformers:
.. code-block:: bash
transformers-cli login
It will store your access token in the Hugging Face cache folder (by default :obj:`~/.cache/`).
If you don't have an easy access to a terminal (for instance in a Colab session), you can find a token linked to your
acount by going on `huggingface.co <https://huggingface.co/>`, click on your avatar on the top left corner, then on
`Edit profile` on the left, just beneath your profile picture. In the submenu `API Tokens`, you will find your API
token that you can just copy.
Directly push your model to the hub
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Once you have an API token (either stored in the cache or copied and pasted in your notebook), you can directly push a
finetuned model you saved in :obj:`save_drectory` by calling:
.. code-block:: python
finetuned_model.push_to_hub("my-awesome-model")
If you have your API token not stored in the cache, you will need to pass it with :obj:`use_auth_token=your_token`.
This is also be the case for all the examples below, so we won't mention it again.
This will create a repository in your namespace name :obj:`my-awesome-model`, so anyone can now run:
.. code-block:: python
from transformers import AutoModel
model = AutoModel.from_pretrained("your_username/my-awesome-model")
Even better, you can combine this push to the hub with the call to :obj:`save_pretrained`:
.. code-block:: python
finetuned_model.save_pretrained(save_directory, push_to_hub=True, repo_name="my-awesome-model")
If you are a premium user and want your model to be private, just add :obj:`private=True` to this call.
If you are a member of an organization and want to push it inside the namespace of the organization instead of yours,
just add :obj:`organization=my_amazing_org`.
Add new files to your model repo
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Once you have pushed your model to the hub, you might want to add the tokenizer, or a version of your model for another
framework (TensorFlow, PyTorch, Flax). This is super easy to do! Let's begin with the tokenizer. You can add it to the
repo you created before like this
.. code-block:: python
tokenizer.push_to_hub("my-awesome-model")
If you know its URL (it should be :obj:`https://huggingface.co/username/repo_name`), you can also do:
.. code-block:: python
tokenizer.push_to_hub(repo_url=my_repo_url)
And that's all there is to it! It's also a very easy way to fix a mistake if one of the files online had a bug.
To add a model for another backend, it's also super easy. Let's say you have fine-tuned a TensorFlow model and want to
add the pytorch model files to your model repo, so that anyone in the community can use it. The following allows you to
directly create a PyTorch version of your TensorFlow model:
.. code-block:: python
from transfomers import AutoModel
model = AutoModel.from_pretrained(save_directory, from_tf=True)
You can also replace :obj:`save_directory` by the identifier of your model (:obj:`username/repo_name`) if you don't
have a local save of it anymore. Then, just do the same as before:
.. code-block:: python
model.push_to_hub("my-awesome-model")
or
.. code-block:: python
model.push_to_hub(repo_url=my_repo_url)
Use your terminal and git
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Basic steps
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

View File

@@ -682,7 +682,8 @@ The `mbart-large-en-ro checkpoint <https://huggingface.co/facebook/mbart-large-e
romanian translation.
The `mbart-large-cc25 <https://huggingface.co/facebook/mbart-large-cc25>`_ checkpoint can be finetuned for other
translation and summarization tasks, using code in ```examples/seq2seq/``` , but is not very useful without finetuning.
translation and summarization tasks, using code in ```examples/pytorch/translation/``` , but is not very useful without
finetuning.
ProphetNet

View File

@@ -90,8 +90,8 @@ You can then feed it all as input to your model:
>>> outputs = model(input_ids, langs=langs)
The example :prefix_link:`run_generation.py <examples/text-generation/run_generation.py>` can generate text using the
CLM checkpoints from XLM, using the language embeddings.
The example :prefix_link:`run_generation.py <examples/pytorch/text-generation/run_generation.py>` can generate text
using the CLM checkpoints from XLM, using the language embeddings.
XLM without Language Embeddings
-----------------------------------------------------------------------------------------------------------------------

View File

@@ -238,23 +238,22 @@ keys directly to tensors, for a PyTorch model, you need to unpack the dictionary
>>> ## TENSORFLOW CODE
>>> tf_outputs = tf_model(tf_batch)
In 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the final
activations of the model.
In 🤗 Transformers, all outputs are objects that contain the model's final activations along with other metadata. These
objects are described in greater detail :doc:`here <main_classes/output>`. For now, let's inspect the output ourselves:
.. code-block::
>>> ## PYTORCH CODE
>>> print(pt_outputs)
(tensor([[-4.0833, 4.3364],
[ 0.0818, -0.0418]], grad_fn=<AddmmBackward>),)
SequenceClassifierOutput(loss=None, logits=tensor([[-4.0833, 4.3364],
[ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
>>> ## TENSORFLOW CODE
>>> print(tf_outputs)
(<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.0832963 , 4.336414 ],
[ 0.08181786, -0.04179301]], dtype=float32)>,)
TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.0832963 , 4.3364143 ],
[ 0.081807 , -0.04178282]], dtype=float32)>, hidden_states=None, attentions=None)
The model can return more than just the final activations, which is why the output is a tuple. Here we only asked for
the final activations, so we get a tuple with one element.
Notice how the output object has a ``logits`` attribute. You can use this to access the model's final activations.
.. note::
@@ -267,10 +266,10 @@ Let's apply the SoftMax activation to get predictions.
>>> ## PYTORCH CODE
>>> import torch.nn.functional as F
>>> pt_predictions = F.softmax(pt_outputs[0], dim=-1)
>>> pt_predictions = F.softmax(pt_outputs.logits, dim=-1)
>>> ## TENSORFLOW CODE
>>> import tensorflow as tf
>>> tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
>>> tf.nn.softmax(tf_outputs.logits, axis=-1)
We can see we get the numbers from before:
@@ -286,16 +285,24 @@ We can see we get the numbers from before:
tensor([[2.2043e-04, 9.9978e-01],
[5.3086e-01, 4.6914e-01]], grad_fn=<SoftmaxBackward>)
If you have labels, you can provide them to the model, it will return a tuple with the loss and the final activations.
If you provide the model with labels in addition to inputs, the model output object will also contain a ``loss``
attribute:
.. code-block::
>>> ## PYTORCH CODE
>>> import torch
>>> pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1, 0]))
>>> print(pt_outputs)
SequenceClassifierOutput(loss=tensor(0.3167, grad_fn=<NllLossBackward>), logits=tensor([[-4.0833, 4.3364],
[ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
>>> ## TENSORFLOW CODE
>>> import tensorflow as tf
>>> tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]))
>>> print(tf_outputs)
TFSequenceClassifierOutput(loss=<tf.Tensor: shape=(2,), dtype=float32, numpy=array([2.2051287e-04, 6.3326043e-01], dtype=float32)>, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.0832963 , 4.3364143 ],
[ 0.081807 , -0.04178282]], dtype=float32)>, hidden_states=None, attentions=None)
Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ or `tf.keras.Model
<https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual training loop. 🤗
@@ -323,6 +330,7 @@ loading a saved PyTorch model in a TensorFlow model, use :func:`~transformers.TF
.. code-block::
from transformers import TFAutoModel
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = TFAutoModel.from_pretrained(save_directory, from_pt=True)
@@ -330,6 +338,7 @@ and if you are loading a saved TensorFlow model in a PyTorch model, you should u
.. code-block::
from transformers import AutoModel
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModel.from_pretrained(save_directory, from_tf=True)
@@ -340,10 +349,12 @@ Lastly, you can also ask the model to return all hidden states and all attention
>>> ## PYTORCH CODE
>>> pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
>>> all_hidden_states, all_attentions = pt_outputs[-2:]
>>> all_hidden_states = pt_outputs.hidden_states
>>> all_attentions = pt_outputs.attentions
>>> ## TENSORFLOW CODE
>>> tf_outputs = tf_model(tf_batch, output_hidden_states=True, output_attentions=True)
>>> all_hidden_states, all_attentions = tf_outputs[-2:]
>>> all_hidden_states = tf_outputs.hidden_states
>>> all_attentions = tf_outputs.attentions
Accessing the code
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -376,16 +387,16 @@ directly instantiate model and tokenizer without the auto magic:
Customizing the model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you want to change how the model itself is built, you can define your custom configuration class. Each architecture
comes with its own relevant configuration (in the case of DistilBERT, :class:`~transformers.DistilBertConfig`) which
allows you to specify any of the hidden dimension, dropout rate, etc. If you do core modifications, like changing the
hidden size, you won't be able to use a pretrained model anymore and will need to train from scratch. You would then
instantiate the model directly from this configuration.
If you want to change how the model itself is built, you can define a custom configuration class. Each architecture
comes with its own relevant configuration. For example, :class:`~transformers.DistilBertConfig` allows you to specify
parameters such as the hidden dimension, dropout rate, etc for DistilBERT. If you do core modifications, like changing
the hidden size, you won't be able to use a pretrained model anymore and will need to train from scratch. You would
then instantiate the model directly from this configuration.
Here we use the predefined vocabulary of DistilBERT (hence load the tokenizer with the
:func:`~transformers.DistilBertTokenizer.from_pretrained` method) and initialize the model from scratch (hence
instantiate the model from the configuration instead of using the
:func:`~transformers.DistilBertForSequenceClassification.from_pretrained` method).
Below, we load a predefined vocabulary for a tokenizer with the
:func:`~transformers.DistilBertTokenizer.from_pretrained` method. However, unlike the tokenizer, we wish to initialize
the model from scratch. Therefore, we instantiate the model from a configuration instead of using the
:func:`~transformers.DistilBertForSequenceClassification.from_pretrained` method.
.. code-block::
@@ -402,9 +413,9 @@ instantiate the model from the configuration instead of using the
For something that only changes the head of the model (for instance, the number of labels), you can still use a
pretrained model for the body. For instance, let's define a classifier for 10 different labels using a pretrained body.
We could create a configuration with all the default values and just change the number of labels, but more easily, you
can directly pass any argument a configuration would take to the :func:`from_pretrained` method and it will update the
default configuration with it:
Instead of creating a new configuration with all the default values just to change the number of labels, we can instead
pass any argument a configuration would take to the :func:`from_pretrained` method and it will update the default
configuration appropriately:
.. code-block::

View File

@@ -325,7 +325,7 @@ When you create a `HuggingFace` Estimator, you can specify a [training script th
If you are using `git_config` to run the [🤗 Transformers examples scripts](https://github.com/huggingface/transformers/tree/master/examples) keep in mind that you need to configure the right `'branch'` for you `transformers_version`, e.g. if you use `transformers_version='4.4.2` you have to use `'branch':'v4.4.2'`.
As an example to use `git_config` with an [example script from the transformers repository](https://github.com/huggingface/transformers/tree/master/examples/text-classification).
As an example to use `git_config` with an [example script from the transformers repository](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification).
_Tip: define `output_dir` as `/opt/ml/model` in the hyperparameter for the script to save your model to S3 after training._
@@ -338,7 +338,7 @@ git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch'
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point='run_glue.py',
source_dir='./examples/text-classification',
source_dir='./examples/pytorch/text-classification',
git_config=git_config,
instance_type='ml.p3.2xlarge',
instance_count=1,

View File

@@ -55,10 +55,10 @@ Sequence Classification
Sequence classification is the task of classifying sequences according to a given number of classes. An example of
sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune a
model on a GLUE sequence classification task, you may leverage the :prefix_link:`run_glue.py
<examples/text-classification/run_glue.py>`, :prefix_link:`run_tf_glue.py
<examples/text-classification/run_tf_glue.py>`, :prefix_link:`run_tf_text_classification.py
<examples/text-classification/run_tf_text_classification.py>` or :prefix_link:`run_xnli.py
<examples/text-classification/run_xnli.py>` scripts.
<examples/pytorch/text-classification/run_glue.py>`, :prefix_link:`run_tf_glue.py
<examples/tensorflow/text-classification/run_tf_glue.py>`, :prefix_link:`run_tf_text_classification.py
<examples/tensorflow/text-classification/run_tf_text_classification.py>` or :prefix_link:`run_xnli.py
<examples/pytorch/text-classification/run_xnli.py>` scripts.
Here is an example of using pipelines to do sentiment analysis: identifying if a sequence is positive or negative. It
leverages a fine-tuned model on sst2, which is a GLUE task.
@@ -85,9 +85,8 @@ each other. The process is the following:
1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
with the weights stored in the checkpoint.
2. Build a sequence from the two sentences, with the correct model-specific separators token type ids and attention
masks (:func:`~transformers.PreTrainedTokenizer.encode` and :func:`~transformers.PreTrainedTokenizer.__call__` take
care of this).
2. Build a sequence from the two sentences, with the correct model-specific separators, token type ids and attention
masks (which will be created automatically by the tokenizer).
3. Pass this sequence through the model so that it is classified in one of the two available classes: 0 (not a
paraphrase) and 1 (is a paraphrase).
4. Compute the softmax of the result to get probabilities over the classes.
@@ -108,6 +107,7 @@ each other. The process is the following:
>>> sequence_1 = "Apples are especially bad for your health"
>>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
>>> # The tokekenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to the sequence, as well as compute the attention masks.
>>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
>>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
@@ -141,6 +141,7 @@ each other. The process is the following:
>>> sequence_1 = "Apples are especially bad for your health"
>>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
>>> # The tokekenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to the sequence, as well as compute the attention masks.
>>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
>>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
@@ -168,8 +169,10 @@ Extractive Question Answering
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a
model on a SQuAD task, you may leverage the `run_qa.py
<https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_qa.py>`__ and `run_tf_squad.py
<https://github.com/huggingface/transformers/tree/master/examples/question-answering/run_tf_squad.py>`__ scripts.
<https://github.com/huggingface/transformers/tree/master/examples/pytorch/question-answering/run_qa.py>`__ and
`run_tf_squad.py
<https://github.com/huggingface/transformers/tree/master/examples/tensorflow/question-answering/run_tf_squad.py>`__
scripts.
Here is an example of using pipelines to do question answering: extracting an answer from a text given a question. It
@@ -184,7 +187,7 @@ leverages a fine-tuned model on SQuAD.
>>> context = r"""
... Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
... a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
... """
This returns an answer extracted from the text, a confidence score, alongside "start" and "end" values, which are the
@@ -325,8 +328,7 @@ fill that mask with an appropriate token. This allows the model to attend to bot
right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for
downstream tasks requiring bi-directional context, such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
<https://arxiv.org/abs/1910.13461>`__, part 4.2). If you would like to fine-tune a model on a masked language modeling
task, you may leverage the `run_mlm.py
<https://github.com/huggingface/transformers/tree/master/examples/language-modeling/run_mlm.py>`__ script.
task, you may leverage the :prefix_link:`run_mlm.py <examples/pytorch/language-modeling/run_mlm.py>` script.
Here is an example of using pipelines to replace a mask from a sequence:
@@ -435,7 +437,7 @@ Causal Language Modeling
Causal language modeling is the task of predicting the token following a sequence of tokens. In this situation, the
model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting
for generation tasks. If you would like to fine-tune a model on a causal language modeling task, you may leverage the
`run_clm.py <https://github.com/huggingface/transformers/tree/master/examples/language-modeling/run_clm.py>`__ script.
:prefix_link:`run_clm.py <examples/pytorch/language-modeling/run_clm.py>` script.
Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the
input sequence.
@@ -503,8 +505,8 @@ This outputs a (hopefully) coherent next token following the original sequence,
>>> print(resulting_string)
Hugging Face is based in DUMBO, New York City, and has
In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to
generate multiple tokens up to a user-defined length.
In the next section, we show how :func:`~transformers.PreTrainedModel.generate` can be used to generate multiple tokens
up to a specified length instead of one token at a time.
Text Generation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -525,10 +527,11 @@ As a default all models apply *Top-K* sampling when used in pipelines, as config
Here, the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am
concerned, I will"*. The default arguments of ``PreTrainedModel.generate()`` can be directly overridden in the
pipeline, as is shown above for the argument ``max_length``.
concerned, I will"*. Behind the scenes, the pipeline object calls the method
:func:`~transformers.PreTrainedModel.generate` to generate text. The default arguments for this method can be
overridden in the pipeline, as is shown above for the arguments ``max_length`` and ``do_sample``.
Here is an example of text generation using ``XLNet`` and its tokenizer.
Below is an example of text generation using ``XLNet`` and its tokenizer, which includes calling ``generate`` directly:
.. code-block::
@@ -602,8 +605,7 @@ Named Entity Recognition
Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a token
as a person, an organisation or a location. An example of a named entity recognition dataset is the CoNLL-2003 dataset,
which is entirely based on that task. If you would like to fine-tune a model on an NER task, you may leverage the
`run_ner.py <https://github.com/huggingface/transformers/tree/master/examples/token-classification/run_ner.py>`__
script.
:prefix_link:`run_ner.py <examples/pytorch/token-classification/run_ner.py>` script.
Here is an example of using pipelines to do named entity recognition, specifically, trying to identify tokens as
belonging to one of 9 classes:
@@ -627,8 +629,8 @@ It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https:
>>> nlp = pipeline("ner")
>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
... "close to the Manhattan Bridge which is visible from the window."
>>> sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
... therefore very close to the Manhattan Bridge which is visible from the window."""
This outputs a list of all words that have been identified as one of the entities from the 9 classes defined above.
@@ -659,15 +661,14 @@ Here is an example of doing named entity recognition, using a model and a tokeni
1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
with the weights stored in the checkpoint.
2. Define the label list with which the model was trained on.
3. Define a sequence with known entities, such as "Hugging Face" as an organisation and "New York City" as a location.
4. Split words into tokens so that they can be mapped to predictions. We use a small hack by, first, completely
2. Define a sequence with known entities, such as "Hugging Face" as an organisation and "New York City" as a location.
3. Split words into tokens so that they can be mapped to predictions. We use a small hack by, first, completely
encoding and decoding the sequence, so that we're left with a string that contains the special tokens.
5. Encode that sequence into IDs (special tokens are added automatically).
6. Retrieve the predictions by passing the input to the model and getting the first output. This results in a
4. Encode that sequence into IDs (special tokens are added automatically).
5. Retrieve the predictions by passing the input to the model and getting the first output. This results in a
distribution over the 9 possible classes for each token. We take the argmax to retrieve the most likely class for
each token.
7. Zip together each token with its prediction and print it.
6. Zip together each token with its prediction and print it.
.. code-block::
@@ -706,18 +707,6 @@ Here is an example of doing named entity recognition, using a model and a tokeni
>>> model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
>>> label_list = [
... "O", # Outside of a named entity
... "B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
... "I-MISC", # Miscellaneous entity
... "B-PER", # Beginning of a person's name right after another person's name
... "I-PER", # Person's name
... "B-ORG", # Beginning of an organisation right after another organisation
... "I-ORG", # Organisation
... "B-LOC", # Beginning of a location right after another location
... "I-LOC" # Location
... ]
>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
... "close to the Manhattan Bridge."
@@ -731,23 +720,61 @@ Here is an example of doing named entity recognition, using a model and a tokeni
This outputs a list of each token mapped to its corresponding prediction. Differently from the pipeline, here every
token has a prediction as we didn't remove the "0"th class, which means that no particular entity was found on that
token. The following array should be the output:
token.
In the above example, ``predictions`` is an integer that corresponds to the predicted class. We can use the
``model.config.id2label`` property in order to recover the class name corresponding to the class number, which is
illustrated below:
.. code-block::
>>> print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())])
[('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]
>>> for token, prediction in zip(tokens, predictions[0].numpy()):
... print((token, model.config.id2label[prediction]))
('[CLS]', 'O')
('Hu', 'I-ORG')
('##gging', 'I-ORG')
('Face', 'I-ORG')
('Inc', 'I-ORG')
('.', 'O')
('is', 'O')
('a', 'O')
('company', 'O')
('based', 'O')
('in', 'O')
('New', 'I-LOC')
('York', 'I-LOC')
('City', 'I-LOC')
('.', 'O')
('Its', 'O')
('headquarters', 'O')
('are', 'O')
('in', 'O')
('D', 'I-LOC')
('##UM', 'I-LOC')
('##BO', 'I-LOC')
(',', 'O')
('therefore', 'O')
('very', 'O')
('##c', 'O')
('##lose', 'O')
('to', 'O')
('the', 'O')
('Manhattan', 'I-LOC')
('Bridge', 'I-LOC')
('.', 'O')
('[SEP]', 'O')
Summarization
-----------------------------------------------------------------------------------------------------------------------
Summarization is the task of summarizing a document or an article into a shorter text. If you would like to fine-tune a
model on a summarization task, you may leverage the `run_summarization.py
<https://github.com/huggingface/transformers/tree/master/examples/seq2seq/run_summarization.py>`__ script.
<https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization/run_summarization.py>`__
script.
An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was
created for the task of summarization. If you would like to fine-tune a model on a summarization task, various
approaches are described in this :prefix_link:`document <examples/seq2seq/README.md>`.
approaches are described in this :prefix_link:`document <examples/pytorch/summarization/README.md>`.
Here is an example of using the pipelines to do summarization. It leverages a Bart model that was fine-tuned on the CNN
/ Daily Mail data set.
@@ -794,7 +821,7 @@ Here is an example of doing summarization using a model and a tokenizer. The pro
3. Add the T5 specific prefix "summarize: ".
4. Use the ``PreTrainedModel.generate()`` method to generate the summary.
In this example we use Google`s T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including
In this example we use Google's T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including
CNN / Daily Mail), it yields very good results.
.. code-block::
@@ -818,16 +845,23 @@ CNN / Daily Mail), it yields very good results.
>>> inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="tf", max_length=512)
>>> outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
.. code-block::
>>> print(tokenizer.decode(outputs[0]))
<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them between 1999 and 2002.</s>
Translation
-----------------------------------------------------------------------------------------------------------------------
Translation is the task of translating a text from one language to another. If you would like to fine-tune a model on a
translation task, you may leverage the `run_translation.py
<https://github.com/huggingface/transformers/tree/master/examples/seq2seq/run_translation.py>`__ script.
<https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation/run_translation.py>`__ script.
An example of a translation dataset is the WMT English to German dataset, which has sentences in English as the input
data and the corresponding sentences in German as the target data. If you would like to fine-tune a model on a
translation task, various approaches are described in this :prefix_link:`document <examples/seq2seq/README.md>`.
translation task, various approaches are described in this :prefix_link:`document
<examples/pytorch.translation/README.md>`.
Here is an example of using the pipelines to do translation. It leverages a T5 model that was only pre-trained on a
multi-task mixture dataset (including WMT), yet, yielding impressive translation results.

View File

@@ -502,20 +502,18 @@ Distributed training
thing and end up thinking they are ``pytest`` and start running the test suite in loops. It works, however, if one
spawns a normal process that then spawns off multiple workers and manages the IO pipes.
This is still under development but you can study 2 different tests that perform this successfully:
Here are some tests that use it:
* :prefix_link:`test_seq2seq_examples_multi_gpu.py <examples/seq2seq/test_seq2seq_examples_multi_gpu.py>` - a
``pytorch-lightning``-running test (had to use PL's ``ddp`` spawning method which is the default)
* :prefix_link:`test_finetune_trainer.py <examples/seq2seq/test_finetune_trainer.py>` - a normal (non-PL) test
* :prefix_link:`test_trainer_distributed.py <tests/test_trainer_distributed.py>`
* :prefix_link:`test_deepspeed.py <tests/deepspeed/test_deepspeed.py>`
To jump right into the execution point, search for the ``execute_subprocess_async`` function in those tests.
To jump right into the execution point, search for the ``execute_subprocess_async`` call in those tests.
You will need at least 2 GPUs to see these tests in action:
.. code-block:: bash
CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW=1 pytest -sv examples/seq2seq/test_finetune_trainer.py \
examples/seq2seq/test_seq2seq_examples_multi_gpu.py
CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 pytest -sv tests/test_trainer_distributed.py
Output capture
@@ -718,10 +716,10 @@ To start using those all you need is to make sure that the test resides in a sub
from transformers.testing_utils import TestCasePlus
class PathExampleTest(TestCasePlus):
def test_something_involving_local_locations(self):
data_dir = self.examples_dir / "seq2seq/test_data/wmt_en_ro"
data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro"
If you don't need to manipulated paths via ``pathlib`` or you just need a path as a string, you can always invoked
``str()`` on the ``pathlib`` oboject or use the accessors ending with ``_str``. For example:
If you don't need to manipulate paths via ``pathlib`` or you just need a path as a string, you can always invoked
``str()`` on the ``pathlib`` object or use the accessors ending with ``_str``. For example:
.. code-block:: python

View File

@@ -10,274 +10,377 @@
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Training and fine-tuning
Fine-tuning a pretrained model
=======================================================================================================================
Model classes in 🤗 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used
seamlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
standard training tools available in either framework. We will also show how to use our included
:func:`~transformers.Trainer` class which handles much of the complexity of training for you.
In this tutorial, we will show you how to fine-tune a pretrained model from the Transformers library. In TensorFlow,
models can be directly trained using Keras and the :obj:`fit` method. In PyTorch, there is no generic training loop so
the 🤗 Transformers library provides an API with the class :class:`~transformers.Trainer` to let you fine-tune or train
a model from scratch easily. Then we will show you how to alternatively write the whole training loop in PyTorch.
This guide assume that you are already familiar with loading and use our models for inference; otherwise, see the
:doc:`task summary <task_summary>`. We also assume that you are familiar with training deep neural networks in either
PyTorch or TF2, and focus specifically on the nuances and tools for training models in 🤗 Transformers.
Before we can fine-tune a model, we need a dataset. In this tutorial, we will show you how to fine-tune BERT on the
`IMDB dataset <https://www.imdb.com/interfaces/>`__: the task is to classify whether movie reviews are positive or
negative. For examples of other tasks, refer to the :ref:`additional-resources` section!
Sections:
.. _data-processing:
- :ref:`pytorch`
- :ref:`tensorflow`
- :ref:`trainer`
- :ref:`additional-resources`
.. _pytorch:
Fine-tuning in native PyTorch
Preparing the datasets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Model classes in 🤗 Transformers that don't begin with ``TF`` are `PyTorch Modules
<https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_, meaning that you can use them just as you would any
model in PyTorch for both inference and optimization.
We will use the `🤗 Datasets <https:/github.com/huggingface/datasets/>`__ library to download and preprocess the IMDB
datasets. We will go over this part pretty quickly. Since the focus of this tutorial is on training, you should refer
to the 🤗 Datasets `documentation <https://huggingface.co/docs/datasets/>`__ or the :doc:`preprocessing` tutorial for
more information.
Let's consider the common task of fine-tuning a masked language model like BERT on a sequence classification dataset.
When we instantiate a model with :func:`~transformers.PreTrainedModel.from_pretrained`, the model configuration and
pre-trained weights of the specified model are used to initialize the model. The library also includes a number of
task-specific final layers or 'heads' whose weights are instantiated randomly when not present in the specified
pre-trained model. For example, instantiating a model with
``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)`` will create a BERT model instance
with encoder weights copied from the ``bert-base-uncased`` model and a randomly initialized sequence classification
head on top of the encoder with an output size of 2. Models are initialized in ``eval`` mode by default. We can call
``model.train()`` to put it in train mode.
First, we can use the :obj:`load_dataset` function to download and cache the dataset:
.. code-block:: python
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
model.train()
from datasets import load_dataset
This is useful because it allows us to make use of the pre-trained BERT encoder and easily train it on whatever
sequence classification dataset we choose. We can use any PyTorch optimizer, but our library also provides the
:func:`~transformers.AdamW` optimizer which implements gradient bias correction as well as weight decay.
raw_datasets = load_dataset("imdb")
This works like the :obj:`from_pretrained` method we saw for the models and tokenizers (except the cache directory is
`~/.cache/huggingface/dataset` by default).
The :obj:`raw_datasets` object is a dictionary with three keys: :obj:`"train"`, :obj:`"test"` and :obj:`"unsupervised"`
(which correspond to the three splits of that dataset). We will use the :obj:`"train"` split for training and the
:obj:`"test"` split for validation.
To preprocess our data, we will need a tokenizer:
.. code-block:: python
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)
from transformers import AutoTokenizer
The optimizer allows us to apply different hyperpameters for specific parameter groups. For example, we can apply
weight decay to all parameters other than bias and layer normalization terms:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
As we saw in :doc:`preprocessing`, we can prepare the text inputs for the model with the following command (this is an
example, not a command you can execute):
.. code-block:: python
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
inputs = tokenizer(sentences, padding="max_length", truncation=True)
Now we can set up a simple dummy training batch using :func:`~transformers.PreTrainedTokenizer.__call__`. This returns
a :func:`~transformers.BatchEncoding` instance which prepares everything we might need to pass to the model.
This will make all the samples have the maximum length the model can accept (here 512), either by padding or truncating
them.
However, we can instead apply these preprocessing steps to all the splits of our dataset at once by using the
:obj:`map` method:
.. code-block:: python
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text_batch = ["I love Pixar.", "I don't care for Pixar."]
encoding = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True)
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
When we call a classification model with the ``labels`` argument, the first returned element is the Cross Entropy loss
between the predictions and the passed labels. Having already set up our optimizer, we can then do a backwards pass and
update the weights:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
You can learn more about the map method or the other ways to preprocess the data in the 🤗 Datasets `documentation
<https://huggingface.co/docs/datasets/>`__.
Next we will generate a small subset of the training and validation set, to enable faster training:
.. code-block:: python
labels = torch.tensor([1,0]).unsqueeze(0)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]
Alternatively, you can just get the logits and calculate the loss yourself. The following is equivalent to the previous
example:
In all the examples below, we will always use :obj:`small_train_dataset` and :obj:`small_eval_dataset`. Just replace
them by their `full` equivalent to train or evaluate on the full dataset.
.. code-block:: python
.. _trainer:
from torch.nn import functional as F
labels = torch.tensor([1,0])
outputs = model(input_ids, attention_mask=attention_mask)
loss = F.cross_entropy(outputs.logits, labels)
loss.backward()
optimizer.step()
Of course, you can train on GPU by calling ``to('cuda')`` on the model and inputs as usual.
We also provide a few learning rate scheduling tools. With the following, we can set up a scheduler which warms up for
``num_warmup_steps`` and then linearly decays to 0 by the end of training.
.. code-block:: python
from transformers import get_linear_schedule_with_warmup
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_train_steps)
Then all we have to do is call ``scheduler.step()`` after ``optimizer.step()``.
.. code-block:: python
loss.backward()
optimizer.step()
scheduler.step()
We highly recommend using :func:`~transformers.Trainer`, discussed below, which conveniently handles the moving parts
of training 🤗 Transformers models with features like mixed precision and easy tensorboard logging.
Freezing the encoder
-----------------------------------------------------------------------------------------------------------------------
In some cases, you might be interested in keeping the weights of the pre-trained encoder frozen and optimizing only the
weights of the head layers. To do so, simply set the ``requires_grad`` attribute to ``False`` on the encoder
parameters, which can be accessed with the ``base_model`` submodule on any task-specific model in the library:
.. code-block:: python
for param in model.base_model.parameters():
param.requires_grad = False
.. _tensorflow:
Fine-tuning in native TensorFlow 2
Fine-tuning in PyTorch with the Trainer API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Models can also be trained natively in TensorFlow 2. Just as with PyTorch, TensorFlow models can be instantiated with
:func:`~transformers.PreTrainedModel.from_pretrained` to load the weights of the encoder from a pretrained model.
Since PyTorch does not provide a training loop, the 🤗 Transformers library provides a :class:`~transformers.Trainer`
API that is optimized for 🤗 Transformers models, with a wide range of training options and with built-in features like
logging, gradient accumulation, and mixed precision.
First, let's define our model:
.. code-block:: python
from transformers import TFBertForSequenceClassification
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
from transformers import AutoModelForSequenceClassification
Let's use ``tensorflow_datasets`` to load in the `MRPC dataset
<https://www.tensorflow.org/datasets/catalog/glue#gluemrpc>`_ from GLUE. We can then use our built-in
:func:`~transformers.data.processors.glue.glue_convert_examples_to_features` to tokenize MRPC and convert it to a
TensorFlow ``Dataset`` object. Note that tokenizers are framework-agnostic, so there is no need to prepend ``TF`` to
the pretrained tokenizer name.
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
This will issue a warning about some of the pretrained weights not being used and some weights being randomly
initialized. That's because we are throwing away the pretraining head of the BERT model to replace it with a
classification head which is randomly initialized. We will fine-tune this model on our task, transferring the knowledge
of the pretrained model to it (which is why doing this is called transfer learning).
Then, to define our :class:`~transformers.Trainer`, we will need to instantiate a
:class:`~transformers.TrainingArguments`. This class contains all the hyperparameters we can tune for the
:class:`~transformers.Trainer` or the flags to activate the different training options it supports. Let's begin by
using all the defaults, the only thing we then have to provide is a directory in which the checkpoints will be saved:
.. code-block:: python
from transformers import TrainingArguments
training_args = TrainingArguments("test_trainer")
Then we can instantiate a :class:`~transformers.Trainer` like this:
.. code-block:: python
from transformers import Trainer
trainer = Trainer(
model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)
To fine-tune our model, we just need to call
.. code-block:: python
trainer.train()
which will start a training that you can follow with a progress bar, which should take a couple of minutes to complete
(as long as you hav access to a GPU). It won't actually tell you anything useful about how well (or badly) your model
is performing however as by default, there is no evaluation during training, and we didn't tell the
:class:`~transformers.Trainer` to compute any metrics. Let's have a look on how to do that now!
To have the :class:`~transformers.Trainer` compute and report metrics, we need to give it a :obj:`compute_metrics`
function that takes predictions and labels (grouped in a namedtuple called :class:`~transformers.EvalPrediction`) and
return a dictionary with string items (the metric names) and float values (the metric values).
The 🤗 Datasets library provides an easy way to get the common metrics used in NLP with the :obj:`load_metric` function.
here we simply use accuracy. Then we define the :obj:`compute_metrics` function that just convert logits to predictions
(remember that all 🤗 Transformers models return the logits) and feed them to :obj:`compute` method of this metric.
.. code-block:: python
import numpy as np
from datasets import load_metric
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
The compute function needs to receive a tuple (with logits and labels) and has to return a dictionary with string keys
(the name of the metric) and float values. It will be called at the end of each evaluation phase on the whole arrays of
predictions/labels.
To check if this works on practice, let's create a new :class:`~transformers.Trainer` with our fine-tuned model:
.. code-block:: python
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
trainer.evaluate()
which showed an accuracy of 87.5% in our case.
If you want to fine-tune your model and regularly report the evaluation metrics (for instance at the end of each
epoch), here is how you should define your training arguments:
.. code-block:: python
from transformers import TrainingArguments
training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch")
See the documentation of :class:`~transformers.TrainingArguments` for more options.
.. _keras:
Fine-tuning with Keras
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Models can also be trained natively in TensorFlow using the Keras API. First, let's define our model:
.. code-block:: python
from transformers import BertTokenizer, glue_convert_examples_to_features
import tensorflow as tf
import tensorflow_datasets as tfds
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
data = tfds.load('glue/mrpc')
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
from transformers import TFAutoModelForSequenceClassification
The model can then be compiled and trained as any Keras model:
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
Then we will need to convert our datasets from before in standard :obj:`tf.data.Dataset`. Since we have fixed shapes,
it can easily be done like this. First we remove the `"text"` column from our datasets and set them in TensorFlow
format:
.. code-block:: python
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss)
model.fit(train_dataset, epochs=2, steps_per_epoch=115)
tf_train_dataset = small_train_dataset.remove_columns(["text"]).with_format("tensorflow")
tf_eval_dataset = small_eval_dataset.remove_columns(["text"]).with_format("tensorflow")
Then we convert everything in big tensors and use the :obj:`tf.data.Dataset.from_tensor_slices` method:
.. code-block:: python
train_features = {x: tf_train_dataset[x].to_tensor() for x in tokenizer.model_input_names}
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["label"]))
train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(8)
eval_features = {x: tf_eval_dataset[x].to_tensor() for x in tokenizer.model_input_names}
eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset["label"]))
eval_tf_dataset = eval_tf_dataset.batch(8)
With this done, the model can then be compiled and trained as any Keras model:
.. code-block:: python
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=tf.metrics.SparseCategoricalAccuracy(),
)
model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=3)
With the tight interoperability between TensorFlow and PyTorch models, you can even save the model and then reload it
as a PyTorch model (or vice-versa):
.. code-block:: python
from transformers import BertForSequenceClassification
model.save_pretrained('./my_mrpc_model/')
pytorch_model = BertForSequenceClassification.from_pretrained('./my_mrpc_model/', from_tf=True)
from transformers import AutoModelForSequenceClassification
model.save_pretrained("my_imdb_model")
pytorch_model = AutoModelForSequenceClassification.from_pretrained("my_imdb_model", from_tf=True)
.. _trainer:
.. _pytorch_native:
Trainer
Fine-tuning in native PyTorch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We also provide a simple but feature-complete training and evaluation interface through :func:`~transformers.Trainer`
and :func:`~transformers.TFTrainer`. You can train, fine-tune, and evaluate any 🤗 Transformers model with a wide range
of training options and with built-in features like logging, gradient accumulation, and mixed precision.
You might need to restart your notebook at this stage to free some memory, or excute the following code:
.. code-block:: python
## PYTORCH CODE
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
del model
del pytorch_model
del trainer
torch.cuda.empty_cache()
model = BertForSequenceClassification.from_pretrained("bert-large-uncased")
Let's now see how to achieve the same results as in :ref:`trainer section <trainer>` in PyTorch. First we need to
define the dataloaders, which we will use to iterate over batches. We just need to apply a bit of post-processing to
our :obj:`tokenized_datasets` before doing that to:
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total # of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
)
- remove the columns corresponding to values the model does not expect (here the :obj:`"text"` column)
- rename the column :obj:`"label"` to :obj:`"labels"` (because the model expect the argument to be named :obj:`labels`)
- set the format of the datasets so they return PyTorch Tensors instead of lists.
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=test_dataset # evaluation dataset
)
## TENSORFLOW CODE
from transformers import TFBertForSequenceClassification, TFTrainer, TFTrainingArguments
model = TFBertForSequenceClassification.from_pretrained("bert-large-uncased")
training_args = TFTrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total # of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
)
trainer = TFTrainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=tfds_train_dataset, # tensorflow_datasets training dataset
eval_dataset=tfds_test_dataset # tensorflow_datasets evaluation dataset
)
Now simply call ``trainer.train()`` to train and ``trainer.evaluate()`` to evaluate. You can use your own module as
well, but the first argument returned from ``forward`` must be the loss which you wish to optimize.
:func:`~transformers.Trainer` uses a built-in default function to collate batches and prepare them to be fed into the
model. If needed, you can also use the ``data_collator`` argument to pass your own collator function which takes in the
data in the format provided by your dataset and returns a batch ready to be fed into the model. Note that
:func:`~transformers.TFTrainer` expects the passed datasets to be dataset objects from ``tensorflow_datasets``.
To calculate additional metrics in addition to the loss, you can also define your own ``compute_metrics`` function and
pass it to the trainer.
Our `tokenized_datasets` has one method for each of those steps:
.. code-block:: python
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
'f1': f1,
'precision': precision,
'recall': recall
}
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
Finally, you can view the results, including any calculated metrics, by launching tensorboard in your specified
``logging_dir`` directory.
Now that this is done, we can easily define our dataloaders:
.. code-block:: python
from torch.utils.data import DataLoader
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)
Next, we define our model:
.. code-block:: python
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
We are almost ready to write our training loop, the only two things are missing are an optimizer and a learning rate
scheduler. The default optimizer used by the :class:`~transformers.Trainer` is :class:`~transformers.AdamW`:
.. code-block:: python
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)
Finally, the learning rate scheduler used by default it just a linear decay form the maximum value (5e-5 here) to 0:
.. code-block:: python
from transformers import get_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)
One last thing, we will want to use the GPU if we have access to one (otherwise training might take several hours
instead of a couple of minutes). To do this, we define a :obj:`device` we will put our model and our batches on.
.. code-block:: python
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
We now are ready to train! To get some sense of when it will be finished, we add a progress bar over our number of
training steps, using the `tqdm` library.
.. code-block:: python
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
Note that if you are used to freezing the body of your pretrained model (like in computer vision) the above may seem a
bit strange, as we are directly fine-tuning the whole model without taking any precaution. It actually works better
this way for Transformers model (so this is not an oversight on our side). If you're not familiar with what "freezing
the body" of the model means, forget you read this paragraph.
Now to check the results, we need to write the evaluation loop. Like in the :ref:`trainer section <trainer>` we will
use a metric from the datasets library. Here we accumulate the predictions at each batch before computing the final
result when the loop is finished.
.. code-block:: python
metric= load_metric("accuracy")
model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
metric.add_batch(predictions=predictions, references=batch["labels"])
metric.compute()
.. _additional-resources:
@@ -285,15 +388,10 @@ Finally, you can view the results, including any calculated metrics, by launchin
Additional resources
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_
which uses ``Trainer`` for IMDb sentiment classification.
To look at more fine-tuning examples you can refer to:
- `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`_ including scripts for
training and fine-tuning on GLUE, SQuAD, and several other tasks.
- `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`__ which includes scripts
to train on all common NLP tasks in PyTorch and TensorFlow.
- `How to train a language model
<https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb>`_, a detailed
colab notebook which uses ``Trainer`` to train a masked language model from scratch on Esperanto.
- `🤗 Transformers Notebooks <notebooks.html>`_ which contain dozens of example notebooks from the community for
training and using 🤗 Transformers on a variety of tasks.
- `🤗 Transformers Notebooks <notebooks.html>`__ which contains various notebooks and in particular one per task (look
for the `how to finetune a model on xxx`).

View File

@@ -15,9 +15,9 @@ limitations under the License.
# Examples
This folder contains actively maintained examples of use of 🤗 Transformers organized along NLP tasks. If you are looking for an example that used to be in this folder, it may have moved to our [research projects](https://github.com/huggingface/transformers/tree/master/examples/research_projects) subfolder (which contains frozen snapshots of research projects) or to the [legacy](https://github.com/huggingface/transformers/tree/master/examples/legacy) subfolder.
This folder contains actively maintained examples of use of 🤗 Transformers organized along NLP tasks. If you are looking for an example that used to be in this folder, it may have moved to the corresponding framework subfolder (pytorch, tensorflow or flax), our [research projects](https://github.com/huggingface/transformers/tree/master/examples/research_projects) subfolder (which contains frozen snapshots of research projects) or to the [legacy](https://github.com/huggingface/transformers/tree/master/examples/legacy) subfolder.
While we strive to present as many use cases as possible, the scripts in this folder are just examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, all the PyTorch versions of the examples fully expose the preprocessing of the data. This way, you can easily tweak them.
While we strive to present as many use cases as possible, the scripts in this folder are just examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, most of the examples fully expose the preprocessing of the data. This way, you can easily tweak them.
This is similar if you want the scripts to report another metric than the one they currently use: look at the `compute_metrics` function inside the script. It takes the full arrays of predictions and labels and has to return a dictionary of string keys and float values. Just change it to add (or replace) your own metric to the ones already reported.
@@ -42,7 +42,8 @@ To browse the examples corresponding to released versions of 🤗 Transformers,
<details>
<summary>Examples for older versions of 🤗 Transformers</summary>
- [v4.5.1](https://github.com/huggingface/transformers/tree/v4.5.1/examples)
- [v4.4.2](https://github.com/huggingface/transformers/tree/v4.4.2/examples)
- [v4.3.3](https://github.com/huggingface/transformers/tree/v4.3.3/examples)
- [v4.2.2](https://github.com/huggingface/transformers/tree/v4.2.2/examples)
- [v4.1.1](https://github.com/huggingface/transformers/tree/v4.1.1/examples)
@@ -75,193 +76,3 @@ Alternatively, you can find switch your cloned 🤗 Transformers to a specific v
git checkout tags/v3.5.1
```
and run the example command as usual afterward.
## The Big Table of Tasks
Here is the list of all our examples:
- with information on whether they are **built on top of `Trainer`/`TFTrainer`** (if not, they still work, they might
just lack some features),
- whether or not they leverage the [🤗 Datasets](https://github.com/huggingface/datasets) library.
- links to **Colab notebooks** to walk through the scripts and run them easily,
<!--
Coming soon!
- links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
-->
| Task | Example datasets | Trainer support | TFTrainer support | 🤗 Datasets | Colab
|---|---|:---:|:---:|:---:|:---:|
| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | WikiText-2 | ✅ | - | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/multiple_choice.ipynb)
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | XSum | ✅ | - | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/summarization.ipynb)
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | WMT | ✅ | - | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/translation.ipynb)
## Running quick tests
Most examples are equipped with a mechanism to truncate the number of dataset samples to the desired length. This is useful for debugging purposes, for example to quickly check that all stages of the programs can complete, before running the same setup on the full dataset which may take hours to complete.
For example here is how to truncate all three splits to just 50 samples each:
```
examples/token-classification/run_ner.py \
--max_train_samples 50 \
--max_val_samples 50 \
--max_test_samples 50 \
[...]
```
Most example scripts should have the first two command line arguments and some have the third one. You can quickly check if a given example supports any of these by passing a `-h` option, e.g.:
```
examples/token-classification/run_ner.py -h
```
## Resuming training
You can resume training from a previous checkpoint like this:
1. Pass `--output_dir previous_output_dir` without `--overwrite_output_dir` to resume training from the latest checkpoint in `output_dir` (what you would use if the training was interrupted, for instance).
2. Pass `--model_name_or_path path_to_a_specific_checkpoint` to resume training from that checkpoint folder.
Should you want to turn an example into a notebook where you'd no longer have access to the command
line, 🤗 Trainer supports resuming from a checkpoint via `trainer.train(resume_from_checkpoint)`.
1. If `resume_from_checkpoint` is `True` it will look for the last checkpoint in the value of `output_dir` passed via `TrainingArguments`.
2. If `resume_from_checkpoint` is a path to a specific checkpoint it will use that saved checkpoint folder to resume the training from.
## Distributed training and mixed precision
All the PyTorch scripts mentioned above work out of the box with distributed training and mixed precision, thanks to
the [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html). To launch one of them on _n_ GPUS,
use the following command:
```bash
python -m torch.distributed.launch \
--nproc_per_node number_of_gpu_you_have path_to_script.py \
--all_arguments_of_the_script
```
As an example, here is how you would fine-tune the BERT large model (with whole word masking) on the text
classification MNLI task using the `run_glue` script, with 8 GPUs:
```bash
python -m torch.distributed.launch \
--nproc_per_node 8 text-classification/run_glue.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--task_name mnli \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mnli_output/
```
If you have a GPU with mixed precision capabilities (architecture Pascal or more recent), you can use mixed precision
training with PyTorch 1.6.0 or latest, or by installing the [Apex](https://github.com/NVIDIA/apex) library for previous
versions. Just add the flag `--fp16` to your command launching one of the scripts mentioned above!
Using mixed precision training usually results in 2x-speedup for training with the same final results (as shown in
[this table](https://github.com/huggingface/transformers/tree/master/examples/text-classification#mixed-precision-training)
for text classification).
## Running on TPUs
When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Strategy`.
When using PyTorch, we support TPUs thanks to `pytorch/xla`. For more context and information on how to setup your TPU environment refer to Google's documentation and to the
very detailed [pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
In this repo, we provide a very simple launcher script named
[xla_spawn.py](https://github.com/huggingface/transformers/tree/master/examples/xla_spawn.py) that lets you run our
example scripts on multiple TPU cores without any boilerplate. Just pass a `--num_cores` flag to this script, then your
regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for
`torch.distributed`):
```bash
python xla_spawn.py --num_cores num_tpu_you_have \
path_to_script.py \
--all_arguments_of_the_script
```
As an example, here is how you would fine-tune the BERT large model (with whole word masking) on the text
classification MNLI task using the `run_glue` script, with 8 TPUs:
```bash
python xla_spawn.py --num_cores 8 \
text-classification/run_glue.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--task_name mnli \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mnli_output/
```
## Logging & Experiment tracking
You can easily log and monitor your runs code. The following are currently supported:
* [TensorBoard](https://www.tensorflow.org/tensorboard)
* [Weights & Biases](https://docs.wandb.ai/integrations/huggingface)
* [Comet ML](https://www.comet.ml/docs/python-sdk/huggingface/)
### Weights & Biases
To use Weights & Biases, install the wandb package with:
```bash
pip install wandb
```
Then log in the command line:
```bash
wandb login
```
If you are in Jupyter or Colab, you should login with:
```python
import wandb
wandb.login()
```
To enable logging to W&B, include `"wandb"` in the `report_to` of your `TrainingArguments` or script. Or just pass along `--report_to all` if you have `wandb` installed.
Whenever you use `Trainer` or `TFTrainer` classes, your losses, evaluation metrics, model topology and gradients (for `Trainer` only) will automatically be logged.
Advanced configuration is possible by setting environment variables:
| Environment Variable | Value |
|---|---|
| WANDB_LOG_MODEL | Log the model as artifact (log the model as artifact at the end of training (`false` by default) |
| WANDB_WATCH | one of `gradients` (default) to log histograms of gradients, `all` to log histograms of both gradients and parameters, or `false` for no histogram logging |
| WANDB_PROJECT | Organize runs by project |
Set run names with `run_name` argument present in scripts or as part of `TrainingArguments`.
Additional configuration options are available through generic [wandb environment variables](https://docs.wandb.com/library/environment-variables).
Refer to related [documentation & examples](https://docs.wandb.ai/integrations/huggingface).
### Comet.ml
To use `comet_ml`, install the Python package with:
```bash
pip install comet_ml
```
or if in a Conda environment:
```bash
conda install -c comet_ml -c anaconda -c conda-forge comet_ml
```

View File

@@ -45,7 +45,7 @@ from transformers import (
MODEL_FOR_MASKED_LM_MAPPING,
AutoConfig,
AutoTokenizer,
FlaxBertForMaskedLM,
FlaxAutoModelForMaskedLM,
HfArgumentParser,
PreTrainedTokenizerBase,
TensorType,
@@ -105,6 +105,12 @@ class ModelArguments:
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
)
dtype: Optional[str] = field(
default="float32",
metadata={
"help": "Floating-point format in which the model weights should be initialized and trained. Choose one of `[float32, float16, bfloat16]`."
},
)
@dataclass
@@ -162,6 +168,10 @@ class DataTrainingArguments:
"If False, will pad the samples dynamically when batching to the maximum length in the batch."
},
)
line_by_line: bool = field(
default=False,
metadata={"help": "Whether distinct lines of text in the dataset are to be handled as distinct sequences."},
)
def __post_init__(self):
if self.dataset_name is None and self.train_file is None and self.validation_file is None:
@@ -537,27 +547,76 @@ if __name__ == "__main__":
column_names = datasets["validation"].column_names
text_column_name = "text" if "text" in column_names else column_names[0]
padding = "max_length" if data_args.pad_to_max_length else False
max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)
def tokenize_function(examples):
# Remove empty lines
examples = [line for line in examples if len(line) > 0 and not line.isspace()]
return tokenizer(
examples,
return_special_tokens_mask=True,
padding=padding,
truncation=True,
max_length=data_args.max_seq_length,
if data_args.line_by_line:
# When using line_by_line, we just tokenize each nonempty line.
padding = "max_length" if data_args.pad_to_max_length else False
def tokenize_function(examples):
# Remove empty lines
examples = [line for line in examples if len(line) > 0 and not line.isspace()]
return tokenizer(
examples,
return_special_tokens_mask=True,
padding=padding,
truncation=True,
max_length=max_seq_length,
)
tokenized_datasets = datasets.map(
tokenize_function,
input_columns=[text_column_name],
batched=True,
num_proc=data_args.preprocessing_num_workers,
remove_columns=column_names,
load_from_cache_file=not data_args.overwrite_cache,
)
tokenized_datasets = datasets.map(
tokenize_function,
input_columns=[text_column_name],
batched=True,
num_proc=data_args.preprocessing_num_workers,
remove_columns=column_names,
load_from_cache_file=not data_args.overwrite_cache,
)
else:
# Otherwise, we tokenize every text, then concatenate them together before splitting them in smaller parts.
# We use `return_special_tokens_mask=True` because DataCollatorForLanguageModeling (see below) is more
# efficient when it receives the `special_tokens_mask`.
def tokenize_function(examples):
return tokenizer(examples[text_column_name], return_special_tokens_mask=True)
tokenized_datasets = datasets.map(
tokenize_function,
batched=True,
num_proc=data_args.preprocessing_num_workers,
remove_columns=column_names,
load_from_cache_file=not data_args.overwrite_cache,
)
# Main data processing function that will concatenate all texts from our dataset and generate chunks of
# max_seq_length.
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
total_length = (total_length // max_seq_length) * max_seq_length
# Split by chunks of max_len.
result = {
k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
for k, t in concatenated_examples.items()
}
return result
# Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a
# remainder for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher value
# might be slower to preprocess.
#
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map
tokenized_datasets = tokenized_datasets.map(
group_texts,
batched=True,
num_proc=data_args.preprocessing_num_workers,
load_from_cache_file=not data_args.overwrite_cache,
)
# Enable tensorboard only on the master node
if has_tensorboard and jax.host_id() == 0:
@@ -571,13 +630,7 @@ if __name__ == "__main__":
rng = jax.random.PRNGKey(training_args.seed)
dropout_rngs = jax.random.split(rng, jax.local_device_count())
model = FlaxBertForMaskedLM.from_pretrained(
"bert-base-cased",
dtype=jnp.float32,
input_shape=(training_args.train_batch_size, config.max_position_embeddings),
seed=training_args.seed,
dropout_rate=0.1,
)
model = FlaxAutoModelForMaskedLM.from_config(config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype))
# Setup optimizer
optimizer = Adam(
@@ -590,7 +643,7 @@ if __name__ == "__main__":
# Create learning rate scheduler
# warmup_steps = 0 causes the Flax optimizer to return NaNs; warmup_steps = 1 is functionally equivalent.
lr_scheduler_fn = create_learning_rate_scheduler(
base_learning_rate=training_args.learning_rate, warmup_steps=min(training_args.warmup_steps, 1)
base_learning_rate=training_args.learning_rate, warmup_steps=max(training_args.warmup_steps, 1)
)
# Create parallel version of the training and evaluation steps
@@ -602,8 +655,8 @@ if __name__ == "__main__":
# Store some constant
nb_epochs = int(training_args.num_train_epochs)
batch_size = int(training_args.train_batch_size)
eval_batch_size = int(training_args.eval_batch_size)
batch_size = int(training_args.per_device_train_batch_size) * jax.device_count()
eval_batch_size = int(training_args.per_device_eval_batch_size) * jax.device_count()
epochs = tqdm(range(nb_epochs), desc=f"Epoch ... (1/{nb_epochs})", position=0)
for epoch in epochs:
@@ -657,3 +710,8 @@ if __name__ == "__main__":
if has_tensorboard and jax.host_id() == 0:
for name, value in eval_summary.items():
summary_writer.scalar(name, value, epoch)
# save last checkpoint
if jax.host_id() == 0:
params = jax.device_get(jax.tree_map(lambda x: x[0], optimizer.target))
model.save_pretrained(training_args.output_dir, params=params)

View File

@@ -0,0 +1,106 @@
<!---
Copyright 2021 The Google Flax Team Authors and HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Text classification examples
## GLUE tasks
Based on the script [`run_flax_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/flax/text-classification/run_flax_glue.py).
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
Evaluation](https://gluebenchmark.com/). This script can fine-tune any of the models on the [hub](https://huggingface.co/models).
GLUE is made up of a total of 9 different tasks. Here is how to run the script on one of them:
```bash
export TASK_NAME=mrpc
python run_flax_glue.py \
--model_name_or_path bert-base-cased \
--task_name $TASK_NAME \
--max_length 128 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--output_dir /tmp/$TASK_NAME/
```
where task name can be one of cola, mnli, mnli-mm, mrpc, qnli, qqp, rte, sst2, stsb, wnli.
Using the command above, the script will train for 3 epochs and run eval after each epoch.
Metrics and hyperparameters are stored in Tensorflow event files in `---output_dir`.
You can see the results by running `tensorboard` in that directory:
```bash
$ tensorboard --logdir .
```
### Accuracy Evaluation
We train five replicas and report mean accuracy and stdev on the dev set below.
We use the settings as in the command above (with an exception for MRPC and
WNLI which are tiny and where we used 5 epochs instead of 3), and we use a total
train batch size of 32 (we train on 8 Cloud v3 TPUs, so a per-device batch size of 4),
On the task other than MRPC and WNLI we train for 3 these epochs because this is the standard,
but looking at the training curves of some of them (e.g., SST-2, STS-b), it appears the models
are undertrained and we could get better results when training longer.
In the Tensorboard results linked below, the random seed of each model is equal to the ID of the run. So in order to reproduce run 1, run the command above with `--seed=1`. The best run used random seed 2, which is the default in the script. The results of all runs are in [this Google Sheet](https://docs.google.com/spreadsheets/d/1zKL_xn32HwbxkFMxB3ftca-soTHAuBFgIhYhOhCnZ4E/edit?usp=sharing).
| Task | Metric | Acc (best run) | Acc (avg/5runs) | Stdev | Metrics |
|-------|------------------------------|----------------|-----------------|-----------|--------------------------------------------------------------------------|
| CoLA | Matthew's corr | 59.57 | 58.04 | 1.81 | [tfhub.dev](https://tensorboard.dev/experiment/f4OvQpWtRq6CvddpxGBd0A/) |
| SST-2 | Accuracy | 92.43 | 91.79 | 0.59 | [tfhub.dev](https://tensorboard.dev/experiment/BYFwa49MRTaLIn93DgAEtA/) |
| MRPC | F1/Accuracy | 89.50/84.8 | 88.70/84.02 | 0.56/0.48 | [tfhub.dev](https://tensorboard.dev/experiment/9ZWH5xwXRS6zEEUE4RaBhQ/) |
| STS-B | Pearson/Spearman corr. | 90.00/88.71 | 89.09/88.61 | 0.51/0.07 | [tfhub.dev](https://tensorboard.dev/experiment/mUlI5B9QQ0WGEJip7p3Tng/) |
| QQP | Accuracy/F1 | 90.88/87.64 | 90.75/87.53 | 0.11/0.13 | [tfhub.dev](https://tensorboard.dev/experiment/pO6h75L3SvSXSWRcgljXKA/) |
| MNLI | Matched acc. | 84.06 | 83.88 | 0.16 | [tfhub.dev](https://tensorboard.dev/experiment/LKwaOH18RMuo7nJkESrpKg/) |
| QNLI | Accuracy | 91.01 | 90.86 | 0.18 | [tfhub.dev](https://tensorboard.dev/experiment/qesXxNcaQhmKxPmbw1sOoA/) |
| RTE | Accuracy | 66.80 | 65.27 | 1.07 | [tfhub.dev](https://tensorboard.dev/experiment/Z84xC0r6RjyzT4SLqiAbzQ/) |
| WNLI | Accuracy | 39.44 | 32.96 | 5.85 | [tfhub.dev](https://tensorboard.dev/experiment/gV73w9v0RIKrqVw32PZbAQ/) |
Some of these results are significantly different from the ones reported on the test set of GLUE benchmark on the
website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the website.
### Runtime evaluation
We also ran each task once on a single V100 GPU, 8 V100 GPUs, and 8 Cloud v3 TPUs and report the
overall training time below. For comparison we ran Pytorch's [run_glue.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/text-classification/run_glue.py) on a single GPU (last column).
| Task | TPU v3-8 | 8 GPU | 1 GPU | 1 GPU (Pytorch) |
|-------|-----------|------------|------------|-----------------|
| CoLA | 1m 46s | 1m 26s | 3m 6s | 4m 6s |
| SST-2 | 5m 30s | 6m 28s | 22m 6s | 34m 37s |
| MRPC | 1m 32s | 1m 14s | 2m 17s | 2m 56s |
| STS-B | 1m 33s | 1m 12s | 2m 11s | 2m 48s |
| QQP | 24m 40s | 31m 48s | 1h 20m 15s | 2h 54m |
| MNLI | 26m 30s | 33m 55s | 2h 7m 30s | 3h 7m 6s |
| QNLI | 8m | 9m 40s | 34m 20s | 49m 8s |
| RTE | 1m 21s | 55s | 1m 8s | 1m 16s |
| WNLI | 1m 12s | 48s | 38s | 36s |
|-------|
| **TOTAL** | 1h 13m | 1h 28m | 4h 34m | 6h 37m |
| **COST*** | $9.60 | $29.10 | $11.33 | $16.41 |
*All experiments are ran on Google Cloud Platform. Prices are on-demand prices
(not preemptible), obtained from the following tables:
[TPU pricing table](https://cloud.google.com/tpu/pricing),
[GPU pricing table](https://cloud.google.com/compute/gpus-pricing). GPU
experiments are ran without further optimizations besides JAX transformations.

View File

@@ -0,0 +1,5 @@
datasets >= 1.1.3
jax>=0.2.8
jaxlib>=0.1.59
git+https://github.com/google/flax.git
git+https://github.com/deepmind/optax.git

View File

@@ -0,0 +1,517 @@
#!/usr/bin/env python
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Finetuning a 🤗 Flax Transformers model for sequence classification on GLUE."""
import argparse
import logging
import os
import random
import time
from itertools import chain
from typing import Any, Callable, Dict, Tuple
import datasets
from datasets import load_dataset, load_metric
import jax
import jax.numpy as jnp
import optax
import transformers
from flax import linen as nn
from flax import struct, traverse_util
from flax.jax_utils import replicate, unreplicate
from flax.metrics import tensorboard
from flax.training import train_state
from flax.training.common_utils import get_metrics, onehot, shard, shard_prng_key
from transformers import AutoConfig, AutoTokenizer, FlaxAutoModelForSequenceClassification, PretrainedConfig
logger = logging.getLogger(__name__)
Array = Any
Dataset = datasets.arrow_dataset.Dataset
PRNGKey = Any
task_to_keys = {
"cola": ("sentence", None),
"mnli": ("premise", "hypothesis"),
"mrpc": ("sentence1", "sentence2"),
"qnli": ("question", "sentence"),
"qqp": ("question1", "question2"),
"rte": ("sentence1", "sentence2"),
"sst2": ("sentence", None),
"stsb": ("sentence1", "sentence2"),
"wnli": ("sentence1", "sentence2"),
}
def parse_args():
parser = argparse.ArgumentParser(description="Finetune a transformers model on a text classification task")
parser.add_argument(
"--task_name",
type=str,
default=None,
help="The name of the glue task to train on.",
choices=list(task_to_keys.keys()),
)
parser.add_argument(
"--train_file", type=str, default=None, help="A csv or a json file containing the training data."
)
parser.add_argument(
"--validation_file", type=str, default=None, help="A csv or a json file containing the validation data."
)
parser.add_argument(
"--max_length",
type=int,
default=128,
help=(
"The maximum total input sequence length after tokenization. Sequences longer than this will be truncated,"
" sequences shorter will be padded."
),
)
parser.add_argument(
"--model_name_or_path",
type=str,
help="Path to pretrained model or model identifier from huggingface.co/models.",
required=True,
)
parser.add_argument(
"--use_slow_tokenizer",
action="store_true",
help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
)
parser.add_argument(
"--per_device_train_batch_size",
type=int,
default=8,
help="Batch size (per device) for the training dataloader.",
)
parser.add_argument(
"--per_device_eval_batch_size",
type=int,
default=8,
help="Batch size (per device) for the evaluation dataloader.",
)
parser.add_argument(
"--learning_rate",
type=float,
default=5e-5,
help="Initial learning rate (after the potential warmup period) to use.",
)
parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay to use.")
parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.")
parser.add_argument(
"--max_train_steps",
type=int,
default=None,
help="Total number of training steps to perform. If provided, overrides num_train_epochs.",
)
parser.add_argument(
"--gradient_accumulation_steps",
type=int,
default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.",
)
parser.add_argument(
"--num_warmup_steps", type=int, default=0, help="Number of steps for the warmup in the lr scheduler."
)
parser.add_argument("--output_dir", type=str, default=None, help="Where to store the final model.")
parser.add_argument("--seed", type=int, default=2, help="A seed for reproducible training.")
args = parser.parse_args()
# Sanity checks
if args.task_name is None and args.train_file is None and args.validation_file is None:
raise ValueError("Need either a task name or a training/validation file.")
else:
if args.train_file is not None:
extension = args.train_file.split(".")[-1]
assert extension in ["csv", "json"], "`train_file` should be a csv or a json file."
if args.validation_file is not None:
extension = args.validation_file.split(".")[-1]
assert extension in ["csv", "json"], "`validation_file` should be a csv or a json file."
if args.output_dir is not None:
os.makedirs(args.output_dir, exist_ok=True)
return args
def create_train_state(
model: FlaxAutoModelForSequenceClassification,
learning_rate_fn: Callable[[int], float],
is_regression: bool,
num_labels: int,
) -> train_state.TrainState:
"""Create initial training state."""
class TrainState(train_state.TrainState):
"""Train state with an Optax optimizer.
The two functions below differ depending on whether the task is classification
or regression.
Args:
logits_fn: Applied to last layer to obtain the logits.
loss_fn: Function to compute the loss.
"""
logits_fn: Callable = struct.field(pytree_node=False)
loss_fn: Callable = struct.field(pytree_node=False)
# Creates a multi-optimizer consisting of two "Adam with weight decay" optimizers.
def adamw(weight_decay):
return optax.adamw(learning_rate=learning_rate_fn, b1=0.9, b2=0.999, eps=1e-6, weight_decay=weight_decay)
def traverse(fn):
def mask(data):
flat = traverse_util.flatten_dict(data)
return traverse_util.unflatten_dict({k: fn(k, v) for k, v in flat.items()})
return mask
# We use Optax's "masking" functionality to create a multi-optimizer, one
# with weight decay and the other without. Note masking means the optimizer
# will ignore these paths.
decay_path = lambda p: not any(x in p for x in ["bias", "LayerNorm.weight"]) # noqa: E731
tx = optax.chain(
optax.masked(adamw(0.0), mask=traverse(lambda path, _: decay_path(path))),
optax.masked(adamw(0.01), mask=traverse(lambda path, _: not decay_path(path))),
)
if is_regression:
def mse_loss(logits, labels):
return jnp.mean((logits[..., 0] - labels) ** 2)
return TrainState.create(
apply_fn=model.__call__,
params=model.params,
tx=tx,
logits_fn=lambda logits: logits[..., 0],
loss_fn=mse_loss,
)
else: # Classification.
def cross_entropy_loss(logits, labels):
logits = nn.log_softmax(logits)
xentropy = optax.softmax_cross_entropy(logits, onehot(labels, num_classes=num_labels))
return jnp.mean(xentropy)
return TrainState.create(
apply_fn=model.__call__,
params=model.params,
tx=tx,
logits_fn=lambda logits: logits.argmax(-1),
loss_fn=cross_entropy_loss,
)
def create_learning_rate_fn(
train_ds_size: int, train_batch_size: int, num_train_epochs: int, num_warmup_steps: int, learning_rate: float
) -> Callable[[int], jnp.array]:
"""Returns a linear warmup, linear_decay learning rate function."""
steps_per_epoch = train_ds_size // train_batch_size
num_train_steps = steps_per_epoch * num_train_epochs
warmup_fn = optax.linear_schedule(init_value=0.0, end_value=learning_rate, transition_steps=num_warmup_steps)
decay_fn = optax.linear_schedule(
init_value=learning_rate, end_value=0, transition_steps=num_train_steps - num_warmup_steps
)
schedule_fn = optax.join_schedules(schedules=[warmup_fn, decay_fn], boundaries=[num_warmup_steps])
return schedule_fn
def glue_train_data_collator(rng: PRNGKey, dataset: Dataset, batch_size: int):
"""Returns shuffled batches of size `batch_size` from truncated `train dataset`, sharded over all local devices."""
steps_per_epoch = len(dataset) // batch_size
perms = jax.random.permutation(rng, len(dataset))
perms = perms[: steps_per_epoch * batch_size] # Skip incomplete batch.
perms = perms.reshape((steps_per_epoch, batch_size))
for perm in perms:
batch = dataset[perm]
batch = {k: jnp.array(v) for k, v in batch.items()}
batch = shard(batch)
yield batch
def glue_eval_data_collator(dataset: Dataset, batch_size: int):
"""Returns batches of size `batch_size` from `eval dataset`, sharded over all local devices."""
for i in range(len(dataset) // batch_size):
batch = dataset[i * batch_size : (i + 1) * batch_size]
batch = {k: jnp.array(v) for k, v in batch.items()}
batch = shard(batch)
yield batch
def main():
args = parse_args()
# Make one log on every process with the configuration for debugging.
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO,
)
# Setup logging, we only want one process per machine to log things on the screen.
logger.setLevel(logging.INFO if jax.process_index() == 0 else logging.ERROR)
if jax.process_index() == 0:
datasets.utils.logging.set_verbosity_warning()
transformers.utils.logging.set_verbosity_info()
else:
datasets.utils.logging.set_verbosity_error()
transformers.utils.logging.set_verbosity_error()
# Get the datasets: you can either provide your own CSV/JSON training and evaluation files (see below)
# or specify a GLUE benchmark task (the dataset will be downloaded automatically from the datasets Hub).
# For CSV/JSON files, this script will use as labels the column called 'label' and as pair of sentences the
# sentences in columns called 'sentence1' and 'sentence2' if such column exists or the first two columns not named
# label if at least two columns are provided.
# If the CSVs/JSONs contain only one non-label column, the script does single sentence classification on this
# single column. You can easily tweak this behavior (see below)
# In distributed training, the load_dataset function guarantee that only one local process can concurrently
# download the dataset.
if args.task_name is not None:
# Downloading and loading a dataset from the hub.
raw_datasets = load_dataset("glue", args.task_name)
else:
# Loading the dataset from local csv or json file.
data_files = {}
if args.train_file is not None:
data_files["train"] = args.train_file
if args.validation_file is not None:
data_files["validation"] = args.validation_file
extension = (args.train_file if args.train_file is not None else args.valid_file).split(".")[-1]
raw_datasets = load_dataset(extension, data_files=data_files)
# See more about loading any type of standard or custom dataset at
# https://huggingface.co/docs/datasets/loading_datasets.html.
# Labels
if args.task_name is not None:
is_regression = args.task_name == "stsb"
if not is_regression:
label_list = raw_datasets["train"].features["label"].names
num_labels = len(label_list)
else:
num_labels = 1
else:
# Trying to have good defaults here, don't hesitate to tweak to your needs.
is_regression = raw_datasets["train"].features["label"].dtype in ["float32", "float64"]
if is_regression:
num_labels = 1
else:
# A useful fast method:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
label_list = raw_datasets["train"].unique("label")
label_list.sort() # Let's sort it for determinism
num_labels = len(label_list)
# Load pretrained model and tokenizer
config = AutoConfig.from_pretrained(args.model_name_or_path, num_labels=num_labels, finetuning_task=args.task_name)
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=not args.use_slow_tokenizer)
model = FlaxAutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, config=config)
# Preprocessing the datasets
if args.task_name is not None:
sentence1_key, sentence2_key = task_to_keys[args.task_name]
else:
# Again, we try to have some nice defaults but don't hesitate to tweak to your use case.
non_label_column_names = [name for name in raw_datasets["train"].column_names if name != "label"]
if "sentence1" in non_label_column_names and "sentence2" in non_label_column_names:
sentence1_key, sentence2_key = "sentence1", "sentence2"
else:
if len(non_label_column_names) >= 2:
sentence1_key, sentence2_key = non_label_column_names[:2]
else:
sentence1_key, sentence2_key = non_label_column_names[0], None
# Some models have set the order of the labels to use, so let's make sure we do use it.
label_to_id = None
if (
model.config.label2id != PretrainedConfig(num_labels=num_labels).label2id
and args.task_name is not None
and not is_regression
):
# Some have all caps in their config, some don't.
label_name_to_id = {k.lower(): v for k, v in model.config.label2id.items()}
if list(sorted(label_name_to_id.keys())) == list(sorted(label_list)):
logger.info(
f"The configuration of the model provided the following label correspondence: {label_name_to_id}. "
"Using it!"
)
label_to_id = {i: label_name_to_id[label_list[i]] for i in range(num_labels)}
else:
logger.warning(
"Your model seems to have been trained with labels, but they don't match the dataset: ",
f"model labels: {list(sorted(label_name_to_id.keys()))}, dataset labels: {list(sorted(label_list))}."
"\nIgnoring the model labels as a result.",
)
elif args.task_name is None:
label_to_id = {v: i for i, v in enumerate(label_list)}
def preprocess_function(examples):
# Tokenize the texts
texts = (
(examples[sentence1_key],) if sentence2_key is None else (examples[sentence1_key], examples[sentence2_key])
)
result = tokenizer(*texts, padding="max_length", max_length=args.max_length, truncation=True)
if "label" in examples:
if label_to_id is not None:
# Map labels to IDs (not necessary for GLUE tasks)
result["labels"] = [label_to_id[l] for l in examples["label"]]
else:
# In all cases, rename the column to labels because the model will expect that.
result["labels"] = examples["label"]
return result
processed_datasets = raw_datasets.map(
preprocess_function, batched=True, remove_columns=raw_datasets["train"].column_names
)
train_dataset = processed_datasets["train"]
eval_dataset = processed_datasets["validation_matched" if args.task_name == "mnli" else "validation"]
# Log a few random samples from the training set:
for index in random.sample(range(len(train_dataset)), 3):
logger.info(f"Sample {index} of the training set: {train_dataset[index]}.")
# Define a summary writer
summary_writer = tensorboard.SummaryWriter(args.output_dir)
summary_writer.hparams(vars(args))
def write_metric(train_metrics, eval_metrics, train_time, step):
summary_writer.scalar("train_time", train_time, step)
train_metrics = get_metrics(train_metrics)
for key, vals in train_metrics.items():
tag = f"train_{key}"
for i, val in enumerate(vals):
summary_writer.scalar(tag, val, step - len(vals) + i + 1)
for metric_name, value in eval_metrics.items():
summary_writer.scalar(f"eval_{metric_name}", value, step)
num_epochs = int(args.num_train_epochs)
rng = jax.random.PRNGKey(args.seed)
train_batch_size = args.per_device_train_batch_size * jax.local_device_count()
eval_batch_size = args.per_device_eval_batch_size * jax.local_device_count()
learning_rate_fn = create_learning_rate_fn(
len(train_dataset), train_batch_size, args.num_train_epochs, args.num_warmup_steps, args.learning_rate
)
state = create_train_state(model, learning_rate_fn, is_regression, num_labels=num_labels)
# define step functions
def train_step(
state: train_state.TrainState, batch: Dict[str, Array], dropout_rng: PRNGKey
) -> Tuple[train_state.TrainState, float]:
"""Trains model with an optimizer (both in `state`) on `batch`, returning a pair `(new_state, loss)`."""
targets = batch.pop("labels")
def loss_fn(params):
logits = state.apply_fn(**batch, params=params, dropout_rng=dropout_rng, train=True)[0]
loss = state.loss_fn(logits, targets)
return loss, logits
grad_fn = jax.value_and_grad(loss_fn, has_aux=True)
(loss, logits), grad = grad_fn(state.params)
grad = jax.lax.pmean(grad, "batch")
new_state = state.apply_gradients(grads=grad)
metrics = jax.lax.pmean({"loss": loss, "learning_rate": learning_rate_fn(state.step)}, axis_name="batch")
return new_state, metrics
p_train_step = jax.pmap(train_step, axis_name="batch", donate_argnums=(0,))
def eval_step(state, batch):
logits = state.apply_fn(**batch, params=state.params, train=False)[0]
return state.logits_fn(logits)
p_eval_step = jax.pmap(eval_step, axis_name="batch")
if args.task_name is not None:
metric = load_metric("glue", args.task_name)
else:
metric = load_metric("accuracy")
logger.info(f"===== Starting training ({num_epochs} epochs) =====")
train_time = 0
for epoch in range(1, num_epochs + 1):
logger.info(f"Epoch {epoch}")
logger.info(" Training...")
# make sure weights are replicated on each device
state = replicate(state)
train_start = time.time()
train_metrics = []
rng, input_rng, dropout_rng = jax.random.split(rng, 3)
# train
for batch in glue_train_data_collator(input_rng, train_dataset, train_batch_size):
dropout_rngs = shard_prng_key(dropout_rng)
state, metrics = p_train_step(state, batch, dropout_rngs)
train_metrics.append(metrics)
train_time += time.time() - train_start
logger.info(f" Done! Training metrics: {unreplicate(metrics)}")
logger.info(" Evaluating...")
rng, input_rng = jax.random.split(rng)
# evaluate
for batch in glue_eval_data_collator(eval_dataset, eval_batch_size):
labels = batch.pop("labels")
predictions = p_eval_step(state, batch)
metric.add_batch(predictions=chain(*predictions), references=chain(*labels))
# evaluate also on leftover examples (not divisible by batch_size)
num_leftover_samples = len(eval_dataset) % eval_batch_size
# make sure leftover batch is evaluated on one device
if num_leftover_samples > 0 and jax.process_index() == 0:
# put weights on single device
state = unreplicate(state)
# take leftover samples
batch = eval_dataset[-num_leftover_samples:]
batch = {k: jnp.array(v) for k, v in batch.items()}
labels = batch.pop("labels")
predictions = eval_step(state, batch)
metric.add_batch(predictions=predictions, references=labels)
eval_metric = metric.compute()
logger.info(f" Done! Eval metrics: {eval_metric}")
cur_step = epoch * (len(train_dataset) // train_batch_size)
write_metric(train_metrics, eval_metric, train_time, cur_step)
# save last checkpoint
if jax.process_index() == 0:
params = jax.device_get(jax.tree_map(lambda x: x[0], state.params))
model.save_pretrained(args.output_dir, params=params)
if __name__ == "__main__":
main()

View File

@@ -74,7 +74,7 @@ def to_list(tensor):
def train(args, train_dataset, model, tokenizer):
""" Train the model """
"""Train the model"""
if args.local_rank in [-1, 0]:
tb_writer = SummaryWriter()

View File

@@ -61,7 +61,7 @@ def accuracy(out, labels):
def load_rocstories_dataset(dataset_path):
""" Output a list of tuples(story, 1st continuation, 2nd continuation, label) """
"""Output a list of tuples(story, 1st continuation, 2nd continuation, label)"""
with open(dataset_path, encoding="utf_8") as f:
f = csv.reader(f)
output = []
@@ -184,7 +184,7 @@ def main():
# Load and encode the datasets
def tokenize_and_encode(obj):
""" Tokenize and encode a nested object """
"""Tokenize and encode a nested object"""
if isinstance(obj, str):
return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
elif isinstance(obj, int):

Some files were not shown because too many files have changed in this diff Show More