Compare commits

...

349 Commits

Author SHA1 Message Date
Sylvain Gugger
cd48078ce5 Release: v4.3.2
Some checks failed
Model templates runner / run_tests_templates (push) Has been cancelled
Release - Conda / build_and_package (push) Has been cancelled
2021-02-09 14:07:52 -05:00
Suraj Patil
727ab9d398 [RAG] fix generate (#10094)
* fix rag generate and tests

* put back adjust_logits_during_generation

* tests are okay

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-02-09 14:04:32 -05:00
Patrick von Platen
c95fae6d65 fix import (#10103) 2021-02-09 14:03:17 -05:00
Lysandre
cc86472c78 Release: v4.3.1
Some checks failed
Model templates runner / run_tests_templates (push) Has been cancelled
Release - Conda / build_and_package (push) Has been cancelled
2021-02-09 09:55:55 +01:00
Patrick von Platen
02451cda74 Deprecate Wav2Vec2ForMaskedLM and add Wav2Vec2ForCTC (#10089)
* add wav2vec2CTC and deprecate for maskedlm

* remove from docs
2021-02-09 09:55:55 +01:00
Lysandre
800f385d78 Release: v4.3.0
Some checks failed
Model templates runner / run_tests_templates (push) Has been cancelled
Release - Conda / build_and_package (push) Has been cancelled
2021-02-08 18:31:49 +01:00
Anthony MOI
bcf49c0438 Update tokenizers requirement (#10077) 2021-02-08 18:29:16 +01:00
Patrick von Platen
15a8906c71 Bump minimum Jax requirement to 2.8.0 (#10027)
* Bump minimum Jax requirement to 2.8.0

* update table
2021-02-08 18:18:26 +01:00
Sylvain Gugger
4cd22512de Release: 4.3.0.rc1
Some checks failed
Model templates runner / run_tests_templates (push) Has been cancelled
Release - Conda / build_and_package (push) Has been cancelled
2021-02-04 15:41:19 -05:00
Sylvain Gugger
4739ce177d Fix test for sagemaker and TPU integrations 2021-02-04 15:06:58 -05:00
Sylvain Gugger
21b3922e35 Authorize last version of tokenizer (#9799)
* Authorize last version of tokenizer

* Update version table

* Fix conversion of spm tokenizers and fix some hub links

* Bump tokenizers version to 0.10.1rc1

* Add script to check tokenizers conversion with XNLI

* Add some more mask_token lstrip support

* Must modify mask_token in slow tokenizers too

* Keep using the old method for Pegasus

* add missing import

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-02-04 14:18:33 -05:00
Nicolas Patry
d5888ef0ab Hotfixing tests (blenderbot decoderonly tests, also need to remove (#10003)
`encoder_no_repeat_ngram_size` from their config.
2021-02-04 11:41:34 -05:00
Stas Bekman
8c3b1fcb67 [trainer] a few fixes (#9993)
* trainer fixes

* don't switch the model  just for deepspeed and mp

* correct the fix
2021-02-04 07:44:56 -08:00
Daniel Stancl
714855bd8f Remove "double" assignment in TF-BART like models (#9997)
* Replace `attn_weights = attn_wegihts = tf.reshape(...)`
with `attn_weights = tf.reshape(...)` and thus remove
unintentionally used "double" assignment.
2021-02-04 10:24:47 -05:00
Sylvain Gugger
b72f16b3ec Fix doc for TFConverBertModel 2021-02-04 10:14:46 -05:00
Nicolas Patry
aeb18b9224 Adding new encoder_no_repeat_ngram_size to generate. (#9984)
Adding new `encoder_no_repeat_ngram_size` to `generate`.

Blenderbot results seemed off compared to original ParlAI script:
`https://parl.ai/projects/recipes/`. Notably the model seems
to repeat a lot what was said during the conversation.

The actual problem was that `no_repeat_ngram_size` actually applies
to the `encoder_input_ids` but HF's `no_repeat_ngram_size` applies
to the previously generated ids (within the decoder). The history
conversation of blenderbot is within the `encoder` part so that
explains why HF's implementation had the repetitions.

This fix was focused on blenderbot *not* small and added tests
for those because they are quite different in configuration.

This change includes:

- Adding a new EncoderNoRepeatLogitProcessor.
- Adding 1 new arg to `generate` (`encoder_no_repeat_ngram_size`)
- Adding 1 new config parameter `encoder_no_repeat_ngram_size`.
- Adding 2 tests, one for the pipeline (high level, inputs exhibited
repeat behavior, one low level for EncoderNoRepeatLogitProcessor)
- Factored NoRepeatLogitProcessor so that logic could be reused.

Further work:

- Blenderbot conversational pipeline still does not behave correctly
 as they way input is prepared within the pipeline is still incorrect
(follow up PR)
- Blenderbot allows the bot to have personas, which is done by
prepending "your personna: XXXX" to the input, this could be explored
too in a follow up PR.

@patrickvonplaten
@LysandreJik

* Update src/transformers/generation_logits_process.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/generation_utils.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/generation_utils.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/configuration_utils.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Doc quality.

* Fixing test.

* Last fixes.

* Fixing to account for batch_size.

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/generation_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-02-04 15:00:18 +01:00
Lysandre Debut
e89c959af9 Fix model templates (#9999) 2021-02-04 07:47:26 -05:00
Daniel Hug
804cd185d8 Added Integration testing for DistilBert model from issue #9948' (#9995) 2021-02-04 04:24:59 -05:00
demSd
00031785a8 BartForCausalLM analogs to ProphetNetForCausalLM (#9128)
* initiliaze bart4causalLM

* create BartDecoderWrapper, setters/getters

* delete spaces

* forward and additional methods

* update cache function, loss function, remove ngram* params in data class.

* add bartcausallm, bartdecoder testing

* correct bart for causal lm

* remove at

* add mbart as well

* up

* fix typo

* up

* correct

* add pegasusforcausallm

* add blenderbotforcausallm

* add blenderbotsmallforcausallm

* add marianforcausallm

* add test for MarianForCausalLM

* add Pegasus test

* add BlenderbotSmall test

* add blenderbot test

* fix a fail

* fix an import fail

* a fix

* fix

* Update modeling_pegasus.py

* fix models

* fix inputs_embeds setting getter

* adapt tests

* correct repo utils check

* finish test improvement

* fix tf models as well

* make style

* make fix-copies

* fix copies

* run all tests

* last changes

* fix all tests

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-02-04 11:56:12 +03:00
Sylvain Gugger
7898fc03b1 Add from_slow in fast tokenizers build and fixes some bugs (#9987) 2021-02-04 03:34:23 -05:00
Stefan Schweter
6244727e05 distilbert: fix creation of sinusoidal embeddings when using PyTorch 1.8+ (#9917) 2021-02-03 11:42:16 -05:00
sandip
2f06f2bcd6 Alber model integration testing added (#9980) 2021-02-03 11:41:10 -05:00
sandip
75fd00fb25 Integration test added for TF MPnet (#9979) 2021-02-03 11:39:40 -05:00
sandip
ce08043f7a Integration test for mobilebert (#9978) 2021-02-03 11:36:45 -05:00
sandip
1486205d23 TF DistilBERT integration tests (#9975)
* TF DistilBERT integration test

* Update test_modeling_tf_distilbert.py
2021-02-03 09:51:00 -05:00
sandip
f2d5c04e1f Added integration tests for TensorFlow implementation of the ALBERT model (#9976)
* TF Albert integration test

* TF Alber integration test added
2021-02-03 09:49:18 -05:00
Suraj Patil
bca0dd5ee3 [run_clm.py] fix getting extention 2021-02-03 20:14:42 +05:30
yylun
5442a11f5f fix steps_in_epoch variable in trainer when using max_steps (#9969)
* fix steps_in_epoch variable when using max_steps

* redundant sentence

* Revert "redundant sentence"

This reverts commit ad5c0e9b6e66d65732dee2239cdc9c76dfa0dc5a.

* remove redundant sentence

Co-authored-by: wujindou <wujindou@sogou-inc.com>
2021-02-03 09:30:37 -05:00
Julien Plu
3f77c26d74 Fix Longformer and LED (#9942)
* Fix Longformer and LED

* Add a test for graph execution with inputs_embeds

* Apply style
2021-02-03 12:26:32 +01:00
Stas Bekman
d55e10beab [research proj] [lxmert] rm bleach dependency (#9970)
Looks like a vulnerability and it's not really used anywhere in the code, so just as well remove it completely from deps.
https://github.com/huggingface/transformers/security/dependabot/examples/research_projects/lxmert/requirements.txt/bleach/open
2021-02-03 05:24:40 -05:00
abhishek thakur
a1a67a3ced Fix GroupedLinearLayer in TF ConvBERT (#9972) 2021-02-03 04:49:07 -05:00
Daniel Stancl
71bdc076dd Add head_mask and decoder_head_mask to PyTorch LED (#9856)
* Add {decoder_,}head_mask to LED

* Fix create_custom_forward signatue in encoder

* Add head_mask to longformer

* Add head_mask to longformer to fix dependencies
of LED on Longformer.

* Not working yet

* Add mising one input in longofrmer_modeling.py

* make fix-copies
2021-02-02 11:06:52 -08:00
Patrick von Platen
d6217fb30c Wav2Vec2 (#9659)
* add raw scaffold

* implement feat extract layers

* make style

* remove +

* correctly convert weights

* make feat extractor work

* make feature extraction proj work

* run forward pass

* finish forward pass

* Succesful decoding example

* remove unused files

* more changes

* add wav2vec tokenizer

* add new structure

* fix run forward

* add other layer norm architecture

* finish 2nd structure

* add model tests

* finish tests for tok and model

* clean-up

* make style

* finish docstring for model and config

* make style

* correct docstring

* correct tests

* change checkpoints to fairseq

* fix examples

* finish wav2vec2

* make style

* apply sylvains suggestions

* apply lysandres suggestions

* change print to log.info

* re-add assert statement

* add input_values as required input name

* finish wav2vec2 tokenizer

* Update tests/test_tokenization_wav2vec2.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* apply sylvains suggestions

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-02-02 15:52:10 +03:00
Sylvain Gugger
d996024af7 Use compute_loss in prediction_step (#9935) 2021-02-02 07:00:17 -05:00
Stefan Schweter
aa438a4265 convbert: minor fixes for conversion script (#9937) 2021-02-02 06:09:24 -05:00
Sylvain Gugger
62024453c3 Bump numpy (#9934) 2021-02-02 05:46:33 -05:00
Sylvain Gugger
de38a6e4d2 Fix 9918 (#9932)
* Initial work

* Fix doc styler and other models
2021-02-02 05:22:20 -05:00
Lysandre Debut
1809de5165 ALBERT Tokenizer integration test (#9943)
* ALBERT Tokenizer integration test

* Batching

* Style
2021-02-02 04:39:33 -05:00
Patrick von Platen
0f4dc5d864 fix typo in naming (#9944) 2021-02-02 12:22:42 +03:00
Patrick von Platen
538b3b4607 [Tokenizer Utils Base] Make pad function more flexible (#9928)
* change tokenizer requirement

* split line

* Correct typo from list to str

* improve style

* make other function pretty as well

* add comment

* correct typo

* add new test

* pass tests for tok without padding token

* Apply suggestions from code review
2021-02-02 10:35:27 +03:00
Jan Jitse Venselaar
d1b14c9b54 Tensorflow doc changes on loss output size (#9922)
* Change documentation to correctly specify loss tensor size

* Change documentation to correct input format for labels

* Corrected output size of loss tensor for sequence classifier, multiple choice model and question answering
2021-02-01 11:17:50 -05:00
Suraj Patil
343057e141 Fix bart conversion script (#9923)
* fix conversion script

* typo

* import nn
2021-02-01 19:17:14 +03:00
Patrick von Platen
0e3be1ac8f Add new model docs (#9667)
* add new model logic

* fix docs

* change structure

* improve add_new_model

* push new changes

* up

* up

* correct spelling

* improve docstring

* correct line length

* update readme

* correct links

* correct typos

* only add rst file for now

* Apply suggestions from code review 1

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Bram Vanroy <Bram.Vanroy@UGent.be>

* Apply suggestions from code review

Co-authored-by: Bram Vanroy <Bram.Vanroy@UGent.be>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Stefan Schweter <stefan@schweter.it>
Co-authored-by: Bram Vanroy <Bram.Vanroy@UGent.be>

* Apply suggestions from code review

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>

* finish adding all suggestions

* make style

* apply Niels feedback

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* apply sylvains suggestions

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Bram Vanroy <Bram.Vanroy@UGent.be>
Co-authored-by: Stefan Schweter <stefan@schweter.it>
Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-02-01 17:55:10 +03:00
Suraj Patil
0842c33edd fix typos (#9924) 2021-02-01 08:17:45 -05:00
CeShine Lee
8672bcda1f Adafactor: avoid updating group["lr"] attributes (#9751)
This affects Adafactor with relative_step=False and scale_parameter=True.
Updating group["lr"] makes the result of ._get_lr() depends on the previous call,
i.e., on the scale of other parameters. This isn't supposed to happen.
2021-02-01 08:07:33 -05:00
Sylvain Gugger
115d97dd2f Remove subclass for sortish sampler (#9907)
* Remove subclass for sortish sampler

* Use old Seq2SeqTrainer in script

* Styling
2021-02-01 08:06:32 -05:00
wlhgtc
1682804ebd Fit chinese wwm to new datasets (#9887)
* MOD: fit chinese wwm to new datasets

* MOD: move wwm to new folder

* MOD: formate code

* Styling

* MOD add param and recover trainer

Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
2021-02-01 03:37:59 -05:00
Stas Bekman
24881008a6 [wandb] restore WANDB_DISABLED=true to disable wandb (#9896)
* [t5 doc] typos

a few run away backticks

@sgugger

* style

* [trainer] put fp16 args together

this PR proposes a purely cosmetic change that puts all the fp16 args together - so they are easier to manager/read

@sgugger

* style

* [wandb] make WANDB_DISABLED disable wandb with any value

This PR solves part of https://github.com/huggingface/transformers/issues/9623

It tries to actually do what https://github.com/huggingface/transformers/issues/9699 requested/discussed and that is any value of `WANDB_DISABLED` should disable wandb.

The current behavior is that it has to be one of `ENV_VARS_TRUE_VALUES = {"1", "ON", "YES"}`

I have been using `WANDB_DISABLED=true` everywhere in scripts as it was originally advertised. I have no idea why this was changed to a sub-set of possible values. And it's not documented anywhere.

@sgugger

* WANDB_DISABLED=true to disable; make tf trainer consistent

* style
2021-02-01 03:14:06 -05:00
Stas Bekman
6bab83683b fix logger format for non-main process (#9911) 2021-02-01 03:08:12 -05:00
Sylvain Gugger
d85691ac75 Doc title in the template (#9910) 2021-02-01 03:05:31 -05:00
Daniel Stancl
0c6c0afc0e Add head_mask and decoder_head_mask to FSMT (#9819)
* Add {decoder_,}head_mask to fsmt_modeling.py

* Enable test_headmasking and some changes to docs

* Remove test_head_masking flag from fsmt test file

Remove test_head_masking flag from test_modeling_fsmt.py
since test_head_masking is set to be True by default (thus it is redundant to store).

* Merge master and remove test_head_masking = True

* Rebase necessary due to an update of jaxlib

* Remove test_head_masking=True in tests/test_modeling_fsmt.py
as it is redundant.
2021-02-01 09:30:21 +03:00
Kiyoung Kim
74f16b8276 TFBart lables consider both pad token and -100 (#9847)
* TFBart lables consider both pad token and -100

* make style

* fix for all other models

Co-authored-by: kykim <kykim>
Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>
2021-02-01 01:31:29 +03:00
lewtun
22121e813e Clarify definition of seed argument in TrainingArguments (#9903)
* Clarify definition of seed argument in Trainer

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args_tf.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fix style

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-31 11:09:31 -05:00
Stas Bekman
40cfc355f1 [doc] nested markup is invalid in rst (#9898)
Apparently nested markup in RST is invalid: https://docutils.sourceforge.io/FAQ.html#is-nested-inline-markup-possible

So currently this line doesn't get rendered properly, leaving inner markdown unrendered, resulting in:
```
https://docutils.sourceforge.io/FAQ.html#is-nested-inline-markup-possible
```

This PR removes the bold which fixes the link.
2021-01-30 09:59:19 -05:00
Stas Bekman
1420b5ff67 refactor deepspeed setup devices (#9880) 2021-01-29 08:18:04 -08:00
Stas Bekman
6bf94bc0b6 correctly handle mt5 (#9879) 2021-01-29 08:11:22 -08:00
Sylvain Gugger
7eadfe166e When on sagemaker use their env variables for saves (#9876)
* When on sagemaker use their env variables for saves

* Address review comments

* Quality
2021-01-29 09:52:26 -05:00
Julien Plu
fdcde144d8 Add XLA test (#9848) 2021-01-29 11:25:03 +01:00
Ethan Chau
99b9affa02 Clarify use of unk_token in tokenizer docstrings (#9875) 2021-01-29 05:11:53 -05:00
Nicolas Patry
c2d0ffec8c Adding a new return_full_text parameter to TextGenerationPipeline. (#9852)
* Adding a new `return_full_text` parameter to TextGenerationPipeline.

For text-generation, it's sometimes used as prompting text.
In that context, prefixing `generated_text` with the actual input
forces the caller to take an extra step to remove it.

The proposed change adds a new parameter (for backward compatibility).
`return_full_text` that enables the caller to prevent adding the prefix.

* Doc quality.
2021-01-29 10:27:32 +01:00
abhishek thakur
bc109ae5b8 pin_memory -> dataloader_pin_memory (#9874) 2021-01-28 21:10:46 +01:00
abhishek thakur
80e4184fb0 on_log event should occur *after* the current log is written (#9872) 2021-01-28 19:11:04 +01:00
Stas Bekman
15e4ce353a [docs] expand install instructions (#9817)
* expand install instructions

* fix

* white space

* rewrite as discussed in the PR

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change the wording to encourage issue report

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-28 09:36:46 -08:00
Daniel Stancl
4c3ae89ad3 Remove redundant test_head_masking = True flags in test files (#9858)
* Remove redundant test_head_masking = True flags

* Remove all redundant test_head_masking flags in PyTorch test_modeling_* files

* Make test_head_masking = True as a default choice in test_modeling_tf_commong.py

* Remove all redundant test_head_masking flags in TensorFlow
test_modeling_tf_* files

* Put back test_head_masking=False fot TFT5 models
2021-01-28 10:09:13 -05:00
Joe Davison
caddf9126b tutorial typo 2021-01-28 09:21:58 -05:00
Sylvain Gugger
b4e559cfa1 Deprecate model_path in Trainer.train (#9854) 2021-01-28 08:32:46 -05:00
Funtowicz Morgan
2ee9f9b69e Fix computation of attention_probs when head_mask is provided. (#9853)
* Fix computation of attention_probs when head_mask is provided.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Apply changes to the template

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2021-01-28 06:11:52 -05:00
Nicolas Patry
b936582f71 Fixing flaky conversational test + flag it as a pipeline test. (#9837) 2021-01-28 10:19:55 +01:00
Lysandre Debut
58fbef9ebc Remove submodule (#9868) 2021-01-28 04:03:53 -05:00
Lysandre Debut
6cb0a6f01a Partial local tokenizer load (#9807)
* Allow partial loading of a cached tokenizer

* Warning > Info

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Raise error if not local_files_only

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-28 03:29:12 -05:00
abhishek thakur
25fcb5c171 Pin memory in Trainer by default (#9857)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2021-01-28 08:50:46 +01:00
Stefan Schweter
5ed5a54684 ADD BORT (#9813)
* tests: add integration tests for new Bort model

* bort: add conversion script from Gluonnlp to Transformers 🚀

* bort: minor cleanup (BORT -> Bort)

* add docs

* make fix-copies

* clean doc a bit

* correct docs

* Update docs/source/model_doc/bort.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update docs/source/model_doc/bort.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* correct dialogpt doc

* correct link

* Update docs/source/model_doc/bort.rst

* Update docs/source/model_doc/dialogpt.rst

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* make style

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-27 21:25:11 +03:00
Stas Bekman
7c6d63298f [traner] fix --lr_scheduler_type choices (#9800)
* fix --lr_scheduler_type choices

* rewrite to fix for all enum-based cl args

* cleanup

* adjust test

* style

* Proposal that should work

* Remove needless code

* Fix test

Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
2021-01-27 10:12:15 -05:00
Sylvain Gugger
893120facc Allow --arg Value for booleans in HfArgumentParser (#9823)
* Allow --arg Value for booleans in HfArgumentParser

* Update last test

* Better error message
2021-01-27 09:31:42 -05:00
Sylvain Gugger
35d55b7b84 When resuming training from checkpoint, Trainer loads model (#9818)
* Whenresuming training from checkpoint, Trainer loads model

* Finish cleaning tests

* Address review comment

* Use global_step from state
2021-01-27 09:31:18 -05:00
Lysandre Debut
6b6c2b487f Test (#9851) 2021-01-27 09:11:53 -05:00
Lysandre Debut
56c3f07a13 Labeled pull requests (#9849) 2021-01-27 08:45:54 -05:00
Kiyoung Kim
20932e5520 Add tpu_zone and gcp_project in training_args_tf.py (#9825)
* add tpu_zone and gcp_project in training_args_tf.py

* make style

Co-authored-by: kykim <kykim>
2021-01-27 08:45:09 -05:00
Lysandre Debut
763ece2fea Fix model templates (#9842) 2021-01-27 08:20:58 -05:00
Julien Plu
bd701ab1a0 Fix template (#9840) 2021-01-27 07:40:30 -05:00
Sylvain Gugger
c7b7bd9963 Add a flag for find_unused_parameters (#9820)
* Add a flag for find_unused_parameters

* Apply suggestions from code review

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Remove negation

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2021-01-27 06:18:06 -05:00
Julien Plu
4adbdce5ee Clean TF Bert (#9788)
* Start cleaning BERT

* Clean BERT and all those depends of it

* Fix attribute name

* Apply style

* Apply Sylvain's comments

* Apply Lysandre's comments

* remove unused import
2021-01-27 11:28:11 +01:00
tomohideshibata
f0329ea516 Delete a needless duplicate condition (#9826)
Co-authored-by: Tomohide Shibata <tomshiba@yahoo-corp.jp>
2021-01-27 13:15:23 +03:00
Julien Plu
a1720694a5 Remove a TF usage warning and rework the documentation (#9756)
* Rework documentation

* Update the template

* Trigger CI

* Restore the warning but with the TF logger

* Update convbert doc
2021-01-27 10:45:42 +01:00
Nicolas Patry
285c6262a8 Adding a test to prevent late failure in the Table question answering (#9808)
pipeline.

- If table is empty then the line that contain `answer[0]` will fail.
- This PR add a check to prevent `answer[0]`.
- Also adds an early check for presence of `table` and `query` to
prevent late failure and give better error message.
- Adds a few tests to make sure these errors are correctly raised.
2021-01-27 04:10:53 -05:00
Patrick von Platen
a46050d0f5 fix typo with mt5 init (#9830) 2021-01-27 04:09:56 -05:00
jncasey
f4bf0dea46 Fix auto-resume training from checkpoint (#9822)
* Fix auto-resume training from checkpoint

* style fixes
2021-01-27 03:48:18 -05:00
Sylvain Gugger
f2fabedbab Setup logging with a stdout handler (#9816) 2021-01-27 03:39:11 -05:00
Julien Plu
2c891c156d Add a test for mixed precision (#9806) 2021-01-27 03:36:49 -05:00
Patrick von Platen
d5b40d6693 [Setup.py] update jaxlib (#9831)
* update jaxlib

* Update setup.py

* update table
2021-01-27 11:34:21 +03:00
abhishek thakur
f617490e71 ConvBERT Model (#9717)
* finalize convbert

* finalize convbert

* fix

* fix

* fix

* push

* fix

* tf image patches

* fix torch model

* tf tests

* conversion

* everything aligned

* remove print

* tf tests

* fix tf

* make tf tests pass

* everything works

* fix init

* fix

* special treatment for sepconv1d

* style

* 🙏🏽

* add doc and cleanup

* add electra test again

* fix doc

* fix doc again

* fix doc again

* Update src/transformers/modeling_tf_pytorch_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/models/conv_bert/configuration_conv_bert.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update docs/source/model_doc/conv_bert.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/auto/configuration_auto.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/conv_bert/configuration_conv_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* conv_bert -> convbert

* more fixes from review

* add conversion script

* dont use pretrained embed

* unused config

* suggestions from julien

* some more fixes

* p -> param

* fix copyright

* fix doc

* Update src/transformers/models/convbert/configuration_convbert.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* comments from reviews

* fix-copies

* fix style

* revert shape_list

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-01-27 03:20:09 -05:00
Patrick von Platen
e575e06287 fix led not defined (#9828) 2021-01-27 10:43:14 +03:00
Yusuke Mori
059bb25817 Fix a bug in run_glue.py (#9812) (#9815) 2021-01-26 14:32:19 -05:00
Tristan Deleu
eba418ac5d Commit the last step on world_process_zero in WandbCallback (#9805)
* Commit the last step on world_process_zero in WandbCallback

* Use the environment variable WANDB_LOG_MODEL as a default value in WandbCallback
2021-01-26 13:21:26 -05:00
Derrick Blakely
8edc98bb70 Allow RAG to output decoder cross-attentions (#9789)
* get cross attns

* add cross-attns doc strings

* fix typo

* line length

* Apply suggestions from code review

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
2021-01-26 20:32:46 +03:00
Magdalena Biesialska
8f6c12d306 Fix fine-tuning translation scripts (#9809) 2021-01-26 11:30:31 -05:00
Michael Glass
c37dcff764 Fixed parameter name for logits_processor (#9790) 2021-01-26 18:44:02 +03:00
Sylvain Gugger
0d0efd3a0e Smdistributed trainer (#9798)
* Add a debug print

* Adapt Trainer to use smdistributed if available

* Forgotten parenthesis

* Real check for sagemaker

* Donforget to define device...

* Woopsie, local)rank is defined differently

* Update since local_rank has the proper value

* Remove debug statement

* More robust check for smdistributed

* Quality

* Deal with key not present error
2021-01-26 10:28:21 -05:00
Lysandre
897a24c869 Fix head_mask for model templates 2021-01-26 11:02:48 +01:00
Andrea Cappelli
10e5f28212 Improve pytorch examples for fp16 (#9796)
* Pad to 8x for fp16 multiple choice example (#9752)

* Pad to 8x for fp16 squad trainer example (#9752)

* Pad to 8x for fp16 ner example (#9752)

* Pad to 8x for fp16 swag example (#9752)

* Pad to 8x for fp16 qa beam search example (#9752)

* Pad to 8x for fp16 qa example (#9752)

* Pad to 8x for fp16 seq2seq example (#9752)

* Pad to 8x for fp16 glue example (#9752)

* Pad to 8x for fp16 new ner example (#9752)

* update script template #9752

* Update examples/multiple-choice/run_swag.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/question-answering/run_qa.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/question-answering/run_qa_beam_search.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* improve code quality #9752

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-26 04:47:07 -05:00
Nicolas Patry
781e4b1384 Adding skip_special_tokens=True to FillMaskPipeline (#9783)
* We most likely don't want special tokens in this output.

* Adding `skip_special_tokens=True` to FillMaskPipeline

- It's backward incompatible.
- It makes for sense for pipelines to remove references to
special_tokens (all of the other pipelines do that).
- Keeping special tokens makes it hard for users to actually remove them
  because all models have different tokens (<s>, <cls>, [CLS], ....)

* Fixing `token_str` in the same vein, and actually fix the tests too !
2021-01-26 10:06:28 +01:00
Daniel Stancl
1867d9a8d7 Add head_mask/decoder_head_mask for TF BART models (#9639)
* Add head_mask/decoder_head_mask for TF BART models

* Add head_mask and decoder_head_mask input arguments for TF BART-based
models as a TF counterpart to the PR #9569

* Add test_headmasking functionality to tests/test_modeling_tf_common.py

* TODO: Add a test to verify that we can get a gradient back for
importance score computation

* Remove redundant #TODO note

Remove redundant #TODO note from tests/test_modeling_tf_common.py

* Fix assertions

* Make style

* Fix ...Model input args and adjust one new test

* Add back head_mask and decoder_head_mask to BART-based ...Model
after the last commit

* Remove head_mask ande decoder_head_mask from input_dict
in TF test_train_pipeline_custom_model as these two have different
shape than other input args (Necessary for passing this test)

* Revert adding global_rng in test_modeling_tf_common.py
2021-01-26 03:50:00 -05:00
Yusuke Mori
cb73ab5a38 Fix broken links in the converting tf ckpt document (#9791)
* Fix broken links in the converting tf ckpt document

* Update docs/source/converting_tensorflow_models.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Reflect the review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-26 03:37:57 -05:00
Patrick von Platen
d94cc2f904 [Flaky Generation Tests] Make sure that no early stopping is happening for beam search (#9794)
* fix ci

* fix ci

* renaming

* fix dup line
2021-01-26 03:21:44 -05:00
Stas Bekman
0fdbf0850a [PR/Issue templates] normalize, group, sort + add myself for deepspeed (#9706)
* normalize, group, sort + add myself for deepspeed

* new structure

* add ray

* typo

* more suggestions

* more suggestions

* white space

* Update .github/ISSUE_TEMPLATE/bug-report.md

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* add bullets

* sync

* Apply suggestions from code review

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* sync

Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-01-25 21:09:01 -08:00
Sylvain Gugger
af41da5097 Fix style 2021-01-25 12:40:58 -05:00
Sylvain Gugger
caf4abf768 Auto-resume training from checkpoint (#9776)
* Auto-resume training from checkpoint

* Update examples/text-classification/run_glue.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Roll out to other examples

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-01-25 12:03:51 -05:00
Lysandre Debut
0f443436fb Actual fix (#9787) 2021-01-25 11:12:07 -05:00
Stas Bekman
fac7cfb16a [fsmt] onnx triu workaround (#9738)
* onnx triu workaround

* style

* working this time

* add test

* more efficient version
2021-01-25 08:57:37 -05:00
Sorami Hisamoto
626116b7d7 Fix a typo in Trainer.hyperparameter_search docstring (#9762)
`compute_objectie` => `compute_objective`
2021-01-25 06:40:03 -05:00
Kai Fricke
d63ab61525 Use object store to pass trainer object to Ray Tune (#9749) 2021-01-25 05:01:55 -05:00
Maria Janina Sarol
6312fed47d Fix TFTrainer prediction output (#9662)
* Fix TFTrainer prediction output

* Update trainer_tf.py

* Fix TFTrainer prediction output

* Fix evaluation_loss update in TFTrainer

* Fix TFTrainer prediction output
2021-01-25 10:27:12 +01:00
Wilfried L. Bounsi
9152f16023 Fix broken [Open in Colab] links (#9761) 2021-01-23 15:11:46 +05:30
Stas Bekman
b7b7e5d049 token_type_ids isn't used (#9736) 2021-01-22 20:38:53 -08:00
Julien Plu
a449ffcbd2 Fix test (#9755) 2021-01-22 17:40:16 +01:00
Sylvain Gugger
82d46febeb Add report_to training arguments to control the reporting integrations used (#9735) 2021-01-22 10:34:34 -05:00
Sylvain Gugger
411c582109 Fixes to run_seq2seq and instructions (#9734)
* Fixes to run_seq2seq and instructions

* Add more defaults for summarization
2021-01-22 10:03:57 -05:00
Julien Plu
d7c31abf38 Fix some TF slow tests (#9728)
* Fix saved model tests + fix a graph issue in longformer

* Apply style
2021-01-22 14:50:46 +01:00
Stefan Schweter
08b22722c7 examples: fix XNLI url (#9741) 2021-01-22 18:13:52 +05:30
Sylvain Gugger
5f80c15ef5 Fix memory regression in Seq2Seq example (#9713)
* Fix memory regression in Seq2Seq example

* Fix test and properly deal with -100

* Easier condition with device safety

* Patch for MBartTokenzierFast
2021-01-21 12:05:46 -05:00
Julien Plu
a7dabfb3d1 Fix TF s2s models (#9478)
* Fix Seq2Seq models for serving

* Apply style

* Fix lonfgormer

* Fix mBart/Pegasus/Blenderbot

* Apply style

* Add a main intermediate layer

* Apply style

* Remove import

* Apply tf.function to Longformer

* Fix utils check_copy

* Update S2S template

* Fix BART + Blenderbot

* Fix BlenderbotSmall

* Fix BlenderbotSmall

* Fix BlenderbotSmall

* Fix MBart

* Fix Marian

* Fix Pegasus + template

* Apply style

* Fix common attributes test

* Forgot to fix the LED test

* Apply Patrick's comment on LED Decoder
2021-01-21 17:03:29 +01:00
Nicolas Patry
23e5a36ee6 Changing model default for TableQuestionAnsweringPipeline. (#9729)
* Changing model default for TableQuestionAnsweringPipeline.

- Discussion: https://discuss.huggingface.co/t/table-question-answering-is-not-an-available-task-under-pipeline/3284/6

* Updating slow tests that were out of sync.
2021-01-21 14:31:51 +01:00
Julien Plu
3f290e6c84 Fix mixed precision in TF models (#9163)
* Fix Gelu precision

* Fix gelu_fast

* Naming

* Fix usage and apply style

* add TF gelu approximate version

* add TF gelu approximate version

* add TF gelu approximate version

* Apply style

* Fix albert

* Remove the usage of the Activation layer
2021-01-21 07:00:11 -05:00
Suraj Patil
248fa1ae72 fix T5 head mask in model_parallel (#9726)
* fix head mask in model_parallel

* pass correct head mask
2021-01-21 12:16:14 +01:00
Patrick von Platen
ca422e3d7d finish (#9721) 2021-01-21 05:17:13 -05:00
Patrick von Platen
c8ea582ed6 reduce led memory (#9723) 2021-01-21 05:16:15 -05:00
guillaume-be
fb36c273a2 Allow text generation for ProphetNetForCausalLM (#9707)
* Moved ProphetNetForCausalLM's parent initialization after config update

* Added unit tests for generation for ProphetNetForCausalLM
2021-01-21 11:13:38 +01:00
Lysandre Debut
910aa89671 Temporarily deactivate TPU tests while we work on fixing them (#9720) 2021-01-21 04:17:39 -05:00
Muennighoff
6a346f0358 fix typo (#9708)
* fix typo

Co-authored-by: Suraj Patil <surajp815@gmail.com>
2021-01-21 13:51:01 +05:30
Stas Bekman
4a20b7c450 [trainer] no --deepspeed and --sharded_ddp together (#9712)
* no --deepspeed and --sharded_ddp together

* Update src/transformers/trainer.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-20 16:50:21 -08:00
Sylvain Gugger
7acfa95afb Add missing new line 2021-01-20 14:13:16 -05:00
Darigov Research
5a307ece82 Adds flashcards to Glossary & makes small corrections (#8949)
* fix: Makes small typo corrections & standardises glossary

* feat: Adds introduction & links to transformer flashcards

* feat: Adds attribution & adjustments requested in #8949

* feat: Adds flashcards to community.md

* refactor: Removes flashcards from glossary
2021-01-20 13:28:40 -05:00
Sylvain Gugger
3cd91e8162 Fix WAND_DISABLED test (#9703)
* Fix WAND_DISABLED test

* Remove duplicate import

* Make a test that actually works...

* Fix style
2021-01-20 12:30:24 -05:00
Sylvain Gugger
2a703773aa Fix style 2021-01-20 12:17:40 -05:00
Stas Bekman
cd5565bed3 fix the backward for deepspeed (#9705) 2021-01-20 09:07:07 -08:00
Gunjan Chhablani
538245b0c2 Fix Trainer and Args to mention AdamW, not Adam. (#9685)
* Fix Trainer and Args to mention AdamW, not Adam.

* Update the docs for Training Arguments.

* Change arguments adamw_* to adam_*

* Fixed links to AdamW in TrainerArguments docs

* Fix line length in Training Args docs.
2021-01-20 11:59:31 -05:00
NielsRogge
88583d4958 Add notebook (#9696) 2021-01-20 10:19:26 -05:00
NielsRogge
d1370d29b1 Add DeBERTa head models (#9691)
* Add DebertaForMaskedLM, DebertaForTokenClassification, DebertaForQuestionAnswering

* Add docs and fix quality

* Fix Deberta not having pooler
2021-01-20 10:18:50 -05:00
Sylvain Gugger
a7b62fece5 Fix Funnel Transformer conversion script (#9683) 2021-01-20 09:50:20 -05:00
acul3
8940c7662d Add t5 convert to transformers-cli (#9654)
* Update run_mlm.py

* add t5 model to transformers-cli convert

* update rum_mlm.py same as master

* update converting model docs

* update converting model docs

* Update convert.py

* Trigger notification

* update import sorted

* fix typo t5
2021-01-20 09:34:27 -05:00
Julien Plu
7251a4736d Fix template (#9697) 2021-01-20 09:04:53 -05:00
Julien Plu
14042d560f New TF embeddings (cleaner and faster) (#9418)
* Create new embeddings + add to BERT

* Add Albert

* Add DistilBert

* Add Albert + Electra + Funnel

* Add Longformer + Lxmert

* Add last models

* Apply style

* Update the template

* Remove unused imports

* Rename attribute

* Import embeddings in their own model file

* Replace word_embeddings per weight

* fix naming

* Fix Albert

* Fix Albert

* Fix Longformer

* Fix Lxmert Mobilebert and MPNet

* Fix copy

* Fix template

* Update the get weights function

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/electra/modeling_tf_electra.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* address Sylvain's comments

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-20 12:08:12 +01:00
Julien Plu
12f0d7e8e0 Fix label datatype in TF Trainer (#9616)
* Fix label datatype

* Apply style
2021-01-20 12:08:00 +01:00
Sylvain Gugger
76f36e183a Add a community page to the docs (#9682) 2021-01-20 04:54:36 -05:00
Sylvain Gugger
582f516adb Use datasets squad_v2 metric in run_qa (#9677) 2021-01-20 04:52:13 -05:00
LSinev
a98173cc45 make RepetitionPenaltyLogitsProcessor faster (#9600) 2021-01-20 10:23:01 +01:00
Sylvain Gugger
a1ad16a446 Restrain tokenizer.model_max_length default (#9681)
* Restrain tokenizer.model_max_length default

* Fix indent
2021-01-20 04:17:39 -05:00
Sylvain Gugger
7e662e6a3b Fix model templates and use less than 119 chars (#9684)
* Fix model templates and use less than 119 chars

* Missing new line
2021-01-19 17:11:22 -05:00
Daniel Stancl
2ebbbf558c Add separated decoder_head_mask for T5 Models (#9634)
* Add decoder_head_mask for PyTorch T5 model

* Add decoder_head_mask args into T5Model and T5ForConditionalGeneration

* Slightly change the order of input args to be in accordance
with the convention from BART-based models introduced within the PR #9569.

* Make style for modeling_t5.py

* Add decoder_head_mask for TF T5 models

* Separate head_mask and decoder_head_mask args in TF T5 models

* Slightly change the order of input args to follow convention
of BART-based models updated in PR #9569

* Update test_forward_signature tests/test_modeling_tf_common.py
w.r.t. the changed order of input args

* Add FutureWarnings for T5 and TFT5 models

* Add FutureWarnings for T5 and TFT5 models warning a user that
input argument `head_mask` was split into two arguments -
`head_mask` and `decoder_head_mask`

* Add default behaviour - `decoder_head_mask` is set to copy
`head_mask`

* Fix T5 modeling and FutureWarning

* Make proper usage of head_mask and decoder_head_mask
in cross_attention

* Fix conditions for raising FutureWarning

* Reformat FutureWarning in T5 modeling

* Refactor the warning message
2021-01-19 22:50:25 +01:00
Sylvain Gugger
e4c06ed664 New run_seq2seq script (#9605)
* New run_seq2seq script

* Add tests

* Mark as slow

* Update examples/seq2seq/run_seq2seq.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/data/data_collator.py

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* Update src/transformers/data/data_collator.py

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* Address review comments

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
2021-01-19 15:22:17 -05:00
Julien Plu
fa876aee2a Fix TF Flaubert and XLM (#9661)
* Fix Flaubert and XLM

* Fix Flaubert and XLM

* Apply style
2021-01-19 18:02:57 +01:00
max yue
11ec74905a Update integrations.py (#9652)
File "/share/apps/anaconda3/envs/my_env/lib/python3.7/site-packages/transformers/integrations.py", line 419, in __init__
    self._SummaryWriter = SummaryWriter
UnboundLocalError: local variable 'SummaryWriter' referenced before assignment
2021-01-19 11:39:49 -05:00
Yusuke Mori
b020a736c3 Update past_key_values in GPT-2 (#9596)
* Update past_key_values in gpt2 (#9391)

* Update generation_utils, and rename some items

* Update modeling_gpt2 to avoid an error in gradient_checkpointing

* Remove 'reorder_cache' from util and add variations to XLNet, TransfoXL, GPT-2

* Change the location of '_reorder_cache' in modeling files

* Add '_reorder_cache' in modeling_ctrl

* Fix a bug of my last commit in CTRL

* Add '_reorder_cache' to GPT2DoubleHeadsModel

* Manage 'use_cache' in config of test_modeling_gpt2

* Clean up the doc string

* Update src/transformers/models/gpt2/modeling_gpt2.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fix the doc string (GPT-2, CTRL)

* improve gradient_checkpointing_behavior

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-01-19 16:00:15 +01:00
Sylvain Gugger
97b787fb4e Fix old Seq2SeqTrainer (#9675) 2021-01-19 09:56:25 -05:00
Sylvain Gugger
d302d88b47 Fix GPT conversion script (#9676) 2021-01-19 09:55:37 -05:00
Sylvain Gugger
053efc5d2d Fix imports in conversion scripts (#9674) 2021-01-19 09:40:15 -05:00
Patrick von Platen
2390c16fd2 add mbart to automodel for masked lm (#9673) 2021-01-19 15:19:11 +01:00
Patrick von Platen
b39bd763e8 Update README.md 2021-01-19 12:25:51 +01:00
Sergey Mkrtchyan
917dbb15e0 Fix DPRReaderTokenizer's attention_mask (#9663)
* Fix the attention_mask in DPRReaderTokenizer

* Add an integration test for DPRReader inference

* Run make style
2021-01-19 05:43:11 -05:00
Patrick von Platen
12c1b5b8f4 fix test (#9669) 2021-01-19 09:06:24 +01:00
Daniel Stancl
357fb1c5d8 Add head_mask/decoder_head_mask for BART (#9569)
* Add head_mask/decoder_head_mask for BART

This branch implement head_mask and decoder_head_mask
for BART-based models. Full list below:
- BART
- MBart
- Blenderbot
- BlenderbotSmall
- Marian
- Pegasus

Everything is accompanied with updated testing.

* Fix test_headmasking for BART models

* Fix text_headmasking for BART-like models
which has only 2 layers in each modules.
The condition
```
self.assertNotEqual(attentions[1][..., 0, :, :].flatten().sum().item(), 0.0)
```
is, therefore, invalid for encoder-decoder models considering
the `head_mask`
```
head_mask = torch.ones(
    self.model_tester.num_hidden_layers,
    self.model_tester.num_attention_heads,
    device=torch_device,
)
head_mask[0, 0] = 0
head_mask[-1, :-1] = 0
```
specified in the `test_headmasking` test/function.

* Adjust test_modeling_common.py to reflect T5 input args

* Update tests/test_modeling_common.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* make style

* make fix-copies

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-18 13:35:22 +01:00
Devrim
65eb5d9ac5 Fix: torch.utils.checkpoint import error. (#9626) 2021-01-18 04:33:39 -05:00
Anthony MOI
72fc9abf17 Remove duplicated extra["retrieval"] (#9621) 2021-01-18 04:24:21 -05:00
Stas Bekman
c60e0e1ee4 deepspeed + grad acumm (#9622) 2021-01-15 10:12:26 -08:00
Lysandre Debut
6d3b688b04 Ignore lm_head decoder bias warning (#9615)
* Ignore lm_head decoder bias warning

* Revert "Ignore lm_head decoder bias warning"

This reverts commit f25177a9da6ca898e351f46c8b1515971de5c670.

* predictions -> lm_head
2021-01-15 09:40:21 -05:00
Julien Plu
8eba1f8ca8 Remove unused token_type_ids in MPNet (#9564)
* Add warning

* Remove unused import

* Fix missing call

* Fix missing call

* Completely remove token_type_ids

* Apply style

* Remove unused import

* Update src/transformers/models/mpnet/modeling_tf_mpnet.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-01-15 08:06:29 -05:00
Patrick von Platen
90ca8d36e9 [TF Led] Fix wrong decoder attention mask behavior (#9601)
* fix tf led

* remove loop file
2021-01-15 06:40:27 -05:00
Kiyoung Kim
85788bae5c Revert "Gradient accumulation for TFTrainer (#9585)"
This reverts commit 3f40070c88.
2021-01-15 10:47:01 +01:00
Stas Bekman
82498cbc37 [deepspeed doc] install issues + 1-gpu deployment (#9582)
* [doc] install + 1-gpu deployment

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* improvements

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-14 11:05:04 -08:00
Sylvain Gugger
329fe2746a Upstream (and rename) sortish sampler (#9574)
* Upstream (and rename) sortish sampler

* Use proper sampler

* Update src/transformers/trainer_pt_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-01-14 10:38:14 -05:00
Kiyoung Kim
3f40070c88 Gradient accumulation for TFTrainer (#9585)
* gradient accumulation for tftrainer

* label naming

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* label naming

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-14 10:16:39 -05:00
Lysandre
e43f3b6190 v4.2.1 in docs 2021-01-14 14:25:30 +01:00
Lysandre Debut
280db79ac1 BatchEncoding.to with device with tests (#9584) 2021-01-14 07:57:58 -05:00
Lysandre Debut
8bf27075a2 Fix conda build (#9589)
* conda build -> conda-build

* Syntax error

* conda build -> conda-build + 4.2.0

* Prepare to merge in `master`
2021-01-14 05:51:52 -05:00
Stas Bekman
c99751dd9d [setup.py] note on how to get to transformers exact dependencies from shell (#9553)
* note on how to get to deps from shell

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix text

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-14 05:04:08 -05:00
Julien Plu
a26536f0c8 Make logs tf compliant (#9565) 2021-01-14 04:56:53 -05:00
Julien Plu
14d677ca4a Compliancy with tf-nightly (#9570)
* Compliancy with tf-nightly

* Add more version + restore min version check
2021-01-14 04:35:35 -05:00
Sylvain Gugger
46ed56cfd1 Switch metrics in run_ner to datasets (#9567)
* Switch metrics in run_ner to datasets

* Add flag to return all metrics

* Upstream (and rename) sortish_sampler

* Revert "Upstream (and rename) sortish_sampler"

This reverts commit e07d0dcf650c2bae36da011dd76c77a8bb4feb0d.
2021-01-14 03:37:07 -05:00
Sylvain Gugger
5e1bea4f16 Fix Trainer with a parallel model (#9578)
* Fix Trainer with a parallel model

* More clean up
2021-01-14 03:23:41 -05:00
Patrick von Platen
126fd281bc Update README.md 2021-01-13 16:55:59 +01:00
Lysandre
e63cad7936 v4.3.0.dev0 2021-01-13 16:16:54 +01:00
Lysandre
33a8497db8 v4.2.0 documentation 2021-01-13 16:15:40 +01:00
Lysandre
7d9a9d0c72 Release: v4.2.0
Some checks failed
Model templates runner / run_tests_templates (push) Has been cancelled
Release - Conda / build_and_package (push) Has been cancelled
2021-01-13 16:01:51 +01:00
Lysandre Debut
c949516695 Fix slow tests v4.2.0 (#9561)
* Fix conversational pipeline test

* LayoutLM

* ProphetNet

* BART

* Blenderbot & small

* Marian

* mBART

* Pegasus

* Tapas tokenizer

* BERT2BERT test

* Style

* Example requirements

* TF BERT2BERT test
2021-01-13 09:55:48 -05:00
Sylvain Gugger
04dc65e5c6 Fix data parallelism in Trainer (#9566)
* Fix data parallelism in Trainer

* Update src/transformers/training_args.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-01-13 09:54:41 -05:00
Stas Bekman
b2dfcc567b use correct deps for torchhub (#9552) 2021-01-13 08:02:53 -05:00
Yusuke Mori
eabad8fd9c Update run_glue for do_predict with local test data (#9442) (#9486)
* Update run_glue for do_predict with local test data (#9442)

* Update run_glue (#9442): fix comments ('files' to 'a file')

* Update run_glue (#9442): reflect the code review

* Update run_glue (#9442): auto format

* Update run_glue (#9442): reflect the code review
2021-01-13 07:48:35 -05:00
LSinev
0c9f01a8e5 Speed up TopKLogitsWarper and TopPLogitsWarper (pytorch) (#9557)
* make TopKLogitsWarper faster

* make TopPLogitsWarper faster
2021-01-13 07:47:47 -05:00
Pavel Tarashkevich
27d0e01d75 Fix classification script: enable dynamic padding with truncation (#9554)
Co-authored-by: Pavel Tarashkevich <Pavel.Tarashkievich@orange.com>
2021-01-13 07:46:48 -05:00
Lysandre Debut
245cdb469d Fix barthez tokenizer (#9562) 2021-01-13 06:24:10 -05:00
Julien Chaumond
247a7b2029 Doc: Update pretrained_models wording (#9545)
* Update pretrained_models.rst

To clarify things cf. this tweet for instance https://twitter.com/RTomMcCoy/status/1349094111505211395

* format
2021-01-13 05:58:05 -05:00
Suraj Patil
69ed36063a fix BlenderbotSmallTokenizer (#9538)
* add model_input_names

* fix test
2021-01-13 10:53:43 +05:30
Stas Bekman
2df34f4aba [trainer] deepspeed integration (#9211)
* deepspeed integration

* style

* add test

* ds wants to do its own backward

* fp16 assert

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style

* for clarity extract what args are being passed to deepspeed

* introduce the concept of self.wrapped_model

* s/self.wrapped_model/self.model_wrapped/

* complete transition to self.wrapped_model / self.model

* fix

* doc

* give ds its own init

* add custom overrides, handle bs correctly

* fix test

* clean up model_init logic, fix small bug

* complete fix

* collapse --deepspeed_config into --deepspeed

* style

* start adding doc notes

* style

* implement hf2ds optimizer and scheduler configuration remapping

* oops

* call get_num_training_steps absolutely when needed

* workaround broken auto-formatter

* deepspeed_config arg is no longer needed - fixed in deepspeed master

* use hf's fp16 args in config

* clean

* start on the docs

* rebase cleanup

* finish up --fp16

* clarify the supported stages

* big refactor thanks to discovering deepspeed.init_distributed

* cleanup

* revert fp16 part

* add checkpoint-support

* more init ds into integrations

* extend docs

* cleanup

* unfix docs

* clean up old code

* imports

* move docs

* fix logic

* make it clear which file it's referring to

* document nodes/gpus

* style

* wrong format

* style

* deepspeed handles gradient clipping

* easier to read

* major doc rewrite

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* docs

* switch to AdamW optimizer

* style

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* clarify doc

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-01-12 19:05:18 -08:00
Sylvain Gugger
5f6721032a Use the right version of tokenizers (#9550)
* Use the right version of tokenizers

* Try another way

* Try another way

* Deps are installed from there...

* Deps are installed from there...

* Revert last

* remove needless comment
2021-01-12 18:55:45 -05:00
Sylvain Gugger
063d8d27f4 Refactor prepare_seq2seq_batch (#9524)
* Add target contextmanager and rework prepare_seq2seq_batch

* Fix tests, treat BART and Barthez

* Add last tokenizers

* Fix test

* Set src token before calling the superclass

* Remove special behavior for T5

* Remove needless imports

* Remove needless asserts
2021-01-12 18:19:38 -05:00
Sylvain Gugger
e6ecef711e Revert, it was not the issue. 2021-01-12 18:00:22 -05:00
Sylvain Gugger
250f27f207 Fix tokenizers install for now 2021-01-12 17:50:27 -05:00
Lysandre Debut
dfbf0f5598 topk -> top_k (#9541) 2021-01-12 16:21:29 -05:00
Lysandre Debut
a1100fac67 LayoutLM Config (#9539) 2021-01-12 10:03:50 -05:00
NielsRogge
e45eba3b1c Improve LayoutLM (#9476)
* Add LayoutLMForSequenceClassification and integration tests

Improve docs

Add LayoutLM notebook to list of community notebooks

* Make style & quality

* Address comments by @sgugger, @patrickvonplaten and @LysandreJik

* Fix rebase with master

* Reformat in one line

* Improve code examples as requested by @patrickvonplaten

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-01-12 09:26:32 -05:00
Suraj Patil
ccd1923f46 [T5] enable T5 fp16 (#9487)
* fix t5 fp16
2021-01-12 17:12:33 +05:30
Patrick von Platen
2aa9c2f204 fix blenderbot tok (#9532) 2021-01-12 05:53:32 -05:00
Lysandre Debut
406cbf58b2 Shouldn't stale issues/PRs with feature request label (#9511) 2021-01-12 04:49:15 -05:00
Simon Brandeis
3b67c5abb0 Update 'Develop on Windows' guidelines (#9519) 2021-01-12 04:15:16 -05:00
Patrick von Platen
a051d8928a [ProphetNet] Fix naming and wrong config (#9514)
* fix naming issues

* better names
2021-01-12 04:10:05 -05:00
Patrick von Platen
7f28613213 [TFBart] Split TF-Bart (#9497)
* make templates ready

* make add_new_model_command_ready

* finish tf bart

* prepare tf mbart

* finish tf bart

* add tf mbart

* add marian

* prep pegasus

* add tf pegasus

* push blenderbot tf

* add blenderbot

* add blenderbot small

* clean-up

* make fix copy

* define blend bot tok

* fix

* up

* make style

* add to docs

* add copy statements

* overwrite changes

* improve

* fix docs

* finish

* fix last slow test

* fix missing git conflict line

* fix blenderbot

* up

* fix blenderbot small

* load changes

* finish copied from

* upload fix
2021-01-12 02:06:32 +01:00
Stas Bekman
0ecbb69806 [make docs] parallel build (#9522)
After experimenting with different number of workers https://github.com/huggingface/transformers/issues/9496#issuecomment-758145868 4-5 workers seems to be the most optimal - let's go with 4 as surely we wouldn't find a cpu with less cores these days.

Fixes part of https://github.com/huggingface/transformers/issues/9496

@sgugger
2021-01-11 13:00:08 -08:00
Stas Bekman
e6f211cade [trainer] round numbers in trainer state (#9491)
* round numbers

* style

* round only on logging
2021-01-11 10:17:49 -08:00
Sylvain Gugger
01a1684078 Make doc styler behave properly on Windows (#9516) 2021-01-11 10:25:24 -05:00
Sylvain Gugger
6009668c63 Add link to forums thread 2021-01-11 10:00:59 -05:00
Julien Plu
ba702966ba Fix cardinality (#9505) 2021-01-11 09:42:19 -05:00
Stas Bekman
33b7422839 [trainer] remove --model_parallel (#9451)
* fix bad merge - dropped code

* remove --model_parallel

* Deal with TrainingArguments

* Use a private attr and fix batch sizes

* fix _n_gpu

* add is_parallel helper wrapper

* fix attribute

* introduce a new attribute is_model_parallel

* docs

* docs

* Put back init False and rearrange doc

* Ignore non-init args in HFArgumentParser

Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
2021-01-11 09:39:28 -05:00
Stas Bekman
6f63501383 [doc] How To Request Support document stab (#9288)
* How To Request Support document stab

* integrate suggestions

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* small corrections

* expand on how to search for issues with examples

* address issues

* Update ISSUES.md

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* patrick's suggestion

* patrick's suggestion

* small fix

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-01-11 09:23:51 -05:00
Nicolas Patry
d20e9c7299 Enable TruncationStrategy override for pipelines (#9432)
* Enable TruncationStrategy override for pipelines

* Update isort.

* Fixing test

* Fixing text_generation pipeline.

* Using same DummyTok as other PR  for easier merge later.

* Some more import guards.

* Remove bogus file.

* Do not pass `generate_kwargs` to `_parse_and_tokenize`.
@patrickvonplaten

* Removed DummyTok.

* Doc quality.
2021-01-11 09:23:28 -05:00
Sylvain Gugger
8d25df2c7a Make doc styler detect lists on rst (#9488) 2021-01-11 08:53:41 -05:00
Aakash Tripathi
5a442a8db1 New Updated DistilGPT-2 Finetuning and Generation (#9494)
https://github.com/huggingface/transformers/pull/3177
2021-01-11 14:34:39 +01:00
Patrick von Platen
6c8ec2a931 fix tf led pt test (#9513) 2021-01-11 14:14:48 +01:00
Julien Plu
1e3c362235 Fix template (#9512) 2021-01-11 08:03:28 -05:00
Lysandre Debut
d415882b41 Remove tolerance + drop_rows_to_fit by default (#9507)
* Remove tolerance + drop_rows_to_fit by default

* remove drop_rows_to_fit
2021-01-11 08:02:41 -05:00
Julien Plu
1243ee7d0c Full rework of the TF input/output embeddings and bias resizing (#9193)
* Start rework resizing

* Rework bias/decoder resizing

* Full resizing rework

* Full resizing rework

* Start to update the models with the new approach

* Finish to update the models

* Update all the tests

* Update the template

* Fix tests

* Fix tests

* Test a new approach

* Refactoring

* Refactoring

* Refactoring

* New rework

* Rework BART

* Rework bert+blenderbot

* Rework CTRL

* Rework Distilbert

* Rework DPR

* Rework Electra

* Rework Flaubert

* Rework Funnel

* Rework GPT2

* Rework Longformer

* Rework Lxmert

* Rework marian+mbart

* Rework mobilebert

* Rework mpnet

* Rework openai

* Rework pegasus

* Rework Roberta

* Rework T5

* Rework xlm+xlnet

* Rework template

* Fix TFT5EncoderOnly + DPRs

* Restore previous methods

* Fix Funnel

* Fix CTRL and TransforXL

* Apply style

* Apply Sylvain's comments

* Restore a test in DPR

* Address the comments

* Fix bug

* Apply style

* remove unused import

* Fix test

* Forgot a method

* missing test

* Trigger CI

* naming update

* Rebase

* Trigger CI
2021-01-11 06:27:28 -05:00
Julien Plu
cf416764f4 Fix template (#9504) 2021-01-11 05:21:25 -05:00
Richard Liaw
09926c8e86 fix-template (#9499)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-01-10 20:34:17 -05:00
Julien Plu
4f7022d68d Reformat (#9482) 2021-01-10 15:10:15 +01:00
Nicolas Patry
96f1f74aaf Fixing tests. It seems master changed something in the warnings. (#9483)
Trying to keep warning tests for now. Should be discarded if it becomes
too hard to maintain.
2021-01-10 15:08:20 +01:00
Boris Dayma
1c19b423bf fix(wandb): fix config (#9489) 2021-01-08 14:32:02 -05:00
Nicolas Patry
02e05fb0a5 Making Conversation possible to create directly a full conversation (#9434)
* Cleaning up conversation tests.

* Adding tests that don't require downloading models + conversation can be

fully created from static state.

* Making tests non flaky (by fixing generation length)

* Bumping isort version.

* Doc cleanup.

* Remove unused test in this PR.

* Torch import guard for TF.

* Missing torch guard.

* Small mistake in doc.

* Actual uses `_history` and `_index` cache.

+ remove dead enumerate
+ improve warning message.

* Update src/transformers/pipelines/conversational.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/pipelines/conversational.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/pipelines/conversational.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Adding comments and cleaner code to address history copy.

* Improving pipeline name in tests.

* Change tokenizer to a real one (still created at runtime with no

external dependency)

* Simplify DummyTok, reverse changes on tokenization.

* Removing DummyTok.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-08 14:33:25 +01:00
Julien Plu
4fbcf8ea49 Fix TF input for np.ndarray (#9294)
* Fix input for np.ndarray"

* add a test

* add a test

* Add a test

* Apply style

* Fix test
2021-01-08 08:23:29 -05:00
Thomas Tanon
e34e45536f Makes HfArgumentParser compatible with Python 3.9 (#9479)
Python 3.9 changed the format of the string serialization of `typing.Optional`.
For example, `str(typing.Optional[str])` is
`typing.Union[str, NoneType]` in python 3.8 and
`typing.Optional[str]` in Python 3.9.
2021-01-08 08:10:44 -05:00
Sylvain Gugger
1bdf42409c Fast imports part 3 (#9474)
* New intermediate inits

* Update template

* Avoid importing torch/tf/flax in tokenization unless necessary

* Styling

* Shutup flake8

* Better python version check
2021-01-08 07:40:59 -05:00
Patrick von Platen
79bbcc5260 [Generation] Fix bug for manual decoder_input_ids + warning message (#9472)
* up

* improve style
2021-01-08 05:50:39 -05:00
Patrick von Platen
9e1ea846bc [README] Add new models (#9465)
* add new models

* make fix-copies
2021-01-08 05:49:43 -05:00
Nicolas Patry
bf9056442a Removing duplicated code for Translation,Summarization and Text2TextGeneration pipelines (#9433)
* Merging all duplicated codes for Text2TextPipeline while preserving
backward compat.

* Fixing TranslationPipeline Hierarchy + return_name

* torch import guard.

* Update isort version.

* Remove code from other PR disentanglement.

* Removed named example to something more agnostic.
2021-01-07 23:10:16 +01:00
Patrick von Platen
f33a6f3446 [TFGPT2] - Fix flaky past_key_values test (#9460)
* fix tf flakey

* remove test files
2021-01-07 16:12:08 +01:00
Sylvain Gugger
758ed3332b Transformers fast import part 2 (#9446)
* Main init work

* Add version

* Change from absolute to relative imports

* Fix imports

* One more typo

* More typos

* Styling

* Make quality script pass

* Add necessary replace in template

* Fix typos

* Spaces are ignored in replace for some reason

* Forgot one models.

* Fixes for import

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

* Add documentation

* Styling

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2021-01-07 09:36:14 -05:00
Patrick von Platen
a400fe8931 [LED Test] fix common inputs pt for flaky pt-tf led test (#9459)
* fix common inputs pt flakey led

* fix other tests correspondingly
2021-01-07 12:29:03 +01:00
Patrick von Platen
ae5a32bb0d up (#9454) 2021-01-07 11:51:02 +01:00
Julien Plu
812045adcc New serving (#9419)
* Add a serving method

* Add albert

* Add serving for BERT and BART

* Add more models

* Finish the serving addition

* Temp fix

* Restore DPR

* Fix funnel attribute

* Fix attributes GPT2

* Fix OpenAIGPT attribute

* Fix T5 attributes

* Fix Bart attributes

* Fix TransfoXL attributes

* Add versioning

* better test

* Update template

* Fix Flaubert

* Fix T5

* Apply style

* Remove unused imports

* Deactivate extra parameters

* Remove too long test + saved_model default to False

* Ignore the saved model test for some models

* Fix some inputs

* Fix mpnet serving

* Trigger CI

* Address all comments
2021-01-07 11:48:49 +01:00
guillaume-be
390cf16bc8 Prophetnet optimization (#9453)
* Vectorized `ngram_attention_bias` calculation

* updated formatting with black

* Further optimization

* one (last) optimization
2021-01-07 11:41:58 +01:00
Stas Bekman
28d74872cc a more reliable version of branching point discovery (#9449) 2021-01-07 04:47:50 -05:00
Sylvain Gugger
3ec40299c1 Remove nested lxmert (#9440) 2021-01-07 04:10:41 -05:00
Patrick von Platen
b8462b5b2a [GenerationOutputs] Fix GenerationOutputs Tests (#9443)
* fix generation models

* fix led

* fix docs

* add is_decoder

* fix last docstrings

* make style

* fix t5 cross attentions

* correct t5
2021-01-06 19:37:02 +01:00
Sylvain Gugger
0c96262f7d Fast transformers import part 1 (#9441)
* Don't import libs to check they are available

* Don't import integrations at init

* Add importlib_metdata to deps

* Remove old vars references

* Avoid syntax error

* Adapt testing utils

* Try to appease torchhub

* Add dependency

* Remove more private variables

* Fix typo

* Another typo

* Refine the tf availability test
2021-01-06 12:17:24 -05:00
Simon Brandeis
c89f1bc92e Add flags to return scores, hidden states and / or attention weights in GenerationMixin (#9150)
* Define new output dataclasses for greedy generation

* Add output_[...] flags in greedy generation methods

Added output_attentions, output_hidden_states, output_scores flags in
generate and greedy_search methods in GenerationMixin.

* [WIP] Implement logic and tests for output flags in generation

* Update GreedySearchOutput classes & docstring

* Implement greedy search output accumulation logic

Update greedy_search unittests

Fix generate method return value docstring

Properly init flags with the default config

* Update configuration to add output_scores flag

* Fix test_generation_utils

Sort imports and fix isinstance tests for GreedySearchOutputs

* Fix typo in generation_utils

* Add return_dict_in_generate for backwards compatibility

* Add return_dict_in_generate flag in config

* Fix tyPo in configuration

* Fix handling of attentions and hidden_states flags

* Make style & quality

* first attempt attentions

* some corrections

* improve tests

* special models requires special test

* disable xlm test for now

* clean tests

* fix for tf

* isort

* Add output dataclasses for other generation methods

* Add logic to return dict in sample generation

* Complete test for sample generation

- Pass output_attentions and output_hidden_states flags to encoder in
encoder-decoder models
- Fix import satements order in test_generation_utils file

* Add logic to return dict in sample generation

- Refactor tests to avoid using self.assertTrue, which provides
scarce information when the test fails
- Add tests for the three beam_search methods: vanilla, sample and
grouped

* Style doc

* Fix copy-paste error in generation tests

* Rename logits to scores and refactor

* Refactor group_beam_search for consistency

* make style

* add sequences_scores

* fix all tests

* add docs

* fix beam search finalize test

* correct docstring

* clean some files

* Made suggested changes to the documentation

* Style doc ?

* Style doc using the Python util

* Update src/transformers/generation_utils.py

* fix empty lines

* fix all test

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-01-06 17:11:42 +01:00
Kevin Canwen Xu
7a9f1b5c99 Store transformers version info when saving the model (#9421)
* Store transformers version info when saving the model

* Store transformers version info when saving the model

* fix format

* fix format

* fix format

* Update src/transformers/configuration_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update configuration_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-01-06 23:34:48 +08:00
Qbiwan
ecfcac223c Improve documentation coverage for Phobert (#9427)
* first commit

* change phobert to phoBERT as per author in overview

* v3 and v4 both runs on same code hence there is no need to differentiate them

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-06 10:04:32 -05:00
Qbiwan
be898998bb Improve documentation coverage for Herbert (#9428)
* first commit

* changed XLMTokenizer to HerbertTokenizer in code example
2021-01-06 09:13:43 -05:00
Patrick von Platen
b972c1bfb0 finalize (#9431) 2021-01-06 14:36:55 +01:00
Sylvain Gugger
bcb55d33ce Upgrade styler to better handle lists (#9423)
* Add missing lines before a new list.

* Update doc styler and restyle some files.

* Fix docstrings of LED and Longformer
2021-01-06 07:46:17 -05:00
NielsRogge
b7e548976f Fix URLs to TAPAS notebooks (#9435) 2021-01-06 07:20:41 -05:00
Stas Bekman
9f675b05d4 [trainer] self.model_wrapped + _model_unwrap (#9390)
* model wrapped + model_unwrap

* cleanup

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style

* deprecation warning

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-06 06:50:11 -05:00
Sylvain Gugger
453a70d4cb Allow example to use a revision and work with private models (#9407)
* Allow example to use a revision and work with private models

* Copy to other examples and template

* Styling
2021-01-06 06:49:23 -05:00
Manuel Romero
7988edc031 Fix link to Notebook to fine-tune TAPAS (#9413)
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-01-06 03:44:52 -05:00
Manuel Romero
c9553c0352 Fix link to Evaluate TAPAS Notebook (#9414) 2021-01-06 03:42:50 -05:00
Nicolas Patry
090d28e32d [Refactor] Splitting pipelines.py into its own module. (#9279)
* Splitting pipelines into its own module.

* Moving everything into base.py

* Moving FeatureExtractionPipeline into its own file.

* TextGenerationPipeline.

* TextClassifictionPipeline

* ZeroShot + get_framework import.

* FillMaskPipeline

* NerPipeline + TokenClassificationPipeline

* QuestionAnsweringPipeline

* TableQuestionAnsweringPipeline

* ConversationnalPipeline

* Text2TextGenerationPipeline, TranslationPipeline, SummarizationPipeline

* Typo import fix.

* Relative imports.
2021-01-06 09:33:50 +01:00
Stas Bekman
d64372fdfc [docs] outline sharded ddp doc (#9208)
* outline sharded dpp doc

* fix link

* add example

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* narrow the command and remove non-essentials

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-05 17:34:15 -08:00
Patrick von Platen
eef66035a2 [PyTorch Bart] Split Bart into different models (#9343)
* first try

* remove old template

* finish bart

* finish mbart

* delete unnecessary line

* init pegasus

* save intermediate

* correct pegasus

* finish pegasus

* remove cookie cutter leftover

* add marian

* finish blenderbot

* replace in file

* correctly split blenderbot

* delete "old" folder

* correct "add statement"

* adapt config for tf comp

* correct configs for tf

* remove ipdb

* fix more stuff

* fix mbart

* push pegasus fix

* fix mbart

* more fixes

* fix research projects code

* finish docs for bart, mbart, and marian

* delete unnecessary file

* correct attn typo

* correct configs

* remove pegasus for seq class

* correct peg docs

* correct peg docs

* finish configs

* further improve docs

* add copied from statements to mbart

* fix copied from in mbart

* add copy statements to marian

* add copied from to marian

* add pegasus copied from

* finish pegasus

* finish copied from

* Apply suggestions from code review

* make style

* backward comp blenderbot

* apply lysandres and sylvains suggestions

* apply suggestions

* push last fixes

* fix docs

* fix tok tests

* fix imports code style

* fix doc
2021-01-05 22:00:05 +01:00
Clement
4eec5d0cf6 improve readme text to private models/versioning/api (#9424) 2021-01-05 15:02:46 -05:00
Stas Bekman
d9e848c1d6 add experimental warning (#9412) 2021-01-05 10:05:32 -05:00
Stas Bekman
29acabd886 [trainer] group fp16 args together (#9409)
* [t5 doc] typos

a few run away backticks

@sgugger

* style

* [trainer] put fp16 args together

this PR proposes a purely cosmetic change that puts all the fp16 args together - so they are easier to manager/read

@sgugger

* style
2021-01-05 09:39:38 -05:00
Yusuke Mori
57a6626929 [examples/text-classification] Fix a bug for using one's own dataset of a regression task (#9411) 2021-01-05 08:15:06 -05:00
Patrick von Platen
189387e9b2 LED (#9278)
* create model

* add integration

* save current state

* make integration tests pass

* add one more test

* add explanation to tests

* remove from bart

* add padding

* remove unnecessary test

* make all tests pass

* re-add cookie cutter tests

* finish PyTorch

* fix attention test

* Update tests/test_modeling_common.py

* revert change

* remove unused file

* add string to doc

* save intermediate

* make tf integration tests pass

* finish tf

* fix doc

* fix docs again

* add led to doctree

* add to auto tokenizer

* added tips for led

* make style

* apply jplus statements

* correct tf longformer

* apply lysandres suggestions

* apply sylvains suggestions

* Apply suggestions from code review
2021-01-05 13:14:30 +01:00
Sugeeth
314cca2842 Fix documentation links always pointing to master. (#9217)
* Use extlinks to point hyperlink with the version of code

* Point to version on release and master until then

* Apply style

* Correct links

* Add missing backtick

* Simple missing backtick after all.

Co-authored-by: Raghavendra Sugeeth P S <raghav-5305@raghav-5305.csez.zohocorpin.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2021-01-05 06:18:48 -05:00
Julien Plu
52d62e686c Fix TF Funnel (#9300)
* Fix Funnel

* Apply Patrick's comment

* Remove comment

* Fix dummy value

* Apply style
2021-01-05 05:54:50 -05:00
Stas Bekman
748006c0b3 [trainer] --model_parallel hasn't been implemented for most models (#9347)
* --model_parallel hasn't been implemented for most models

* make the help clear as well

* implement is_parallelizable; use it

* oops

* remove property
2021-01-05 04:01:30 -05:00
Julien Plu
4225740a7b Use stable functions (#9369) 2021-01-05 03:58:26 -05:00
Stas Bekman
4aa8f6ad99 [logging] autoflush (#9385)
This PR proposes to:

* auto-flush `transformers` logging 

When using logging for tracing signals from different parts of the code and which could be mixed with print debug this aids to get all the logging events synchronized. 

I don't think this change will introduce any performance impacts.

If it helps someone here is the code I used to sync `transformers` logging with various other debug prints.

I was porting bart to MP and I needed to trace that the device switching happens correctly and I added a bunch of logger.info calls inside `modeling_bart.py` and also had some other helpers `print` debug messages which weren't logger based:

```

# auto flush std streams
from sys import stdout, stderr
def stdout_write_flush(args, w=stderr.write): w(args); stderr.flush()
def stderr_write_flush(args, w=stderr.write): w(args); stderr.flush()
stdout.write = stdout_write_flush
stderr.write = stderr_write_flush

from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

import logging
import transformers.utils.logging
import transformers.models.bart.modeling_bart

# I wanted a shorter simpler format
handlers = transformers.utils.logging._get_library_root_logger().handlers
for handler in handlers:
    formatter = logging.Formatter("[%(funcName)s] %(message)s")
    handler.setFormatter(formatter)

transformers.models.bart.modeling_bart.logger.setLevel(transformers.logging.INFO)
```

@LysandreJik, @sgugger, @patrickvonplaten
2021-01-05 03:57:57 -05:00
Julien Plu
83eec97ec6 Fix TF Longformer (#9348)
* Fix longformer

* Apply style

* Remove serving content

* Forgot a condition

* Apply style

* Address Patrick's comments

* Fix dtype
2021-01-05 03:49:54 -05:00
Boris Dayma
30fa0b780f feat(wandb): save model as artifact (#8119)
* feat(wandb): log artifacts

* fix: typo

* feat(wandb): ensure name is allowed

* feat(wandb): log artifact

* feat(wandb): saving logic

* style: improve formatting

* fix: unrelated typo

* feat: use a fake trainer

* fix: simplify

* feat(wandb): log model files as artifact

* style: fix style

* docs(wandb): correct description

* feat: unpack model + allow env Truethy values

* feat: TrainerCallback can access tokenizer

* style: fix style

* feat(wandb): log more interesting metadata

* feat: unpack tokenizer

* feat(wandb): metadata with load_best_model_at_end

* feat(wandb): more robust metadata

* style(wandb): fix formatting
2021-01-05 03:30:46 -05:00
Stas Bekman
143289dcf7 [test_model_parallelization] multiple fixes (#9354) 2021-01-04 12:09:12 -08:00
Qbiwan
086718ac6e Improve documentation coverage for Bertweet (#9379)
* bertweet docs coverage

* style doc max len 119

* maxlen style rst

* run main() from style_doc

* changed according to  comments
2021-01-04 13:12:59 -05:00
Stas Bekman
47ca0eaaac replace apex.normalization.FusedLayerNorm with torch.nn.LayerNorm (#9386) 2021-01-04 19:00:08 +01:00
Patrick von Platen
75ff530551 correct docs (#9378) 2021-01-04 17:27:29 +01:00
Julien Plu
ec54d70e16 Fix TF DPR (#9283)
* Fix DPR

* Keep usual models

* Apply style

* Address Sylvain's comments
2021-01-04 17:26:56 +01:00
Julien Plu
de29ff9bd2 Fix open (#9368) 2021-01-04 10:22:15 -05:00
Stas Bekman
d018afced0 [trainer] parametrize default output_dir (#9352)
This PR:

* fixes trainer to have the logger agree with the actual default `output_dir`, but setting it one place and passing it as an argument to both places

@sgugger
2021-01-04 10:14:32 -05:00
Julien Plu
d735b074d7 Fix Flaubert (#9292) 2021-01-04 16:06:28 +01:00
dependabot[bot]
5dd389d1c7 Bump notebook from 6.1.4 to 6.1.5 in /examples/research_projects/lxmert (#9402)
Bumps [notebook](https://github.com/jupyter/jupyterhub) from 6.1.4 to 6.1.5.
- [Release notes](https://github.com/jupyter/jupyterhub/releases)
- [Changelog](https://github.com/jupyterhub/jupyterhub/blob/master/CHECKLIST-Release.md)
- [Commits](https://github.com/jupyter/jupyterhub/commits)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-01-04 10:02:07 -05:00
Sylvain Gugger
23a71449c0 Put back LXMert example (#9401) 2021-01-04 09:59:07 -05:00
Julien Plu
6c03d4ac70 Fix CTRL (#9291) 2021-01-04 09:56:51 -05:00
Charles
c581d8af7a Add utility function for retrieving locally cached models (#8836)
* add get_cached_models function

* add List type to import

* fix code quality

* Update src/transformers/file_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/file_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/file_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/file_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/file_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fix style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-04 09:53:54 -05:00
Sam Shleifer
8eb7f26d5d simplify marian distillation script (#9394) 2021-01-04 11:21:24 +05:30
Yoshitomo Matsubara
d944966b19 Fix typos in README and bugs in RAG example code for end-to-end evaluation and finetuning (#9355)
* fix a bug in eval_batch_retrieval

* should return parser as well as other staticmethod

* remove duplicate argument

* these kwargs are no longer accepted (cause TypeError in self.generator.generate of modeling_rag.py)

* fixed file paths in README

* moved an arg to add_ray_specific_args
2021-01-03 16:00:30 +01:00
Chris Kennedy
c4fd609afb file_utils.py: TF examples outputs.last_hidden_states -> state (#9382) 2021-01-02 17:58:16 +01:00
Patrick von Platen
b01f451ca3 [Docs] past_key_values return a tuple of tuple as a default (#9381)
* push

* make style
2021-01-02 15:55:07 +01:00
Derrick Blakely
5f7a07c0c8 use return dict for rag encoder (#9363) 2021-01-02 12:39:14 +01:00
Stas Bekman
ae333d04b2 torch.cuda.is_available() is redundant as apex handles that internally (#9350) 2020-12-30 10:09:51 +01:00
Stas Bekman
8217d4e37f [prophetnet] wrong import (#9349)
```
python -c "from apex.normalization import FusedProphetNetLayerNorm"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'FusedProphetNetLayerNorm' from 'apex.normalization' (/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/apex/normalization/__init__.py)
```
It looks like this code has never been tested, so it silently fails inside try/except.

Discovered this by accident in https://github.com/huggingface/transformers/issues/9338#issuecomment-752217708
2020-12-29 22:32:07 +01:00
Patrick von Platen
912f6881d2 add import math (#9346) 2020-12-29 19:35:06 +01:00
Patrick von Platen
785e52cd30 improve templates (#9342) 2020-12-29 16:48:44 +01:00
Julien Plu
64103fb6be Fix TransfoXL (#9302) 2020-12-28 20:52:18 +01:00
Julien Plu
d97d06d05f Fix TF T5 (#9301)
* Fix T5

* Fix test

* Fix test
2020-12-28 20:51:40 +01:00
Patrick von Platen
83fdd252f6 [Seq2Seq Templates] Correct some TF-serving errors and add gradient checkpointing to PT by default. (#9334)
* correct tests

* correct shape and get_tf_activation

* more correction tf

* add gradient checkpointing to templates

* correct typo
2020-12-28 17:51:04 +01:00
Patrick von Platen
8e74eca7f2 push (#9320) 2020-12-27 21:57:50 +01:00
Patrick von Platen
61443cd7d9 [GPT2] Correct gradient checkpointing (#9308)
* correct gpt2

* fix gpt2

* fix use_cache ordering

* correct past tolerance

* fix for all cases

* style
2020-12-25 23:28:12 +01:00
Vasudev Gupta
21fc676645 add translation example (#9303)
* Created using Colaboratory

* mbart-training examples add

* link add

* Update description

Co-authored-by: Suraj Patil <surajp815@gmail.com>
2020-12-25 14:47:49 +05:30
Patrick von Platen
52b3a05e83 [Bart doc] Fix outdated statement (#9299)
* fix bart doc

* fix docs
2020-12-24 14:47:53 +01:00
Bram Vanroy
7777db159f Update tokenization_utils_base.py (#9293)
Missing "s" typo
2020-12-24 14:43:14 +01:00
Daniele Sartiano
71963a6633 fix typo in modeling_encoder_decoder.py (#9297)
* Update modeling_encoder_decoder.py

Fixed typo.

* typo

Co-authored-by: Suraj Patil <surajp815@gmail.com>
2020-12-24 14:38:08 +01:00
Ratthachat (Jung)
f3a3b91d6f Proposed Fix : [RagSequenceForGeneration] generate "without" input_ids (#9220)
* Create modeling_tf_dpr.py

* Add TFDPR

* Add back TFPegasus, TFMarian, TFMBart, TFBlenderBot

last commit accidentally deleted these 4 lines, so I recover them back

* Add TFDPR

* Add TFDPR

* clean up some comments, add TF input-style doc string

* Add TFDPR

* Make return_dict=False as default

* Fix return_dict bug (in .from_pretrained)

* Add get_input_embeddings()

* Create test_modeling_tf_dpr.py

The current version is already passed all 27 tests!
Please see the test run at : 
https://colab.research.google.com/drive/1czS_m9zy5k-iSJbzA_DP1k1xAAC_sdkf?usp=sharing

* fix quality

* delete init weights

* run fix copies

* fix repo consis

* del config_class, load_tf_weights

They shoud be 'pytorch only'

* add config_class back

after removing it, test failed ... so totally only removing "use_tf_weights = None" on Lysandre suggestion

* newline after .. note::

* import tf, np (Necessary for ModelIntegrationTest)

* slow_test from_pretrained with from_pt=True

At the moment we don't have TF weights (since we don't have official official TF model)
Previously, I did not run slow test, so I missed this bug

* Add simple TFDPRModelIntegrationTest

Note that this is just a test that TF and Pytorch gives approx. the same output.
However, I could not test with the official DPR repo's output yet

* upload correct tf model

* remove position_ids as missing keys

* fix RagSeq generate with context_input_ids

fix RagSeq generate with context_input_ids

* apply style

* delete unused lines

* Add test_rag_sequence_generate_batch_from_context_input_ids

* Readability improved

* stylying

* Stylize

* typos

* add check_model_generate_from_context_input_ids

* make style

* Apply suggestions from code review

* make style2

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: patrickvonplaten <patrick@huggingface.co>
2020-12-24 13:38:00 +01:00
Suraj Patil
2a18b70998 enable cache by default (#9296) 2020-12-24 17:47:36 +05:30
Jungwhan
6189ae9960 Fix typo in file_utils.py (#9289) 2020-12-24 13:48:33 +05:30
Jethro Kuan
222dbdb203 allow integer device for BatchEncoding (#9271)
Fixes #9244

Co-authored-by: Jethro Kuan <jethro.kuan@bytedance.com>
2020-12-24 09:01:56 +01:00
Patrick von Platen
6c091abef2 [Templates] Adapt Bert (#9284)
* adapt templates

* adapt config

* add test as well

* fix output type

* fix cache false naming

* finish tests

* last fix
2020-12-24 01:44:33 +01:00
Suraj Patil
88ef8893cd Add caching mechanism to BERT, RoBERTa (#9183)
* add past_key_values

* add use_cache option

* make mask before cutting ids

* adjust position_ids according to past_key_values

* flatten past_key_values

* fix positional embeds

* fix _reorder_cache

* set use_cache to false when not decoder, fix attention mask init

* add test for caching

* add past_key_values for Roberta

* fix position embeds

* add caching test for roberta

* add doc

* make style

* doc, fix attention mask, test

* small fixes

* adress patrick's comments

* input_ids shouldn't start with pad token

* use_cache only when decoder

* make consistent with bert

* make copies consistent

* add use_cache to encoder

* add past_key_values to tapas attention

* apply suggestions from code review

* make coppies consistent

* add attn mask in tests

* remove copied from longformer

* apply suggestions from code review

* fix bart test

* nit

* simplify model outputs

* fix doc

* fix output ordering
2020-12-23 23:01:32 +05:30
Sylvain Gugger
a1cb6e9866 Adapt to new name of label_smoothing_factor training arg (#9282) 2020-12-23 11:05:21 -05:00
Connor Brinton
bcc87c639f Minor documentation revisions from copyediting (#9266)
* typo: Revise "checkout" to "check out"

* typo: Change "seemlessly" to "seamlessly"

* typo: Close parentheses in "Using the tokenizer"

* typo: Add closing parenthesis to supported models aside

* docs: Treat ``position_ids`` as plural

Alternatively, the word "argument" could be added to make the subject singular.

* docs: Remove comma, making subordinate clause

* docs: Remove comma separating verb and direct object

* docs: Fix typo ("next" -> "text")

* docs: Reverse phrase order to simplify sentence

* docs: "quicktour" -> "quick tour"

* docs: "to throw" -> "from throwing"

* docs: Remove disruptive newline in padding/truncation section

* docs: "show exemplary" -> "show examples of"

* docs: "much harder as" -> "much harder than"

* docs: Fix typo "seach" -> "search"

* docs: Fix subject-verb disagreement in WordPiece description

* docs: Fix style in preprocessing.rst
2020-12-23 10:15:49 -05:00
Patrick von Platen
d5db6c37d4 [Seq2Seq Templates] Fix check_repo.py templates file (#9277)
* add enc dec pt model to check repo

* fix indent
2020-12-23 11:40:20 +01:00
Xu Song
4bafc43b0e Fix param error (#9273)
TypeError: forward() got an unexpected keyword argument 'token_type_ids'
2020-12-23 11:34:57 +01:00
Xu Song
58e8a7611f Fix gpt2 document (#9272) 2020-12-23 11:34:15 +01:00
Patrick von Platen
cbe63949d7 Model Templates for Seq2Seq (#9251)
* adapt cookie cutter

* fix copy past statement

* delete copy statements for now

* remove unused import from template

* make doc rst

* correct config docstring

* correct training

* correct inputs processing tf enc dec

* make style

* adapt templates

* clean tabs

* correct tensor -> Tensor naming

* correct indent

* correct templates

* fix the test

* break lines to avoid > 119

* Apply suggestions from code review
2020-12-22 23:41:20 +01:00
Sylvain Gugger
e6c1f1cad8 Revert renaming in finetune_trainer (#9262) 2020-12-22 15:42:34 -05:00
Sylvain Gugger
ab17758874 Add speed metrics to all example scripts + template (#9260) 2020-12-22 14:02:26 -05:00
Julien Chaumond
5b5f7dd09c [hf_api] Fix incorrect typing 2020-12-22 19:52:47 +01:00
Julien Plu
1558d191e6 Fix TF BART for saved model creation (#9252)
* Fix TF BART for saved model creation

* Apply style

* Update src/transformers/models/bart/modeling_tf_bart.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/bart/modeling_tf_bart.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Rework the fix

* Fix condition

* Apply style

* Fix condition

* Fix shape_list

* Apply Patrick's solution

* Apply Patrick's solution

* Rebase

* make tests pass

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>
2020-12-22 18:07:04 +01:00
Manuel Romero
37d6fb5d04 Fix link to bertabs/README.md (#9255) 2020-12-22 11:41:23 -05:00
Manuel Romero
189c1b91a6 Fix link to old language modeling script (#9254) 2020-12-22 11:40:47 -05:00
Sylvain Gugger
490b39e614 Seq2seq trainer (#9241)
* Add label smoothing in Trainer

* Add options for scheduler and Adafactor in Trainer

* Put Seq2SeqTrainer in the main lib

* Apply suggestions from code review

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Address review comments and adapt scripts

* Documentation

* Move test not using script to tests folder

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-12-22 11:33:44 -05:00
Sylvain Gugger
1fc7119181 Fix script that check objects are documented (#9259) 2020-12-22 11:12:58 -05:00
Patrick von Platen
e9d77ccd5a [EncoderDecoder] Make tests more aggressive (#9256)
* add tests

* make style and fix bart bug

* fix bart past key value edge case

* correct tf bart test

* fix gpt2 tf

* fix t5 test
2020-12-22 17:00:04 +01:00
Sylvain Gugger
ec07da65e2 Update the README of the text classification example (#9237)
* Update the README of the text classification example

* Update examples/README.md

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Adapt comment from review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-12-21 15:23:40 -05:00
Teven
4eef5889ac Adding performer fine-tuning research exampke (#9239)
* added run_mlm_performer.py research example

* make styke

* make styke

* Added a README !
2020-12-21 21:19:41 +01:00
Patrick von Platen
9a12b9696f [MPNet] Add slow to fast tokenizer converter (#9233)
* add converter

* delet unnecessary comments
2020-12-21 15:41:34 +01:00
Suraj Patil
f4432b7e01 add base model classes to bart subclassed models (#9230)
* add base model classes to  bart subclassed models

* add doc
2020-12-21 19:56:46 +05:30
TobiasNorlund
08abdabda1 Fixed beam search generation for GPT2 and T5 (#9219) 2020-12-21 08:05:23 -05:00
Julien Plu
161a6461db Fix TF template (#9234) 2020-12-21 13:52:16 +01:00
Julien Plu
5a8a4eb187 Improve BERT-like models performance with better self attention (#9124)
* Improve BERT-like models attention layers

* Apply style

* Put back error raising instead of assert

* Update template

* Fix copies

* Apply raising valueerror in MPNet

* Restore the copy check for the Intermediate layer in Longformer

* Update longformer
2020-12-21 13:10:15 +01:00
Patrick von Platen
6b034309ca fix warning (#9231) 2020-12-21 10:41:34 +01:00
Amog Kamsetty
a4b21cdd20 [RAG] Add Ray implementation for distributed retrieval (#9197)
* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* uncomment

* uncomment

* wip

* updates

* add docstring

* updates

* fix arg

* fixes

* add unit tests

* update readme

* update readme

* update finetune script

* update test

* add test

* add ray to test dependencies

* separate ray and ray tune

* formatting

* shutdown ray at end of test

* fix tests

* formatting

* formatting

* even more formatting

* address comments

* formatting

* add files

* Update examples/research_projects/rag/test_distributed_retriever.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* address comments

* addressing comments

Co-authored-by: Ubuntu <ubuntu@ip-172-31-21-208.us-west-2.compute.internal>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-12-21 10:39:30 +01:00
Stas Bekman
f38c4ad302 better logging and help (#9203) 2020-12-20 10:28:28 -08:00
sandip
e0e255be1f Added TF TransfoXL Sequence Classification (#9169)
* TF Transfoxl seq classification

* Update test_modeling_tf_transfo_xl.py

Added num_labels to config level

* TF Transfoxl seq classification

* Update test_modeling_tf_transfo_xl.py

Added num_labels to config level

* code refactor

* code refactor

* code refator
2020-12-19 14:44:04 +01:00
Stas Bekman
6b850b671d [run_glue] add speed metrics (#9198)
* add speed metrics

* suggestions
2020-12-18 17:09:30 -08:00
Stas Bekman
3ff5e8955a [t5 doc] typos (#9199)
* [t5 doc] typos

a few run away backticks

@sgugger

* style
2020-12-18 16:03:26 -08:00
Aleksey Tikhonov
291974c65c GPT-model attention heads pruning example (#9189)
* Pruning for GPT attn heads

* The code formatted according to the transformers requirements

* Update run_prune_gpt.py

* Update run_prune_gpt.py
2020-12-18 16:32:10 -05:00
Sylvain Gugger
1198ba8fba Add timing inside Trainer (#9196)
* Add timing inside Trainer

* Fix tests

* Add n_objs for train

* Sort logs
2020-12-18 15:10:39 -05:00
Sylvain Gugger
9a25c5bd3a Add new run_swag example (#9175)
* Add new run_swag example

* Add check

* Add sample

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Very important change to make Lysandre happy

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-12-18 14:19:24 -05:00
Sylvain Gugger
3e56e2ce04 Fix typo 2020-12-18 10:11:07 -05:00
Manuel Romero
077a5dce32 Fix link to old SQUAD fine-tuning script (#9181) 2020-12-18 09:12:10 -05:00
Stas Bekman
84d5879eaf [setup] correct transformers version format (#9176)
setuptools has a pretty fixed expectation of version numbers.

This PR fixes the dev version number and adds a comment with correct formats for the future editors

This fix removes this warning on `make fixup|style|etc` or any other time `setup.py` is being run.
```
setuptools/dist.py:452: UserWarning: Normalizing '4.2.0dev0' to '4.2.0.dev0'
  warnings.warn(tmpl.format(**locals()))
```
and the alternative:
```
/setuptools/dist.py:452: UserWarning: Normalizing '4.0.0-rc-1' to '4.0.0rc1
```

Fixes: #8749

@LysandreJik, @sgugger
2020-12-18 08:55:55 -05:00
Wissam Antoun
fd7b6a5274 fixed JSON error in run_qa with fp16 (#9186) 2020-12-18 07:53:23 -05:00
Manuel Romero
66a14a2f6f Fix link to old NER fine-tuning script (#9182) 2020-12-17 19:50:01 -05:00
Stas Bekman
f06d0fadc9 [trainer] apex fixes and tests (#9180) 2020-12-17 16:49:11 -08:00
sandip
467e9158b4 Added TF CTRL Sequence Classification (#9151)
* Added TF CTRL Sequence Classification

* code refactor
2020-12-17 18:10:57 -05:00
Stas Bekman
63841c559b add tests for the new sharded ddp fairscale integration (#9177) 2020-12-17 14:24:03 -08:00
Lysandre
bf713cdec7 setup.py development version 2020-12-17 11:29:31 -05:00
Lysandre
bd40345d3e v4.1.1 docs 2020-12-17 11:28:38 -05:00
Lysandre
bfa4ccf77d Release: v4.1.1
Some checks failed
Model templates runner / run_tests_templates (push) Has been cancelled
Release - Conda / build_and_package (push) Has been cancelled
2020-12-17 11:25:49 -05:00
Lysandre
e0790cca78 Fix TAPAS doc 2020-12-17 11:25:05 -05:00
Sylvain Gugger
6d2e864db7 Put all models in the constants (#9170)
* Put all models in the constants

* Add Google AI mention in the main README
2020-12-17 11:23:21 -05:00
Lysandre
f83d9c8da7 v4.1.0 docs 2020-12-17 10:16:07 -05:00
510 changed files with 73199 additions and 15657 deletions

View File

@@ -421,15 +421,15 @@ workflows:
- run_tests_git_lfs
- build_doc
- deploy_doc: *workflow_filters
tpu_testing_jobs:
triggers:
- schedule:
# Set to run at the first minute of every hour.
cron: "0 8 * * *"
filters:
branches:
only:
- master
jobs:
- cleanup-gke-jobs
- run_examples_tpu
# tpu_testing_jobs:
# triggers:
# - schedule:
# # Set to run at the first minute of every hour.
# cron: "0 8 * * *"
# filters:
# branches:
# only:
# - master
# jobs:
# - cleanup-gke-jobs
# - run_examples_tpu

View File

@@ -53,4 +53,6 @@ deploy_doc "3ebb1b3" v3.2.0
deploy_doc "0613f05" v3.3.1
deploy_doc "eb0e0ce" v3.4.0
deploy_doc "818878d" v3.5.1
deploy_doc "c781171" # v4.0.0 Latest stable release
deploy_doc "c781171" v4.0.0
deploy_doc "bfa4ccf" v4.1.1
deploy_doc "7d9a9d0" # v4.2.0 Latest stable release

View File

@@ -11,7 +11,7 @@ assignees: ''
## Environment info
<!-- You can run the command `transformers-cli env` and copy-and-paste its output below.
Don't forget to fill out the missing fields in that output! -->
- `transformers` version:
- Platform:
- Python version:
@@ -24,32 +24,41 @@ assignees: ''
<!-- Your issue will be replied to more quickly if you can figure out the right person to tag with @
If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
Please tag fewer than 3 people.
albert, bert, GPT2, XLM: @LysandreJik
tokenizers: @mfuntowicz
Trainer: @sgugger
Speed and Memory Benchmarks: @patrickvonplaten
Model Cards: @julien-c
TextGeneration: @TevenLeScao
examples/distillation: @VictorSanh
nlp datasets: [different repo](https://github.com/huggingface/nlp)
rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
Text Generation: @patrickvonplaten @TevenLeScao
Blenderbot: @patrickvonplaten
Bart: @patrickvonplaten
Marian: @patrickvonplaten
Pegasus: @patrickvonplaten
mBART: @patrickvonplaten
T5: @patrickvonplaten
Longformer/Reformer: @patrickvonplaten
TransfoXL/XLNet: @TevenLeScao
RAG: @patrickvonplaten, @lhoestq
FSMT: @stas00
examples/seq2seq: @patil-suraj
examples/bert-loses-patience: @JetRunner
tensorflow: @jplu
examples/token-classification: @stefan-it
documentation: @sgugger
Models:
- albert, bert, xlm: @LysandreJik
- blenderbot, bart, marian, pegasus, encoderdecoder, t5: @patrickvonplaten, @patil-suraj
- longformer, reformer, transfoxl, xlnet: @patrickvonplaten
- fsmt: @stas00
- funnel: @sgugger
- gpt2: @patrickvonplaten, @LysandreJik
- rag: @patrickvonplaten, @lhoestq
- tensorflow: @jplu
Library:
- benchmarks: @patrickvonplaten
- deepspeed: @stas00
- ray/raytune: @richardliaw, @amogkam
- text generation: @patrickvonplaten
- tokenizers: @n1t0, @LysandreJik
- trainer: @sgugger
- pipelines: @LysandreJik
Documentation: @sgugger
HF projects:
- nlp datasets: [different repo](https://github.com/huggingface/nlp)
- rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
Examples:
- maintained examples (not research project or legacy): @sgugger, @patil-suraj
- research_projects/bert-loses-patience: @JetRunner
- research_projects/distillation: @VictorSanh
-->
## Information

View File

@@ -37,26 +37,38 @@ members/contributors which may be interested in your PR.
If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
Please tag fewer than 3 people.
albert, bert, XLM: @LysandreJik
GPT2: @LysandreJik, @patrickvonplaten
tokenizers: @mfuntowicz
Trainer: @sgugger
Benchmarks: @patrickvonplaten
Model Cards: @julien-c
examples/distillation: @VictorSanh
nlp datasets: [different repo](https://github.com/huggingface/nlp)
rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
Text Generation: @patrickvonplaten, @TevenLeScao
Blenderbot, Bart, Marian, Pegasus: @patrickvonplaten
T5: @patrickvonplaten
Rag: @patrickvonplaten, @lhoestq
EncoderDecoder: @patrickvonplaten
Longformer, Reformer: @patrickvonplaten
TransfoXL, XLNet: @TevenLeScao, @patrickvonplaten
examples/seq2seq: @patil-suraj
examples/bert-loses-patience: @JetRunner
tensorflow: @jplu
examples/token-classification: @stefan-it
documentation: @sgugger
FSMT: @stas00
Models:
- albert, bert, xlm: @LysandreJik
- blenderbot, bart, marian, pegasus, encoderdecoder, t5: @patrickvonplaten, @patil-suraj
- longformer, reformer, transfoxl, xlnet: @patrickvonplaten
- fsmt: @stas00
- funnel: @sgugger
- gpt2: @patrickvonplaten, @LysandreJik
- rag: @patrickvonplaten, @lhoestq
- tensorflow: @jplu
Library:
- benchmarks: @patrickvonplaten
- deepspeed: @stas00
- ray/raytune: @richardliaw, @amogkam
- text generation: @patrickvonplaten
- tokenizers: @n1t0, @LysandreJik
- trainer: @sgugger
- pipelines: @LysandreJik
Documentation: @sgugger
HF projects:
- nlp datasets: [different repo](https://github.com/huggingface/nlp)
- rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
Examples:
- maintained examples (not research project or legacy): @sgugger, @patil-suraj
- research_projects/bert-loses-patience: @JetRunner
- research_projects/distillation: @VictorSanh
-->

1
.github/stale.yml vendored
View File

@@ -6,6 +6,7 @@ daysUntilClose: 7
exemptLabels:
- pinned
- security
- Feature request
# Label to use when marking an issue as stale
staleLabel: wontfix
# Comment to post when marking an issue as stale. Set to `false` to disable

View File

@@ -1,6 +1,6 @@
name: Torch hub integration
on:
on:
push:
branches:
- "*"
@@ -32,8 +32,10 @@ jobs:
- name: Install dependencies
run: |
pip install --upgrade pip
pip install torch
pip install numpy filelock protobuf requests tqdm regex sentencepiece sacremoses tokenizers packaging
# install torch-hub specific dependencies
pip install -e git+https://github.com/huggingface/transformers.git#egg=transformers[torchhub]
# no longer needed
pip uninstall -y transformers
- name: Torch hub list
run: |

View File

@@ -7,6 +7,9 @@ on:
- "tests/**"
- ".github/**"
- "templates/**"
pull_request_target:
branches:
- master
jobs:
run_tests_templates:
@@ -40,6 +43,8 @@ jobs:
transformers-cli add-new-model --testing --testing_file=templates/adding_a_new_model/tests/pt-encoder-bert-tokenizer.json --path=templates/adding_a_new_model
transformers-cli add-new-model --testing --testing_file=templates/adding_a_new_model/tests/standalone.json --path=templates/adding_a_new_model
transformers-cli add-new-model --testing --testing_file=templates/adding_a_new_model/tests/tf-encoder-bert-tokenizer.json --path=templates/adding_a_new_model
transformers-cli add-new-model --testing --testing_file=templates/adding_a_new_model/tests/tf-seq-2-seq-bart-tokenizer.json --path=templates/adding_a_new_model
transformers-cli add-new-model --testing --testing_file=templates/adding_a_new_model/tests/pt-seq-2-seq-bart-tokenizer.json --path=templates/adding_a_new_model
make style
python utils/check_table.py --fix_and_overwrite
python utils/check_dummies.py --fix_and_overwrite

View File

@@ -37,7 +37,8 @@ jobs:
- name: Build conda packages
run: |
conda info
conda build .github/conda
conda list
conda-build .github/conda
- name: Upload to Anaconda
run: anaconda upload `conda build .github/conda --output` --force
run: anaconda upload `conda-build .github/conda --output` --force

View File

@@ -75,7 +75,7 @@ jobs:
RUN_SLOW: yes
run: |
source .env/bin/activate
pip install -r examples/requirements.txt
pip install -r examples/_tests_requirements.txt
python -m pytest -n 1 --dist=loadfile -s --make-reports=examples_torch_gpu examples
- name: Failure short reports

View File

@@ -328,11 +328,18 @@ for more information.
### Develop on Windows
On windows, you need to configure git to transform Windows `CRLF` line endings to Linux `LF` line endings:
`git config core.autocrlf input`
One way one can run the make command on Window is to pass by MSYS2:
1. [Download MSYS2](https://www.msys2.org/), we assume to have it installed in C:\msys64
2. Open the command line C:\msys64\msys2.exe (it should be available from the start menu)
3. Run in the shell: `pacman -Syu` and install make with `pacman -S make`
4. Add `C:\msys64\usr\bin` to your PATH environment variable.
You can now use `make` from any terminal (Powershell, cmd.exe, etc) 🎉
### Syncing forked master with upstream (HuggingFace) master

275
ISSUES.md Normal file
View File

@@ -0,0 +1,275 @@
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# How To Request Support
This is an Open Source Project so please be mindful that like in any other project of this kind there is no obligation to answer all requests for help.
However, we want to encourage you to ask for help whenever you think it's needed! We are happy about every question we get because it allows us to better understand your needs, possible misunderstandings, and most importantly a way for you to help us make this library better. That being said, this document's main purpose is to provide guidelines at how you can formulate your requests to increase your chances to be understood and to get support.
There are two main venues to receive support: [the forums](https://discuss.huggingface.co/) and [the GitHub issues](https://github.com/huggingface/transformers/issues).
## The Forums
[The user forums](https://discuss.huggingface.co/) are supported by the wide community of the library users and backed up by developers when needed.
If you have a difficulty with deploying this library or some questions, or you'd like to discuss a new feature, please first consider discussing those things at the forums. Only when you feel your subject matter has been crystalized and you still need support from the library developers do proceed to file an [issue](https://github.com/huggingface/transformers/issues).
In particular all "Please explain" questions or objectively very user-specific feature requests belong to the forums. Here are some example of such questions:
* "I would like to use a BertModel within a RL-Agent for a customer support service. How can I use a BertForMaskedLM in my ChatBotModel?"
* "Could you please explain why T5 has no positional embedding matrix under T5Model?"
* "How should I set my generation parameters for translation?"
* "How to train T5 on De->En translation?"
## The GitHub Issues
Everything which hints at a bug should be opened as an [issue](https://github.com/huggingface/transformers/issues).
You are not required to read the following guidelines before opening an issue. However, if you notice that your issue doesn't get any replies, chances are that the developers have one or several difficulties with its quality. In this case, reading the following points and adjusting your issue accordingly could help.
1. Before posting an issue, first search for already posted issues, since chances are someone has already asked a similar question before you.
If you use Google your search query should be:
```
"huggingface" "transformers" your query
```
The first two quoted words tell Google to limit the search to the context of the Huggingface Transformers. The remainder is your query - most commonly this would be the error message the software fails with. We will go deeper into details shortly.
The results of such a query will typically match GitHub issues, Hugging Face forums, StackExchange, and blogs.
If you find relevant hints, you may choose to continue the discussion there if you have follow up questions.
If what you found is similar but doesn't quite answer your problem, please, post a new issue and do include links to similar issues or forum discussions you may have found.
Let's look at some examples:
The error message, often referred to as an assertion, tells us what went wrong. Here is an example of an assertion:
```python
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/transformers/src/transformers/__init__.py", line 34, in <module>
from . import dependency_versions_check
File "/transformers/src/transformers/dependency_versions_check.py", line 34, in <module>
from .file_utils import is_tokenizers_available
File "/transformers/src/transformers/file_utils.py", line 40, in <module>
from tqdm.auto import tqdm
ModuleNotFoundError: No module named 'tqdm.auto'
```
and it typically includes a traceback, so that we can see the full stack of calls the program made before it fails. This gives us the context to know why the program failed.
Going back to the above example. If you received this error search, look at the very last line of the error which is:
```python
ModuleNotFoundError: No module named 'tqdm.auto'
```
And now we can use it to do the searching on your favorite search engine:
1. first for `"huggingface" "transformers" "ModuleNotFoundError: No module named 'tqdm.auto'"`
2. if you don't find relevant results, then search for just `"ModuleNotFoundError: No module named 'tqdm.auto'"`
3. and finally if nothing still comes up, then remove the outside quotes: `ModuleNotFoundError: No module named 'tqdm.auto'`
If the error includes any messages that include bits unique to your filesystem, always remove those in the search query since other users will not have the same filesystem as yours. For example:
```bash
python -c 'open("/tmp/wrong_path.txt", "r")'
Traceback (most recent call last):
File "<string>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/wrong_path.txt'
```
Here you'd search for just: `"FileNotFoundError: [Errno 2] No such file or directory"`
If the local information that you removed were inside the error message and you removed them you may need to remove double quotes since your query is no longer exact. So if the error message was something like:
```bash
ValueError: '/tmp/wrong_path.txt' cannot be found
```
then you'd search for `"ValueError" "cannot be found"`
As you search you will notice that when you don't use quotes often the search engines will return a variety of unrelated hits, which may or may not be what you want.
Experiment with different ways and find which approach gives the most satisfactory results.
2. Keep the issue short, providing the information that you think will aid the developers to understand your situation. Put yourself in the shoes of the person who has never seen your code or knows anything about your custom setup. This mental exercise will help to develop an intuition to what/what not to share"
3. If there is a software failure, always provide the full traceback, for example:
```python
$ python -c 'import transformers'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/transformers/src/transformers/__init__.py", line 34, in <module>
from . import dependency_versions_check
File "/transformers/src/transformers/dependency_versions_check.py", line 34, in <module>
from .file_utils import is_tokenizers_available
File "/transformers/src/transformers/file_utils.py", line 40, in <module>
from tqdm.auto import tqdm
ModuleNotFoundError: No module named 'tqdm.auto'
```
As compared to providing just the last line of the error message, e.g.:
```python
ModuleNotFoundError: No module named 'tqdm.auto'
```
which is not sufficient.
If your application is running on more than one GPU (e.g. under `DistributedDataParallel`) and typically getting every log and traceback printed multiple times, please make sure that you paste only one copy of it. At times the traceback from parallel processes may get interleaved - so either disentangle these or change the loggers to log only for `local_rank==0` so that only one process logs things.
4. When quoting a traceback, command line instructions and any type of code always enclose it in triple backticks inside the editor window, that is:
````
```
git clone https://github.com/huggingface/transformers
cd transformers
pip install .
```
````
If it's a command line with a long argument list, please consider breaking it down using backslashes and new lines. Here is an example of a good command line quote:
```bash
cd examples/seq2seq
python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py \
--model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
--output_dir output_dir --overwrite_output_dir \
--do_train --n_train 500 --num_train_epochs 1 \
--per_device_train_batch_size 1 --freeze_embeds \
--src_lang en_XX --tgt_lang ro_RO --task translation \
--fp16 --sharded_ddp
```
If you don't break it up, one has to scroll horizontally which often makes it quite difficult to quickly see what's happening.
The backslashes allow us to copy the command directly into the console to run it, without needing to edit it.
5. Include only the important information that you think will help the developer to quickly identify the problem.
For example applications often create huge amounts of logs. Ask yourself whether providing all or parts of the log is useful.
Pasting a 100-1000 lines of log into the issue is an immediate turn off, since it will take a lot of time to figure out where the pertinent parts of the log are.
Attaching a full log can be helpful if it's done as an attachment, if it's enclosed in the following html code in the comment editor window:
```
<details>
<summary>Full log</summary>
<pre>
many
lines
go
here
</pre>
</details>
```
which would result in the following entry, which can be opened if desired, but otherwise takes little space.
<details>
<summary>Full log</summary>
<pre>
many
lines
go
here
</pre>
</details>
You could also provide a link to a pastebin service, but this is less beneficial since those links tend to expire quickly and future readers of your issue might not be able to access that log file anymore and may lack some context.
6. If this is an issue in your code, do try to reduce that code to a minimal example that still demonstrates the problem. Please ask at the forums if you have a hard time figuring how to do that. Please realize that we don't have the luxury of having time to try and understand all of your custom code.
If you really tried to make a short reproducible code but couldn't figure it out, it might be that having a traceback will give the developer enough information to know what's going on. But if it is not enough and we can't reproduce the problem, we can't really solve it.
Do not dispair if you can't figure it out from the begining, just share what you can and perhaps someone else will be able to help you at the forums.
7. If you forked off some of this project's code or example applications, please, do not ask us to go into your code repository and figure out what you may have done. The code is already very complex and unless there is an easy way to do a diff and it's a small diff, it won't be possible to find someone with time on their hands to make a lengthy investigation. Albeit, you might find someone at the forums who will be generous to do this for you.
8. Before reporting an issue, first, always try to update your environment to the latest official version of this library. We have no resources to go and debug older revisions, which could easily have bugs that have been fixed in the latest released version.
We understand that this is not always possible, especially when APIs change, in which case file an issue against the highest library version your environment can support.
Of course, if you upgrade the library, always retest that the problem is still there.
9. Please do not ask us to reproduce an issue with your custom data, since we don't have it. So, either you should use some existing dataset supported by HF datasets or you need to supply a code that generates a small sample on the fly, or some another quick and simple way to get it.
Please do not send us any non-public domain data that may require a license or a permission to be used.
10. Do not tag multiple developers on the issue unless you know this is expected, either because you asked them and they gave you an explicit permission to tag them or the issue template instructs you to do so.
The "who to tag for what domain" part of the issue template is there to help users direct their questions to the right developers who are designated maintainers of project's specific domains. They can then decide at their own discretion to tag other developers if they feel it'd help move the issue forward.
We currently don't have a triage service and we trust your capacity to identify the right domain and thus the persons to tag in your issue. If you are not sure, please use the forums to ask for guidance.
When in doubt, err on the side of not tagging a given person. If you tag multiple people out of context or permission don't be surprised if you get no response at all. Please remember that every time you tag someone, they get a notification and you're taking their time without their permission. Please be sensitive to that.
If you got helped by one of the developers in the past please don't tag them in future issues, unless they are listed in the issue template for the domain you are asking about or that developer gave you an explicit permission to tag them in future issues.
If you see a certain developer doing multiple and/or recent commits into a specific area of the project that you feel is relevant to your issue, it is not a good reason to tag them. Various developers may be fixing things that prevent them from moving forward, but often their work is focused on a totally different domain. And while they may or may not know how to help you with the problem at hand, it would benefit the whole community much more if they focus on the domain of their unique expertise.
11. Use the Edit button. Take your time, and re-read and improve the wording and formatting to make your posts and comments as easy to understand as possible.
Avoid posting multiple comments in a row, as each comment generates a notification for the developers tagged in that issue. If you happened to post multiple comments in a row, and nobody followed up yet - consider merging those into one or a few comments while editing the combined content to be coherent.
If you choose to edit your older comments after others posted follow up comments you need to be aware that your modifications might not be noticed, so if it's not a typo fixing, try to write a new comment flagging that something has been changed in the previous comments.
For example, the very first comment is the most important one. If while the thread unfolds you realize that things aren't as they seemed to you originally you may want to edit the first post to reflect the up-to-date understanding of the issue at hand so that it helps those who read your issue in the future quickly understand what's going on and not need to sift through dozens of comments. It also helps to indicate that the post was edited. So, those reading the thread later can understand why there might be certain discontinuity in the information flow.
Use bullets and items if you have lists of items and the outcome improves overall readability.
Use backticks to refer to class and function names, e.g. `BartModel` and `generate` as these stand out and improve the speed of a reader's comprehension.
Try not use italics and bold text too much as these often make the text more difficult to read.
12. If you are cross-referencing a specific comment in a given thread or another issue, always link to that specific comment, rather than using the issue link. If you do the latter it could be quite impossible to find which specific comment you're referring to.
To get the link to the specific comment do not copy the url from the location bar of your browser, but instead, click the `...` icon in the upper right corner of the comment and then select "Copy Link".
For example the first link is a link to an issue, and the second to a specific comment in the same issue:
1. https://github.com/huggingface/transformers/issues/9257
2. https://github.com/huggingface/transformers/issues/9257#issuecomment-749945162
13. If you are replying to a last comment, it's totally fine to make your reply with just your comment in it. The readers can follow the information flow here.
But if you're replying to a comment that happened some comments back it's always a good practice to quote just the relevant lines you're replying it. The `>` is used for quoting, or you can always use the menu to do so. For example your editor box will look like:
```
> How big is your gpu cluster?
Our cluster is made of 256 gpus.
```
If you are addressing multiple comments, quote the relevant parts of each before your answer. Some people use the same comment to do multiple replies, others separate them into separate comments. Either way works. The latter approach helps for linking to a specific comment.
In general the best way to figure out what works the best is learn from issues posted by other people - see which issues get great responses and which get little to no response - observe what the posters who received great responses did differently from those who did not.
Thank you for reading this somewhat lengthy document. We would like to conclude that these are not absolute rules, but a friendly advice that will help maximize the chances for us to understand what you are trying to communicate, reproduce the problem then resolve it to your satisfaction and the benefit of the whole community.
If after reading this document there are remaining questions on how and why or there is a need for further elucidation, please, don't hesitate to ask your question in [this thread](https://discuss.huggingface.co/t/how-to-request-support/3128).

View File

@@ -67,4 +67,4 @@ test-examples:
# Check that docs can build
docs:
cd docs && make html SPHINXOPTS="-W"
cd docs && make html SPHINXOPTS="-W -j 4"

View File

@@ -49,7 +49,7 @@ limitations under the License.
## Online demos
You can test most of our models directly on their pages from the [model hub](https://huggingface.co/models). We also offer an [inference API](https://huggingface.co/pricing) to use those models.
You can test most of our models directly on their pages from the [model hub](https://huggingface.co/models). We also offer [private model hosting, versioning, & an inference API](https://huggingface.co/pricing) to use those models.
Here are a few examples:
- [Masked word completion with BERT](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France)
@@ -167,7 +167,7 @@ When TensorFlow 2.0 and/or PyTorch has been installed, 🤗 Transformers can be
pip install transformers
```
If you'd like to play with the examples, you must [install the library from source](https://huggingface.co/transformers/installation.html#installing-from-source).
If you'd like to play with the examples or need the bleeding edge of the code and can't wait for a new release, you must [install the library from source](https://huggingface.co/transformers/installation.html#installing-from-source).
### With conda
@@ -179,7 +179,7 @@ Since Transformers version v4.0.0, we now have a conda channel: `huggingface`.
conda install -c huggingface transformers
```
Follow the installation pages of TensorFlow, PyTorch or Flax to see how to install them with conda.
Follow the installation pages of TensorFlow, PyTorch or Flax to see how to install them with conda.
## Models architectures
@@ -195,7 +195,10 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[BERT](https://huggingface.co/transformers/model_doc/bert.html)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
1. **[BERT For Sequence Generation](https://huggingface.co/transformers/model_doc/bertgeneration.html)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
1. **[Blenderbot](https://huggingface.co/transformers/model_doc/blenderbot.html)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
1. **[BlenderbotSmall](https://huggingface.co/transformers/model_doc/blenderbot_small.html)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
1. **[BORT](https://huggingface.co/transformers/model_doc/bort.html)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry.
1. **[CamemBERT](https://huggingface.co/transformers/model_doc/camembert.html)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
1. **[ConvBERT](https://huggingface.co/transformers/model_doc/convbert.html)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
1. **[CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
1. **[DeBERTa](https://huggingface.co/transformers/model_doc/deberta.html)** (from Microsoft Research) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
1. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
@@ -209,6 +212,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[GPT](https://huggingface.co/transformers/model_doc/gpt.html)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
1. **[GPT-2](https://huggingface.co/transformers/model_doc/gpt2.html)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
1. **[LayoutLM](https://huggingface.co/transformers/model_doc/layoutlm.html)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1. **[LED](https://huggingface.co/transformers/model_doc/led.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
1. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
1. **[LXMERT](https://huggingface.co/transformers/model_doc/lxmert.html)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
1. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
@@ -219,11 +223,11 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[ProphetNet](https://huggingface.co/transformers/model_doc/prophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
ultilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
1. **[SqueezeBert](https://huggingface.co/transformers/model_doc/squeezebert.html)** released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[T5](https://huggingface.co/transformers/model_doc/t5.html)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[TAPAS](https://huggingface.co/transformers/master/model_doc/tapas.html)** released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
1. **[TAPAS](https://huggingface.co/transformers/model_doc/tapas.html)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
1. **[Transformer-XL](https://huggingface.co/transformers/model_doc/transformerxl.html)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
1. **[Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[XLM](https://huggingface.co/transformers/model_doc/xlm.html)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
1. **[XLM-ProphetNet](https://huggingface.co/transformers/model_doc/xlmprophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[XLM-RoBERTa](https://huggingface.co/transformers/model_doc/xlmroberta.html)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.

View File

@@ -1,10 +1,12 @@
// These two things need to be updated at each release for the version selector.
// Last stable version
const stableVersion = "v4.0.0"
const stableVersion = "v4.2.0"
// Dictionary doc folder to label. The last stable version should have an empty key.
const versionMapping = {
"master": "master",
"": "v4.0.0 (stable)",
"": "v4.2.0/v4.2.1 (stable)",
"v4.1.1": "v4.1.0/v4.1.1",
"v4.0.1": "v4.0.0/v4.0.1",
"v3.5.1": "v3.5.0/v3.5.1",
"v3.4.0": "v3.4.0",
"v3.3.1": "v3.3.0/v3.3.1",

View File

@@ -0,0 +1,844 @@
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
How to add a model to 🤗 Transformers?
=======================================================================================================================
Adding a new model is often difficult and requires an in-depth knowledge of the 🤗 Transformers library and ideally also
of the model's original repository. At Hugging Face, we are trying to empower the community more and more to add models
independently. Thus, for some new models that the community wants to be added to 🤗 Transformers, we create a customized
*call-for-model-addition* that explains step-by-step how to add the requested model. With this
*call-for-model-addition*, we want to teach a motivated and experienced contributor of the community how to port a
model to 🤗 Transformers.
If this sounds like something you would be interested in, feel free to check out the currently open
“calls-for-model-addition” `here
<https://github.com/huggingface/transformers/tree/master/templates/adding_a_new_model/open_model_proposals/README.md>`__
and to contact us.
If selected, you will then work closely with one member of the Hugging Face team to integrate the model into 🤗
Transformers. By doing so, you will both gain a theoretical and deep practical understanding of the proposed model. But
more importantly, you will have made a major open-source contribution to 🤗 Transformers. Along the way, you will:
- get insights into open-source best practices
- understand the design principles of one of the most popular NLP libraries
- learn how to do efficiently test large NLP models
- learn how to integrate Python utilities like ``black``, ``isort``, ``make fix-copies`` into a library to always
ensure clean and readable code
We are also more than happy if you want to add a model that cannot be found in the “calls-for-model-addition” folder.
The following sections explain in detail how to add a new model. It might also be very helpful to check out already
added models to see if those resemble the model you would like to add `here
<https://github.com/huggingface/transformers/pulls?q=is%3Apr+label%3A%22PR+for+Model+Addition%22+is%3Aclosed>`__.
To start, let's try to get a general overview of the Transformers library.
General overview of 🤗 Transformers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
First, you should get a general overview of 🤗 Transformers. 🤗 Transformers is a very opinionated library, so there is a
chance that you don't agree with some of the library's philosophies or design choices. From our experience, however, we
found that the fundamental design choices and philosophies of the library are crucial to efficiently scale 🤗
Transformers while keeping maintenance costs at a reasonable level.
A good first starting point to better understand the library is to read the :doc:`documentation of our philosophy
<philosophy>`. As a result of our way of working, there are some choices that we try to apply to all models:
- Composition is generally favored over-abstraction
- Duplicating code is not always bad if it strongly improves the readability or accessibility of a model
- Model files are as self-contained as possible so that when you read the code of a specific model, you ideally only
have to look into the respective ``modeling_....py`` file.
In our opinion, the library's code is not just a means to provide a product, *e.g.* the ability to use BERT for
inference, but also as the very product that we want to improve. Hence, when adding a model, the user is not only the
person that will use your model, but also everybody that will read, try to understand, and possibly tweak your code.
With this in mind, let's go a bit deeper into the general library design.
Overview of models
-----------------------------------------------------------------------------------------------------------------------
To successfully add a model, it is important to understand the interaction between your model and its config,
:class:`~transformers.PreTrainedModel`, and :class:`~transformers.PretrainedConfig`. For exemplary purposes, we will
call the model to be added to 🤗 Transformers ``BrandNewBert``.
Let's take a look:
.. image:: ./imgs/transformers_overview.png
As you can see, we do make use of inheritance in 🤗 Transformers, but we keep the level of abstraction to an absolute
minimum. There are never more than two levels of abstraction for any model in the library. :obj:`BrandNewBertModel`
inherits from :obj:`BrandNewBertPreTrainedModel` which in turn inherits from :class:`~transformres.PreTrainedModel` and
that's it. As a general rule, we want to make sure that a new model only depends on
:class:`~transformers.PreTrainedModel`. The important functionalities that are automatically provided to every new
model are :meth:`~transformers.PreTrainedModel.from_pretrained` and
:meth:`~transformers.PreTrainedModel.save_pretrained`, which are used for serialization and deserialization. All of the
other important functionalities, such as :meth:`BrandNewBertModel.forward` should be completely defined in the new
``modeling_brand_new_bert.py`` script. Next, we want to make sure that a model with a specific head layer, such as
:obj:`BrandNewBertForMaskedLM` does not inherit from :obj:`BrandNewBertModel`, but rather uses :obj:`BrandNewBertModel`
as a component that can be called in its forward pass to keep the level of abstraction low. Every new model requires a
configuration class, called :obj:`BrandNewBertConfig`. This configuration is always stored as an attribute in
:class:`~transformers.PreTrainedModel`, and thus can be accessed via the ``config`` attribute for all classes
inheriting from :obj:`BrandNewBertPreTrainedModel`:
.. code:: python
model = BrandNewBertModel.from_pretrained("brandy/brand_new_bert")
model.config # model has access to its config
Similar to the model, the configuration inherits basic serialization and deserialization functionalities from
:class:`~transformers.PretrainedConfig`. Note that the configuration and the model are always serialized into two
different formats - the model to a `pytorch_model.bin` file and the configuration to a `config.json` file. Calling
:meth:`~transformers.PreTrainedModel.save_pretrained` will automatically call
:meth:`~transformers.PretrainedConfig.save_pretrained`, so that both model and configuration are saved.
Overview of tokenizers
-----------------------------------------------------------------------------------------------------------------------
Not quite ready yet :-( This section will be added soon!
Step-by-step recipe to add a model to 🤗 Transformers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Everyone has different preferences of how to port a model so it can be very helpful for you to take a look at summaries
of how other contributors ported models to Hugging Face. Here is a list of community blog posts on how to port a model:
1. `Porting GPT2 Model <https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28>`__ by `Thomas
<https://huggingface.co/thomwolf>`__
2. `Porting WMT19 MT Model <https://huggingface.co/blog/porting-fsmt>`__ by `Stas <https://huggingface.co/stas>`__
From experience, we can tell you that the most important things to keep in mind when adding a model are:
- Don't reinvent the wheel! Most parts of the code you will add for the new 🤗 Transformers model already exist
somewhere in 🤗 Transformers. Take some time to find similar, already existing models and tokenizers you can copy
from. `grep <https://www.gnu.org/software/grep/>`__ and `rg <https://github.com/BurntSushi/ripgrep>`__ are your
friends. Note that it might very well happen that your model's tokenizer is based on one model implementation, and
your model's modeling code on another one. *E.g.* FSMT's modeling code is based on BART, while FSMT's tokenizer code
is based on XLM.
- It's more of an engineering challenge than a scientific challenge. You should spend more time on creating an
efficient debugging environment than trying to understand all theoretical aspects of the model in the paper.
- Ask for help, when you're stuck! Models are the core component of 🤗 Transformers so that we at Hugging Face are more
than happy to help you at every step to add your model. Don't hesitate to ask if you notice you are not making
progress.
In the following, we try to give you a general recipe that we found most useful when porting a model to 🤗 Transformers.
The following list is a summary of everything that has to be done to add a model and can be used by you as a To-Do
List:
- 1. ☐ (Optional) Understood theoretical aspects
- 2. ☐ Prepared transformers dev environment
- 3. ☐ Set up debugging environment of the original repository
- 4. ☐ Created script that successfully runs forward pass using original repository and checkpoint
- 5. ☐ Successfully added the model skeleton to Transformers
- 6. ☐ Successfully converted original checkpoint to Transformers checkpoint
- 7. ☐ Successfully ran forward pass in Transformers that gives identical output to original checkpoint
- 8. ☐ Finished model tests in Transformers
- 9. ☐ Successfully added Tokenizer in Transformers
- 10. ☐ Run end-to-end integration tests
- 11. ☐ Finished docs
- 12. ☐ Uploaded model weights to the hub
- 13. ☐ Submitted the pull request
- 14. ☐ (Optional) Added a demo notebook
To begin with, we usually recommend to start by getting a good theoretical understanding of ``BrandNewBert``. However,
if you prefer to understand the theoretical aspects of the model *on-the-job*, then it is totally fine to directly dive
into the ``BrandNewBert``'s code-base. This option might suit you better, if your engineering skills are better than
your theoretical skill, if you have trouble understanding ``BrandNewBert``'s paper, or if you just enjoy programming
much more than reading scientific papers.
1. (Optional) Theoretical aspects of BrandNewBert
-----------------------------------------------------------------------------------------------------------------------
You should take some time to read *BrandNewBert's* paper, if such descriptive work exists. There might be large
sections of the paper that are difficult to understand. If this is the case, this is fine - don't worry! The goal is
not to get a deep theoretical understanding of the paper, but to extract the necessary information required to
effectively re-implement the model in 🤗 Transformers. That being said, you don't have to spend too much time on the
theoretical aspects, but rather focus on the practical ones, namely:
- What type of model is *brand_new_bert*? BERT-like encoder-only model? GPT2-like decoder-only model? BART-like
encoder-decoder model? Look at the :doc:`model_summary` if you're not familiar with the differences between those.
- What are the applications of *brand_new_bert*? Text classification? Text generation? Seq2Seq tasks, *e.g.,*
summarization?
- What is the novel feature of the model making it different from BERT/GPT-2/BART?
- Which of the already existing `🤗 Transformers models <https://huggingface.co/transformers/#contents>`__ is most
similar to *brand_new_bert*?
- What type of tokenizer is used? A sentencepiece tokenizer? Word piece tokenizer? Is it the same tokenizer as used
for BERT or BART?
After you feel like you have gotten a good overview of the architecture of the model, you might want to write to the
Hugging Face team with any questions you might have. This might include questions regarding the model's architecture,
its attention layer, etc. We will be more than happy to help you.
2. Next prepare your environment
-----------------------------------------------------------------------------------------------------------------------
1. Fork the `repository <https://github.com/huggingface/transformers>`__ by clicking on the Fork' button on the
repository's page. This creates a copy of the code under your GitHub user account.
2. Clone your ``transformers`` fork to your local disk, and add the base repository as a remote:
.. code:: bash
git clone https://github.com/[your Github handle]/transformers.git
cd transformers
git remote add upstream https://github.com/huggingface/transformers.git
3. Set up a development environment, for instance by running the following command:
.. code:: bash
python -m venv .env
source .env/bin/activate
pip install -e ".[dev]"
and return to the parent directory
.. code:: bash
cd ..
4. We recommend adding the PyTorch version of *brand_new_bert* to Transformers. To install PyTorch, please follow the
instructions on https://pytorch.org/get-started/locally/.
**Note:** You don't need to have CUDA installed. Making the new model work on CPU is sufficient.
5. To port *brand_new_bert*, you will also need access to its original repository:
.. code:: bash
git clone https://github.com/org_that_created_brand_new_bert_org/brand_new_bert.git
cd brand_new_bert
pip install -e .
Now you have set up a development environment to port *brand_new_bert* to 🤗 Transformers.
3.-4. Run a pretrained checkpoint using the original repository
-----------------------------------------------------------------------------------------------------------------------
At first, you will work on the original *brand_new_bert* repository. Often, the original implementation is very
“researchy”. Meaning that documentation might be lacking and the code can be difficult to understand. But this should
be exactly your motivation to reimplement *brand_new_bert*. At Hugging Face, one of our main goals is to *make people
stand on the shoulders of giants* which translates here very well into taking a working model and rewriting it to make
it as **accessible, user-friendly, and beautiful** as possible. This is the number-one motivation to re-implement
models into 🤗 Transformers - trying to make complex new NLP technology accessible to **everybody**.
You should start thereby by diving into the original repository.
Successfully running the official pretrained model in the original repository is often **the most difficult** step.
From our experience, it is very important to spend some time getting familiar with the original code-base. You need to
figure out the following:
- Where to find the pretrained weights?
- How to load the pretrained weights into the corresponding model?
- How to run the tokenizer independently from the model?
- Trace one forward pass so that you know which classes and functions are required for a simple forward pass. Usually,
you only have to reimplement those functions.
- Be able to locate the important components of the model: Where is the model's class? Are there model sub-classes,
*e.g.* EncoderModel, DecoderModel? Where is the self-attention layer? Are there multiple different attention layers,
*e.g.* *self-attention*, *cross-attention*...?
- How can you debug the model in the original environment of the repo? Do you have to add `print` statements, can you
work with an interactive debugger like `ipdb`, or should you use an efficient IDE to debug the model, like PyCharm?
It is very important that before you start the porting process, that you can **efficiently** debug code in the original
repository! Also, remember that you are working with an open-source library, so do not hesitate to open an issue, or
even a pull request in the original repository. The maintainers of this repository are most likely very happy about
someone looking into their code!
At this point, it is really up to you which debugging environment and strategy you prefer to use to debug the original
model. We strongly advise against setting up a costly GPU environment, but simply work on a CPU both when starting to
dive into the original repository and also when starting to write the 🤗 Transformers implementation of the model. Only
at the very end, when the model has already been successfully ported to 🤗 Transformers, one should verify that the
model also works as expected on GPU.
In general, there are two possible debugging environments for running the original model
- `Jupyter notebooks <https://jupyter.org/>`__ / `google colab
<https://colab.research.google.com/notebooks/intro.ipynb>`__
- Local python scripts.
Jupyter notebooks have the advantage that they allow for cell-by-cell execution which can be helpful to better split
logical components from one another and to have faster debugging cycles as intermediate results can be stored. Also,
notebooks are often easier to share with other contributors, which might be very helpful if you want to ask the Hugging
Face team for help. If you are familiar with Jupiter notebooks, we strongly recommend you to work with them.
The obvious disadvantage of Jupyther notebooks is that if you are not used to working with them you will have to spend
some time adjusting to the new programming environment and that you might not be able to use your known debugging tools
anymore, like ``ipdb``.
For each code-base, a good first step is always to load a **small** pretrained checkpoint and to be able to reproduce a
single forward pass using a dummy integer vector of input IDs as an input. Such a script could look like this (in
pseudocode):
.. code:: bash
model = BrandNewBertModel.load_pretrained_checkpoint(/path/to/checkpoint/)
input_ids = [0, 4, 5, 2, 3, 7, 9] # vector of input ids
original_output = model.predict(input_ids)
Next, regarding the debugging strategy, there are generally a few from which to choose from:
- Decompose the original model into many small testable components and run a forward pass on each of those for
verification
- Decompose the original model only into the original *tokenizer* and the original *model*, run a forward pass on
those, and use intermediate print statements or breakpoints for verification
Again, it is up to you which strategy to choose. Often, one or the other is advantageous depending on the original code
base.
If the original code-base allows you to decompose the model into smaller sub-components, *e.g.* if the original
code-base can easily be run in eager mode, it is usually worth the effort to do so. There are some important advantages
to taking the more difficult road in the beginning:
- at a later stage when comparing the original model to the Hugging Face implementation, you can verify automatically
for each component individually that the corresponding component of the 🤗 Transformers implementation matches instead
of relying on visual comparison via print statements
- it can give you some rope to decompose the big problem of porting a model into smaller problems of just porting
individual components and thus structure your work better
- separating the model into logical meaningful components will help you to get a better overview of the model's design
and thus to better understand the model
- at a later stage those component-by-component tests help you to ensure that no regression occurs as you continue
changing your code
`Lysandre's <https://gist.github.com/LysandreJik/db4c948f6b4483960de5cbac598ad4ed>`__ integration checks for ELECTRA
gives a nice example of how this can be done.
However, if the original code-base is very complex or only allows intermediate components to be run in a compiled mode,
it might be too time-consuming or even impossible to separate the model into smaller testable sub-components. A good
example is `T5's MeshTensorFlow <https://github.com/tensorflow/mesh/tree/master/mesh_tensorflow>`__ library which is
very complex and does not offer a simple way to decompose the model into its sub-components. For such libraries, one
often relies on verifying print statements.
No matter which strategy you choose, the recommended procedure is often the same in that you should start to debug the
starting layers first and the ending layers last.
It is recommended that you retrieve the output, either by print statements or sub-component functions, of the following
layers in the following order:
1. Retrieve the input IDs passed to the model
2. Retrieve the word embeddings
3. Retrieve the input of the first Transformer layer
4. Retrieve the output of the first Transformer layer
5. Retrieve the output of the following n - 1 Transformer layers
6. Retrieve the output of the whole BrandNewBert Model
Input IDs should thereby consists of an array of integers, *e.g.* ``input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19]``
The outputs of the following layers often consist of multi-dimensional float arrays and can look like this:
.. code:: bash
[[
[-0.1465, -0.6501, 0.1993, ..., 0.1451, 0.3430, 0.6024],
[-0.4417, -0.5920, 0.3450, ..., -0.3062, 0.6182, 0.7132],
[-0.5009, -0.7122, 0.4548, ..., -0.3662, 0.6091, 0.7648],
...,
[-0.5613, -0.6332, 0.4324, ..., -0.3792, 0.7372, 0.9288],
[-0.5416, -0.6345, 0.4180, ..., -0.3564, 0.6992, 0.9191],
[-0.5334, -0.6403, 0.4271, ..., -0.3339, 0.6533, 0.8694]]],
We expect that every model added to 🤗 Transformers passes a couple of integration tests, meaning that the original
model and the reimplemented version in 🤗 Transformers have to give the exact same output up to a precision of 0.001!
Since it is normal that the exact same model written in different libraries can give a slightly different output
depending on the library framework, we accept an error tolerance of 1e-3 (0.001). It is not enough if the model gives
nearly the same output, they have to be the almost identical. Therefore, you will certainly compare the intermediate
outputs of the 🤗 Transformers version multiple times against the intermediate outputs of the original implementation of
*brand_new_bert* in which case an **efficient** debugging environment of the original repository is absolutely
important. Here is some advice is to make your debugging environment as efficient as possible.
- Find the best way of debugging intermediate results. Is the original repository written in PyTorch? Then you should
probably take the time to write a longer script that decomposes the original model into smaller sub-components to
retrieve intermediate values. Is the original repository written in Tensorflow 1? Then you might have to rely on
TensorFlow print operations like `tf.print <https://www.tensorflow.org/api_docs/python/tf/print>`__ to output
intermediate values. Is the original repository written in Jax? Then make sure that the model is **not jitted** when
running the forward pass, *e.g.* check-out `this link <https://github.com/google/jax/issues/196>`__.
- Use the smallest pretrained checkpoint you can find. The smaller the checkpoint, the faster your debug cycle
becomes. It is not efficient if your pretrained model is so big that your forward pass takes more than 10 seconds.
In case only very large checkpoints are available, it might make more sense to create a dummy model in the new
environment with randomly initialized weights and save those weights for comparison with the 🤗 Transformers version
of your model
- Make sure you are using the easiest way of calling a forward pass in the original repository. Ideally, you want to
find the function in the original repository that **only** calls a single forward pass, *i.e.* that is often called
``predict``, ``evaluate``, ``forward`` or ``__call__``. You don't want to debug a function that calls ``forward``
multiple times, *e.g.* to generate text, like ``autoregressive_sample``, ``generate``.
- Try to separate the tokenization from the model's `forward` pass. If the original repository shows examples where
you have to input a string, then try to find out where in the forward call the string input is changed to input ids
and start from this point. This might mean that you have to possibly write a small script yourself or change the
original code so that you can directly input the ids instead of an input string.
- Make sure that the model in your debugging setup is **not** in training mode, which often causes the model to yield
random outputs due to multiple dropout layers in the model. Make sure that the forward pass in your debugging
environment is **deterministic** so that the dropout layers are not used. Or use `transformers.file_utils.set_seed`
if the old and new implementations are in the same framework.
The following section gives you more specific details/tips on how you can do this for *brand_new_bert*.
5.-14. Port BrandNewBert to 🤗 Transformers
-----------------------------------------------------------------------------------------------------------------------
Next, you can finally start adding new code to 🤗 Transformers. Go into the clone of your 🤗 Transformers' fork:
::
cd transformers
In the special case that you are adding a model whose architecture exactly matches the model architecture of an
existing model you only have to add a conversion script as described in `this section <#write-a-conversion-script>`__.
In this case, you can just re-use the whole model architecture of the already existing model.
Otherwise, let's start generating a new model with the amazing Cookiecutter!
**Use the Cookiecutter to automatically generate the model's code**
To begin with head over to the `🤗 Transformers templates
<https://github.com/huggingface/transformers/tree/master/templates/adding_a_new_model>`__ to make use of our
``cookiecutter`` implementation to automatically generate all the relevant files for your model. Again, we recommend
only adding the PyTorch version of the model at first. Make sure you follow the instructions of the ``README.md`` on
the `🤗 Transformers templates <https://github.com/huggingface/transformers/tree/master/templates/adding_a_new_model>`__
carefully.
**Open a Pull Request on the main huggingface/transformers repo**
Before starting to adapt the automatically generated code, now is the time to open a “Work in progress (WIP)” pull
request, *e.g.* “[WIP] Add *brand_new_bert*”, in 🤗 Transformers so that you and the Hugging Face team can work
side-by-side on integrating the model into 🤗 Transformers.
You should do the following:
1. Create a branch with a descriptive name from your master branch
::
git checkout -b add_brand_new_bert
2. Commit the automatically generated code:
::
git add .
git commit
3. Fetch and rebase to current master
::
git fetch upstream
git rebase upstream/master
4. Push the changes to your account using:
::
git push -u origin a-descriptive-name-for-my-changes
5. Once you are satisfied, go to the webpage of your fork on GitHub. Click on “Pull request”. Make sure to add the
GitHub handle of some members of the Hugging Face team as reviewers, so that the Hugging Face team gets notified for
future changes.
6. Change the PR into a draft by clicking on “Convert to draft” on the right of the GitHub pull request web page.
In the following, whenever you have done some progress, don't forget to commit your work and push it to your account so
that it shows in the pull request. Additionally, you should make sure to update your work with the current master from
time to time by doing:
::
git fetch upstream
git merge upstream/master
In general, all questions you might have regarding the model or your implementation should be asked in your PR and
discussed/solved in the PR. This way, the Hugging Face team will always be notified when you are committing new code or
if you have a question. It is often very helpful to point the Hugging Face team to your added code so that the Hugging
Face team can efficiently understand your problem or question.
To do so, you can go to the “Files changed” tab where you see all of your changes, go to a line regarding which you
want to ask a question, and click on the “+” symbol to add a comment. Whenever a question or problem has been solved,
you can click on the “Resolve” button of the created comment.
In the same way, the Hugging Face team will open comments when reviewing your code. We recommend asking most questions
on GitHub on your PR. For some very general questions that are not very useful for the public, feel free to ping the
Hugging Face team by Slack or email.
**5. Adapt the generated models code for brand_new_bert**
At first, we will focus only on the model itself and not care about the tokenizer. All the relevant code should be
found in the generated files ``src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`` and
``src/transformers/models/brand_new_bert/configuration_brand_new_bert.py``.
Now you can finally start coding :). The generated code in
``src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`` will either have the same architecture as BERT if
it's an encoder-only model or BART if it's an encoder-decoder model. At this point, you should remind yourself what
you've learned in the beginning about the theoretical aspects of the model: *How is the model different from BERT or
BART?*". Implement those changes which often means to change the *self-attention* layer, the order of the normalization
layer, etc… Again, it is often useful to look at the similar architecture of already existing models in Transformers to
get a better feeling of how your model should be implemented.
**Note** that at this point, you don't have to be very sure that your code is fully correct or clean. Rather, it is
advised to add a first *unclean*, copy-pasted version of the original code to
``src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`` until you feel like all the necessary code is
added. From our experience, it is much more efficient to quickly add a first version of the required code and
improve/correct the code iteratively with the conversion script as described in the next section. The only thing that
has to work at this point is that you can instantiate the 🤗 Transformers implementation of *brand_new_bert*, *i.e.* the
following command should work:
.. code:: python
from transformers import BrandNewBertModel, BrandNewBertConfig
model = BrandNewBertModel(BrandNewBertConfig())
The above command will create a model according to the default parameters as defined in ``BrandNewBertConfig()`` with
random weights, thus making sure that the ``init()`` methods of all components works.
**6. Write a conversion script**
Next, you should write a conversion script that lets you convert the checkpoint you used to debug *brand_new_bert* in
the original repository to a checkpoint compatible with your just created 🤗 Transformers implementation of
*brand_new_bert*. It is not advised to write the conversion script from scratch, but rather to look through already
existing conversion scripts in 🤗 Transformers for one that has been used to convert a similar model that was written in
the same framework as *brand_new_bert*. Usually, it is enough to copy an already existing conversion script and
slightly adapt it for your use case. Don't hesitate to ask the Hugging Face team to point you to a similar already
existing conversion script for your model.
- If you are porting a model from TensorFlow to PyTorch, a good starting point might be BERT's conversion script `here
<https://github.com/huggingface/transformers/blob/7acfa95afb8194f8f9c1f4d2c6028224dbed35a2/src/transformers/models/bert/modeling_bert.py#L91>`__
- If you are porting a model from PyTorch to PyTorch, a good starting point might be BART's conversion script `here
<https://github.com/huggingface/transformers/blob/master/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py>`__
In the following, we'll quickly explain how PyTorch models store layer weights and define layer names. In PyTorch, the
name of a layer is defined by the name of the class attribute you give the layer. Let's define a dummy model in
PyTorch, called ``SimpleModel`` as follows:
.. code:: python
import torch.nn as nn
class SimpleModel(nn.Module):
def __init__(self):
super().__init__()
self.dense = nn.Linear(10, 10)
self.intermediate = nn.Linear(10, 10)
self.layer_norm = nn.LayerNorm(10)
Now we can create an instance of this model definition which will fill all weights: ``dense``, ``intermediate``,
``layer_norm`` with random weights. We can print the model to see its architecture
.. code:: python
model = SimpleModel()
print(model)
This will print out the following:
.. code:: bash
SimpleModel(
(dense): Linear(in_features=10, out_features=10, bias=True)
(intermediate): Linear(in_features=10, out_features=10, bias=True)
(layer_norm): LayerNorm((10,), eps=1e-05, elementwise_affine=True)
)
We can see that the layer names are defined by the name of the class attribute in PyTorch. You can print out the weight
values of a specific layer:
.. code:: python
print(model.dense.weight.data)
to see that the weights were randomly initialized
.. code:: bash
tensor([[-0.0818, 0.2207, -0.0749, -0.0030, 0.0045, -0.1569, -0.1598, 0.0212,
-0.2077, 0.2157],
[ 0.1044, 0.0201, 0.0990, 0.2482, 0.3116, 0.2509, 0.2866, -0.2190,
0.2166, -0.0212],
[-0.2000, 0.1107, -0.1999, -0.3119, 0.1559, 0.0993, 0.1776, -0.1950,
-0.1023, -0.0447],
[-0.0888, -0.1092, 0.2281, 0.0336, 0.1817, -0.0115, 0.2096, 0.1415,
-0.1876, -0.2467],
[ 0.2208, -0.2352, -0.1426, -0.2636, -0.2889, -0.2061, -0.2849, -0.0465,
0.2577, 0.0402],
[ 0.1502, 0.2465, 0.2566, 0.0693, 0.2352, -0.0530, 0.1859, -0.0604,
0.2132, 0.1680],
[ 0.1733, -0.2407, -0.1721, 0.1484, 0.0358, -0.0633, -0.0721, -0.0090,
0.2707, -0.2509],
[-0.1173, 0.1561, 0.2945, 0.0595, -0.1996, 0.2988, -0.0802, 0.0407,
0.1829, -0.1568],
[-0.1164, -0.2228, -0.0403, 0.0428, 0.1339, 0.0047, 0.1967, 0.2923,
0.0333, -0.0536],
[-0.1492, -0.1616, 0.1057, 0.1950, -0.2807, -0.2710, -0.1586, 0.0739,
0.2220, 0.2358]]).
In the conversion script, you should fill those randomly initialized weights with the exact weights of the
corresponding layer in the checkpoint. *E.g.*
.. code:: python
# retrieve matching layer weights, e.g. by
# recursive algorithm
layer_name = "dense"
pretrained_weight = array_of_dense_layer
model_pointer = getattr(model, "dense")
model_pointer.weight.data = torch.from_numpy(pretrained_weight)
While doing so, you must verify that each randomly initialized weight of your PyTorch model and its corresponding
pretrained checkpoint weight exactly match in both **shape and name**. To do so, it is **necessary** to add assert
statements for the shape and print out the names of the checkpoints weights. E.g. you should add statements like:
.. code:: python
assert (
model_pointer.weight.shape == pretrained_weight.shape
), f"Pointer shape of random weight {model_pointer.shape} and array shape of checkpoint weight {pretrained_weight.shape} mismatched"
Besides, you should also print out the names of both weights to make sure they match, *e.g.*
.. code:: python
logger.info(f"Initialize PyTorch weight {layer_name} from {pretrained_weight.name}")
If either the shape or the name doesn't match, you probably assigned the wrong checkpoint weight to a randomly
initialized layer of the 🤗 Transformers implementation.
An incorrect shape is most likely due to an incorrect setting of the config parameters in ``BrandNewBertConfig()`` that
do not exactly match those that were used for the checkpoint you want to convert. However, it could also be that
PyTorch's implementation of a layer requires the weight to be transposed beforehand.
Finally, you should also check that **all** required weights are initialized and print out all checkpoint weights that
were not used for initialization to make sure the model is correctly converted. It is completely normal, that the
conversion trials fail with either a wrong shape statement or wrong name assignment. This is most likely because either
you used incorrect parameters in ``BrandNewBertConfig()``, have a wrong architecture in the 🤗 Transformers
implementation, you have a bug in the ``init()`` functions of one of the components of the 🤗 Transformers
implementation or you need to transpose one of the checkpoint weights.
This step should be iterated with the previous step until all weights of the checkpoint are correctly loaded in the
Transformers model. Having correctly loaded the checkpoint into the 🤗 Transformers implementation, you can then save
the model under a folder of your choice ``/path/to/converted/checkpoint/folder`` that should then contain both a
``pytorch_model.bin`` file and a ``config.json`` file:
.. code:: python
model.save_pretrained("/path/to/converted/checkpoint/folder")
**7. Implement the forward pass**
Having managed to correctly load the pretrained weights into the 🤗 Transformers implementation, you should now make
sure that the forward pass is correctly implemented. In `Get familiar with the original repository
<#run-a-pretrained-checkpoint-using-the-original-repository>`__, you have already created a script that runs a forward
pass of the model using the original repository. Now you should write an analogous script using the 🤗 Transformers
implementation instead of the original one. It should look as follows:
.. code:: python
model = BrandNewBertModel.from_pretrained(/path/to/converted/checkpoint/folder)
input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19]
output = model(input_ids).last_hidden_states
It is very likely that the 🤗 Transformers implementation and the original model implementation don't give the exact
same output the very first time or that the forward pass throws an error. Don't be disappointed - it's expected! First,
you should make sure that the forward pass doesn't throw any errors. It often happens that the wrong dimensions are
used leading to a `Dimensionality mismatch` error or that the wrong data type object is used, *e.g.* ``torch.long``
instead of ``torch.float32``. Don't hesitate to ask the Hugging Face team for help, if you don't manage to solve
certain errors.
The final part to make sure the 🤗 Transformers implementation works correctly is to ensure that the outputs are
equivalent to a precision of ``1e-3``. First, you should ensure that the output shapes are identical, *i.e.*
``outputs.shape`` should yield the same value for the script of the 🤗 Transformers implementation and the original
implementation. Next, you should make sure that the output values are identical as well. This one of the most difficult
parts of adding a new model. Common mistakes why the outputs are not identical are:
- Some layers were not added, *i.e.* an `activation` layer was not added, or the residual connection was forgotten
- The word embedding matrix was not tied
- The wrong positional embeddings are used because the original implementation uses on offset
- Dropout is applied during the forward pass. To fix this make sure `model.training is False` and that no dropout
layer is falsely activated during the forward pass, *i.e.* pass `self.training` to `PyTorch's functional dropout
<https://pytorch.org/docs/stable/nn.functional.html?highlight=dropout#torch.nn.functional.dropout>`_
The best way to fix the problem is usually to look at the forward pass of the original implementation and the 🤗
Transformers implementation side-by-side and check if there are any differences. Ideally, you should debug/print out
intermediate outputs of both implementations of the forward pass to find the exact position in the network where the 🤗
Transformers implementation shows a different output than the original implementation. First, make sure that the
hard-coded ``input_ids`` in both scripts are identical. Next, verify that the outputs of the first transformation of
the ``input_ids`` (usually the word embeddings) are identical. And then work your way up to the very last layer of the
network. At some point, you will notice a difference between the two implementations, which should point you to the bug
in the 🤗 Transformers implementation. From our experience, a simple and efficient way is to add many print statements
in both the original implementation and 🤗 Transformers implementation, at the same positions in the network
respectively, and to successively remove print statements showing the same values for intermediate presentions.
When you're confident that both implementations yield the same output, verifying the outputs with
``torch.allclose(original_output, output, atol=1e-3)``, you're done with the most difficult part! Congratulations - the
work left to be done should be a cakewalk 😊.
**8. Adding all necessary model tests**
At this point, you have successfully added a new model. However, it is very much possible that the model does not yet
fully comply with the required design. To make sure, the implementation is fully compatible with 🤗 Transformers, all
common tests should pass. The Cookiecutter should have automatically added a test file for your model, probably under
the same ``tests/test_modeling_brand_new_bert.py``. Run this test file to verify that all common tests pass:
.. code:: python
pytest tests/test_modeling_brand_new_bert.py
Having fixed all common tests, it is now crucial to ensure that all the nice work you have done is well tested, so that
-
a) The community can easily understand your work by looking at specific tests of *brand_new_bert*
-
b) Future changes to your model will not break any important feature of the model.
At first, integration tests should be added. Those integration tests essentially do the same as the debugging scripts
you used earlier to implement the model to 🤗 Transformers. A template of those model tests is already added by the
Cookiecutter, called ``BrandNewBertModelIntegrationTests`` and only has to be filled out by you. To ensure that those
tests are passing, run
.. code:: python
RUN_SLOW=1 pytest -sv tests/test_modeling_brand_new_bert.py::BrandNewBertModelIntegrationTests
.. note::
In case you are using Windows, you should replace ``RUN_SLOW=1`` with ``SET RUN_SLOW=1``
Second, all features that are special to *brand_new_bert* should be tested additionally in a separate test under
``BrandNewBertModelTester``/``BrandNewBertModelTest``. This part is often forgotten but is extremely useful in two
ways:
- It helps to transfer the knowledge you have acquired during the model addition to the community by showing how the
special features of *brand_new_bert* should work.
- Future contributors can quickly test changes to the model by running those special tests.
**9. Implement the tokenizer**
Next, we should add the tokenizer of *brand_new_bert*. Usually, the tokenizer is equivalent or very similar to an
already existing tokenizer of 🤗 Transformers.
It is very important to find/extract the original tokenizer file and to manage to load this file into the 🤗
Transformers' implementation of the tokenizer.
To ensure that the tokenizer works correctly, it is recommended to first create a script in the original repository
that inputs a string and returns the ``input_ids``. It could look similar to this (in pseudo-code):
.. code:: bash
input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words."
model = BrandNewBertModel.load_pretrained_checkpoint(/path/to/checkpoint/)
input_ids = model.tokenize(input_str)
You might have to take a deeper look again into the original repository to find the correct tokenizer function or you
might even have to do changes to your clone of the original repository to only output the ``input_ids``. Having written
a functional tokenization script that uses the original repository, an analogous script for 🤗 Transformers should be
created. It should look similar to this:
.. code:: python
from transformers import BrandNewBertTokenizer
input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words."
tokenizer = BrandNewBertTokenizer.from_pretrained(/path/to/tokenizer/folder/)
input_ids = tokenizer(input_str).input_ids
When both ``input_ids`` yield the same values, as a final step a tokenizer test file should also be added.
Analogous to the modeling test files of *brand_new_bert*, the tokenization test files of *brand_new_bert* should
contain a couple of hard-coded integration tests.
**10. Run End-to-end integration tests**
Having added the tokenizer, you should also add a couple of end-to-end integration tests using both the model and the
tokenizer to ``tests/test_modeling_brand_new_bert.py`` in 🤗 Transformers. Such a test should show on a meaningful
text-to-text sample that the 🤗 Transformers implementation works as expected. A meaningful text-to-text sample can
include *e.g.* a source-to-target-translation pair, an article-to-summary pair, a question-to-answer pair, etc… If none
of the ported checkpoints has been fine-tuned on a downstream task it is enough to simply rely on the model tests. In a
final step to ensure that the model is fully functional, it is advised that you also run all tests on GPU. It can
happen that you forgot to add some ``.to(self.device)`` statements to internal tensors of the model, which in such a
test would show in an error. In case you have no access to a GPU, the Hugging Face team can take care of running those
tests for you.
**11. Add Docstring**
Now, all the necessary functionality for *brand_new_bert* is added - you're almost done! The only thing left to add is
a nice docstring and a doc page. The Cookiecutter should have added a template file called
``docs/source/model_doc/brand_new_bert.rst`` that you should fill out. Users of your model will usually first look at
this page before using your model. Hence, the documentation must be understandable and concise. It is very useful for
the community to add some *Tips* to show how the model should be used. Don't hesitate to ping the Hugging Face team
regarding the docstrings.
Next, make sure that the docstring added to ``src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`` is
correct and included all necessary inputs and outputs. It is always to good to remind oneself that documentation should
be treated at least as carefully as the code in 🤗 Transformers since the documentation is usually the first contact
point of the community with the model.
**Code refactor**
Great, now you have added all the necessary code for *brand_new_bert*. At this point, you should correct some potential
incorrect code style by running:
.. code:: bash
make style
and verify that your coding style passes the quality check:
.. code:: bash
make quality
There are a couple of other very strict design tests in 🤗 Transformers that might still be failing, which shows up in
the tests of your pull request. This is often because of some missing information in the docstring or some incorrect
naming. The Hugging Face team will surely help you if you're stuck here.
Lastly, it is always a good idea to refactor one's code after having ensured that the code works correctly. With all
tests passing, now it's a good time to go over the added code again and do some refactoring.
You have now finished the coding part, congratulation! 🎉 You are Awesome! 😎
**12. Upload the models to the model hub**
In this final part, you should convert and upload all checkpoints to the model hub and add a model card for each
uploaded model checkpoint. You should work alongside the Hugging Face team here to decide on a fitting name for each
checkpoint and to get the required access rights to be able to upload the model under the author's organization of
*brand_new_bert*.
It is worth spending some time to create fitting model cards for each checkpoint. The model cards should highlight the
specific characteristics of this particular checkpoint, *e.g.* On which dataset was the checkpoint
pretrained/fine-tuned on? On what down-stream task should the model be used? And also include some code on how to
correctly use the model.
**13. (Optional) Add notebook**
It is very helpful to add a notebook that showcases in-detail how *brand_new_bert* can be used for inference and/or
fine-tuned on a downstream task. This is not mandatory to merge your PR, but very useful for the community.
**14. Submit your finished PR**
You're done programming now and can move to the last step, which is getting your PR merged into master. Usually, the
Hugging Face team should have helped you already at this point, but it is worth taking some time to give your finished
PR a nice description and eventually add comments to your code, if you want to point out certain design choices to your
reviewer.
Share your work!!
-----------------------------------------------------------------------------------------------------------------------
Now, it's time to get some credit from the community for your work! Having completed a model addition is a major
contribution to Transformers and the whole NLP community. Your code and the ported pre-trained models will certainly be
used by hundreds and possibly even thousands of developers and researchers. You should be proud of your work and share
your achievement with the community.
**You have made another model that is super easy to access for everyone in the community! 🤯**

View File

@@ -15,8 +15,8 @@ Benchmarks
Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here
<https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found :prefix_link:`here
<notebooks/05-benchmark.ipynb>`.
How to benchmark 🤗 Transformer models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -99,6 +99,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: PyTorch
- use_torchscript: False
@@ -145,6 +146,7 @@ An instantiated benchmark object can then simply be run by calling ``benchmark.r
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: Tensorflow
- use_xla: False
@@ -228,6 +230,7 @@ configurations must be inserted with the benchmark args as follows.
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: PyTorch
- use_torchscript: False
@@ -297,6 +300,7 @@ configurations must be inserted with the benchmark args as follows.
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: Tensorflow
- use_xla: False
@@ -353,5 +357,5 @@ The approach is detailed in the `following blogpost
available `here
<https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing>`__.
With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community `here
<https://github.com/huggingface/transformers/blob/master/examples/benchmarking/README.md>`__.
With the new `benchmark` tools, it is easier than ever to share your benchmark results with the community
:prefix_link:`here <examples/benchmarking/README.md>`.

View File

@@ -33,6 +33,6 @@ help people access the inner representations, mainly adapted from the great work
* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained
in https://arxiv.org/abs/1905.10650.
To help you understand and use these features, we have added a specific example script: `bertology.py
<https://github.com/huggingface/transformers/blob/master/examples/research_projects/bertology/run_bertology.py>`_ while
extract information and prune a model pre-trained on GLUE.
To help you understand and use these features, we have added a specific example script: :prefix_link:`bertology.py
<examples/research_projects/bertology/run_bertology.py>` while extract information and prune a model pre-trained on
GLUE.

49
docs/source/community.md Normal file
View File

@@ -0,0 +1,49 @@
# Community
This page regroups resources around 🤗 Transformers developed by the community.
## Community resources:
| Resource | Description | Author |
|:----------|:-------------|------:|
| [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | A set of flashcards based on the [Transformers Docs Glossary](https://huggingface.co/transformers/master/glossary.html) that has been put into a form which can be easily learnt/revised using [Anki ](https://apps.ankiweb.net/) an open source, cross platform app specifically designed for long term knowledge retention. See this [Introductory video on how to use the flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) |
## Community notebooks:
| Notebook | Description | Author | |
|:----------|:-------------|:-------------|------:|
| [Train T5 in Tensorflow 2 ](https://github.com/snapthat/TF-T5-text-to-text) | How to train T5 for any task using Tensorflow 2. This notebook demonstrates a Question & Answer task implemented in Tensorflow 2 using SQUAD | [Muhammad Harris](https://github.com/HarrisDePerceptron) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snapthat/TF-T5-text-to-text/blob/master/snapthatT5/notebooks/TF-T5-Datasets%20Training.ipynb) |
| [Train T5 on TPU](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) | How to train T5 on SQUAD with Transformers and Nlp | [Suraj Patil](https://github.com/patil-suraj) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) |
| [Fine-tune T5 for Classification and Multiple Choice](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | How to fine-tune T5 for classification and multiple choice tasks using a text-to-text format with PyTorch Lightning | [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) |
| [Fine-tune DialoGPT on New Datasets and Languages](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | How to fine-tune the DialoGPT model on a new dataset for open-dialog conversational chatbots | [Nathan Cooper](https://github.com/ncoop57) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) |
| [Long Sequence Modeling with Reformer](https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) | How to train on sequences as long as 500,000 tokens with Reformer | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) |
| [Fine-tune BART for Summarization](https://github.com/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) | How to fine-tune BART for summarization with fastai using blurr | [Wayde Gilliam](https://ohmeow.com/) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) |
| [Fine-tune a pre-trained Transformer on anyone's tweets](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) | How to generate tweets in the style of your favorite Twitter account by fine-tune a GPT-2 model | [Boris Dayma](https://github.com/borisdayma) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) |
| [A Step by Step Guide to Tracking Hugging Face Model Performance](https://colab.research.google.com/drive/1NEiqNPhiouu2pPwDAVeFoN4-vTYMz9F8) | A quick tutorial for training NLP models with HuggingFace and & visualizing their performance with Weights & Biases | [Jack Morris](https://github.com/jxmorris12) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1NEiqNPhiouu2pPwDAVeFoN4-vTYMz9F8) |
| [Pretrain Longformer](https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) | How to build a "long" version of existing pretrained models | [Iz Beltagy](https://beltagy.net) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) |
| [Fine-tune Longformer for QA](https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) | How to fine-tune longformer model for QA task | [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) |
| [Evaluate Model with 🤗nlp](https://github.com/patrickvonplaten/notebooks/blob/master/How_to_evaluate_Longformer_on_TriviaQA_using_NLP.ipynb) | How to evaluate longformer on TriviaQA with `nlp` | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1m7eTGlPmLRgoPkkA7rkhQdZ9ydpmsdLE?usp=sharing) |
| [Fine-tune T5 for Sentiment Span Extraction](https://github.com/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) | How to fine-tune T5 for sentiment span extraction using a text-to-text format with PyTorch Lightning | [Lorenzo Ampil](https://github.com/enzoampil) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) |
| [Fine-tune DistilBert for Multiclass Classification](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb) | How to fine-tune DistilBert for multiclass classification with PyTorch | [Abhishek Kumar Mishra](https://github.com/abhimishra91) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb)|
|[Fine-tune BERT for Multi-label Classification](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|How to fine-tune BERT for multi-label classification using PyTorch|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|
|[Fine-tune T5 for Summarization](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|How to fine-tune T5 for summarization in PyTorch and track experiments with WandB|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|
|[Speed up Fine-Tuning in Transformers with Dynamic Padding / Bucketing](https://github.com/ELS-RD/transformers-notebook/blob/master/Divide_Hugging_Face_Transformers_training_time_by_2_or_more.ipynb)|How to speed up fine-tuning by a factor of 2 using dynamic padding / bucketing|[Michael Benesty](https://github.com/pommedeterresautee) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CBfRU1zbfu7-ijiOqAAQUA-RJaxfcJoO?usp=sharing)|
|[Pretrain Reformer for Masked Language Modeling](https://github.com/patrickvonplaten/notebooks/blob/master/Reformer_For_Masked_LM.ipynb)| How to train a Reformer model with bi-directional self-attention layers | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tzzh0i8PgDQGV3SMFUGxM7_gGae3K-uW?usp=sharing)|
|[Expand and Fine Tune Sci-BERT](https://github.com/lordtt13/word-embeddings/blob/master/COVID-19%20Research%20Data/COVID-SciBERT.ipynb)| How to increase vocabulary of a pretrained SciBERT model from AllenAI on the CORD dataset and pipeline it. | [Tanmay Thakur](https://github.com/lordtt13) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rqAR40goxbAfez1xvF3hBJphSCsvXmh8)|
|[Fine-tune Electra and interpret with Integrated Gradients](https://github.com/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb) | How to fine-tune Electra for sentiment analysis and interpret predictions with Captum Integrated Gradients | [Eliza Szczechla](https://elsanns.github.io) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb)|
|[fine-tune a non-English GPT-2 Model with Trainer class](https://github.com/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb) | How to fine-tune a non-English GPT-2 Model with Trainer class | [Philipp Schmid](https://www.philschmid.de) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb)|
|[Fine-tune a DistilBERT Model for Multi Label Classification task](https://github.com/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb) | How to fine-tune a DistilBERT Model for Multi Label Classification task | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb)|
|[Fine-tune ALBERT for sentence-pair classification](https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb) | How to fine-tune an ALBERT model or another BERT-based model for the sentence-pair classification task | [Nadir El Manouzi](https://github.com/NadirEM) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb)|
|[Fine-tune Roberta for sentiment analysis](https://github.com/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb) | How to fine-tune an Roberta model for sentiment analysis | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb)|
|[Evaluating Question Generation Models](https://github.com/flexudy-pipe/qugeev) | How accurate are the answers to questions generated by your seq2seq transformer model? | [Pascal Zoleko](https://github.com/zolekode) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bpsSqCQU-iw_5nNoRm_crPq6FRuJthq_?usp=sharing)|
|[Classify text with DistilBERT and Tensorflow](https://github.com/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb) | How to fine-tune DistilBERT for text classification in TensorFlow | [Peter Bayerle](https://github.com/peterbayerle) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb)|
|[Leverage BERT for Encoder-Decoder Summarization on CNN/Dailymail](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | How to warm-start a *EncoderDecoderModel* with a *bert-base-uncased* checkpoint for summarization on CNN/Dailymail | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)|
|[Leverage RoBERTa for Encoder-Decoder Summarization on BBC XSum](https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb) | How to warm-start a shared *EncoderDecoderModel* with a *roberta-base* checkpoint for summarization on BBC/XSum | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb)|
|[Fine-tune TAPAS on Sequential Question Answering (SQA)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) | How to fine-tune *TapasForQuestionAnswering* with a *tapas-base* checkpoint on the Sequential Question Answering (SQA) dataset | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb)|
|[Evaluate TAPAS on Table Fact Checking (TabFact)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb) | How to evaluate a fine-tuned *TapasForSequenceClassification* with a *tapas-base-finetuned-tabfact* checkpoint using a combination of the 🤗 datasets and 🤗 transformers libraries | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb)|
|[Fine-tuning mBART for translation](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb) | How to fine-tune mBART using Seq2SeqTrainer for Hindi to English translation | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb)|
|[Fine-tune LayoutLM on FUNSD (a form understanding dataset)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb) | How to fine-tune *LayoutLMForTokenClassification* on the FUNSD dataset for information extraction from scanned documents | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb)|
|[Fine-Tune DistilGPT2 and Generate Text](https://colab.research.google.com/github/tripathiaakash/DistilGPT2-Tutorial/blob/main/distilgpt2_fine_tuning.ipynb) | How to fine-tune DistilGPT2 and generate text | [Aakash Tripathi](https://github.com/tripathiaakash) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tripathiaakash/DistilGPT2-Tutorial/blob/main/distilgpt2_fine_tuning.ipynb)|
|[Fine-Tune LED on up to 8K tokens](https://github.com/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb) | How to fine-tune LED on pubmed for long-range summarization | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb)|
|[Evaluate LED on Arxiv](https://github.com/patrickvonplaten/notebooks/blob/master/LED_on_Arxiv.ipynb) | How to effectively evaluate LED on long-range summarization | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/LED_on_Arxiv.ipynb)|
|[Fine-tune LayoutLM on RVL-CDIP (a document image classification dataset)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb) | How to fine-tune *LayoutLMForSequenceClassification* on the RVL-CDIP dataset for scanned document classification | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb)|

View File

@@ -26,8 +26,11 @@ author = u'huggingface'
# The short X.Y version
version = u''
# The full version, including alpha/beta/rc tags
release = u'4.1.0'
release = u'4.3.0'
# Prefix link to point to master, comment this during version release and uncomment below line
extlinks = {'prefix_link': ('https://github.com/huggingface/transformers/blob/master/%s', '')}
# Prefix link to always point to corresponding version, uncomment this during version release
# extlinks = {'prefix_link': ('https://github.com/huggingface/transformers/blob/v'+ release + '/%s', '')}
# -- General configuration ---------------------------------------------------
@@ -40,6 +43,7 @@ release = u'4.1.0'
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.extlinks',
'sphinx.ext.coverage',
'sphinx.ext.napoleon',
'recommonmark',

View File

@@ -27,19 +27,14 @@ BERT
You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google
<https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the
`convert_bert_original_tf_checkpoint_to_pytorch.py
<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
script.
:prefix_link:`convert_bert_original_tf_checkpoint_to_pytorch.py
<src/transformers/models/bert/convert_bert_original_tf_checkpoint_to_pytorch.py>` script.
This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated
configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights
from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that
can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py
<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ ,
`run_bert_classifier.py
<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and
`run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\
).
can be imported using ``from_pretrained()`` (see example in :doc:`quicktour` , `run_glue.py
<https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py>`_\ ).
You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow
checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\
@@ -66,9 +61,8 @@ ALBERT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Convert TensorFlow model checkpoints of ALBERT to PyTorch using the
`convert_albert_original_tf_checkpoint_to_pytorch.py
<https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_
script.
:prefix_link:`convert_albert_original_tf_checkpoint_to_pytorch.py
<src/transformers/models/albert/convert_albert_original_tf_checkpoint_to_pytorch.py>` script.
The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying
configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you
@@ -170,3 +164,18 @@ Here is an example of the conversion process for a pre-trained XLM model:
--pytorch_dump_output $PYTORCH_DUMP_OUTPUT
[--config XML_CONFIG] \
[--finetuning_task_name XML_FINETUNED_TASK]
T5
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the conversion process for a pre-trained T5 model:
.. code-block:: shell
export T5=/path/to/t5/uncased_L-12_H-768_A-12
transformers-cli convert --model_type t5 \
--tf_checkpoint $T5/t5_model.ckpt \
--config $T5/t5_config.json \
--pytorch_dump_output $T5/pytorch_model.bin

View File

@@ -75,7 +75,7 @@ read this in.
test_texts, test_labels = read_imdb_split('aclImdb/test')
We now have a train and test dataset, but let's also also create a validation set which we can use for for evaluation
and tuning without training our test set results. Sklearn has a convenient utility for creating such splits:
and tuning without tainting our test set results. Sklearn has a convenient utility for creating such splits:
.. code-block:: python
@@ -558,12 +558,15 @@ we can use the built in :func:`~transformers.BatchEncoding.char_to_token` method
end_positions = []
for i in range(len(answers)):
start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))
# if None, the answer passage has been truncated
end_positions.append(encodings.char_to_token(i, answers[i]['answer_end']))
# if start position is None, the answer passage has been truncated
if start_positions[-1] is None:
start_positions[-1] = tokenizer.model_max_length
# if end position is None, the 'char_to_token' function points to the space before the correct token - > add + 1
if end_positions[-1] is None:
end_positions[-1] = tokenizer.model_max_length
end_positions[-1] = encodings.char_to_token(i, answers[i]['answer_end'] + 1)
encodings.update({'start_positions': start_positions, 'end_positions': end_positions})
add_token_positions(train_encodings, train_answers)

View File

@@ -24,11 +24,11 @@ General terms
- MLM: masked language modeling, a pretraining task where the model sees a corrupted version of the texts, usually done
by masking some tokens randomly, and has to predict the original text.
- multimodal: a task that combines texts with another kind of inputs (for instance images).
- NLG: natural language generation, all tasks related to generating text ( for instance talk with transformers,
translation)
- NLG: natural language generation, all tasks related to generating text (for instance talk with transformers,
translation).
- NLP: natural language processing, a generic way to say "deal with texts".
- NLU: natural language understanding, all tasks related to understanding what is in a text (for instance classifying
the whole text, individual words)
the whole text, individual words).
- pretrained model: a model that has been pretrained on some data (for instance all of Wikipedia). Pretraining methods
involve a self-supervised objective, which can be reading the text and trying to predict the next word (see CLM) or
masking some words and trying to predict them (see MLM).
@@ -226,7 +226,7 @@ Contrary to RNNs that have the position of each token embedded within them, tran
each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in
the list of tokens.
They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as
They are an optional parameter. If no ``position_ids`` are passed to the model, the IDs are automatically created as
absolute positional embeddings.
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use

Binary file not shown.

After

Width:  |  Height:  |  Size: 145 KiB

View File

@@ -100,98 +100,109 @@ and conversion utilities for the following models:
6. :doc:`Blenderbot <model_doc/blenderbot>` (from Facebook) released with the paper `Recipes for building an
open-domain chatbot <https://arxiv.org/abs/2004.13637>`__ by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary
Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
7. :doc:`CamemBERT <model_doc/camembert>` (from Inria/Facebook/Sorbonne) released with the paper `CamemBERT: a Tasty
7. :doc:`BlenderbotSmall <model_doc/blenderbot_small>` (from Facebook) released with the paper `Recipes for building an
open-domain chatbot <https://arxiv.org/abs/2004.13637>`__ by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary
Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
8. :doc:`BORT <model_doc/bort>` (from Alexa) released with the paper `Optimal Subarchitecture Extraction For BERT
<https://arxiv.org/abs/2010.10499>`__ by Adrian de Wynter and Daniel J. Perry.
9. :doc:`CamemBERT <model_doc/camembert>` (from Inria/Facebook/Sorbonne) released with the paper `CamemBERT: a Tasty
French Language Model <https://arxiv.org/abs/1911.03894>`__ by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz
Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
8. :doc:`CTRL <model_doc/ctrl>` (from Salesforce) released with the paper `CTRL: A Conditional Transformer Language
Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`__ by Nitish Shirish Keskar*, Bryan McCann*,
Lav R. Varshney, Caiming Xiong and Richard Socher.
9. :doc:`DeBERTa <model_doc/deberta>` (from Microsoft Research) released with the paper `DeBERTa: Decoding-enhanced
BERT with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao,
Weizhu Chen.
10. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
10. :doc:`ConvBERT <model_doc/convbert>` (from YituTech) released with the paper `ConvBERT: Improving BERT with
Span-based Dynamic Convolution <https://arxiv.org/abs/2008.02496>`__ by Zihang Jiang, Weihao Yu, Daquan Zhou,
Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
11. :doc:`CTRL <model_doc/ctrl>` (from Salesforce) released with the paper `CTRL: A Conditional Transformer Language
Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`__ by Nitish Shirish Keskar*, Bryan McCann*,
Lav R. Varshney, Caiming Xiong and Richard Socher.
12. :doc:`DeBERTa <model_doc/deberta>` (from Microsoft Research) released with the paper `DeBERTa: Decoding-enhanced
BERT with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao,
Weizhu Chen.
13. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`__ by Yizhe
Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
11. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
14. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__ by Victor
Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, RoBERTa into `DistilRoBERTa
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, Multilingual BERT into
`DistilmBERT <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German
version of DistilBERT.
12. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
15. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
Question Answering <https://arxiv.org/abs/2004.04906>`__ by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick
Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
13. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
16. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
Pre-training text encoders as discriminators rather than generators <https://arxiv.org/abs/2003.10555>`__ by Kevin
Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
14. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
17. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
Pre-training for French <https://arxiv.org/abs/1912.05372>`__ by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne,
Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
15. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
18. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
Filtering out Sequential Redundancy for Efficient Language Processing <https://arxiv.org/abs/2006.03236>`__ by
Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
16. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
19. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
Pre-Training <https://blog.openai.com/language-unsupervised/>`__ by Alec Radford, Karthik Narasimhan, Tim Salimans
and Ilya Sutskever.
17. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
20. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
Learners <https://blog.openai.com/better-language-models/>`__ by Alec Radford*, Jeffrey Wu*, Rewon Child, David
Luan, Dario Amodei** and Ilya Sutskever**.
18. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
21. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li,
Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
19. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
22. :doc:`LED <model_doc/led>` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer
<https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
23. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
20. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
24. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
Encoder Representations from Transformers for Open-Domain Question Answering <https://arxiv.org/abs/1908.07490>`__
by Hao Tan and Mohit Bansal.
21. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
25. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
Translator Team.
22. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
26. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
23. :doc:`MPNet <model_doc/mpnet>` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted
27. :doc:`MPNet <model_doc/mpnet>` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted
Pre-training for Language Understanding <https://arxiv.org/abs/2004.09297>`__ by Kaitao Song, Xu Tan, Tao Qin,
Jianfeng Lu, Tie-Yan Liu.
24. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
28. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
text-to-text transformer <https://arxiv.org/abs/2010.11934>`__ by Linting Xue, Noah Constant, Adam Roberts, Mihir
Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
25. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
29. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
Gap-sentences for Abstractive Summarization <https://arxiv.org/abs/1912.08777>`__> by Jingqing Zhang, Yao Zhao,
Mohammad Saleh and Peter J. Liu.
26. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
30. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi,
Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
27. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
31. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
Transformer <https://arxiv.org/abs/2001.04451>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
28. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
32. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
Pretraining Approach <https://arxiv.org/abs/1907.11692>`__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. ultilingual BERT into `DistilmBERT
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German version of
DistilBERT.
29. :doc:`SqueezeBert <model_doc/squeezebert>` released with the paper `SqueezeBERT: What can computer vision teach NLP
Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
33. :doc:`SqueezeBert <model_doc/squeezebert>` released with the paper `SqueezeBERT: What can computer vision teach NLP
about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi
Krishna, and Kurt W. Keutzer.
30. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
34. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
31. `TAPAS <https://huggingface.co/transformers/master/model_doc/tapas.html>`__ released with the paper `TAPAS: Weakly
Supervised Table Parsing via Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof
Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
32. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
35. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
Francesco Piccinno and Julian Martin Eisenschlos.
36. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
33. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
37. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
Zhou, Abdelrahman Mohamed, Michael Auli.
38. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
34. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
39. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
35. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
40. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
Zettlemoyer and Veselin Stoyanov.
36. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
41. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
@@ -220,10 +231,14 @@ TensorFlow and/or Flax.
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Blenderbot | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BlenderbotSmall | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CTRL | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ConvBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DPR | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DeBERTa | ✅ | ❌ | ✅ | ❌ | ❌ |
@@ -240,6 +255,8 @@ TensorFlow and/or Flax.
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Funnel Transformer | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LED | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LXMERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LayoutLM | ✅ | ✅ | ✅ | ❌ | ❌ |
@@ -276,6 +293,8 @@ TensorFlow and/or Flax.
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Transformer-XL | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Wav2Vec2 | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLM-RoBERTa | ✅ | ✅ | ✅ | ✅ | ❌ |
@@ -318,9 +337,11 @@ TensorFlow and/or Flax.
examples
custom_datasets
notebooks
community
converting_tensorflow_models
migration
contributing
add_new_model
testing
serialization
@@ -356,9 +377,13 @@ TensorFlow and/or Flax.
model_doc/bart
model_doc/barthez
model_doc/bert
model_doc/bertweet
model_doc/bertgeneration
model_doc/blenderbot
model_doc/blenderbot_small
model_doc/bort
model_doc/camembert
model_doc/convbert
model_doc/ctrl
model_doc/deberta
model_doc/dialogpt
@@ -369,7 +394,9 @@ TensorFlow and/or Flax.
model_doc/flaubert
model_doc/fsmt
model_doc/funnel
model_doc/herbert
model_doc/layoutlm
model_doc/led
model_doc/longformer
model_doc/lxmert
model_doc/marian
@@ -380,6 +407,7 @@ TensorFlow and/or Flax.
model_doc/gpt
model_doc/gpt2
model_doc/pegasus
model_doc/phobert
model_doc/prophetnet
model_doc/rag
model_doc/reformer
@@ -389,6 +417,7 @@ TensorFlow and/or Flax.
model_doc/t5
model_doc/tapas
model_doc/transformerxl
model_doc/wav2vec2
model_doc/xlm
model_doc/xlmprophetnet
model_doc/xlmroberta

View File

@@ -19,7 +19,7 @@ limitations under the License.
🤗 Transformers is tested on Python 3.6+, and PyTorch 1.1.0+ or TensorFlow 2.0+.
You should install 🤗 Transformers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're
unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). Create a virtual environment with the version of Python you're going
unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). Create a virtual environment with the version of Python you're going
to use and activate it.
Now, if you want to use 🤗 Transformers, you can install it with pip. If you'd like to play with the examples, you
@@ -28,8 +28,8 @@ must install it from source.
## Installation with pip
First you need to install one of, or both, TensorFlow 2.0 and PyTorch.
Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available),
[PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) and/or
Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available),
[PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) and/or
[Flax installation page](https://github.com/google/flax#quick-install)
regarding the specific install command for your platform.
@@ -73,7 +73,27 @@ It should download a pretrained model then print something like
## Installing from source
To install from source, clone the repository and install with the following commands:
Here is how to quickly install `transformers` from source:
```bash
pip install git+https://github.com/huggingface/transformers
```
Note that this will install not the latest released version, but the bleeding edge `master` version, which you may want to use in case a bug has been fixed since the last official release and a new release hasn't been yet rolled out.
While we strive to keep `master` operational at all times, if you notice some issues, they usually get fixed within a few hours or a day and and you're more than welcome to help us detect any problems by opening an [Issue](https://github.com/huggingface/transformers/issues) and this way, things will get fixed even sooner.
Again, you can run:
```bash
python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I hate you'))"
```
to check 🤗 Transformers is properly installed.
## Editable install
If you want to constantly use the bleeding edge `master` version of the source code, or if you want to contribute to the library and need to test the changes in the code you're making, you will need an editable install. This is done by cloning the repository and installing with the following commands:
``` bash
git clone https://github.com/huggingface/transformers.git
@@ -81,13 +101,22 @@ cd transformers
pip install -e .
```
Again, you can run
This command performs a magical link between the folder you cloned the repository to and your python library paths, and it'll look inside this folder in addition to the normal library-wide paths. So if normally your python packages get installed into:
```
~/anaconda3/envs/main/lib/python3.7/site-packages/
```
now this editable install will reside where you clone the folder to, e.g. `~/transformers/` and python will search it too.
```bash
python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I hate you'))"
Do note that you have to keep that `transformers` folder around and not delete it to continue using the `transfomers` library.
Now, let's get to the real benefit of this installation approach. Say, you saw some new feature has been just committed into `master`. If you have already performed all the steps above, to update your transformers to include all the latest commits, all you need to do is to `cd` into that cloned repository folder and update the clone to the latest version:
```
cd ~/transformers/
git pull
```
to check 🤗 Transformers is properly installed.
There is nothing else to do. Your python environment will find the bleeding edge version of `transformers` on the next run.
## With conda
@@ -100,7 +129,7 @@ Since Transformers version v4.0.0, we now have a conda channel: `huggingface`.
conda install -c huggingface transformers
```
Follow the installation pages of TensorFlow, PyTorch or Flax to see how to install them with conda.
Follow the installation pages of TensorFlow, PyTorch or Flax to see how to install them with conda.
## Caching models
@@ -109,7 +138,7 @@ This library provides pretrained models that will be downloaded and cached local
folder given by the shell environment variable ``TRANSFORMERS_CACHE``. The default value for it will be the Hugging
Face cache home followed by ``/transformers/``. This is (by order of priority):
* shell environment variable ``HF_HOME``
* shell environment variable ``HF_HOME``
* shell environment variable ``XDG_CACHE_HOME`` + ``/huggingface/``
* default: ``~/.cache/huggingface/``
@@ -130,7 +159,7 @@ faster, and cheaper. Feel free to contact us privately if you need any help.
You should check out our [swift-coreml-transformers](https://github.com/huggingface/swift-coreml-transformers) repo.
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`,
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`,
`DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
At some point in the future, you'll be able to seamlessly move from pretraining or fine-tuning models in PyTorch or

View File

@@ -13,13 +13,102 @@
Utilities for Generation
-----------------------------------------------------------------------------------------------------------------------
This page lists all the utility functions used by :meth:`~transformers.PretrainedModel.generate`,
:meth:`~transformers.PretrainedModel.greedy_search`, :meth:`~transformers.PretrainedModel.sample`,
:meth:`~transformers.PretrainedModel.beam_search`, :meth:`~transformers.PretrainedModel.beam_sample`, and
:meth:`~transformers.PretrainedModel.group_beam_search`.
This page lists all the utility functions used by :meth:`~transformers.PreTrainedModel.generate`,
:meth:`~transformers.PreTrainedModel.greedy_search`, :meth:`~transformers.PreTrainedModel.sample`,
:meth:`~transformers.PreTrainedModel.beam_search`, :meth:`~transformers.PreTrainedModel.beam_sample`, and
:meth:`~transformers.PreTrainedModel.group_beam_search`.
Most of those are only useful if you are studying the code of the generate methods in the library.
Generate Outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The output of :meth:`~transformers.PreTrainedModel.generate` is an instance of a subclass of
:class:`~transformers.file_utils.ModelOutput`. This output is a data structure containing all the information returned
by :meth:`~transformers.PreTrainedModel.generate`, but that can also be used as tuple or dictionary.
Here's an example:
.. code-block::
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt")
generation_output = model.generate(**inputs, return_dict_in_generate=True, output_scores=True)
The ``generation_output`` object is a :class:`~transformers.generation_utils.GreedySearchDecoderOnlyOutput`, as we can
see in the documentation of that class below, it means it has the following attributes:
- ``sequences``: the generated sequences of tokens
- ``scores`` (optional): the prediction scores of the language modelling head, for each generation step
- ``hidden_states`` (optional): the hidden states of the model, for each generation step
- ``attentions`` (optional): the attention weights of the model, for each generation step
Here we have the ``scores`` since we passed along ``output_scores=True``, but we don't have ``hidden_states`` and
``attentions`` because we didn't pass ``output_hidden_states=True`` or ``output_attentions=True``.
You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you
will get ``None``. Here for instance ``generation_output.scores`` are all the generated prediction scores of the
language modeling head, and ``generation_output.attentions`` is ``None``.
When using our ``generation_output`` object as a tuple, it only keeps the attributes that don't have ``None`` values.
Here, for instance, it has two elements, ``loss`` then ``logits``, so
.. code-block::
generation_output[:2]
will return the tuple ``(generation_output.sequences, generation_output.scores)`` for instance.
When using our ``generation_output`` object as a dictionary, it only keeps the attributes that don't have ``None``
values. Here, for instance, it has two keys that are ``sequences`` and ``scores``.
We document here all output types.
GreedySearchOutput
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. autoclass:: transformers.generation_utils.GreedySearchDecoderOnlyOutput
:members:
.. autoclass:: transformers.generation_utils.GreedySearchEncoderDecoderOutput
:members:
SampleOutput
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. autoclass:: transformers.generation_utils.SampleDecoderOnlyOutput
:members:
.. autoclass:: transformers.generation_utils.SampleEncoderDecoderOutput
:members:
BeamSearchOutput
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. autoclass:: transformers.generation_utils.BeamSearchDecoderOnlyOutput
:members:
.. autoclass:: transformers.generation_utils.BeamSearchEncoderDecoderOutput
:members:
BeamSampleOutput
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. autoclass:: transformers.generation_utils.BeamSampleDecoderOnlyOutput
:members:
.. autoclass:: transformers.generation_utils.BeamSampleEncoderDecoderOutput
:members:
LogitsProcessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -43,6 +43,10 @@ Schedules
Learning Rate Schedules (Pytorch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. autoclass:: transformers.SchedulerType
.. autofunction:: transformers.get_scheduler
.. autofunction:: transformers.get_constant_schedule

View File

@@ -126,13 +126,6 @@ CausalLMOutputWithCrossAttentions
:members:
CausalLMOutputWithPastAndCrossAttentions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.modeling_outputs.CausalLMOutputWithPastAndCrossAttentions
:members:
CausalLMOutputWithPast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -168,5 +168,5 @@ Using `tensorflow_datasets` is as easy as using a data file:
)
Another example using these processors is given in the `run_squad.py
<https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
Another example using these processors is given in the :prefix_link:`run_squad.py
<examples/question-answering/run_squad.py>` script.

View File

@@ -56,6 +56,8 @@ PreTrainedTokenizer
:special-members: __call__
:members:
.. automethod:: encode
PreTrainedTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -64,6 +66,8 @@ PreTrainedTokenizerFast
:special-members: __call__
:members:
.. automethod:: encode
BatchEncoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -1,4 +1,4 @@
..
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
@@ -63,6 +63,13 @@ Trainer
:members:
Seq2SeqTrainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Seq2SeqTrainer
:members: evaluate, predict
TFTrainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -77,8 +84,639 @@ TrainingArguments
:members:
Seq2SeqTrainingArguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Seq2SeqTrainingArguments
:members:
TFTrainingArguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFTrainingArguments
:members:
Trainer Integrations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :class:`~transformers.Trainer` has been extended to support libraries that may dramatically improve your training
time and fit much bigger models.
Currently it supports third party solutions, `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ and `FairScale
<https://github.com/facebookresearch/fairscale/>`__, which implement parts of the paper `ZeRO: Memory Optimizations
Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
<https://arxiv.org/abs/1910.02054>`__.
This provided support is new and experimental as of this writing.
Installation Notes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.
While all installation issues should be dealt with through the corresponding GitHub Issues of `FairScale
<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
<https://github.com/microsoft/DeepSpeed/issues>`__, there are a few common issues that one may encounter while building
any PyTorch extension that needs to build CUDA extensions.
Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:
.. code-block:: bash
pip install fairscale
pip install deepspeed
please, read the following notes first.
In these notes we give examples for what to do when ``pytorch`` has been built with CUDA ``10.2``. If your situation is
different remember to adjust the version number to the one you are after.
**Possible problem #1:**
While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
installed system-wide.
For example, if you installed ``pytorch`` with ``cudatoolkit==10.2`` in the Python environment, you also need to have
CUDA ``10.2`` installed system-wide.
The exact location may vary from system to system, but ``/usr/local/cuda-10.2`` is the most common location on many
Unix systems. When CUDA is correctly set up and added to the ``PATH`` environment variable, one can find the
installation location by doing:
.. code-block:: bash
which nvcc
If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite
search engine. For example, if you're on Ubuntu you may want to search for: `ubuntu cuda 10.2 install
<https://www.google.com/search?q=ubuntu+cuda+10.2+install>`__.
**Possible problem #2:**
Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
may have:
.. code-block:: bash
/usr/local/cuda-10.2
/usr/local/cuda-11.0
Now, in this situation you need to make sure that your ``PATH`` and ``LD_LIBRARY_PATH`` environment variables contain
the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the
last version was installed. If you encounter the problem, where the package build fails because it can't find the right
CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned
environment variables.
First, you may look at their contents:
.. code-block:: bash
echo $PATH
echo $LD_LIBRARY_PATH
so you get an idea of what is inside.
It's possible that ``LD_LIBRARY_PATH`` is empty.
``PATH`` lists the locations of where executables can be found and ``LD_LIBRARY_PATH`` is for where shared libraries
are to looked for. In both cases, earlier entries have priority over the later ones. ``:`` is used to separate multiple
entries.
Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by
doing:
.. code-block:: bash
export PATH=/usr/local/cuda-10.2/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
Note that we aren't overwriting the existing values, but prepending instead.
Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do
exist. ``lib64`` sub-directory is where the various CUDA ``.so`` objects, like ``libcudart.so`` reside, it's unlikely
that your system will have it named differently, but if it is adjust it to reflect your reality.
**Possible problem #3:**
Some older CUDA versions may refuse to build with newer compilers. For example, you my have ``gcc-9`` but it wants
``gcc-7``.
There are various ways to go about it.
If you can install the latest CUDA toolkit it typically should support the newer compiler.
Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may
already have it but it's not the default one, so the build system can't see it. If you have ``gcc-7`` installed but the
build system complains it can't find it, the following might do the trick:
.. code-block:: bash
sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc
sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++
Here, we are making a symlink to ``gcc-7`` from ``/usr/local/cuda-10.2/bin/gcc`` and since
``/usr/local/cuda-10.2/bin/`` should be in the ``PATH`` environment variable (see the previous problem's solution), it
should find ``gcc-7`` (and ``g++7``) and then the build will succeed.
As always make sure to edit the paths in the example to match your situation.
**If still unsuccessful:**
If after addressing these you still encounter build issues, please, proceed with the GitHub Issue of `FairScale
<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
<https://github.com/microsoft/DeepSpeed/issues>`__, depending on the project you have the problem with.
FairScale
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
By integrating `FairScale <https://github.com/facebookresearch/fairscale/>`__ the :class:`~transformers.Trainer`
provides support for the following features from `the ZeRO paper <https://arxiv.org/abs/1910.02054>`__:
1. Optimizer State Sharding
2. Gradient Sharding
You will need at least two GPUs to use this feature.
To deploy this feature:
1. Install the library via pypi:
.. code-block:: bash
pip install fairscale
or find more details on `the FairScale's GitHub page
<https://github.com/facebookresearch/fairscale/#installation>`__.
2. Add ``--sharded_ddp`` to the command line arguments, and make sure you have added the distributed launcher ``-m
torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
For example here is how you could use it for ``finetune_trainer.py`` with 2 GPUs:
.. code-block:: bash
cd examples/seq2seq
python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py \
--model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
--output_dir output_dir --overwrite_output_dir \
--do_train --n_train 500 --num_train_epochs 1 \
--per_device_train_batch_size 1 --freeze_embeds \
--src_lang en_XX --tgt_lang ro_RO --task translation \
--fp16 --sharded_ddp
Notes:
- This feature requires distributed training (so multiple GPUs).
- It is not implemented for TPUs.
- It works with ``--fp16`` too, to make things even faster.
- One of the main benefits of enabling ``--sharded_ddp`` is that it uses a lot less GPU memory, so you should be able
to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
significantly shorter training time.
DeepSpeed
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper
<https://arxiv.org/abs/1910.02054>`__, except ZeRO's stage 3. "Parameter Partitioning (Pos+g+p)". Currently it provides
full support for:
1. Optimizer State Partitioning (ZeRO stage 1)
2. Add Gradient Partitioning (ZeRO stage 2)
Installation
=======================================================================================================================
Install the library via pypi:
.. code-block:: bash
pip install deepspeed
or find more details on `the DeepSpeed's GitHub page <https://github.com/microsoft/deepspeed#installation>`__.
Deployment with multiple GPUs
=======================================================================================================================
To deploy this feature with multiple GPUs adjust the :class:`~transformers.Trainer` command line arguments as
following:
1. replace ``python -m torch.distributed.launch`` with ``deepspeed``.
2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file as
documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.
Therefore, if your original command line looked as following:
.. code-block:: bash
python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
Now it should be:
.. code-block:: bash
deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with the
``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used. The
full details on how to configure various nodes and GPUs can be found `here
<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.
Here is an example of running ``finetune_trainer.py`` under DeepSpeed deploying all available GPUs:
.. code-block:: bash
cd examples/seq2seq
deepspeed ./finetune_trainer.py --deepspeed ds_config.json \
--model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
--output_dir output_dir --overwrite_output_dir \
--do_train --n_train 500 --num_train_epochs 1 \
--per_device_train_batch_size 1 --freeze_embeds \
--src_lang en_XX --tgt_lang ro_RO --task translation
Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - i.e.
two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal
with, we combined the two into a single argument.
For some practical usage examples, please, see this `post
<https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400>`__.
Deployment with one GPU
=======================================================================================================================
To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` command line arguments as following:
.. code-block:: bash
cd examples/seq2seq
deepspeed --num_gpus=1 ./finetune_trainer.py --deepspeed ds_config.json \
--model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
--output_dir output_dir --overwrite_output_dir \
--do_train --n_train 500 --num_train_epochs 1 \
--per_device_train_batch_size 1 --freeze_embeds \
--src_lang en_XX --tgt_lang ro_RO --task translation
This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU. By default,
DeepSpeed deploys all GPUs it can see. If you have only 1 GPU to start with, then you don't need this argument. The
following `documentation <https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__ discusses the
launcher options.
Why would you want to use DeepSpeed with just one GPU?
1. It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus
leave more GPU resources for model's needs - e.g. larger batch size, or enabling a fitting of a very big model which
normally won't fit.
2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit
bigger models and data batches.
While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU
with DeepSpeed is to have at least the following configuration in the configuration file:
.. code-block:: json
{
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"overlap_comm": true,
"contiguous_gradients": true,
"cpu_offload": true
},
}
which enables ``cpu_offload`` and some other important features. You may experiment with the buffer sizes, you will
find more details in the discussion below.
For a practical usage example of this type of deployment, please, see this `post
<https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685>`__.
Configuration
=======================================================================================================================
For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer
to the `following documentation <https://www.deepspeed.ai/docs/config-json/>`__.
While you always have to supply the DeepSpeed configuration file, you can configure the DeepSpeed integration in
several ways:
1. Supply most of the configuration inside the file, and just use a few required command line arguments. This is the
recommended way as it puts most of the configuration params in one place.
2. Supply just the ZeRO configuration params inside the file, and configure the rest using the normal
:class:`~transformers.Trainer` command line arguments.
3. Any variation of the first two ways.
To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features,
enables FP16, uses AdamW optimizer and WarmupLR scheduler:
.. code-block:: json
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"cpu_offload": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 3e-5,
"betas": [ 0.8, 0.999 ],
"eps": 1e-8,
"weight_decay": 3e-7
}
},
"zero_allow_untested_optimizer": true,
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 3e-5,
"warmup_num_steps": 500
}
}
}
If you already have a command line that you have been using with :class:`transformers.Trainer` args, you can continue
using those and the :class:`~transformers.Trainer` will automatically convert them into the corresponding DeepSpeed
configuration at run time. For example, you could use the following configuration file:
.. code-block:: json
{
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"cpu_offload": true
}
}
and the following command line arguments:
.. code-block:: bash
--learning_rate 3e-5 --warmup_steps 500 --adam_beta1 0.8 --adam_beta2 0.999 --adam_epsilon 1e-8 \
--weight_decay 3e-7 --lr_scheduler_type constant_with_warmup --fp16 --fp16_backend amp
to achieve the same configuration as provided by the longer json file in the first example.
When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer`
to the console, so you can see exactly what the final configuration was passed to it.
Shared Configuration
=======================================================================================================================
Some configuration information is required by both the :class:`~transformers.Trainer` and DeepSpeed to function
correctly, therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to
configure those via the :class:`~transformers.Trainer` command line arguments.
Therefore, the following DeepSpeed configuration params shouldn't be used with the :class:`~transformers.Trainer`:
* ``train_batch_size``
* ``train_micro_batch_size_per_gpu``
* ``gradient_accumulation_steps``
as these will be automatically derived from the run time environment and the following 2 command line arguments:
.. code-block:: bash
--per_device_train_batch_size 8 --gradient_accumulation_steps 2
which are always required to be supplied.
Of course, you will need to adjust the values in this example to your situation.
ZeRO
=======================================================================================================================
The ``zero_optimization`` section of the configuration file is the most important part (`docs
<https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define
which ZeRO stages you want to enable and how to configure them.
.. code-block:: json
{
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"cpu_offload": true
}
}
Notes:
- enabling ``cpu_offload`` should reduce GPU RAM usage (it requires ``"stage": 2``)
- ``"overlap_comm": true`` trades off increased GPU RAM usage to lower all-reduce latency. ``overlap_comm`` uses 4.5x
the ``allgather_bucket_size`` and ``reduce_bucket_size`` values. So if they are set to 5e8, this requires a 9GB
footprint (``5e8 x 2Bytes x 2 x 4.5``). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting
OOM-errors you will need to reduce those parameters to about ``2e8``, which would require 3.6GB.
This section has to be configured exclusively via DeepSpeed configuration - the :class:`~transformers.Trainer` provides
no equivalent command line arguments.
Optimizer
=======================================================================================================================
DeepSpeed's main optimizers are Adam, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are thus
recommended to be used. It, however, can import other optimizers from ``torch``. The full documentation is `here
<https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`__.
If you don't configure the ``optimizer`` entry in the configuration file, the :class:`~transformers.Trainer` will
automatically set it to ``AdamW`` and will use the supplied values or the defaults for the following command line
arguments: ``--learning_rate``, ``--adam_beta1``, ``--adam_beta2``, ``--adam_epsilon`` and ``--weight_decay``.
Here is an example of the pre-configured ``optimizer`` entry for AdamW:
.. code-block:: json
{
"zero_allow_untested_optimizer": true,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 0.001,
"betas": [0.8, 0.999],
"eps": 1e-8,
"weight_decay": 3e-7
}
}
}
Since AdamW isn't on the list of tested with DeepSpeed/ZeRO optimizers, we have to add
``zero_allow_untested_optimizer`` flag.
If you want to use one of the officially supported optimizers, configure them explicitly in the configuration file, and
make sure to adjust the values. e.g. if use Adam you will want ``weight_decay`` around ``0.01``.
Scheduler
=======================================================================================================================
DeepSpeed supports LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR LR schedulers. The full documentation is `here
<https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__.
If you don't configure the ``scheduler`` entry in the configuration file, the :class:`~transformers.Trainer` will use
the value of ``--lr_scheduler_type`` to configure it. Currently the :class:`~transformers.Trainer` supports only 2 LR
schedulers that are also supported by DeepSpeed:
* ``WarmupLR`` via ``--lr_scheduler_type constant_with_warmup``
* ``WarmupDecayLR`` via ``--lr_scheduler_type linear``. This is also the default value for ``--lr_scheduler_type``,
therefore, if you don't configure the scheduler this is scheduler that will get configured by default.
In either case, the values of ``--learning_rate`` and ``--warmup_steps`` will be used for the configuration.
In other words, if you don't use the configuration file to set the ``scheduler`` entry, provide either:
.. code-block:: bash
--lr_scheduler_type constant_with_warmup --learning_rate 3e-5 --warmup_steps 500
or
.. code-block:: bash
--lr_scheduler_type linear --learning_rate 3e-5 --warmup_steps 500
with the desired values. If you don't pass these arguments, reasonable default values will be used instead.
In the case of WarmupDecayLR ``total_num_steps`` gets set either via the ``--max_steps`` command line argument, or if
it is not provided, derived automatically at run time based on the environment and the size of the dataset and other
command line arguments.
Here is an example of the pre-configured ``scheduler`` entry for WarmupLR (``constant_with_warmup`` in the
:class:`~transformers.Trainer` API):
.. code-block:: json
{
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 1000
}
}
}
Automatic Mixed Precision
=======================================================================================================================
You can work with FP16 in one of the following ways:
1. Pytorch native amp, as documented `here <https://www.deepspeed.ai/docs/config-json/#fp16-training-options>`__.
2. NVIDIA's apex, as documented `here
<https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.
If you want to use an equivalent of the Pytorch native amp, you can either configure the ``fp16`` entry in the
configuration file, or use the following command line arguments: ``--fp16 --fp16_backend amp``.
Here is an example of the ``fp16`` configuration:
.. code-block:: json
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
}
If you want to use NVIDIA's apex instead, you can can either configure the ``amp`` entry in the configuration file, or
use the following command line arguments: ``--fp16 --fp16_backend apex --fp16_opt_level 01``.
Here is an example of the ``amp`` configuration:
.. code-block:: json
{
"amp": {
"enabled": true,
"opt_level": "O1"
}
}
Gradient Clipping
=======================================================================================================================
If you don't configure the ``gradient_clipping`` entry in the configuration file, the :class:`~transformers.Trainer`
will use the value of the ``--max_grad_norm`` command line argument to set it.
Here is an example of the ``gradient_clipping`` configuration:
.. code-block:: json
{
"gradient_clipping": 1.0,
}
Notes
=======================================================================================================================
* DeepSpeed works with the PyTorch :class:`~transformers.Trainer` but not TF :class:`~transformers.TFTrainer`.
* While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source
<https://github.com/microsoft/deepspeed#installation>`__ to best match your hardware and also if you need to enable
certain features, like 1-bit Adam, which aren't available in the pypi distribution.
* You don't have to use the :class:`~transformers.Trainer` to use DeepSpeed with HuggingFace ``transformers`` - you can
use any model with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration
instructions <https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__.
Main DeepSpeed Resources
=======================================================================================================================
- `Project's github <https://github.com/microsoft/deepspeed>`__
- `Usage docs <https://www.deepspeed.ai/getting-started/>`__
- `API docs <https://deepspeed.readthedocs.io/en/latest/index.html>`__
- `Blog posts <https://www.microsoft.com/en-us/research/search/?q=deepspeed>`__
Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you
have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub
<https://github.com/microsoft/DeepSpeed/issues>`__.

View File

@@ -42,7 +42,7 @@ Examples
_______________________________________________________________________________________________________________________
- Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in
`examples/seq2seq/ <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
:prefix_link:`examples/seq2seq/ <examples/seq2seq/README.md>`.
- An example of how to train :class:`~transformers.BartForConditionalGeneration` with a Hugging Face :obj:`datasets`
object can be found in this `forum discussion
<https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904>`__.
@@ -55,9 +55,8 @@ Implementation Notes
- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use :class:`~transformers.BartTokenizer` or
:meth:`~transformers.BartTokenizer.encode` to get the proper splitting.
- The forward pass of :class:`~transformers.BartModel` will create decoder inputs (using the helper function
:func:`transformers.models.bart.modeling_bart._prepare_bart_decoder_inputs`) if they are not passed. This is
different than some other modeling APIs.
- The forward pass of :class:`~transformers.BartModel` will create the ``decoder_input_ids`` if they are not passed.
This is different than some other modeling APIs. A typical use case of this feature is mask filling.
- Model predictions are intended to be identical to the original implementation when
:obj:`force_bos_token_to_be_generated=True`. This only works, however, if the string you pass to
:func:`fairseq.encode` starts with a space.
@@ -65,7 +64,6 @@ Implementation Notes
summarization, see the example in that docstrings.
- Models that load the `facebook/bart-large-cnn` weights will not have a :obj:`mask_token_id`, or be able to perform
mask-filling tasks.
- For training/forward passes that don't involve beam search, pass :obj:`use_cache=False`.
Mask Filling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -132,6 +130,12 @@ BartForQuestionAnswering
.. autoclass:: transformers.BartForQuestionAnswering
:members: forward
BartForCausalLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BartForCausalLM
:members: forward
TFBartModel

View File

@@ -41,8 +41,8 @@ The Authors' code can be found `here <https://github.com/moussaKam/BARThez>`__.
Examples
_______________________________________________________________________________________________________________________
- BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check: `examples/seq2seq/
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
- BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check:
:prefix_link:`examples/seq2seq/ <examples/seq2seq/README.md>`.
BarthezTokenizer

View File

@@ -0,0 +1,64 @@
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Bertweet
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The BERTweet model was proposed in `BERTweet: A pre-trained language model for English Tweets
<https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf>`__ by Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen.
The abstract from the paper is the following:
*We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having
the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et
al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al.,
2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks:
Part-of-speech tagging, Named-entity recognition and text classification.*
Example of use:
.. code-block::
import torch
from transformers import AutoModel, AutoTokenizer
bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
# For transformers v4.x+:
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
# For transformers v3.x:
# tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = bertweet(input_ids) # Models outputs are now tuples
## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
The original code can be found `here <https://github.com/VinAIResearch/BERTweet>`__.
BertweetTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BertweetTokenizer
:members:

View File

@@ -43,13 +43,10 @@ Implementation Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Blenderbot uses a standard `seq2seq model transformer <https://arxiv.org/pdf/1706.03762.pdf>`__ based architecture.
- It inherits completely from :class:`~transformers.BartForConditionalGeneration`
- Even though blenderbot is one model, it uses two tokenizers :class:`~transformers.BlenderbotSmallTokenizer` for 90M
checkpoint and :class:`~transformers.BlenderbotTokenizer` for all other checkpoints.
- :class:`~transformers.BlenderbotSmallTokenizer` will always return :class:`~transformers.BlenderbotSmallTokenizer`,
regardless of checkpoint. To use the 3B parameter checkpoint, you must call
:class:`~transformers.BlenderbotTokenizer` directly.
- Available checkpoints can be found in the `model hub <https://huggingface.co/models?search=blenderbot>`__.
- This is the `default` Blenderbot model class. However, some smaller checkpoints, such as
``facebook/blenderbot_small_90M``, have a different architecture and consequently should be used with
`BlenderbotSmall <https://huggingface.co/transformers/master/model_doc/blenderbot_small.html>`__.
Usage
@@ -59,26 +56,15 @@ Here is an example of model usage:
.. code-block::
>>> from transformers import BlenderbotSmallTokenizer, BlenderbotForConditionalGeneration
>>> mname = 'facebook/blenderbot-90M'
>>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
>>> mname = 'facebook/blenderbot-400M-distill'
>>> model = BlenderbotForConditionalGeneration.from_pretrained(mname)
>>> tokenizer = BlenderbotSmallTokenizer.from_pretrained(mname)
>>> tokenizer = BlenderbotTokenizer.from_pretrained(mname)
>>> UTTERANCE = "My friends are cool but they eat too many carbs."
>>> inputs = tokenizer([UTTERANCE], return_tensors='pt')
>>> reply_ids = model.generate(**inputs)
>>> print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in reply_ids])
Here is how you can check out config values:
.. code-block::
>>> from transformers import BlenderbotConfig
>>> config_90 = BlenderbotConfig.from_pretrained("facebook/blenderbot-90M")
>>> config_90.to_diff_dict() # show interesting Values.
>>> configuration_3B = BlenderbotConfig("facebook/blenderbot-3B")
>>> configuration_3B.to_diff_dict()
>>> print(tokenizer.batch_decode(reply_ids))
["<s> That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?</s>"]
BlenderbotConfig
@@ -93,11 +79,14 @@ BlenderbotTokenizer
.. autoclass:: transformers.BlenderbotTokenizer
:members: build_inputs_with_special_tokens
BlenderbotSmallTokenizer
BlenderbotModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotSmallTokenizer
:members:
See :obj:`transformers.BartModel` for arguments to `forward` and `generate`
.. autoclass:: transformers.BlenderbotModel
:members: forward
BlenderbotForConditionalGeneration
@@ -106,13 +95,25 @@ BlenderbotForConditionalGeneration
See :obj:`transformers.BartForConditionalGeneration` for arguments to `forward` and `generate`
.. autoclass:: transformers.BlenderbotForConditionalGeneration
:members:
:members: forward
BlenderbotForCausalLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotForCausalLM
:members: forward
TFBlenderbotModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFBlenderbotModel
:members: call
TFBlenderbotForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
See :obj:`transformers.TFBartForConditionalGeneration` for arguments to `forward` and `generate`
.. autoclass:: transformers.TFBlenderbotForConditionalGeneration
:members:
:members: call

View File

@@ -0,0 +1,91 @@
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Blenderbot Small
-----------------------------------------------------------------------------------------------------------------------
Note that :class:`~transformers.BlenderbotSmallModel` and
:class:`~transformers.BlenderbotSmallForConditionalGeneration` are only used in combination with the checkpoint
`facebook/blenderbot-90M <https://huggingface.co/facebook/blenderbot-90M>`__. Larger Blenderbot checkpoints should
instead be used with :class:`~transformers.BlenderbotModel` and
:class:`~transformers.BlenderbotForConditionalGeneration`
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot
<https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
The abstract of the paper is the following:
*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
persona. We show that large scale models can learn these skills when given appropriate training data and choice of
generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
failure cases of our models.*
The authors' code can be found `here <https://github.com/facebookresearch/ParlAI>`__ .
BlenderbotSmallConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotSmallConfig
:members:
BlenderbotSmallTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotSmallTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
BlenderbotSmallModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotSmallModel
:members: forward
BlenderbotSmallForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotSmallForConditionalGeneration
:members: forward
BlenderbotSmallForCausalLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotSmallForCausalLM
:members: forward
TFBlenderbotSmallModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFBlenderbotSmallModel
:members: call
TFBlenderbotSmallForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFBlenderbotSmallForConditionalGeneration
:members: call

View File

@@ -0,0 +1,46 @@
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
BORT
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The BORT model was proposed in `Optimal Subarchitecture Extraction for BERT <https://arxiv.org/abs/2010.10499>`__ by
Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the
authors refer to as "Bort".
The abstract from the paper is the following:
*We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by
applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as
"Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of 5.5% the
original BERT-large architecture, and 16% of the net size. Bort is also able to be pretrained in 288 GPU hours, which
is 1.2% of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large
(Liu et al., 2019), and about 33% of that of the world-record, in GPU hours, required to train BERT-large on the same
hardware. It is also 7.9x faster on a CPU, as well as being better performing than other compressed variants of the
architecture, and some of the non-compressed variants: it obtains performance improvements of between 0.3% and 31%,
absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks.*
Tips:
- BORT's model architecture is based on BERT, so one can refer to :doc:`BERT's documentation page <bert>` for the
model's API as well as usage examples.
- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to :doc:`RoBERTa's documentation page
<roberta>` for the tokenizer's API as well as usage examples.
- BORT requires a specific fine-tuning algorithm, called `Agora
<https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology>`__ ,
that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the
algorithm to make BORT fine-tuning work.
The original code can be found `here <https://github.com/alexa/bort/>`__.

View File

@@ -0,0 +1,144 @@
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
ConvBERT
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ConvBERT model was proposed in `ConvBERT: Improving BERT with Span-based Dynamic Convolution
<https://arxiv.org/abs/2008.02496>`__ by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng
Yan.
The abstract from the paper is the following:
*Pre-trained language models like BERT and its variants have recently achieved impressive performance in various
natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers
large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for
generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to
replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the
rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context
learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that
ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and
fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
using less than 1/4 training cost. Code and pre-trained models will be released.*
ConvBERT training tips are similar to those of BERT. The original implementation can be found here:
https://github.com/yitu-opensource/ConvBert
ConvBertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ConvBertConfig
:members:
ConvBertTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ConvBertTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
ConvBertTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ConvBertTokenizerFast
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
ConvBertModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ConvBertModel
:members: forward
ConvBertForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ConvBertForMaskedLM
:members: forward
ConvBertForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ConvBertForSequenceClassification
:members: forward
ConvBertForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ConvBertForMultipleChoice
:members: forward
ConvBertForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ConvBertForTokenClassification
:members: forward
ConvBertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ConvBertForQuestionAnswering
:members: forward
TFConvBertModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFConvBertModel
:members: call
TFConvBertForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFConvBertForMaskedLM
:members: call
TFConvBertForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFConvBertForSequenceClassification
:members: call
TFConvBertForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFConvBertForMultipleChoice
:members: call
TFConvBertForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFConvBertForTokenClassification
:members: call
TFConvBertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFConvBertForQuestionAnswering
:members: call

View File

@@ -97,3 +97,8 @@ TFCTRLLMHeadModel
.. autoclass:: transformers.TFCTRLLMHeadModel
:members: call
TFCTRLForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFCTRLForSequenceClassification
:members: call

View File

@@ -70,8 +70,29 @@ DebertaPreTrainedModel
:members:
DebertaForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaForMaskedLM
:members:
DebertaForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaForSequenceClassification
:members:
DebertaForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaForTokenClassification
:members:
DebertaForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DebertaForQuestionAnswering
:members:

View File

@@ -48,7 +48,6 @@ modeling. We first concatenate all dialog turns within a dialogue session into a
sequence length), ended by the end-of-text token.* For more information please confer to the original paper.
DialoGPT's architecture is based on the GPT2 model, so one can refer to GPT2's `docstring
<https://huggingface.co/transformers/model_doc/gpt2.html>`_.
DialoGPT's architecture is based on the GPT2 model, so one can refer to :doc:`GPT2's documentation page <gpt2>`.
The original code can be found `here <https://github.com/microsoft/DialoGPT>`_.

View File

@@ -0,0 +1,71 @@
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
herBERT
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The herBERT model was proposed in `KLEJ: Comprehensive Benchmark for Polish Language Understanding
<https://www.aclweb.org/anthology/2020.acl-main.111.pdf>`__ by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, and
Ireneusz Gawlik. It is a BERT-based Language Model trained on Polish Corpora using only MLM objective with dynamic
masking of whole words.
The abstract from the paper is the following:
*In recent years, a series of Transformer-based models unlocked major improvements in general natural language
understanding (NLU) tasks. Such a fast pace of research would not be possible without general NLU benchmarks, which
allow for a fair comparison of the proposed methods. However, such benchmarks are available only for a handful of
languages. To alleviate this issue, we introduce a comprehensive multi-task benchmark for the Polish language
understanding, accompanied by an online leaderboard. It consists of a diverse set of tasks, adopted from existing
datasets for named entity recognition, question-answering, textual entailment, and others. We also introduce a new
sentiment analysis task for the e-commerce domain, named Allegro Reviews (AR). To ensure a common evaluation scheme and
promote models that generalize to different NLU tasks, the benchmark includes datasets from varying domains and
applications. Additionally, we release HerBERT, a Transformer-based model trained specifically for the Polish language,
which has the best average performance and obtains the best results for three out of nine tasks. Finally, we provide an
extensive evaluation, including several standard baselines and recently proposed, multilingual Transformer-based
models.*
Examples of use:
.. code-block::
from transformers import HerbertTokenizer, RobertaModel
tokenizer = HerbertTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd to jasne.", return_tensors='pt')
outputs = model(encoded_input)
# HerBERT can also be loaded using AutoTokenizer and AutoModel:
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
The original code can be found `here <https://github.com/allegro/HerBERT>`__.
HerbertTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.HerbertTokenizer
:members:
HerbertTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.HerbertTokenizerFast
:members:

View File

@@ -13,32 +13,72 @@
LayoutLM
-----------------------------------------------------------------------------------------------------------------------
.. _Overview:
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
information extraction tasks, such as form understanding and receipt understanding.
information extraction tasks, such as form understanding and receipt understanding. It obtains state-of-the-art results
on several downstream tasks:
- form understanding: the `FUNSD <https://guillaumejaume.github.io/FUNSD/>`__ dataset (a collection of 199 annotated
forms comprising more than 30,000 words).
- receipt understanding: the `SROIE <https://rrc.cvc.uab.es/?ch=13>`__ dataset (a collection of 626 receipts for
training and 347 receipts for testing).
- document image classification: the `RVL-CDIP <https://www.cs.cmu.edu/~aharley/rvl-cdip/>`__ dataset (a collection of
400,000 images belonging to one of 16 classes).
The abstract from the paper is the following:
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation,
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images,
which is beneficial for a great number of real-world document image understanding tasks such as information extraction
from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into
LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single
framework for document-level pretraining. It achieves new state-of-the-art results in several downstream tasks,
including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image
classification (from 93.07 to 94.42).*
the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is
beneficial for a great number of real-world document image understanding tasks such as information extraction from
scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM.
To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for
document-level pretraining. It achieves new state-of-the-art results in several downstream tasks, including form
understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification
(from 93.07 to 94.42).*
Tips:
- LayoutLM has an extra input called :obj:`bbox`, which is the bounding boxes of the input tokens.
- The :obj:`bbox` requires the data that on 0-1000 scale, which means you should normalize the bounding box before
passing them into model.
- In addition to `input_ids`, :meth:`~transformer.LayoutLMModel.forward` also expects the input :obj:`bbox`, which are
the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such
as Google's `Tesseract <https://github.com/tesseract-ocr/tesseract>`__ (there's a `Python wrapper
<https://pypi.org/project/pytesseract/>`__ available). Each bounding box should be in (x0, y0, x1, y1) format, where
(x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the
position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on a 0-1000
scale. To normalize, you can use the following function:
.. code-block::
def normalize_bbox(bbox, width, height):
return [
int(1000 * (bbox[0] / width)),
int(1000 * (bbox[1] / height)),
int(1000 * (bbox[2] / width)),
int(1000 * (bbox[3] / height)),
]
Here, :obj:`width` and :obj:`height` correspond to the width and height of the original document in which the token
occurs. Those can be obtained using the Python Image Library (PIL) library for example, as follows:
.. code-block::
from PIL import Image
image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
width, height = image.size
- For a demo which shows how to fine-tune :class:`LayoutLMForTokenClassification` on the `FUNSD dataset
<https://guillaumejaume.github.io/FUNSD/>`__ (a collection of annotated forms), see `this notebook
<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb>`__.
It includes an inference part, which shows how to use Google's Tesseract on a new document.
The original code can be found `here <https://github.com/microsoft/unilm/tree/master/layoutlm>`_.
@@ -78,6 +118,13 @@ LayoutLMForMaskedLM
:members:
LayoutLMForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LayoutLMForSequenceClassification
:members:
LayoutLMForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -0,0 +1,149 @@
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
LED
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The LED model was proposed in `Longformer: The Long-Document Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz
Beltagy, Matthew E. Peters, Arman Cohan.
The abstract from the paper is the following:
*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting
long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization
dataset.*
Tips:
- :class:`~transformers.LEDForConditionalGeneration` is an extension of
:class:`~transformers.BartForConditionalGeneration` exchanging the traditional *self-attention* layer with
*Longformer*'s *chunked self-attention* layer. :class:`~transformers.LEDTokenizer` is an alias of
:class:`~transformers.BartTokenizer`.
- LED works very well on long-range *sequence-to-sequence* tasks where the ``input_ids`` largely exceed a length of
1024 tokens.
- LED pads the ``input_ids`` to be a multiple of ``config.attention_window`` if required. Therefore a small speed-up is
gained, when :class:`~transformers.LEDTokenizer` is used with the ``pad_to_multiple_of`` argument.
- LED makes use of *global attention* by means of the ``global_attention_mask`` (see
:class:`~transformers.LongformerModel`). For summarization, it is advised to put *global attention* only on the first
``<s>`` token. For question answering, it is advised to put *global attention* on all tokens of the question.
- To fine-tune LED on all 16384, it is necessary to enable *gradient checkpointing* by setting
``config.gradient_checkpointing = True``.
- A notebook showing how to evaluate LED, can be accessed `here
<https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing>`__.
- A notebook showing how to fine-tune LED, can be accessed `here
<https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing>`__.
LEDConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LEDConfig
:members:
LEDTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LEDTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
LEDTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LEDTokenizerFast
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
LED specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.led.modeling_led.LEDEncoderBaseModelOutput
:members:
.. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqModelOutput
:members:
.. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqLMOutput
:members:
.. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqSequenceClassifierOutput
:members:
.. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqQuestionAnsweringModelOutput
:members:
.. autoclass:: transformers.models.led.modeling_tf_led.TFLEDEncoderBaseModelOutput
:members:
.. autoclass:: transformers.models.led.modeling_tf_led.TFLEDSeq2SeqModelOutput
:members:
.. autoclass:: transformers.models.led.modeling_tf_led.TFLEDSeq2SeqLMOutput
:members:
LEDModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LEDModel
:members: forward
LEDForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LEDForConditionalGeneration
:members: forward
LEDForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LEDForSequenceClassification
:members: forward
LEDForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LEDForQuestionAnswering
:members: forward
TFLEDModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFLEDModel
:members: call
TFLEDForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFLEDForConditionalGeneration
:members: call

View File

@@ -33,7 +33,6 @@ Implementation Notes
- The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
- static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
- a new final_logits_bias (:obj:`MarianConfig.add_bias_logits=True`)
- no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
- the model starts generating with :obj:`pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
:obj:`<s/>`),
@@ -56,12 +55,10 @@ Examples
- Since Marian models are smaller than many other translation models available in the library, they can be useful for
fine-tuning experiments and integration tests.
- `Fine-tune on TPU
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/builtin_trainer/train_distil_marian_enro_tpu.sh>`__
- `Fine-tune on GPU
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/builtin_trainer/train_distil_marian_enro.sh>`__
<https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_enro_teacher.sh>`__
- `Fine-tune on GPU with pytorch-lightning
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/distil_marian_no_teacher.sh>`__
<https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_no_teacher.sh>`__
Multilingual Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -182,13 +179,36 @@ MarianTokenizer
:members: prepare_seq2seq_batch
MarianModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MarianModel
:members: forward
MarianMTModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MarianMTModel
:members: forward
MarianForCausalLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MarianForCausalLM
:members: forward
TFMarianModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFMarianModel
:members: call
TFMarianMTModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFMarianMTModel
:members: call

View File

@@ -35,7 +35,7 @@ Examples
_______________________________________________________________________________________________________________________
- Examples and scripts for fine-tuning mBART and other models for sequence to sequence tasks can be found in
`examples/seq2seq/ <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
:prefix_link:`examples/seq2seq/ <examples/seq2seq/README.md>`.
- Given the large embeddings table, mBART consumes a large amount of GPU RAM, especially for fine-tuning.
:class:`MarianMTModel` is usually a better choice for bilingual machine translation.
@@ -97,6 +97,13 @@ MBartTokenizerFast
:members:
MBartModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MBartModel
:members:
MBartForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -104,8 +111,35 @@ MBartForConditionalGeneration
:members:
MBartForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MBartForQuestionAnswering
:members:
MBartForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MBartForSequenceClassification
MBartForCausalLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MBartForCausalLM
:members: forward
TFMBartModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFMBartModel
:members: call
TFMBartForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFMBartForConditionalGeneration
:members:
:members: call

View File

@@ -51,9 +51,8 @@ All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-
Examples
_______________________________________________________________________________________________________________________
- `Script <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/finetune_pegasus_xsum.sh>`__ to
fine-tune pegasus on the XSUM dataset. Data download instructions at `examples/seq2seq/
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
- :prefix_link:`Script <examples/seq2seq/finetune_pegasus_xsum.sh>` to fine-tune pegasus on the XSUM dataset. Data
download instructions at :prefix_link:`examples/seq2seq/ <examples/seq2seq/README.md>`.
- FP16 is not supported (help/ideas on this appreciated!).
- The adafactor optimizer is recommended for pegasus fine-tuning.
@@ -66,7 +65,6 @@ Implementation Notes
- Some key configuration differences:
- static, sinusoidal position embeddings
- no :obj:`layernorm_embedding` (:obj:`PegasusConfig.normalize_embedding=False`)
- the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
- more beams are used (:obj:`num_beams=8`)
- All pretrained pegasus checkpoints are the same besides three attributes: :obj:`tokenizer.model_max_length` (maximum
@@ -119,13 +117,36 @@ PegasusTokenizerFast
:members:
PegasusModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PegasusModel
:members: forward
PegasusForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PegasusForConditionalGeneration
:members: forward
PegasusForCausalLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PegasusForCausalLM
:members: forward
TFPegasusModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFPegasusModel
:members: call
TFPegasusForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFPegasusForConditionalGeneration
:members: call

View File

@@ -0,0 +1,59 @@
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
PhoBERT
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The PhoBERT model was proposed in `PhoBERT: Pre-trained language models for Vietnamese
<https://www.aclweb.org/anthology/2020.findings-emnlp.92.pdf>`__ by Dat Quoc Nguyen, Anh Tuan Nguyen.
The abstract from the paper is the following:
*We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the first public large-scale monolingual
language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent
best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves the state-of-the-art in multiple
Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and
Natural language inference.*
Example of use:
.. code-block::
import torch
from transformers import AutoModel, AutoTokenizer
phobert = AutoModel.from_pretrained("vinai/phobert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")
# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
line = "Tôi là sinh_viên trường đại_học Công_nghệ ."
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = phobert(input_ids) # Models outputs are now tuples
## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# phobert = TFAutoModel.from_pretrained("vinai/phobert-base")
The original code can be found `here <https://github.com/VinAIResearch/PhoBERT>`__.
PhobertTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PhobertTokenizer
:members:

View File

@@ -44,9 +44,9 @@ Tips:
For more information about which prefix to use, it is easiest to look into Appendix D of the `paper
<https://arxiv.org/pdf/1910.10683.pdf>`__. - For sequence-to-sequence generation, it is recommended to use
:obj:`T5ForConditionalGeneration.generate()``. This method takes care of feeding the encoded input via
cross-attention layers to the decoder and auto-regressively generates the decoder output. - T5 uses relative scalar
embeddings. Encoder input padding can be done on the left and on the right.
:obj:`T5ForConditionalGeneration.generate()`. This method takes care of feeding the encoded input via cross-attention
layers to the decoder and auto-regressively generates the decoder output. - T5 uses relative scalar embeddings.
Encoder input padding can be done on the left and on the right.
The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`__.
@@ -55,7 +55,7 @@ Training
T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher
forcing. This means that for training we always need an input sequence and a target sequence. The input sequence is fed
to the model using :obj:`input_ids``. The target sequence is shifted to the right, i.e., prepended by a start-sequence
to the model using :obj:`input_ids`. The target sequence is shifted to the right, i.e., prepended by a start-sequence
token and fed to the decoder using the :obj:`decoder_input_ids`. In teacher-forcing style, the target sequence is then
appended by the EOS token and corresponds to the :obj:`labels`. The PAD token is hereby used as the start-sequence
token. T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.

View File

@@ -265,7 +265,7 @@ conversational**. In case your dataset involves conversational questions (such a
together the ``queries``, ``answer_coordinates`` and ``answer_text`` per table (in the order of their ``position``
index) and batch encode each table with its questions. This will make sure that the ``prev_labels`` token types (see
docs of :class:`~transformers.TapasTokenizer`) are set correctly. See `this notebook
<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__
<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__
for more info.
**STEP 4: Train (fine-tune) TapasForQuestionAnswering**
@@ -346,7 +346,7 @@ of that:
... inputs,
... outputs.logits.detach(),
... outputs.logits_aggregation.detach()
...)
... )
>>> # let's print out the results:
>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
@@ -382,7 +382,7 @@ of that:
In case of a conversational set-up, then each table-question pair must be provided **sequentially** to the model, such
that the ``prev_labels`` token types can be overwritten by the predicted ``labels`` of the previous table-question
pair. Again, more info can be found in `this notebook
<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__.
<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__.
Tapas specific outputs

View File

@@ -87,12 +87,14 @@ TransfoXLLMHeadModel
.. autoclass:: transformers.TransfoXLLMHeadModel
:members: forward
TransfoXLForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TransfoXLForSequenceClassification
:members: forward
TFTransfoXLModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -107,6 +109,13 @@ TFTransfoXLLMHeadModel
:members: call
TFTransfoXLForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFTransfoXLForSequenceClassification
:members: call
Internal Layers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -0,0 +1,65 @@
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Wav2Vec2
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Wav2Vec2 model was proposed in `wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
<https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
The abstract from the paper is the following:
*We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on
transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks
the speech input in the latent space and solves a contrastive task defined over a quantization of the latent
representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the
clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state
of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and
pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech
recognition with limited amounts of labeled data.*
Tips:
- Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
- Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded
using :class:`~transformers.Wav2Vec2Tokenizer`.
Wav2Vec2Config
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Wav2Vec2Config
:members:
Wav2Vec2Tokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Wav2Vec2Tokenizer
:members: __call__, save_vocabulary
Wav2Vec2Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Wav2Vec2Model
:members: forward
Wav2Vec2ForCTC
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.Wav2Vec2ForCTC
:members: forward

View File

@@ -60,7 +60,7 @@ Basic steps
In order to upload a model, you'll need to first create a git repo. This repo will live on the model hub, allowing
users to clone it and you (and your organization members) to push to it.
You can create a model repo **directly from `the /new page on the website <https://huggingface.co/new>`__.**
You can create a model repo directly from `the /new page on the website <https://huggingface.co/new>`__.
Alternatively, you can use the ``transformers-cli``. The next steps describe that process:
@@ -78,6 +78,12 @@ Once you are logged in with your model hub credentials, you can start building y
transformers-cli repo create your-model-name
If you want to create a repo under a specific organization, you should add a `--organization` flag:
.. code-block:: bash
transformers-cli repo create your-model-name --organization your-org-name
This creates a repo on the model hub, which can be cloned.
.. code-block:: bash
@@ -105,6 +111,9 @@ The only learning curve you might have compared to regular git is the one for gi
`git-lfs.github.com <https://git-lfs.github.com/>`__ is decent, but we'll work on a tutorial with some tips and tricks
in the coming weeks!
Additionally, if you want to change multiple repos at once, the `change_config.py script
<https://github.com/huggingface/efficient_scripts/blob/main/change_config.py>`__ can probably save you some time.
Make your model work on all frameworks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

View File

@@ -16,7 +16,7 @@ Summary of the models
This is a summary of the models available in 🤗 Transformers. It assumes youre familiar with the original `transformer
model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
<http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
models. You can check them more in detail in their respective documentation. Also checkout the :doc:`pretrained model
models. You can check them more in detail in their respective documentation. Also check out the :doc:`pretrained model
page </pretrained_models>` to see the checkpoints available for each type of model and all `the community models
<https://huggingface.co/models>`_.
@@ -30,7 +30,7 @@ Each one of the models in the library falls into one of the following categories
Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
sentence so that the attention heads can only see what was before in the next, and not whats after. Although those
sentence so that the attention heads can only see what was before in the text, and not whats after. Although those
models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A
typical example of such models is GPT.
@@ -330,6 +330,36 @@ the same probabilities as the larger model. The actual objective is a combinatio
The library provides a version of the model for masked language modeling, token classification, sentence classification
and question answering.
ConvBERT
-----------------------------------------------------------------------------------------------------------------------
.. raw:: html
<a href="https://huggingface.co/models?filter=convbert">
<img alt="Models" src="https://img.shields.io/badge/All_model_pages-convbert-blueviolet">
</a>
<a href="model_doc/convbert.html">
<img alt="Doc" src="https://img.shields.io/badge/Model_documentation-convbert-blueviolet">
</a>
`ConvBERT: Improving BERT with Span-based Dynamic Convolution <https://arxiv.org/abs/1910.01108>`_, Zihang Jiang,
Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural
language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers large
memory footprint and computation cost. Although all its attention heads query on the whole input sequence for
generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to
replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the
rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context
learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that
ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and
fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
using less than 1/4 training cost.
The library provides a version of the model for masked language modeling, token classification, sentence classification
and question answering.
XLM
-----------------------------------------------------------------------------------------------------------------------
@@ -512,8 +542,8 @@ BART
<https://arxiv.org/abs/1910.13461>`_, Mike Lewis et al.
Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is
fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder
, on the pretraining tasks, a composition of the following transformations are applied:
fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). A composition of
the following transformations are applied on the pretraining tasks for the encoder:
* mask random tokens (like in BERT)
* delete random tokens

View File

@@ -90,9 +90,8 @@ You can then feed it all as input to your model:
>>> outputs = model(input_ids, langs=langs)
The example `run_generation.py
<https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__ can generate
text using the CLM checkpoints from XLM, using the language embeddings.
The example :prefix_link:`run_generation.py <examples/text-generation/run_generation.py>` can generate text using the
CLM checkpoints from XLM, using the language embeddings.
XLM without Language Embeddings
-----------------------------------------------------------------------------------------------------------------------

View File

@@ -78,7 +78,7 @@ The library is built around three types of classes for each model:
All these classes can be instantiated from pretrained instances and saved locally using two methods:
- :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>` or
provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>`) or
stored locally (or on a server) by the user,
- :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
:obj:`from_pretrained()`.

View File

@@ -10,17 +10,17 @@
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
reprocessing data
Preprocessing data
=======================================================================================================================
In this tutorial, we'll explore how to preprocess your data using 🤗 Transformers. The main tool for this is what we
call a :doc:`tokenizer <main_classes/tokenizer>`. You can build one using the tokenizer class associated to the model
you would like to use, or directly with the :class:`~transformers.AutoTokenizer` class.
As we saw in the :doc:`quicktour </quicktour>`, the tokenizer will first split a given text in words (or part of words,
punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able to
build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect to
work properly.
As we saw in the :doc:`quick tour </quicktour>`, the tokenizer will first split a given text in words (or part of
words, punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able
to build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect
to work properly.
.. note::
@@ -131,7 +131,7 @@ ones it should not (because they represent padding in this case).
Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer to throw those kinds of warnings.
can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer from throwing those kinds of warnings.
.. _sentence-pairs:
@@ -216,7 +216,6 @@ Everything you always wanted to know about padding and truncation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
truncate to the maximum length the mode can accept). However, the API supports more strategies if you need them. The
three arguments you need to know for this are :obj:`padding`, :obj:`truncation` and :obj:`max_length`.

View File

@@ -13,10 +13,9 @@
Pretrained models
=======================================================================================================================
Here is the full list of the currently provided pretrained models together with a short presentation of each model.
Here is a partial list of some of the available pretrained models together with a short presentation of each model.
For a list that includes all community-uploaded models, refer to `https://huggingface.co/models
<https://huggingface.co/models>`__.
For the full list, refer to `https://huggingface.co/models <https://huggingface.co/models>`__.
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Architecture | Model id | Details of the model |

View File

@@ -158,7 +158,7 @@ Using the tokenizer
We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
that process (you can learn more about them in the :doc:`tokenizer summary <tokenizer_summary>`, which is why we need
that process (you can learn more about them in the :doc:`tokenizer summary <tokenizer_summary>`), which is why we need
to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
pretrained.

View File

@@ -327,7 +327,7 @@ Masked Language Modeling
Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to
fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the
right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for
downstream tasks, requiring bi-directional context such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
downstream tasks requiring bi-directional context, such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
<https://arxiv.org/abs/1910.13461>`__, part 4.2).
Here is an example of using pipelines to replace a mask from a sequence:
@@ -657,7 +657,7 @@ Here are the expected results:
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]
Note, how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
Note how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
"DUMBO" and "Manhattan Bridge" have been identified as locations.
Here is an example of doing named entity recognition, using a model and a tokenizer. The process is the following:
@@ -750,8 +750,7 @@ Summarization is the task of summarizing a document or an article into a shorter
An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was
created for the task of summarization. If you would like to fine-tune a model on a summarization task, various
approaches are described in this `document
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
approaches are described in this :prefix_link:`document <examples/seq2seq/README.md>`.
Here is an example of using the pipelines to do summarization. It leverages a Bart model that was fine-tuned on the CNN
/ Daily Mail data set.
@@ -829,8 +828,7 @@ Translation is the task of translating a text from one language to another.
An example of a translation dataset is the WMT English to German dataset, which has sentences in English as the input
data and the corresponding sentences in German as the target data. If you would like to fine-tune a model on a
translation task, various approaches are described in this `document
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
translation task, various approaches are described in this :prefix_link:`document <examples/seq2seq/README.md>`.
Here is an example of using the pipelines to do translation. It leverages a T5 model that was only pre-trained on a
multi-task mixture dataset (including WMT), yet, yielding impressive translation results.

View File

@@ -25,25 +25,22 @@ How transformers are tested
-----------------------------------------------------------------------------------------------------------------------
1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs
are defined in this `config file <https://github.com/huggingface/transformers/blob/master/.circleci/config.yml>`__,
so that if needed you can reproduce the same environment on your machine.
are defined in this :prefix_link:`config file <.circleci/config.yml>`, so that if needed you can reproduce the same
environment on your machine.
These CI jobs don't run ``@slow`` tests.
2. There are 3 jobs run by `github actions <https://github.com/huggingface/transformers/actions>`__:
* `torch hub integration
<https://github.com/huggingface/transformers/blob/master/.github/workflows/github-torch-hub.yml>`__: checks
whether torch hub integration works.
* :prefix_link:`torch hub integration <.github/workflows/github-torch-hub.yml>`: checks whether torch hub
integration works.
* `self-hosted (push) <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml>`__:
runs fast tests on GPU only on commits on ``master``. It only runs if a commit on ``master`` has updated the code
in one of the following folders: ``src``, ``tests``, ``.github`` (to prevent running on added model cards,
notebooks, etc.)
* :prefix_link:`self-hosted (push) <.github/workflows/self-push.yml>`: runs fast tests on GPU only on commits on
``master``. It only runs if a commit on ``master`` has updated the code in one of the following folders: ``src``,
``tests``, ``.github`` (to prevent running on added model cards, notebooks, etc.)
* `self-hosted runner
<https://github.com/huggingface/transformers/blob/master/.github/workflows/self-scheduled.yml>`__: runs normal and
slow tests on GPU in ``tests`` and ``examples``:
* :prefix_link:`self-hosted runner <.github/workflows/self-scheduled.yml>`: runs normal and slow tests on GPU in
``tests`` and ``examples``:
.. code-block:: bash
@@ -492,12 +489,9 @@ spawns a normal process that then spawns off multiple workers and manages the IO
This is still under development but you can study 2 different tests that perform this successfully:
* `test_seq2seq_examples_multi_gpu.py
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_seq2seq_examples_multi_gpu.py>`__ - a
* :prefix_link:`test_seq2seq_examples_multi_gpu.py <examples/seq2seq/test_seq2seq_examples_multi_gpu.py>` - a
``pytorch-lightning``-running test (had to use PL's ``ddp`` spawning method which is the default)
* `test_finetune_trainer.py
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_finetune_trainer.py>`__ - a normal
(non-PL) test
* :prefix_link:`test_finetune_trainer.py <examples/seq2seq/test_finetune_trainer.py>` - a normal (non-PL) test
To jump right into the execution point, search for the ``execute_subprocess_async`` function in those tests.
@@ -940,10 +934,9 @@ slow models to do qualitative testing. To see the use of these simply look for *
grep tiny tests examples
Here is a an example of a `script
<https://github.com/huggingface/transformers/blob/master/scripts/fsmt/fsmt-make-tiny-model.py>`__ that created the tiny
model `stas/tiny-wmt19-en-de <https://huggingface.co/stas/tiny-wmt19-en-de>`__. You can easily adjust it to your
specific model's architecture.
Here is a an example of a :prefix_link:`script <scripts/fsmt/fsmt-make-tiny-model.py>` that created the tiny model
`stas/tiny-wmt19-en-de <https://huggingface.co/stas/tiny-wmt19-en-de>`__. You can easily adjust it to your specific
model's architecture.
It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if
you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the

View File

@@ -18,7 +18,7 @@ On this page, we will have a closer look at tokenization. As we saw in :doc:`the
look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`,
and :ref:`SentencePiece <sentencepiece>`, and show exemplary which tokenizer type is used by which model.
and :ref:`SentencePiece <sentencepiece>`, and show examples of which tokenizer type is used by which model.
Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
@@ -72,7 +72,7 @@ greater than 50,000, especially if they are pretrained only on a single language
So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
for the letter ``"t"`` is much harder as learning a context-independent representation for the word ``"today"``.
for the letter ``"t"`` is much harder than learning a context-independent representation for the word ``"today"``.
Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.
@@ -202,10 +202,10 @@ WordPiece
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>`, :doc:`DistilBERT
<model_doc/distilbert>`, and :doc:`Electra <model_doc/electra>`. The algorithm was outlined in `Japanese and Korean
Voice Seach (Schuster et al., 2012)
Voice Search (Schuster et al., 2012)
<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__ and is very similar to
BPE. WordPiece first initializes the vocabulary to include every character present in the training data and
progressively learn a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is

View File

@@ -14,7 +14,7 @@ Training and fine-tuning
=======================================================================================================================
Model classes in 🤗 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used
seemlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
seamlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
standard training tools available in either framework. We will also show how to use our included
:func:`~transformers.Trainer` class which handles much of the complexity of training for you.
@@ -279,6 +279,7 @@ Finally, you can view the results, including any calculated metrics, by launchin
``logging_dir`` directory.
.. _additional-resources:
Additional resources

View File

@@ -54,12 +54,12 @@ Coming soon!
| Task | Example datasets | Trainer support | TFTrainer support | 🤗 Datasets | Colab
|---|---|:---:|:---:|:---:|:---:|
| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | Raw text | ✅ | - | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | CNN/Daily Mail | ✅ | - | - | -
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | WMT | ✅ | - | - | -
@@ -69,6 +69,43 @@ Coming soon!
**Coming soon!**
-->
## Distributed training and mixed precision
All the PyTorch scripts mentioned above work out of the box with distributed training and mixed precision, thanks to
the [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html). To launch one of them on _n_ GPUS,
use the following command:
```bash
python -m torch.distributed.launch \
--nproc_per_node number_of_gpu_you_have path_to_script.py \
--all_arguments_of_the_script
```
As an example, here is how you would fine-tune the BERT large model (with whole word masking) on the text
classification MNLI task using the `run_glue` script, with 8 GPUs:
```bash
python -m torch.distributed.launch \
--nproc_per_node 8 text-classification/run_glue.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--task_name mnli \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mnli_output/
```
If you have a GPU with mixed precision capabilities (architecture Pascal or more recent), you can use mixed precision
training with PyTorch 1.6.0 or latest, or by installing the [Apex](https://github.com/NVIDIA/apex) library for previous
versions. Just add the flag `--fp16` to your command launching one of the scripts mentioned above!
Using mixed precision training usually results in 2x-speedup for training with the same final results (as shown in
[this table](https://github.com/huggingface/transformers/tree/master/examples/text-classification#mixed-precision-training)
for text classification).
## Running on TPUs
When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Strategy`.
@@ -76,27 +113,34 @@ When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Str
When using PyTorch, we support TPUs thanks to `pytorch/xla`. For more context and information on how to setup your TPU environment refer to Google's documentation and to the
very detailed [pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
In this repo, we provide a very simple launcher script named [xla_spawn.py](https://github.com/huggingface/transformers/tree/master/examples/xla_spawn.py) that lets you run our example scripts on multiple TPU cores without any boilerplate.
Just pass a `--num_cores` flag to this script, then your regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for torch.distributed).
Note that this approach does not work for examples that use `pytorch-lightning`.
For example for `run_glue`:
In this repo, we provide a very simple launcher script named
[xla_spawn.py](https://github.com/huggingface/transformers/tree/master/examples/xla_spawn.py) that lets you run our
example scripts on multiple TPU cores without any boilerplate. Just pass a `--num_cores` flag to this script, then your
regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for
`torch.distributed`):
```bash
python examples/xla_spawn.py --num_cores 8 \
examples/text-classification/run_glue.py \
--model_name_or_path bert-base-cased \
--task_name mnli \
--data_dir ./data/glue_data/MNLI \
--output_dir ./models/tpu \
--overwrite_output_dir \
--do_train \
--do_eval \
--num_train_epochs 1 \
--save_steps 20000
python xla_spawn.py --num_cores num_tpu_you_have \
path_to_script.py \
--all_arguments_of_the_script
```
Feedback and more use cases and benchmarks involving TPUs are welcome, please share with the community.
As an example, here is how you would fine-tune the BERT large model (with whole word masking) on the text
classification MNLI task using the `run_glue` script, with 8 TPUs:
```bash
python xla_spawn.py --num_cores 8 \
text-classification/run_glue.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--task_name mnli \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mnli_output/
```
## Logging & Experiment tracking

View File

@@ -25,8 +25,7 @@ objectives in our [model summary](https://huggingface.co/transformers/model_summ
These scripts leverage the 🤗 Datasets library and the Trainer API. You can easily customize them to your needs if you
need extra processing on your datasets.
**Note:** The old script `run_language_modeling.py` is still available
[here](https://github.com/huggingface/transformers/blob/master/examples/contrib/legacy/run_language_modeling.py).
**Note:** The old script `run_language_modeling.py` is still available [here](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py).
The following examples, will run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own
text files for training and validation. We give examples of both below.
@@ -101,72 +100,7 @@ sure all your batches have the same length.
### Whole word masking
The BERT authors released a new version of BERT using Whole Word Masking in May 2019. Instead of masking randomly
selected tokens (which may be part of words), they mask randomly selected words (masking all the tokens corresponding
to that word). This technique has been refined for Chinese in [this paper](https://arxiv.org/abs/1906.08101).
To fine-tune a model using whole word masking, use the following script:
```bash
python run_mlm_wwm.py \
--model_name_or_path roberta-base \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--do_eval \
--output_dir /tmp/test-mlm-wwm
```
For Chinese models, we need to generate a reference files (which requires the ltp library), because it's tokenized at
the character level.
**Q :** Why a reference file?
**A :** Suppose we have a Chinese sentence like: `我喜欢你` The original Chinese-BERT will tokenize it as
`['我','喜','欢','你']` (character level). But `喜欢` is a whole word. For whole word masking proxy, we need a result
like `['我','喜','##欢','你']`, so we need a reference file to tell the model which position of the BERT original token
should be added `##`.
**Q :** Why LTP ?
**A :** Cause the best known Chinese WWM BERT is [Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm) by HIT.
It works well on so many Chines Task like CLUE (Chinese GLUE). They use LTP, so if we want to fine-tune their model,
we need LTP.
Now LTP only only works well on `transformers==3.2.0`. So we don't add it to requirements.txt.
You need to create a separate environment with this version of Transformers to run the `run_chinese_ref.py` script that
will create the reference files. The script is in `examples/contrib`. Once in the proper environment, run the
following:
```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export LTP_RESOURCE=/path/to/ltp/tokenizer
export BERT_RESOURCE=/path/to/bert/tokenizer
export SAVE_PATH=/path/to/data/ref.txt
python examples/contrib/run_chinese_ref.py \
--file_name=path_to_train_or_eval_file \
--ltp=path_to_ltp_tokenizer \
--bert=path_to_bert_tokenizer \
--save_path=path_to_reference_file
```
Then you can run the script like this:
```bash
python run_mlm_wwm.py \
--model_name_or_path roberta-base \
--train_file path_to_train_file \
--validation_file path_to_validation_file \
--train_ref_file path_to_train_chinese_ref_file \
--validation_ref_file path_to_validation_chinese_ref_file \
--do_train \
--do_eval \
--output_dir /tmp/test-mlm-wwm
```
**Note:** On TPU, you should the flag `--pad_to_max_length` to make sure all your batches have the same length.
This part was moved to `examples/research_projects/mlm_wwm`.
### XLNet and permutation language modeling

View File

@@ -42,7 +42,7 @@ from transformers import (
default_data_collator,
set_seed,
)
from transformers.trainer_utils import is_main_process
from transformers.trainer_utils import get_last_checkpoint, is_main_process
logger = logging.getLogger(__name__)
@@ -83,6 +83,17 @@ class ModelArguments:
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
)
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
use_auth_token: bool = field(
default=False,
metadata={
"help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
"with private models)."
},
)
@dataclass
@@ -149,23 +160,28 @@ def main():
else:
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
if (
os.path.exists(training_args.output_dir)
and os.listdir(training_args.output_dir)
and training_args.do_train
and not training_args.overwrite_output_dir
):
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty."
"Use --overwrite_output_dir to overcome."
)
# Detecting last checkpoint.
last_checkpoint = None
if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. "
"Use --overwrite_output_dir to overcome."
)
elif last_checkpoint is not None:
logger.info(
f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
handlers=[logging.StreamHandler(sys.stdout)],
)
logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
# Log on each process the small summary:
logger.warning(
@@ -211,7 +227,11 @@ def main():
data_files["train"] = data_args.train_file
if data_args.validation_file is not None:
data_files["validation"] = data_args.validation_file
extension = data_args.train_file.split(".")[-1]
extension = (
data_args.train_file.split(".")[-1]
if data_args.train_file is not None
else data_args.validation_file.split(".")[-1]
)
if extension == "txt":
extension = "text"
datasets = load_dataset(extension, data_files=data_files)
@@ -224,22 +244,29 @@ def main():
# The .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.
config_kwargs = {
"cache_dir": model_args.cache_dir,
"revision": model_args.model_revision,
"use_auth_token": True if model_args.use_auth_token else None,
}
if model_args.config_name:
config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs)
elif model_args.model_name_or_path:
config = AutoConfig.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
else:
config = CONFIG_MAPPING[model_args.model_type]()
logger.warning("You are instantiating a new config instance from scratch.")
tokenizer_kwargs = {
"cache_dir": model_args.cache_dir,
"use_fast": model_args.use_fast_tokenizer,
"revision": model_args.model_revision,
"use_auth_token": True if model_args.use_auth_token else None,
}
if model_args.tokenizer_name:
tokenizer = AutoTokenizer.from_pretrained(
model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
)
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
elif model_args.model_name_or_path:
tokenizer = AutoTokenizer.from_pretrained(
model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
)
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
else:
raise ValueError(
"You are instantiating a new tokenizer from scratch. This is not supported by this script."
@@ -252,6 +279,8 @@ def main():
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
else:
logger.info("Training new model from scratch")
@@ -336,14 +365,26 @@ def main():
# Training
if training_args.do_train:
model_path = (
model_args.model_name_or_path
if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
else None
)
trainer.train(model_path=model_path)
if last_checkpoint is not None:
checkpoint = last_checkpoint
elif model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path):
checkpoint = model_args.model_name_or_path
else:
checkpoint = None
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model() # Saves the tokenizer too for easy upload
output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
if trainer.is_world_process_zero():
with open(output_train_file, "w") as writer:
logger.info("***** Train results *****")
for key, value in sorted(train_result.metrics.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")
# Need to save the state, since Trainer.save_model saves only the tokenizer with the model
trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))
# Evaluation
results = {}
if training_args.do_eval:
@@ -358,7 +399,7 @@ def main():
if trainer.is_world_process_zero():
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in results.items():
for key, value in sorted(results.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")

View File

@@ -42,7 +42,7 @@ from transformers import (
TrainingArguments,
set_seed,
)
from transformers.trainer_utils import is_main_process
from transformers.trainer_utils import get_last_checkpoint, is_main_process
logger = logging.getLogger(__name__)
@@ -81,6 +81,17 @@ class ModelArguments:
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
)
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
use_auth_token: bool = field(
default=False,
metadata={
"help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
"with private models)."
},
)
@dataclass
@@ -160,23 +171,28 @@ def main():
else:
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
if (
os.path.exists(training_args.output_dir)
and os.listdir(training_args.output_dir)
and training_args.do_train
and not training_args.overwrite_output_dir
):
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty."
"Use --overwrite_output_dir to overcome."
)
# Detecting last checkpoint.
last_checkpoint = None
if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. "
"Use --overwrite_output_dir to overcome."
)
elif last_checkpoint is not None:
logger.info(
f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
handlers=[logging.StreamHandler(sys.stdout)],
)
logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
# Log on each process the small summary:
logger.warning(
@@ -234,22 +250,29 @@ def main():
# Distributed training:
# The .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.
config_kwargs = {
"cache_dir": model_args.cache_dir,
"revision": model_args.model_revision,
"use_auth_token": True if model_args.use_auth_token else None,
}
if model_args.config_name:
config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs)
elif model_args.model_name_or_path:
config = AutoConfig.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
else:
config = CONFIG_MAPPING[model_args.model_type]()
logger.warning("You are instantiating a new config instance from scratch.")
tokenizer_kwargs = {
"cache_dir": model_args.cache_dir,
"use_fast": model_args.use_fast_tokenizer,
"revision": model_args.model_revision,
"use_auth_token": True if model_args.use_auth_token else None,
}
if model_args.tokenizer_name:
tokenizer = AutoTokenizer.from_pretrained(
model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
)
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
elif model_args.model_name_or_path:
tokenizer = AutoTokenizer.from_pretrained(
model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
)
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
else:
raise ValueError(
"You are instantiating a new tokenizer from scratch. This is not supported by this script."
@@ -262,6 +285,8 @@ def main():
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
else:
logger.info("Training new model from scratch")
@@ -318,6 +343,12 @@ def main():
if data_args.max_seq_length is None:
max_seq_length = tokenizer.model_max_length
if max_seq_length > 1024:
logger.warn(
f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). "
"Picking 1024 instead. You can change that default value by passing --max_seq_length xxx."
)
max_seq_length = 1024
else:
if data_args.max_seq_length > tokenizer.model_max_length:
logger.warn(
@@ -371,14 +402,26 @@ def main():
# Training
if training_args.do_train:
model_path = (
model_args.model_name_or_path
if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
else None
)
trainer.train(model_path=model_path)
if last_checkpoint is not None:
checkpoint = last_checkpoint
elif model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path):
checkpoint = model_args.model_name_or_path
else:
checkpoint = None
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model() # Saves the tokenizer too for easy upload
output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
if trainer.is_world_process_zero():
with open(output_train_file, "w") as writer:
logger.info("***** Train results *****")
for key, value in sorted(train_result.metrics.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")
# Need to save the state, since Trainer.save_model saves only the tokenizer with the model
trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))
# Evaluation
results = {}
if training_args.do_eval:
@@ -393,7 +436,7 @@ def main():
if trainer.is_world_process_zero():
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in results.items():
for key, value in sorted(results.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")

View File

@@ -38,7 +38,7 @@ from transformers import (
XLNetLMHeadModel,
set_seed,
)
from transformers.trainer_utils import is_main_process
from transformers.trainer_utils import get_last_checkpoint, is_main_process
logger = logging.getLogger(__name__)
@@ -71,6 +71,17 @@ class ModelArguments:
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
)
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
use_auth_token: bool = field(
default=False,
metadata={
"help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
"with private models)."
},
)
@dataclass
@@ -157,23 +168,28 @@ def main():
else:
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
if (
os.path.exists(training_args.output_dir)
and os.listdir(training_args.output_dir)
and training_args.do_train
and not training_args.overwrite_output_dir
):
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty."
"Use --overwrite_output_dir to overcome."
)
# Detecting last checkpoint.
last_checkpoint = None
if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. "
"Use --overwrite_output_dir to overcome."
)
elif last_checkpoint is not None:
logger.info(
f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
handlers=[logging.StreamHandler(sys.stdout)],
)
logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
# Log on each process the small summary:
logger.warning(
@@ -231,22 +247,29 @@ def main():
# Distributed training:
# The .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.
config_kwargs = {
"cache_dir": model_args.cache_dir,
"revision": model_args.model_revision,
"use_auth_token": True if model_args.use_auth_token else None,
}
if model_args.config_name:
config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs)
elif model_args.model_name_or_path:
config = AutoConfig.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
else:
config = XLNetConfig()
logger.warning("You are instantiating a new config instance from scratch.")
tokenizer_kwargs = {
"cache_dir": model_args.cache_dir,
"use_fast": model_args.use_fast_tokenizer,
"revision": model_args.model_revision,
"use_auth_token": True if model_args.use_auth_token else None,
}
if model_args.tokenizer_name:
tokenizer = AutoTokenizer.from_pretrained(
model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
)
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
elif model_args.model_name_or_path:
tokenizer = AutoTokenizer.from_pretrained(
model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
)
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
else:
raise ValueError(
"You are instantiating a new tokenizer from scratch. This is not supported by this script."
@@ -259,6 +282,8 @@ def main():
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
else:
logger.info("Training new model from scratch")
@@ -358,14 +383,26 @@ def main():
# Training
if training_args.do_train:
model_path = (
model_args.model_name_or_path
if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
else None
)
trainer.train(model_path=model_path)
if last_checkpoint is not None:
checkpoint = last_checkpoint
elif model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path):
checkpoint = model_args.model_name_or_path
else:
checkpoint = None
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model() # Saves the tokenizer too for easy upload
output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
if trainer.is_world_process_zero():
with open(output_train_file, "w") as writer:
logger.info("***** Train results *****")
for key, value in sorted(train_result.metrics.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")
# Need to save the state, since Trainer.save_model saves only the tokenizer with the model
trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))
# Evaluation
results = {}
if training_args.do_eval:
@@ -380,7 +417,7 @@ def main():
if trainer.is_world_process_zero():
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in results.items():
for key, value in sorted(results.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")

View File

@@ -28,6 +28,7 @@ from transformers import (
AutoConfig,
AutoModelForMultipleChoice,
AutoTokenizer,
DataCollatorWithPadding,
EvalPrediction,
HfArgumentParser,
Trainer,
@@ -188,6 +189,9 @@ def main():
preds = np.argmax(p.predictions, axis=1)
return {"acc": simple_accuracy(preds, p.label_ids)}
# Data collator
data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) if training_args.fp16 else None
# Initialize our Trainer
trainer = Trainer(
model=model,
@@ -195,6 +199,7 @@ def main():
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
data_collator=data_collator,
)
# Training

View File

@@ -0,0 +1,579 @@
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Multiple choice fine-tuning: utilities to work with multiple choice tasks of reading comprehension """
import csv
import glob
import json
import logging
import os
from dataclasses import dataclass
from enum import Enum
from typing import List, Optional
import tqdm
from filelock import FileLock
from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available
logger = logging.getLogger(__name__)
@dataclass(frozen=True)
class InputExample:
"""
A single training/test example for multiple choice
Args:
example_id: Unique id for the example.
question: string. The untokenized text of the second sequence (question).
contexts: list of str. The untokenized text of the first sequence (context of corresponding question).
endings: list of str. multiple choice's options. Its length must be equal to contexts' length.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
example_id: str
question: str
contexts: List[str]
endings: List[str]
label: Optional[str]
@dataclass(frozen=True)
class InputFeatures:
"""
A single set of features of data.
Property names are the same names as the corresponding inputs to a model.
"""
example_id: str
input_ids: List[List[int]]
attention_mask: Optional[List[List[int]]]
token_type_ids: Optional[List[List[int]]]
label: Optional[int]
class Split(Enum):
train = "train"
dev = "dev"
test = "test"
if is_torch_available():
import torch
from torch.utils.data.dataset import Dataset
class MultipleChoiceDataset(Dataset):
"""
This will be superseded by a framework-agnostic approach
soon.
"""
features: List[InputFeatures]
def __init__(
self,
data_dir: str,
tokenizer: PreTrainedTokenizer,
task: str,
max_seq_length: Optional[int] = None,
overwrite_cache=False,
mode: Split = Split.train,
):
processor = processors[task]()
cached_features_file = os.path.join(
data_dir,
"cached_{}_{}_{}_{}".format(
mode.value,
tokenizer.__class__.__name__,
str(max_seq_length),
task,
),
)
# Make sure only the first process in distributed training processes the dataset,
# and the others will use the cache.
lock_path = cached_features_file + ".lock"
with FileLock(lock_path):
if os.path.exists(cached_features_file) and not overwrite_cache:
logger.info(f"Loading features from cached file {cached_features_file}")
self.features = torch.load(cached_features_file)
else:
logger.info(f"Creating features from dataset file at {data_dir}")
label_list = processor.get_labels()
if mode == Split.dev:
examples = processor.get_dev_examples(data_dir)
elif mode == Split.test:
examples = processor.get_test_examples(data_dir)
else:
examples = processor.get_train_examples(data_dir)
logger.info("Training examples: %s", len(examples))
self.features = convert_examples_to_features(
examples,
label_list,
max_seq_length,
tokenizer,
)
logger.info("Saving features into cached file %s", cached_features_file)
torch.save(self.features, cached_features_file)
def __len__(self):
return len(self.features)
def __getitem__(self, i) -> InputFeatures:
return self.features[i]
if is_tf_available():
import tensorflow as tf
class TFMultipleChoiceDataset:
"""
This will be superseded by a framework-agnostic approach
soon.
"""
features: List[InputFeatures]
def __init__(
self,
data_dir: str,
tokenizer: PreTrainedTokenizer,
task: str,
max_seq_length: Optional[int] = 128,
overwrite_cache=False,
mode: Split = Split.train,
):
processor = processors[task]()
logger.info(f"Creating features from dataset file at {data_dir}")
label_list = processor.get_labels()
if mode == Split.dev:
examples = processor.get_dev_examples(data_dir)
elif mode == Split.test:
examples = processor.get_test_examples(data_dir)
else:
examples = processor.get_train_examples(data_dir)
logger.info("Training examples: %s", len(examples))
self.features = convert_examples_to_features(
examples,
label_list,
max_seq_length,
tokenizer,
)
def gen():
for (ex_index, ex) in tqdm.tqdm(enumerate(self.features), desc="convert examples to features"):
if ex_index % 10000 == 0:
logger.info("Writing example %d of %d" % (ex_index, len(examples)))
yield (
{
"example_id": 0,
"input_ids": ex.input_ids,
"attention_mask": ex.attention_mask,
"token_type_ids": ex.token_type_ids,
},
ex.label,
)
self.dataset = tf.data.Dataset.from_generator(
gen,
(
{
"example_id": tf.int32,
"input_ids": tf.int32,
"attention_mask": tf.int32,
"token_type_ids": tf.int32,
},
tf.int64,
),
(
{
"example_id": tf.TensorShape([]),
"input_ids": tf.TensorShape([None, None]),
"attention_mask": tf.TensorShape([None, None]),
"token_type_ids": tf.TensorShape([None, None]),
},
tf.TensorShape([]),
),
)
def get_dataset(self):
self.dataset = self.dataset.apply(tf.data.experimental.assert_cardinality(len(self.features)))
return self.dataset
def __len__(self):
return len(self.features)
def __getitem__(self, i) -> InputFeatures:
return self.features[i]
class DataProcessor:
"""Base class for data converters for multiple choice data sets."""
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the test set."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
class RaceProcessor(DataProcessor):
"""Processor for the RACE data set."""
def get_train_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} train".format(data_dir))
high = os.path.join(data_dir, "train/high")
middle = os.path.join(data_dir, "train/middle")
high = self._read_txt(high)
middle = self._read_txt(middle)
return self._create_examples(high + middle, "train")
def get_dev_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} dev".format(data_dir))
high = os.path.join(data_dir, "dev/high")
middle = os.path.join(data_dir, "dev/middle")
high = self._read_txt(high)
middle = self._read_txt(middle)
return self._create_examples(high + middle, "dev")
def get_test_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} test".format(data_dir))
high = os.path.join(data_dir, "test/high")
middle = os.path.join(data_dir, "test/middle")
high = self._read_txt(high)
middle = self._read_txt(middle)
return self._create_examples(high + middle, "test")
def get_labels(self):
"""See base class."""
return ["0", "1", "2", "3"]
def _read_txt(self, input_dir):
lines = []
files = glob.glob(input_dir + "/*txt")
for file in tqdm.tqdm(files, desc="read files"):
with open(file, "r", encoding="utf-8") as fin:
data_raw = json.load(fin)
data_raw["race_id"] = file
lines.append(data_raw)
return lines
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (_, data_raw) in enumerate(lines):
race_id = "%s-%s" % (set_type, data_raw["race_id"])
article = data_raw["article"]
for i in range(len(data_raw["answers"])):
truth = str(ord(data_raw["answers"][i]) - ord("A"))
question = data_raw["questions"][i]
options = data_raw["options"][i]
examples.append(
InputExample(
example_id=race_id,
question=question,
contexts=[article, article, article, article], # this is not efficient but convenient
endings=[options[0], options[1], options[2], options[3]],
label=truth,
)
)
return examples
class SynonymProcessor(DataProcessor):
"""Processor for the Synonym data set."""
def get_train_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} train".format(data_dir))
return self._create_examples(self._read_csv(os.path.join(data_dir, "mctrain.csv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} dev".format(data_dir))
return self._create_examples(self._read_csv(os.path.join(data_dir, "mchp.csv")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} dev".format(data_dir))
return self._create_examples(self._read_csv(os.path.join(data_dir, "mctest.csv")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1", "2", "3", "4"]
def _read_csv(self, input_file):
with open(input_file, "r", encoding="utf-8") as f:
return list(csv.reader(f))
def _create_examples(self, lines: List[List[str]], type: str):
"""Creates examples for the training and dev sets."""
examples = [
InputExample(
example_id=line[0],
question="", # in the swag dataset, the
# common beginning of each
# choice is stored in "sent2".
contexts=[line[1], line[1], line[1], line[1], line[1]],
endings=[line[2], line[3], line[4], line[5], line[6]],
label=line[7],
)
for line in lines # we skip the line with the column names
]
return examples
class SwagProcessor(DataProcessor):
"""Processor for the SWAG data set."""
def get_train_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} train".format(data_dir))
return self._create_examples(self._read_csv(os.path.join(data_dir, "train.csv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} dev".format(data_dir))
return self._create_examples(self._read_csv(os.path.join(data_dir, "val.csv")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} dev".format(data_dir))
raise ValueError(
"For swag testing, the input file does not contain a label column. It can not be tested in current code"
"setting!"
)
return self._create_examples(self._read_csv(os.path.join(data_dir, "test.csv")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1", "2", "3"]
def _read_csv(self, input_file):
with open(input_file, "r", encoding="utf-8") as f:
return list(csv.reader(f))
def _create_examples(self, lines: List[List[str]], type: str):
"""Creates examples for the training and dev sets."""
if type == "train" and lines[0][-1] != "label":
raise ValueError("For training, the input file must contain a label column.")
examples = [
InputExample(
example_id=line[2],
question=line[5], # in the swag dataset, the
# common beginning of each
# choice is stored in "sent2".
contexts=[line[4], line[4], line[4], line[4]],
endings=[line[7], line[8], line[9], line[10]],
label=line[11],
)
for line in lines[1:] # we skip the line with the column names
]
return examples
class ArcProcessor(DataProcessor):
"""Processor for the ARC data set (request from allennlp)."""
def get_train_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} train".format(data_dir))
return self._create_examples(self._read_json(os.path.join(data_dir, "train.jsonl")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
logger.info("LOOKING AT {} dev".format(data_dir))
return self._create_examples(self._read_json(os.path.join(data_dir, "dev.jsonl")), "dev")
def get_test_examples(self, data_dir):
logger.info("LOOKING AT {} test".format(data_dir))
return self._create_examples(self._read_json(os.path.join(data_dir, "test.jsonl")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1", "2", "3"]
def _read_json(self, input_file):
with open(input_file, "r", encoding="utf-8") as fin:
lines = fin.readlines()
return lines
def _create_examples(self, lines, type):
"""Creates examples for the training and dev sets."""
# There are two types of labels. They should be normalized
def normalize(truth):
if truth in "ABCD":
return ord(truth) - ord("A")
elif truth in "1234":
return int(truth) - 1
else:
logger.info("truth ERROR! %s", str(truth))
return None
examples = []
three_choice = 0
four_choice = 0
five_choice = 0
other_choices = 0
# we deleted example which has more than or less than four choices
for line in tqdm.tqdm(lines, desc="read arc data"):
data_raw = json.loads(line.strip("\n"))
if len(data_raw["question"]["choices"]) == 3:
three_choice += 1
continue
elif len(data_raw["question"]["choices"]) == 5:
five_choice += 1
continue
elif len(data_raw["question"]["choices"]) != 4:
other_choices += 1
continue
four_choice += 1
truth = str(normalize(data_raw["answerKey"]))
assert truth != "None"
question_choices = data_raw["question"]
question = question_choices["stem"]
id = data_raw["id"]
options = question_choices["choices"]
if len(options) == 4:
examples.append(
InputExample(
example_id=id,
question=question,
contexts=[
options[0]["para"].replace("_", ""),
options[1]["para"].replace("_", ""),
options[2]["para"].replace("_", ""),
options[3]["para"].replace("_", ""),
],
endings=[options[0]["text"], options[1]["text"], options[2]["text"], options[3]["text"]],
label=truth,
)
)
if type == "train":
assert len(examples) > 1
assert examples[0].label is not None
logger.info("len examples: %s}", str(len(examples)))
logger.info("Three choices: %s", str(three_choice))
logger.info("Five choices: %s", str(five_choice))
logger.info("Other choices: %s", str(other_choices))
logger.info("four choices: %s", str(four_choice))
return examples
def convert_examples_to_features(
examples: List[InputExample],
label_list: List[str],
max_length: int,
tokenizer: PreTrainedTokenizer,
) -> List[InputFeatures]:
"""
Loads a data file into a list of `InputFeatures`
"""
label_map = {label: i for i, label in enumerate(label_list)}
features = []
for (ex_index, example) in tqdm.tqdm(enumerate(examples), desc="convert examples to features"):
if ex_index % 10000 == 0:
logger.info("Writing example %d of %d" % (ex_index, len(examples)))
choices_inputs = []
for ending_idx, (context, ending) in enumerate(zip(example.contexts, example.endings)):
text_a = context
if example.question.find("_") != -1:
# this is for cloze question
text_b = example.question.replace("_", ending)
else:
text_b = example.question + " " + ending
inputs = tokenizer(
text_a,
text_b,
add_special_tokens=True,
max_length=max_length,
padding="max_length",
truncation=True,
return_overflowing_tokens=True,
)
if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0:
logger.info(
"Attention! you are cropping tokens (swag task is ok). "
"If you are training ARC and RACE and you are poping question + options,"
"you need to try to use a bigger max seq length!"
)
choices_inputs.append(inputs)
label = label_map[example.label]
input_ids = [x["input_ids"] for x in choices_inputs]
attention_mask = (
[x["attention_mask"] for x in choices_inputs] if "attention_mask" in choices_inputs[0] else None
)
token_type_ids = (
[x["token_type_ids"] for x in choices_inputs] if "token_type_ids" in choices_inputs[0] else None
)
features.append(
InputFeatures(
example_id=example.example_id,
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
label=label,
)
)
for f in features[:2]:
logger.info("*** Example ***")
logger.info("feature: %s" % f)
return features
processors = {"race": RaceProcessor, "swag": SwagProcessor, "arc": ArcProcessor, "syn": SynonymProcessor}
MULTIPLE_CHOICE_TASKS_NUM_LABELS = {"race", 4, "swag", 4, "arc", 4, "syn", 5}

View File

@@ -19,3 +19,4 @@ pytest
conllu
sentencepiece != 0.1.92
protobuf
ray

View File

@@ -23,7 +23,14 @@ from dataclasses import dataclass, field
from typing import Optional
import transformers
from transformers import AutoConfig, AutoModelForQuestionAnswering, AutoTokenizer, HfArgumentParser, SquadDataset
from transformers import (
AutoConfig,
AutoModelForQuestionAnswering,
AutoTokenizer,
DataCollatorWithPadding,
HfArgumentParser,
SquadDataset,
)
from transformers import SquadDataTrainingArguments as DataTrainingArguments
from transformers import Trainer, TrainingArguments
from transformers.trainer_utils import is_main_process
@@ -145,12 +152,16 @@ def main():
else None
)
# Data collator
data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) if training_args.fp16 else None
# Initialize our Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=data_collator,
)
# Training

View File

@@ -30,6 +30,7 @@ from transformers import (
AutoConfig,
AutoModelForTokenClassification,
AutoTokenizer,
DataCollatorWithPadding,
EvalPrediction,
HfArgumentParser,
Trainer,
@@ -237,6 +238,9 @@ def main():
"f1": f1_score(out_label_list, preds_list),
}
# Data collator
data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) if training_args.fp16 else None
# Initialize our Trainer
trainer = Trainer(
model=model,
@@ -244,6 +248,7 @@ def main():
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
data_collator=data_collator,
)
# Training

View File

@@ -16,27 +16,20 @@ limitations under the License.
## Multiple Choice
Based on the script [`run_multiple_choice.py`]().
Based on the script [`run_swag.py`]().
#### Fine-tuning on SWAG
Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
```bash
#training on 4 tesla V100(16GB) GPUS
export SWAG_DIR=/path/to/swag_data_dir
python ./examples/multiple-choice/run_multiple_choice.py \
--task_name swag \
python examples/multiple-choice/run_swag.py \
--model_name_or_path roberta-base \
--do_train \
--do_eval \
--data_dir $SWAG_DIR \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--max_seq_length 80 \
--output_dir models_bert/swag_base \
--output_dir /tmp/swag_base \
--per_gpu_eval_batch_size=16 \
--per_device_train_batch_size=16 \
--gradient_accumulation_steps 2 \
--overwrite_output
```
Training with the defined hyper-parameters yields the following results:

View File

@@ -0,0 +1,388 @@
# coding=utf-8
# Copyright The HuggingFace Team and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for multiple choice.
"""
# You can also adapt this script on your own multiple choice task. Pointers for this are left as comments.
import logging
import os
import sys
from dataclasses import dataclass, field
from typing import Optional, Union
import numpy as np
import torch
from datasets import load_dataset
import transformers
from transformers import (
AutoConfig,
AutoModelForMultipleChoice,
AutoTokenizer,
HfArgumentParser,
Trainer,
TrainingArguments,
default_data_collator,
set_seed,
)
from transformers.tokenization_utils_base import PaddingStrategy, PreTrainedTokenizerBase
from transformers.trainer_utils import get_last_checkpoint, is_main_process
logger = logging.getLogger(__name__)
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path: str = field(
metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
)
config_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
)
tokenizer_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
)
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
use_auth_token: bool = field(
default=False,
metadata={
"help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
"with private models)."
},
)
@dataclass
class DataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
"""
train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."})
validation_file: Optional[str] = field(
default=None,
metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
)
overwrite_cache: bool = field(
default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
)
preprocessing_num_workers: Optional[int] = field(
default=None,
metadata={"help": "The number of processes to use for the preprocessing."},
)
max_seq_length: int = field(
default=None,
metadata={
"help": "The maximum total input sequence length after tokenization. If passed, sequences longer "
"than this will be truncated, sequences shorter will be padded."
},
)
pad_to_max_length: bool = field(
default=False,
metadata={
"help": "Whether to pad all samples to the maximum sentence length. "
"If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
"efficient on GPU but very bad for TPU."
},
)
def __post_init__(self):
if self.train_file is not None:
extension = self.train_file.split(".")[-1]
assert extension in ["csv", "json"], "`train_file` should be a csv or a json file."
if self.validation_file is not None:
extension = self.validation_file.split(".")[-1]
assert extension in ["csv", "json"], "`validation_file` should be a csv or a json file."
@dataclass
class DataCollatorForMultipleChoice:
"""
Data collator that will dynamically pad the inputs for multiple choice received.
Args:
tokenizer (:class:`~transformers.PreTrainedTokenizer` or :class:`~transformers.PreTrainedTokenizerFast`):
The tokenizer used for encoding the data.
padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
among:
* :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
* :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
maximum acceptable input length for the model if that argument is not provided.
* :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
different lengths).
max_length (:obj:`int`, `optional`):
Maximum length of the returned list and optionally padding length (see above).
pad_to_multiple_of (:obj:`int`, `optional`):
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
7.5 (Volta).
"""
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
def __call__(self, features):
label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature.pop(label_name) for feature in features]
batch_size = len(features)
num_choices = len(features[0]["input_ids"])
flattened_features = [
[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
]
flattened_features = sum(flattened_features, [])
batch = self.tokenizer.pad(
flattened_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
# Un-flatten
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
# Add back labels
batch["labels"] = torch.tensor(labels, dtype=torch.int64)
return batch
def main():
# See all possible arguments in src/transformers/training_args.py
# or by passing the --help flag to this script.
# We now keep distinct sets of args, for a cleaner separation of concerns.
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
# If we pass only one argument to the script and it's the path to a json file,
# let's parse it to get our arguments.
model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
else:
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
# Detecting last checkpoint.
last_checkpoint = None
if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. "
"Use --overwrite_output_dir to overcome."
)
elif last_checkpoint is not None:
logger.info(
f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
handlers=[logging.StreamHandler(sys.stdout)],
)
logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
# Log on each process the small summary:
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
)
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(training_args.local_rank):
transformers.utils.logging.set_verbosity_info()
logger.info("Training/evaluation parameters %s", training_args)
# Set seed before initializing model.
set_seed(training_args.seed)
# Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below)
# or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
# (the dataset will be downloaded automatically from the datasets Hub).
# For CSV/JSON files, this script will use the column called 'text' or the first column if no column called
# 'text' is found. You can easily tweak this behavior (see below).
# In distributed training, the load_dataset function guarantee that only one local process can concurrently
# download the dataset.
if data_args.train_file is not None or data_args.validation_file is not None:
data_files = {}
if data_args.train_file is not None:
data_files["train"] = data_args.train_file
if data_args.validation_file is not None:
data_files["validation"] = data_args.validation_file
extension = data_args.train_file.split(".")[-1]
datasets = load_dataset(extension, data_files=data_files)
else:
# Downloading and loading the swag dataset from the hub.
datasets = load_dataset("swag", "regular")
# See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
# https://huggingface.co/docs/datasets/loading_datasets.html.
# Load pretrained model and tokenizer
# Distributed training:
# The .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.
config = AutoConfig.from_pretrained(
model_args.config_name if model_args.config_name else model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
tokenizer = AutoTokenizer.from_pretrained(
model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
use_fast=model_args.use_fast_tokenizer,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
model = AutoModelForMultipleChoice.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
# When using your own dataset or a different dataset from swag, you will probably need to change this.
ending_names = [f"ending{i}" for i in range(4)]
context_name = "sent1"
question_header_name = "sent2"
# Preprocessing the datasets.
def preprocess_function(examples):
first_sentences = [[context] * 4 for context in examples[context_name]]
question_headers = examples[question_header_name]
second_sentences = [
[f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
]
# Flatten out
first_sentences = sum(first_sentences, [])
second_sentences = sum(second_sentences, [])
# Tokenize
tokenized_examples = tokenizer(
first_sentences,
second_sentences,
truncation=True,
max_length=data_args.max_seq_length,
padding="max_length" if data_args.pad_to_max_length else False,
)
# Un-flatten
return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
tokenized_datasets = datasets.map(
preprocess_function,
batched=True,
num_proc=data_args.preprocessing_num_workers,
load_from_cache_file=not data_args.overwrite_cache,
)
# Data collator
data_collator = (
default_data_collator
if data_args.pad_to_max_length
else DataCollatorForMultipleChoice(tokenizer=tokenizer, pad_to_multiple_of=8 if training_args.fp16 else None)
)
# Metric
def compute_metrics(eval_predictions):
predictions, label_ids = eval_predictions
preds = np.argmax(predictions, axis=1)
return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}
# Initialize our Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"] if training_args.do_train else None,
eval_dataset=tokenized_datasets["validation"] if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
# Training
if training_args.do_train:
if last_checkpoint is not None:
checkpoint = last_checkpoint
elif os.path.isdir(model_args.model_name_or_path):
checkpoint = model_args.model_name_or_path
else:
checkpoint = None
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model() # Saves the tokenizer too for easy upload
output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
if trainer.is_world_process_zero():
with open(output_train_file, "w") as writer:
logger.info("***** Train results *****")
for key, value in sorted(train_result.metrics.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")
# Need to save the state, since Trainer.save_model saves only the tokenizer with the model
trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))
# Evaluation
results = {}
if training_args.do_eval:
logger.info("*** Evaluate ***")
results = trainer.evaluate()
output_eval_file = os.path.join(training_args.output_dir, "eval_results_swag.txt")
if trainer.is_world_process_zero():
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in sorted(results.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")
return results
def _mp_fn(index):
# For xla_spawn (TPUs)
main()
if __name__ == "__main__":
main()

View File

@@ -23,8 +23,7 @@ uses special features of those tokenizers. You can check if your favorite model
[this table](https://huggingface.co/transformers/index.html#bigtable), if it doesn't you can still use the old version
of the script.
The old version of this script can be found [here](https://github.com/huggingface/transformers/blob/master/examples/contrib/legacy/question-answering/run_squad.py).
The old version of this script can be found [here](https://github.com/huggingface/transformers/tree/master/examples/legacy/question-answering).
#### Fine-tuning BERT on SQuAD1.0
This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)

View File

@@ -1 +1 @@
datasets >= 1.1.3
datasets >= 1.2.1

View File

@@ -39,7 +39,7 @@ from transformers import (
default_data_collator,
set_seed,
)
from transformers.trainer_utils import is_main_process
from transformers.trainer_utils import get_last_checkpoint, is_main_process
from utils_qa import postprocess_qa_predictions
@@ -65,6 +65,17 @@ class ModelArguments:
default=None,
metadata={"help": "Path to directory to store the pretrained models downloaded from huggingface.co"},
)
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
use_auth_token: bool = field(
default=False,
metadata={
"help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
"with private models)."
},
)
@dataclass
@@ -158,21 +169,26 @@ def main():
else:
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
if (
os.path.exists(training_args.output_dir)
and os.listdir(training_args.output_dir)
and training_args.do_train
and not training_args.overwrite_output_dir
):
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty."
"Use --overwrite_output_dir to overcome."
)
# Detecting last checkpoint.
last_checkpoint = None
if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. "
"Use --overwrite_output_dir to overcome."
)
elif last_checkpoint is not None:
logger.info(
f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
handlers=[logging.StreamHandler(sys.stdout)],
)
logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
@@ -220,17 +236,23 @@ def main():
config = AutoConfig.from_pretrained(
model_args.config_name if model_args.config_name else model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
tokenizer = AutoTokenizer.from_pretrained(
model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
use_fast=True,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
model = AutoModelForQuestionAnswering.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
# Tokenizer check: this script requires a fast tokenizer.
@@ -390,7 +412,11 @@ def main():
# Data collator
# We have already padded to max length if the corresponding flag is True, otherwise we need to pad in the data
# collator.
data_collator = default_data_collator if data_args.pad_to_max_length else DataCollatorWithPadding(tokenizer)
data_collator = (
default_data_collator
if data_args.pad_to_max_length
else DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8 if training_args.fp16 else None)
)
# Post-processing:
def post_processing_function(examples, features, predictions):
@@ -416,9 +442,7 @@ def main():
references = [{"id": ex["id"], "answers": ex[answer_column_name]} for ex in datasets["validation"]]
return EvalPrediction(predictions=formatted_predictions, label_ids=references)
# TODO: Once the fix lands in a Datasets release, remove the _local here and the squad_v2_local folder.
current_dir = os.path.sep.join(os.path.join(__file__).split(os.path.sep)[:-1])
metric = load_metric(os.path.join(current_dir, "squad_v2_local") if data_args.version_2_with_negative else "squad")
metric = load_metric("squad_v2" if data_args.version_2_with_negative else "squad")
def compute_metrics(p: EvalPrediction):
return metric.compute(predictions=p.predictions, references=p.label_ids)
@@ -438,11 +462,26 @@ def main():
# Training
if training_args.do_train:
trainer.train(
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
)
if last_checkpoint is not None:
checkpoint = last_checkpoint
elif os.path.isdir(model_args.model_name_or_path):
checkpoint = model_args.model_name_or_path
else:
checkpoint = None
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model() # Saves the tokenizer too for easy upload
output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
if trainer.is_world_process_zero():
with open(output_train_file, "w") as writer:
logger.info("***** Train results *****")
for key, value in sorted(train_result.metrics.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")
# Need to save the state, since Trainer.save_model saves only the tokenizer with the model
trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))
# Evaluation
results = {}
if training_args.do_eval:
@@ -453,7 +492,7 @@ def main():
if trainer.is_world_process_zero():
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in results.items():
for key, value in sorted(results.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")

View File

@@ -38,7 +38,7 @@ from transformers import (
default_data_collator,
set_seed,
)
from transformers.trainer_utils import is_main_process
from transformers.trainer_utils import get_last_checkpoint, is_main_process
from utils_qa import postprocess_qa_predictions_with_beam_search
@@ -64,9 +64,16 @@ class ModelArguments:
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
use_auth_token: bool = field(
default=False,
metadata={
"help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
"with private models)."
},
)
@@ -161,21 +168,26 @@ def main():
else:
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
if (
os.path.exists(training_args.output_dir)
and os.listdir(training_args.output_dir)
and training_args.do_train
and not training_args.overwrite_output_dir
):
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty."
"Use --overwrite_output_dir to overcome."
)
# Detecting last checkpoint.
last_checkpoint = None
if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. "
"Use --overwrite_output_dir to overcome."
)
elif last_checkpoint is not None:
logger.info(
f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
handlers=[logging.StreamHandler(sys.stdout)],
)
logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
@@ -223,16 +235,22 @@ def main():
config = XLNetConfig.from_pretrained(
model_args.config_name if model_args.config_name else model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
tokenizer = XLNetTokenizerFast.from_pretrained(
model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
model = XLNetForQuestionAnswering.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
# Preprocessing the datasets.
@@ -431,7 +449,11 @@ def main():
# Data collator
# We have already padded to max length if the corresponding flag is True, otherwise we need to pad in the data
# collator.
data_collator = default_data_collator if data_args.pad_to_max_length else DataCollatorWithPadding(tokenizer)
data_collator = (
default_data_collator
if data_args.pad_to_max_length
else DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8 if training_args.fp16 else None)
)
# Post-processing:
def post_processing_function(examples, features, predictions):
@@ -459,9 +481,7 @@ def main():
references = [{"id": ex["id"], "answers": ex[answer_column_name]} for ex in datasets["validation"]]
return EvalPrediction(predictions=formatted_predictions, label_ids=references)
# TODO: Once the fix lands in a Datasets release, remove the _local here and the squad_v2_local folder.
current_dir = os.path.sep.join(os.path.join(__file__).split(os.path.sep)[:-1])
metric = load_metric(os.path.join(current_dir, "squad_v2_local") if data_args.version_2_with_negative else "squad")
metric = load_metric("squad_v2" if data_args.version_2_with_negative else "squad")
def compute_metrics(p: EvalPrediction):
return metric.compute(predictions=p.predictions, references=p.label_ids)
@@ -481,11 +501,26 @@ def main():
# Training
if training_args.do_train:
trainer.train(
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
)
if last_checkpoint is not None:
checkpoint = last_checkpoint
elif os.path.isdir(model_args.model_name_or_path):
checkpoint = model_args.model_name_or_path
else:
checkpoint = None
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model() # Saves the tokenizer too for easy upload
output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
if trainer.is_world_process_zero():
with open(output_train_file, "w") as writer:
logger.info("***** Train results *****")
for key, value in sorted(train_result.metrics.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")
# Need to save the state, since Trainer.save_model saves only the tokenizer with the model
trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))
# Evaluation
results = {}
if training_args.do_eval:
@@ -496,7 +531,7 @@ def main():
if trainer.is_world_process_zero():
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in results.items():
for key, value in sorted(results.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")

View File

@@ -1,322 +0,0 @@
"""Official evaluation script for SQuAD version 2.0.
In addition to basic functionality, we also compute additional statistics and
plot precision-recall curves if an additional na_prob.json file is provided.
This file is expected to map question ID's to the model's predicted probability
that a question is unanswerable.
"""
import argparse
import collections
import json
import os
import re
import string
import sys
import numpy as np
OPTS = None
def parse_args():
parser = argparse.ArgumentParser("Official evaluation script for SQuAD version 2.0.")
parser.add_argument("data_file", metavar="data.json", help="Input data JSON file.")
parser.add_argument("pred_file", metavar="pred.json", help="Model predictions.")
parser.add_argument(
"--out-file", "-o", metavar="eval.json", help="Write accuracy metrics to file (default is stdout)."
)
parser.add_argument(
"--na-prob-file", "-n", metavar="na_prob.json", help="Model estimates of probability of no answer."
)
parser.add_argument(
"--na-prob-thresh",
"-t",
type=float,
default=1.0,
help='Predict "" if no-answer probability exceeds this (default = 1.0).',
)
parser.add_argument(
"--out-image-dir", "-p", metavar="out_images", default=None, help="Save precision-recall curves to directory."
)
parser.add_argument("--verbose", "-v", action="store_true")
if len(sys.argv) == 1:
parser.print_help()
sys.exit(1)
return parser.parse_args()
def make_qid_to_has_ans(dataset):
qid_to_has_ans = {}
for article in dataset:
for p in article["paragraphs"]:
for qa in p["qas"]:
qid_to_has_ans[qa["id"]] = bool(qa["answers"]["text"])
return qid_to_has_ans
def normalize_answer(s):
"""Lower text and remove punctuation, articles and extra whitespace."""
def remove_articles(text):
regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
return re.sub(regex, " ", text)
def white_space_fix(text):
return " ".join(text.split())
def remove_punc(text):
exclude = set(string.punctuation)
return "".join(ch for ch in text if ch not in exclude)
def lower(text):
return text.lower()
return white_space_fix(remove_articles(remove_punc(lower(s))))
def get_tokens(s):
if not s:
return []
return normalize_answer(s).split()
def compute_exact(a_gold, a_pred):
return int(normalize_answer(a_gold) == normalize_answer(a_pred))
def compute_f1(a_gold, a_pred):
gold_toks = get_tokens(a_gold)
pred_toks = get_tokens(a_pred)
common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
num_same = sum(common.values())
if len(gold_toks) == 0 or len(pred_toks) == 0:
# If either is no-answer, then F1 is 1 if they agree, 0 otherwise
return int(gold_toks == pred_toks)
if num_same == 0:
return 0
precision = 1.0 * num_same / len(pred_toks)
recall = 1.0 * num_same / len(gold_toks)
f1 = (2 * precision * recall) / (precision + recall)
return f1
def get_raw_scores(dataset, preds):
exact_scores = {}
f1_scores = {}
for article in dataset:
for p in article["paragraphs"]:
for qa in p["qas"]:
qid = qa["id"]
gold_answers = [t for t in qa["answers"]["text"] if normalize_answer(t)]
if not gold_answers:
# For unanswerable questions, only correct answer is empty string
gold_answers = [""]
if qid not in preds:
print("Missing prediction for %s" % qid)
continue
a_pred = preds[qid]
# Take max over all gold answers
exact_scores[qid] = max(compute_exact(a, a_pred) for a in gold_answers)
f1_scores[qid] = max(compute_f1(a, a_pred) for a in gold_answers)
return exact_scores, f1_scores
def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
new_scores = {}
for qid, s in scores.items():
pred_na = na_probs[qid] > na_prob_thresh
if pred_na:
new_scores[qid] = float(not qid_to_has_ans[qid])
else:
new_scores[qid] = s
return new_scores
def make_eval_dict(exact_scores, f1_scores, qid_list=None):
if not qid_list:
total = len(exact_scores)
return collections.OrderedDict(
[
("exact", 100.0 * sum(exact_scores.values()) / total),
("f1", 100.0 * sum(f1_scores.values()) / total),
("total", total),
]
)
else:
total = len(qid_list)
return collections.OrderedDict(
[
("exact", 100.0 * sum(exact_scores[k] for k in qid_list) / total),
("f1", 100.0 * sum(f1_scores[k] for k in qid_list) / total),
("total", total),
]
)
def merge_eval(main_eval, new_eval, prefix):
for k in new_eval:
main_eval["%s_%s" % (prefix, k)] = new_eval[k]
def plot_pr_curve(precisions, recalls, out_image, title):
plt.step(recalls, precisions, color="b", alpha=0.2, where="post")
plt.fill_between(recalls, precisions, step="post", alpha=0.2, color="b")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.xlim([0.0, 1.05])
plt.ylim([0.0, 1.05])
plt.title(title)
plt.savefig(out_image)
plt.clf()
def make_precision_recall_eval(scores, na_probs, num_true_pos, qid_to_has_ans, out_image=None, title=None):
qid_list = sorted(na_probs, key=lambda k: na_probs[k])
true_pos = 0.0
cur_p = 1.0
cur_r = 0.0
precisions = [1.0]
recalls = [0.0]
avg_prec = 0.0
for i, qid in enumerate(qid_list):
if qid_to_has_ans[qid]:
true_pos += scores[qid]
cur_p = true_pos / float(i + 1)
cur_r = true_pos / float(num_true_pos)
if i == len(qid_list) - 1 or na_probs[qid] != na_probs[qid_list[i + 1]]:
# i.e., if we can put a threshold after this point
avg_prec += cur_p * (cur_r - recalls[-1])
precisions.append(cur_p)
recalls.append(cur_r)
if out_image:
plot_pr_curve(precisions, recalls, out_image, title)
return {"ap": 100.0 * avg_prec}
def run_precision_recall_analysis(main_eval, exact_raw, f1_raw, na_probs, qid_to_has_ans, out_image_dir):
if out_image_dir and not os.path.exists(out_image_dir):
os.makedirs(out_image_dir)
num_true_pos = sum(1 for v in qid_to_has_ans.values() if v)
if num_true_pos == 0:
return
pr_exact = make_precision_recall_eval(
exact_raw,
na_probs,
num_true_pos,
qid_to_has_ans,
out_image=os.path.join(out_image_dir, "pr_exact.png"),
title="Precision-Recall curve for Exact Match score",
)
pr_f1 = make_precision_recall_eval(
f1_raw,
na_probs,
num_true_pos,
qid_to_has_ans,
out_image=os.path.join(out_image_dir, "pr_f1.png"),
title="Precision-Recall curve for F1 score",
)
oracle_scores = {k: float(v) for k, v in qid_to_has_ans.items()}
pr_oracle = make_precision_recall_eval(
oracle_scores,
na_probs,
num_true_pos,
qid_to_has_ans,
out_image=os.path.join(out_image_dir, "pr_oracle.png"),
title="Oracle Precision-Recall curve (binary task of HasAns vs. NoAns)",
)
merge_eval(main_eval, pr_exact, "pr_exact")
merge_eval(main_eval, pr_f1, "pr_f1")
merge_eval(main_eval, pr_oracle, "pr_oracle")
def histogram_na_prob(na_probs, qid_list, image_dir, name):
if not qid_list:
return
x = [na_probs[k] for k in qid_list]
weights = np.ones_like(x) / float(len(x))
plt.hist(x, weights=weights, bins=20, range=(0.0, 1.0))
plt.xlabel("Model probability of no-answer")
plt.ylabel("Proportion of dataset")
plt.title("Histogram of no-answer probability: %s" % name)
plt.savefig(os.path.join(image_dir, "na_prob_hist_%s.png" % name))
plt.clf()
def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
cur_score = num_no_ans
best_score = cur_score
best_thresh = 0.0
qid_list = sorted(na_probs, key=lambda k: na_probs[k])
for i, qid in enumerate(qid_list):
if qid not in scores:
continue
if qid_to_has_ans[qid]:
diff = scores[qid]
else:
if preds[qid]:
diff = -1
else:
diff = 0
cur_score += diff
if cur_score > best_score:
best_score = cur_score
best_thresh = na_probs[qid]
return 100.0 * best_score / len(scores), best_thresh
def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):
best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)
best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)
main_eval["best_exact"] = best_exact
main_eval["best_exact_thresh"] = exact_thresh
main_eval["best_f1"] = best_f1
main_eval["best_f1_thresh"] = f1_thresh
def main():
with open(OPTS.data_file) as f:
dataset_json = json.load(f)
dataset = dataset_json["data"]
with open(OPTS.pred_file) as f:
preds = json.load(f)
if OPTS.na_prob_file:
with open(OPTS.na_prob_file) as f:
na_probs = json.load(f)
else:
na_probs = {k: 0.0 for k in preds}
qid_to_has_ans = make_qid_to_has_ans(dataset) # maps qid to True/False
has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
exact_raw, f1_raw = get_raw_scores(dataset, preds)
exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans, OPTS.na_prob_thresh)
f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans, OPTS.na_prob_thresh)
out_eval = make_eval_dict(exact_thresh, f1_thresh)
if has_ans_qids:
has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
merge_eval(out_eval, has_ans_eval, "HasAns")
if no_ans_qids:
no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
merge_eval(out_eval, no_ans_eval, "NoAns")
if OPTS.na_prob_file:
find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans)
if OPTS.na_prob_file and OPTS.out_image_dir:
run_precision_recall_analysis(out_eval, exact_raw, f1_raw, na_probs, qid_to_has_ans, OPTS.out_image_dir)
histogram_na_prob(na_probs, has_ans_qids, OPTS.out_image_dir, "hasAns")
histogram_na_prob(na_probs, no_ans_qids, OPTS.out_image_dir, "noAns")
if OPTS.out_file:
with open(OPTS.out_file, "w") as f:
json.dump(out_eval, f)
else:
print(json.dumps(out_eval, indent=2))
if __name__ == "__main__":
OPTS = parse_args()
if OPTS.out_image_dir:
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
main()

View File

@@ -1,128 +0,0 @@
# coding=utf-8
# Copyright 2020 The HuggingFace Datasets Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" SQuAD v2 metric. """
import datasets
from .evaluate import (
apply_no_ans_threshold,
find_all_best_thresh,
get_raw_scores,
make_eval_dict,
make_qid_to_has_ans,
merge_eval,
)
_CITATION = """\
@inproceedings{Rajpurkar2016SQuAD10,
title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
booktitle={EMNLP},
year={2016}
}
"""
_DESCRIPTION = """
This metric wrap the official scoring script for version 2 of the Stanford Question
Answering Dataset (SQuAD).
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by
crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span,
from the corresponding reading passage, or the question might be unanswerable.
SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions
written adversarially by crowdworkers to look similar to answerable ones.
To do well on SQuAD2.0, systems must not only answer questions when possible, but also
determine when no answer is supported by the paragraph and abstain from answering.
"""
_KWARGS_DESCRIPTION = """
Computes SQuAD v2 scores (F1 and EM).
Args:
predictions: List of triple for question-answers to score with the following elements:
- the question-answer 'id' field as given in the references (see below)
- the text of the answer
- the probability that the question has no answer
references: List of question-answers dictionaries with the following key-values:
- 'id': id of the question-answer pair (see above),
- 'answers': a list of Dict {'text': text of the answer as a string}
no_answer_threshold: float
Probability threshold to decide that a question has no answer.
Returns:
'exact': Exact match (the normalized answer exactly match the gold answer)
'f1': The F-score of predicted tokens versus the gold answer
'total': Number of score considered
'HasAns_exact': Exact match (the normalized answer exactly match the gold answer)
'HasAns_f1': The F-score of predicted tokens versus the gold answer
'HasAns_total': Number of score considered
'NoAns_exact': Exact match (the normalized answer exactly match the gold answer)
'NoAns_f1': The F-score of predicted tokens versus the gold answer
'NoAns_total': Number of score considered
'best_exact': Best exact match (with varying threshold)
'best_exact_thresh': No-answer probability threshold associated to the best exact match
'best_f1': Best F1 (with varying threshold)
'best_f1_thresh': No-answer probability threshold associated to the best F1
"""
class SquadV2(datasets.Metric):
def _info(self):
return datasets.MetricInfo(
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=datasets.Features(
{
"predictions": {
"id": datasets.Value("string"),
"prediction_text": datasets.Value("string"),
"no_answer_probability": datasets.Value("float32"),
},
"references": {
"id": datasets.Value("string"),
"answers": datasets.features.Sequence(
{"text": datasets.Value("string"), "answer_start": datasets.Value("int32")}
),
},
}
),
codebase_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
reference_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
)
def _compute(self, predictions, references, no_answer_threshold=1.0):
no_answer_probabilities = dict((p["id"], p["no_answer_probability"]) for p in predictions)
dataset = [{"paragraphs": [{"qas": references}]}]
predictions = dict((p["id"], p["prediction_text"]) for p in predictions)
qid_to_has_ans = make_qid_to_has_ans(dataset) # maps qid to True/False
has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
exact_raw, f1_raw = get_raw_scores(dataset, predictions)
exact_thresh = apply_no_ans_threshold(exact_raw, no_answer_probabilities, qid_to_has_ans, no_answer_threshold)
f1_thresh = apply_no_ans_threshold(f1_raw, no_answer_probabilities, qid_to_has_ans, no_answer_threshold)
out_eval = make_eval_dict(exact_thresh, f1_thresh)
if has_ans_qids:
has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
merge_eval(out_eval, has_ans_eval, "HasAns")
if no_ans_qids:
no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
merge_eval(out_eval, no_ans_eval, "NoAns")
find_all_best_thresh(out_eval, predictions, exact_raw, f1_raw, no_answer_probabilities, qid_to_has_ans)
return out_eval

View File

@@ -206,7 +206,7 @@ def postprocess_qa_predictions(
# Make `predictions` JSON-serializable by casting np.float back to float.
all_nbest_json[example["id"]] = [
{k: (float(v) if isinstance(v, (np.float32, np.float64)) else v) for k, v in pred.items()}
{k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()}
for pred in predictions
]
@@ -394,7 +394,7 @@ def postprocess_qa_predictions_with_beam_search(
# Make `predictions` JSON-serializable by casting np.float back to float.
all_nbest_json[example["id"]] = [
{k: (float(v) if isinstance(v, (np.float32, np.float64)) else v) for k, v in pred.items()}
{k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()}
for pred in predictions
]

View File

@@ -0,0 +1,388 @@
#!/usr/bin/env python3
""" This script is adapted from the Bertology pruning code (https://github.com/huggingface/transformers/blob/783d7d2629e97c5f0c5f9ef01b8c66410275c204/examples/research_projects/bertology/run_bertology.py)
to prune GPT-like models. The author is @altsoph.
"""
import argparse
import logging
import os
from datetime import datetime
import numpy as np
import torch
from torch.utils.data import DataLoader, RandomSampler, TensorDataset
from tqdm import tqdm
from transformers import GPT2LMHeadModel
logger = logging.getLogger(__name__)
def save_model(model, dirpath):
# save results
if os.path.exists(dirpath):
if os.path.exists(os.path.join(dirpath, "config.json")) and os.path.isfile(
os.path.join(dirpath, "config.json")
):
os.remove(os.path.join(dirpath, "config.json"))
if os.path.exists(os.path.join(dirpath, "pytorch_model.bin")) and os.path.isfile(
os.path.join(dirpath, "pytorch_model.bin")
):
os.remove(os.path.join(dirpath, "pytorch_model.bin"))
else:
os.makedirs(dirpath)
model.save_pretrained(dirpath)
def entropy(p, unlogit=False):
""" Compute the entropy of a probability distribution """
exponent = 2
if unlogit:
p = torch.pow(p, exponent)
plogp = p * torch.log(p)
plogp[p == 0] = 0
return -plogp.sum(dim=-1)
def print_2d_tensor(tensor):
""" Print a 2D tensor """
logger.info("lv, h >\t" + "\t".join(f"{x + 1}" for x in range(len(tensor))))
for row in range(len(tensor)):
if tensor.dtype != torch.long:
logger.info(f"layer {row + 1}:\t" + "\t".join(f"{x:.5f}" for x in tensor[row].cpu().data))
else:
logger.info(f"layer {row + 1}:\t" + "\t".join(f"{x:d}" for x in tensor[row].cpu().data))
def compute_heads_importance(
args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None, actually_pruned=False
):
"""This method shows how to compute:
- head attention entropy
- head importance scores according to http://arxiv.org/abs/1905.10650
"""
# Prepare our tensors
n_layers, n_heads = model.config.num_hidden_layers, model.config.num_attention_heads
head_importance = torch.zeros(n_layers, n_heads).to(args.device)
attn_entropy = torch.zeros(n_layers, n_heads).to(args.device)
if head_mask is None:
head_mask = torch.ones(n_layers, n_heads).to(args.device)
head_mask.requires_grad_(requires_grad=True)
# If actually pruned attention multi-head, set head mask to None to avoid shape mismatch
if actually_pruned:
head_mask = None
tot_tokens = 0.0
total_loss = 0.0
for step, inputs in enumerate(tqdm(eval_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])):
inputs = tuple(t.to(args.device) for t in inputs)
(input_ids,) = inputs
# Do a forward pass (not with torch.no_grad() since we need gradients for importance score - see below)
outputs = model(input_ids, labels=input_ids, head_mask=head_mask)
# (loss), lm_logits, presents, (all hidden_states), (attentions)
loss, _, all_attentions = (
outputs[0],
outputs[1],
outputs[-1],
) # Loss and logits are the first, attention the last
loss.backward() # Backpropagate to populate the gradients in the head mask
total_loss += loss.detach().cpu().numpy()
if compute_entropy:
for layer, attn in enumerate(all_attentions):
masked_entropy = entropy(attn.detach(), True)
attn_entropy[layer] += masked_entropy.sum(-1).sum(0).sum(0).detach()
if compute_importance:
head_importance += head_mask.grad.abs().detach()
tot_tokens += torch.ones_like(input_ids).float().detach().sum().data
# Normalize
attn_entropy /= tot_tokens
head_importance /= tot_tokens
# Layerwise importance normalization
if not args.dont_normalize_importance_by_layer:
exponent = 2
norm_by_layer = torch.pow(torch.pow(head_importance, exponent).sum(-1), 1 / exponent)
head_importance /= norm_by_layer.unsqueeze(-1) + 1e-20
if not args.dont_normalize_global_importance:
head_importance = (head_importance - head_importance.min()) / (head_importance.max() - head_importance.min())
# Print matrices
if compute_entropy:
logger.info("Attention entropies")
print_2d_tensor(attn_entropy)
if compute_importance:
logger.info("Head importance scores")
print_2d_tensor(head_importance)
logger.info("Head ranked by importance scores")
head_ranks = torch.zeros(head_importance.numel(), dtype=torch.long, device=args.device)
head_ranks[head_importance.view(-1).sort(descending=True)[1]] = torch.arange(
head_importance.numel(), device=args.device
)
head_ranks = head_ranks.view_as(head_importance)
print_2d_tensor(head_ranks)
return attn_entropy, head_importance, total_loss
def mask_heads(args, model, eval_dataloader):
"""This method shows how to mask head (set some heads to zero), to test the effect on the network,
based on the head importance scores, as described in Michel et al. (http://arxiv.org/abs/1905.10650)
"""
_, head_importance, loss = compute_heads_importance(args, model, eval_dataloader, compute_entropy=False)
original_score = 1 / loss # instead of downsteam score use the LM loss
logger.info("Pruning: original score: %f, threshold: %f", original_score, original_score * args.masking_threshold)
new_head_mask = torch.ones_like(head_importance)
num_to_mask = max(1, int(new_head_mask.numel() * args.masking_amount))
current_score = original_score
while current_score >= original_score * args.masking_threshold:
head_mask = new_head_mask.clone().detach() # save current head mask
# heads from least important to most - keep only not-masked heads
head_importance[head_mask == 0.0] = float("Inf")
current_heads_to_mask = head_importance.view(-1).sort()[1]
if len(current_heads_to_mask) <= num_to_mask:
print("BREAK BY num_to_mask")
break
# mask heads
current_heads_to_mask = current_heads_to_mask[:num_to_mask]
logger.info("Heads to mask: %s", str(current_heads_to_mask.tolist()))
new_head_mask = new_head_mask.view(-1)
new_head_mask[current_heads_to_mask] = 0.0
new_head_mask = new_head_mask.view_as(head_mask)
new_head_mask = new_head_mask.clone().detach()
print_2d_tensor(new_head_mask)
# Compute metric and head importance again
_, head_importance, loss = compute_heads_importance(
args, model, eval_dataloader, compute_entropy=False, head_mask=new_head_mask
)
current_score = 1 / loss
logger.info(
"Masking: current score: %f, remaining heads %d (%.1f percents)",
current_score,
new_head_mask.sum(),
new_head_mask.sum() / new_head_mask.numel() * 100,
)
logger.info("Final head mask")
print_2d_tensor(head_mask)
np.save(os.path.join(args.output_dir, "head_mask.npy"), head_mask.detach().cpu().numpy())
return head_mask
def prune_heads(args, model, eval_dataloader, head_mask):
"""This method shows how to prune head (remove heads weights) based on
the head importance scores as described in Michel et al. (http://arxiv.org/abs/1905.10650)
"""
# Try pruning and test time speedup
# Pruning is like masking but we actually remove the masked weights
before_time = datetime.now()
_, _, loss = compute_heads_importance(
args, model, eval_dataloader, compute_entropy=False, compute_importance=False, head_mask=head_mask
)
score_masking = 1 / loss
original_time = datetime.now() - before_time
original_num_params = sum(p.numel() for p in model.parameters())
heads_to_prune = dict(
(layer, (1 - head_mask[layer].long()).nonzero().squeeze().tolist()) for layer in range(len(head_mask))
)
for k, v in heads_to_prune.items():
if isinstance(v, int):
heads_to_prune[k] = [
v,
]
assert sum(len(h) for h in heads_to_prune.values()) == (1 - head_mask.long()).sum().item()
model.prune_heads(heads_to_prune)
pruned_num_params = sum(p.numel() for p in model.parameters())
before_time = datetime.now()
_, _, loss = compute_heads_importance(
args,
model,
eval_dataloader,
compute_entropy=False,
compute_importance=False,
head_mask=None,
actually_pruned=True,
)
score_pruning = 1 / loss
new_time = datetime.now() - before_time
logger.info(
"Pruning: original num of params: %.2e, after pruning %.2e (%.1f percents)",
original_num_params,
pruned_num_params,
pruned_num_params / original_num_params * 100,
)
logger.info("Pruning: score with masking: %f score with pruning: %f", score_masking, score_pruning)
logger.info("Pruning: speed ratio (original timing / new timing): %f percents", original_time / new_time * 100)
save_model(model, args.output_dir)
def main():
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--data_dir",
default=None,
type=str,
required=True,
help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
)
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pretrained model or model identifier from huggingface.co/models",
)
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
# Other parameters
parser.add_argument(
"--config_name",
default="",
type=str,
help="Pretrained config name or path if not the same as model_name_or_path",
)
parser.add_argument(
"--tokenizer_name",
default="",
type=str,
help="Pretrained tokenizer name or path if not the same as model_name_or_path",
)
parser.add_argument(
"--cache_dir",
default=None,
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
)
parser.add_argument(
"--data_subset", type=int, default=-1, help="If > 0: limit the data to a subset of data_subset instances."
)
parser.add_argument(
"--overwrite_output_dir", action="store_true", help="Whether to overwrite data in output directory"
)
parser.add_argument(
"--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets"
)
parser.add_argument(
"--dont_normalize_importance_by_layer", action="store_true", help="Don't normalize importance score by layers"
)
parser.add_argument(
"--dont_normalize_global_importance",
action="store_true",
help="Don't normalize all importance scores between 0 and 1",
)
parser.add_argument(
"--try_masking", action="store_true", help="Whether to try to mask head until a threshold of accuracy."
)
parser.add_argument(
"--masking_threshold",
default=0.9,
type=float,
help="masking threshold in term of metrics (stop masking when metric < threshold * original metric value).",
)
parser.add_argument(
"--masking_amount", default=0.1, type=float, help="Amount to heads to masking at each masking step."
)
parser.add_argument("--metric_name", default="acc", type=str, help="Metric to use for head masking.")
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after WordPiece tokenization. \n"
"Sequences longer than this will be truncated, sequences shorter padded.",
)
parser.add_argument("--batch_size", default=1, type=int, help="Batch size.")
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--local_rank", type=int, default=-1, help="local_rank for distributed training on gpus")
parser.add_argument("--no_cuda", action="store_true", help="Whether not to use CUDA when available")
parser.add_argument("--server_ip", type=str, default="", help="Can be used for distant debugging.")
parser.add_argument("--server_port", type=str, default="", help="Can be used for distant debugging.")
args = parser.parse_args()
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
# Setup devices and distributed training
if args.local_rank == -1 or args.no_cuda:
args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
else:
torch.cuda.set_device(args.local_rank)
args.device = torch.device("cuda", args.local_rank)
args.n_gpu = 1
torch.distributed.init_process_group(backend="nccl") # Initializes the distributed backend
# Setup logging
logging.basicConfig(level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
logger.info("device: {} n_gpu: {}, distributed: {}".format(args.device, args.n_gpu, bool(args.local_rank != -1)))
model = GPT2LMHeadModel.from_pretrained(args.model_name_or_path)
# Distributed and parallel training
model.to(args.device)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
)
elif args.n_gpu > 1:
model = torch.nn.DataParallel(model)
# Print/save training arguments
os.makedirs(args.output_dir, exist_ok=True)
torch.save(args, os.path.join(args.output_dir, "run_args.bin"))
logger.info("Training/evaluation parameters %s", args)
# Prepare dataset
numpy_data = np.concatenate(
[
np.loadtxt(args.data_dir, dtype=np.int64),
]
)
train_tensor_dataset = (torch.from_numpy(numpy_data),)
train_data = TensorDataset(*train_tensor_dataset)
train_sampler = RandomSampler(train_data)
eval_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.batch_size)
# Compute head entropy and importance score
compute_heads_importance(args, model, eval_dataloader)
# Try head masking (set heads to zero until the score goes under a threshole)
# and head pruning (remove masked heads and see the effect on the network)
if args.try_masking and args.masking_threshold > 0.0 and args.masking_threshold < 1.0:
head_mask = mask_heads(args, model, eval_dataloader)
prune_heads(args, model, eval_dataloader, head_mask)
if __name__ == "__main__":
main()

View File

@@ -3,7 +3,6 @@ argon2-cffi==20.1.0
async-generator==1.10
attrs==20.2.0
backcall==0.2.0
bleach==3.1.5
CacheControl==0.12.6
certifi==2020.6.20
cffi==1.14.2
@@ -90,7 +89,7 @@ torchvision==0.7.0
tornado==6.0.4
tqdm==4.48.2
traitlets
transformers==3.5.1
git+https://github.com/huggingface/transformers.git
urllib3==1.25.8
wcwidth==0.2.5
webencodings==0.5.1

View File

@@ -0,0 +1,92 @@
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
## Whole Word Mask Language Model
These scripts leverage the 🤗 Datasets library and the Trainer API. You can easily customize them to your needs if you
need extra processing on your datasets.
The following examples, will run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own
text files for training and validation. We give examples of both below.
The BERT authors released a new version of BERT using Whole Word Masking in May 2019. Instead of masking randomly
selected tokens (which may be part of words), they mask randomly selected words (masking all the tokens corresponding
to that word). This technique has been refined for Chinese in [this paper](https://arxiv.org/abs/1906.08101).
To fine-tune a model using whole word masking, use the following script:
```bash
python run_mlm_wwm.py \
--model_name_or_path roberta-base \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--do_eval \
--output_dir /tmp/test-mlm-wwm
```
For Chinese models, we need to generate a reference files (which requires the ltp library), because it's tokenized at
the character level.
**Q :** Why a reference file?
**A :** Suppose we have a Chinese sentence like: `我喜欢你` The original Chinese-BERT will tokenize it as
`['我','喜','欢','你']` (character level). But `喜欢` is a whole word. For whole word masking proxy, we need a result
like `['我','喜','##欢','你']`, so we need a reference file to tell the model which position of the BERT original token
should be added `##`.
**Q :** Why LTP ?
**A :** Cause the best known Chinese WWM BERT is [Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm) by HIT.
It works well on so many Chines Task like CLUE (Chinese GLUE). They use LTP, so if we want to fine-tune their model,
we need LTP.
You could run the following:
```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export LTP_RESOURCE=/path/to/ltp/tokenizer
export BERT_RESOURCE=/path/to/bert/tokenizer
export SAVE_PATH=/path/to/data/ref.txt
python run_chinese_ref.py \
--file_name=path_to_train_or_eval_file \
--ltp=path_to_ltp_tokenizer \
--bert=path_to_bert_tokenizer \
--save_path=path_to_reference_file
```
Then you can run the script like this:
```bash
python run_mlm_wwm.py \
--model_name_or_path roberta-base \
--train_file path_to_train_file \
--validation_file path_to_validation_file \
--train_ref_file path_to_train_chinese_ref_file \
--validation_ref_file path_to_validation_chinese_ref_file \
--do_train \
--do_eval \
--output_dir /tmp/test-mlm-wwm
```
**Note1:** On TPU, you should the flag `--pad_to_max_length` to make sure all your batches have the same length.
**Note2:** And if you have any questions or something goes wrong when runing this code, don't hesitate to pin @wlhgtc.

View File

@@ -0,0 +1,4 @@
datasets >= 1.1.3
sentencepiece != 0.1.92
protobuf
ltp

View File

@@ -0,0 +1,147 @@
import argparse
import json
from typing import List
from ltp import LTP
from transformers.models.bert.tokenization_bert import BertTokenizer
def _is_chinese_char(cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if (
(cp >= 0x4E00 and cp <= 0x9FFF)
or (cp >= 0x3400 and cp <= 0x4DBF) #
or (cp >= 0x20000 and cp <= 0x2A6DF) #
or (cp >= 0x2A700 and cp <= 0x2B73F) #
or (cp >= 0x2B740 and cp <= 0x2B81F) #
or (cp >= 0x2B820 and cp <= 0x2CEAF) #
or (cp >= 0xF900 and cp <= 0xFAFF)
or (cp >= 0x2F800 and cp <= 0x2FA1F) #
): #
return True
return False
def is_chinese(word: str):
# word like '180' or '身高' or '神'
for char in word:
char = ord(char)
if not _is_chinese_char(char):
return 0
return 1
def get_chinese_word(tokens: List[str]):
word_set = set()
for token in tokens:
chinese_word = len(token) > 1 and is_chinese(token)
if chinese_word:
word_set.add(token)
word_list = list(word_set)
return word_list
def add_sub_symbol(bert_tokens: List[str], chinese_word_set: set()):
if not chinese_word_set:
return bert_tokens
max_word_len = max([len(w) for w in chinese_word_set])
bert_word = bert_tokens
start, end = 0, len(bert_word)
while start < end:
single_word = True
if is_chinese(bert_word[start]):
l = min(end - start, max_word_len)
for i in range(l, 1, -1):
whole_word = "".join(bert_word[start : start + i])
if whole_word in chinese_word_set:
for j in range(start + 1, start + i):
bert_word[j] = "##" + bert_word[j]
start = start + i
single_word = False
break
if single_word:
start += 1
return bert_word
def prepare_ref(lines: List[str], ltp_tokenizer: LTP, bert_tokenizer: BertTokenizer):
ltp_res = []
for i in range(0, len(lines), 100):
res = ltp_tokenizer.seg(lines[i : i + 100])[0]
res = [get_chinese_word(r) for r in res]
ltp_res.extend(res)
assert len(ltp_res) == len(lines)
bert_res = []
for i in range(0, len(lines), 100):
res = bert_tokenizer(lines[i : i + 100], add_special_tokens=True, truncation=True, max_length=512)
bert_res.extend(res["input_ids"])
assert len(bert_res) == len(lines)
ref_ids = []
for input_ids, chinese_word in zip(bert_res, ltp_res):
input_tokens = []
for id in input_ids:
token = bert_tokenizer._convert_id_to_token(id)
input_tokens.append(token)
input_tokens = add_sub_symbol(input_tokens, chinese_word)
ref_id = []
# We only save pos of chinese subwords start with ##, which mean is part of a whole word.
for i, token in enumerate(input_tokens):
if token[:2] == "##":
clean_token = token[2:]
# save chinese tokens' pos
if len(clean_token) == 1 and _is_chinese_char(ord(clean_token)):
ref_id.append(i)
ref_ids.append(ref_id)
assert len(ref_ids) == len(bert_res)
return ref_ids
def main(args):
# For Chinese (Ro)Bert, the best result is from : RoBERTa-wwm-ext (https://github.com/ymcui/Chinese-BERT-wwm)
# If we want to fine-tune these model, we have to use same tokenizer : LTP (https://github.com/HIT-SCIR/ltp)
with open(args.file_name, "r", encoding="utf-8") as f:
data = f.readlines()
data = [line.strip() for line in data if len(line) > 0 and not line.isspace()] # avoid delimiter like '\u2029'
ltp_tokenizer = LTP(args.ltp) # faster in GPU device
bert_tokenizer = BertTokenizer.from_pretrained(args.bert)
ref_ids = prepare_ref(data, ltp_tokenizer, bert_tokenizer)
with open(args.save_path, "w", encoding="utf-8") as f:
data = [json.dumps(ref) + "\n" for ref in ref_ids]
f.writelines(data)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="prepare_chinese_ref")
parser.add_argument(
"--file_name",
type=str,
default="./resources/chinese-demo.txt",
help="file need process, same as training data in lm",
)
parser.add_argument(
"--ltp", type=str, default="./resources/ltp", help="resources for LTP tokenizer, usually a path"
)
parser.add_argument("--bert", type=str, default="./resources/robert", help="resources for Bert tokenizer")
parser.add_argument("--save_path", type=str, default="./resources/ref.txt", help="path to save res")
args = parser.parse_args()
main(args)

View File

@@ -44,7 +44,7 @@ from transformers import (
TrainingArguments,
set_seed,
)
from transformers.trainer_utils import is_main_process
from transformers.trainer_utils import get_last_checkpoint, is_main_process
logger = logging.getLogger(__name__)
@@ -83,6 +83,17 @@ class ModelArguments:
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
)
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
use_auth_token: bool = field(
default=False,
metadata={
"help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
"with private models)."
},
)
@dataclass
@@ -173,23 +184,28 @@ def main():
else:
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
if (
os.path.exists(training_args.output_dir)
and os.listdir(training_args.output_dir)
and training_args.do_train
and not training_args.overwrite_output_dir
):
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty."
"Use --overwrite_output_dir to overcome."
)
# Detecting last checkpoint.
last_checkpoint = None
if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. "
"Use --overwrite_output_dir to overcome."
)
elif last_checkpoint is not None:
logger.info(
f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
handlers=[logging.StreamHandler(sys.stdout)],
)
logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
# Log on each process the small summary:
logger.warning(
@@ -247,22 +263,29 @@ def main():
# Distributed training:
# The .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.
config_kwargs = {
"cache_dir": model_args.cache_dir,
"revision": model_args.model_revision,
"use_auth_token": True if model_args.use_auth_token else None,
}
if model_args.config_name:
config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs)
elif model_args.model_name_or_path:
config = AutoConfig.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
else:
config = CONFIG_MAPPING[model_args.model_type]()
logger.warning("You are instantiating a new config instance from scratch.")
tokenizer_kwargs = {
"cache_dir": model_args.cache_dir,
"use_fast": model_args.use_fast_tokenizer,
"revision": model_args.model_revision,
"use_auth_token": True if model_args.use_auth_token else None,
}
if model_args.tokenizer_name:
tokenizer = AutoTokenizer.from_pretrained(
model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
)
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
elif model_args.model_name_or_path:
tokenizer = AutoTokenizer.from_pretrained(
model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
)
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
else:
raise ValueError(
"You are instantiating a new tokenizer from scratch. This is not supported by this script."
@@ -275,6 +298,8 @@ def main():
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
else:
logger.info("Training new model from scratch")
@@ -312,6 +337,10 @@ def main():
tokenized_datasets["validation"] = add_chinese_references(
tokenized_datasets["validation"], data_args.validation_ref_file
)
# If we have ref files, need to avoid it removed by trainer
has_ref = data_args.train_ref_file or data_args.validation_ref_file
if has_ref:
training_args.remove_unused_columns = False
# Data collator
# This one will take care of randomly masking the tokens.
@@ -329,14 +358,26 @@ def main():
# Training
if training_args.do_train:
model_path = (
model_args.model_name_or_path
if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
else None
)
trainer.train(model_path=model_path)
if last_checkpoint is not None:
checkpoint = last_checkpoint
elif model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path):
checkpoint = model_args.model_name_or_path
else:
checkpoint = None
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model() # Saves the tokenizer too for easy upload
output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
if trainer.is_world_process_zero():
with open(output_train_file, "w") as writer:
logger.info("***** Train results *****")
for key, value in sorted(train_result.metrics.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")
# Need to save the state, since Trainer.save_model saves only the tokenizer with the model
trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))
# Evaluation
results = {}
if training_args.do_eval:
@@ -351,7 +392,7 @@ def main():
if trainer.is_world_process_zero():
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in results.items():
for key, value in sorted(results.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")

View File

@@ -0,0 +1,25 @@
# Performer fine-tuning
Example authors: @TevenLeScao, @Patrickvonplaten
Paper authors: Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller
## Requirements
`datasets`, `flax` and `jax`. `wandb` integration is built-in if you want to use it.
## Examples
`sanity_script.sh` will launch performer fine-tuning from the bert-base-cased checkpoint on the Simple Wikipedia dataset (a small, easy-language English Wikipedia) from `datasets`.
`full_script.sh` will launch performer fine-tuning from the bert-large-cased checkpoint on the English Wikipedia dataset from `datasets`.
Here are a few key arguments:
- Remove the `--performer` argument to use a standard Bert model.
- Add `--reinitialize` to start from a blank model rather than a Bert checkpoint.
- You may change the Bert size by passing a different [checkpoint](https://huggingface.co/transformers/pretrained_models.html) to the `--model_name_or_path` argument.
- Passing your user name to the `--wandb_user_name` argument will trigger weights and biases logging.
- You can choose a dataset with `--dataset_name` and `--dataset_config`. Our [viewer](https://huggingface.co/datasets/viewer/) will help you find what you need.

View File

@@ -0,0 +1 @@
TOKENIZERS_PARALLELISM=true python run_mlm_performer.py --output_dir experiments --dataset_name wikipedia --dataset_config_name 20200501.en --model_name_or_path bert-large-cased --tokenizer_name bert-large-cased --do_train --overwrite_output_dir --per_device_train_batch_size 4 --learning_rate 5e-4 --warmup_steps 100 --num_train_epochs 3 --performer

View File

@@ -0,0 +1,553 @@
# coding=utf-8
# Copyright 2018 The Google Flax Team Authors and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Callable, Dict, Tuple
import numpy as np
import flax.linen as nn
import jax
import jax.numpy as jnp
from jax.random import PRNGKey
from modeling_flax_performer_utils import make_fast_softmax_attention
from transformers.file_utils import add_start_docstrings
from transformers.modeling_flax_utils import ACT2FN
from transformers.models.bert.configuration_bert import BertConfig
from transformers.models.bert.modeling_flax_bert import FlaxBertOnlyMLMHead, FlaxBertPreTrainedModel
from transformers.utils import logging
logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC = "BertConfig"
_TOKENIZER_FOR_DOC = "BertTokenizer"
BERT_START_DOCSTRING = r"""
This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
Parameters:
config (:class:`~transformers.BertConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
BERT_INPUTS_DOCSTRING = r"""
Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using :class:`~transformers.BertTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
1]``:
- 0 corresponds to a `sentence A` token,
- 1 corresponds to a `sentence B` token.
`What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`_
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
vectors than the model's internal embedding lookup matrix.
output_attentions (:obj:`bool`, `optional`):
Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`):
Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`):
Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
"""
class FlaxPerformerLayerNorm(nn.Module):
"""
Layer normalization (https://arxiv.org/abs/1607.06450). Operates on the last axis of the input data.
"""
epsilon: float = 1e-6
dtype: jnp.dtype = jnp.float32 # the dtype of the computation
bias: bool = True # If True, bias (beta) is added.
scale: bool = True # If True, multiply by scale (gamma). When the next layer is linear
# (also e.g. nn.relu), this can be disabled since the scaling will be
# done by the next layer.
bias_init: jnp.ndarray = nn.initializers.zeros
scale_init: jnp.ndarray = nn.initializers.ones
@nn.compact
def __call__(self, x):
"""
Applies layer normalization on the input. It normalizes the activations of the layer for each given example in
a batch independently, rather than across a batch like Batch Normalization. i.e. applies a transformation that
maintains the mean activation within each example close to 0 and the activation standard deviation close to 1
Args:
x: the inputs
Returns:
Normalized inputs (the same shape as inputs).
"""
features = x.shape[-1]
mean = jnp.mean(x, axis=-1, keepdims=True)
mean2 = jnp.mean(jax.lax.square(x), axis=-1, keepdims=True)
var = mean2 - jax.lax.square(mean)
mul = jax.lax.rsqrt(var + self.epsilon)
if self.scale:
mul = mul * jnp.asarray(self.param("gamma", self.scale_init, (features,)), self.dtype)
y = (x - mean) * mul
if self.bias:
y = y + jnp.asarray(self.param("beta", self.bias_init, (features,)), self.dtype)
return y
class FlaxPerformerEmbedding(nn.Module):
"""
Specify a new class for doing the embedding stuff as Flax's one use 'embedding' for the parameter name and PyTorch
use 'weight'
"""
vocab_size: int
hidden_size: int
emb_init: Callable[..., np.ndarray] = nn.initializers.normal(stddev=0.1)
@nn.compact
def __call__(self, inputs):
embedding = self.param("weight", self.emb_init, (self.vocab_size, self.hidden_size))
return jnp.take(embedding, inputs, axis=0)
class FlaxPerformerEmbeddings(nn.Module):
"""Construct the embeddings from word, position and token_type embeddings."""
vocab_size: int
hidden_size: int
type_vocab_size: int
max_length: int
@nn.compact
def __call__(self, input_ids, token_type_ids, position_ids, attention_mask):
# Embed
w_emb = FlaxPerformerEmbedding(self.vocab_size, self.hidden_size, name="word_embeddings")(
jnp.atleast_2d(input_ids.astype("i4"))
)
p_emb = FlaxPerformerEmbedding(self.max_length, self.hidden_size, name="position_embeddings")(
jnp.atleast_2d(position_ids.astype("i4"))
)
t_emb = FlaxPerformerEmbedding(self.type_vocab_size, self.hidden_size, name="token_type_embeddings")(
jnp.atleast_2d(token_type_ids.astype("i4"))
)
# Sum all embeddings
summed_emb = w_emb + jnp.broadcast_to(p_emb, w_emb.shape) + t_emb
# Layer Norm
layer_norm = FlaxPerformerLayerNorm(name="layer_norm")(summed_emb)
return layer_norm
class FlaxPerformerAttention(nn.Module):
num_heads: int
head_size: int
@nn.compact
def __call__(self, hidden_state, attention_mask):
single_head_dim = self.head_size // self.num_heads
fast_softmax_attention = make_fast_softmax_attention(qkv_dim=single_head_dim)
self_att = nn.attention.SelfAttention(
num_heads=self.num_heads, qkv_features=self.head_size, name="self", attention_fn=fast_softmax_attention
)(hidden_state, attention_mask)
layer_norm = FlaxPerformerLayerNorm(name="layer_norm")(self_att + hidden_state)
return layer_norm
class FlaxPerformerIntermediate(nn.Module):
output_size: int
hidden_act: str = "gelu"
@nn.compact
def __call__(self, hidden_state):
# TODO: Add ACT2FN reference to change activation function
dense = nn.Dense(features=self.output_size, name="dense")(hidden_state)
return ACT2FN[self.hidden_act](dense)
class FlaxPerformerOutput(nn.Module):
@nn.compact
def __call__(self, intermediate_output, attention_output):
hidden_state = nn.Dense(attention_output.shape[-1], name="dense")(intermediate_output)
hidden_state = FlaxPerformerLayerNorm(name="layer_norm")(hidden_state + attention_output)
return hidden_state
class FlaxPerformerLayer(nn.Module):
num_heads: int
head_size: int
intermediate_size: int
hidden_act: str = "gelu"
@nn.compact
def __call__(self, hidden_state, attention_mask):
attention = FlaxPerformerAttention(self.num_heads, self.head_size, name="attention")(
hidden_state, attention_mask
)
intermediate = FlaxPerformerIntermediate(
self.intermediate_size, name="intermediate", hidden_act=self.hidden_act
)(attention)
output = FlaxPerformerOutput(name="output")(intermediate, attention)
return output
class FlaxPerformerLayerCollection(nn.Module):
"""
Stores N BertLayer(s)
"""
num_layers: int
num_heads: int
head_size: int
intermediate_size: int
hidden_act: str = "gelu"
@nn.compact
def __call__(self, inputs, attention_mask):
assert self.num_layers > 0, f"num_layers should be >= 1, got ({self.num_layers})"
# Initialize input / output
input_i = inputs
# Forward over all encoders
for i in range(self.num_layers):
layer = FlaxPerformerLayer(
self.num_heads, self.head_size, self.intermediate_size, hidden_act=self.hidden_act, name=f"{i}"
)
input_i = layer(input_i, attention_mask)
return input_i
class FlaxPerformerEncoder(nn.Module):
num_layers: int
num_heads: int
head_size: int
intermediate_size: int
hidden_act: str = "gelu"
@nn.compact
def __call__(self, hidden_state, attention_mask):
layer = FlaxPerformerLayerCollection(
self.num_layers,
self.num_heads,
self.head_size,
self.intermediate_size,
name="layer",
hidden_act=self.hidden_act,
)(hidden_state, attention_mask)
return layer
class FlaxPerformerPooler(nn.Module):
@nn.compact
def __call__(self, hidden_state):
cls_token = hidden_state[:, 0]
out = nn.Dense(hidden_state.shape[-1], name="dense")(cls_token)
return jax.lax.tanh(out)
class FlaxPerformerModule(nn.Module):
vocab_size: int
hidden_size: int
type_vocab_size: int
max_length: int
num_encoder_layers: int
num_heads: int
head_size: int
intermediate_size: int
hidden_act: str = "gelu"
add_pooling_layer: bool = True
@nn.compact
def __call__(self, input_ids, token_type_ids, position_ids, attention_mask):
# Embedding
embeddings = FlaxPerformerEmbeddings(
self.vocab_size, self.hidden_size, self.type_vocab_size, self.max_length, name="embeddings"
)(input_ids, token_type_ids, position_ids, attention_mask)
# N stacked encoding layers
encoder = FlaxPerformerEncoder(
self.num_encoder_layers,
self.num_heads,
self.head_size,
self.intermediate_size,
hidden_act=self.hidden_act,
name="encoder",
)(embeddings, attention_mask)
if not self.add_pooling_layer:
return encoder
pooled = FlaxPerformerPooler(name="pooler")(encoder)
return encoder, pooled
@add_start_docstrings(
"The bare Bert Model transformer outputting raw hidden-states without any specific head on top.",
BERT_START_DOCSTRING,
)
class FlaxPerformerModel(FlaxBertPreTrainedModel):
"""
The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
cross-attention is added between the self-attention layers, following the architecture described in `Attention is
all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
"""
model_class = FlaxPerformerModule
config_class = BertConfig
base_model_prefix = "bert"
@staticmethod
def convert_from_pytorch(pt_state: Dict, config: BertConfig) -> Dict:
jax_state = dict(pt_state)
# Need to change some parameters name to match Flax names so that we don't have to fork any layer
for key, tensor in pt_state.items():
# Key parts
key_parts = set(key.split("."))
# Every dense layer has "kernel" parameters instead of "weight"
if "dense.weight" in key:
del jax_state[key]
key = key.replace("weight", "kernel")
jax_state[key] = tensor
# SelfAttention needs also to replace "weight" by "kernel"
if {"query", "key", "value"} & key_parts:
# Flax SelfAttention decomposes the heads (num_head, size // num_heads)
if "bias" in key:
jax_state[key] = tensor.reshape((config.num_attention_heads, -1))
elif "weight":
del jax_state[key]
key = key.replace("weight", "kernel")
tensor = tensor.reshape((config.num_attention_heads, -1, config.hidden_size)).transpose((2, 0, 1))
jax_state[key] = tensor
# SelfAttention output is not a separate layer, remove one nesting
if "attention.output.dense" in key:
del jax_state[key]
key = key.replace("attention.output.dense", "attention.self.out")
jax_state[key] = tensor
# SelfAttention output is not a separate layer, remove nesting on layer norm
if "attention.output.LayerNorm" in key:
del jax_state[key]
key = key.replace("attention.output.LayerNorm", "attention.LayerNorm")
jax_state[key] = tensor
# There are some transposed parameters w.r.t their PyTorch counterpart
if "intermediate.dense.kernel" in key or "output.dense.kernel" in key:
jax_state[key] = tensor.T
# Self Attention output projection needs to be transposed
if "out.kernel" in key:
jax_state[key] = tensor.reshape((config.hidden_size, config.num_attention_heads, -1)).transpose(
1, 2, 0
)
# Pooler needs to transpose its kernel
if "pooler.dense.kernel" in key:
jax_state[key] = tensor.T
# Handle LayerNorm conversion
if "LayerNorm" in key:
del jax_state[key]
# Replace LayerNorm by layer_norm
new_key = key.replace("LayerNorm", "layer_norm")
if "weight" in key:
new_key = new_key.replace("weight", "gamma")
elif "bias" in key:
new_key = new_key.replace("bias", "beta")
jax_state[new_key] = tensor
return jax_state
def __init__(
self, config: BertConfig, input_shape: Tuple = (1, 1), seed: int = 0, dtype: jnp.dtype = jnp.float32, **kwargs
):
module = FlaxPerformerModule(
vocab_size=config.vocab_size,
hidden_size=config.hidden_size,
type_vocab_size=config.type_vocab_size,
max_length=config.max_position_embeddings,
num_encoder_layers=config.num_hidden_layers,
num_heads=config.num_attention_heads,
head_size=config.hidden_size,
intermediate_size=config.intermediate_size,
dropout_rate=config.hidden_dropout_prob,
hidden_act=config.hidden_act,
)
super().__init__(config, module, input_shape=input_shape, seed=seed, dtype=dtype)
@property
def module(self) -> nn.Module:
return self._module
def __call__(
self, input_ids, token_type_ids=None, position_ids=None, dropout_rng: PRNGKey = None, attention_mask=None
):
input_ids, attention_mask, token_type_ids, position_ids = self._check_inputs(
input_ids, attention_mask, token_type_ids, position_ids
)
# Handle any PRNG if needed
rngs = {}
if dropout_rng is not None:
rngs["dropout"] = dropout_rng
return self.module.apply(
{"params": self.params},
jnp.array(input_ids, dtype="i4"),
jnp.array(token_type_ids, dtype="i4"),
jnp.array(position_ids, dtype="i4"),
jnp.array(attention_mask, dtype="i4"),
rng=rngs,
)
class FlaxPerformerForMaskedLM(FlaxBertPreTrainedModel):
def __init__(
self, config: BertConfig, input_shape: Tuple = (1, 1), seed: int = 0, dtype: jnp.dtype = jnp.float32, **kwargs
):
module = FlaxPerformerForMaskedLMModule(
vocab_size=config.vocab_size,
type_vocab_size=config.type_vocab_size,
hidden_size=config.hidden_size,
intermediate_size=config.intermediate_size,
head_size=config.hidden_size,
num_heads=config.num_attention_heads,
num_encoder_layers=config.num_hidden_layers,
max_length=config.max_position_embeddings,
hidden_act=config.hidden_act,
**kwargs,
)
super().__init__(config, module, input_shape=input_shape, seed=seed, dtype=dtype)
def __call__(
self,
input_ids,
attention_mask=None,
token_type_ids=None,
position_ids=None,
params: dict = None,
train: bool = False,
dropout_rng: PRNGKey = None,
):
input_ids, attention_mask, token_type_ids, position_ids = self._check_inputs(
input_ids, attention_mask, token_type_ids, position_ids
)
# Handle any PRNG if needed
rngs = {}
if dropout_rng is not None:
rngs["dropout"] = dropout_rng
return self.module.apply(
{"params": params or self.params},
jnp.array(input_ids, dtype="i4"),
jnp.array(attention_mask, dtype="i4"),
jnp.array(token_type_ids, dtype="i4"),
jnp.array(position_ids, dtype="i4"),
not train,
rngs=rngs,
)
class FlaxPerformerForMaskedLMModule(nn.Module):
vocab_size: int
hidden_size: int
intermediate_size: int
head_size: int
num_heads: int
num_encoder_layers: int
type_vocab_size: int
max_length: int
hidden_act: str
dropout_rate: float = 0.0
dtype: jnp.dtype = jnp.float32
@nn.compact
def __call__(
self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, deterministic: bool = True
):
# Model
encoder = FlaxPerformerModule(
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
type_vocab_size=self.type_vocab_size,
max_length=self.max_length,
num_encoder_layers=self.num_encoder_layers,
num_heads=self.num_heads,
head_size=self.hidden_size,
intermediate_size=self.intermediate_size,
hidden_act=self.hidden_act,
add_pooling_layer=False,
name="bert",
)(input_ids, attention_mask, token_type_ids, position_ids)
# Compute the prediction scores
encoder = nn.Dropout(rate=self.dropout_rate)(encoder, deterministic=deterministic)
logits = FlaxBertOnlyMLMHead(
vocab_size=self.vocab_size, hidden_act=self.hidden_act, name="cls", dtype=self.dtype
)(encoder)
return (logits,)

View File

@@ -0,0 +1,660 @@
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
IMPORTANT:
This code was copied from
https://github.com/google-research/google-research/blob/master/performer/fast_self_attention/fast_self_attention.py on
6/11/2020. This is very new code, so it might be prone to change soon -> make sure to check the original code and
update accordingly
Core Fast Attention Module for Flax. Implementation of the approximate fast softmax and generalized attention mechanism
leveraging structured random feature maps [RFM] techniques and low rank decomposition of the attention matrix.
"""
# pylint: disable=invalid-name, missing-function-docstring, line-too-long
import abc
import functools
from collections.abc import Iterable # pylint: disable=g-importing-member
import numpy as onp
from absl import logging
import jax
import jax.numpy as jnp
from jax import lax, random
def nonnegative_softmax_kernel_feature_creator(
data, projection_matrix, attention_dims_t, batch_dims_t, precision, is_query, normalize_data=True, eps=0.0001
):
"""
Constructs nonnegative kernel features for fast softmax attention
Args:
data: input for which features are computes
projection_matrix: random matrix used to compute features
attention_dims_t: tuple of attention dimensions
batch_dims_t: tuple of batch dimensions
precision: precision parameter
is_query: predicate indicating whether input data corresponds to queries or
keys
normalize_data: predicate indicating whether data should be normalized,
eps: numerical stabilizer
Returns:
Random features for fast softmax attention.
"""
del attention_dims_t
if normalize_data:
# We have e^{qk^T/sqrt{d}} = e^{q_norm k_norm^T}, where
# w_norm = w * data_normalizer for w in {q,k}.
data_normalizer = 1.0 / (jnp.sqrt(jnp.sqrt(data.shape[-1])))
else:
data_normalizer = 1.0
ratio = 1.0 / jnp.sqrt(projection_matrix.shape[0])
data_mod_shape = data.shape[0 : len(batch_dims_t)] + projection_matrix.shape
data_thick_random_matrix = jnp.zeros(data_mod_shape) + projection_matrix
data_dash = lax.dot_general(
data_normalizer * data,
data_thick_random_matrix,
(((data.ndim - 1,), (data_thick_random_matrix.ndim - 1,)), (batch_dims_t, batch_dims_t)),
precision=precision,
)
diag_data = jnp.square(data)
diag_data = jnp.sum(diag_data, axis=data.ndim - 1)
diag_data = (diag_data / 2.0) * data_normalizer * data_normalizer
diag_data = jnp.expand_dims(diag_data, axis=data.ndim - 1)
if is_query:
last_dims_t = (len(data_dash.shape) - 1,)
data_dash = ratio * (
jnp.exp(data_dash - diag_data - jnp.max(data_dash, axis=last_dims_t, keepdims=True)) + eps
)
else:
data_dash = ratio * (jnp.exp(data_dash - diag_data - jnp.max(data_dash)) + eps)
return data_dash
def sincos_softmax_kernel_feature_creator(
data, projection_matrix, attention_dims_t, batch_dims_t, precision, normalize_data=True
):
"""
Constructs kernel sin-cos features for fast softmax attention
Args:
data: input for which features are computes
projection_matrix: random matrix used to compute features
attention_dims_t: tuple of attention dimensions
batch_dims_t: tuple of batch dimensions
precision: precision parameter
normalize_data: predicate indicating whether data should be normalized
Returns:
Random features for fast softmax attention.
"""
if normalize_data:
# We have: exp(qk^T/sqrt{d}) = exp(|q|^2/2sqrt{d}) * exp(|k|^2/2sqrt{d}) *
# exp(-(|q*c-k*c|^2)/2), where c = 1.0 / sqrt{sqrt{d}}.
data_normalizer = 1.0 / (jnp.sqrt(jnp.sqrt(data.shape[-1])))
else:
data_normalizer = 1.0
ratio = 1.0 / jnp.sqrt(projection_matrix.shape[0])
data_mod_shape = data.shape[0 : len(batch_dims_t)] + projection_matrix.shape
data_thick_random_matrix = jnp.zeros(data_mod_shape) + projection_matrix
data_dash = lax.dot_general(
data_normalizer * data,
data_thick_random_matrix,
(((data.ndim - 1,), (data_thick_random_matrix.ndim - 1,)), (batch_dims_t, batch_dims_t)),
precision=precision,
)
data_dash_cos = ratio * jnp.cos(data_dash)
data_dash_sin = ratio * jnp.sin(data_dash)
data_dash = jnp.concatenate((data_dash_cos, data_dash_sin), axis=-1)
# Constructing D_data and data^{'}
diag_data = jnp.square(data)
diag_data = jnp.sum(diag_data, axis=data.ndim - 1)
diag_data = (diag_data / 2.0) * data_normalizer * data_normalizer
diag_data = jnp.expand_dims(diag_data, axis=data.ndim - 1)
# Additional renormalization for numerical stability
data_renormalizer = jnp.max(diag_data, attention_dims_t, keepdims=True)
diag_data -= data_renormalizer
diag_data = jnp.exp(diag_data)
data_prime = data_dash * diag_data
return data_prime
def generalized_kernel_feature_creator(
data, projection_matrix, batch_dims_t, precision, kernel_fn, kernel_epsilon, normalize_data
):
"""
Constructs kernel features for fast generalized attention
Args:
data: input for which features are computes
projection_matrix: matrix used to compute features
batch_dims_t: tuple of batch dimensions
precision: precision parameter
kernel_fn: kernel function used
kernel_epsilon: additive positive term added to every feature for numerical
stability
normalize_data: predicate indicating whether data should be normalized
Returns:
Random features for fast generalized attention.
"""
if normalize_data:
data_normalizer = 1.0 / (jnp.sqrt(jnp.sqrt(data.shape[-1])))
else:
data_normalizer = 1.0
if projection_matrix is None:
return kernel_fn(data_normalizer * data) + kernel_epsilon
else:
data_mod_shape = data.shape[0 : len(batch_dims_t)] + projection_matrix.shape
data_thick_random_matrix = jnp.zeros(data_mod_shape) + projection_matrix
data_dash = lax.dot_general(
data_normalizer * data,
data_thick_random_matrix,
(((data.ndim - 1,), (data_thick_random_matrix.ndim - 1,)), (batch_dims_t, batch_dims_t)),
precision=precision,
)
data_prime = kernel_fn(data_dash) + kernel_epsilon
return data_prime
def make_fast_softmax_attention(
qkv_dim,
renormalize_attention=True,
numerical_stabilizer=0.000001,
nb_features=256,
ortho_features=True,
ortho_scaling=0.0,
redraw_features=True,
unidirectional=False,
nonnegative_features=True,
lax_scan_unroll=1,
):
"""Construct a fast softmax attention method."""
logging.info(
"Fast softmax attention: %s features and orthogonal=%s, renormalize=%s",
nb_features,
ortho_features,
renormalize_attention,
)
if ortho_features:
matrix_creator = functools.partial(GaussianOrthogonalRandomMatrix, nb_features, qkv_dim, scaling=ortho_scaling)
else:
matrix_creator = functools.partial(GaussianUnstructuredRandomMatrix, nb_features, qkv_dim)
if nonnegative_features:
def kernel_feature_creator(
data, projection_matrix, attention_dims_t, batch_dims_t, precision, is_query, normalize_data=True
):
return nonnegative_softmax_kernel_feature_creator(
data,
projection_matrix,
attention_dims_t,
batch_dims_t,
precision,
is_query,
normalize_data,
numerical_stabilizer,
)
else:
def kernel_feature_creator(
data, projection_matrix, attention_dims_t, batch_dims_t, precision, is_query, normalize_data=True
):
del is_query
return sincos_softmax_kernel_feature_creator(
data, projection_matrix, attention_dims_t, batch_dims_t, precision, normalize_data
)
attention_fn = FastAttentionviaLowRankDecomposition(
matrix_creator,
kernel_feature_creator,
renormalize_attention=renormalize_attention,
numerical_stabilizer=numerical_stabilizer,
redraw_features=redraw_features,
unidirectional=unidirectional,
lax_scan_unroll=lax_scan_unroll,
).dot_product_attention
return attention_fn
def make_fast_generalized_attention(
qkv_dim,
renormalize_attention=True,
numerical_stabilizer=0.0,
nb_features=256,
features_type="deterministic",
kernel_fn=jax.nn.relu,
kernel_epsilon=0.001,
redraw_features=False,
unidirectional=False,
lax_scan_unroll=1,
):
"""Construct a fast generalized attention menthod."""
logging.info("Fast generalized attention.: %s features and renormalize=%s", nb_features, renormalize_attention)
if features_type == "ortho":
matrix_creator = functools.partial(GaussianOrthogonalRandomMatrix, nb_features, qkv_dim, scaling=False)
elif features_type == "iid":
matrix_creator = functools.partial(GaussianUnstructuredRandomMatrix, nb_features, qkv_dim)
elif features_type == "deterministic":
matrix_creator = None
else:
raise ValueError("Unknown feature value type")
def kernel_feature_creator(
data, projection_matrix, attention_dims_t, batch_dims_t, precision, is_query, normalize_data=False
):
del attention_dims_t
del is_query
return generalized_kernel_feature_creator(
data, projection_matrix, batch_dims_t, precision, kernel_fn, kernel_epsilon, normalize_data
)
attention_fn = FastAttentionviaLowRankDecomposition(
matrix_creator,
kernel_feature_creator,
renormalize_attention=renormalize_attention,
numerical_stabilizer=numerical_stabilizer,
redraw_features=redraw_features,
unidirectional=unidirectional,
lax_scan_unroll=lax_scan_unroll,
).dot_product_attention
return attention_fn
class RandomMatrix(object):
r"""
Abstract class providing a method for constructing 2D random arrays. Class is responsible for constructing 2D
random arrays.
"""
__metaclass__ = abc.ABCMeta
@abc.abstractmethod
def get_2d_array(self):
raise NotImplementedError("Abstract method")
class GaussianUnstructuredRandomMatrix(RandomMatrix):
def __init__(self, nb_rows, nb_columns, key):
self.nb_rows = nb_rows
self.nb_columns = nb_columns
self.key = key
def get_2d_array(self):
return random.normal(self.key, (self.nb_rows, self.nb_columns))
class GaussianOrthogonalRandomMatrix(RandomMatrix):
r"""
Class providing a method to create Gaussian orthogonal matrix. Class is responsible for constructing 2D Gaussian
orthogonal arrays.
"""
def __init__(self, nb_rows, nb_columns, key, scaling=0):
self.nb_rows = nb_rows
self.nb_columns = nb_columns
self.key = key
self.scaling = scaling
def get_2d_array(self):
nb_full_blocks = int(self.nb_rows / self.nb_columns)
block_list = []
rng = self.key
for _ in range(nb_full_blocks):
rng, rng_input = jax.random.split(rng)
unstructured_block = random.normal(rng_input, (self.nb_columns, self.nb_columns))
q, _ = jnp.linalg.qr(unstructured_block)
q = jnp.transpose(q)
block_list.append(q)
remaining_rows = self.nb_rows - nb_full_blocks * self.nb_columns
if remaining_rows > 0:
rng, rng_input = jax.random.split(rng)
unstructured_block = random.normal(rng_input, (self.nb_columns, self.nb_columns))
q, _ = jnp.linalg.qr(unstructured_block)
q = jnp.transpose(q)
block_list.append(q[0:remaining_rows])
final_matrix = jnp.vstack(block_list)
if self.scaling == 0:
multiplier = jnp.linalg.norm(random.normal(self.key, (self.nb_rows, self.nb_columns)), axis=1)
elif self.scaling == 1:
multiplier = jnp.sqrt(float(self.nb_columns)) * jnp.ones((self.nb_rows))
else:
raise ValueError("Scaling must be one of {0, 1}. Was %s" % self._scaling)
return jnp.matmul(jnp.diag(multiplier), final_matrix)
class FastAttention(object):
r"""
Abstract class providing a method for fast attention. Class is responsible for providing a method
<dot_product_attention> for fast approximate attention.
"""
__metaclass__ = abc.ABCMeta
@abc.abstractmethod
def dot_product_attention(
self,
query,
key,
value,
dtype=jnp.float32,
bias=None,
axis=None,
broadcast_dropout=True,
dropout_rng=None,
dropout_rate=0.0,
deterministic=False,
precision=None,
):
"""
Computes dot-product attention given query, key, and value. This is the core function for applying fast
approximate dot-product attention. It calculates the attention weights given query and key and combines the
values using the attention weights. This function supports multi-dimensional inputs
Args:
query: queries for calculating attention with shape of [batch_size, dim1,
dim2, ..., dimN, num_heads, mem_channels].
key: keys for calculating attention with shape of [batch_size, dim1, dim2,
..., dimN, num_heads, mem_channels].
value: values to be used in attention with shape of [batch_size, dim1,
dim2,..., dimN, num_heads, value_channels].
dtype: the dtype of the computation (default: float32)
bias: bias for the attention weights. This can be used for incorporating
autoregressive mask, padding mask, proximity bias.
axis: axises over which the attention is applied.
broadcast_dropout: bool: use a broadcasted dropout along batch dims.
dropout_rng: JAX PRNGKey: to be used for dropout.
dropout_rate: dropout rate.
deterministic: bool, deterministic or not (to apply dropout).
precision: numerical precision of the computation see `jax.lax.Precision`
for details
Returns:
Output of shape [bs, dim1, dim2, ..., dimN,, num_heads, value_channels].
"""
raise NotImplementedError("Abstract method")
def _numerator(z_slice_shape, precision, unroll=1):
def fwd(qs, ks, vs):
def body(p, qkv):
(q, k, v) = qkv
p += jnp.einsum("...m,...d->...md", k, v, precision=precision)
X_slice = jnp.einsum("...m,...md->...d", q, p, precision=precision)
return p, X_slice
init_value = jnp.zeros(z_slice_shape)
p, W = lax.scan(body, init_value, (qs, ks, vs), unroll=unroll)
return W, (p, qs, ks, vs)
def bwd(pqkv, W_ct):
def body(carry, qkv_xct):
p, p_ct = carry
q, k, v, x_ct = qkv_xct
q_ct = jnp.einsum("...d,...md->...m", x_ct, p, precision=precision)
p_ct += jnp.einsum("...d,...m->...md", x_ct, q, precision=precision)
k_ct = jnp.einsum("...md,...d->...m", p_ct, v, precision=precision)
v_ct = jnp.einsum("...md,...m->...d", p_ct, k, precision=precision)
p -= jnp.einsum("...m,...d->...md", k, v, precision=precision)
return (p, p_ct), (q_ct, k_ct, v_ct)
p, qs, ks, vs = pqkv
_, (qs_ct, ks_ct, vs_ct) = lax.scan(
body, (p, jnp.zeros_like(p)), (qs, ks, vs, W_ct), reverse=True, unroll=unroll
)
return qs_ct, ks_ct, vs_ct
@jax.custom_vjp
def _numerator_impl(qs, ks, vs):
W, _ = fwd(qs, ks, vs)
return W
_numerator_impl.defvjp(fwd, bwd)
return _numerator_impl
def _denominator(t_slice_shape, precision, unroll=1):
def fwd(qs, ks):
def body(p, qk):
q, k = qk
p += k
x = jnp.einsum("...m,...m->...", q, p, precision=precision)
return p, x
p = jnp.zeros(t_slice_shape)
p, R = lax.scan(body, p, (qs, ks), unroll=unroll)
return R, (qs, ks, p)
def bwd(qkp, R_ct):
def body(carry, qkx):
p, p_ct = carry
q, k, x_ct = qkx
q_ct = jnp.einsum("...,...m->...m", x_ct, p, precision=precision)
p_ct += jnp.einsum("...,...m->...m", x_ct, q, precision=precision)
k_ct = p_ct
p -= k
return (p, p_ct), (q_ct, k_ct)
qs, ks, p = qkp
_, (qs_ct, ks_ct) = lax.scan(body, (p, jnp.zeros_like(p)), (qs, ks, R_ct), reverse=True, unroll=unroll)
return (qs_ct, ks_ct)
@jax.custom_vjp
def _denominator_impl(qs, ks):
R, _ = fwd(qs, ks)
return R
_denominator_impl.defvjp(fwd, bwd)
return _denominator_impl
class FastAttentionviaLowRankDecomposition(FastAttention):
r"""
Class providing a method for fast attention via low rank decomposition. Class is responsible for providing a method
<dot_product_attention> for fast dot-product attention with the use of low rank decomposition (e.g. with random
feature maps).
"""
def __init__(
self,
matrix_creator,
kernel_feature_creator,
renormalize_attention,
numerical_stabilizer,
redraw_features,
unidirectional,
lax_scan_unroll=1,
): # For optimal GPU performance, set to 16.
rng = random.PRNGKey(0)
self.matrix_creator = matrix_creator
self.projection_matrix = self.draw_weights(rng)
self.kernel_feature_creator = kernel_feature_creator
self.renormalize_attention = renormalize_attention
self.numerical_stabilizer = numerical_stabilizer
self.redraw_features = redraw_features
self.unidirectional = unidirectional
self.lax_scan_unroll = lax_scan_unroll
def draw_weights(self, key):
if self.matrix_creator is None:
return None
matrixrng, _ = random.split(key)
projection_matrix = self.matrix_creator(key=matrixrng).get_2d_array()
return projection_matrix
def dot_product_attention(
self,
query,
key,
value,
dtype=jnp.float32,
bias=None,
axis=None,
broadcast_dropout=True,
dropout_rng=None,
dropout_rate=0.0,
deterministic=False,
precision=None,
):
assert key.shape[:-1] == value.shape[:-1]
assert query.shape[0:1] == key.shape[0:1] and query.shape[-1] == key.shape[-1]
if axis is None:
axis = tuple(range(1, key.ndim - 2))
if not isinstance(axis, Iterable):
axis = (axis,)
assert key.ndim == query.ndim
assert key.ndim == value.ndim
for ax in axis:
if not (query.ndim >= 3 and 1 <= ax < query.ndim - 2):
raise ValueError("Attention axis must be between the batch " "axis and the last-two axes.")
n = key.ndim
# Constructing projection tensor.
if self.redraw_features:
# TODO(kchoro): Get rid of the constant below.
query_seed = lax.convert_element_type(jnp.ceil(jnp.sum(query) * 10000000.0), jnp.int32)
rng = random.PRNGKey(query_seed)
self.projection_matrix = self.draw_weights(rng)
# batch_dims is <bs, <non-attention dims>, num_heads>
batch_dims = tuple(onp.delete(range(n), axis + (n - 1,)))
# q & k -> (bs, <non-attention dims>, num_heads, <attention dims>, channels)
qk_perm = batch_dims + axis + (n - 1,)
k_extra_perm = axis + batch_dims + (n - 1,)
key_extra = key.transpose(k_extra_perm)
key = key.transpose(qk_perm)
query = query.transpose(qk_perm)
# v -> (bs, <non-attention dims>, num_heads, <attention dims>, channels)
v_perm = batch_dims + axis + (n - 1,)
value = value.transpose(v_perm)
batch_dims_t = tuple(range(len(batch_dims)))
attention_dims_t = tuple(range(len(batch_dims), len(batch_dims) + len(axis)))
# Constructing tensors Q^{'} and K^{'}.
query_prime = self.kernel_feature_creator(
query, self.projection_matrix, attention_dims_t, batch_dims_t, precision, True
)
key_prime = self.kernel_feature_creator(
key, self.projection_matrix, attention_dims_t, batch_dims_t, precision, False
)
if self.unidirectional:
index = attention_dims_t[0]
z_slice_shape = key_prime.shape[0 : len(batch_dims_t)] + (key_prime.shape[-1],) + (value.shape[-1],)
numerator_fn = _numerator(z_slice_shape, precision, self.lax_scan_unroll)
W = numerator_fn(
jnp.moveaxis(query_prime, index, 0), jnp.moveaxis(key_prime, index, 0), jnp.moveaxis(value, index, 0)
)
# Constructing W = (Q^{'}(K^{'})^{T})_{masked}V
W = jnp.moveaxis(W, 0, index)
if not self.renormalize_attention:
# Unidirectional, not-normalized attention.
perm_inv = _invert_perm(qk_perm)
result = W.transpose(perm_inv)
return result
else:
# Unidirectional, normalized attention.
thick_all_ones = jnp.zeros(key.shape[0:-1]) + jnp.ones(key_extra.shape[0 : len(axis)])
index = attention_dims_t[0]
t_slice_shape = key_prime.shape[0 : len(batch_dims_t)] + (key_prime.shape[-1],)
denominator_fn = _denominator(t_slice_shape, precision, self.lax_scan_unroll)
R = denominator_fn(jnp.moveaxis(query_prime, index, 0), jnp.moveaxis(key_prime, index, 0))
R = jnp.moveaxis(R, 0, index)
else:
contract_query = tuple(range(len(batch_dims) + len(axis), len(batch_dims) + len(axis) + 1))
contract_z = tuple(range(len(batch_dims), len(batch_dims) + 1))
# Constructing Z = (K^{'})^{T}V
# Z (bs, <non-attention dims>, num_heads, channels_m, channels_v)
Z = lax.dot_general(
key_prime,
value,
((attention_dims_t, attention_dims_t), (batch_dims_t, batch_dims_t)),
precision=precision,
)
# Constructing W = Q^{'}Z = Q^{'}(K^{'})^{T}V
# q (bs, <non-attention dims>, num_heads, <attention dims>, channels_m)
# Z (bs, <non-attention dims>, num_heads, channels_m, channels_v)
# W (bs, <non-attention dims>, num_heads, <attention dims>, channels_v)
W = lax.dot_general(
query_prime, Z, ((contract_query, contract_z), (batch_dims_t, batch_dims_t)), precision=precision
)
if not self.renormalize_attention:
# Bidirectional, not-normalized attention.
perm_inv = _invert_perm(qk_perm)
result = W.transpose(perm_inv)
return result
else:
# Bidirectional, normalized attention.
thick_all_ones = jnp.zeros(key.shape[0:-1]) + jnp.ones(key_extra.shape[0 : len(axis)])
contract_key = tuple(range(len(batch_dims), len(batch_dims) + len(axis)))
contract_thick_all_ones = tuple(range(thick_all_ones.ndim - len(axis), thick_all_ones.ndim))
# Construct T = (K^{'})^{T} 1_L
# k (bs, <non-attention dims>, num_heads, <attention dims>, channels)
T = lax.dot_general(
key_prime,
thick_all_ones,
((contract_key, contract_thick_all_ones), (batch_dims_t, batch_dims_t)),
precision=precision,
)
# Construct partition function: R = Q^{'} T = Q^{'}(K^{'})^{T} 1_L
# q_p (bs, <non-attention dims>, num_heads, <attention dims>, channs_m)
# T (bs, <non-attention dims>, num_heads, channels_m)
R = lax.dot_general(
query_prime,
T,
(((query_prime.ndim - 1,), (T.ndim - 1,)), (batch_dims_t, range(0, len(T.shape) - 1))),
precision=precision,
)
R = R + 2 * self.numerical_stabilizer * (jnp.abs(R) <= self.numerical_stabilizer)
R = jnp.reciprocal(R)
R = jnp.expand_dims(R, len(R.shape))
# W (bs, <non-attention dims>, num_heads, <attention dims>, channels_v)
# R (bs, <non-attention dims>, num_heads, <attention dims>, extra_channel)
result = W * R
# back to (bs, dim1, dim2, ..., dimN, num_heads, channels)
perm_inv = _invert_perm(qk_perm)
result = result.transpose(perm_inv)
return result
def _invert_perm(perm):
perm_inv = [0] * len(perm)
for i, j in enumerate(perm):
perm_inv[j] = i
return tuple(perm_inv)

View File

@@ -0,0 +1,685 @@
# coding=utf-8
# Copyright 2020 The HuggingFace Team All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for masked language modeling (BERT, ALBERT, RoBERTa...) with whole word masking on a
text file or a dataset.
Here is the full list of checkpoints on the hub that can be fine-tuned by this script:
https://huggingface.co/models?filter=masked-lm
"""
import logging
import os
import sys
from dataclasses import dataclass, field
# You can also adapt this script on your own masked language modeling task. Pointers for this are left as comments.
from pathlib import Path
from typing import Dict, List, Optional, Tuple
import numpy as np
from datasets import load_dataset
from tqdm import tqdm
import jax
import jax.numpy as jnp
from flax import jax_utils
from flax.optim import Adam
from flax.training import common_utils
from flax.training.common_utils import get_metrics
from jax.nn import log_softmax
from modeling_flax_performer import FlaxPerformerForMaskedLM
from transformers import (
MODEL_FOR_MASKED_LM_MAPPING,
AutoTokenizer,
BertConfig,
FlaxBertForMaskedLM,
HfArgumentParser,
PreTrainedTokenizerBase,
TensorType,
TrainingArguments,
is_tensorboard_available,
set_seed,
)
# Cache the result
has_tensorboard = is_tensorboard_available()
if has_tensorboard:
try:
from flax.metrics.tensorboard import SummaryWriter
except ImportError as ie:
has_tensorboard = False
print(f"Unable to display metrics through TensorBoard because some package are not installed: {ie}")
else:
print(
"Unable to display metrics through TensorBoard because the package is not installed: "
"Please run pip install tensorboard to enable."
)
MODEL_CONFIG_CLASSES = list(MODEL_FOR_MASKED_LM_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
@dataclass
class WandbArguments:
"""
Arguments for logging
"""
wandb_user_name: Optional[str] = field(
default=None,
metadata={"help": "The WandB user name for potential logging. If left None, no logging"},
)
wandb_project_name: Optional[str] = field(
default="performer-experiments",
metadata={"help": "The WandB project name for potential logging"},
)
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
"""
model_name_or_path: Optional[str] = field(
default=None,
metadata={
"help": "The model checkpoint for weights initialization."
"Don't set if you want to train a model from scratch."
},
)
performer: bool = field(
default=False,
metadata={"help": "Whether to use FAVOR+ attention"},
)
reinitialize: bool = field(
default=False,
metadata={"help": "Whether to use a blank model without pretraining"},
)
tokenizer_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
use_fast_tokenizer: bool = field(
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
)
@dataclass
class DataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
"""
dataset_name: Optional[str] = field(
default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
)
dataset_config_name: Optional[str] = field(
default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
)
train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."})
validation_file: Optional[str] = field(
default=None,
metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
)
train_ref_file: Optional[str] = field(
default=None,
metadata={"help": "An optional input train ref data file for whole word masking in Chinese."},
)
validation_ref_file: Optional[str] = field(
default=None,
metadata={"help": "An optional input validation ref data file for whole word masking in Chinese."},
)
overwrite_cache: bool = field(
default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
)
validation_split_percentage: Optional[int] = field(
default=5,
metadata={
"help": "The percentage of the train set used as validation set in case there's no validation split"
},
)
max_seq_length: Optional[int] = field(
default=None,
metadata={
"help": "The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated. Default to the max input length of the model."
},
)
preprocessing_num_workers: Optional[int] = field(
default=None,
metadata={"help": "The number of processes to use for the preprocessing."},
)
mlm_probability: float = field(
default=0.15, metadata={"help": "Ratio of tokens to mask for masked language modeling loss"}
)
pad_to_max_length: bool = field(
default=False,
metadata={
"help": "Whether to pad all samples to `max_seq_length`. "
"If False, will pad the samples dynamically when batching to the maximum length in the batch."
},
)
def __post_init__(self):
if self.dataset_name is None and self.train_file is None and self.validation_file is None:
raise ValueError("Need either a dataset name or a training/validation file.")
else:
if self.train_file is not None:
extension = self.train_file.split(".")[-1]
assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, a json or a txt file."
if self.validation_file is not None:
extension = self.validation_file.split(".")[-1]
assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, a json or a txt file."
# Adapted from transformers/data/data_collator.py
# Letting here for now, let's discuss where it should live
@dataclass
class FlaxDataCollatorForLanguageModeling:
"""
Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they
are not all of the same length.
Args:
tokenizer (:class:`~transformers.PreTrainedTokenizer` or :class:`~transformers.PreTrainedTokenizerFast`):
The tokenizer used for encoding the data.
mlm (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to use masked language modeling. If set to :obj:`False`, the labels are the same as the
inputs with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for
non-masked tokens and the value to predict for the masked token.
mlm_probability (:obj:`float`, `optional`, defaults to 0.15):
The probability with which to (randomly) mask tokens in the input, when :obj:`mlm` is set to :obj:`True`.
.. note::
For best performance, this data collator should be used with a dataset having items that are dictionaries or
BatchEncoding, with the :obj:`"special_tokens_mask"` key, as returned by a
:class:`~transformers.PreTrainedTokenizer` or a :class:`~transformers.PreTrainedTokenizerFast` with the
argument :obj:`return_special_tokens_mask=True`.
"""
tokenizer: PreTrainedTokenizerBase
mlm: bool = True
mlm_probability: float = 0.15
def __post_init__(self):
if self.mlm and self.tokenizer.mask_token is None:
raise ValueError(
"This tokenizer does not have a mask token which is necessary for masked language modeling. "
"You should pass `mlm=False` to train on causal language modeling instead."
)
def __call__(self, examples: List[Dict[str, np.ndarray]], pad_to_multiple_of: int) -> Dict[str, np.ndarray]:
# Handle dict or lists with proper padding and conversion to tensor.
batch = self.tokenizer.pad(examples, pad_to_multiple_of=pad_to_multiple_of, return_tensors=TensorType.NUMPY)
# If special token mask has been preprocessed, pop it from the dict.
special_tokens_mask = batch.pop("special_tokens_mask", None)
if self.mlm:
batch["input_ids"], batch["labels"] = self.mask_tokens(
batch["input_ids"], special_tokens_mask=special_tokens_mask
)
else:
labels = batch["input_ids"].copy()
if self.tokenizer.pad_token_id is not None:
labels[labels == self.tokenizer.pad_token_id] = -100
batch["labels"] = labels
return batch
def mask_tokens(
self, inputs: np.ndarray, special_tokens_mask: Optional[np.ndarray]
) -> Tuple[jnp.ndarray, jnp.ndarray]:
"""
Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
"""
labels = inputs.copy()
# We sample a few tokens in each sequence for MLM training (with probability `self.mlm_probability`)
probability_matrix = np.full(labels.shape, self.mlm_probability)
special_tokens_mask = special_tokens_mask.astype("bool")
probability_matrix[special_tokens_mask] = 0.0
masked_indices = np.random.binomial(1, probability_matrix).astype("bool")
labels[~masked_indices] = -100 # We only compute loss on masked tokens
# 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
indices_replaced = np.random.binomial(1, np.full(labels.shape, 0.8)).astype("bool") & masked_indices
inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)
# 10% of the time, we replace masked input tokens with random word
indices_random = np.random.binomial(1, np.full(labels.shape, 0.5)).astype("bool")
indices_random &= masked_indices & ~indices_replaced
random_words = np.random.randint(self.tokenizer.vocab_size, size=labels.shape, dtype="i4")
inputs[indices_random] = random_words[indices_random]
# The rest of the time (10% of the time) we keep the masked input tokens unchanged
return inputs, labels
def create_learning_rate_scheduler(
factors="constant * linear_warmup * rsqrt_decay",
base_learning_rate=0.5,
warmup_steps=1000,
decay_factor=0.5,
steps_per_decay=20000,
steps_per_cycle=100000,
):
"""Creates learning rate schedule.
Interprets factors in the factors string which can consist of:
* constant: interpreted as the constant value,
* linear_warmup: interpreted as linear warmup until warmup_steps,
* rsqrt_decay: divide by square root of max(step, warmup_steps)
* rsqrt_normalized_decay: divide by square root of max(step/warmup_steps, 1)
* decay_every: Every k steps decay the learning rate by decay_factor.
* cosine_decay: Cyclic cosine decay, uses steps_per_cycle parameter.
Args:
factors: string, factors separated by "*" that defines the schedule.
base_learning_rate: float, the starting constant for the lr schedule.
warmup_steps: int, how many steps to warm up for in the warmup schedule.
decay_factor: float, the amount to decay the learning rate by.
steps_per_decay: int, how often to decay the learning rate.
steps_per_cycle: int, steps per cycle when using cosine decay.
Returns:
a function learning_rate(step): float -> {"learning_rate": float}, the
step-dependent lr.
"""
factors = [n.strip() for n in factors.split("*")]
def step_fn(step):
"""Step to learning rate function."""
ret = 1.0
for name in factors:
if name == "constant":
ret *= base_learning_rate
elif name == "linear_warmup":
ret *= jnp.minimum(1.0, step / warmup_steps)
elif name == "rsqrt_decay":
ret /= jnp.sqrt(jnp.maximum(step, warmup_steps))
elif name == "rsqrt_normalized_decay":
ret *= jnp.sqrt(warmup_steps)
ret /= jnp.sqrt(jnp.maximum(step, warmup_steps))
elif name == "decay_every":
ret *= decay_factor ** (step // steps_per_decay)
elif name == "cosine_decay":
progress = jnp.maximum(0.0, (step - warmup_steps) / float(steps_per_cycle))
ret *= jnp.maximum(0.0, 0.5 * (1.0 + jnp.cos(jnp.pi * (progress % 1.0))))
else:
raise ValueError("Unknown factor %s." % name)
return jnp.asarray(ret, dtype=jnp.float32)
return step_fn
def compute_metrics(logits, labels, weights, label_smoothing=0.0):
"""Compute summary metrics."""
loss, normalizer = cross_entropy(logits, labels, weights, label_smoothing)
acc, _ = accuracy(logits, labels, weights)
metrics = {"loss": loss, "accuracy": acc, "normalizer": normalizer}
metrics = jax.lax.psum(metrics, axis_name="batch")
return metrics
def accuracy(logits, targets, weights=None):
"""Compute weighted accuracy for log probs and targets.
Args:
logits: [batch, length, num_classes] float array.
targets: categorical targets [batch, length] int array.
weights: None or array of shape [batch, length]
Returns:
Tuple of scalar loss and batch normalizing factor.
"""
if logits.ndim != targets.ndim + 1:
raise ValueError(
"Incorrect shapes. Got shape %s logits and %s targets" % (str(logits.shape), str(targets.shape))
)
loss = jnp.equal(jnp.argmax(logits, axis=-1), targets)
loss *= weights
return loss.sum(), weights.sum()
def cross_entropy(logits, targets, weights=None, label_smoothing=0.0):
"""Compute cross entropy and entropy for log probs and targets.
Args:
logits: [batch, length, num_classes] float array.
targets: categorical targets [batch, length] int array.
weights: None or array of shape [batch, length]
label_smoothing: label smoothing constant, used to determine the on and off values.
Returns:
Tuple of scalar loss and batch normalizing factor.
"""
if logits.ndim != targets.ndim + 1:
raise ValueError(
"Incorrect shapes. Got shape %s logits and %s targets" % (str(logits.shape), str(targets.shape))
)
vocab_size = logits.shape[-1]
confidence = 1.0 - label_smoothing
low_confidence = (1.0 - confidence) / (vocab_size - 1)
normalizing_constant = -(
confidence * jnp.log(confidence) + (vocab_size - 1) * low_confidence * jnp.log(low_confidence + 1e-20)
)
soft_targets = common_utils.onehot(targets, vocab_size, on_value=confidence, off_value=low_confidence)
loss = -jnp.sum(soft_targets * log_softmax(logits), axis=-1)
loss = loss - normalizing_constant
if weights is not None:
loss = loss * weights
normalizing_factor = weights.sum()
else:
normalizing_factor = np.prod(targets.shape)
return loss.sum(), normalizing_factor
def training_step(optimizer, batch, dropout_rng):
dropout_rng, new_dropout_rng = jax.random.split(dropout_rng)
def loss_fn(params):
targets = batch.pop("labels")
# Hide away tokens which doesn't participate in the optimization
token_mask = jnp.where(targets > 0, 1.0, 0.0)
logits = model(**batch, params=params, dropout_rng=dropout_rng, train=True)[0]
loss, weight_sum = cross_entropy(logits, targets, token_mask)
return loss / weight_sum
step = optimizer.state.step
lr = lr_scheduler_fn(step)
grad_fn = jax.value_and_grad(loss_fn)
loss, grad = grad_fn(optimizer.target)
grad = jax.lax.pmean(grad, "batch")
optimizer = optimizer.apply_gradient(grad, learning_rate=lr)
return loss, optimizer, new_dropout_rng
def eval_step(params, batch):
"""
Calculate evaluation metrics on a batch.
"""
targets = batch.pop("labels")
# Hide away tokens which doesn't participate in the optimization
token_mask = jnp.where(targets > 0, 1.0, 0.0)
logits = model(**batch, params=params, train=False)[0]
return compute_metrics(logits, targets, token_mask)
def generate_batch_splits(samples_idx: jnp.ndarray, batch_size: int) -> jnp.ndarray:
nb_samples = len(samples_idx)
samples_to_remove = nb_samples % batch_size
if samples_to_remove != 0:
samples_idx = samples_idx[:-samples_to_remove]
sections_split = nb_samples // batch_size
batch_idx = np.split(samples_idx, sections_split)
return batch_idx
if __name__ == "__main__":
# See all possible arguments in src/transformers/training_args.py
# or by passing the --help flag to this script.
# We now keep distinct sets of args, for a cleaner separation of concerns.
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments, WandbArguments))
if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
# If we pass only one argument to the script and it's the path to a json file,
# let's parse it to get our arguments.
model_args, data_args, training_args, wandb_args = parser.parse_json_file(
json_file=os.path.abspath(sys.argv[1])
)
else:
model_args, data_args, training_args, wandb_args = parser.parse_args_into_dataclasses()
if (
os.path.exists(training_args.output_dir)
and os.listdir(training_args.output_dir)
and training_args.do_train
and not training_args.overwrite_output_dir
):
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty."
"Use --overwrite_output_dir to overcome."
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
level="NOTSET",
datefmt="[%X]",
)
# Log on each process the small summary:
logger = logging.getLogger(__name__)
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
)
# Set the verbosity to info of the Transformers logger (on main process only):
logger.info("Training/evaluation parameters %s", training_args)
# Set seed before initializing model.
set_seed(training_args.seed)
# Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below)
# or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
# (the dataset will be downloaded automatically from the datasets Hub).
#
# For CSV/JSON files, this script will use the column called 'text' or the first column if no column called
# 'text' is found. You can easily tweak this behavior (see below).
#
# In distributed training, the load_dataset function guarantees that only one local process can concurrently
# download the dataset.
if data_args.dataset_name is not None:
# Downloading and loading a dataset from the hub.
datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name)
if "validation" not in datasets.keys():
datasets["validation"] = load_dataset(
data_args.dataset_name,
data_args.dataset_config_name,
split=f"train[:{data_args.validation_split_percentage}%]",
)
datasets["train"] = load_dataset(
data_args.dataset_name,
data_args.dataset_config_name,
split=f"train[{data_args.validation_split_percentage}%:]",
)
else:
data_files = {}
if data_args.train_file is not None:
data_files["train"] = data_args.train_file
if data_args.validation_file is not None:
data_files["validation"] = data_args.validation_file
extension = data_args.train_file.split(".")[-1]
if extension == "txt":
extension = "text"
datasets = load_dataset(extension, data_files=data_files)
# See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
# https://huggingface.co/docs/datasets/loading_datasets.html.
# Load pretrained model and tokenizer
# Distributed training:
# The .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.
rng = jax.random.PRNGKey(training_args.seed)
dropout_rngs = jax.random.split(rng, jax.local_device_count())
config = BertConfig.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
lm_class = FlaxPerformerForMaskedLM if model_args.performer else FlaxBertForMaskedLM
if model_args.reinitialize:
model = lm_class(config=BertConfig.from_pretrained(model_args.model_name_or_path))
else:
model = lm_class.from_pretrained(
model_args.model_name_or_path,
dtype=jnp.float32,
input_shape=(training_args.train_batch_size, config.max_position_embeddings),
seed=training_args.seed,
dropout_rate=0.1,
)
if model_args.tokenizer_name:
tokenizer = AutoTokenizer.from_pretrained(
model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
)
elif model_args.model_name_or_path:
tokenizer = AutoTokenizer.from_pretrained(
model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
)
else:
raise ValueError(
"You are instantiating a new tokenizer from scratch. This is not supported by this script."
"You can do it from another script, save it, and load it from here, using --tokenizer_name."
)
# Preprocessing the datasets.
# First we tokenize all the texts.
if training_args.do_train:
column_names = datasets["train"].column_names
else:
column_names = datasets["validation"].column_names
text_column_name = "text" if "text" in column_names else column_names[0]
padding = "max_length" if data_args.pad_to_max_length else False
def tokenize_function(examples):
# Remove empty lines
examples = [line for line in examples if len(line) > 0 and not line.isspace()]
return tokenizer(
examples,
return_special_tokens_mask=True,
padding=padding,
truncation=True,
max_length=data_args.max_seq_length,
)
tokenized_datasets = datasets.map(
tokenize_function,
input_columns=[text_column_name],
batched=True,
num_proc=data_args.preprocessing_num_workers,
remove_columns=column_names,
load_from_cache_file=not data_args.overwrite_cache,
)
# Enable tensorboard only on the master node
if has_tensorboard and jax.host_id() == 0:
summary_writer = SummaryWriter(log_dir=Path(training_args.output_dir).joinpath("logs").as_posix())
# Data collator
# This one will take care of randomly masking the tokens.
data_collator = FlaxDataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=data_args.mlm_probability)
# Setup optimizer
optimizer = Adam(
learning_rate=training_args.learning_rate,
weight_decay=training_args.weight_decay,
beta1=training_args.adam_beta1,
beta2=training_args.adam_beta2,
).create(model.params)
# Create learning rate scheduler
lr_scheduler_fn = create_learning_rate_scheduler(
base_learning_rate=training_args.learning_rate, warmup_steps=max(training_args.warmup_steps, 1)
)
# Create parallel version of the training and evaluation steps
p_training_step = jax.pmap(training_step, "batch", donate_argnums=(0,))
p_eval_step = jax.pmap(eval_step, "batch", donate_argnums=(0,))
# Replicate the optimizer on each device
optimizer = jax_utils.replicate(optimizer)
# Store some constant
nb_epochs = int(training_args.num_train_epochs)
batch_size = int(training_args.train_batch_size)
eval_batch_size = int(training_args.eval_batch_size)
if wandb_args.wandb_user_name is not None:
import wandb
wandb.init(project=wandb_args.wandb_project_name, entity=wandb_args.wandb_user_name)
epochs = tqdm(range(nb_epochs), desc=f"Epoch ... (1/{nb_epochs})", position=0)
for epoch in epochs:
# ======================== Training ================================
# Create sampling rng
rng, training_rng, eval_rng = jax.random.split(rng, 3)
# Generate an epoch by shuffling sampling indices from the train dataset
nb_training_samples = len(tokenized_datasets["train"])
training_samples_idx = jax.random.permutation(training_rng, jnp.arange(nb_training_samples))
training_batch_idx = generate_batch_splits(training_samples_idx, batch_size)
# Gather the indexes for creating the batch and do a training step
for batch_idx in tqdm(training_batch_idx, desc="Training...", position=1):
samples = [tokenized_datasets["train"][int(idx)] for idx in batch_idx]
model_inputs = data_collator(samples, pad_to_multiple_of=16)
# Model forward
model_inputs = common_utils.shard(model_inputs.data)
loss, optimizer, dropout_rngs = p_training_step(optimizer, model_inputs, dropout_rngs)
if wandb_args.wandb_user_name is not None:
wandb.log({"Training loss": np.array(loss).mean()})
epochs.write(f"Loss: {loss}")
# ======================== Evaluating ==============================
nb_eval_samples = len(tokenized_datasets["validation"])
eval_samples_idx = jnp.arange(nb_eval_samples)
eval_batch_idx = generate_batch_splits(eval_samples_idx, eval_batch_size)
eval_metrics = []
for i, batch_idx in enumerate(tqdm(eval_batch_idx, desc="Evaluating ...", position=2)):
samples = [tokenized_datasets["validation"][int(idx)] for idx in batch_idx]
model_inputs = data_collator(samples, pad_to_multiple_of=16)
# Model forward
model_inputs = common_utils.shard(model_inputs.data)
metrics = p_eval_step(optimizer.target, model_inputs)
eval_metrics.append(metrics)
eval_metrics_np = get_metrics(eval_metrics)
eval_metrics_np = jax.tree_map(jnp.sum, eval_metrics_np)
eval_normalizer = eval_metrics_np.pop("normalizer")
eval_summary = jax.tree_map(lambda x: x / eval_normalizer, eval_metrics_np)
# Update progress bar
epochs.desc = (
f"Epoch... ({epoch + 1}/{nb_epochs} | Loss: {eval_summary['loss']}, Acc: {eval_summary['accuracy']})"
)
if wandb_args.wandb_user_name is not None:
wandb.log({"Eval loss": np.array(eval_summary["loss"]).mean()})
# Save metrics
if has_tensorboard and jax.host_id() == 0:
for name, value in eval_summary.items():
summary_writer.scalar(name, value, epoch)

Some files were not shown because too many files have changed in this diff Show More