Compare commits

..

1540 Commits

Author SHA1 Message Date
Lysandre
e7cfc1a313 Release: v2.9.0
Some checks failed
GitHub-hosted runner / check_code_quality (push) Has been cancelled
2020-05-07 14:15:20 -04:00
Julien Chaumond
0ae96ff8a7 BIG Reorganize examples (#4213)
* Created using Colaboratory

* [examples] reorganize files

* remove run_tpu_glue.py as superseded by TPU support in Trainer

* Bugfix: int, not tuple

* move files around
2020-05-07 13:48:44 -04:00
Julien Chaumond
cafa6a9e29 [Trainer] Ability to specify optimizer/scheduler at init
cc @patrickvonplaten @thomwolf
2020-05-07 11:25:26 -04:00
Bram Vanroy
e4fd5e3999 Use with_extension to change the extension (#4203)
As per https://github.com/huggingface/transformers/pull/3934#discussion_r421307659
2020-05-07 11:14:56 -04:00
Lysandre Debut
ebf80e2e70 Tpu trainer (#4146)
* wip

* wip

* a last wip

* Better logging when using TPUs

* Correct argument name

* Tests

* fix

* Metrics in evaluation

* Update src/transformers/training_args.py

* [tpu] Use launcher script instead

* [tpu] lots of tweaks

* Fix formatting

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-07 10:34:04 -04:00
Funtowicz Morgan
026097b9ee Ensure fast tokenizer can construct tensor without pad token if only one sample is provided. (#4201) 2020-05-07 10:02:53 -04:00
Funtowicz Morgan
0a6cbea0a5 Rewritten batch support in pipelines. (#4154)
* Rewritten batch support in pipelines.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix imports sorting 🔧

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Set pad_to_max_length=True by default on Pipeline.

* Set pad_to_max_length=False for generation pipelines.

Most of generation models doesn't have padding token.

* Address @joeddav review comment: Uniformized *args.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Address @joeddav review comment: Uniformized *args (second).

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-05-07 09:52:40 -04:00
Patrick von Platen
99d1a69444 fix examples (#4192) 2020-05-07 10:54:48 +02:00
Patrick von Platen
74ffc9ea6b [Reformer] Fix example and error message (#4191)
* fix example reformer

* fix error message and example docstring

* improved error message
2020-05-07 10:50:11 +02:00
Patrick von Platen
96c78396ce fix docstring reformer (#4190) 2020-05-07 10:28:31 +02:00
Patrick von Platen
dca34695d0 Reformer (#3351)
* first copy & past commit from Bert and morgans LSH code

* add easy way to compare to trax original code

* translate most of function

* make trax lsh self attention deterministic with numpy seed + copy paste code

* add same config

* add same config

* make layer init work

* implemented hash_vectors function for lsh attention

* continue reformer translation

* hf LSHSelfAttentionLayer gives same output as trax layer

* refactor code

* refactor code

* refactor code

* refactor

* refactor + add reformer config

* delete bogus file

* split reformer attention layer into two layers

* save intermediate step

* save intermediate step

* make test work

* add complete reformer block layer

* finish reformer layer

* implement causal and self mask

* clean reformer test and refactor code

* fix merge conflicts

* fix merge conflicts

* update init

* fix device for GPU

* fix chunk length init for tests

* include morgans optimization

* improve memory a bit

* improve comment

* factorize num_buckets

* better testing parameters

* make whole model work

* make lm model work

* add t5 copy paste tokenizer

* add chunking feed forward

* clean config

* add improved assert statements

* make tokenizer work

* improve test

* correct typo

* extend config

* add complexer test

* add new axial position embeddings

* add local block attention layer

* clean tests

* refactor

* better testing

* save intermediate progress

* clean test file

* make shorter input length work for model

* allow variable input length

* refactor

* make forward pass for pretrained model work

* add generation possibility

* finish dropout and init

* make style

* refactor

* add first version of RevNet Layers

* make forward pass work and add convert file

* make uploaded model forward pass work

* make uploaded model forward pass work

* refactor code

* add namedtuples and cache buckets

* correct head masks

* refactor

* made reformer more flexible

* make style

* remove set max length

* add attention masks

* fix up tests

* fix lsh attention mask

* make random seed optional for the moment

* improve memory in reformer

* add tests

* make style

* make sure masks work correctly

* detach gradients

* save intermediate

* correct backprob through gather

* make style

* change back num hashes

* rename to labels

* fix rotation shape

* fix detach

* update

* fix trainer

* fix backward dropout

* make reformer more flexible

* fix conflict

* fix

* fix

* add tests for fixed seed in reformer layer

* fix trainer typo

* fix typo in activations

* add fp16 tests

* add fp16 training

* support fp16

* correct gradient bug in reformer

* add fast gelu

* re-add dropout for embedding dropout

* better naming

* better naming

* renaming

* finalize test branch

* finalize tests

* add more tests

* finish tests

* fix

* fix type trainer

* fix fp16 tests

* fix tests

* fix tests

* fix tests

* fix issue with dropout

* fix dropout seeds

* correct random seed on gpu

* finalize random seed for dropout

* finalize random seed for dropout

* remove duplicate line

* correct half precision bug

* make style

* refactor

* refactor

* docstring

* remove sinusoidal position encodings for reformer

* move chunking to modeling_utils

* make style

* clean config

* make style

* fix tests

* fix auto tests

* pretrained models

* fix docstring

* update conversion file

* Update pretrained_models.rst

* fix rst

* fix rst

* update copyright

* fix test path

* fix test path

* fix small issue in test

* include reformer in generation tests

* add docs for axial position encoding

* finish docs

* Update convert_reformer_trax_checkpoint_to_pytorch.py

* remove isort

* include sams comments

* remove wrong comment in utils

* correct typos

* fix typo

* Update reformer.rst

* applied morgans optimization

* make style

* make gpu compatible

* remove bogus file

* big test refactor

* add example for chunking

* fix typo

* add to README
2020-05-07 10:17:01 +02:00
Clement
877fc56410 change order pytorch/tf in readme (#4167) 2020-05-06 16:31:07 -04:00
Julien Plu
aad50151f3 TF version of the trainer (#4017)
* First commit to add a TF version of the trainer.

* Make the TF trainer closer to what looks the PT trainer

* Refactoring common code between the PT and TF trainer into an util file.

* Some bugfix + better similarity with the PT trainer

* Add missing class in transformers init

* Bugfix over prediction + use classification report instead of simple metrics

* Fix name error

* Fix optimization tests + style

* Apply style

* Several bugfix for multi-gpu training

* Apply style

* Apply style

* Add glue example for the TF trainer

* Several bugix + address the reviews

* Fix on the TF training args file

* Add a debug mode

* Bugfix in utils_ner.py when segment_ids is None

* Apply style

* Apply style

* Add TPU strategy

* Fix selection strategy
2020-05-06 12:56:52 -04:00
Simone Primarosa
25296b12aa Fix overwrite_cache behaviour for pytorch lightning examples (#4093) 2020-05-06 12:24:49 -04:00
kumapo
9972562d33 Include ElectraPreTrainedModel into __init__ (#4173) 2020-05-06 12:00:23 -04:00
martindh
ff8ed52dd8 Camembert-large-fquad model card (#4143)
Description for the model card describing the camembert-large-fquad model.
2020-05-06 10:41:07 -04:00
Julien Plu
4c3be2e718 Add model card for the NER model (#4162) 2020-05-06 10:40:55 -04:00
Manuel Romero
17ae0363db Fix markdown to show the results table properly (#4119) 2020-05-06 10:38:29 -04:00
Patrick von Platen
a638e986f4 fix hard wired pad token id (#4138) 2020-05-06 00:42:34 +02:00
Julien Chaumond
fd2174664c [Trainer] W&B: Enable model watch
See https://github.com/huggingface/transformers/pull/3916
2020-05-05 10:59:23 -04:00
Lysandre Debut
79b1c6966b Pytorch 1.5.0 (#3973)
* Standard deviation can no longer be set to 0

* Remove torch pinned version

* 9th instead of 10th, silly me
2020-05-05 10:23:01 -04:00
Boris Dayma
818463ee8e Trainer: add logging through Weights & Biases (#3916)
* feat: add logging through Weights & Biases

* feat(wandb): make logging compatible with all scripts

* style(trainer.py): fix formatting

* [Trainer] Tweak wandb integration

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-04 22:42:27 -04:00
jaymody
858b1d1e5a allow an already created tensorboard SummaryWriter be passed to Trainer 2020-05-04 19:58:24 -04:00
Patrick von Platen
8e67573a64 [EncoderDecoder Tests] Improve tests (#4046)
* Hoist bert model tester for patric

* indent

* make tests work

* Update tests/test_modeling_bert.py

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

Co-authored-by: sshleifer <sshleifer@gmail.com>
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-04 02:18:36 +02:00
Lorenzo Ampil
6af3306a1d Add decoder specific error message for T5Stack.forward (#4128) 2020-05-03 12:40:08 +02:00
Zhiyu Lin
1cdd2ad2af Fix #2941 (#4109)
* Fix of issue #2941

Reshaped score array to avoid `numpy` ValueError.

* Update src/transformers/pipelines.py

* Update src/transformers/pipelines.py

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-02 11:20:30 -04:00
Manuel Romero
5f4f6b65b3 distilroberta-base-finetuned-sentiment (#4115)
* Create model card

Create Model card for distilroberta-base-finetuned-sentiment

* Update model_cards/mrm8488/distilroberta-base-finetuned-sentiment/README.md

* Update model_cards/mrm8488/distilroberta-base-finetuned-sentiment/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-02 11:19:31 -04:00
Suraj Parmar
7da051f135 model card for surajp/albert-base-sanskrit (#4114)
* Create README.md

* Update model_cards/surajp/albert-base-sanskrit/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-02 11:15:39 -04:00
Zhen Wang
14911e2e12 Create README.md (#4112) 2020-05-02 10:52:12 -04:00
HUSEIN ZOLKEPLI
9e97c87539 Added huseinzol05/gpt2-345M-bahasa-cased (#4102) 2020-05-02 10:51:15 -04:00
William Falcon
4c5bd92183 Update run_pl_glue.py (#4117) 2020-05-02 10:38:30 -04:00
William Falcon
5282b31df4 Update run_pl_ner.py (#4118) 2020-05-02 10:38:21 -04:00
Stefan Schweter
1e616c0af3 NER: parse args from .args file or JSON (#4110)
* ner: parse args from .args file or JSON

* examples: mention json-based configuration file support for run_ner script
2020-05-02 10:29:17 -04:00
Patrick von Platen
abb1fa3f37 Update README.md 2020-05-02 10:32:00 +02:00
Patrick von Platen
0ccbfd2868 Update Reformer ReadME 2020-05-02 10:31:00 +02:00
Patrick von Platen
2d8340a91f [Reformer] Move model card to google model (#4113)
* correct model card

* remove model card from patrick von platen
2020-05-02 10:25:22 +02:00
Julien Chaumond
d713cfc5eb GePpeTto 🇮🇹: Fixpath to model card 2020-05-01 11:48:58 -04:00
Lorenzo De Mattei
f3d44301cc GePpeTto model 🇮🇹 (#4099)
* Create GePpeTto.md

* Update model_cards/LorenzoDeMattei/GePpeTto.md

* Update model_cards/LorenzoDeMattei/GePpeTto.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-01 11:46:42 -04:00
Julien Chaumond
27d55125e6 Configs: saner num_labels in configs. (#3967) 2020-05-01 11:28:55 -04:00
Stefan Schweter
e80be7f1d0 docs: add xlm-roberta section to multi-lingual section (#4101) 2020-05-01 11:06:58 -04:00
Sam Shleifer
18db92dd9a [testing] add timeout_decorator (#3543) 2020-05-01 09:05:47 -04:00
Julien Chaumond
b8686174be Merge pull request #3934 from huggingface/examples_args_from_files
[qol] example scripts: parse args from .args file or JSON
2020-04-30 22:40:13 -04:00
Julien Chaumond
f39217a5ec [tests] Light cleanup of tempfile in tests/ 2020-04-30 22:30:15 -04:00
Julien Chaumond
f54dc3f4d5 [ci] Load pretrained models into the default (long-lived) cache
There's an inconsistency right now where:
- we load some models into CACHE_DIR
- and some models in the default cache
- and often, in both for the same models

When running the RUN_SLOW tests, this takes a lot of disk space, time, and bandwidth.

I'd rather always use the default cache
2020-04-30 22:30:15 -04:00
Scottish_Fold007
6b410bedfc Model Card: gaochangkuan README.md (#4033)
* Create README.md

* Update README.md

* tweak

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-04-30 22:26:58 -04:00
husein zolkepli
8829ace4aa added gpt2 117m bahasa readme
(cherry picked from commit a4a673a1d0bec0bf4085eef021acb788ca1f5eb5)
2020-04-30 22:20:00 -04:00
Benjamin Muller
1851a64b6f create model_card camembert-base-wikipedia-4gb 2020-04-30 22:16:12 -04:00
Benjamin Muller
443e5e34af Create README.md 2020-04-30 22:16:00 -04:00
Benjamin Muller
60e1556a44 Create model_card camembert-base-ccnet-4gb 2020-04-30 22:15:47 -04:00
Benjamin Muller
fa9365eca5 Create README.md 2020-04-30 22:15:38 -04:00
Benjamin Muller
afe002b04c Create README.md 2020-04-30 22:15:23 -04:00
Suraj Parmar
8b5e5ebcf9 Continue training args and tqdm in notebooks (#3939)
* Continue training args

* Continue training args

* added explaination

* added explaination

* added explaination

* Fixed tqdm auto

* Update src/transformers/training_args.py

Co-Authored-By: Julien Chaumond <chaumond@gmail.com>

* Update src/transformers/training_args.py

* Update src/transformers/training_args.py

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-04-30 22:14:08 -04:00
Julien Chaumond
ab90353f1a [cli] {login, upload, s3} display more helpful error messages 2020-04-30 12:51:06 -04:00
Julien Chaumond
452dd0e4d9 [ci] Align test_hf_api.py with API change 2020-04-30 12:06:01 -04:00
Jordan
7f9193ef09 Fixed Style Inconsistency (#3976) 2020-04-30 14:33:09 +02:00
Jared T Nielsen
64070cbb88 Fix TF input docstrings to refer to tf.Tensor rather than torch.FloatTensor. (#4051) 2020-04-30 14:28:56 +02:00
Lysandre Debut
e73595bd64 Remove jitted method so that our models are pickable. (#4050) 2020-04-29 09:53:19 -04:00
Sam Shleifer
2c77842887 [Fix common tests on GPU] send model, ids to torch_device (#4014) 2020-04-29 09:47:20 -04:00
Julien Chaumond
6faca88ee0 Align MarianMT with #4030
cc @sshleifer
2020-04-28 20:35:20 -04:00
Julien Chaumond
211e130811 [github] Issue templates: populate some labels
cc @bramvanroy @stefan-it
2020-04-28 20:34:34 -04:00
Julien Chaumond
455c639093 CDN urls (#4030)
* [file_utils] use_cdn + documentation

* Move to cdn. urls for weights

* [urls] Hotfix for bert-base-japanese
2020-04-28 20:27:14 -04:00
Thomas Wolf
8ba4c5885f Allow a more backward compatible behavior of max_len_single_sentence and max_len_sentences_pair (#3994)
* Allow a more backward compatible behavior of max_len_single_sentence and max_len_sentences_pair and

* The style and quality are now top-notch
2020-04-29 01:13:59 +02:00
Sam Shleifer
847e7f3379 MarianMTModel.from_pretrained('Helsinki-NLP/opus-marian-en-de') (#3908)
Co-Authored-By: Stefan Schweter <stefan@schweter.it>
2020-04-28 18:22:37 -04:00
Sam Shleifer
d714dfeaa8 [isort] add known 3rd party to setup.cfg (#4053)
* add known 3rd party to setup.cfg

* comment

* Update CONTRIBUTING.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-04-28 17:12:00 -04:00
MichalMalyska
d52b0e294a Minor Readme Fixes (#4056)
Added contact info and fixed typos.
2020-04-28 16:42:15 -04:00
Alex Combessie
55adefe428 Add license information to model cards (#3864)
Close #3357
2020-04-28 16:40:21 -04:00
ydaigo
0ac6d0bf33 Create README.md
I create japanese binary classification.
2020-04-28 15:35:30 -04:00
Louis MARTIN
c73c83b0e6 Small cosmetic changes to CamemBERT model card 2020-04-28 15:32:55 -04:00
Bogdan Kostić
4a94c062a4 Provide model card for roberta-base-squad2-covid 2020-04-28 15:29:30 -04:00
jazzcook15
c7d06b79ae Fix #3954 - GPT2 is not traceable (#3955)
* Update sqrt computation so it can survive a torch.jit.trace

* Update modeling_gpt2.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-04-28 21:18:56 +02:00
Patrick von Platen
9a0a8c1c6f add examples to doc (#4045) 2020-04-28 16:33:23 +02:00
Patrick von Platen
fa49b9afea Clean Encoder-Decoder models with Bart/T5-like API and add generate possibility (#3383)
* change encoder decoder style to bart & t5 style

* make encoder decoder generation dummy work for bert

* make style

* clean init config in encoder decoder

* add tests for encoder decoder models

* refactor and add last tests

* refactor and add last tests

* fix attn masks for bert encoder decoder

* make style

* refactor prepare inputs for Bert

* refactor

* finish encoder decoder

* correct typo

* add docstring to config

* finish

* add tests

* better naming

* make style

* fix flake8

* clean docstring

* make style

* rename
2020-04-28 15:11:09 +02:00
Patrick von Platen
180585741c [Generation] Generation should allow to start with empty prompt (#3993)
* fix empty prompt

* fix length in generation pipeline
2020-04-28 14:33:15 +02:00
Patrick von Platen
52679fbc2e add dialogpt training tips (#3996) 2020-04-28 14:32:31 +02:00
Stefan Schweter
b5c6d3d4c7 notebooks: minor fix for community provided models example (#4025) 2020-04-28 09:12:25 +02:00
martindh
2fade302ac camembert-base-fquad
Model card for illuin release of camembert-base-fquad
2020-04-27 18:29:55 -04:00
Manuel Romero
20c3b8cab4 Create model card 2020-04-27 18:27:46 -04:00
Manuel Romero
b3f272ffcb Create model card 2020-04-27 18:27:04 -04:00
Nick Doiron
518f291eef add model card for Hindi-BERT 2020-04-27 18:25:16 -04:00
monologg
d7b3bf547c Model cards for KoELECTRA 2020-04-27 18:21:01 -04:00
Sai Saketh Aluru
db9d56c08a Add modelcard for Hate-speech-CNERG/dehatebert-mono-arabic model (#3979)
* Add dehatebert-mono-arabic readme card

* Update dehatebert-mono-arabic model card
2020-04-27 18:18:54 -04:00
sshleifer
41750a6cff Fix typos 2020-04-27 13:25:53 -04:00
Lorenzo Ampil
12bb7fe770 Fix t5 doc typos (#3978)
* Fix tpo in into and add line under

* Add missing blank line under

* Correct types under
2020-04-27 18:27:15 +02:00
Julien Chaumond
97a375484c rm boto3 dependency 2020-04-27 11:17:14 -04:00
Txus
4e817ff418 Create README.md (#3966) 2020-04-25 09:16:40 -04:00
Junyi_Li
73d6a2f901 [model_cards] xlnet_chinese_large & roberta_chinese_large 2020-04-24 16:12:42 -04:00
Manuel Romero
623ba0236d Create README.md (#3882) 2020-04-24 15:57:01 -04:00
Leandro von Werra
f4078e0db6 Feat/add model card (#3923)
* add model card for gpt2-imdb-ctrl

* fix title

* add sentiment control description
2020-04-24 10:24:28 -04:00
YuvalPeleg
03322b4261 Create README.md (#3917) 2020-04-24 10:24:00 -04:00
Julien Chaumond
c811526004 [examples] For convenience, also save the tokenizer
Close #3921
2020-04-24 09:52:42 -04:00
Cola
b0167632ce Shuffle train subset for summarization example (#3909)
* Shuffle train subset

* Cleaner shuffle
2020-04-24 07:55:34 -04:00
Julien Chaumond
c53cc018de [Trainer] Fix _rotate_checkpoints
Close #3920
2020-04-23 23:59:43 +00:00
Julien Chaumond
cbbb3c43c5 [hubconf] Modify pythonpath to get canonical imports to work
See https://github.com/huggingface/transformers/pull/3881/files#r412292660

Should we remove SRC_DIR from sys.path right after the imports, @aaugustin?
2020-04-23 16:27:43 -04:00
mneilly-et
77b75d2c78 Fix for #3873 to change type of exponent parameter for torch.pow() call from int to float (#3924) 2020-04-23 14:25:31 -04:00
Clement
6ba254ee54 quick fix wording readme for community models (#3900) 2020-04-23 14:19:45 -04:00
Jared T Nielsen
a79a9e1241 Fix TFAlbertForSequenceClassification classifier dropout probability. It was set to config.hidden_dropout_prob, but should be config.classifier_dropout_prob. (#3928) 2020-04-23 13:18:16 -04:00
peterandluc
8e093e5981 Remove 50k limits bug 2020-04-23 11:15:09 -04:00
Julien Chaumond
6af5a54c28 [Trainer] reuse constant 2020-04-23 11:02:05 -04:00
Julien Chaumond
7c2a32ff88 [housekeeping] super() 2020-04-23 10:43:22 -04:00
Julien Chaumond
a946b6b51b [housekeeping] Upgrade # type Python 2 syntax
cc @sshleifer
2020-04-23 10:39:24 -04:00
Manuel Romero
cb3c2212c7 Create model card (#3890)
Model: TinyBERT-spanish-uncased-finetuned-ner
2020-04-22 14:56:43 -04:00
Manuel Romero
d698b87f20 Update comparison table (#3889) 2020-04-22 14:54:17 -04:00
Anthony MOI
13dd2acca4 Bump tokenizers version to final 0.7.0 (#3898) 2020-04-22 11:02:29 -04:00
Lorenzo Ampil
f16540fcba Pipeline for Text Generation: GenerationPipeline (#3758)
* Add GenerationPipeline

* Fix parameter names

* Correct parameter __call__ parameters

* Add model type attribute and correct function calls for prepare_input

* Take out trailing commas from init attributes

* Remove unnecessary tokenization line

* Implement support for multiple text inputs

* Apply generation support for multiple input text prompts

* Take out tensor coersion

* Take out batch index

* Add text prompt to return sequence

* Squeeze token tensore before decoding

* Return only a single list of sequences if only one prompt was used

* Correct results variable name

* Add GenerationPipeline to SUPPORTED_TASKS with the alias , initalized w GPT2

* Registedred AutoModelWithLMHead for both pt and t

* Update docstring for GenerationPipeline

* Add kwargs parameter to mode.generate

* Take out kwargs parameter after all

* Add generation pipeline example in pipeline docstring

* Fix max length by squeezing tokens tensor

* Apply ensure_tensor_on_device to pytorch tensor

* Include generation step in torch.no_grad

* Take out input from prepare_xlm_input and set 'en' as default xlm_language

* Apply framework specific encoding during prepare_input

* Format w make style

* Move GenerationPipeline import to follow proper import sorting

* Take out training comma from generation dict

* Apply requested changes

* Change name to TextGenerationPipeline

* Apply TextGenerationPipeline rename to __init___

* Changing alias to

* Set input mapping as input to ensure_tensor_on_device

* Fix assertion placement

* Add test_text_generation

* Add TextGenerationPipeline to PipelineCommonTests

* Take out whitespace

* Format __init__ w black

* Fix __init__ style

* Forman __init___

* Add line to end of __init__

* Correct model tokenizer set for test_text_generation

* Ensure to return list of list, not list of string (to pass test)

* Limit test models to only 3 to limit runtime to address circleCI timeout error

* Update src/transformers/pipelines.py

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/pipelines.py

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/pipelines.py

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/pipelines.py

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/pipelines.py

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>

* Update tests/test_pipelines.py

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/pipelines.py

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/pipelines.py

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/pipelines.py

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>

* Remove argument docstring, __init__, add additional __call__ arguments, and reformat results to list of dict

* Fix blank result list

* Add TextGenerationPipeline to pipelines.rst

* Update src/transformers/pipelines.py

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/pipelines.py

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>

* Fix typos from adding PADDING_TEXT_TOKEN_LENGTH

* Fix incorrectly moved result list

* Update src/transformers/pipelines.py

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/pipelines.py

* Update src/transformers/pipelines.py

* Update src/transformers/pipelines.py

* Update src/transformers/pipelines.py

* Update src/transformers/pipelines.py

* Update src/transformers/pipelines.py

* Update src/transformers/pipelines.py

* Update src/transformers/pipelines.py

* Update src/transformers/pipelines.py

* Update src/transformers/pipelines.py

* Update src/transformers/pipelines.py

* Update src/transformers/pipelines.py

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>

* Add back generation line and make style

* Take out blank whitespace

* Apply new alis, text-generation, to test_pipelines

* Fix text generation alias in test

* Update src/transformers/pipelines.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-04-22 09:37:03 -04:00
Julien Chaumond
1dc9b3c784 Fixes #3877 2020-04-22 01:15:10 +00:00
Julien Chaumond
dd9d483d03 Trainer (#3800)
* doc

* [tests] Add sample files for a regression task

* [HUGE] Trainer

* Feedback from @sshleifer

* Feedback from @thomwolf + logging tweak

* [file_utils] when downloading concurrently, get_from_cache will use the cached file for subsequent processes

* [glue] Use default max_seq_length of 128 like before

* [glue] move DataTrainingArguments around

* [ner] Change interface of InputExample, and align run_{tf,pl}

* Re-align the pl scripts a little bit

* ner

* [ner] Add integration test

* Fix language_modeling with API tweak

* [ci] Tweak loss target

* Don't break console output

* amp.initialize: model must be on right device before

* [multiple-choice] update for Trainer

* Re-align to 827d6d6ef0
2020-04-21 20:11:56 -04:00
Julien Chaumond
eb5601b0a5 [ci] Pin torch version while we update 2020-04-21 15:46:18 -04:00
Spencer Adams
53f5ef6df5 create readme for spentaur/yelp model (#3874)
* create readme for spentaur/yelp model

* update spentaur/yelp/README.md

* remove typo
2020-04-21 15:31:36 -04:00
Julien Chaumond
d32585a304 Fix Torch.hub + Integration test 2020-04-21 14:13:30 -04:00
Bharat Raghunathan
7d40901ce3 Fix Documentation issue in BertForMaskedLM forward (#3855) 2020-04-21 09:08:20 +02:00
Andrey Kulagin
b1ff0b2ae7 Fix bug in examples: double wrap into DataParallel during eval 2020-04-20 19:37:44 -04:00
husein zolkepli
7f23af1684 added electra model
(cherry picked from commit b5f2dc5d627d44b8cbb0ccf8ad2b46bea211a236)
2020-04-20 17:17:58 -04:00
Punyajoy Saha
03121deba3 New model added
The first model added to the repo
2020-04-20 17:10:01 -04:00
Manuel Romero
15b9868f8b Create model card 2020-04-20 17:07:34 -04:00
Funtowicz Morgan
2c05b8a56c Remove tqdm logging when using pipelines. (#3833)
Introduce tqdm_enabled parameter on squad_convert_examples_to_features() default to True and set to False in QA pipelines.
2020-04-20 22:58:52 +02:00
Jared T Nielsen
c79b550dd0 Add qas_id to SquadResult and SquadExample (#3745)
* Add qas_id

* Fix incorrect name in squad.py

* Make output files optional for squad eval
2020-04-20 16:08:57 -04:00
Patrick von Platen
c4158a6314 [Pipelines] Encode to max length of input not max length of tokenizer for batch input (#3857)
* remove max_length = tokenizer.max_length when encoding

* make style
2020-04-20 14:39:16 -04:00
Mohamed El-Geish
857ccdb259 exbert links for my albert model cards (#3729)
* exbert links for my albert model cards

* Added exbert tag to the metadata block

* Adding "how to cite"
2020-04-20 10:54:39 -04:00
Sam Shleifer
a504cb49ec [examples] fix summarization do_predict (#3866) 2020-04-20 10:49:56 -04:00
ahotrod
52c85f847a Update README.md 2020-04-20 10:10:56 -04:00
Patrick von Platen
a21d4fa410 add "by" to ReadMe 2020-04-18 18:07:17 +02:00
Thomas Wolf
827d6d6ef0 Cleanup fast tokenizers integration (#3706)
* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py

Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy

Co-authored-by: Stefan Schweter <stefan@schweter.it>
2020-04-18 13:43:57 +02:00
Julien Chaumond
60a42ef1c0 [model_cards] Fix CamemBERT table markdown
see https://github.com/huggingface/transformers/pull/3836
2020-04-17 20:21:15 -04:00
Julien Chaumond
88aecee6a2 [ci] GitHub-hosted runner has no space left on device 2020-04-17 20:16:00 -04:00
Benjamin Muller
73efa694e6 Update camembert-base-README.md (#3836) 2020-04-17 20:08:13 -04:00
Patrick von Platen
e9d0bc027a [Config, Serialization] more readable config serialization (#3797)
* better config serialization

* finish configuration utils
2020-04-17 20:07:18 -04:00
Lysandre Debut
8b63a01d95 XLM tokenizer should encode with bos token (#3791)
* XLM tokenizer should encode with bos token

* Update tests
2020-04-17 11:28:55 -04:00
Patrick von Platen
1d4a35b396 Higher tolerance for past testing in TF T5 (#3844) 2020-04-17 11:26:16 -04:00
Patrick von Platen
d13eca11e2 Higher tolerance for past testing in T5 (#3843) 2020-04-17 11:25:14 -04:00
Harutaka Kawamura
b0c9fbb293 Add workflow to build docs (#3763) 2020-04-17 11:23:18 -04:00
Santiago Castro
c19727fd38 Add support for the null answer in QuestionAnsweringPipeline (#3441)
* Add support for the null answer in `QuestionAnsweringPipeline`

* black

* Fix min null score computation

* Fix a PR comment
2020-04-17 11:17:21 -04:00
Simon Böhm
edf0582c0b Fix token_type_id in BERT question-answering example (#3790)
token_type_id is converted into the segment embedding. For question answering,
this needs to highlight whether a token belongs to sequence 0 or 1.
encode_plus takes care of correctly setting this parameter automatically.
2020-04-17 11:14:12 -04:00
Pierric Cistac
6d00033e97 Question Answering support for Albert and Roberta in TF (#3812)
* Add TFAlbertForQuestionAnswering

* Add TFRobertaForQuestionAnswering

* Update TFAutoModel with Roberta/Albert for QA

* Clean `super` TF Albert calls
2020-04-17 10:45:30 -04:00
Patrick von Platen
f399c00610 Update README 2020-04-17 09:42:22 +02:00
Sam Shleifer
f0c96fafd1 [examples] summarization/bart/finetune.py supports t5 (#3824)
renames `run_bart_sum.py` to `finetune.py`
2020-04-16 15:15:19 -04:00
Jonathan Sum
0cec4fab7d typo: fine-grained token-leven
Changing from "fine-grained token-leven" to "fine-grained token-level"
2020-04-16 15:11:23 -04:00
Aryansh Omray
14cdeee75a Tanh torch warnings 2020-04-16 15:10:35 -04:00
Sam Shleifer
16469fedbd [PretrainedTokenizer] Factor out tensor conversion method (#3777) 2020-04-16 15:02:43 -04:00
Patrick von Platen
80a1694514 [Examples, T5] Change newstest2013 to newstest2014 and clean up (#3817)
* Refactored use of newstest2013 to newstest2014. Fixed bug where argparse consumed first command line argument as model_size argument rather than using default model_size by forcing explicit --model_size flag inclusion

* More pythonic file handling through 'with' context

* COSMETIC - ran Black and isort

* Fixed reference to number of lines in newstest2014

* Fixed failing test. More pythonic file handling

* finish PR from tholiao

* remove outcommented lines

* make style

* make isort happy

Co-authored-by: Thomas Liao <tholiao@gmail.com>
2020-04-16 20:00:41 +02:00
Lysandre Debut
d486795158 JIT not compatible with PyTorch/XLA (#3743) 2020-04-16 11:19:24 -04:00
Davide Fiocco
b1e2368b32 Typo fix (#3821) 2020-04-16 11:04:32 -04:00
Patrick von Platen
baca8fa8e6 clean pipelines (#3795) 2020-04-16 10:21:34 -04:00
Patrick von Platen
38f7461df3 [TFT5, Cache] Add cache to TFT5 (#3772)
* correct gpt2 test inputs

* make style

* delete modeling_gpt2 change in test file

* translate from pytorch

* correct tests

* fix conflicts

* fix conflicts

* fix conflicts

* fix conflicts

* make tensorflow t5 caching work

* make style

* clean reorder cache

* remove unnecessary spaces

* fix test
2020-04-16 16:14:52 +02:00
Patrick von Platen
a5b249472e change pad token id to config pad token id (#3793) 2020-04-16 15:58:57 +02:00
Sam Shleifer
dbd041243d [cleanup] factor out get_head_mask, invert_attn_mask, get_exten… (#3806)
* Delete some copy pasted code
2020-04-16 09:55:25 -04:00
Patrick von Platen
d22894dfd4 [Docs] Add DialoGPT (#3755)
* add dialoGPT

* update README.md

* fix conflict

* update readme

* add code links to docs

* Update README.md

* Update dialo_gpt2.rst

* Update pretrained_models.rst

* Update docs/source/model_doc/dialo_gpt2.rst

Co-Authored-By: Julien Chaumond <chaumond@gmail.com>

* change filename of dialogpt

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-04-16 09:04:32 +02:00
Sam Shleifer
c59b1e682d [examples] unit test for run_bart_sum (#3544)
- adds pytorch-lightning dependency
2020-04-15 18:35:01 -04:00
Patrick von Platen
301bf8d1b4 Create Modelcard for Reformer Model 2020-04-15 16:26:24 +02:00
Patrick von Platen
01c37dcdb5 [Config, Caching] Remove output_past everywhere and replace by use_cache argument (#3734)
* remove output_past from pt

* make style

* add optional input length for gpt2

* add use cache to prepare input

* save memory in gpt2

* correct gpt2 test inputs

* make past input optional for gpt2

* finish use_cache for all models

* make style

* delete modeling_gpt2 change in test file

* correct docstring

* correct is true statements for gpt2
2020-04-14 14:40:28 -04:00
Patrick von Platen
092cf881a5 [Generation, EncoderDecoder] Apply Encoder Decoder 1.5GB memory… (#3778) 2020-04-13 22:29:28 -04:00
Teven
352d5472b0 Shift labels internally within TransfoXLLMHeadModel when called with labels (#3716)
* Shifting labels inside TransfoXLLMHead

* Changed doc to reflect change

* Updated pytorch test

* removed IDE whitespace changes

* black reformat

Co-authored-by: TevenLeScao <teven.lescao@gmail.com>
2020-04-13 18:11:23 +02:00
elk-cloner
5ebd898953 fix dataset shuffling for Distributed training (#huggingface#3721) (#3766) 2020-04-13 10:11:18 -04:00
HenrykBorzymowski
7972a4019f updated dutch squad model card (#3736)
* added model_cards for polish squad models

* corrected mistake in polish design cards

* updated model_cards for squad2_dutch model

* added links to benchmark models

Co-authored-by: Henryk Borzymowski <henryk.borzymowski@pwc.com>
2020-04-11 06:44:59 -04:00
HUSEIN ZOLKEPLI
f8c1071c51 Added README huseinzol05/albert-tiny-bahasa-cased (#3746)
* add bert bahasa readme

* update readme

* update readme

* added xlnet

* added tiny-bert and fix xlnet readme

* added albert base

* added albert tiny
2020-04-11 06:42:06 -04:00
Jin Young Sohn
700ccf6e35 Fix glue_convert_examples_to_features API breakage (#3742) 2020-04-10 16:03:27 -04:00
Anthony MOI
b7cf9f43d2 Update tokenizers to 0.7.0-rc5 (#3705) 2020-04-10 14:23:49 -04:00
Jin Young Sohn
551b450527 Add run_glue_tpu.py that trains models on TPUs (#3702)
* Initial commit to get BERT + run_glue.py on TPU

* Add README section for TPU and address comments.

* Cleanup TPU bits from run_glue.py (#3)

TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py.

We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.

* Cleanup TPU bits from run_glue.py

TPU runner is currently implemented in:
https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py.

We plan to upstream this directly into `huggingface/transformers`
(either `master` or `tpu`) branch once it's been more thoroughly tested.

* No need to call `xm.mark_step()` explicitly (#4)

Since for gradient accumulation we're accumulating on batches from
`ParallelLoader` instance which on next() marks the step itself.

* Resolve R/W conflicts from multiprocessing (#5)

* Add XLNet in list of models for `run_glue_tpu.py` (#6)

* Add RoBERTa to list of models in TPU GLUE (#7)

* Add RoBERTa and DistilBert to list of models in TPU GLUE (#8)

* Use barriers to reduce duplicate work/resources (#9)

* Shard eval dataset and aggregate eval metrics (#10)

* Shard eval dataset and aggregate eval metrics

Also, instead of calling `eval_loss.item()` every time do summation with
tensors on device.

* Change defaultdict to float

* Reduce the pred, label tensors instead of metrics

As brought up during review some metrics like f1 cannot be aggregated
via averaging. GLUE task metrics depends largely on the dataset, so
instead we sync the prediction and label tensors so that the metrics can
be computed accurately on those instead.

* Only use tb_writer from master (#11)

* Apply huggingface black code formatting

* Style

* Remove `--do_lower_case` as example uses cased

* Add option to specify tensorboard logdir

This is needed for our testing framework which checks regressions
against key metrics writtern by the summary writer.

* Using configuration for `xla_device`

* Prefix TPU specific comments.

* num_cores clarification and namespace eval metrics

* Cache features file under `args.cache_dir`

Instead of under `args.data_dir`. This is needed as our test infra uses
data_dir with a read-only filesystem.

* Rename `run_glue_tpu` to `run_tpu_glue`

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2020-04-10 12:53:54 -04:00
Julien Chaumond
cbad305ce6 [docs] The use of do_lower_case in scripts is on its way to deprecation (#3738) 2020-04-10 12:34:04 -04:00
Julien Chaumond
b169ac9c2b [examples] Generate argparsers from type hints on dataclasses (#3669)
* [examples] Generate argparsers from type hints on dataclasses

* [HfArgumentParser] way simpler API

* Restore run_language_modeling.py for easier diff

* [HfArgumentParser] final tweaks from code review
2020-04-10 12:21:58 -04:00
Sam Shleifer
7a7fdf71f8 Multilingual BART - (#3602)
- support mbart-en-ro weights
- add MBartTokenizer
2020-04-10 11:25:39 -04:00
Julien Chaumond
f98d0ef2a2 Big cleanup of glue_convert_examples_to_features (#3688)
* Big cleanup of `glue_convert_examples_to_features`

* Use batch_encode_plus

* Cleaner wrapping of glue_convert_examples_to_features for TF

@lysandrejik

* Cleanup syntax, thanks to @mfuntowicz

* Raise explicit error in case of user error
2020-04-10 10:20:18 -04:00
Patrick von Platen
ce2298fb5f [T5, generation] Add decoder caching for T5 (#3682)
* initial commit to add decoder caching for T5

* better naming for caching

* finish T5 decoder caching

* correct test

* added extensive past testing for T5

* clean files

* make tests cleaner

* improve docstring

* improve docstring

* better reorder cache

* make style

* Update src/transformers/modeling_t5.py

Co-Authored-By: Yacine Jernite <yjernite@users.noreply.github.com>

* make set output past work for all layers

* improve docstring

* improve docstring

Co-authored-by: Yacine Jernite <yjernite@users.noreply.github.com>
2020-04-10 01:02:50 +02:00
calpt
9384e5f6de Fix force_download of files on Windows (#3697) 2020-04-09 14:44:57 -04:00
Julien Chaumond
bc65afc4df [Exbert] Change style of button 2020-04-09 10:44:42 -04:00
LysandreJik
31baeed614 Update quotes
cc @julien-c
2020-04-09 09:09:00 -04:00
Teven
f8208fa456 Correct transformers-cli env call 2020-04-09 09:03:19 +02:00
Lysandre Debut
6435b9f908 Updating the TensorFlow models to work as expected with tokenizers v3.0.0 (#3684)
* Updating modeling tf files; adding tests

* Merge `encode_plus` and `batch_encode_plus`
2020-04-08 16:22:44 -04:00
LysandreJik
500aa12318 close #3699 2020-04-08 14:32:47 -04:00
Julien Chaumond
a594ee9c84 More doc for model cards (#3698)
see https://github.com/huggingface/transformers/pull/3679#pullrequestreview-389368270
2020-04-08 12:12:52 -04:00
Julien Chaumond
83703cd077 Update doc for {Summarization,Translation}Pipeline and other tweaks 2020-04-08 09:45:00 -04:00
Seyone Chithrananda
a1b3b4167e Created README.md for model card ChemBERTa (#3666)
* created readme.md

* update readme with fixes

Fixes from PR comments
2020-04-08 09:10:20 -04:00
Lorenzo Ampil
747907dc5e Fix typo in FeatureExtractionPipeline docstring 2020-04-08 09:08:56 -04:00
Sam Shleifer
715aa5b135 [Bart] Replace config.output_past with use_cache kwarg (#3632) 2020-04-07 19:08:26 -04:00
Sam Shleifer
e344e3d402 [examples] SummarizationDataset cleanup (#3451) 2020-04-07 19:05:58 -04:00
Patrick von Platen
b0ad069517 [Tokenization] fix edge case for bert tokenization (#3517)
* fix egde gase for bert tokenization

* add Lysandres comments for improvement

* use new is_pretokenized_flag
2020-04-07 16:26:31 -04:00
Patrick von Platen
80fa0f7812 [Examples, Benchmark] Improve benchmark utils (#3674)
* improve and add features to benchmark utils

* update benchmark style

* remove output files
2020-04-07 16:25:57 -04:00
Michael Pang
05deb52dc1 Optimize causal mask using torch.where (#2715)
* Optimize causal mask using torch.where

Instead of multiplying by 1.0 float mask, use torch.where with a bool mask for increased performance.

* Maintain compatiblity with torch 1.0.0 - thanks for PR feedback

* Fix typo

* reformat line for CI
2020-04-07 22:19:18 +02:00
Sam Shleifer
0a4b1068e1 Speedup torch summarization tests (#3663) 2020-04-07 14:01:30 -04:00
Myle Ott
5aa8a278a3 Fix roberta checkpoint conversion script (#3642) 2020-04-07 12:03:23 -04:00
Julien Chaumond
11cc1e168b [model_cards] Turn down spurious warnings
Close #3639 + spurious warning mentioned in #3227

cc @lysandrejik @thomwolf
2020-04-07 10:20:19 -04:00
Teven
0a9d09b42a fixed TransfoXLLMHeadModel documentation (#3661)
Co-authored-by: TevenLeScao <teven.lescao@gmail.com>
2020-04-07 00:47:51 +02:00
Funtowicz Morgan
96ab75b8dd Tokenizers v3.0.0 (#3185)
* Renamed num_added_tokens to num_special_tokens_to_add

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style and quality

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏

* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2020-04-07 00:29:15 +02:00
Ethan Perez
e52d1258e0 Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py (#3631)
* Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py

`convert_examples_to_fes atures` sets `pad_token=0` by default, which is correct for BERT but incorrect for RoBERTa (`pad_token=1`) and XLNet (`pad_token=5`). I think the other arguments to `convert_examples_to_features` are correct, but it might be helpful if someone checked who is more familiar with this part of the codebase.

* Simplifying change to match recent commits
2020-04-06 16:52:22 -04:00
ktrapeznikov
0ac33ddd8d Create README.md 2020-04-06 16:35:29 -04:00
Manuel Romero
326e6ebae7 Add model card 2020-04-06 16:30:01 -04:00
Manuel Romero
43eca3f878 Add model card 2020-04-06 16:29:51 -04:00
Manuel Romero
6bec88ca42 Create README.md 2020-04-06 16:29:44 -04:00
Manuel Romero
769b60f935 Add model card (#3655)
* Add model card

* Fix model name in fine-tuning script
2020-04-06 16:29:36 -04:00
Manuel Romero
c4bcb01906 Create model card (#3654)
* Create model card

* Fix model name in fine-tuning script
2020-04-06 16:29:25 -04:00
Manuel Romero
6903a987b8 Create README.md 2020-04-06 16:29:02 -04:00
MichalMalyska
760872dbde Create README.md (#3662) 2020-04-06 16:27:50 -04:00
jjacampos
47e1334c0b Add model card for BERTeus (#3649)
* Add model card for BERTeus

* Update README
2020-04-06 16:21:25 -04:00
Suchin
529534dc2f BioMed Roberta-Base (AllenAI) (#3643)
* added model card

* updated README

* updated README

* updated README

* added evals

* removed pico eval

* Tweaks

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-04-06 16:12:09 -04:00
Lysandre Debut
261c4ff4e2 Update notebooks (#3620)
* Update notebooks

* From local to global link

* from local links to *actual* global links
2020-04-06 14:32:39 -04:00
Julien Chaumond
39a34cc375 [model_cards] ELECTRA (w/ examples of usage)
Co-Authored-By: Kevin Clark <clarkkev@users.noreply.github.com>
Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
2020-04-06 11:43:33 -04:00
LysandreJik
ea6dba2787 Re-pin isort 2020-04-06 10:09:54 -04:00
LysandreJik
11c3257a18 unpin isort for pypi
Some checks failed
GitHub-hosted runner / check_code_quality (push) Has been cancelled
2020-04-06 10:06:41 -04:00
LysandreJik
36bffc81b3 Release: v2.8.0 2020-04-06 10:03:53 -04:00
Patrick von Platen
2ee410560e [Generate, Test] Split generate test function into beam search, no beam search (#3601)
* split beam search and no beam search test

* fix test

* clean generate tests
2020-04-06 10:37:05 +02:00
Patrick von Platen
1789c7daf1 fix argument order (#3637) 2020-04-05 12:33:41 +02:00
Patrick von Platen
b809d2f073 Fix TF T5 docstring (#3636) 2020-04-05 12:23:09 +02:00
Timo Moeller
4ab8ab4f50 Adjust model card to reflect changes to vocabulary
(cherry picked from commit 8e25c4bf2838211378db4d93e7f9722386cc1a04)
2020-04-04 15:27:41 -04:00
ktrapeznikov
ac40eed1a5 Create README.md
adding readme for 
ktrapeznikov/albert-xlarge-v2-squad-v2
2020-04-04 15:18:54 -04:00
ktrapeznikov
fd9995ebc5 Create README.md 2020-04-04 15:18:31 -04:00
Julien Chaumond
5d912e7ed4 Tweak typing for #3566 2020-04-04 15:04:03 -04:00
Julien Chaumond
94eb68d742 weigths*weights 2020-04-04 15:03:26 -04:00
Manuel Romero
243e687be6 Create model card 2020-04-04 08:20:34 -04:00
Julien Chaumond
3e4b4dd190 [model_cards] Link to ExBERT visualisation
Hat/tip @bhoov @HendrikStrobelt @sebastianGehrmann

Also cc @srush and @thomwolf
2020-04-03 20:03:29 -04:00
Max Ryabinin
c6acd246ec Speed up GELU computation with torch.jit (#2988)
* Compile gelu_new with torchscript

* Compile _gelu_python with torchscript

* Wrap gelu_new with torch.jit for torch>=1.4
2020-04-03 15:20:21 -04:00
Lysandre Debut
d5d7d88612 ELECTRA (#3257)
* Electra wip

* helpers

* Electra wip

* Electra v1

* ELECTRA may be saved/loaded

* Generator & Discriminator

* Embedding size instead of halving the hidden size

* ELECTRA Tokenizer

* Revert BERT helpers

* ELECTRA Conversion script

* Archive maps

* PyTorch tests

* Start fixing tests

* Tests pass

* Same configuration for both models

* Compatible with base + large

* Simplification + weight tying

* Archives

* Auto + Renaming to standard names

* ELECTRA is uncased

* Tests

* Slight API changes

* Update tests

* wip

* ElectraForTokenClassification

* temp

* Simpler arch + tests

Removed ElectraForPreTraining which will be in a script

* Conversion script

* Auto model

* Update links to S3

* Split ElectraForPreTraining and ElectraForTokenClassification

* Actually test PreTraining model

* Remove num_labels from configuration

* wip

* wip

* From discriminator and generator to electra

* Slight API changes

* Better naming

* TensorFlow ELECTRA tests

* Accurate conversion script

* Added to conversion script

* Fast ELECTRA tokenizer

* Style

* Add ELECTRA to README

* Modeling Pytorch Doc + Real style

* TF Docs

* Docs

* Correct links

* Correct model intialized

* random fixes

* style

* Addressing Patrick's and Sam's comments

* Correct links in docs
2020-04-03 14:10:54 -04:00
Yohei Tamura
8594dd80dd BertJapaneseTokenizer accept options for mecab (#3566)
* BertJapaneseTokenizer accept options for mecab

* black

* fix mecab_option to Option[str]
2020-04-03 11:12:19 -04:00
HUSEIN ZOLKEPLI
216e167ce6 Added albert-base-bahasa-cased README and fixed tiny-bert-bahasa-cased README (#3613)
* add bert bahasa readme

* update readme

* update readme

* added xlnet

* added tiny-bert and fix xlnet readme

* added albert base
2020-04-03 09:28:43 -04:00
ahotrod
1ac6a246d8 Update README.md (#3604)
Update AutoModel & AutoTokernizer loading.
2020-04-03 09:28:25 -04:00
ahotrod
e91692f4a3 Update README.md (#3603) 2020-04-03 09:27:57 -04:00
HenrykBorzymowski
8e287d507d corrected mistake in polish model cards (#3611)
* added model_cards for polish squad models

* corrected mistake in polish design cards

Co-authored-by: Henryk Borzymowski <henryk.borzymowski@pwc.com>
2020-04-03 09:07:15 -04:00
redewiedergabe
81484b447b Create README.md (#3568)
* Create README.md

* added meta block (language: german)

* Added additional information about test data
2020-04-02 21:48:31 -04:00
ahotrod
9f6349aba9 Create README.md 2020-04-02 21:43:12 -04:00
Henryk Borzymowski
ddb1ce7418 added model_cards for polish squad models 2020-04-02 21:40:16 -04:00
Patrick von Platen
f68d22850c delete bogus print statement (#3595) 2020-04-02 21:49:34 +02:00
Nicolas
c50aa67bff Resizing embedding matrix before sending it to the optimizer. (#3532)
* Resizing embedding matrix after sending it to the optimizer prevents from updating the newly resized matrix.

* Remove space for style matter
2020-04-02 15:00:05 -04:00
Mark Kockerbeck
1b10159950 Adding should_continue check for retraining (#3509) 2020-04-02 14:07:08 -04:00
Patrick von Platen
390c128592 [Encoder-Decoder] Force models outputs to always have batch_size as their first dim (#3536)
* solve conflicts

* improve comments
2020-04-02 15:18:33 +02:00
Patrick von Platen
ab5d06a094 [T5, examples] replace heavy t5 models with tiny random models (#3556)
* replace heavy t5 models with tiny random models as was done by sshleifer

* fix isort
2020-04-02 12:34:05 +02:00
Patrick von Platen
a4ee4da18a [T5, TF 2.2] change tf t5 argument naming (#3547)
* change tf t5 argument naming for TF 2.2

* correct bug in testing
2020-04-01 22:04:20 +02:00
Patrick von Platen
06dd597552 fix bug in warnings T5 pipelines (#3545) 2020-04-01 21:59:12 +02:00
Anirudh Srinivasan
9de9ceb6c5 Correct output shape for Bert NSP models in docs (#3482) 2020-04-01 15:04:38 -04:00
Patrick von Platen
b815edf69f [T5, Testst] Add extensive hard-coded integration tests and make sure PT and TF give equal results (#3550)
* add some t5 integration tests

* finish summarization and translation integration tests for T5 - results loook good

* add tf test

* fix == vs is bug

* fix tf beam search error and make tf t5 tests pass
2020-04-01 18:01:33 +02:00
HUSEIN ZOLKEPLI
8538ce9044 Add tiny-bert-bahasa-cased model card (#3567)
* add bert bahasa readme

* update readme

* update readme

* added xlnet

* added tiny-bert and fix xlnet readme
2020-04-01 07:15:00 -04:00
Manuel Romero
c1a6252be1 Create model card (#3557)
Create model card for: distilbert-multi-finetuned-for-xqua-on-tydiqa
2020-04-01 07:14:23 -04:00
Julien Chaumond
50e15c825c Tokenizers: Start cleaning examples a little (#3455)
* Start cleaning examples

* Fixup
2020-04-01 07:13:40 -04:00
Patrick von Platen
b38d552a92 [Generate] Add bad words list argument to the generate function (#3367)
* add bad words list

* make style

* add bad_words_tokens

* make style

* better naming

* make style

* fix typo
2020-03-31 18:42:31 +02:00
Patrick von Platen
ae6834e028 [Examples] Clean summarization and translation example testing files for T5 and Bart (#3514)
* fix conflicts

* add model size argument to summarization

* correct wrong import

* fix isort

* correct imports

* other isort make style

* make style
2020-03-31 17:54:13 +02:00
Manuel Romero
0373b60c4c Update README.md (#3552)
- Show that the last uploaded version was trained on more data (custom_license files)
2020-03-31 10:40:34 -04:00
Patrick von Platen
83d1fbcff6 [Docs] Add usage examples for translation and summarization (#3538) 2020-03-31 09:36:03 -04:00
Patrick von Platen
55bcae7f25 remove useless and confusing lm_labels line (#3531) 2020-03-31 09:32:25 -04:00
Patrick von Platen
42e1e3c67f Update usage doc regarding generate fn (#3504) 2020-03-31 09:31:46 -04:00
Patrick von Platen
57b0fab692 Add better explanation to check docs locally. (#3459) 2020-03-31 09:30:17 -04:00
Manuel Romero
a8d4dff0a1 Update README.md (#3470)
Fix typo
2020-03-31 08:01:09 -04:00
Manuel Romero
4a5663568f Create card for the model: GPT-2-finetuned-covid-bio-medrxiv (#3453) 2020-03-31 08:01:03 -04:00
Branden Chan
bbedb59675 Create README.md (#3393)
* Create README.md

* Update README.md
2020-03-31 08:00:35 -04:00
Manuel Romero
c2cf192943 Add link to 16 POS tags model (#3465) 2020-03-31 08:00:00 -04:00
Gabriele Sarti
c82ef72158 Added CovidBERT-NLI model card (#3477) 2020-03-31 07:59:49 -04:00
Manuel Romero
b48a1f08c1 Add text shown in example of usage (#3464) 2020-03-31 07:59:36 -04:00
Manuel Romero
99833a9cbf Create model card (#3487) 2020-03-31 07:59:22 -04:00
Sho Arora
ebceeeacda Add electra and alectra model cards (#3524) 2020-03-31 07:58:48 -04:00
Leandro von Werra
a6c4ee27fd Add model cards (#3537)
* feat: add model card bert-imdb

* feat: add model card gpt2-imdb-pos

* feat: add model card gpt2-imdb
2020-03-31 07:54:45 -04:00
Ethan Perez
e5c393dceb [Bug fix] Using loaded checkpoint with --do_predict (instead of… (#3437)
* Using loaded checkpoint with --do_predict

Without this fix, I'm getting near-random validation performance for a trained model, and the validation performance differs per validation run. I think this happens since the `model` variable isn't set with the loaded checkpoint, so I'm using a randomly initialized model. Looking at the model activations, they differ each time I run evaluation (but they don't with this fix).

* Update checkpoint loading

* Fixing model loading
2020-03-30 17:06:08 -04:00
Sam Shleifer
8deff3acf2 [bart-tiny-random] Put a 5MB model on S3 to allow faster exampl… (#3488) 2020-03-30 12:28:27 -04:00
dougian
1f72865726 [BART] Update encoder and decoder on set_input_embedding (#3501)
Co-authored-by: Ioannis Douratsos <ioannisd@amazon.com>
2020-03-30 12:20:37 -04:00
Julien Chaumond
cc598b312b [InputExample] Unfreeze for now, cf. #3423 2020-03-30 10:41:49 -04:00
Julien Plu
d38bbb225f Update the NER TF script (#3511)
* Update the NER TF script to remove the softmax and make the pad token label id to -1

* Reformat the quality and style

Co-authored-by: Julien Plu <julien.plu@adevinta.com>
2020-03-30 09:50:12 -04:00
LysandreJik
eff757f2e3 Re-pin isort version 2020-03-30 09:00:47 -04:00
LysandreJik
a009d751c2 Un-pin isort for v2.7.0 pypi 2020-03-30 08:55:10 -04:00
LysandreJik
6f5a12a583 Release: v2.7.0
Some checks failed
GitHub-hosted runner / check_code_quality (push) Has been cancelled
2020-03-30 08:49:24 -04:00
Patrick von Platen
296252c49e fix lm lables in docstring (#3529) 2020-03-30 14:26:24 +02:00
Patrick von Platen
75ec6c9e3a [T5] make decoder input ids optional for t5 training (#3521)
* make decoder input ids optional for t5 training

* lm_lables should not be shifted in t5

* add tests

* finish shift right functionality for PT T5

* move shift right to correct class

* cleaner code

* replace -100 values with pad token id

* add assert statement

* remove unnecessary for loop

* make style
2020-03-30 13:45:26 +02:00
Patrick von Platen
5b44e0a31b [T5] Add training documenation (#3507)
* Add clear description of how to train T5

* correct docstring in T5

* correct typo

* correct docstring format

* update t5 model docs

* implement collins feedback

* fix typo and add more explanation for sentinal tokens

* delete unnecessary todos
2020-03-30 13:35:53 +02:00
Sam Shleifer
33ef7002e1 [Docs] examples/summarization/bart: Simplify CNN/DM preprocessi… (#3516) 2020-03-29 13:25:42 -04:00
Sam Shleifer
f6a23d1911 [BART] add bart-large-xsum weights (#3422) 2020-03-29 10:51:13 -04:00
Stefan Schweter
601ac5b1dc [model_cards]: use MIT license for all dbmdz models 2020-03-27 18:06:25 -04:00
Patrick von Platen
17dceae7a1 Fix circle ci flaky fail of wmt example (#3485)
* force bleu

* fix wrong file name

* rename file

* different filenames for each example test

* test files should clean up after themselves

* test files should clean up after themselves

* do not force bleu

* correct typo

* fix isort
2020-03-27 13:01:28 -04:00
Patrick von Platen
00ea100e96 add summarization and translation to notebook (#3478) 2020-03-27 11:05:37 -04:00
Funtowicz Morgan
b08259a120 run_ner.py / bert-base-multilingual-cased can output empty tokens (#2991)
* Use tokenizer.num_added_tokens to count number of added special_tokens instead of hardcoded numbers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* run_ner.py - Do not add a label to the labels_ids if word_tokens is empty.

This can happen when using bert-base-multilingual-cased with an input containing an unique space.
In this case, the tokenizer will output just an empty word_tokens thus leading to an non-consistent behavior
over the labels_ids tokens adding one more tokens than tokens vector.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-03-27 10:59:55 -04:00
Patrick von Platen
f4f4946836 Rename t5-large to t5-base in README.md 2020-03-27 15:57:58 +01:00
Patrick von Platen
fa9af2468a Add T5 to docs (#3461)
* add t5 docs basis

* improve docs

* add t5 docs

* improve t5 docstring

* add t5 tokenizer docstring

* finish docstring

* make style

* add pretrained models

* correct typo

* make examples work

* finalize docs
2020-03-27 10:57:16 -04:00
Lysandre Debut
ff80b73157 Add option to choose T5 model size. (#3480)
T5-small in test


isort
2020-03-27 15:56:59 +01:00
LysandreJik
e2c05f06ef Correct indentation in docstring
For some reason Sphinx extremely dislikes this and crashes.
2020-03-27 09:28:52 -04:00
Sam Shleifer
3ee431dd4c [Bart/Memory] Two separate, smaller decoder attention masks (#3371) 2020-03-26 21:34:15 -04:00
Manuel Romero
53fe733805 Model Cards: Fix grammar error (#3467) 2020-03-26 21:33:33 -04:00
Sam Shleifer
c10decf7a0 [Bart: example] drop columns that are exclusively pad_token_id… (#3400)
* trim seq_len below 1024 if there are columns full of pad_token_id
* Centralize trim_batch so SummarizationDataset can use it too
2020-03-26 19:33:54 -04:00
Sam Shleifer
63f4d8cad0 [Bart/Memory] SelfAttention only returns weights if config.outp… (#3369) 2020-03-26 18:42:39 -04:00
Sam Shleifer
2b2a2f8df2 [Bart] Fix: put dummy_inputs on correct device (#3398)
* Dummy inputs to model.device

* Move self.device to ModuleUtilsMixin
2020-03-26 18:42:09 -04:00
Sam Shleifer
1a5aefc95c [Seq2Seq Generation] Call encoder before expanding input_ids (#3370) 2020-03-26 18:41:19 -04:00
Sam Shleifer
39371ee454 [Bart/Memory] don't create lm_head (#3323)
* delete lm_head, skips weight tying
* Fixed s3
2020-03-26 18:40:39 -04:00
Patrick von Platen
5ad2ea06af Add wmt translation example (#3428)
* add translation example

* make style

* adapt docstring

* add gpu device as input for example

* small renaming

* better README
2020-03-26 19:07:59 +01:00
Patrick von Platen
b4fb94fe6d revert unpin isort commit 2020-03-26 13:19:18 -04:00
Patrick von Platen
e703e923ca Add t5 summarization example (#3411)
* rebase to master

* change tf to pytorch

* change to pytorch

* small fix

* renaming

* add gpu training possibility

* renaming

* improve README

* incoorporate collins feedback

* better Readme

* better README.md
2020-03-26 18:17:55 +01:00
sakares saengkaew
1a6c546c6f Add missing token classification for XLM (#3277)
* Add the missing token classification for XLM

* fix styling

* Add XLMForTokenClassification to AutoModelForTokenClassification class

* Fix docstring typo for non-existing class

* Add the missing token classification for XLM

* fix styling

* fix styling

* Add XLMForTokenClassification to AutoModelForTokenClassification class

* Fix docstring typo for non-existing class

* Add missing description for AlbertForTokenClassification

* fix styling

* Add missing docstring for AlBert

* Slow tests should be slow

Co-authored-by: Sakares Saengkaew <s.sakares@gmail.com>
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2020-03-26 10:22:13 -04:00
Patrick von Platen
311970546f rename string in pipeline 2020-03-26 14:59:49 +01:00
Manuel Romero
7420a6a9cc Create card for model GPT-2-finetuned-CORD19 2020-03-26 09:10:09 -04:00
Patrick von Platen
022e8fab97 Adds translation pipeline (#3419)
* fix merge conflicts

* add t5 summarization example

* change parameters for t5 summarization

* make style

* add first code snippet for translation

* only add prefixes

* add prefix patterns

* make style

* renaming

* fix conflicts

* remove unused patterns

* solve conflicts

* fix merge conflicts

* remove translation example

* remove summarization example

* make sure tensors are in numpy for float comparsion

* re-add t5 config

* fix t5 import config typo

* make style

* remove unused numpy statements

* update doctstring

* import translation pipeline
2020-03-26 13:50:58 +01:00
HUSEIN ZOLKEPLI
3c5c567507 Update model card huseinzol05/bert-base-bahasa-cased (#3425)
* add bert bahasa readme

* update readme

* update readme

* added xlnet
2020-03-26 07:50:27 -04:00
Patrick von Platen
9c683ef01e Add t5 to pipeline(task='summarization') (#3413)
* solve conflicts

* move warnings below

* incorporate changes

* add pad_to_max_length to pipelines

* add bug fix for T5 beam search

* add prefix patterns

* make style

* fix conflicts

* adapt pipelines for task specific parameters

* improve docstring

* remove unused patterns
2020-03-26 11:03:13 +01:00
Lysandre Debut
ffcffebe85 Force the return of token type IDs (#3439) 2020-03-26 09:41:36 +01:00
Travis McGuire
010e0460b2 Updated/added model cards (#3435) 2020-03-25 16:40:03 -04:00
Patrick von Platen
ffa17fe322 Extend config with task specific configs. (#3433)
* add new default configs

* change prefix default to None
2020-03-25 21:32:04 +01:00
Julien Chaumond
83272a3853 Experiment w/ dataclasses (including Py36) (#3423)
* [ci] Also run test_examples in py37

(will revert at the end of the experiment)

* InputExample: use immutable dataclass

* [deps] Install dataclasses for Py<3.7

* [skip ci] Revert "[ci] Also run test_examples in py37"

This reverts commit d29afd9959786b77759b0b8fa4e6b4335b952015.
2020-03-25 11:10:20 -04:00
Gabriele Sarti
ccbe839ee0 Added BioBERT-NLI model card (#3421) 2020-03-24 21:15:55 -04:00
Andre Carrera
3d76df3a12 BART for summarization training with CNN/DM using pytorch-lightning 2020-03-24 21:00:24 -04:00
Julien Chaumond
eaabaaf750 [run_language_modeling] Fix: initialize a new model from a config object 2020-03-24 17:56:40 -04:00
Julien Chaumond
f8823bad9a Expose missing mappings (see #3415) 2020-03-24 17:46:25 -04:00
Julien Chaumond
d0c36a7b72 [ci] Partial revert of 18eec3a984 due to fbc5bf10cf 2020-03-24 12:10:43 -04:00
LysandreJik
fbc5bf10cf v2.6.0 release: isort un-pinned
Some checks failed
GitHub-hosted runner / check_code_quality (push) Has been cancelled
2020-03-24 11:52:02 -04:00
Manuel Romero
b88bda6af3 Add right model and tokenizer path in example 2020-03-24 11:30:12 -04:00
Stefan Schweter
b31ef225cf [model_cards] 🇹🇷 Add new (uncased, 128k) BERTurk model 2020-03-24 11:29:06 -04:00
Stefan Schweter
b4009cb001 [model_cards] 🇹🇷 Add new (cased, 128k) BERTurk model 2020-03-24 11:29:06 -04:00
Stefan Schweter
d3283490ef [model_cards] 🇹🇷 Add new (uncased) BERTurk model 2020-03-24 11:29:06 -04:00
Mohamed El-Geish
e279a312d6 Model cards for CS224n SQuAD2.0 models (#3406)
* Model cards for CS224n SQuAD2.0 models

* consistent spacing
2020-03-24 11:28:33 -04:00
Gabriele Sarti
7372e62b2c Added precisions in SciBERT-NLI model card (#3410) 2020-03-24 11:01:56 -04:00
LysandreJik
471cce24b3 Release: v2.6.0 2020-03-24 10:37:32 -04:00
Patrick von Platen
e392ba6938 Add camembert integration tests (#3375)
* add integration tests for camembert

* use jplu/tf-camembert fro the moment

* make style
2020-03-24 10:18:37 +01:00
Julien Chaumond
a8e3336a85 [examples] Use AutoModels in more examples 2020-03-23 20:11:14 -04:00
Julien Chaumond
ec6766a363 [deps] scikit-learn's transient issue was fixed 2020-03-23 18:38:09 -04:00
Julien Chaumond
f7dcf8fcea [BertAbs] Move files around for more consistent naming 2020-03-23 13:58:49 -04:00
Julien Chaumond
e25c4f4027 [ALBERT] move things around for more consistent naming
see #3359

cc @lysandrejik
2020-03-23 13:58:21 -04:00
Manuel Romero
85b324bee5 Add comparison table with older brother in family 2020-03-23 12:11:20 -04:00
Manuel Romero
b7aa077a63 Create card for the model 2020-03-23 12:10:41 -04:00
Manuel Romero
f740177c87 Add comparison table with new models 2020-03-23 12:10:23 -04:00
LysandreJik
e52482909b Correct order for dev/quality dependencies
cc @julien-c
2020-03-23 12:01:23 -04:00
Gabriele Sarti
28424906c2 Added scibert-nli model card 2020-03-23 11:55:41 -04:00
Julien Chaumond
18eec3a984 [ci] simpler way to load correct version of isort
hat/tip @bramvanroy
2020-03-23 10:03:22 -04:00
Julien Chaumond
cf72479bf1 One last reorder of {scheduler,optimizer}.step() 2020-03-20 18:05:50 -04:00
Elijah Rippeth
634bf6cf7e fixes lr_scheduler warning
For more details, see https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
2020-03-20 18:03:50 -04:00
Travis McGuire
265709f5cd New model, new model cards 2020-03-20 18:01:01 -04:00
Bram Vanroy
115abd2166 Handle pinned version of isort
The CONTRIBUTING file pins to a specific version of isort, so we might as well install that in `dev` . This makes it easier for contributors so they don't have to manually install the specific commit.
2020-03-20 18:00:04 -04:00
Patrick von Platen
95e00d0808 Clean special token init in modeling_....py (#3264)
* make style

* fix conflicts
2020-03-20 21:41:04 +01:00
Nitish Shirish Keskar
8becb73293 removing torch.cuda.empty_cache() from TF function (#3267)
torch.cuda.empty_cache() was being called from a TF function (even when torch is unavailable)
not sure any replacement is needed if TF OOMs
2020-03-19 23:25:30 +01:00
Julien Chaumond
ecfd336318 Simpler Error message when loading config/model with .from_pretrained() (#3341) 2020-03-19 23:23:03 +01:00
Kyeongpil Kang
8eeefcb576 Update 01-training-tokenizers.ipynb (typo issue) (#3343)
I found there are two grammar errors or typo issues in the explanation of the encoding properties.

The original sentences:
If your was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to
If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts.

I think "input" should be inserted after the phrase "If your".
2020-03-19 23:21:49 +01:00
Patrick von Platen
bbf26c4e61 Support T5 Generation (#3228)
* fix conflicts

* update bart max length test

* correct spelling mistakes

* implemented model specific encode function

* fix merge conflicts

* better naming

* save intermediate state -> need to rethink strucuture a bit

* leave tf problem as it is for now

* current version

* add layers.pop

* remove ipdb

* make style

* clean return cut decoding

* remove ipdbs

* Fix restoring layers in the decoders that doesnt exists.

* push good intermediate solution for now

* fix conflicts

* always good to refuse to merge conflicts when rebasing

* fix small bug

* improve function calls

* remove unused file

* add correct scope behavior for t5_generate

Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
2020-03-19 23:18:23 +01:00
Julien Chaumond
656e1386a2 Fix #3305: run_ner only possible on ModelForTokenClassification models 2020-03-19 16:41:28 -04:00
husein zolkepli
0c44b11917 add bert bahasa readme 2020-03-19 15:08:19 -04:00
Manuel Romero
e99af3b17b Create model card for bert-small-finetuned-squadv2 2020-03-19 15:07:55 -04:00
Manuel Romero
39db055268 Merge pull request #3348 from mrm8488/patch-28
Create card for BERT-Mini finetuned on SQuAD v2
2020-03-19 15:07:39 -04:00
Manuel Romero
dedc7a8fdb Create card for BERT-Tiny fine-tuned on SQuAD v2
- Only 17MB of Model weights!!
2020-03-19 15:07:22 -04:00
Manuel Romero
676adf8625 Created card for spanbert-finetuned-squadv1 2020-03-19 15:06:35 -04:00
Antti Virtanen
11d8bcc9d7 Add model cards for FinBERT. (#3331)
* Add a model card for FinBERT

This is a copy of https://github.com/TurkuNLP/FinBERT/blob/master/README.md.

* Added a file for uncased.

* Add metadata for cased.

* Added metadata for uncased.
2020-03-19 15:06:01 -04:00
Lysandre Debut
f049be7ad4 Export ALBERT main layer in TensorFlow (#3354) 2020-03-19 13:53:05 -04:00
Kyeongpil Kang
3bedfd3347 Fix wrong link for the notebook file (#3344)
For the tutorial of "How to generate text", the URL link was wrong (it was linked to the tutorial of "How to train a language model").

I fixed the URL.
2020-03-19 17:22:47 +01:00
Serkan Karakulak
b2c2c31c60 Minor Bug Fix for Running Roberta on Glue (#3240)
* added return_token_type_ids argument for tokenizers which do not generate return_type_ids by default

* fixed styling

* Style

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2020-03-19 12:08:31 -04:00
Sam Shleifer
4e4403c9b4 [BART] torch 1.0 compatibility (#3322)
* config.activation_function
2020-03-19 11:56:54 -04:00
mataney
c44a17db1b [FIX] not training when epoch is small (#3006)
* solving bug where for small epochs and large gradient_accumulation_steps we never train

* black formatting

* no need to change these files
2020-03-19 11:21:21 -04:00
Sam Shleifer
ad7233fc01 [BART] cleanup: remove redundant kwargs, improve docstrings (#3319) 2020-03-19 11:16:51 -04:00
Mohamed El-Geish
cd21d8bc00 Typo in warning message (#3219)
`T5Tokenizer` instead of `XLNetTokenizer`
2020-03-19 09:49:25 -04:00
Matthew Goldey
8d3e218ea6 fix typo in docstring demonstrating usage (#3213) 2020-03-19 09:47:54 -04:00
Patrick von Platen
cec3cdda15 Fix input ids can be none attn mask (#3345)
* fix issue 3289

* fix attention mask if input_ids None behavior
2020-03-19 09:55:17 +01:00
Junyi_Li
f6d813aaaa Create README.md 2020-03-18 23:45:02 -04:00
Junyi_Li
939328111b Create README.md
roberta_chinese_base card
2020-03-18 23:44:12 -04:00
Junyi_Li
29442d2edf Create README.md
albert_chinese_tiny card
2020-03-18 23:43:49 -04:00
Kyle Lo
20139b7c8d Added model cards for SciBERT models uploaded under AllenAI org (#3330)
* Create README.md

* model card

* add model card for cased
2020-03-18 15:45:11 -04:00
Morgan Funtowicz
cae334c43c Improve fill-mask pipeline example in 03-pipelines notebook.
Remove hardcoded mask_token and use the value provided by the tokenizer.
2020-03-18 17:11:42 +01:00
Branden Chan
4b1970bb4c Create README.md 2020-03-18 11:37:17 -04:00
Lysandre Debut
d6afbd323d XLM-R Tokenizer now passes common tests + Integration tests (#3198)
* XLM-R now passes common tests + Integration tests

* Correct mask index

* Model input names

* Style

* Remove text preprocessing

* Unneccessary import
2020-03-18 09:52:49 -04:00
Patrick von Platen
292186a3e7 Adding LM Head to Transfo-XL and first step to fixing problem with Adaptive Embeddings in TransfoXL (#3286)
* first commit

* work in progress

* make language generation task pass

* update to working version for LM

* delete print

* remove dead code

* make style
2020-03-18 09:24:27 -04:00
Patrick von Platen
efdb46b6e2 add link to blog post (#3326) 2020-03-18 13:24:28 +01:00
Patrick von Platen
ddb10c6447 improve doctstring (#3327) 2020-03-18 13:24:09 +01:00
Junyi_Li
d7f98cd3ef Init card for model 2020-03-18 07:55:27 -04:00
Sam Shleifer
38a555a83c Add Summarization to Pipelines (#3128)
* passing

* Undo stupid chg

* docs

* undo rename

* delete-cruft

* only import if you have torch

* Dont rely on dict ordering

* Fix dict ordering upstream

* docstring link

* docstring link

* remove trailing comma for 3.5 compat

* new name

* delegate kwarging

* Update kwargs
2020-03-17 18:04:21 -04:00
J.P Lee
2b60a26b46 Update examples/ner/run_ner.py to use AutoModel (#3305)
* Update examples/ner/run_ner.py to use AutoModel

* Fix missing code and apply `make style` command
2020-03-17 12:30:10 -04:00
Manuel Romero
e41212c715 Create model card for CodeBERTaPy (#3309) 2020-03-17 12:29:11 -04:00
Julien Chaumond
0f1bc0d68e [model_cards] Add google thumbnail 2020-03-17 12:02:51 -04:00
Nathan Raw
930c9412b4 [WIP] Lightning glue example (#3290)
*  Alter base pl transformer to use automodels

* 🐛 Add batch size env variable to function call

* 💄 Apply black code style from Makefile

* 🚚 Move lightning base out of ner directory

*  Add lightning glue example

* 💄 self

* move _feature_file to base class

*  Move eval logging to custom callback

* 💄 Apply black code style

* 🐛 Add parent to pythonpath, remove copy command

* 🐛 Add missing max_length kwarg
2020-03-17 11:46:42 -04:00
Patrick von Platen
e8f44af5bf [generate] do_sample default back to False (#3298)
* change do_samples back

* None better default as boolean

* adapt do_sample to True in test example

* make style
2020-03-17 10:52:37 -04:00
Thomas Wolf
2187c49f5c CPU/GPU memory benchmarking utilities - Remove support for python 3.5 (now only 3.6+) (#3186)
* memory benchmark rss

* have both forward pass and line-by-line mem tracing

* cleaned up tracing

* refactored and cleaning up API

* no f-strings yet...

* add GPU mem logging

* fix GPU memory monitoring

* style and quality

* clean up and doc

* update with comments

* Switching to python 3.6+

* fix quality
2020-03-17 10:17:11 -04:00
Jannes
bd3feddf67 Create README.md (#3306)
* Create README.md

* Updated README.md
2020-03-17 09:05:11 -04:00
Julien Chaumond
68ef0a111f [model_cards] Symlink all Google AI's BERT Miniatures to source model card 2020-03-16 23:37:42 -04:00
Sam Shleifer
b2c1a447fe [BART] Delete redundant unit test (#3302) 2020-03-16 23:09:10 -04:00
iuliaturc-google
b2028cc26b Add model card for Google AI's BERT Miniatures (#3301)
This model card is intended to be shared among all models under google/bert_uncased_*
(We'll need some support from HuggingFace to get this card cross-linked from all models)
2020-03-16 21:51:46 -04:00
Patrick von Platen
4759176313 add camembert for Question answering for examples 2020-03-16 14:42:11 -04:00
Sam Shleifer
11573231c6 [BART] generation_mode as a kwarg not a class attribute (#3278) 2020-03-16 12:47:53 -04:00
Manuel Romero
de697935a2 Create model card for spanbert-finetuned-squadv2 2020-03-16 12:32:46 -04:00
Manuel Romero
3ddd2029bc Create CodeBERTaJS model card 2020-03-16 12:23:01 -04:00
Julien Plu
879e1d3234 Add TF2 version of FlauBERT (#2700)
* Add TF2 version of FlauBERT

* Add TF2 version of FlauBERT

* Add documentation

* Apply style and quality

* Apply style once again

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-03-16 09:29:21 -04:00
Patrick von Platen
af471ce5e8 Improved Error message when loading config/model with .from_pretrained() (#3247)
* better error message

* better error message

* update to model identifier instead of url

* update to model identifier instead of ur
2020-03-16 09:48:30 +01:00
Sam Shleifer
5ea8ba67b4 [BART] Remove unused kwargs (#3279)
* Remove unused kwargs
* dont call forward in tests
2020-03-15 23:00:44 -04:00
Thomas Wolf
3814e167d9 Merge pull request #3225 from patrickvonplaten/finalize_merge_bart_generate_into_default_generate
Complete merge Seq-2-Seq generation into default generation
2020-03-14 15:08:59 +01:00
Sam Shleifer
2bd79e23de [BART] FP16 testing fixes (#3266) 2020-03-13 19:48:26 -04:00
Julien Chaumond
8320feec09 [model_cards] CodeBERTa 2020-03-13 18:28:09 -04:00
Patrick von Platen
ab756f713c add gpt2-xl for tf 2020-03-13 16:40:35 -04:00
Patrick von Platen
4f75d380a4 make style 2020-03-13 16:35:52 +01:00
Patrick von Platen
c2ee3840ae update file to new starting token logic 2020-03-13 16:34:44 +01:00
Benjamin Muller
cc4c37952a Create camembert-base-README.md 2020-03-13 09:35:53 -04:00
dependabot[bot]
afea70c01c Bump psutil from 5.6.3 to 5.6.6 in /examples/distillation
Bumps [psutil](https://github.com/giampaolo/psutil) from 5.6.3 to 5.6.6.
- [Release notes](https://github.com/giampaolo/psutil/releases)
- [Changelog](https://github.com/giampaolo/psutil/blob/master/HISTORY.rst)
- [Commits](https://github.com/giampaolo/psutil/compare/release-5.6.3...release-5.6.6)

Signed-off-by: dependabot[bot] <support@github.com>
2020-03-12 21:14:56 -04:00
Sam Shleifer
087465b943 add BART to README (#3255) 2020-03-12 19:38:05 -04:00
Patrick von Platen
6a82f774f2 fix typo 2020-03-12 21:10:51 +01:00
Patrick von Platen
f1c71da115 fix eos_token_ids in test 2020-03-12 21:00:54 +01:00
Patrick von Platen
6047f46b19 re-add eos token to get good bart results 2020-03-12 20:17:50 +01:00
Patrick von Platen
c11160114a small clean-up 2020-03-12 20:02:35 +01:00
Sam Shleifer
2e81b9d8d7 Bart: update example for #3140 compatibility (#3233)
* Update bart example docs
2020-03-12 10:36:37 -04:00
Julien Chaumond
72768b6b9c [model_cards] polbert: simplify usage example with pipelines
Co-Authored-By: Darek Kłeczek <darek.kleczek@gmail.com>
2020-03-12 10:05:40 -04:00
Julien Chaumond
a4c75f1492 [ci] last resort 2020-03-11 19:11:19 -04:00
Julien Chaumond
824e320d96 [ci] Fixup c6cf925 2020-03-11 18:52:10 -04:00
Julien Chaumond
c6cf925ff8 [ci] last resort
while looking for fix to https://twitter.com/julien_c/status/1237864185821708291
2020-03-11 18:49:19 -04:00
Stefan Schweter
14e455b716 [model_cards] 🇹🇷 Add new (cased) DistilBERTurk model 2020-03-11 18:40:38 -04:00
Emily Alsentzer
f65f74bbce Create README.md (#3230) 2020-03-11 12:37:00 -04:00
Emily Alsentzer
324292cfc7 Add Bio+ Clinical BERT model card (#3229)
* Create README.md

* Update README.md
2020-03-11 12:36:33 -04:00
Julien Chaumond
e43afb1bb8 [model_cards] DialoGPT: How to use + thumbnail + conversational tag
cc @dreasysnail

Co-Authored-By: Patrick von Platen <patrick.v.platen@gmail.com>
2020-03-11 11:36:47 -04:00
Julien Chaumond
5085df995f [model_cards] PolBERT tweaks 2020-03-11 09:29:22 -04:00
Darek Kłeczek
19a63d8245 Create Readme.md model card (#3221) 2020-03-11 09:12:48 -04:00
Yizhe
dc848c2994 Create README.md 2020-03-11 09:10:57 -04:00
Yizhe
6ad221daf3 Create README.md 2020-03-11 09:10:23 -04:00
Yizhe
735180aa14 Create README.md 2020-03-11 09:10:13 -04:00
Manuel Romero
6c61c0801e Create README.md 2020-03-11 09:03:30 -04:00
Manuel Romero
235616686a Update README.md
- Update title
- Remove metrics
2020-03-11 09:03:20 -04:00
Manuel Romero
5bb00c817f Update README.md
Change title to clarify the model description
2020-03-11 09:03:07 -04:00
Manuel Romero
601e424750 Update README.md 2020-03-11 09:02:56 -04:00
Manuel Romero
1b9e765b21 Update README.md
- Remove metrics until tested on other xquad benchmarks
2020-03-11 09:02:18 -04:00
Thomas Wolf
db29ffc978 Merge pull request #3140 from patrickvonplaten/merge_bart_generate_into_default_generate
Merge bart generate into default generate
2020-03-11 13:21:53 +01:00
Patrick von Platen
ac303eae46 fix problem with half 2020-03-11 12:24:30 +01:00
Patrick von Platen
bc9d5d917c make all tensors half precision 2020-03-11 12:15:38 +01:00
Patrick von Platen
a332cc9f7f finalize generation merge 2020-03-11 11:53:36 +01:00
Patrick von Platen
1ba21f96ca fix bug in tf no_repeat_ngram_size 2020-03-11 11:06:56 +01:00
Patrick von Platen
d997ac7810 fix typo 2020-03-11 11:06:56 +01:00
Patrick von Platen
7351a8dbaf re-add scoring filtering 2020-03-11 11:06:56 +01:00
Patrick von Platen
9b8ee8cea0 delete print and make style 2020-03-11 11:06:56 +01:00
Patrick von Platen
ca1330f0b2 do not mess with the negative sign 2020-03-11 11:06:56 +01:00
Patrick von Platen
10989715d0 rename variable 2020-03-11 11:06:56 +01:00
Patrick von Platen
cf06290565 remove ipdb 2020-03-11 11:06:56 +01:00
Patrick von Platen
374deef48d fixed typo 2020-03-11 11:06:56 +01:00
Patrick von Platen
a2c8e516c2 fix torch to tf translation 2020-03-11 11:06:56 +01:00
Patrick von Platen
ca2047bc35 refactor variable naming and improve tf generate in line with torch generate 2020-03-11 11:06:56 +01:00
patrickvonplaten
41b437ea3a add draft version of propsoed changes for ROGUE score 2020-03-11 11:06:56 +01:00
patrickvonplaten
a5751f7578 fix bug with attention_mask as optional input argument 2020-03-11 11:06:56 +01:00
patrickvonplaten
629aac92ec do not allow do_sample and weird force bos token things 2020-03-11 11:06:56 +01:00
patrickvonplaten
d880a5fbde finalized PR 2020-03-11 11:06:56 +01:00
patrickvonplaten
2acfe63964 best current version and make style 2020-03-11 11:06:56 +01:00
patrickvonplaten
c62444da39 fix conflicts 2020-03-11 11:06:56 +01:00
Patrick von Platen
77e6775065 add current changes 2020-03-11 11:06:56 +01:00
Patrick von Platen
333affcb81 add current changes 2020-03-11 11:06:56 +01:00
Patrick von Platen
421216997b comment out stuff 2020-03-11 11:06:56 +01:00
Patrick von Platen
7a11e925cf work in progress 2020-03-11 11:06:56 +01:00
Patrick von Platen
5b3000d933 renamed min_len to min_length 2020-03-11 11:06:56 +01:00
Patrick von Platen
aceb3fbaf4 only do output_past=True for language generation in bart 2020-03-11 11:06:56 +01:00
Patrick von Platen
7cba11fb9b better naming 2020-03-11 11:06:56 +01:00
Patrick von Platen
ff648221bd fix conflicts 2020-03-11 11:06:56 +01:00
Patrick von Platen
c0d9dd3ba9 refactored code a bit and made more generic 2020-03-11 11:06:56 +01:00
Patrick von Platen
d8e2b3c547 fix conflicts 2020-03-11 11:06:56 +01:00
Julien Chaumond
d6de6423ba [doc] --organization tweak
Co-Authored-By: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-03-10 16:52:44 -04:00
Julien Chaumond
0e56dc3078 [doc] Document the new --organization flag of CLI 2020-03-10 16:42:01 -04:00
Julien Chaumond
270dfa1c8e [dialogpt] conversion script
Reference: https://github.com/huggingface/transformers/pull/1778#issuecomment-567675530

cc @patrickvonplaten and @dreasysnail
2020-03-10 15:09:29 -04:00
Manuel Romero
2661d80687 Update README.md
- Clarify that the model is not trained on the evaluation dataset
2020-03-10 10:59:34 -04:00
Manuel Romero
6a13448ad2 Update README.md
- Fix path of tokenizer
- Clarify that the model is not trained on the evaluation set
2020-03-10 10:59:16 -04:00
Manuel Romero
e57533cca5 Create README.md 2020-03-10 10:58:13 -04:00
Patrick von Platen
31f2437f07 Merge pull request #3191 from patrickvonplaten/add_integration_tests_lm_generate_torch_tf
Add integration tests lm generate torch tf
2020-03-10 11:29:17 +01:00
Shubham Agarwal
5ca356a464 NER - pl example (#3180)
* 1. seqeval required by ner pl example. install from examples/requirements. 2. unrecognized arguments: save_steps

* pl checkpoint callback filenotfound error: make directory and pass

* #3159 pl checkpoint path difference

* 1. Updated Readme for pl 2. pl script now also correct displays logs 3. pass gpu ids compared to number of gpus

* Updated results in readme

* 1. updated readme 2. removing deprecated pl methods 3. finalizing scripts

* comment length check

* using deprecated validation_end for stable results

* style related changes
2020-03-09 20:43:38 -04:00
Travis McGuire
f51ba059b9 Model card for albert-base-v2-squad2 2020-03-09 19:37:15 -04:00
Julien Chaumond
cbf8f5d32b [model upload] Support for organizations 2020-03-09 17:33:57 -04:00
Lysandre
525b6b1c54 TFQA pipeline marked as slow test 2020-03-09 16:52:30 -04:00
Sam Shleifer
3aca02efb3 Bart example: model.to(device) (#3194) 2020-03-09 15:09:35 -04:00
Lysandre Debut
5164ea91a7 Skipping outputs (#3116)
* Minimal example

* Proposal 2

* Proposal 2 for fast tokenizers

* Typings

* Docs

* Revert "Docs" for easier review

This reverts commit eaf0f97062e809887704a542144c537f769d5223.

* Remove unnecessary assignments

* Tests

* Fix faulty type

* Remove prints

* return_outputs -> model_input_names

* Revert "Revert "Docs" for easier review"

This reverts commit 6fdc69408102bf695797f2dfddbb6350c6b9e722.

* code quality
2020-03-09 13:48:58 -04:00
Patrick von Platen
49debe62fd Merge pull request #3190 from patrickvonplaten/fix_repetition_penalty_in_tf_generate
fix repetition penalty mask in tf
2020-03-09 16:29:57 +01:00
Patrick von Platen
847d370301 fix typo 2020-03-09 16:18:29 +01:00
Lysandre
eb3e6cb04f cased -> uncased in BERT SQuAD example
closes #3183
2020-03-09 10:54:18 -04:00
Patrick von Platen
9050ffe035 delete w! -> need to be more careful with vim 2020-03-09 15:43:12 +01:00
Patrick von Platen
efb619235c add print statement to avoid code quality problem 2020-03-09 15:31:21 +01:00
Patrick von Platen
b12541c4dc test ctrl 2020-03-09 13:58:01 +00:00
Patrick von Platen
3e624c64ca fix repetition penalty mask in tf 2020-03-09 14:55:11 +01:00
Patrick von Platen
b73dd1a0e4 fix typo in test xlm tf 2020-03-09 11:34:31 +01:00
Patrick von Platen
4620caa864 fix if use lang embeddings in tf xlm 2020-03-09 11:18:54 +01:00
patrickvonplaten
fbd02d4693 fixed all tests, still need to check ctrl tf and pt and xlm tf 2020-03-08 21:45:55 +01:00
patrickvonplaten
b4a3a64744 fix xlnet & transfotests 2020-03-08 16:25:03 +01:00
Param bhavsar
b29fed790b Updated Tokenw ise in print statement to Token wise 2020-03-08 10:55:30 -04:00
patrickvonplaten
66c827656f fix typo in test gpt2 2020-03-08 15:35:08 +01:00
patrickvonplaten
314bdc7c14 fix typo in test 2020-03-08 15:34:20 +01:00
patrickvonplaten
575976144a updated all tests 2020-03-08 15:29:10 +01:00
Julien Chaumond
e03129ad44 [model_cards] Small formatting fix
cc @mrm8488
2020-03-06 18:07:44 -05:00
Julien Chaumond
08a70fb392 [model_cards] Fixup d6df9a8f 2020-03-06 17:50:38 -05:00
Lysandre Debut
0ae91c80aa Change back pipeline signatures (#3105)
* Change back pipeline signatures

* String types for non-imported objects
2020-03-06 17:26:18 -05:00
voidful
d6df9a8ffe [model_cards]Add albert chinese model(tiny,small,base,large,xlarge,xxlarge) 2020-03-06 17:23:31 -05:00
Manuel Romero
c52716d46c Create README.md 2020-03-06 17:20:19 -05:00
Aleksei Lymar
73a0c25376 remove excess line breaks in DeepPavlov model cards 2020-03-06 17:19:35 -05:00
Sam Shleifer
ed37f9fa4f [Bart] _prepare_decoder_inputs should use large negative (#3158) 2020-03-06 16:06:36 -05:00
Thomas Wolf
0416d437fb Merge pull request #3148 from patrickvonplaten/refactoring_beam_search_for_tf_2
refactored beam search according to torch implementation
2020-03-06 22:01:46 +01:00
Funtowicz Morgan
db9279dedb Fix QA models binding for Flaubert, XLNet and XLM. (#3100)
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Format & quality

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Again.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-03-06 13:04:29 -05:00
Sam Shleifer
e58b3ec5df add imports to examples (#3160) 2020-03-06 11:15:33 -05:00
Thomas Wolf
6ffe03a0a1 Merge pull request #3137 from tomhosking/bart-refactor
Refactor BartModel so that input checks are handled within enc/dec
2020-03-06 13:06:34 +01:00
Thomas Wolf
3e5da38dae Merge pull request #3132 from huggingface/hf_api_model_list
[hf_api] Get the public list of all the models on huggingface
2020-03-06 13:05:52 +01:00
Thomas Wolf
9499a3778e Merge pull request #3103 from gthb/keras-serialization
Support keras JSON/HDF5 serialization of main layers
2020-03-06 12:59:13 +01:00
patrickvonplaten
9362eb4a07 refactored beam search according to torch implementation 2020-03-06 00:46:29 +01:00
Patrick von Platen
c8035e11e8 Merge pull request #3149 from patrickvonplaten/fix_renaming_error
fix missed BartForMaskedLM renaming
2020-03-06 00:45:08 +01:00
patrickvonplaten
58fc8f97a3 fix renaming problem 2020-03-06 00:35:47 +01:00
Sam Shleifer
857e0a0d3b Rename BartForMaskedLM -> BartForConditionalGeneration (#3114)
* improved documentation
2020-03-05 17:41:18 -05:00
Lysandre Debut
fa2aa699da Merge pull request #3011 from patrickvonplaten/add_models_special_tokens_to_specific_configs
Add models special tokens to its pretrained configs
2020-03-05 17:26:48 -05:00
Lysandre Debut
146c521235 Merge branch 'master' into add_models_special_tokens_to_specific_configs 2020-03-05 17:24:42 -05:00
Lysandre Debut
b623ddc000 Pass kwargs to configuration (#3147)
* Pass kwargs to configuration

* Setter

* test
2020-03-05 17:16:57 -05:00
Lysandre Debut
0001d05686 Correct missing keys + test (#3143) 2020-03-05 17:01:54 -05:00
Thomas Wolf
1741d740f2 Merge pull request #3145 from sshleifer/bartfp16
[Bart] FP16 Support
2020-03-05 22:14:35 +01:00
Thomas Wolf
bbabbc1613 Merge pull request #3135 from patrickvonplaten/refactor_beam_search_generate
Refactoring and bug fixing beam search generate
2020-03-05 22:12:56 +01:00
sshleifer
14d40584b2 remove newline 2020-03-05 13:06:35 -05:00
sshleifer
1360dacaa3 cleanup deltas 2020-03-05 12:57:42 -05:00
sshleifer
810079de1f no ipdb 2020-03-05 12:48:14 -05:00
sshleifer
c203509d5b undo chg 2020-03-05 12:34:08 -05:00
sshleifer
c36fdc88d4 tests pass 2020-03-05 12:33:08 -05:00
Morgan Funtowicz
7ac47bfe69 Updated notebook dependencies for Colab.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-03-05 16:07:51 +01:00
Morgan Funtowicz
be02176a4b Fixing sentiment pipeline in 03-pipelines notebook.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-03-05 16:07:51 +01:00
Julien Chaumond
8a2d9bc9ef Add model cards for DeepPavlov models (#3138)
* add empty model cards for every current DeepPavlov model

* fix: replace cyrillic `с` with `c`

* docs: add model cards for current DeepPavlov BERT models

* docs: add links for arXiv preprints
2020-03-05 09:34:43 -05:00
Morgan Funtowicz
012cbdb0f5 Updating colab links in notebooks README.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-03-05 15:34:15 +01:00
Tom Hosking
31acb8dc52 Remove rogue .DS_Store 2020-03-05 13:51:30 +00:00
Tom Hosking
06a6cb6f36 Refactor BartModel so that input checks are handled within BartEncoder and BartDecoder 2020-03-05 13:45:41 +00:00
Patrick von Platen
e33ed12c3b uncomment expression 2020-03-05 13:41:04 +01:00
Patrick von Platen
4220fd52b9 remove ipdb 2020-03-05 13:36:21 +01:00
Patrick von Platen
c47394b0c9 refactoring and bug fixing beam search generate 2020-03-05 13:12:50 +01:00
Gunnlaugur Thor Briem
4c91a3af94 Document keras_serializable decorator 2020-03-05 11:48:10 +00:00
Gunnlaugur Thor Briem
4be01e5cbf Use name transformers_config in Keras serialization
Be explicit that this is config for the transformers package (as these
layers may coexist with other custom stuff in a Keras model, plus the
Keras container itself is called config, and config["config"] is not
great)

Add explicit error handling for initializer calls that have neither
the `config` nor the `transformers_config` argument, or have both.
2020-03-05 11:47:35 +00:00
Gunnlaugur Thor Briem
a355f4f0fc Add functools.wraps for wrapper initializer
Preserve the original initializer function's metadata. See
https://docs.python.org/3/library/functools.html#functools.update_wrapper
2020-03-05 11:18:50 +00:00
Gunnlaugur Thor Briem
d262a5d48e fix: remove unused import 2020-03-05 11:05:29 +00:00
Morgan Funtowicz
30624f7056 Fix Colab links + install dependencies first.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-03-05 11:40:15 +01:00
Julien Chaumond
3f067f4409 [hf_api] slightly more doc 2020-03-04 23:55:46 -05:00
Julien Chaumond
f564f93c84 [hf_api] Get the public list of all the models on huggingface 2020-03-04 23:33:09 -05:00
Julien Chaumond
ff9e79ba3a make style 2020-03-04 20:18:07 -05:00
Lysandre
07a79db505 Fix failing doc samples 2020-03-04 19:11:31 -05:00
Gunnlaugur Thor Briem
4f338ed407 Explicit config_class instead of module inspection 2020-03-04 23:45:29 +00:00
Gunnlaugur Thor Briem
6fe1cc0874 fix: clean up inadvertent change in tf_t5
This was the beginnings of an attempt to address the test failure on
this layer, and instead I backed out of making this layer
keras-serializable at all ... so it was a mistake to commit this.
2020-03-04 23:24:15 +00:00
Thomas Wolf
bdd3d0c76d Merge pull request #3118 from patrickvonplaten/add_beam_search_to_generation_tf_2_0
Add beam search to generation tf 2 0
2020-03-04 23:28:00 +01:00
Julien Chaumond
c440030e99 [model_cards] Tag AR model languages 2020-03-04 16:33:10 -05:00
Thomas Wolf
3b7f95a506 Merge pull request #3115 from gthb/fix-bogus-param-to-layer-init
fix: passing config as Layer trainable param
2020-03-04 21:59:09 +01:00
Morgan Funtowicz
1bca97ec7f Update notebook link and fix few working issues.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-03-04 21:19:33 +01:00
Manuel Romero
189113d891 Create README.md 2020-03-04 13:57:23 -05:00
Julien Chaumond
76111a3d3a [model_cards] Add card by @lvwerra
(the current way to submit a model card to have it displayed on the website is to open a PR on the `transformers` repo itself)

Thanks for sharing!
2020-03-04 12:55:20 -05:00
Julien Chaumond
a43c388abb [model_cards] Add card by @djstrong
(the current way to submit a model card to have it displayed on the website is to open a PR on the `transformers` repo itself)

Thanks for sharing!
2020-03-04 12:53:02 -05:00
Manuel Romero
ec60e0ae7a Create README.md 2020-03-04 12:06:05 -05:00
Wissam Antoun
6a143bf282 model cards for both aubmindlab/bert-base-arabert models (#3113)
* Added readme for AraBERTv0.1

* Added readme to AraBERT
2020-03-04 12:04:39 -05:00
Patrick von Platen
932eab943d include tf gpt2 tests for attn mask and past variable (#3122) 2020-03-04 12:03:46 -05:00
Julien Chaumond
256cbbc4a2 [doc] Fix link to how-to-train Colab 2020-03-04 12:01:45 -05:00
Patrick von Platen
006097f8ad rename variables named 'word' to 'token' in generate fn (#3119)
* fix conflits

* fixed naming bug

* make style
2020-03-04 12:01:17 -05:00
Gunnlaugur Thor Briem
18f4b9274f fix: work with Tensorflow < 2.1.0
tf.keras.utils.register_keras_serializable was added in TF 2.1.0, so
don't rely on it being there; just decorate the class with it if it
exists.
2020-03-04 16:57:29 +00:00
Funtowicz Morgan
71c8711970 Adding Docker images for transformers + notebooks (#3051)
* Added transformers-pytorch-cpu and gpu Docker images

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added automatic jupyter launch for Docker image.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move image from alpine to Ubuntu to align with NVidia container images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added TRANSFORMERS_VERSION argument to Dockerfile.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added Pytorch-GPU based Docker image

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added Tensorflow images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Use python 3.7 as Tensorflow doesnt provide 3.8 compatible wheel.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove double FROM instructions on transformers-pytorch-cpu image.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added transformers-tensorflow-gpu Docker image.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* use the correct ubuntu version for tensorflow-gpu

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added pipelines example notebook

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added transformers-cpu and transformers-gpu (including both PyTorch and TensorFlow) images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Docker images doesnt start jupyter notebook by default.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Tokenizers notebook

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update images links

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update Docker images to python 3.7.6 and transformers 2.5.1

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added 02-transformers notebook.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Trying to realign 02-transformers notebook ?

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added Transformer image schema

* Some tweaks on tokenizers notebook

* Removed old notebooks.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Attempt to provide table of content for each notebooks

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Second attempt.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Reintroduce transformer image.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Keep trying

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* It's going to fly !

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remaining of the Table of Content

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix inlined elements for the table of content

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Removed anaconda dependencies for Docker images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Removing notebooks ToC

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added LABEL to each docker image.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Removed old Dockerfile

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Directly use the context and include transformers from here.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Reduce overall size of compiled Docker images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Install jupyter by default and use CMD for easier launching of the images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Reduce number of layers in the images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added README.md for notebooks.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix notebooks link in README

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix some wording issues.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added blog notebooks too.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing spelling errors in review comments.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
2020-03-04 11:45:57 -05:00
Patrick von Platen
7a89a3e493 correct beam search sampling 2020-03-04 17:27:47 +01:00
Patrick von Platen
c4c4c9998a make GPT2 and CTRL shape consistent between torch and TF 2020-03-04 17:27:47 +01:00
patrickvonplaten
2529b2d37e set redorder past sort dimension to its default 2020-03-04 17:27:47 +01:00
patrickvonplaten
61fef6e957 added beam_search generation for tf 2.0 2020-03-04 17:27:47 +01:00
Patrick von Platen
34de670dbe fix sklearn release circle ci [temporary] (#3123) 2020-03-04 11:25:23 -05:00
Patrick von Platen
6701fb7859 fix beam_search behavior when sampling (#3106)
* fix beam_search behavior when sampling

* delete print

* make correct style
2020-03-04 09:30:51 -05:00
Gunnlaugur Thor Briem
b1116fd673 fix: passing config as Layer trainable param
Lurking bugs discovered while working on other stuff.
2020-03-03 23:05:40 +00:00
Gunnlaugur Thor Briem
96c4990165 fix unused imports and style 2020-03-03 22:57:05 +00:00
Gunnlaugur Thor Briem
470753bcf5 Put @keras_serializable only on layers it works on
And only run the test on TF*MainLayer classes so marked.
2020-03-03 22:44:45 +00:00
Gunnlaugur Thor Briem
0c716ede8c Use class decorator instead of superclass
When supplied by Keras deserialization, the config parameter to initializers
will be a dict. So intercept it and convert to PretrainedConfig object (and
store in instance attribute for get_config to get at it) before passing to the
actual initializer. To accomplish this, and repeat as little code as possible,
use a class decorator on TF*MainLayer classes.
2020-03-03 22:31:42 +00:00
Sam Shleifer
e9e6efdc45 BartForSequenceClassification: fix num_labels, add test (#3110) 2020-03-03 15:54:29 -05:00
Julien Chaumond
f631e01d2c [ci] Re-run integration ground truth from fairseq
Adopted best practice set by @patrickvonplaten of commenting lines run on fairseq, for easy comparison

also see #3020
2020-03-03 15:31:40 -05:00
Sam Shleifer
5b396457e5 Summarization Examples: add Bart CNN Evaluation (#3082)
* Rename and improve example

* Add test

* slightly faster test

* style

* This breaks remy prolly

* shorter test string

* no slow

* newdir structure

* New tree

* Style

* shorter

* docs

* clean

* Attempt future import

* more import hax
2020-03-03 15:29:59 -05:00
Sam Shleifer
5c5af879b6 [Bart] dont call .forward (#3094) 2020-03-03 15:14:12 -05:00
Gunnlaugur Thor Briem
b8da16f390 Add (failing) tests for Keras save/load 2020-03-03 15:22:34 +00:00
Gunnlaugur Thor Briem
ba28170717 Support keras JSON/HDF5 serialization of main layers
Fixes #3101
2020-03-03 15:21:41 +00:00
Julien Chaumond
a088d75e51 [model_cards] Fix incorrect path 2020-03-03 09:52:32 -05:00
Patrick von Platen
4134100363 Add generate() functionality to TF 2.0 (#3063)
* add first copy past test to tf 2 generate

* add tf top_k_top_p_filter fn

* add generate function for TF

* add generate function for TF

* implemented generate for all models expect transfoXL

* implemented generate for all models expect transfoXL

* implemented generate for all models expect transfoXL

* make style

* change permission of test file to correct ones

* delete ipdb

* delete ipdb

* fix bug and finish simple gpt2 integration test

* clean test file

* clean test file

* make style

* make style

* make style

* make style

* change import style

* change import style

* make style

* make style

* add decorators

* add decorators

* fix tf ctrl bug dim => axis in TF

* make style

* make style

* refactored test file

* refactored test file

* take out test_torch_tf_conversion if nothing is defined

* take out test_torch_tf_conversion if nothing is defined

* remove useless files

* remove useless files

* fix conflicts

* fix conflicts

* fix conflicts

* fix conflicts

* fix conflicts

* solve conflicts

* solve conflicts

* fix conflicts

* fix conflicts

* merge conflicts

* delete ipdb

* exposed top_k_top_p_filtering fns

* delete weirdly created w! file

* add comment to test tf common modeling

* fix conflicts

* fix conflicts

* make style

* merge conflicts

* make style

* change tf.tensor.shape to shape_list(tensor)
2020-03-03 09:42:15 -05:00
ali safaya
b31f715019 bert-base-arabic model card 2020-03-03 09:29:28 -05:00
Davide Fiocco
c0c7ec3458 Don't crash if fine-tuned model doesn't end with a number (#3099)
That's the same fix applied in https://github.com/huggingface/transformers/issues/2258 , but for the GLUE example
2020-03-03 08:59:47 -05:00
Julien Chaumond
eec5ec8071 [BART] to each its own config + make BART compatible w/ Pipelines
cc @sshleifer
2020-03-02 18:56:17 -05:00
Felix MIKAELIAN
6b1558bad8 add models cards for camembert-base-fquad camembert-base-squad (#3089)
* add models cards for camembert-base-fquad camembert-base-squad

* typo fix
2020-03-02 17:07:13 -05:00
Julien Chaumond
f169957d0c TF GPU CI (#3085)
* debug env

* Restrict TF GPU memory

* Fixup

* One more test

* rm debug logs

* Fixup
2020-03-02 15:45:25 -05:00
Lysandre Debut
d3eb7d23a4 Pipeline doc (#3055)
* Pipeline doc initial commit

* pipeline abstraction

* Remove modelcard argument from pipeline

* Task-specific pipelines can be instantiated with no model or tokenizer

* All pipelines doc
2020-03-02 14:07:10 -05:00
Manuel Romero
2c7749784c Update README.md
- Add example of usage
- Update metrics
2020-03-02 13:35:34 -05:00
Julien Chaumond
0e56b37e80 rm bogus file
cc @patrickvonplaten
2020-03-02 12:27:12 -05:00
Patrick von Platen
2fdc7f6ce8 correct greedy generation when doing beam search (#3078)
* correct greedy generation when doing beam search

* improve comment
2020-03-02 12:00:09 -05:00
Julien Chaumond
13afb71208 [ci] Ensure that TF does not preempt all GPU memory for itself
see https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth

Co-Authored-By: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
2020-03-02 11:56:45 -05:00
Patrick von Platen
c0135194eb Force pad_token_id to be set before padding for standard tokenizer (#3035)
* force pad_token_id to be set before padding

* fix tests and forbid padding without having a padding_token_id set
2020-03-02 10:53:55 -05:00
Sam Shleifer
b54ef78d0c Bart-CNN (#3059)
`generate` code that produces 99% identical summarizations to fairseq on CNN test data, with caching.
2020-03-02 10:35:53 -05:00
Victor SANH
6b1ff25084 fix n_gpu count when no_cuda flag is activated (#3077)
* fix n_gpu count when no_cuda flag is activated

* someone was left behind
2020-03-02 10:20:21 -05:00
Julien Chaumond
298bed16a8 make style 2020-03-01 14:08:01 -05:00
VictorSanh
852e032ca6 include roberta in run_squad_w_distillation - cc @graviraja 2020-03-01 01:56:50 +00:00
VictorSanh
b5509abb36 --do_lower_case will always trick me... 2020-03-01 01:39:24 +00:00
Julien Chaumond
d6ef587a10 [ci] Fixup e36bd94345 2020-02-28 23:19:17 -05:00
Julien Chaumond
e36bd94345 [ci] Run all tests on (self-hosted) GPU (#3020)
* Create self-hosted.yml

* Update self-hosted.yml

* Update self-hosted.yml

* Update self-hosted.yml

* Update self-hosted.yml

* Update self-hosted.yml

* do not run slow tests, for now

* [ci] For comparison with circleci, let's also run CPU-tests

* [ci] reorganize

* clearer filenames

* [ci] Final tweaks before merging

* rm slow tests on circle ci

* Trigger CI

* On GPU this concurrency was way too high
2020-02-28 21:11:08 -05:00
srush
908fa43b54 Changes to NER examples for PLT and TPU (#3053)
* changes to allow for tpu training

* black

* tpu

* tpu
2020-02-27 16:45:32 -05:00
Lysandre Debut
8bcb37bfb8 NER support for Albert in run_ner.py and NerPipeline (#2983)
* * Added support for Albert when fine-tuning for NER

* Added support for Albert in NER pipeline

* Added command-line options to examples/ner/run_ner.py to better control tokenization

* Added class AlbertForTokenClassification

* Changed output for NerPipeline to use .convert_ids_to_tokens(...) instead of .decode(...) to better reflect tokens

* Added ,

* Now passes style guide enforcement

* Changes from reviews.

* Code now passes style enforcement

* Added test for AlbertForTokenClassification

* Added test for AlbertForTokenClassification
2020-02-27 10:22:55 -05:00
Sam Shleifer
6a37588041 spelling: strictly (#3042) 2020-02-27 10:22:35 -05:00
Cola
f4ff44a6d9 Fix batch_encode_plus (#3041) 2020-02-27 09:56:47 -05:00
Martin Malmsten
f71157529e Added test for AlbertForTokenClassification 2020-02-27 12:24:20 +01:00
Martin Malmsten
aceb6a0907 Added test for AlbertForTokenClassification 2020-02-27 11:52:46 +01:00
Martin Malmsten
d762d4289c Code now passes style enforcement 2020-02-26 23:50:40 +01:00
Martin Malmsten
9495d38b0d Changes from reviews. 2020-02-26 23:36:39 +01:00
Julien Chaumond
b370cc7e99 [gpu] Fixup fdd61b1992 2020-02-26 21:48:49 +00:00
Julien Chaumond
f5516805c2 Fix bart slow test 2020-02-26 20:47:49 +00:00
Andrew Walker
5bc99e7f33 fix several typos in Distil* readme (#3034) 2020-02-26 12:39:54 -05:00
Patrick von Platen
fdd61b1992 Fix attn mask gpt2 when using past (#3033)
* fix issue and add some tests

* fix issue and add some tests

* updated doc string gpt2
2020-02-26 12:04:37 -05:00
Julien Chaumond
9cda3620b6 Fix (non-slow) tests on GPU (torch) (#3024)
* Fix tests on GPU (torch)

* Fix bart slow tests

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-02-26 11:59:25 -05:00
Sam Shleifer
9df74b8bc4 Delete all mentions of Model2Model (#3019) 2020-02-26 11:36:27 -05:00
Lysandre Debut
bb7c468520 Documentation (#2989)
* All Tokenizers

BertTokenizer + few fixes
RobertaTokenizer
OpenAIGPTTokenizer + Fixes
GPT2Tokenizer + fixes
TransfoXLTokenizer
Correct rst for TransformerXL
XLMTokenizer + fixes
XLNet Tokenizer + Style
DistilBERT + Fix XLNet RST
CTRLTokenizer
CamemBERT Tokenizer
FlaubertTokenizer
XLMRobertaTokenizer
cleanup

* cleanup
2020-02-25 18:43:36 -05:00
Patrick von Platen
c913eb9c38 Add integration tests for xlm roberta modelling and xlm roberta tokenzier (#3014)
* add first files

* add xlm roberta integration tests

* make style

* flake 8 issues solved
2020-02-25 16:51:25 -05:00
srush
e8ce63ff21 Change masking to direct labeling for TPU support. (#2982)
* change masking to direct labelings

* fix black

* switch to ignore index

* .

* fix black
2020-02-25 14:47:43 -05:00
Jhuo IH
7a7ee28cb9 missing ner link (#2967) 2020-02-25 14:06:57 -05:00
Lysandre Debut
65e7c90a77 Adding usage examples for common tasks (#2850)
* Usage: Sequence Classification & Question Answering

* Pipeline example

* Language modeling

* TensorFlow code for Sequence classification

* Custom TF/PT toggler in docs

* QA + LM for TensorFlow

* Finish Usage for both PyTorch and TensorFlow

* Addressing Julien's comments

* More assertive

* cleanup

* Favicon
- added favicon option in conf.py along with the favicon image
- udpated 🤗 logo. slightly smaller and should appear more consistent across editing programs (no more tongue on the outside of the mouth)

Co-authored-by: joshchagani <joshua@joshuachagani.com>
2020-02-25 13:48:24 -05:00
Patrick von Platen
f5b50c6b8e make style 2020-02-25 16:41:54 +01:00
Patrick von Platen
ec16142ee5 add special tokens to pretrain configs of respective lm head models 2020-02-25 16:37:59 +01:00
Patrick von Platen
e645dcbb70 add special tokens to pretrain configs of respective lm head models 2020-02-25 16:37:56 +01:00
Julien Chaumond
e693cd1e87 [ci] Run slow tests every day 2020-02-24 19:54:47 -05:00
Julien Chaumond
4fc63151af [ci] Attempt to fix #2844 2020-02-24 19:51:34 -05:00
Lysandre Debut
b90745c590 Test correct tokenizers after default switch (#3003) 2020-02-24 18:45:53 -05:00
Lysandre Debut
3716c3d8af False by default (#3002) 2020-02-24 18:30:57 -05:00
Lysandre
f9ec5ca90b Release: v2.5.1 2020-02-24 18:22:54 -05:00
Funtowicz Morgan
4cd9c0971c Fix for fast tokenizers save_pretrained compatibility with Python. (#2933)
* Renamed file generate by tokenizers when calling save_pretrained to match python.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added save_vocabulary tests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove python quick and dirty fix for clean Rust impl.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.5.1

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXLTokenizerFast uses a json vocabulary file + warning about incompatibility between Python and Rust

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added some save_pretrained / from_pretrained unittests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update tokenizers to 0.5.2

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Quality and format.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* flake8

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Making sure there is really a bug in unittest

* Fix TransfoXL constructor vocab_file / pretrained_vocab_file mixin.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-24 18:20:42 -05:00
Sandro Cavallari
ee60840ee6 fix _update_memory fn call in transformer-xl (#2971) 2020-02-24 17:50:24 -05:00
Patrick von Platen
6a50d501ec add explaining example to XLNet LM modeling (#2997)
* add explaining example to XLNet LM modeling

* improve docstring for xlnet
2020-02-24 15:42:38 -05:00
Patrick von Platen
65d74c4965 Add preprocessing step for transfo-xl tokenization to avoid tokenizing words followed by punction to <unk> (#2987)
* add preprocessing to add space before punctuation for transfo_xl

* improve warning messages

* make style

* compile regex at instantination of tokenizer object
2020-02-24 15:11:10 -05:00
Bram Vanroy
a143d9479e Add local_files_only parameter to pretrained items (#2930)
* Add disable_outgoing to pretrained items

Setting disable_outgoing=True disables outgonig traffic:
- etags are not looked up
- models are not downloaded

* parameter name change

* Remove forgotten print
2020-02-24 14:58:15 -05:00
Manuel Romero
286d1ec746 Create README.md 2020-02-24 14:33:49 -05:00
Lysandre Debut
7984a70ee4 kwargs are passed to both model and configuration in AutoModels (#2998) 2020-02-24 14:19:39 -05:00
Lysandre Debut
21d8b6a33e Testing that batch_encode_plus is the same as encode_plus (#2973)
* Testing that encode_plus and batch_encode_plus behave the same way

Spoiler alert: they don't

* Testing rest of arguments in batch_encode_plus

* Test tensor return in batch_encode_plus

* Addressing Sam's comments

* flake8

* Simplified with `num_added_tokens`
2020-02-24 12:09:46 -05:00
Patrick von Platen
17c45c39ed Add slow generate tests for pretrained lm models (#2909)
* add slow generate lm_model tests

* fix conflicts

* merge conflicts

* fix conflicts

* add slow generate lm_model tests

* make style

* delete unused variable

* fix conflicts

* fix conflicts

* fix conflicts

* delete unused variable

* fix conflicts

* finished hard coded tests
2020-02-24 11:51:57 -05:00
Lysandre Debut
8194df8e0c Warning on add_special_tokens (#2966)
Warning on `add_special_tokens` when passed to `encode`, `encode_plus` and `batch_encode_plus`
2020-02-24 08:42:54 -05:00
Patrick von Platen
38f5fe9e02 add_ctags_to_git_ignore (#2984) 2020-02-23 16:55:32 -05:00
Martin Malmsten
105dcb4162 Now passes style guide enforcement 2020-02-23 21:47:59 +01:00
Martin Malmsten
33eb8a165d Added , 2020-02-23 21:43:31 +01:00
Martin Malmsten
869b66f6b3 * Added support for Albert when fine-tuning for NER
* Added support for Albert in NER pipeline

* Added command-line options to examples/ner/run_ner.py to better control tokenization

* Added class AlbertForTokenClassification

* Changed output for NerPipeline to use .convert_ids_to_tokens(...) instead of .decode(...) to better reflect tokens
2020-02-23 21:13:03 +01:00
Sam Shleifer
129f0604ac Delete untested, broken Model2LSTM (#2968) 2020-02-23 11:28:48 -05:00
Lysandre Debut
0e84559d64 Correct special_tokens_mask when add_special_tokens=False (#2965)
Don't know of a use case where that would be useful, but this is more consistent
2020-02-23 09:50:39 -05:00
Sam Shleifer
92487a1dc0 Bart: fix layerdrop and cached decoder_input_ids for generation (#2969) 2020-02-22 16:25:04 -05:00
Joe Davison
c36416e53c Add standardized get_vocab method to tokenizers 2020-02-22 12:09:01 -05:00
saippuakauppias
cafc4dfc7c fix hardcoded path in examples readme 2020-02-22 11:12:38 -05:00
Malte Pietsch
34b4b5a9ed Update modelcard of bert-base-german-cased
Add image
2020-02-22 11:08:42 -05:00
Manuel Romero
7df12d7bf8 Update README.md
- I added an example using the model with pipelines to show that we have set```{"use_fast": False}``` in the tokenizer.
- I added a Colab to play with the model and pipelines
- I added a Colab to discover Huggingface pipelines at the end of the document
2020-02-22 11:06:41 -05:00
Funtowicz Morgan
cc6775cdf5 Fix max_length not taken into account when using pad_to_max_length on fast tokenizers (#2961)
* enable_padding should pad up to max_length if set.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added more testing on padding.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-22 09:27:47 -05:00
Lysandre Debut
94ff2d6ee8 Remove double bias (#2958) 2020-02-21 17:10:18 -05:00
Sam Shleifer
b5b3445c4f Only use F.gelu for torch >=1.4.0 (#2955)
* Only use F.gelu for torch >=1.4.0

* Use F.gelu for newer torch
2020-02-21 16:10:21 -05:00
Patrick von Platen
fc38d4c86f Improve special_token_id logic in run_generation.py and add tests (#2885)
* improving generation

* finalized special token behaviour for no_beam_search generation

* solved modeling_utils merge conflict

* solve merge conflicts in modeling_utils.py

* add run_generation improvements from PR #2749

* adapted language generation to not use hardcoded -1 if no padding token is available

* remove the -1 removal as hard coded -1`s are not necessary anymore

* add lightweight language generation testing for randomely initialized models - just checking whether no errors are thrown

* add slow language generation tests for pretrained models using hardcoded output with pytorch seed

* delete ipdb

* check that all generated tokens are valid

* renaming

* renaming Generation -> Generate

* make style

* updated so that generate_beam_search has same token behavior than generate_no_beam_search

* consistent return format for run_generation.py

* deleted pretrain lm generate tests -> will be added in another PR

* cleaning of unused if statements and renaming

* run_generate will always return an iterable

* make style

* consistent renaming

* improve naming, make sure generate function always returns the same tensor, add docstring

* add slow tests for all lmhead models

* make style and improve example comments modeling_utils

* better naming and refactoring in modeling_utils

* improving generation

* finalized special token behaviour for no_beam_search generation

* solved modeling_utils merge conflict

* solve merge conflicts in modeling_utils.py

* add run_generation improvements from PR #2749

* adapted language generation to not use hardcoded -1 if no padding token is available

* remove the -1 removal as hard coded -1`s are not necessary anymore

* add lightweight language generation testing for randomely initialized models - just checking whether no errors are thrown

* add slow language generation tests for pretrained models using hardcoded output with pytorch seed

* delete ipdb

* check that all generated tokens are valid

* renaming

* renaming Generation -> Generate

* make style

* updated so that generate_beam_search has same token behavior than generate_no_beam_search

* consistent return format for run_generation.py

* deleted pretrain lm generate tests -> will be added in another PR

* cleaning of unused if statements and renaming

* run_generate will always return an iterable

* make style

* consistent renaming

* improve naming, make sure generate function always returns the same tensor, add docstring

* add slow tests for all lmhead models

* make style and improve example comments modeling_utils

* better naming and refactoring in modeling_utils

* changed fast random lm generation testing design to more general one

* delete in old testing design in gpt2

* correct old variable name

* temporary fix for encoder_decoder lm generation tests - has to be updated when t5 is fixed

* adapted all fast random generate tests to new design

* better warning description in modeling_utils

* better comment

* better comment and error message

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-02-21 12:09:59 -05:00
maximeilluin
c749a543fa Added CamembertForQuestionAnswering (#2746)
* Added CamembertForQuestionAnswering

* fixed camembert tokenizer case
2020-02-21 12:01:02 -05:00
Bram Vanroy
5211d333bb Update modeling_tf_utils.py (#2924)
Tensorflow does not use .eval() vs .train().

closes https://github.com/huggingface/transformers/issues/2906
2020-02-21 11:28:32 -05:00
ahotrod
3e98f27e4a Create README.md for xlnet_large_squad (#2942) 2020-02-21 08:54:41 -05:00
Martin Malmsten
4452b44b90 Labels are now added to model config under id2label and label2id (#2945) 2020-02-21 08:53:05 -05:00
Sam Shleifer
53ce3854a1 New BartModel (#2745)
* Results same as fairseq
* Wrote a ton of tests
* Struggled with api signatures
* added some docs
2020-02-20 18:11:13 -05:00
guillaume-be
564fd75d65 Removed unused fields in DistilBert TransformerBlock (#2710)
* Removed unused fields in DistilBert TransformerBlock
2020-02-20 16:08:21 -05:00
srush
889d3bfdbb default arg fix (#2937) 2020-02-20 15:31:17 -05:00
Joe Davison
197d74f988 Add get_vocab method to PretrainedTokenizer 2020-02-20 15:26:49 -05:00
Scott Gigante
ea8eba35e2 Fix InputExample docstring (#2891) 2020-02-20 15:25:15 -05:00
Funtowicz Morgan
e2a6445ebb Tokenizer fast warnings (#2922)
* Remove warning when pad_to_max_length is not set.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move RoberTa warning to RoberTa and not GPT2 base tokenizer.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-20 11:55:03 -05:00
Funtowicz Morgan
9b3093311f Expose all constructor parameter for BertTokenizerFast (#2921)
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-20 11:53:32 -05:00
srush
b662f0e625 Support for torch-lightning in NER examples (#2890)
* initial pytorch lightning commit

* tested multigpu

* Fix learning rate schedule

* black formatting

* fix flake8

* isort

* isort

* .

Co-authored-by: Check your git settings! <chris@chris-laptop>
2020-02-20 11:50:05 -05:00
Ilias Chalkidis
ab1238393c Update to include example of LM
The model files have been updated in order to include the classification layers, based on https://github.com/huggingface/transformers/issues/2901, and now can be also used as a LM.
2020-02-20 10:57:59 -05:00
Santiago Castro
976e9afece Add syntax highlighting to the BibTeX in README 2020-02-20 10:06:15 -05:00
Cong
cbc5705541 Fix spell: EsperBERTo, not EspertBERTo 2020-02-20 10:02:07 -05:00
Funtowicz Morgan
d490b5d500 Fast Tokenizers save pretrained should return the list of generated file paths. (#2918)
* Correctly return the tuple of generated file(s) when calling save_pretrained

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Quality and format.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-20 00:58:04 +01:00
Lysandre
2708b44ee9 Patch ALBERT with heads in TensorFlow 2020-02-19 18:46:25 -05:00
Lysandre
1abd53b1aa Patch ALBERT with heads in TensorFlow 2020-02-19 18:24:40 -05:00
Funtowicz Morgan
e676764241 Override build_inputs_with_special_tokens for fast tokenizers (#2912)
* Override build_inputs_with_special_tokens for fast impl + unittest.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Quality + format.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-19 16:09:51 -05:00
Lysandre
59c23ad9c9 README link + better instructions for release 2020-02-19 11:57:17 -05:00
Lysandre
22b2b5790e Documentation v2.5.0 2020-02-19 11:53:30 -05:00
Lysandre
fb560dcb07 Release: v2.5.0
Welcome Rust Tokenizers
2020-02-19 11:46:19 -05:00
Funtowicz Morgan
3f3fa7f7da Integrate fast tokenizers library inside transformers (#2674)
* Implemented fast version of tokenizers

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bumped tokenizers version requirements to latest 0.2.1

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added matching tests

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching OpenAI GPT tokenization !

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching GPT2 on tokenizers

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Expose add_prefix_space as constructor parameter for GPT2

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching Roberta tokenization !

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Removed fast implementation of CTRL.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Binding TransformerXL tokenizers to Rust.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Updating tests accordingly.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added tokenizers as top-level modules.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Black & isort.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Rename LookupTable to WordLevel to match Rust side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Black.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Use "fast" suffix instead of "ru" for rust tokenizers implementations.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Introduce tokenize() method on fast tokenizers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* encode_plus dispatchs to batch_encode_plus

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* batch_encode_plus now dispatchs to encode if there is only one input element.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bind all the encode_plus parameter to the forwarded batch_encode_plus call.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.3.0

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Formatting.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix tokenization_auto with support for new (python, fast) mapping schema.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Give correct fixtures path in test_tokenization_fast.py for the CLI.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Expose max_len_ properties on BertTokenizerFast

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move max_len_ properties to PreTrainedTokenizerFast and override in specific subclasses.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* _convert_encoding should keep the batch axis tensor if only one sample in the batch.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add warning message for RobertaTokenizerFast if used for MLM.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added use_fast (bool) parameter on AutoTokenizer.from_pretrained().

This allows to easily enable/disable Rust-based tokenizer instantiation.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Let's tokenizers handle all the truncation and padding stuff.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Allow to provide tokenizer arguments during pipeline creation.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update test_fill_mask pipeline to not use fast tokenizers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix too much parameters for convert_encoding.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* When enabling padding, max_length should be set to None.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Avoid returning nested tensors of length 1 when calling encode_plus

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure output is padded when return_tensor is not None.

Tensor creation requires the inital list input to be of the exact same size.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Disable transfoxl unittest if pytorch is not available (required to load the model)

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* encode_plus should not remove the leading batch axis if return_tensor is set

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Temporary disable fast tokenizers on QA pipelines.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix formatting issues.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update tokenizers to 0.4.0

* Update style

* Enable truncation + stride unit test on fast tokenizers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add unittest ensuring special_tokens set match between Python and Rust.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure special_tokens are correctly set during construction.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Give more warning feedback to the user in case of padding without pad_token.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* quality & format.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added possibility to add a single token as str

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added unittest for add_tokens and add_special_tokens on fast tokenizers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix rebase mismatch on pipelines qa default model.

QA requires cased input while the tokenizers would be uncased.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Using offset mapping relative to the original string + unittest.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: save_vocabulary requires folder and file name

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Simplify import for Bert.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: truncate_and_pad disables padding according to the same heuristic than the one enabling padding.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Remove private member access in tokenize()

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Bump tokenizers dependency to 0.4.2

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* format & quality.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Use named arguments when applicable.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Add Github link to Roberta/GPT2 space issue on masked input.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Move max_len_single_sentence / max_len_sentences_pair to PreTrainedTokenizerFast + tests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Relax type checking to include tuple and list object.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Document the truncate_and_pad manager behavior.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Raise an exception if return_offsets_mapping is not available with the current tokenizer.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure padding is set on the tokenizers before setting any padding strategy + unittest.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* On pytorch we need to stack tensor to get proper new axis.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Generalize tests to different framework removing hard written return_tensors="..."

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizer dependency for num_special_tokens_to_add

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Overflowing tokens in batch_encode_plus are now stacked over the batch axis.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Improved error message for padding strategy without pad token.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bumping tokenizers dependency to 0.5.0 for release.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Optimizing convert_encoding around 4x improvement. 🚀

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* expose pad_to_max_length in encode_plus to avoid duplicating the parameters in kwargs

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Generate a proper overflow_to_sampling_mapping when return_overflowing_tokens is True.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix unittests for overflow_to_sampling_mapping not being returned as tensor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Format & quality.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove perfect alignment constraint for Roberta (allowing 1% difference max)

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Triggering final CI

Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
2020-02-19 11:35:40 -05:00
Bin Wang
ffb93ec0cc Create README.md 2020-02-19 10:51:16 -05:00
Sam Shleifer
20fc18fbda Skip flaky test_tf_question_answering (#2845)
* Skip flaky test

* Style
2020-02-18 16:14:50 -05:00
VictorSanh
2ae98336d1 fix vocab size in binarized_data (distil): int16 vs int32 2020-02-18 16:17:35 +00:00
VictorSanh
0dbddba6d2 fix typo in hans example call 2020-02-17 20:19:57 +00:00
Manuel Romero
29ab4b7f40 Create README.md 2020-02-17 10:58:43 -05:00
Stefan Schweter
c88ed74ccf [model_cards] 🇹🇷 Add new (cased) BERTurk model 2020-02-17 09:54:46 -05:00
Thomas Wolf
5b2d4f2657 Merge pull request #2881 from patrickvonplaten/add_vim_swp_to_gitignore
update .gitignore to ignore .swp files created when using vim
2020-02-17 14:36:49 +01:00
Patrick von Platen
fb4d8d0832 update .gitignore to ignore .swp files created when using vim 2020-02-17 14:26:32 +01:00
Manuel Romero
6083c1566e Update README.md
I trained the model for more epochs so I improved the results. This commit will update the results of the model and add a gif using it with **transformers/pipelines**
2020-02-16 10:09:34 -05:00
Julien Chaumond
73028c5df0 [model_cards] EsperBERTo 2020-02-14 15:16:33 -05:00
Timo Moeller
81fb8d3251 Update model card: new performance chart (#2864)
* Update model performance for correct German conll03 dataset

* Adjust text

* Adjust line spacing
2020-02-14 13:39:23 -05:00
Julien Chaumond
4e69104a1f [model_cards] Also use the thumbnail as meta
Co-Authored-By: Ilias Chalkidis <ihalk@di.uoa.gr>
2020-02-14 10:27:11 -05:00
Julien Chaumond
73d79d42b4 [model_cards] nlptown/bert-base-multilingual-uncased-sentiment
cc @yvespeirsman

Co-Authored-By: Yves Peirsman <yvespeirsman@users.noreply.github.com>
2020-02-14 09:51:11 -05:00
Yves Peirsman
47b735f994 Added model card for bert-base-multilingual-uncased-sentiment (#2859)
* Created model card for nlptown/bert-base-multilingual-sentiment

* Delete model card

* Created model card for bert-base-multilingual-uncased-sentiment as README
2020-02-14 09:31:15 -05:00
Julien Chaumond
7d22fefd37 [pipeline] Alias NerPipeline as TokenClassificationPipeline 2020-02-14 09:18:10 -05:00
Manuel Romero
61a2b7dc9d Fix typo 2020-02-14 09:13:07 -05:00
Ilias Chalkidis
6e261d3a22 Fix typos 2020-02-14 09:11:07 -05:00
Manuel Romero
4e597c8e4d Fix typo 2020-02-14 09:07:42 -05:00
Julien Chaumond
925a13ced1 [model_cards] mv README.md 2020-02-13 23:07:29 -05:00
Manuel Romero
575a3b7aa1 Create distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es.md 2020-02-13 23:04:52 -05:00
Julien Chaumond
4d36472b96 [run_ner] Don't crash if fine-tuning local model that doesn't end with digit 2020-02-14 03:25:29 +00:00
Ilias Chalkidis
8514018300 Update with additional information
Added a "Pre-training details" section
2020-02-13 21:54:42 -05:00
Ilias Chalkidis
1eec69a900 Create README.md 2020-02-13 19:27:22 -05:00
Felix MIKAELIAN
8744402f1e add model_card flaubert-base-uncased-squad (#2833)
* add model_card

* Add tag

cc @fmikaelian

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-02-13 17:19:13 -05:00
Severin Simmler
7f98edd7e3 Model card: Literary German BERT (#2843)
* feat: create model card

* chore: add description

* feat: stats plot

* Delete prosa-jahre.svg

* feat: years plot (again)

* chore: add more details

* fix: typos

* feat: kfold plot

* feat: kfold plot

* Rename model_cards/severinsimmler/literary-german-bert.md to model_cards/severinsimmler/literary-german-bert/README.md

* Support for linked images + add tags

cc @severinsimmler

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-02-13 15:43:44 -05:00
Joe Davison
f1e8a51f08 Preserve spaces in GPT-2 tokenizers (#2778)
* Preserve spaces in GPT-2 tokenizers

Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
tokenizers, enabling correct BPE encoding. Automatically inserts a space
in front of first token in encode function when adding special tokens.

* Add tokenization preprocessing method

* Add framework argument to pipeline factory

Also fixes pipeline test issue. Each test input now treated as a
distinct sequence.
2020-02-13 13:29:43 -05:00
Sam Shleifer
0ed630f139 Attempt to increase timeout for circleci slow tests (#2844) 2020-02-13 09:11:03 -05:00
Sam Shleifer
ef74b0f07a get_activation('relu') provides a simple mapping from strings i… (#2807)
* activations.py contains a mapping from string to activation function
* resolves some `gelu` vs `gelu_new` ambiguity
2020-02-13 08:28:33 -05:00
Lysandre
f54a5bd37f Raise error when using an mlm flag for a clm model + correct TextDataset 2020-02-12 13:23:14 -05:00
Lysandre
569897ce2c Fix a few issues regarding the language modeling script 2020-02-12 13:23:14 -05:00
Julien Chaumond
21da895013 [model_cards] Better image for social sharing 2020-02-11 20:30:08 -05:00
Julien Chaumond
9a70910d47 [model_cards] Tweak @mrm8488's model card 2020-02-11 20:20:39 -05:00
Julien Chaumond
9274734a0d [model_cards] mv to correct location + tweak tag 2020-02-11 20:13:57 -05:00
Manuel Romero
69f948461f Create bert-base-spanish-wwm-cased-finetuned-spa-squad2-es.md 2020-02-11 20:07:15 -05:00
Julien Chaumond
e0b6247cf7 [model_cards] Change formatting slightly as we updated our markdown engine
cc @tholor @loretoparisi @simonefrancia
2020-02-11 18:25:21 -05:00
sshleifer
5f2dd71d1b Smaller diff 2020-02-11 17:20:09 -05:00
sshleifer
31158af57c formatting 2020-02-11 17:20:09 -05:00
sshleifer
5dd61fb9a9 Add more specific testing advice to Contributing.md 2020-02-11 17:20:09 -05:00
Oleksiy Syvokon
ee5de0ba44 BERT decoder: Fix causal mask dtype.
PyTorch < 1.3 requires multiplication operands to be of the same type.
This was violated when using default attention mask (i.e.,
attention_mask=None in arguments) given BERT in the decoder mode.

In particular, this was breaking Model2Model and made tutorial
from the quickstart failing.
2020-02-11 15:19:22 -05:00
jiyeon
bed38d3afe Fix typo in src/transformers/data/processors/squad.py 2020-02-11 11:22:24 -05:00
Stefan Schweter
498d06e914 [model_cards] Add new German Europeana BERT models (#2805)
* [model_cards] New German Europeana BERT models from dbmdz

* [model_cards] Update German Europeana BERT models from dbmdz
2020-02-11 10:49:39 -05:00
Funtowicz Morgan
3e3a9e2c01 Merge pull request #2793 from huggingface/tensorflow-210-circleci-fix
Fix circleci cuInit error on Tensorflow >= 2.1.0.
2020-02-11 10:48:42 +00:00
Julien Chaumond
1f5db9a13c [model_cards] Rm extraneous tag 2020-02-10 17:45:13 -05:00
Julien Chaumond
95bac8dabb [model_cards] Add language metadata to existing model cards
This will enable filtering on language (amongst other tags) on the website

cc @loretoparisi, @stefan-it, @HenrykBorzymowski, @marma
2020-02-10 17:42:42 -05:00
ahotrod
ba498eac38 Create README.md (#2785)
* Create README.md

* Update README.md

* Update README.md

* Update README.md

* [model_cards] Use code fences for consistency

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-02-10 17:27:59 -05:00
Malte Pietsch
68ccc04ee6 Add model readme for deepset/roberta-base-squad2 (#2797)
* Add readme for deepset/roberta-base-squad2

* update model readme
2020-02-10 15:21:48 -05:00
Lysandre
539f601be7 intermediate_size > hidden_dim in distilbert config docstrings 2020-02-10 13:45:57 -05:00
Lysandre
cfb7d108bd FlauBERT lang embeddings only when n_langs > 1 2020-02-10 13:24:04 -05:00
Julien Chaumond
b4691a438d [model_cards] BERT-of-Theseus: use the visual as thumbnail
cc @jetrunner

Co-Authored-By: Kevin Canwen Xu <canwenxu@outlook.com>
2020-02-10 11:27:08 -05:00
Julien Chaumond
fc325e97cd [model_cards] Showcase model tag syntax 2020-02-10 11:27:08 -05:00
Lysandre
fd639e5be3 Correct quickstart example when using the past 2020-02-10 11:25:56 -05:00
Julien Chaumond
63a5399bc4 [model_cards] Specify language meta + thumbnail
cc @tholor

see #2799
2020-02-10 11:20:05 -05:00
Lysandre
125a75a121 Correctly compute tokens when padding on the left 2020-02-10 10:47:42 -05:00
Malte Pietsch
9c64d1da35 Add model readme for bert-base-german-cased (#2799)
* add readme for bert-base-german-cased

* update readme
2020-02-10 10:27:29 -05:00
Kevin Canwen Xu
bf99014c46 Create BERT-of-Theseus model card 2020-02-10 09:58:40 -05:00
Thomas Wolf
92e974196f Merge pull request #2765 from huggingface/extract-cached-archives
Add option to `cached_path` to automatically extract archives
2020-02-10 14:05:16 +01:00
Morgan Funtowicz
6aa7973aec Fix circleci cuInit error on Tensorflow >= 2.1.0.
Tensorflow 2.1.0 introduce a new dependency model where pip install tensorflow would install tf with GPU support.
Before it would just install with CPU support, thus CircleCI is looking for NVidia driver version at initialization of the
tensorflow related tests but fails as their is no NVidia Driver running.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-10 13:24:37 +01:00
Lysandre
520e7f2119 Correct docstring for xlnet 2020-02-07 16:42:35 -05:00
Lysandre
dd28830327 Update RoBERTa tips 2020-02-07 16:42:35 -05:00
Lysandre
db97930122 Update XLM-R tips 2020-02-07 16:42:35 -05:00
Lysandre
7046de2991 E231 2020-02-07 15:28:13 -05:00
VictorSanh
0d3aa3c04c styling 2020-02-07 15:28:13 -05:00
VictorSanh
d8b43600fd omission 2020-02-07 15:28:13 -05:00
VictorSanh
ee5a6856ca distilbert-base-cased weights + Readmes + omissions 2020-02-07 15:28:13 -05:00
monologg
73368963b2 Fix importing unofficial TF models with extra optimizer weights 2020-02-07 10:25:31 -05:00
Ari
d7dabfeff5 Fix documentation in ProjectedAdaptiveLogSoftmax 2020-02-07 10:14:58 -05:00
Julien Chaumond
42f08e596f [examples] rename run_lm_finetuning to run_language_modeling 2020-02-07 09:15:28 -05:00
Julien Chaumond
4f7bdb0958 [examples] Fix broken markdown 2020-02-07 09:15:28 -05:00
thomwolf
c6c5c3fd4e style and quality 2020-02-07 08:58:06 +01:00
thomwolf
961c69776f @julien-c proposal for TF/PT compat in hf_buckets 2020-02-07 08:53:17 +01:00
thomwolf
d311f87bca cleanup 2020-02-07 00:05:28 +01:00
thomwolf
7d99e05f76 file_cache has options to extract archives 2020-02-07 00:03:12 +01:00
dchurchwell
2c12464a20 Changed vocabulary save function. Variable name was inconsistent, causing an error to be thrown when passing a file name instead of a directory. 2020-02-06 16:40:07 -05:00
Peter Izsak
6fc3d34abd Fix multi-gpu evaluation in run_glue.py 2020-02-06 16:38:55 -05:00
Julien Chaumond
7748cbbe7d Oopsie 2020-02-06 15:30:02 -05:00
Julien Chaumond
432c12521e [docs] Add menu w/ links to other pages on hf.co 2020-02-06 15:30:02 -05:00
Clement
c069932f5d Add contributors snapshot
powered by https://github.com/sourcerer-io/hall-of-fame
2020-02-06 15:25:47 -05:00
Lysandre Debut
33d3072e1c Arxiv README (#2747)
* Arxiv README

* ArXiv-NLP readme
2020-02-05 15:26:28 -05:00
Julien Chaumond
eae8ee0389 [doc] model sharing: mention README.md + tweaks
cc @lysandrejik @thomwolf
2020-02-05 14:20:03 -05:00
James Betker
6bb6a01765 Fix GPT2 config set to trainable
This prevents the model from being saved, and who knows
what else.
2020-02-05 13:55:41 -05:00
Julien Chaumond
ada24def22 [run_lm_finetuning] Tweak fix for non-long tensor, close #2728
see 1ebfeb7946 and #2728

Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
2020-02-05 12:49:18 -05:00
Lysandre
2184f87003 RoBERTa TensorFlow Tests 2020-02-04 18:05:35 -05:00
Lysandre
e615269cb8 Correct slow test 2020-02-04 18:05:35 -05:00
Lysandre
5f96ebc0be Style 2020-02-04 18:05:35 -05:00
Lysandre
950c6a4f09 Flaubert PyTorch tests 2020-02-04 18:05:35 -05:00
Lysandre
d28b81dc29 RoBERTa Pytorch tests 2020-02-04 18:05:35 -05:00
Yuval Pinter
d1ab1fab1b pass langs parameter to certain XLM models (#2734)
* pass langs parameter to certain XLM models

Adding an argument that specifies the language the SQuAD dataset is in so language-sensitive XLMs (e.g. `xlm-mlm-tlm-xnli15-1024`) don't default to language `0`.
Allows resolution of issue #1799 .

* fixing from `make style`

* fixing style (again)
2020-02-04 17:12:42 -05:00
sshleifer
9e5b549b4d fix default getattr 2020-02-04 16:38:52 -05:00
sshleifer
25848a6094 double quotes 2020-02-04 16:38:52 -05:00
sshleifer
cbcb83f21d minor cleanup of test_attention_outputs 2020-02-04 16:38:52 -05:00
Lysandre
3bf5417258 Revert erroneous fix 2020-02-04 16:31:07 -05:00
Lysandre
1ebfeb7946 Cast to long when masking tokens 2020-02-04 15:56:16 -05:00
Lysandre
9c67196b83 Update quickstart 2020-02-04 11:11:37 -05:00
Lysandre
90ab15cb7a Remove redundant hidden states 2020-02-04 10:59:32 -05:00
Julien Chaumond
9a50828b5c Pipelines: fix crash when modelcard is None
cc @mfuntowicz does this seem correct?
2020-02-03 17:53:39 -05:00
Lysandre
6c1b23554f Sample instead of greedy decoding by default in generate 2020-02-03 17:23:53 -05:00
Lysandre
239dd23f64 [Follow up 213]
Masked indices should have -1 and not -100. Updating documentation + scripts that were forgotten
2020-02-03 16:08:05 -05:00
Martin Malmsten
522c5b5533 Added README.md to Swedish BERT models from National Library of Sweden 2020-02-03 09:09:34 -05:00
Julien Plu
9329e59700 Add READMEs to Tensorflow versions of CamemBERT and XLM-RoBERTa 2020-02-03 09:04:34 -05:00
Antonio Carlos Falcão Petri
2ba147ecff Fix typo in examples/utils_ner.py
"%s-%d".format() -> "{}-{}".format()
2020-02-01 11:10:57 -05:00
Bram Vanroy
9773e5e0d9 CLI script to gather environment info (#2699)
* add "info" command to CLI

As a convenience, add the info directive to CLI. Running `python transformers-cli info` will return a string containing the transformers version, platform, python version, PT/TF version and GPU support

* Swap f-strings for .format

Still supporting 3.5 so can't use f-strings (sad face)

* Add reference in issue to CLI

* Add the expected fields to issue template

This way, people can still add the information manually if they want. (Though I fear they'll just ignore it.)

* Remove heading from output

* black-ify

* order of imports

Should ensure isort test passes

* use is_X_available over import..pass

* style

* fix copy-paste bug

* Rename command info -> env

Also adds the command to CONTRIBUTING.md in "Did you find a bug" section
2020-02-01 10:38:14 -05:00
Julien Chaumond
ddb6f9476b [model_cards] dbmdz models
Co-Authored-By: Stefan Schweter <stefan-it@users.noreply.github.com>
2020-01-31 18:39:09 -05:00
Julien Chaumond
6636826f04 [model_cards] Multilingual + Dutch SQuAD2.0
Co-Authored-By: HenrykBorzymowski <henrykborzymowski@users.noreply.github.com>
2020-01-31 18:39:09 -05:00
Julien Chaumond
98dadc98e1 [model_cards] UmBERTo
Co-Authored-By: Loreto Parisi <loretoparisi@gmail.com>
Co-Authored-By: Simone Francia <francia.simone1@gmail.com>
2020-01-31 18:39:09 -05:00
Julien Chaumond
d6fc34b459 [model_cards] add mine 2020-01-31 18:39:09 -05:00
Lysandre
d426b58b9e Patch: v2.4.1 2020-01-31 14:55:33 -05:00
Lysandre
1e82cd8457 Flaubert auto tokenizer + tests
cc @julien-c
2020-01-31 14:16:52 -05:00
Lysandre
d18d47be67 run_generation style 2020-01-31 12:05:48 -05:00
Lysandre
ff6f1492e8 FlauBERT load in AutoModel
The FlauBERT configuration file inherits from XLMConfig, and is recognized as such when loading from AutoModels as the XLMConfig is checked before the FlaubertConfig.

Changing the order solves this problem, but a test should be added.
2020-01-31 12:05:15 -05:00
Lysandre
7365f01d43 do_sample should be set to True in run_generation.py 2020-01-31 11:49:32 -05:00
Arnaud
3a21d6da6b Typo on markdown link in README.md 2020-01-31 10:58:49 -05:00
Lysandre
0aa40e9569 v2.4.0 documentation 2020-01-31 09:55:34 -05:00
Lysandre
8036ceb7c5 Update commands for pypi test 2020-01-31 09:48:15 -05:00
Lysandre
6664ea943d Release: v2.4.0 2020-01-31 09:40:32 -05:00
Julien Chaumond
5a6b138b00 [Umberto] model shortcuts (#2661)
* [Umberto] model shortcuts

cc @loretoparisi @simonefrancia

see #2485

* Ensure that tokenizers will be correctly configured
2020-01-30 21:05:53 -05:00
Julien Chaumond
7fe294bf07 Hotfix: same handling of non-existent files as for config 2020-01-30 20:05:04 -05:00
Julien Chaumond
b85c59f997 config.architectures 2020-01-30 19:26:59 -05:00
Julien Chaumond
f9bc3f5771 style tweak 2020-01-30 19:26:59 -05:00
Julien Chaumond
0b13fb822a No need for a model_type here
cc @lysandrejik
2020-01-30 19:26:59 -05:00
Jared Nielsen
71a382319f Correct documentation 2020-01-30 18:41:24 -05:00
Lysandre
01a14ebd8d Add FlauBERT to automodels 2020-01-30 18:40:22 -05:00
Julien Chaumond
9fa836a73f fill_mask helper (#2576)
* fill_mask helper

* [poc] FillMaskPipeline

* Revert "[poc] FillMaskPipeline"

This reverts commit 67eeea55b0f97b46c2b828de0f4ee97d87338335.

* Revert "fill_mask helper"

This reverts commit cacc17b884e14bb6b07989110ffe884ad9e36eaa.

* README: clarify that Pipelines can also do text-classification

cf. question at the AI&ML meetup last week, @mfuntowicz

* Fix test: test feature-extraction pipeline

* Test tweaks

* Slight refactor of existing pipeline (in preparation of new FillMaskPipeline)

* Extraneous doc

* More robust way of doing this

@mfuntowicz as we don't rely on the model name anymore (see AutoConfig)

* Also add RobertaConfig as a quickfix for wrong token_type_ids

* cs

* [BIG] FillMaskPipeline
2020-01-30 18:15:42 -05:00
Hang Le
b43cb09aaa Add layerdrop 2020-01-30 12:05:01 -05:00
Lysandre
df27648bd9 Rename test_examples to test_doc_samples 2020-01-30 10:07:22 -05:00
Lysandre
93dccf527b Pretrained models 2020-01-30 10:04:18 -05:00
Lysandre
90787fed81 Style 2020-01-30 10:04:18 -05:00
Lysandre
73306d028b FlauBERT documentation 2020-01-30 10:04:18 -05:00
Lysandre
ce2f4227ab Fix failing FlauBERT test 2020-01-30 10:04:18 -05:00
Hang Le
f0a4fc6cd6 Add Flaubert 2020-01-30 10:04:18 -05:00
Peter Izsak
a5381495e6 Added classifier dropout rate in ALBERT 2020-01-30 09:52:34 -05:00
Bram Vanroy
83446a88d9 Use _pad_token of pad_token_id
Requesting pad_token_id would cause an error message when it is None. Use private _pad_token instead.
2020-01-29 17:44:58 -05:00
BramVanroy
9fde13a3ac Add check to verify existence of pad_token_id
In batch_encode_plus we have to ensure that the tokenizer has a pad_token_id so that, when padding, no None values are added as padding. That would happen with gpt2, openai, transfoxl.

closes https://github.com/huggingface/transformers/issues/2640
2020-01-29 17:44:58 -05:00
Lysandre
e63a81dd25 Style 2020-01-29 16:29:20 -05:00
Lysandre
217349016a Copy object instead of passing the reference 2020-01-29 16:15:39 -05:00
Jared Nielsen
adb8c93134 Remove lines causing a KeyError 2020-01-29 14:01:16 -05:00
Lysandre
c69b082601 Update documentation 2020-01-29 12:06:13 -05:00
Julien Plu
ca1d66734d Apply quality and style requirements once again 2020-01-29 12:06:13 -05:00
Julien Plu
5e3c72842d bugfix on model name 2020-01-29 12:06:13 -05:00
Julien Plu
0731fa1587 Apply quality and style requirements 2020-01-29 12:06:13 -05:00
Julien Plu
a3998e76ae Add TF2 CamemBERT model 2020-01-29 12:06:13 -05:00
Lysandre
b5625f131d Style 2020-01-29 11:47:49 -05:00
Lysandre
44a5b4bbe7 Update documentation 2020-01-29 11:47:49 -05:00
Julien Plu
7fc628d98e Apply style 2020-01-29 11:47:49 -05:00
Julien Plu
64ca855617 Add TF2 XLM-RoBERTa model 2020-01-29 11:47:49 -05:00
BramVanroy
9d87eafd11 Streamlining
- mostly stylistic streamlining
- removed 'additional context' sections. They seem to be rarely used and might cause confusion. If more details are needed, users can add them to the 'details' section
2020-01-28 10:41:10 -05:00
BramVanroy
a3b3638f6f phrasing 2020-01-28 10:41:10 -05:00
BramVanroy
c96ca70f25 Update ---new-benchmark.md 2020-01-28 10:41:10 -05:00
BramVanroy
7b5eda32bb Update --new-model-addition.md
Motivate users to @-tag authors of models to increase visibility and expand the community
2020-01-28 10:41:10 -05:00
BramVanroy
c63d91dd1c Update bug-report.md
- change references to pytorch-transformers to transformers
- link to code formatting guidelines
2020-01-28 10:41:10 -05:00
BramVanroy
b2907cd06e Update feature-request.md
- add 'your contribution' section
- add code formatting link to 'additional context'
2020-01-28 10:41:10 -05:00
BramVanroy
2fec88ee02 Update question-help.md
Prefer that general questions are asked on Stack Overflow
2020-01-28 10:41:10 -05:00
BramVanroy
7e03d2bd7c update migration guide
Streamlines usages of pytorch-transformers and pytorch-pretrained-bert. Add link to the README for the migration guide.
2020-01-28 10:41:10 -05:00
Lysandre
335dd5e68a Default save steps 50 to 500 in all scripts 2020-01-28 09:42:11 -05:00
Lysandre
ea2600bd5f Absolute definitive HeisenDistilBug solve
cc @julien-c @thomwolf
2020-01-27 21:58:36 -05:00
Wietse de Vries
5c3d441ee1 Fix formatting 2020-01-27 21:00:34 -05:00
Wietse de Vries
f5a236c3ca Add Dutch pre-trained BERT model 2020-01-27 21:00:34 -05:00
Julien Chaumond
6b4c3ee234 [run_lm_finetuning] GPT2 tokenizer doesn't have a pad_token
ping @lysandrejik
2020-01-27 20:14:02 -05:00
Julien Chaumond
79815bf666 [serving] Fix typo 2020-01-27 19:58:25 -05:00
Julien Chaumond
5004d5af42 [serving] Update dependencies 2020-01-27 19:58:00 -05:00
Lysandre
9ca21c838b Style 2020-01-27 14:49:12 -05:00
thomwolf
e0849a66ac adding in the doc 2020-01-27 14:27:07 -05:00
thomwolf
6b081f04e6 style and quality 2020-01-27 14:27:07 -05:00
thomwolf
0e31e06a75 Add AutoModelForPreTraining 2020-01-27 14:27:07 -05:00
Julien Chaumond
ea56d305be make style 2020-01-27 12:13:32 -05:00
Malte Pietsch
d440e21f5b add mapping of roberta for QA 2020-01-27 12:12:46 -05:00
Lysandre
875c4ae48f Definitive HeisenDistilBug fix
cc @julien-c @@thomwolf
2020-01-27 12:09:58 -05:00
Lysandre
f09f42d4d3 Input Embeddings should be assigned
cc @julien-c
2020-01-27 11:46:00 -05:00
Maksym Del
bac51fba3a Fix token_type_ids for XLM-R 2020-01-27 11:08:31 -05:00
Lysandre
babd41e7fa Code quality 2020-01-24 17:06:55 -05:00
Lysandre
974d083c7b Accurate model for configuration 2020-01-24 16:46:03 -05:00
Lysandre
983fef469c AutoModels doc 2020-01-24 16:37:30 -05:00
Lysandre
009fcb0ec1 Configuration utils 2020-01-24 16:37:30 -05:00
Julien Chaumond
11b13e94a3 Add type to help my IDE out 2020-01-24 14:00:57 -05:00
VictorSanh
1ce3fb5cc7 update correct eval metrics (distilbert & co) 2020-01-24 11:45:22 -05:00
Nicholas Lourie
62f5804608 Update the doc string for T5WithLMHeadModel
T5WithLMHeadModel's doc string claims that indices of -1 are
ignored while computing the cross-entropy loss in the forward
pass; however, indices of -1 throw an error while indices of -100
are ignored. This commit updates the doc string to be consistent
with the class's behavior.
2020-01-24 10:28:20 -05:00
Lysandre
908230d261 Pickle CamemBERT tokenizer 2020-01-24 10:08:59 -05:00
Lysandre
24d5ad1dcc Run the examples in slow 2020-01-23 09:38:45 -05:00
Lysandre
9ddf60b694 Tips + whitespaces 2020-01-23 09:38:45 -05:00
Lysandre
0e9899f451 Fixes 2020-01-23 09:38:45 -05:00
Lysandre
48ac24020d TF CTRL 2020-01-23 09:38:45 -05:00
Lysandre
7511f3dd89 PyTorch CTRL + Style 2020-01-23 09:38:45 -05:00
Lysandre
980211a63a XLM-RoBERTa 2020-01-23 09:38:45 -05:00
Lysandre
6bc966793a TF DistilBERT 2020-01-23 09:38:45 -05:00
Lysandre
db1a7f27a1 PyTorch DistilBERT 2020-01-23 09:38:45 -05:00
Lysandre
b28020f590 TF RoBERTa 2020-01-23 09:38:45 -05:00
Lysandre
3e1bc27e1b Pytorch RoBERTa 2020-01-23 09:38:45 -05:00
Lysandre
f44ff574d3 Camembert 2020-01-23 09:38:45 -05:00
Lysandre
264eb23912 TF XLM 2020-01-23 09:38:45 -05:00
Lysandre
ccebcae75f PyTorch XLM 2020-01-23 09:38:45 -05:00
Lysandre
92b3cb786d TF XLNet 2020-01-23 09:38:45 -05:00
Lysandre
cd656fb21a PyTorch XLNet 2020-01-23 09:38:45 -05:00
Lysandre
83fa8d9fb5 TF Transformer-XL 2020-01-23 09:38:45 -05:00
Lysandre
98edad418e PyTorch Transformer-XL 2020-01-23 09:38:45 -05:00
Lysandre
96d21ad06b TF OpenAI GPT 2020-01-23 09:38:45 -05:00
Lysandre
850795c487 Pytorch GPT 2020-01-23 09:38:45 -05:00
Lysandre
1487b840d3 TF GPT2 2020-01-23 09:38:45 -05:00
Lysandre
bd0d3fd76e GPT-2 PyTorch models + better tips for BERT 2020-01-23 09:38:45 -05:00
Lysandre
dbeb7fb4e6 BERT TensorFlow 2020-01-23 09:38:45 -05:00
Lysandre
cd77c750c5 BERT PyTorch models 2020-01-23 09:38:45 -05:00
Lysandre
3922a2497e TF ALBERT + TF Utilities + Fix warnings 2020-01-23 09:38:45 -05:00
Lysandre
00df3d4de0 ALBERT Modeling + required changes to utilities 2020-01-23 09:38:45 -05:00
Lysandre
f81b6c95f2 Flake8 violation 2020-01-23 09:38:45 -05:00
Lysandre
632675ea88 Can test examples spread over multiple blocks 2020-01-23 09:38:45 -05:00
Lysandre
eaa6b9afc6 Require Torch when testing examples 2020-01-23 09:38:45 -05:00
Lysandre
9bab9b83d2 Glossary 2020-01-23 09:38:45 -05:00
Lysandre
64abd3e0aa Multi-line examples can be tested + ALBERT patch for CircleCI
All tests should now work fine.
2020-01-23 09:38:45 -05:00
Lysandre
837577256b Automatic testing of examples
The CircleCI test should fail.
2020-01-23 09:38:45 -05:00
Julien Chaumond
90b7df444f Upload CLI: on win32, use slashes, not os.sep 2020-01-22 22:41:21 -05:00
Julien Chaumond
119dc50e2a Doc tweak on model sharing 2020-01-22 22:40:38 -05:00
Julien Chaumond
34a3c25a30 Fix for XLMRobertaConfig inherits from RobertaConfig
hat/tip @stefan-it
2020-01-22 17:50:24 -05:00
Julien Chaumond
1a8e87be4e Line-by-line text dataset (including padding) 2020-01-21 16:57:38 -05:00
Julien Chaumond
b94cf7faac change order 2020-01-21 16:57:38 -05:00
Julien Chaumond
2eaa8b6e56 Easier to not support this, as it could be confusing
cc @lysandrejik
2020-01-21 16:57:38 -05:00
Julien Chaumond
801aaa5508 make style 2020-01-21 16:57:38 -05:00
Julien Chaumond
56d4ba8ddb [run_lm_finetuning] Train from scratch 2020-01-21 16:57:38 -05:00
Lysandre
c7f79815e7 Cleanup unused variables 2020-01-21 11:40:24 -05:00
Lysandre
15579e2d55 [SQuAD v2] Code quality 2020-01-21 11:36:46 -05:00
Lysandre
088fa7b759 Correct segment ID for XLNet single sequence 2020-01-21 11:33:45 -05:00
Lysandre
073219b43f Manage impossible examples SQuAD v2 2020-01-21 11:24:43 -05:00
Branden Chan
983c484fa2 add __getstate__ and __setstate__ to XLMRobertaTokenizer 2020-01-21 10:18:24 -05:00
James Betker
cefd51c50c Fix glue processor failing on tf datasets 2020-01-20 11:46:43 -05:00
Lysandre
ca6ce3040d Fix style 2020-01-20 10:56:23 -05:00
Morgan Funtowicz
908cd5ea27 Make forward asynchrone to avoid long computation timing out.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-20 10:56:23 -05:00
Morgan Funtowicz
6e6c8c52ed Fix bad handling of env variable USE_TF / USE_TORCH leading to invalid framework being used.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-20 10:56:23 -05:00
Brendan Roof
23c6998bf4 Add lower bound to tqdm for tqdm.auto
- It appears that `tqdm` only introduced `tqdm.auto` in 4.27.
- See https://github.com/tqdm/tqdm/releases/tag/v4.27.0.
- Without the lower bound I received the following stack trace in an environment where I already had tqdm installed:
```
  File "/home/brendanr/anaconda3/envs/allennlp/lib/python3.6/site-packages/transformers/__init__.py", line 20, in <module>
    from .file_utils import (TRANSFORMERS_CACHE, PYTORCH_TRANSFORMERS_CACHE, PYTORCH_PRETRAINED_BERT_CACHE,
  File "/home/brendanr/anaconda3/envs/allennlp/lib/python3.6/site-packages/transformers/file_utils.py", line 24, in <module>
    from tqdm.auto import tqdm
ModuleNotFoundError: No module named 'tqdm.auto'
```
2020-01-17 18:29:11 -05:00
Mark Neumann
65a89a8976 Fix BasicTokenizer to respect never_split parameters (#2557)
* add failing test

* fix call to _run_split_on_punc

* format with black
2020-01-17 14:57:56 -05:00
jiyeon_baek
6d5049a24d Fix typo in examples/run_squad.py
Rul -> Run
2020-01-17 11:22:51 -05:00
Julien Chaumond
23a2cea8cb Tokenizer.from_pretrained: fetch all possible files remotely 2020-01-16 16:47:19 -05:00
Julien Chaumond
99f9243de5 same here, try to not serialize too much if unneeded 2020-01-16 16:47:19 -05:00
Julien Chaumond
9d8fd2d40e tokenizer.save_pretrained: only save file if non-empty 2020-01-16 16:47:19 -05:00
Lysandre
6e2c28a14a Run SQuAD warning when the doc stride may be too high 2020-01-16 13:59:26 -05:00
Thomas Wolf
b8f43cb273 Merge pull request #2239 from ns-moosavi/HANS-evaluation-example
HANS evaluation
2020-01-16 13:28:25 +01:00
thomwolf
258ed2eaa8 adding details in readme 2020-01-16 13:21:30 +01:00
thomwolf
50ee59578d update formating - make flake8 happy 2020-01-16 13:21:30 +01:00
thomwolf
1c9333584a formating 2020-01-16 13:21:30 +01:00
thomwolf
e25b6fe354 updating readme 2020-01-16 13:21:30 +01:00
thomwolf
27c7b99015 adding details in readme - moving file 2020-01-16 13:21:30 +01:00
Nafise Sadat Moosavi
99d4515572 HANS evaluation 2020-01-16 13:21:30 +01:00
Thomas Wolf
dc17f2a111 Merge pull request #2538 from huggingface/py3_super
💄 super
2020-01-16 13:17:15 +01:00
Thomas Wolf
880854846b Merge pull request #2540 from huggingface/torch14_fix
[PyTorch 1.4] Fix failing torchscript test for xlnet
2020-01-16 13:16:59 +01:00
Julien Chaumond
d9fa1bad72 Fix failing torchscript test for xlnet
model.parameters() order is apparently not stable (only for xlnet, for some reason)
2020-01-15 20:22:21 -05:00
Julien Chaumond
a98b2ca8c0 Style + fixup BertJapaneseTokenizer 2020-01-15 19:05:51 -05:00
Julien Chaumond
83a41d39b3 💄 super 2020-01-15 18:33:50 -05:00
Julien Chaumond
cd51893d37 Merge branch 'Rexhaif-patch-1' 2020-01-15 18:25:15 -05:00
Julien Chaumond
248aeaa842 Merge branch 'patch-1' of https://github.com/Rexhaif/transformers into Rexhaif-patch-1 2020-01-15 18:22:01 -05:00
Aditya Bhargava
c76c3cebed Add check for token_type_ids before tensorizing
Fix an issue where `prepare_for_model()` gives a `KeyError` when
`return_token_type_ids` is set to `False` and `return_tensors` is
enabled.
2020-01-15 12:31:43 -05:00
Julien Chaumond
eb59e9f705 Graduate sst-2 to a canonical one 2020-01-15 16:28:50 +00:00
Julien Chaumond
e184ad13cf Close #2392 2020-01-15 15:43:44 +00:00
Lysandre
dfe012ad9d Fix misleading RoBERTa token type ids 2020-01-14 17:47:28 -05:00
Lysandre
c024ab98df Improve padding side documentation 2020-01-14 17:44:23 -05:00
Lysandre
9aeb0b9b8a Improve padding side documentation 2020-01-14 17:43:00 -05:00
Julien Chaumond
715fa638a7 Merge branch 'master' into from_scratch_training 2020-01-14 18:58:21 +00:00
Lysandre
100e3b6f21 Bias should be resized with the weights
Created a link between the linear layer bias and the model attribute bias. This does not change anything for the user nor for the conversion scripts, but allows the `resize_token_embeddings` method to resize the bias as well as the weights of the decoder.

Added a test.
2020-01-14 13:43:45 -05:00
Lysandre
6c32d8bb95 Size > Dimensionality + Remove final TODOs 2020-01-14 14:09:09 +01:00
Lysandre
760164d63b RoBERTa example 2020-01-14 14:09:09 +01:00
Lysandre
387217bd3e Added example usage 2020-01-14 14:09:09 +01:00
Lysandre
7d1bb7f256 Add missing XLNet and XLM models 2020-01-14 14:09:09 +01:00
Lysandre
a1cb100460 Wrap up configurations 2020-01-14 14:09:09 +01:00
Lysandre
c11b6fd393 Update links in all configurations 2020-01-14 14:09:09 +01:00
Lysandre Debut
632682726f Updated Configurations 2020-01-14 14:09:09 +01:00
Thomas Wolf
2b566c182e Merge pull request #2384 from dimagalat/master
Releasing file lock
2020-01-14 13:19:01 +01:00
Julien Chaumond
764f836d52 Update test_tokenization_auto.py 2020-01-13 22:50:34 -05:00
Julien Chaumond
d5831acb07 Update test_tokenization_auto.py 2020-01-13 22:47:33 -05:00
Julien Chaumond
ed6cd597cc Update test_tokenization_auto.py 2020-01-13 22:46:35 -05:00
Julien Chaumond
5cb463a714 Update test_tokenization_auto.py 2020-01-13 22:38:29 -05:00
Julien Chaumond
afc24ea5d4 In a parallel setup this could fail 2020-01-13 23:44:08 +00:00
Julien Chaumond
894812c652 Fixup mapping 2020-01-13 23:34:19 +00:00
Julien Chaumond
b20f11d4ca 🔫 Python35 2020-01-13 23:20:44 +00:00
Julien Chaumond
0304628590 Map configs to models and tokenizers 2020-01-13 23:11:44 +00:00
Julien Chaumond
1fc855e456 [tests] Safety checks on CONFIG_MAPPING 2020-01-13 21:52:55 +00:00
Julien Chaumond
3c86b6f3c5 Py35 doesn't like inline variable types 2020-01-13 20:44:33 +00:00
Julien Chaumond
b803b067bf Config to Model mapping 2020-01-13 20:05:20 +00:00
Thomas Wolf
896a0eb1fd Merge pull request #2459 from Perseus14/patch-4
Update pipelines.py
2020-01-13 16:02:54 +01:00
Morgan Funtowicz
0d6c17fc1b black formatting 2020-01-13 11:18:27 +01:00
IWillPull
a3085020ed Added repetition penalty to PPLM example (#2436)
* Added repetition penalty

* Default PPLM repetition_penalty to neutral

* Minor modifications to comply with reviewer's suggestions. (j -> token_idx)

* Formatted code with `make style`
2020-01-10 23:00:07 -05:00
Julien Chaumond
cf8a70bf68 More AutoConfig tests 2020-01-11 03:43:57 +00:00
Julien Chaumond
6bb3edc300 Serialize model_type if exists 2020-01-11 03:18:56 +00:00
Julien Chaumond
c6f682c1eb flake 2020-01-11 03:18:31 +00:00
Julien Chaumond
4d1c98c012 AutoConfig + other Auto classes honor model_type 2020-01-11 02:46:17 +00:00
Julien Chaumond
2f32dfd33b Convention: name mixins mixins 2020-01-11 01:24:29 +00:00
VictorSanh
e83d9f1c1d cleaning - change ' to " (black requirements) 2020-01-10 19:34:25 -05:00
VictorSanh
ebba9e929d minor spring cleaning - missing configs + processing 2020-01-10 19:14:58 -05:00
Julien Chaumond
055e80cfad rm old ConfigTester 2020-01-10 21:36:18 +00:00
Thomas Wolf
b1e1a9f9b2 Merge pull request #2495 from mschrimpf/patch-1
T5: move rp_bucket to relative_attention_bias' device
2020-01-10 22:18:54 +01:00
Julien Chaumond
fd8423321f keep list sorted 2020-01-10 20:36:46 +00:00
Julien Chaumond
0cd81fb99f [isort] declare more third-parties in case no tf install 2020-01-10 20:35:45 +00:00
Martin Schrimpf
90d3b787f6 move rp_bucket to relative_attention_bias' device
otherwise, `rp_bucket` will always be on cpu and fail if `self.relative_attention_bias` is on cuda
2020-01-10 15:09:10 -05:00
Julien Chaumond
84c0aa1868 num_parameters helper 2020-01-10 17:40:02 +00:00
Victor SANH
331065e62d missing import 2020-01-10 11:42:53 +01:00
Victor SANH
414e9e7122 indents test 2020-01-10 11:42:53 +01:00
Victor SANH
3cdb38a7c0 indents 2020-01-10 11:42:53 +01:00
Victor SANH
ebd45980a0 Align with run_squad + fix some errors 2020-01-10 11:42:53 +01:00
Victor SANH
45634f87f8 fix Sampler in distributed training - evaluation 2020-01-10 11:42:53 +01:00
Victor SANH
af1ee9e648 Move torch.nn.utils.clip_grad_norm_ 2020-01-10 11:42:53 +01:00
Lysandre
164c794eb3 New SQuAD API for distillation script 2020-01-10 11:42:53 +01:00
Lysandre
801f2ac8c7 Add PRETRAINED_INIT_CONFIGURATION to DistilBERT tokenizer 2020-01-10 11:42:21 +01:00
Yohei Tamura
bfec203d4e modified: src/transformers/tokenization_utils.py 2020-01-09 12:54:28 +01:00
Julien Chaumond
f599623a99 PreTrainedTokenizerFast: hotfix _convert_encoding
cc @n1t0
2020-01-08 15:46:37 -05:00
Rishabh Manoj
f26a353057 Update pipelines.py
Modified QA pipeline to consider all features for each example before generating topk answers. 
Current pipeline only takes one SquadExample, one SquadFeature, one start logit list, one end logit list to retrieve the answer, this is not correct as one SquadExample can produce multiple SquadFeatures.
2020-01-08 21:12:34 +05:30
Lysandre
16ce15ed4b DistilBERT token type ids removed from inputs in run_squad 2020-01-08 13:18:30 +01:00
Lysandre Debut
f24232cd1b Fix error with global step in run_squad.py 2020-01-08 11:39:00 +01:00
thomwolf
1b59b57b57 ignore_index equal -100 in T5 model 2020-01-08 09:52:10 +01:00
Romain Keramitas
569da80ced Make doc regarding masked indices more clear.
Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
2020-01-07 17:37:27 +01:00
Oren Amsalem
43114b89ba spelling correction (#2434) 2020-01-07 17:25:25 +01:00
Genta Indra Winata
d6a677b14b Fix typograpical errors (#2438) 2020-01-07 17:21:23 +01:00
Lysandre Debut
27c1b656cc Fix error with global step in run_lm_finetuning.py 2020-01-07 16:16:12 +01:00
Lysandre
24df44d9c7 Black version python 3.5 2020-01-07 15:53:42 +01:00
Lysandre Debut
73be60c47b Quotes 2020-01-07 15:34:23 +01:00
Lysandre
6806f8204e fix #2410 2020-01-07 15:20:45 +01:00
Simone Primarosa
176d3b3079 Add support for Albert and XLMRoberta for the Glue example (#2403)
* Add support for Albert and XLMRoberta for the Glue example
2020-01-07 14:55:55 +01:00
Morgan Funtowicz
9261c7f771 Remove f-string device creation on PyTorch GPU pipelines.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-07 11:46:44 +01:00
Morgan Funtowicz
91d33c798b Fix issue on pipelines where pytorch's tensors are not copied on the user-specified GPU device.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-07 11:12:31 +01:00
Dima Galat
2926852f14 fixed formatting 2020-01-07 11:56:03 +11:00
Dima Galat
e2810edc8f removing redundant .flush 2020-01-07 11:47:25 +11:00
Julien Chaumond
c301faa92b Distributed or parallel setup 2020-01-06 18:41:08 -05:00
alberduris
81d6841b4b GPU text generation: mMoved the encoded_prompt to correct device 2020-01-06 15:11:12 +01:00
alberduris
dd4df80f0b Moved the encoded_prompts to correct device 2020-01-06 15:11:12 +01:00
Lysandre Debut
1efc208ff3 Complete DataProcessor class 2020-01-06 15:02:25 +01:00
Simone Primarosa
c45d0cf60f Improve logging message in the single sentence classification processor 2020-01-06 14:54:36 +01:00
Simone Primarosa
bf89be77b9 Improve logging message in the single sentence classification processor 2020-01-06 14:54:36 +01:00
Simone Primarosa
bf8d4bc674 Improve logging message in glue feature conversion 2020-01-06 14:54:36 +01:00
Lysandre
74755c89b9 Example snippet for BertForQuestionAnswering 2020-01-06 14:41:53 +01:00
Aymeric Augustin
0ffc8eaf53 Enforce target version for black.
This should stabilize formatting.
2020-01-05 12:52:14 -05:00
karajan1001
f01b3e6680 fix #2399 an ImportError in official example (#2400)
* fix #2399 an ImportError in official example

* style

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-01-05 12:50:20 -05:00
Julien Chaumond
78528742f1 Fix syntax + link to community page 2020-01-05 12:43:39 -05:00
Clement
12e0aa4368 Proposition to include community models in readme 2020-01-05 12:37:11 -05:00
Morgan Funtowicz
80faf22b4a Updating documentation for converting tensorflow model to reflect the new cli convert format.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-01-04 13:41:18 +01:00
Dima
d0e594f9db Releasing file lock 2020-01-02 09:45:48 +11:00
Julien Chaumond
629b22adcf [run_lm_finetuning] mask_tokens: document types 2020-01-01 12:55:10 -05:00
Julien Chaumond
594ca6dead [debug] Debug Heisenbug, the old school way. 2019-12-29 10:07:21 -05:00
Julien Chaumond
0df4e62da0 [http] Tweak http user-agent (#2353) 2019-12-29 10:06:50 -05:00
Thomas Wolf
f75bf05ce6 Merge pull request #2352 from huggingface/cli_tweaks
Cli tweaks
2019-12-28 15:40:00 +01:00
Julien Chaumond
0d467fd6de Typo 2019-12-27 23:06:48 -05:00
Julien Chaumond
d8293e84f3 [cli] upload: max number of files at the same time 2019-12-27 23:02:53 -05:00
Julien Chaumond
4d6c93e923 Kill __main__ 2019-12-27 22:55:22 -05:00
Julien Chaumond
9b2badf3c9 [cli] Update doc 2019-12-27 22:54:29 -05:00
Julien Chaumond
f78ebc22ad [cli] Add ability to delete remote object 2019-12-27 22:53:49 -05:00
Anthony MOI
bfe870be65 Hotfix tokenizers version for sdist installs 2019-12-27 11:05:52 -05:00
Thomas Wolf
74ea432847 Merge pull request #2286 from adelevie/patch-2
Typo in tokenization_utils.py
2019-12-27 10:50:47 +01:00
Thomas Wolf
492bea9aa0 Merge pull request #2292 from patrickvonplaten/add_cached_past_for_language_generation
Add cached past for language generation
2019-12-27 10:33:27 +01:00
Thomas Wolf
e213900fa2 Merge pull request #2290 from patrickvonplaten/fix_typo_in_doc_for_language_generation
duplicated line for repeating_words_penalty_for_language_generation
2019-12-27 10:29:06 +01:00
Thomas Wolf
9f5f646442 Merge pull request #2211 from huggingface/fast-tokenizers
Fast tokenizers
2019-12-27 10:24:29 +01:00
Aymeric Augustin
9024b19994 Auto-format (fixes previous commit). 2019-12-27 10:13:52 +01:00
Aymeric Augustin
3233b58ad4 Quote square brackets in shell commands.
This ensures compatibility with zsh.

Fix #2316.
2019-12-27 08:50:25 +01:00
Anthony MOI
e6ec24fa88 Better added_tokens handling 2019-12-26 16:49:48 -05:00
Anthony MOI
599db139f9 Code style update 2019-12-26 15:13:30 -05:00
Anthony MOI
835b76a46f Handle unk_token
As we discussed, this is handled here directly 
cc @thomwolf
2019-12-26 14:42:55 -05:00
Anthony MOI
7ead04ce14 FastPreTrainedTokenizer => PreTrainedTokenizerFast 2019-12-26 14:39:39 -05:00
Anthony MOI
1f82a5d910 Update for changes in tokenizers API 2019-12-26 14:37:55 -05:00
Thomas Wolf
8c67b529f6 Merge pull request #2324 from kashif/patch-1
Typo in serving.py
2019-12-26 12:38:06 +01:00
Kashif Rasul
7211541ade Typo in serving.py 2019-12-26 12:21:40 +01:00
patrickvonplaten
0f6017bee3 improve comments for examples 2019-12-26 00:35:11 +01:00
patrickvonplaten
87c8fca9bc add example for ctrl text generation in docs 2019-12-26 00:29:19 +01:00
patrickvonplaten
88def24c45 merge conflicts - renamed to previous_token singular 2019-12-26 00:27:16 +01:00
patrickvonplaten
822f725a07 duplicated line for repeating_words_penalty_for_language_generation 2019-12-26 00:25:29 +01:00
patrickvonplaten
fc84bd5254 adapt style to predefined style layout 2019-12-25 23:32:44 +01:00
patrickvonplaten
deff792bb6 add prepare inputs for transfo_xl and xlnet 2019-12-25 23:17:24 +01:00
patrickvonplaten
9398058e19 add easy tensor shape match test 2019-12-25 23:17:24 +01:00
patrickvonplaten
90cda45e9e add past re-ordering for beam search 2019-12-25 23:17:24 +01:00
patrickvonplaten
6bca56fdb0 check for self.config.mem_len instead of self.mem_len in _do_output_past 2019-12-25 23:17:24 +01:00
patrickvonplaten
365ccd0af2 make if statements cleaner for prepare_inputs_for_generation 2019-12-25 23:17:24 +01:00
patrickvonplaten
d039c679d2 better naming for if statement 2019-12-25 23:17:24 +01:00
patrickvonplaten
7e0c5c731a changed do_output_past function to check for self.config.output_past instead of self.output_past 2019-12-25 23:17:24 +01:00
patrickvonplaten
eeaa402cd4 rename comments 2019-12-25 23:17:24 +01:00
patrickvonplaten
7bb4271291 remove ipdb debugging statements 2019-12-25 23:17:24 +01:00
patrickvonplaten
267587c258 add and improve comments 2019-12-25 23:17:24 +01:00
patrickvonplaten
d891fd0ae0 add past hidden key states for more efficient language generation & add prepare_inputs for gpt2 and ctrl model 2019-12-25 23:17:24 +01:00
Thomas Wolf
aeef4823ab Merge pull request #2303 from patrickvonplaten/fix_error_with_repetition_penalty
fix repetition penalty error in modeling_utils.py
2019-12-25 22:39:20 +01:00
Thomas Wolf
0412f3d929 Merge pull request #2291 from aaugustin/fix-flake8-F841
Fix F841 flake8 warning
2019-12-25 22:37:42 +01:00
Thomas Wolf
8742c95461 Merge pull request #2289 from patrickvonplaten/fix_effective_batch_size_lang_gen_xlm
fix bug in prepare inputs for language generation for xlm for effective batch_size > 1
2019-12-25 22:30:46 +01:00
Thomas Wolf
1240be3ed9 Merge pull request #2312 from vitaliyradchenko/fix_special_and_add_tokens_loading
Correct tokenization for special and added tokens
2019-12-25 20:52:30 +01:00
vitaliyradchenko
b262577d17 add special tokens to unique_added_tokens_encoder 2019-12-25 18:31:35 +02:00
vitaliyradchenko
83a2347952 fixed lack of added and special tokens 2019-12-25 18:03:19 +02:00
Thomas Wolf
cea04a2443 Merge pull request #2310 from ShnitzelKiller/scatter-unfix
revert erroneous fix #2276
2019-12-25 12:43:22 +01:00
James Noeckel
e1844d9a45 use positional arguments due to inconsistent API 2019-12-25 01:34:02 -08:00
James Noeckel
9fb7addd4d revert erroneous fix 2019-12-24 22:26:09 -08:00
Anthony MOI
734d29b03d tokenizers is now a real dependency 2019-12-24 13:32:41 -05:00
Anthony MOI
2818e50569 Add tests for fast tokenizers 2019-12-24 13:29:01 -05:00
Anthony MOI
31c56f2e0b Fix style 2019-12-24 12:43:27 -05:00
Anthony MOI
951ae99bea BertTokenizerFast 2019-12-24 12:24:24 -05:00
Anthony MOI
041eac2d6d GPT2TokenizerFast 2019-12-24 12:24:14 -05:00
Anthony MOI
3471ff0d35 FastPreTrainedTokenizer 2019-12-24 12:23:30 -05:00
patrickvonplaten
18e5bdbec5 fix repetition penalty error in modeling_utils.py 2019-12-24 17:18:05 +01:00
patrickvonplaten
f18ac4c28e fix sequence length for prepare_inputs for xlnet 2019-12-24 16:43:24 +01:00
patrickvonplaten
359dc43837 fix effective batch_size error in prepare_inputs also for xlnet 2019-12-24 16:33:20 +01:00
patrickvonplaten
d98a384cb0 fix bug in prepare inputs for language generation for xlm for effective batch_size > 1 2019-12-24 16:29:54 +01:00
thomwolf
3e0cf49514 adding back last dropout in TF 2.0 T5 2019-12-24 11:30:56 +01:00
thomwolf
35d32308de adding back final dropout in T5 2019-12-24 11:29:49 +01:00
Thomas Wolf
81db12c3ba Merge pull request #2271 from aaugustin/improve-setup-and-requirements
Improve setup and requirements
2019-12-24 11:21:20 +01:00
Aymeric Augustin
10724a8123 Run the slow tests every Monday morning. 2019-12-24 09:09:43 +01:00
Aymeric Augustin
a8d34e534e Remove [--editable] in install instructions.
Use -e only in docs targeted at contributors.

If a user copy-pastes  command line with [--editable], they will hit
an error. If they don't know the --editable option, we're giving them
a choice to make before they can move forwards, but this isn't a choice
they need to make right now.
2019-12-24 08:46:08 +01:00
Aymeric Augustin
e74c73a85d Enable F841 warning in flake8. 2019-12-23 22:38:23 +01:00
Aymeric Augustin
e6c0019c80 Remove unused variables in tests. 2019-12-23 22:38:18 +01:00
Aymeric Augustin
495580dad1 Remove unused variables in templates. 2019-12-23 22:38:18 +01:00
Aymeric Augustin
71f94a8a1c Remove unused variables in src. 2019-12-23 22:38:09 +01:00
Aymeric Augustin
81422c4e6d Remove unused variables in examples. 2019-12-23 22:29:02 +01:00
Aymeric Augustin
072750f4dc Merge pull request #2288 from aaugustin/better-handle-optional-imports
Improve handling of optional imports
2019-12-23 22:28:47 +01:00
Aymeric Augustin
4621ad6f9d Use the same pattern as everywhere else.
This is really just for consistency.
2019-12-23 21:30:04 +01:00
Aymeric Augustin
a31d4a2971 Reraise ImportError when sentencepiece isn't installed.
Else, the next line fails with a confusion exception because the spm
variable isn't defined.
2019-12-23 21:27:42 +01:00
Aymeric Augustin
c8b0c1e551 Improve exception type.
ImportError isn't really appropriate when there's no import involved.
2019-12-23 21:27:38 +01:00
Aymeric Augustin
4c09a96096 Simplify re-raising exceptions.
Most module use the simpler `raise` version. Normalize those that don't.
2019-12-23 21:20:54 +01:00
Aymeric Augustin
5565dcdd35 Remove warning when scikit-learn isn't available.
Most users don't need it.
2019-12-23 21:16:26 +01:00
Aymeric Augustin
8a6881822a Run some tests on Python 3.7.
This will improve version coverage.
2019-12-23 21:06:23 +01:00
Aymeric Augustin
7a865821d9 Remove stray egg-info directory automatically.
If a user or contributor ran `pip install -e .` on transformers < 3.0,
pip created a transformers.egg-info directory next to the transformers
directory at the root of the repository.

In transformers 3.0, the source is in a `src` subdirectory.
`pip install -e .` creates a transformers.egg-info directory there.
However, pip will still pick transformers.egg-info from the previous
location. This is a bug: https://github.com/pypa/pip/issues/5466

Users and contributors are likely to hit this problem because the
documentation for transformers 3.0 relies heavily on extra_requires
which didn't exist in earlier versions, so aren't defined in a stale
transformers.egg-info directory.

If such a directory exists, remove it. It's autogenerated, gitignored
and not supposed to contain anything of value.
2019-12-23 21:06:23 +01:00
Aymeric Augustin
70373a5f7c Update contribution instructions.
Also provide shortcuts in a Makefile.
2019-12-23 21:05:30 +01:00
Aymeric Augustin
c3783399db Remove redundant requirements with transformers. 2019-12-23 19:17:27 +01:00
Aymeric Augustin
d79e9c9a9a Remove docs/requirements.txt.
It's superseded by the "docs" extras.
2019-12-23 19:17:07 +01:00
Aymeric Augustin
d73eb552e8 Remove requirements.txt.
It's redundant with setup.py and, also, incomplete (e.g. numpy).
2019-12-23 19:15:08 +01:00
Aymeric Augustin
9fcc532df6 Remove requirements-dev.txt.
It was generated once, likely in a non-reproducible way (pip freeze
in a contributor's local environment), and never updated.
2019-12-23 19:14:36 +01:00
Aymeric Augustin
76a1417f2a Include all optional dependencies in extras.
Take advantage of this to simplify the Circle CI configuration.

Don't bother with tensorboardX: it's a fallback for PyTorch < 1.1.0.
2019-12-23 19:14:31 +01:00
Aymeric Augustin
9fc8dcb2a0 Standardize import.
Every other file uses this pattern.
2019-12-23 18:45:42 +01:00
Aymeric Augustin
f2522869ea Review and update setup.py. 2019-12-23 18:45:42 +01:00
Alan deLevie
7cef764ec0 Typo in tokenization_utils.py
avoir -> avoid
2019-12-23 12:14:50 -05:00
Aymeric Augustin
23dad8447c Install deps from setup.py for building docs.
requirements.txt isn't up to date.
2019-12-23 17:06:32 +01:00
Aymeric Augustin
d8e33dbd67 Fix path to source code in docs config.
This should fix API docs, which went AWOL with yesterday's changes.
2019-12-23 16:49:35 +01:00
thomwolf
59b123bc50 fix tqdm logging level 2019-12-23 16:47:24 +01:00
Thomas Wolf
ba2378ced5 Merge pull request #2264 from upura/fix-doclink
Fix doc link in README
2019-12-23 12:31:00 +01:00
Thomas Wolf
e4e2a666c9 Merge pull request #2276 from ShnitzelKiller/scatterfix
fix error due to wrong argument name to Tensor.scatter()
2019-12-23 12:19:48 +01:00
James Noeckel
398bb03f98 fix out-of-place call to scatter, whose named argument name is source, not src 2019-12-22 23:30:52 -08:00
Aymeric Augustin
ce50305e5b Merge pull request #2270 from aaugustin/remove-python-2
Remove support for Python 2
2019-12-22 23:04:37 +01:00
Aymeric Augustin
1a948d7020 Switch from comments to annotations for types. 2019-12-22 18:56:01 +01:00
Aymeric Augustin
1c62e87b34 Use built-in open().
On Python 3, `open is io.open`.
2019-12-22 18:38:56 +01:00
Aymeric Augustin
d6eaf4e6d2 Update comments mentioning Python 2. 2019-12-22 18:38:56 +01:00
Aymeric Augustin
45841eaf7b Remove references to Python 2 in documentation. 2019-12-22 18:38:56 +01:00
Aymeric Augustin
0dddc1494d Remove py3 marker. 2019-12-22 18:38:56 +01:00
Aymeric Augustin
75a23d24af Remove import fallbacks. 2019-12-22 18:38:56 +01:00
Aymeric Augustin
798b3b3899 Remove sys.version_info[0] == 2 or 3. 2019-12-22 18:38:42 +01:00
Aymeric Augustin
8af25b1664 Remove six. 2019-12-22 17:56:09 +01:00
Aymeric Augustin
6b2200fc88 Remove u-prefixes. 2019-12-22 17:47:54 +01:00
Aymeric Augustin
c824d15aa1 Remove __future__ imports. 2019-12-22 17:47:54 +01:00
Aymeric Augustin
b6ea0f43ae Remove duplicate -v flag. 2019-12-22 17:47:27 +01:00
Thomas Wolf
5daca95ddd Merge pull request #2268 from aaugustin/improve-repository-structure
Improve repository structure
2019-12-22 16:41:53 +01:00
Thomas Wolf
54abc67aec Merge pull request #2255 from aaugustin/implement-best-practices
Implement some Python best practices
2019-12-22 16:31:11 +01:00
Aymeric Augustin
00204f2b4c Replace CommonTestCases for tokenizers with a mixin.
This is the same change as for (TF)CommonTestCases for modeling.
2019-12-22 15:35:25 +01:00
Aymeric Augustin
a3c5883f2c Rename file for consistency. 2019-12-22 15:35:25 +01:00
Aymeric Augustin
daf8bebcdd Remove unused GPTModelTester.
It isn't imported anywhere.
2019-12-22 15:35:25 +01:00
Aymeric Augustin
345c23a60f Replace (TF)CommonTestCases for modeling with a mixin.
I suspect the wrapper classes were created in order to prevent the
abstract base class (TF)CommonModelTester from being included in test
discovery and running, because that would fail.

I solved this by replacing the abstract base class with a mixin.

Code changes are just de-indenting and automatic reformattings
performed by black to use the extra line space.
2019-12-22 15:35:18 +01:00
Aymeric Augustin
7e98e211f0 Remove unittest.main() in test modules.
This construct isn't used anymore these days.

Running python tests/test_foo.py puts the tests/ directory on
PYTHONPATH, which isn't representative of how we run tests.

Use python -m unittest tests/test_foo.py instead.
2019-12-22 14:42:03 +01:00
Aymeric Augustin
6be7cdda66 Move source code inside a src subdirectory.
This prevents transformers from being importable simply because the CWD
is the root of the git repository, while not being importable from other
directories. That led to inconsistent behavior, especially in examples.

Once you fetch this commit, in your dev environment, you must run:

    $ pip uninstall transformers
    $ pip install -e .
2019-12-22 14:15:13 +01:00
Aymeric Augustin
ced0a94204 Switch test files to the standard test_*.py scheme. 2019-12-22 14:15:13 +01:00
Aymeric Augustin
067395d5c5 Move tests outside of library. 2019-12-22 13:47:17 +01:00
Aymeric Augustin
698f9e3d7a Remove trailing whitespace in README. 2019-12-22 13:29:58 +01:00
Aymeric Augustin
c11b3e2926 Sort imports for optional third-party libraries.
These libraries aren't always installed in the virtual environment where
isort is running. Declaring them properly avoids mixing these
third-party imports with local imports.
2019-12-22 11:19:13 +01:00
Aymeric Augustin
2a34d5b71b Stabilize import order for packaging.
I don't want to consider it a dependency of transformers, but it's
usually there in local development and usually not there in CI.
2019-12-22 11:07:31 +01:00
Aymeric Augustin
c9270086ea Disable flake8 F841 in CI to get a passing run.
I'll fix it later.
2019-12-22 11:00:06 +01:00
Aymeric Augustin
577a03664d Enforce flake8 in CI. 2019-12-22 11:00:04 +01:00
Aymeric Augustin
7c6812645a Restore proper import for HTTPError. 2019-12-22 10:59:08 +01:00
Aymeric Augustin
939148b050 Fix F401 flake8 warning (x28).
Do manually what autoflake couldn't manage.
2019-12-22 10:59:08 +01:00
Aymeric Augustin
783a616999 Fix F401 flake8 warning (x88 / 116).
This change is mostly autogenerated with:

    $ python -m autoflake --in-place --recursive --remove-all-unused-imports --ignore-init-module-imports examples templates transformers utils hubconf.py setup.py

I made minor changes in the generated diff.
2019-12-22 10:59:08 +01:00
Aymeric Augustin
80327a13ea Fix F401 flake8 warning (x152 / 268).
This change is mostly autogenerated with:

    $ python -m autoflake --in-place --recursive examples templates transformers utils hubconf.py setup.py

I made minor changes in the generated diff.
2019-12-22 10:59:08 +01:00
Aymeric Augustin
654e051e2a Ignore F401 flake8 warning (x326 / 594). 2019-12-22 10:59:08 +01:00
Aymeric Augustin
fa2ccbc081 Fix E266 flake8 warning (x90). 2019-12-22 10:59:08 +01:00
Aymeric Augustin
2ab78325f0 Fix F821 flake8 warning (x47).
Ignore warnings related to Python 2, because it's going away soon.
2019-12-22 10:59:07 +01:00
Aymeric Augustin
631be27078 Fix E722 flake8 warnings (x26). 2019-12-22 10:59:07 +01:00
Aymeric Augustin
b0f7db73cd Fix E741 flake8 warning (x14). 2019-12-22 10:59:07 +01:00
Aymeric Augustin
ea89bec185 Fix E231 flake8 warning (x9). 2019-12-22 10:59:07 +01:00
Aymeric Augustin
fd2f17a7a1 Fix E714 flake8 warning (x8). 2019-12-22 10:59:07 +01:00
Aymeric Augustin
5eab3cf6bc Fix W605 flake8 warning (x5). 2019-12-22 10:59:07 +01:00
Aymeric Augustin
7dce8dc7ac Fix E731 flake8 warning (x3). 2019-12-22 10:59:07 +01:00
Aymeric Augustin
eed46f38b7 Fix E302 flake8 warning (x3). 2019-12-22 10:59:07 +01:00
Aymeric Augustin
b1de7ae08a Fix F811 flake8 warning (x1). 2019-12-22 10:59:07 +01:00
Aymeric Augustin
357db7098c Fix E712 flake8 warning (x1). 2019-12-22 10:59:07 +01:00
Aymeric Augustin
f9c5317db2 Fix E265 flake8 warning (x1). 2019-12-22 10:59:07 +01:00
Aymeric Augustin
28e608a2c2 Remove trailing whitespace from all Python files.
Fixes flake8 warning W291 (x224).
2019-12-22 10:59:07 +01:00
Aymeric Augustin
1efa0a7552 Add black-compatible flake8 configuration. 2019-12-22 10:59:07 +01:00
Aymeric Augustin
d0c9fe277a Fix circular import in transformers.pipelines.
Submodules shouldn't import from their parent in general.
2019-12-22 10:59:07 +01:00
Aymeric Augustin
5ca054757f Update "make style" to sort imports with isort. 2019-12-22 10:59:07 +01:00
Aymeric Augustin
9e80fc7b2f Enforce isort in CI.
We need https://github.com/timothycrosley/isort/pull/1000 but there's no
release with this fix yet, so we'll install from GitHub.
2019-12-22 10:59:00 +01:00
Aymeric Augustin
158e82e061 Sort imports with isort.
This is the result of:

    $ isort --recursive examples templates transformers utils hubconf.py setup.py
2019-12-22 10:57:46 +01:00
upura
9d00f78f16 fix doc link 2019-12-22 16:07:05 +09:00
Daniil Larionov
b668a740ca Fixing incorrect link in model docstring
The docstring contains a link to Salesforce/CTRL repo, while the model itself is Facebookresearch/mmbt. It may be the wrong copy\paste.
2019-12-22 00:01:14 +03:00
Aymeric Augustin
bc1715c1e0 Add black-compatible isort configuration.
lines_after_imports = 2 is a matter of taste; I like it.
2019-12-21 17:53:18 +01:00
Aymeric Augustin
36883c1192 Add "make style" to format code with black. 2019-12-21 17:53:18 +01:00
Aymeric Augustin
6e5291a915 Enforce black in CI. 2019-12-21 17:53:18 +01:00
Aymeric Augustin
fa84ae26d6 Reformat source code with black.
This is the result of:

    $ black --line-length 119 examples templates transformers utils hubconf.py setup.py

There's a lot of fairly long lines in the project. As a consequence, I'm
picking the longest widely accepted line length, 119 characters.

This is also Thomas' preference, because it allows for explicit variable
names, to make the code easier to understand.
2019-12-21 17:52:29 +01:00
Aymeric Augustin
63e3827c6b Remove empty file.
Likely it was added by accident.
2019-12-21 15:38:08 +01:00
Thomas Wolf
645713e2cb Merge pull request #2254 from huggingface/fix-tfroberta
adding positional embeds masking to TFRoBERTa
2019-12-21 15:33:22 +01:00
Thomas Wolf
73f6e9817c Merge pull request #2115 from suvrat96/add_mmbt_model
[WIP] Add MMBT Model to Transformers Repo
2019-12-21 15:26:08 +01:00
thomwolf
77676c27d2 adding positional embeds masking to TFRoBERTa 2019-12-21 15:24:48 +01:00
thomwolf
344126fe58 move example to mm-imdb folder 2019-12-21 15:06:52 +01:00
Thomas Wolf
5b7fb6a4a1 Merge pull request #2134 from bkkaggle/saving-and-resuming
closes #1960 Add saving and resuming functionality for remaining examples
2019-12-21 15:03:53 +01:00
Thomas Wolf
6f68d559ab Merge pull request #2130 from huggingface/ignored-index-coherence
[BREAKING CHANGE] Setting all ignored index to the PyTorch standard
2019-12-21 14:55:40 +01:00
thomwolf
1ab25c49d3 Merge branch 'master' into pr/2115 2019-12-21 14:54:30 +01:00
thomwolf
b03872aae0 fix merge 2019-12-21 14:49:54 +01:00
Thomas Wolf
518ba748e0 Merge branch 'master' into saving-and-resuming 2019-12-21 14:41:39 +01:00
Thomas Wolf
18601c3b6e Merge pull request #2173 from erenup/master
run_squad with roberta
2019-12-21 14:33:16 +01:00
Thomas Wolf
6e7102cfb3 Merge pull request #2203 from gthb/patch-1
fix: wrong architecture count in README
2019-12-21 14:31:44 +01:00
Thomas Wolf
deceb00161 Merge pull request #2177 from mandubian/issue-2106
:zip: #2106 tokenizer.tokenize speed improvement (3-8x) by caching added_tokens in a Set
2019-12-21 14:31:20 +01:00
Thomas Wolf
eeb70cdd77 Merge branch 'master' into saving-and-resuming 2019-12-21 14:29:59 +01:00
Thomas Wolf
ed9b84816e Merge pull request #1840 from huggingface/generation_sampler
[WIP] Sampling sequence generator for transformers
2019-12-21 14:27:35 +01:00
thomwolf
f86ed23189 update doc 2019-12-21 14:13:06 +01:00
thomwolf
cfa0380515 Merge branch 'master' into generation_sampler 2019-12-21 14:12:52 +01:00
thomwolf
300ec3003c fixing run_generation example - using torch.no_grad 2019-12-21 14:02:19 +01:00
thomwolf
1c37746892 fixing run_generation 2019-12-21 13:52:49 +01:00
Thomas Wolf
7e17f09fb5 Merge pull request #1803 from importpandas/fix-xlnet-squad2.0
fix run_squad.py during fine-tuning xlnet on squad2.0
2019-12-21 13:38:48 +01:00
thomwolf
8a2be93b4e fix merge 2019-12-21 13:31:28 +01:00
Thomas Wolf
562f864038 Merge branch 'master' into fix-xlnet-squad2.0 2019-12-21 12:48:10 +01:00
Thomas Wolf
8618bf15d6 Merge pull request #1736 from huggingface/fix-tf-xlnet
Fix TFXLNet
2019-12-21 12:42:05 +01:00
Thomas Wolf
2fa8737c44 Merge pull request #1586 from enzoampil/include_special_tokens_in_bert_examples
Add special tokens to documentation for bert examples to resolve issue: #1561
2019-12-21 12:36:11 +01:00
Thomas Wolf
f15f087143 Merge pull request #1764 from DomHudson/bug-fix-1761
Bug-fix: Roberta Embeddings Not Masked
2019-12-21 12:13:27 +01:00
Thomas Wolf
fae4d1c266 Merge pull request #2217 from aaugustin/test-parallelization
Support running tests in parallel
2019-12-21 11:54:23 +01:00
Aymeric Augustin
b8e924e10d Restore test.
This looks like debug code accidentally committed in b18509c2.

Refs #2250.
2019-12-21 08:50:15 +01:00
Aymeric Augustin
767bc3ca68 Fix typo in model name.
This looks like a copy/paste mistake. Probably this test was never run.

Refs #2250.
2019-12-21 08:46:26 +01:00
Aymeric Augustin
343c094f21 Run examples separately from tests.
This optimizes the total run time of the Circle CI test suite.
2019-12-21 08:43:19 +01:00
Aymeric Augustin
80caf79d07 Prevent excessive parallelism in PyTorch.
We're already using as many processes in parallel as we have CPU cores.
Furthermore, the number of core may be incorrectly calculated as 36
(we've seen this in pytest-xdist) which make compound the problem.

PyTorch performance craters without this.
2019-12-21 08:43:19 +01:00
Aymeric Augustin
bb3bfa2d29 Distribute tests from the same file to the same worker.
This should prevent two issues:

- hitting API rate limits for tests that hit the HF API
- multiplying the cost of expensive test setups
2019-12-21 08:43:19 +01:00
Aymeric Augustin
29cbab98f0 Parallelize tests on Circle CI.
Set the number of CPUs manually based on the Circle CI resource class,
or else we're getting 36 CPUs, which is far too much (perhaps that's
the underlying hardware and not what Circle CI allocates to us).

Don't parallelize the custom tokenizers tests because they take less
than one second to run and parallelization actually makes them slower.
2019-12-21 08:43:19 +01:00
Aymeric Augustin
a4c9338b83 Prevent parallel downloads of the same file with a lock.
Since the file is written to the filesystem, a filesystem lock is the
way to go here. Add a dependency on the third-party filelock library to
get cross-platform functionality.
2019-12-21 08:43:19 +01:00
Aymeric Augustin
b670c26684 Take advantage of the cache when running tests.
Caching models across test cases and across runs of the test suite makes
slow tests somewhat more bearable.

Use gettempdir() instead of /tmp in tests. This makes it easier to
change the location of the cache with semi-standard TMPDIR/TEMP/TMP
environment variables.

Fix #2222.
2019-12-21 08:43:19 +01:00
Aymeric Augustin
b67fa1a8d2 Download models directly to cache_dir.
This allows moving the file instead of copying it, which is more
reliable. Also it avoids writing large amounts of data to /tmp,
which may not be large enough to accomodate it.

Refs #2222.
2019-12-21 08:43:19 +01:00
Aymeric Augustin
286d5bb6b7 Use a random temp dir for writing pruned models in tests. 2019-12-21 08:43:19 +01:00
Aymeric Augustin
478e456e83 Use a random temp dir for writing file in tests. 2019-12-21 08:43:19 +01:00
Aymeric Augustin
12726f8556 Remove redundant torch.jit.trace in tests.
This looks like it could be expensive, so don't run it twice.
2019-12-21 08:43:19 +01:00
Julien Chaumond
ac1b449cc9 [doc] move distilroberta to more appropriate place
cc @lysandrejik
2019-12-21 00:09:01 -05:00
Julien Chaumond
3e52915fa7 [RoBERTa] Embeddings: fix dimensionality bug 2019-12-20 19:01:27 -05:00
Dom Hudson
228f52867c Bug fix: 1764 2019-12-20 18:27:35 -05:00
Francesco
a80778f40e small refactoring (only esthetic, not functional) 2019-12-20 17:21:24 -05:00
Francesco
3df1d2d144 - Create the output directory (whose name is passed by the user in the "save_directory" parameter) where it will be saved encoder and decoder, if not exists.
- Empty the output directory, if it contains any files or subdirectories.
- Create the "encoder" directory inside "save_directory", if not exists.
- Create the "decoder" directory inside "save_directory", if not exists.
- Save the encoder and the decoder in the previous two directories, respectively.
2019-12-20 17:21:24 -05:00
Lysandre
a436574bfd Release: v2.3.0 2019-12-20 16:22:20 -05:00
Thomas Wolf
d0f8b9a978 Merge pull request #2244 from huggingface/fix-tok-pipe
Fix Camembert and XLM-R `decode` method- Fix NER pipeline alignement
2019-12-20 22:10:39 +01:00
Thomas Wolf
a557836a70 Merge pull request #2191 from huggingface/fix_sp_np
Numpy compatibility for sentence piece
2019-12-20 22:08:08 +01:00
thomwolf
655fd06853 clean up 2019-12-20 21:57:49 +01:00
thomwolf
e5812462fc clean up debug and less verbose tqdm 2019-12-20 21:51:48 +01:00
thomwolf
4775ec354b add overwrite - fix ner decoding 2019-12-20 21:47:15 +01:00
Lysandre
cb6d54bfda Numpy compatibility for sentence piece
convert to int earlier
2019-12-20 15:06:28 -05:00
thomwolf
f79a7dc661 fix NER pipeline 2019-12-20 20:57:45 +01:00
thomwolf
a241011057 fix pipeline NER 2019-12-20 20:43:48 +01:00
thomwolf
e37ca8e11a fix camembert and XLM-R tokenizer 2019-12-20 20:43:42 +01:00
thomwolf
ceae85ad60 fix mc loading 2019-12-20 19:52:24 +01:00
thomwolf
71883b6ddc update link in readme 2019-12-20 19:40:23 +01:00
Thomas Wolf
8d5a47c79b Merge pull request #2243 from huggingface/fix-xlm-roberta
fixing xlm-roberta tokenizer max_length and automodels
2019-12-20 19:34:08 +01:00
thomwolf
79e4a6a25c update serving API 2019-12-20 19:33:12 +01:00
thomwolf
bbaaec046c fixing CLI pipeline 2019-12-20 19:19:20 +01:00
thomwolf
1c12ee0e55 fixing xlm-roberta tokenizer max_length and automodels 2019-12-20 18:28:27 +01:00
Lysandre
65c75fc587 Clean special tokens test 2019-12-20 11:34:16 -05:00
Lysandre
fb393ad994 Added test for all special tokens 2019-12-20 11:29:58 -05:00
Dirk Groeneveld
90debb9ff2 Keep even the first of the special tokens intact while lowercasing. 2019-12-20 11:29:43 -05:00
Morgan Funtowicz
b98ff88544 Added pipelines quick tour in README 2019-12-20 15:52:50 +01:00
Thomas Wolf
3a2c4e6f63 Merge pull request #1548 from huggingface/cli
[2.2] - Command-line interface - Pipeline class
2019-12-20 15:28:29 +01:00
Rémi Louf
4e3f745ba4 add example for Model2Model in quickstart 2019-12-20 09:12:31 -05:00
thomwolf
db0795b5d0 defaults models for tf and pt - update tests 2019-12-20 15:07:00 +01:00
Morgan Funtowicz
7f74084528 Fix leading axis added when saving through the command run 2019-12-20 14:47:04 +01:00
thomwolf
c37815f130 clean up PT <=> TF 2.0 conversion and config loading 2019-12-20 14:35:40 +01:00
thomwolf
73fcebf7ec update serving command 2019-12-20 13:47:35 +01:00
Thomas Wolf
59941c5d1f Merge pull request #2189 from stefan-it/xlmr
Add support for XLM-RoBERTa
2019-12-20 13:26:38 +01:00
thomwolf
15dda5ea32 remove python 2 tests for circle-ci cc @aaugustin @julien-c @LysandreJik 2019-12-20 13:20:41 +01:00
thomwolf
01ffc65e9b update tests to remove unittest.patch 2019-12-20 13:16:23 +01:00
thomwolf
825697cad4 fix tests 2019-12-20 12:51:10 +01:00
thomwolf
1fa93ca1ea Clean up framework handling 2019-12-20 12:34:19 +01:00
thomwolf
ca6bdb28f6 fix pipelines and rename model_card => modelcard 2019-12-20 12:10:40 +01:00
Morgan Funtowicz
61d9ee45e3 All tests are green. 2019-12-20 11:47:56 +01:00
Thomas Wolf
ff36e6d8d7 Merge pull request #2231 from huggingface/requests_user_agent
[http] customizable requests user-agent
2019-12-20 10:28:10 +01:00
Morgan Funtowicz
e516a34a15 Use BasicTokenizer to split over whitespaces. 2019-12-20 09:38:08 +01:00
Morgan Funtowicz
9d0d1cd339 Filter out entity for NER task. 2019-12-20 09:30:37 +01:00
Julien Chaumond
15d897ff4a [http] customizable requests user-agent 2019-12-19 18:29:22 -05:00
Julien Chaumond
f25e9b6f77 [hf_bucket_url] support for cloudfront urls 2019-12-19 18:28:17 -05:00
Julien Chaumond
a5a06a851e [doc] Param name consistency 2019-12-19 16:24:20 -05:00
Aidan Kierans
1718fb9e74 Minor/basic text fixes (#2229)
* Small clarification

Matches line 431 to line 435 for additional clarity and consistency.

* Fixed minor typo

The letter "s" was previously omitted from the word "docstrings".
2019-12-19 16:23:18 -05:00
Julien Chaumond
9a399ead25 Revert incorrect #1778 2019-12-19 15:45:48 -05:00
Stefan Schweter
3376adc051 configuration/modeling/tokenization: add various fine-tuned XLM-RoBERTa models for English, German, Spanish and Dutch (CoNLL datasets) 2019-12-19 21:30:23 +01:00
thomwolf
e4baa68ddb tick-tock cc @julien-c 2019-12-19 20:37:26 +01:00
thomwolf
149dc376aa fix tests 2019-12-19 20:34:28 +01:00
thomwolf
407093b3fa Merge branch 'cli' of https://github.com/huggingface/transformers into cli 2019-12-19 20:26:51 +01:00
thomwolf
c7be096c39 Merge branch 'master' into cli 2019-12-19 20:26:08 +01:00
Morgan Funtowicz
a305067f2d Removed __main__ 2019-12-19 19:41:48 +01:00
Morgan Funtowicz
3492a6ec17 Addressing Thom's comments. 2019-12-19 19:06:44 +01:00
Lysandre
33adab2b91 Fix albert example 2019-12-19 12:40:43 -05:00
Lysandre
a1f1dce0ae Correct max position for SQUAD and TFDS 2019-12-19 12:25:55 -05:00
Francesco
62c1fc3c1e Removed duplicate XLMConfig, XLMForQuestionAnswering and XLMTokenizer from import statement of run_squad.py script 2019-12-19 09:50:56 -05:00
Ejar
284572efc0 Updated typo on the link
Updated documentation due to typo
2019-12-19 09:36:43 -05:00
patrickvonplaten
ed6ba93912 corrected typo in example for t5 model input argument 2019-12-19 09:34:55 -05:00
Morgan Funtowicz
81a911cce5 Doc, doc, ... doc. 2019-12-19 15:12:06 +01:00
Morgan Funtowicz
faef6f6191 Fix logic order for USE_TF/USE_TORCH 2019-12-19 12:28:17 +01:00
Morgan Funtowicz
5664327c24 Hide train command for now. 2019-12-19 12:27:54 +01:00
Morgan Funtowicz
3b29322d4c Expose all the pipeline argument on serve command. 2019-12-19 12:24:17 +01:00
Morgan Funtowicz
fc624716aa Renaming framework env variables flags from NO_ to USE_ 2019-12-19 11:49:06 +01:00
Morgan Funtowicz
f516cf3956 Allow pipeline to write output in binary format 2019-12-19 11:42:33 +01:00
Morgan Funtowicz
d72fa2a0f6 Fix inputs_for_model call in QuestionAnsweringPipeline accessing __dict__ on list. 2019-12-19 10:54:10 +01:00
Morgan Funtowicz
bcc99fd92e Fix wrong automatic config allocation through AutoConfig 2019-12-19 10:32:21 +01:00
Stefan Schweter
a26ce4dee1 examples: add XLM-RoBERTa to glue script 2019-12-19 02:23:01 +01:00
Morgan Funtowicz
ec5d6c6a70 Adressing issue with NER task omitting first and last word. 2019-12-19 00:12:10 +01:00
Stefan Schweter
fe9aab1055 tokenization: use S3 location for XLM-RoBERTa model 2019-12-18 23:47:48 +01:00
Stefan Schweter
5c5f67a256 modeling: use S3 location for XLM-RoBERTa model 2019-12-18 23:47:00 +01:00
Stefan Schweter
db90e12114 configuration: use S3 location for XLM-RoBERTa model 2019-12-18 23:46:33 +01:00
Morgan Funtowicz
d0724d0794 Add PipedPipelineDataFormat 2019-12-18 23:27:26 +01:00
Morgan Funtowicz
7711403bbd Expose config through the cli arguments 2019-12-18 22:59:51 +01:00
Morgan Funtowicz
8bb166db5d Expose more information in the output of TextClassificationPipeline 2019-12-18 22:53:19 +01:00
Stefan Schweter
f09d999641 docs: fix numbering 😅 2019-12-18 19:49:33 +01:00
Stefan Schweter
dd7a958fd6 docs: add XLM-RoBERTa to pretrained model list (incl. all parameters) 2019-12-18 19:45:46 +01:00
Stefan Schweter
d35405b7a3 docs: add XLM-RoBERTa to index page 2019-12-18 19:45:10 +01:00
Stefan Schweter
3e89fca543 readme: add XLM-RoBERTa to model architecture list 2019-12-18 19:44:23 +01:00
Stefan Schweter
128cfdee9b tokenization add XLM-RoBERTa base model 2019-12-18 19:28:16 +01:00
Stefan Schweter
e778dd854d modeling: add XLM-RoBERTa base model 2019-12-18 19:27:34 +01:00
Morgan Funtowicz
04b602f96f Put module import on top of the module. 2019-12-18 18:28:39 +01:00
Stefan Schweter
64a971a915 auto: add XLM-RoBERTa to auto tokenization 2019-12-18 18:24:32 +01:00
Stefan Schweter
036831e279 auto: add XLM-RoBERTa to audo modeling 2019-12-18 18:23:42 +01:00
Stefan Schweter
41a13a6375 auto: add XLMRoBERTa to auto configuration 2019-12-18 18:20:27 +01:00
Morgan Funtowicz
0c88c856d5 Unnest QuestionAnsweringArgumentHandler 2019-12-18 18:18:16 +01:00
Lysandre
8efc6dd544 fix #2214 2019-12-18 10:47:59 -05:00
Gunnlaugur Thor Briem
a2978465a2 Merge branch 'master' into patch-1 2019-12-18 14:54:46 +00:00
Stefan Schweter
01b68be34f converter: remove XLM-RoBERTa specific script (can be done with the script for RoBERTa now) 2019-12-18 12:24:46 +01:00
thomwolf
3d2096f516 further cleanup 2019-12-18 11:50:54 +01:00
Stefan Schweter
ca31abc6d6 tokenization: *align* fairseq and spm vocab to fix some tokenization errors 2019-12-18 11:36:54 +01:00
thomwolf
8e5587fb79 few fixes on sampling 2019-12-18 11:32:37 +01:00
Stefan Schweter
cce3089b65 Merge remote-tracking branch 'upstream/master' into xlmr 2019-12-18 11:05:16 +01:00
thomwolf
641a8decdc clean up code and add arbitrary number of return sequences 2019-12-18 10:43:48 +01:00
Morgan Funtowicz
e347725d8c More fine-grained control over pipeline creation with config argument. 2019-12-18 10:41:24 +01:00
Julien Chaumond
94c99db34c [FinBERT] fix incorrect url 2019-12-17 20:35:25 -05:00
Julien Chaumond
7ffa817390 [s3] mv files and update links 2019-12-17 20:35:25 -05:00
Antti Virtanen
c5f35e61db Uploaded files to AWS. 2019-12-17 20:35:25 -05:00
Antti Virtanen
abc43ffbff Add pretrained model documentation for FinBERT. 2019-12-17 20:35:25 -05:00
Antti Virtanen
8ac840ff87 Adding Finnish BERT. 2019-12-17 20:35:25 -05:00
Julien Chaumond
a0d386455b Fix outdated tokenizer doc 2019-12-17 20:07:39 -05:00
Julien Chaumond
ea636440d1 [roberta.conversion] Do not hardcode vocab size
and support for fairseq 0.9+
2019-12-17 18:12:22 -05:00
Arman Cohan
a4df2e0113 update roberta conversion
- update to fix conversion for the updated fairseq model
- create save directory if not exist
2019-12-17 18:12:22 -05:00
thomwolf
77d397202b clean up dead code 2019-12-17 23:28:46 +01:00
thomwolf
bbc0c86f9b beam search + single beam decoding 2019-12-17 23:27:02 +01:00
Lysandre
5e289f69bc regex 2019.12.17 install fails with Python 2 2019-12-17 15:54:05 -05:00
Lysandre
2cff4bd8f3 Fix segmentation fault 2019-12-17 15:54:05 -05:00
Julien Chaumond
55397dfb9b CsvPipelineDataFormat: Fix for single-column 2019-12-17 13:10:51 -05:00
thomwolf
b6938916ac adding beam search 2019-12-17 17:23:36 +01:00
Gunnlaugur Thor Briem
d303f84e7b fix: wrong architecture count in README
Just say “the following” so that this intro doesn't so easily fall out of date :) )
2019-12-17 16:18:00 +00:00
Morgan Funtowicz
2fde5a2489 Initial bunch of documentation. 2019-12-17 12:16:07 +01:00
thomwolf
2f1c745cde update conversion script 2019-12-17 11:47:54 +01:00
thomwolf
83bc5235cf Merge branch 'master' into pr/2189 2019-12-17 11:47:32 +01:00
Morgan Funtowicz
d7c62661a3 Provide serving dependencies for tensorflow and pytorch (serving-tf, serving-torch) 2019-12-17 11:23:39 +01:00
Stefan Schweter
f349826a57 model: fix cls and sep token for XLM-RoBERTa documentation 2019-12-17 10:36:04 +01:00
Thomas Wolf
f061606277 Merge pull request #2164 from huggingface/cleanup-configs
[SMALL BREAKING CHANGE] Cleaning up configuration classes - Adding Model Cards
2019-12-17 09:10:16 +01:00
erenup
805c21aeba tried to fix the failed checks 2019-12-17 11:36:00 +08:00
erenup
d000195ee6 add comment for example_index and unique_id in single process 2019-12-17 11:28:34 +08:00
erenup
3c6efd0ca3 updated usage example in modeling_roberta for question and answering 2019-12-17 11:18:12 +08:00
Julien Chaumond
3f5ccb183e [doc] Clarify uploads
cf 855ff0e91d (commitcomment-36452545)
2019-12-16 18:20:29 -05:00
thomwolf
3cb51299c3 Fix #2109 2019-12-16 16:58:44 -05:00
Lysandre
18a879f475 fix #2180 2019-12-16 16:44:29 -05:00
Lysandre
d803409215 Fix run squad evaluate during training 2019-12-16 16:31:38 -05:00
thomwolf
a468870fd2 refactoring generation 2019-12-16 22:22:30 +01:00
Julien Chaumond
855ff0e91d [doc] Model upload and sharing
ping @lysandrejik @thomwolf

Is this clear enough? Anything we should add?
2019-12-16 12:42:22 -05:00
Stefan Schweter
d064009b72 converter: fix vocab size 2019-12-16 17:23:25 +01:00
Stefan Schweter
a701a0cee1 configuration: fix model name for large XLM-RoBERTa model 2019-12-16 17:17:56 +01:00
Stefan Schweter
59a1aefb1c tokenization: add support for new XLM-RoBERTa model. Add wrapper around fairseq tokenization logic 2019-12-16 17:00:55 +01:00
Stefan Schweter
69f4f058fa model: add support for new XLM-RoBERTa model 2019-12-16 17:00:12 +01:00
Stefan Schweter
a648ff738c configuration: add support for XLM-RoBERTa model 2019-12-16 16:47:39 +01:00
Stefan Schweter
9ed09cb4a3 converter: add conversion script for original XLM-RoBERTa weights to Transformers-compatible weights 2019-12-16 16:46:58 +01:00
Stefan Schweter
d3549b66af module: add support for XLM-RoBERTa (__init__) 2019-12-16 16:38:39 +01:00
Morgan Funtowicz
a096e2a88b WIP serving through HTTP internally using pipelines. 2019-12-16 16:38:02 +01:00
Stefan Schweter
71b4750517 examples: add support for XLM-RoBERTa to run_ner script 2019-12-16 16:37:27 +01:00
Morgan Funtowicz
43a4e1bbe4 Adressing issue in varargs handling for question answering. 2019-12-16 16:00:41 +01:00
Morgan Funtowicz
46ccbb42fc Make CLI run command use integer mapping for device argument. 2019-12-16 15:49:41 +01:00
Morgan Funtowicz
bbc707cf39 Fix non-keyworded varargs handling in DefaultArgumentHandler for pipeline. 2019-12-16 15:49:09 +01:00
Morgan Funtowicz
9c391277cc Allow tensors placement on specific device through CLI and pipeline. 2019-12-16 15:19:13 +01:00
thomwolf
1bbdbacd5b update __init__ and saving 2019-12-16 14:38:20 +01:00
Morgan Funtowicz
955d7ecb57 Refactored Pipeline with dedicated argument handler. 2019-12-16 14:34:54 +01:00
thomwolf
031ad4eb37 improving JSON error messages (for model card and configurations) 2019-12-16 14:20:57 +01:00
thomwolf
db0a9ee6e0 adding albert to TF auto models cc @LysandreJik 2019-12-16 14:08:08 +01:00
thomwolf
a4d07b983a dict of all config and model files cc @LysandreJik 2019-12-16 14:00:32 +01:00
thomwolf
d3418a94ff update tests 2019-12-16 13:52:41 +01:00
thomwolf
56e98ba81a add model cards cc @mfuntowicz 2019-12-16 11:07:27 +01:00
thomwolf
8669598abd update t5 tf 2019-12-16 09:59:36 +01:00
thomwolf
1b8613acb3 updating t5 config class 2019-12-16 09:51:42 +01:00
Morgan Funtowicz
8e3b1c860f Added FeatureExtraction pipeline. 2019-12-15 01:37:52 +01:00
Morgan Funtowicz
f1971bf303 Binding pipelines to the cli. 2019-12-15 01:37:16 +01:00
Pascal Voitot
cc0135134b :zip: #2106 basic tokenizer.tokenize global speed improvement (3-8x) by simply caching added_tokens in a Set 2019-12-14 15:25:13 +01:00
thomwolf
dc667ce1a7 double check cc @LysandreJik 2019-12-14 09:56:27 +01:00
thomwolf
7140363e09 update bertabs 2019-12-14 09:44:53 +01:00
Thomas Wolf
a52d56c8d9 Merge branch 'master' into cleanup-configs 2019-12-14 09:43:07 +01:00
Thomas Wolf
e92bcb7eb6 Merge pull request #1739 from huggingface/t5
[WIP] Adding Google T5 model
2019-12-14 09:40:43 +01:00
thomwolf
cbb368ca06 distilbert tests 2019-12-14 09:31:18 +01:00
Julien Chaumond
b6d4284b26 [cli] Uploads: fix + test edge case 2019-12-13 22:44:57 -05:00
erenup
a1faaf9962 deleted useless file 2019-12-14 08:57:13 +08:00
erenup
c7780700f5 Merge branch 'refs/heads/squad_roberta'
# Conflicts:
#	transformers/data/processors/squad.py
2019-12-14 08:53:59 +08:00
erenup
76f0d99f02 Merge remote-tracking branch 'refs/remotes/huggingface/master' 2019-12-14 08:45:17 +08:00
erenup
8e9526b4b5 add multiple processing 2019-12-14 08:43:58 +08:00
Lysandre
7bd11dda6f Release: v2.2.2 2019-12-13 16:45:30 -05:00
LysandreJik
c3248cf122 Tests for all tokenizers 2019-12-13 16:41:44 -05:00
Pascal Voitot
f2ac50cb55 better for python2.x 2019-12-13 16:41:44 -05:00
Pascal Voitot
4cbdc7d910 missed space 2019-12-13 16:41:44 -05:00
Pascal Voitot
dd2add9f6e more tests 2019-12-13 16:41:44 -05:00
Pascal Voitot
df160af736 🐛 #2096 in tokenizer.decode, space is not joined between all subtexts instead of before added tokens 2019-12-13 16:41:44 -05:00
Pascal Voitot
5b7b78e088 🐛 #2096 in tokenizer.decode, adds a space after special tokens to return right formatted string 2019-12-13 16:41:44 -05:00
Julien Chaumond
866d73ca26 [cli] Upload is now compatible with folders 2019-12-13 16:39:08 -05:00
Lysandre
d461472948 return for SQuAD [BLACKED] 2019-12-13 15:31:52 -05:00
Lysandre
f24a228a93 Speed up tokenization process 2019-12-13 14:50:35 -05:00
Lysandre
c8ed1c82c8 [SQUAD] Load checkpoint when evaluating without training 2019-12-13 12:13:48 -05:00
thomwolf
5c00e344c1 update model doc - swith 3B/11B to 3b/11b 2019-12-13 16:33:29 +01:00
Morgan Funtowicz
0b51532ce9 Reintroducing the batch_encode_plus method 2019-12-13 16:22:50 +01:00
Thomas Wolf
110394b2ba Merge branch 'master' into t5 2019-12-13 16:03:32 +01:00
Pierric Cistac
5a5c4349e8 Fix summarization to_cpu doc 2019-12-13 10:02:33 -05:00
thomwolf
8ade204098 fix tf 2019-12-13 14:48:47 +01:00
thomwolf
47f0e3cfb7 cleaning up configuration classes 2019-12-13 14:33:24 +01:00
Morgan Funtowicz
8938b546bf Removed from_config 2019-12-13 14:27:04 +01:00
Morgan Funtowicz
1ca52567a4 Allow model conversion in the pipeline allocator. 2019-12-13 14:13:14 +01:00
Morgan Funtowicz
28e64ad5a4 Raise an exception if the pipeline allocator can't determine the tokenizer from the model. 2019-12-13 14:12:54 +01:00
Morgan Funtowicz
be5bf7b81b Added NER pipeline. 2019-12-13 14:12:17 +01:00
Morgan Funtowicz
80eacb8f16 Adding labels mapping for classification models in their respective config. 2019-12-13 14:10:22 +01:00
thomwolf
33e72b08d5 fix inner dimensions for 3B/11B models 2019-12-13 11:33:05 +01:00
erenup
9b312f9d41 initial version for roberta squad 2019-12-13 14:51:40 +08:00
erenup
40ed717232 Merge remote-tracking branch 'refs/remotes/huggingface/master' 2019-12-13 09:10:17 +08:00
LysandreJik
7296f1010b Cleanup squad and add allow train_file and predict_file usage 2019-12-12 13:01:04 -05:00
Julien Chaumond
5d67aa21ae [doc] Replicate doc from #2144 2019-12-12 12:39:41 -05:00
LysandreJik
3fd71c4431 Update example scripts 2019-12-12 12:08:54 -05:00
LysandreJik
fe92755b99 Fix special tokens mask in encode 2019-12-12 11:37:19 -05:00
Alan deLevie
fbf5455a86 Fix typo in examples/run_glue.py args declaration.
deay -> decay
2019-12-12 11:16:19 -05:00
thomwolf
f19dad61c7 fixing XLM conversion tests with dummy input 2019-12-12 14:46:30 +01:00
Morgan Funtowicz
f69dbecc38 Expose classification labels mapping (and reverse) in model config. 2019-12-12 10:25:36 +01:00
Thomas Wolf
90df44f0aa Merge pull request #2063 from guillaume-be/special_tokens_mask_value_not_used
special_tokens_mask value was unused and calculated twice
2019-12-12 08:21:46 +01:00
Thomas Wolf
707f9e9241 Merge pull request #2081 from pglock/patch-1
handle string with only whitespaces as empty
2019-12-12 08:20:43 +01:00
Thomas Wolf
137e20a846 Merge pull request #2075 from huggingface/check-link-validity
Check link validity
2019-12-12 08:09:12 +01:00
Thomas Wolf
d5712f7cac Merge branch 'master' into check-link-validity 2019-12-12 08:00:51 +01:00
Thomas Wolf
9c58b236ef Merge pull request #2144 from huggingface/from-pretrained-from-url
Allowing from_pretrained to load from url directly
2019-12-12 07:43:40 +01:00
thomwolf
413f41921b fix merge 2019-12-12 07:34:42 +01:00
Thomas Wolf
386a93f0f8 Merge branch 'master' into from-pretrained-from-url 2019-12-12 07:31:05 +01:00
Thomas Wolf
2d103546ef Merge pull request #2148 from huggingface/fix_encode_plus
Fix encode plus
2019-12-12 07:24:47 +01:00
Julien Chaumond
1748fdf657 [doc] Fix rst table 2019-12-11 18:32:27 -05:00
Julien Chaumond
36fc52a3b4 Update links to weights 2019-12-11 18:32:27 -05:00
Julien Chaumond
371c5ddfad Py2 tests for Lysandre 2019-12-11 18:32:27 -05:00
Julien Chaumond
5505cf7014 Run tests on Py2 too, for Lysandre 2019-12-11 18:32:27 -05:00
Julien Chaumond
9cb97c0c0f Actually run the tests 2019-12-11 18:32:27 -05:00
Julien Chaumond
95854c4a2f Actually run the tests 2019-12-11 18:32:27 -05:00
Julien Chaumond
d2100428d3 Update to new test infra and only run conditionally 2019-12-11 18:32:27 -05:00
Masatoshi Suzuki
597ba7feb3 Support testing Japanese BERT tokenizers 2019-12-11 18:32:27 -05:00
Masatoshi Suzuki
6a43dc9d7d Support Python 2 2019-12-11 18:32:27 -05:00
Masatoshi Suzuki
a09da4eeb0 Add a test for Japanese BERT tokenizers 2019-12-11 18:32:27 -05:00
Masatoshi Suzuki
57b5cb3eaa Fix loading BertJapaneseTokenizer 2019-12-11 18:32:27 -05:00
Masatoshi Suzuki
c03c0dfd23 Add support for Japanese BERT models by cl-tohoku 2019-12-11 18:32:27 -05:00
Julien Chaumond
4f15e5a267 Add tests.
Maybe not the best possible place for the tests, lmk.
2019-12-11 17:41:51 -05:00
Julien Chaumond
18e1f751f1 TF support 2019-12-11 17:07:46 -05:00
Julien Chaumond
31e5b5ff22 Fix tests + first example of doc 2019-12-11 15:22:02 -05:00
LysandreJik
3d57c51111 Fix encode plus 2019-12-11 15:10:17 -05:00
Julien Chaumond
c999a3e505 Allow from_pretrained to take a remote identifier 2019-12-11 12:29:58 -05:00
Stefan Schweter
030faccb8d doc: fix pretrained models table 2019-12-11 12:19:21 -05:00
thomwolf
6709739a05 allowing from_pretrained to load from url directly 2019-12-11 18:15:45 +01:00
thomwolf
29570db25b allowing from_pretrained to load from url directly 2019-12-11 17:19:18 +01:00
Julien Chaumond
2e2f9fed55 rm duplicate imports 2019-12-11 11:11:56 -05:00
Morgan Funtowicz
c28273793e Add missing DistilBert and Roberta to AutoModelForTokenClassification 2019-12-11 15:31:45 +01:00
LysandreJik
4c12860f7a Remove misleading documentation 2019-12-11 09:22:37 -05:00
Morgan Funtowicz
b040bff6df Added supported model to AutoModelTokenClassification 2019-12-11 14:13:58 +01:00
thomwolf
fafd4c86ec fix TF 2.0 version of T5 - update conversion script 2019-12-11 13:47:27 +01:00
Bilal Khan
6aa919469d Update run_xnli to save optimizer and scheduler states, then resume training from a checkpoint 2019-12-10 19:31:22 -06:00
Bilal Khan
89896fe04f Update run_ner to save optimizer and scheduler states, then resume training from a checkpoint 2019-12-10 19:31:22 -06:00
Bilal Khan
fdc05cd68f Update run_squad to save optimizer and scheduler states, then resume training from a checkpoint 2019-12-10 19:31:22 -06:00
Bilal Khan
854ec5784e Update run_glue to save optimizer and scheduler states, then resume training from a checkpoint 2019-12-10 19:30:36 -06:00
Morgan Funtowicz
9a24e0cf76 Refactored qa pipeline argument handling + unittests 2019-12-11 00:33:25 +01:00
LysandreJik
b72f9d340e Correct index in script 2019-12-10 18:33:17 -05:00
Thomas Wolf
51ae203290 Merge pull request #2129 from leopd/master
Progress indicator improvements when downloading pre-trained models.
2019-12-10 22:18:55 +01:00
LysandreJik
ec6fb25c21 Patch documentation 2019-12-10 15:49:20 -05:00
LysandreJik
418589244d Uniforming the ignored indices 2019-12-10 15:26:19 -05:00
Leo Dirac
58d75aa310 Progress indicator improvements when downloading pre-trained models. 2019-12-10 11:36:56 -08:00
LysandreJik
6a73382706 Complete warning + cleanup 2019-12-10 14:33:24 -05:00
Lysandre
dc4e9e5cb3 DataParallel for SQuAD + fix XLM 2019-12-10 19:21:20 +00:00
thomwolf
67a8be8e90 fix backward in tests 2019-12-10 17:50:32 +01:00
Rémi Louf
07bc8efbc3 add greedy decoding and sampling 2019-12-10 17:27:50 +01:00
Morgan Funtowicz
63e36007ee Make sure padding, cls and another non-context tokens cannot appear in the answer. 2019-12-10 16:47:35 +01:00
thomwolf
f2538c1274 all tests in torch no grad 2019-12-10 16:33:11 +01:00
thomwolf
a5df980c5b updating distilbert test 2019-12-10 16:01:15 +01:00
Morgan Funtowicz
40a39ab650 Reuse recent SQuAD refactored data structure inside QA pipelines. 2019-12-10 15:59:38 +01:00
thomwolf
7c3a15ace9 Merge branch 'master' into t5 2019-12-10 15:36:54 +01:00
thomwolf
981a5c8c17 updating models urls 2019-12-10 15:36:19 +01:00
Thomas Wolf
e6cff60b4c Merge pull request #2069 from huggingface/cleaner-pt-tf-conversion
clean up PT <=> TF conversion
2019-12-10 15:34:08 +01:00
Rémi Louf
4b82c485de remove misplaced summarization documentation 2019-12-10 09:13:33 -05:00
thomwolf
8ae1044f80 updating tests and TF 2.0 model 2019-12-10 15:11:07 +01:00
Morgan Funtowicz
aae74065df Added QuestionAnsweringPipeline unit tests. 2019-12-10 13:37:20 +01:00
Morgan Funtowicz
a7d3794a29 Remove token_type_ids for compatibility with DistilBert 2019-12-10 13:37:20 +01:00
Morgan Funtowicz
fe0f552e00 Use attention_mask everywhere. 2019-12-10 13:37:20 +01:00
Morgan Funtowicz
348e19aa21 Expose attention_masks and input_lengths arguments to batch_encode_plus 2019-12-10 13:37:18 +01:00
Morgan Funtowicz
c2407fdd88 Enable the Tensorflow backend. 2019-12-10 13:37:14 +01:00
Morgan Funtowicz
f116cf599c Allow hidding frameworks through environment variables (NO_TF, NO_TORCH). 2019-12-10 13:37:07 +01:00
Morgan Funtowicz
6e61e06051 batch_encode_plus generates the encoder_attention_mask to avoid attending over padded values. 2019-12-10 13:37:07 +01:00
Morgan Funtowicz
02110485b0 Added batching, topk, chars index and scores. 2019-12-10 13:36:55 +01:00
Morgan Funtowicz
e1d89cb24d Added QuestionAnsweringPipeline with batch support. 2019-12-10 13:36:55 +01:00
thomwolf
0558c9cb9b Merge branch 'master' into t5 2019-12-10 12:58:48 +01:00
Morgan Funtowicz
81babb227e Added download command through the cli.
It allows to predownload models and tokenizers.
2019-12-10 12:18:59 +01:00
thomwolf
31a3a73ee3 updating CLI 2019-12-10 12:18:59 +01:00
thomwolf
7c1697562a compatibility with sklearn and keras 2019-12-10 12:12:22 +01:00
thomwolf
b81ab431f2 updating AutoModels and AutoConfiguration - adding pipelines 2019-12-10 12:11:33 +01:00
thomwolf
2d8559731a add pipeline - train 2019-12-10 11:34:16 +01:00
thomwolf
72c36b9ea2 [WIP] - CLI 2019-12-10 11:33:14 +01:00
Thomas Wolf
e57d00ee10 Merge pull request #1984 from huggingface/squad-refactor
[WIP] Squad refactor
2019-12-10 11:07:26 +01:00
Thomas Wolf
ecabbf6d28 Merge pull request #2107 from huggingface/encoder-mask-shape
create encoder attention mask from shape of hidden states
2019-12-10 10:07:56 +01:00
thomwolf
608a8f5b56 updating tf 2.0 layer_norm to T5 layer norm 2019-12-10 10:01:01 +01:00
Suvrat Bhooshan
df3961121f Add MMBT Model to Transformers Repo 2019-12-09 18:36:48 -08:00
Julien Chaumond
1d18930462 Harmonize no_cuda flag with other scripts 2019-12-09 20:37:55 -05:00
Rémi Louf
f7eba09007 clean for release 2019-12-09 20:37:55 -05:00
Rémi Louf
2a64107e44 improve device usage 2019-12-09 20:37:55 -05:00
Rémi Louf
c0707a85d2 add README 2019-12-09 20:37:55 -05:00
Rémi Louf
ade3cdf5ad integrate ROUGE 2019-12-09 20:37:55 -05:00
Rémi Louf
076602bdc4 prevent BERT weights from being downloaded twice 2019-12-09 20:37:55 -05:00
Rémi Louf
5909f71028 add py-rouge dependency 2019-12-09 20:37:55 -05:00
Rémi Louf
a1994a71ee simplified model and configuration 2019-12-09 20:37:55 -05:00
Rémi Louf
3a9a9f7861 default output dir to documents dir 2019-12-09 20:37:55 -05:00
Rémi Louf
693606a75c update the docs 2019-12-09 20:37:55 -05:00
Rémi Louf
c0443df593 remove beam search 2019-12-09 20:37:55 -05:00
Rémi Louf
2403a66598 give transformers API to BertAbs 2019-12-09 20:37:55 -05:00
Rémi Louf
4d18199902 cast bool tensor to long for pytorch < 1.3 2019-12-09 20:37:55 -05:00
Rémi Louf
9f75565ea8 setup training 2019-12-09 20:37:55 -05:00
Rémi Louf
4735c2af07 tweaks to the BeamSearch API 2019-12-09 20:37:55 -05:00
Rémi Louf
ba089c780b share pretrained embeddings 2019-12-09 20:37:55 -05:00
Rémi Louf
9660ba1cbd Add beam search 2019-12-09 20:37:55 -05:00
Rémi Louf
1c71ecc880 load the pretrained weights for encoder-decoder
We currently save the pretrained_weights of the encoder and decoder in
two separate directories `encoder` and `decoder`. However, for the
`from_pretrained` function to operate with automodels we need to
specify the type of model in the path to the weights.

The path to the encoder/decoder weights is handled by the
`PreTrainedEncoderDecoder` class in the `save_pretrained` function. Sice
there is no easy way to infer the type of model that was initialized for
the encoder and decoder we add a parameter `model_type` to the function.
This is not an ideal solution as it is error prone, and the model type
should be carried by the Model classes somehow.

This is a temporary fix that should be changed before merging.
2019-12-09 20:37:55 -05:00
Rémi Louf
07f4cd73f6 update function to add special tokens
Since I started my PR the `add_special_token_single_sequence` function
has been deprecated for another; I replaced it with the new function.
2019-12-09 20:37:55 -05:00
Pierric Cistac
5c877fe94a fix albert links 2019-12-09 18:53:00 -05:00
Bilal Khan
79526f82f5 Remove unnecessary epoch variable 2019-12-09 16:24:35 -05:00
Bilal Khan
9626e0458c Add functionality to continue training from last saved global_step 2019-12-09 16:24:35 -05:00
Bilal Khan
2d73591a18 Stop saving current epoch 2019-12-09 16:24:35 -05:00
Bilal Khan
0eb973b0d9 Use saved optimizer and scheduler states if available 2019-12-09 16:24:35 -05:00
Bilal Khan
a03fcf570d Save tokenizer after each epoch to be able to resume training from a checkpoint 2019-12-09 16:24:35 -05:00
Bilal Khan
f71b1bb05a Save optimizer state, scheduler state and current epoch 2019-12-09 16:24:35 -05:00
thomwolf
8e651f56b7 fix tf tests 2019-12-09 22:13:57 +01:00
thomwolf
808bb8da7e fix transfo xl tests 2019-12-09 21:48:34 +01:00
thomwolf
b016dd16c9 fix tests on python 3.5 2019-12-09 21:38:07 +01:00
LysandreJik
2a4ef098d6 Add ALBERT and XLM to SQuAD script 2019-12-09 10:46:47 -05:00
Lysandre Debut
00c4e39581 Merge branch 'master' into squad-refactor 2019-12-09 10:41:15 -05:00
thomwolf
169fea6855 updating T5 2019-12-09 16:25:33 +01:00
Rémi Louf
3520be7824 create encoder attention mask from shape of hidden states
We currently create encoder attention masks (when they're not provided)
based on the shape of the inputs to the encoder. This is obviously
wrong; sequences can be of different lengths. We now create the encoder
attention mask based on the batch_size and sequence_length of the
encoder hidden states.
2019-12-09 11:19:45 +01:00
Aymeric Augustin
0cb163865a Remove pytest dependency. (#2093) 2019-12-07 07:46:14 -05:00
Michael Watkins
2670b0d682 Fix bug which lowercases special tokens 2019-12-06 16:15:53 -05:00
Aymeric Augustin
35401fe50f Remove dependency on pytest for running tests (#2055)
* Switch to plain unittest for skipping slow tests.

Add a RUN_SLOW environment variable for running them.

* Switch to plain unittest for PyTorch dependency.

* Switch to plain unittest for TensorFlow dependency.

* Avoid leaking open files in the test suite.

This prevents spurious warnings when running tests.

* Fix unicode warning on Python 2 when running tests.

The warning was:

    UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

* Support running PyTorch tests on a GPU.

Reverts 27e015bd.

* Tests no longer require pytest.

* Make tests pass on cuda
2019-12-06 13:57:38 -05:00
Julien Chaumond
e4679cddce [cli] Uploads: add progress bar (#2078)
* [cli] Uploads: add progress bar

see https://github.com/huggingface/transformers/pull/2044#discussion_r354057827 for context

* rename + documentation

* Add auto-referential comment
2019-12-06 11:56:23 -05:00
thomwolf
1d87b37d10 updating 2019-12-06 15:30:09 +01:00
Thomas Wolf
4cb9b60558 Merge pull request #2077 from patrickvonplaten/change_documentation_for_past_output_shape
corrected documentation for past tensor shape for ctrl and gpt2 model
2019-12-06 12:14:48 +01:00
Thomas Wolf
5482822a2b Merge pull request #2046 from jplu/tf2-ner-example
Add NER TF2 example.
2019-12-06 12:12:22 +01:00
Thomas Wolf
fc1bb1f867 Merge pull request #2068 from huggingface/fix-2042
Nicer error message when Bert's input is missing batch size
2019-12-06 12:06:42 +01:00
Philipp Glock
21451ec6ba handle string with only whitespaces as empty 2019-12-06 10:32:43 +01:00
Rémi Louf
f230d91b43 check the validity of links
We add a script and a CI workflow to check that all download links
present in the source code are valid.
2019-12-06 09:41:28 +01:00
patrickvonplaten
d0383e4daf corrected documentation for past tensor shape for ctrl and gpt2 model 2019-12-06 01:24:22 +01:00
LysandreJik
e9217da5ff Cleanup
Improve global visibility on the run_squad script, remove unused files and fixes related to XLNet.
2019-12-05 16:01:51 -05:00
LysandreJik
9ecd83dace Patch evaluation for impossible values + cleanup 2019-12-05 14:44:57 -05:00
VictorSanh
35ff345fc9 update requirements 2019-12-05 12:07:04 -05:00
VictorSanh
552c44a9b1 release distilm-bert 2019-12-05 10:14:58 -05:00
Rosanne Liu
ee53de7aac Pr for pplm (#2060)
* license

* changes

* ok

* Update paper link and commands to run

* pointer to uber repo
2019-12-05 09:20:07 -05:00
thomwolf
f8fb4335c9 clean up a little bit PT <=> TF conversion 2019-12-05 15:19:32 +01:00
Thomas Wolf
bebaa14039 Merge pull request #2045 from aaugustin/remove-dead-code
Remove dead code in tests.
2019-12-05 14:41:56 +01:00
thomwolf
18fb93530b fixing #2042 - Nicer error message 2019-12-05 14:36:34 +01:00
thomwolf
2d5d86e037 fix #2031 2019-12-05 14:06:29 +01:00
Thomas Wolf
af077b15e2 Merge pull request #2065 from huggingface/fixing-camembert
Fixing camembert tokenization
2019-12-05 13:45:44 +01:00
thomwolf
3268ebd229 fix xlnet test 2019-12-05 13:35:29 +01:00
thomwolf
6c5297a423 Fixing camembert tokenization 2019-12-05 13:27:58 +01:00
Julien Plu
9200a759d7 Add few tests on the TF optimization file with some info in the documentation. Complete the README. 2019-12-05 12:56:43 +01:00
Thomas Wolf
1f179f095f Merge pull request #2011 from AdityaSoni19031997/patch-1
typo fix on the docs as per Pytorch v1.1+
2019-12-05 12:39:04 +01:00
Thomas Wolf
1eaf44e713 Merge pull request #2007 from roskoN/xlnet_attention_fix
fixed XLNet attention output for both attention streams whenever target_mapping is provided
2019-12-05 12:32:39 +01:00
thomwolf
71e4693f08 fix #1968 2019-12-05 12:14:24 +01:00
Thomas Wolf
f9f395b21c Merge pull request #1735 from ondewo/tf-do-not-use-gpu-on-import
Do not use GPU when importing transformers
2019-12-05 11:56:48 +01:00
thomwolf
75a97af6bc fix #1450 - add doc 2019-12-05 11:26:55 +01:00
thomwolf
8b388827b5 fix #1920 2019-12-05 11:18:43 +01:00
Thomas Wolf
d425a4d60b Merge pull request #1870 from alexzubiaga/xlnet-for-token-classification
XLNet for Token classification
2019-12-05 09:54:09 +01:00
Thomas Wolf
1eb89ddf73 Merge pull request #2044 from huggingface/cli_upload
CLI for authenticated file sharing
2019-12-05 09:44:07 +01:00
Guillaume B
7f998b1b83 special_tokens_mask value was unused and calculated twice 2019-12-05 09:01:39 +01:00
VictorSanh
fb0d2f1da1 preparing release distil-mBERT 2019-12-05 03:00:16 -05:00
Julien Chaumond
3ba417e1a8 [cli] ls: Tabular formatting 2019-12-04 18:40:52 -05:00
LysandreJik
ce158a076f Return dataset (pytorch) 2019-12-04 17:55:52 -05:00
LysandreJik
7a03519975 Documentation 2019-12-04 17:24:35 -05:00
Julien Chaumond
96fa9a8a70 Python 2 + Post mime-type to S3 2019-12-04 17:22:50 -05:00
LysandreJik
33508ae310 Remove only_first 2019-12-04 16:26:45 -05:00
LysandreJik
f7e4a7cdfa Cleanup 2019-12-04 16:24:15 -05:00
LysandreJik
a7ca6d738b Padding side is tokenizer-dependant 2019-12-04 15:43:34 -05:00
LysandreJik
cca75e7884 Kill the demon spawn 2019-12-04 15:42:29 -05:00
LysandreJik
bf119c0568 TFDS dataset can now be evaluated 2019-12-04 11:34:59 -05:00
Julien Plu
ff98b041da Fix whitespace issue 2019-12-04 16:53:06 +01:00
LysandreJik
9ddc3f1a12 Naming update + XLNet/XLM evaluation 2019-12-04 10:37:00 -05:00
thomwolf
5bfcd0485e fix #1991 2019-12-04 14:53:11 +01:00
Thomas Wolf
cae641ff26 Merge pull request #1846 from tamuhey/patch/iss1845
fix summary_type value of SequenceSummary
2019-12-04 13:28:39 +01:00
Julien Plu
254ebb979c Bugfix on init file. Missing comma. 2019-12-04 10:00:25 +01:00
Julien Plu
ecb923da9c Create a NER example similar to the Pytorch one. It takes the same options, and can be run the same way. 2019-12-04 09:43:15 +01:00
Aymeric Augustin
40255ab002 Remove dead code in tests. 2019-12-04 08:21:02 +01:00
Julien Chaumond
e4fbf3e2cc CLI for authenticated file sharing 2019-12-04 00:52:23 -05:00
LysandreJik
de276de1c1 Working evaluation 2019-12-03 17:15:51 -05:00
Julien Chaumond
7edb51f3a5 [pplm] split classif head into its own file 2019-12-03 22:07:25 +00:00
LysandreJik
c835bc85c2 Compute predictions 2019-12-03 15:28:16 -05:00
LysandreJik
285b1241e3 Added SquadResult 2019-12-03 15:00:49 -05:00
thomwolf
f3776df0f3 WIP debugging 2019-12-02 15:47:00 +01:00
Aditya Soni
c356290c8d typo fix as per Pytorch v1.1+ 2019-12-01 14:08:14 +05:30
Rostislav Nedelchev
76c0bc06d5 [XLNet] Changed post-processing of attention w.r.t to target_mapping
Whenever target_mapping is provided to the input, XLNet outputs two different attention streams.
Based on that the attention output would be on of the two:
- a list of tensors (usual case for most transformers)
- a list of 2-tuples of tensors, one tesor for each of attention streams
Docs and unit-tests have been updated
2019-11-30 21:01:04 +01:00
Rostislav Nedelchev
b90791e950 fixed XLNet attenttion output for both attention streams 2019-11-30 15:57:51 +01:00
Lysandre
1e9ac5a7cf New -> normal 2019-11-28 17:43:47 -05:00
Lysandre
0b84b9fd8a Add processors to __init__ 2019-11-28 17:38:52 -05:00
Lysandre
f671997ef7 Interface with TFDS 2019-11-28 17:17:20 -05:00
Lysandre
bd41e8292a Cleanup & Evaluation now works 2019-11-28 16:03:56 -05:00
Lysandre
0669c1fcd1 SQuAD v2 BERT + XLNet 2019-11-25 19:22:21 -05:00
Lysandre
e0e55bc550 Manage training example & refactor the refactor 2019-11-22 16:27:45 -05:00
Lysandre
c3ba645237 Works for XLNet 2019-11-22 16:27:37 -05:00
LysandreJik
a5a8a6175f Works for BERT 2019-11-22 16:27:31 -05:00
LysandreJik
a7dafe2f41 Padding strategy (left and right) rather than boolean flag 2019-11-22 16:27:25 -05:00
LysandreJik
9f374c8252 encode and encode_plus handle attention masks and padding 2019-11-22 16:27:15 -05:00
Lysandre
72e506b22e wip 2019-11-22 16:26:00 -05:00
Lysandre
ea52f82455 Moved some SQuAD logic to /data 2019-11-22 16:25:52 -05:00
alexzubiaga
4193aa9f81 add TFXLNetForTokenClassification implementation and unit test
add XLNetForTokenClassification implementation and unit tests
2019-11-19 12:47:54 +01:00
Yohei Tamura
d08a338c3b modified: transformers/modeling_utils.py 2019-11-16 18:47:37 +09:00
Xu Hongshen
ca99a2d500 Update example readme 2019-11-15 14:55:26 +08:00
Xu Hongshen
7da3ef24cd add is_impossible tensor to model inputs during fine-tuning xlnet on squad2.0 2019-11-15 14:18:53 +08:00
thomwolf
268d4f2099 fix position biases + better tests 2019-11-08 16:41:55 +01:00
thomwolf
b4fcd59a5a add sentinels in tokenizer 2019-11-08 14:38:53 +01:00
thomwolf
15e53c4e87 maybe fix tests 2019-11-08 12:43:21 +01:00
thomwolf
f03c0c1423 adding models in readme and auto classes 2019-11-08 11:49:46 +01:00
thomwolf
4321c54125 fix tests 2019-11-08 11:49:32 +01:00
thomwolf
727a79b305 added TF2 model and tests - updated templates 2019-11-08 11:35:03 +01:00
thomwolf
8fda532c3c fix python 2 sentencepiece tokenization 2019-11-07 17:09:50 +01:00
thomwolf
ba10065c4b update model, conversion script, tests and template 2019-11-07 15:55:36 +01:00
thomwolf
076a207935 adding tests and updating model 2019-11-06 11:52:50 +01:00
thomwolf
73f2c342f5 fixing template 2019-11-06 11:52:39 +01:00
thomwolf
3835e1e651 adding tokenizer 2019-11-06 11:52:29 +01:00
thomwolf
88e5bef58f share position biases 2019-11-05 17:02:52 +01:00
thomwolf
568c0ffb7e adding T5 model 2019-11-05 16:40:29 +01:00
thomwolf
60a5babd57 adding files 2019-11-05 12:01:23 +01:00
Filip Povolny
124409d075 Make dummy inputs a property of TFPreTrainedModel. 2019-11-05 11:48:45 +01:00
thomwolf
dfb61caf77 fix #1692 2019-11-05 11:25:13 +01:00
Filip Povolny
8df7dfd2a7 Make dummy inputs a local variable in TFPreTrainedModel. 2019-11-05 11:09:16 +01:00
Lorenzo Ampil
d36680df54 Rever changes to TF distilbert due to failed test: TFDistilBertModelTest.test_pt_tf_model_equivalence 2019-10-27 14:51:36 +08:00
Lorenzo Ampil
ec276d6aba Add special tokens to documentation for the tensorflow model examples #1561 2019-10-27 14:00:40 +08:00
Lorenzo Ampil
6e011690a9 Add special tokens to documentation for the rest of pytorch model examples #1561 2019-10-27 13:59:14 +08:00
Lorenzo Ampil
3a52b65795 Add special tokens to documentation for bert examples to resolve issue: #1561 2019-10-21 12:55:51 +08:00
erenup
86a630702d Merge branch 'huggingface/master' 2019-10-21 12:06:09 +08:00
erenup
b5d73976ad Revert "fixing for roberta tokenizer decoding"
This reverts commit 22e7c4edaf.
2019-10-03 20:48:17 +08:00
erenup
22e7c4edaf fixing for roberta tokenizer decoding 2019-10-03 18:33:53 +08:00
709 changed files with 96882 additions and 42159 deletions

View File

@@ -1,87 +1,116 @@
version: 2
jobs:
build_py3_torch_and_tf:
run_tests_torch_and_tf:
working_directory: ~/transformers
docker:
- image: circleci/python:3.5
- image: circleci/python:3.6
environment:
OMP_NUM_THREADS: 1
resource_class: xlarge
parallelism: 1
steps:
- checkout
- run: sudo pip install torch
- run: sudo pip install tensorflow
- run: sudo pip install --progress-bar off .
- run: sudo pip install pytest codecov pytest-cov
- run: sudo pip install tensorboardX scikit-learn
- run: python -m pytest -sv ./transformers/tests/ --cov
- run: sudo pip install .[sklearn,tf-cpu,torch,testing]
- run: sudo pip install codecov pytest-cov
- run: python -m pytest -n 8 --dist=loadfile -s -v ./tests/ --cov
- run: codecov
build_py3_torch:
run_tests_torch:
working_directory: ~/transformers
docker:
- image: circleci/python:3.5
- image: circleci/python:3.7
environment:
OMP_NUM_THREADS: 1
resource_class: xlarge
parallelism: 1
steps:
- checkout
- run: sudo pip install torch
- run: sudo pip install --progress-bar off .
- run: sudo pip install pytest codecov pytest-cov
- run: sudo pip install tensorboardX scikit-learn
- run: python -m pytest -sv ./transformers/tests/ --cov
- run: python -m pytest -sv ./examples/
- run: sudo pip install .[sklearn,torch,testing]
- run: sudo pip install codecov pytest-cov
- run: python -m pytest -n 8 --dist=loadfile -s -v ./tests/ --cov
- run: codecov
build_py3_tf:
run_tests_tf:
working_directory: ~/transformers
docker:
- image: circleci/python:3.5
- image: circleci/python:3.7
environment:
OMP_NUM_THREADS: 1
resource_class: xlarge
parallelism: 1
steps:
- checkout
- run: sudo pip install tensorflow
- run: sudo pip install --progress-bar off .
- run: sudo pip install pytest codecov pytest-cov
- run: sudo pip install tensorboardX scikit-learn
- run: python -m pytest -sv ./transformers/tests/ --cov
- run: sudo pip install .[sklearn,tf-cpu,testing]
- run: sudo pip install codecov pytest-cov
- run: python -m pytest -n 8 --dist=loadfile -s -v ./tests/ --cov
- run: codecov
build_py2_torch:
run_tests_custom_tokenizers:
working_directory: ~/transformers
resource_class: large
parallelism: 1
docker:
- image: circleci/python:2.7
- image: circleci/python:3.6
environment:
RUN_CUSTOM_TOKENIZERS: yes
steps:
- checkout
- run: sudo pip install torch
- run: sudo pip install --progress-bar off .
- run: sudo pip install pytest codecov pytest-cov
- run: python -m pytest -sv ./transformers/tests/ --cov
- run: codecov
build_py2_tf:
- run: sudo pip install .[mecab,testing]
- run: python -m pytest -sv ./tests/test_tokenization_bert_japanese.py
run_examples_torch:
working_directory: ~/transformers
resource_class: large
parallelism: 1
docker:
- image: circleci/python:2.7
- image: circleci/python:3.6
environment:
OMP_NUM_THREADS: 1
resource_class: xlarge
parallelism: 1
steps:
- checkout
- run: sudo pip install tensorflow
- run: sudo pip install --progress-bar off .
- run: sudo pip install pytest codecov pytest-cov
- run: python -m pytest -sv ./transformers/tests/ --cov
- run: codecov
- run: sudo pip install .[sklearn,torch,testing]
- run: sudo pip install -r examples/requirements.txt
- run: python -m pytest -n 8 --dist=loadfile -s -v ./examples/
build_doc:
working_directory: ~/transformers
docker:
- image: circleci/python:3.6
steps:
- checkout
- run: sudo pip install .[tf,torch,docs]
- run: cd docs && make html
- store_artifacts:
path: ./docs/_build
deploy_doc:
working_directory: ~/transformers
docker:
- image: circleci/python:3.5
- image: circleci/python:3.6
steps:
- add_ssh_keys:
fingerprints:
- "5b:7a:95:18:07:8c:aa:76:4c:60:35:88:ad:60:56:71"
fingerprints:
- "5b:7a:95:18:07:8c:aa:76:4c:60:35:88:ad:60:56:71"
- checkout
- run: sudo pip install --progress-bar off -r docs/requirements.txt
- run: sudo pip install --progress-bar off -r requirements.txt
- run: sudo pip install .[tf,torch,docs]
- run: ./.circleci/deploy.sh
check_code_quality:
working_directory: ~/transformers
docker:
- image: circleci/python:3.6
resource_class: medium
parallelism: 1
steps:
- checkout
# we need a version of isort with https://github.com/timothycrosley/isort/pull/1000
- run: sudo pip install git+git://github.com/timothycrosley/isort.git@e63ae06ec7d70b06df9e528357650281a3d3ec22#egg=isort
- run: sudo pip install .[tf,torch,quality]
- run: black --check --line-length 119 --target-version py35 examples templates tests src utils
- run: isort --check-only --recursive examples templates tests src utils
- run: flake8 examples templates tests src utils
check_repository_consistency:
working_directory: ~/transformers
docker:
- image: circleci/python:3.6
resource_class: small
parallelism: 1
steps:
- checkout
- run: sudo pip install requests
- run: python ./utils/link_tester.py
workflow_filters: &workflow_filters
filters:
branches:
@@ -91,9 +120,12 @@ workflows:
version: 2
build_and_test:
jobs:
- build_py3_torch_and_tf
- build_py3_torch
- build_py3_tf
- build_py2_torch
- build_py2_tf
- check_code_quality
- check_repository_consistency
- run_examples_torch
- run_tests_custom_tokenizers
- run_tests_torch_and_tf
- run_tests_torch
- run_tests_tf
- build_doc
- deploy_doc: *workflow_filters

View File

@@ -3,7 +3,7 @@ cd docs
function deploy_doc(){
echo "Creating doc at commit $1 and pushing to folder $2"
git checkout $1
if [ ! -z "$2" ]
if [ ! -z "$2" ]
then
if [ -d "$dir/$2" ]; then
echo "Directory" $2 "already exists"
@@ -17,10 +17,13 @@ function deploy_doc(){
fi
}
deploy_doc "master"
deploy_doc "master"
deploy_doc "b33a385" v1.0.0
deploy_doc "fe02e45" v1.1.0
deploy_doc "89fd345" v1.2.0
deploy_doc "fc9faa8" v2.0.0
deploy_doc "3ddce1d" v2.1.1
deploy_doc "3616209" v2.2.0
deploy_doc "d0f8b9a" v2.3.0
deploy_doc "6664ea9" v2.4.0
deploy_doc "fb560dc" v2.5.0

View File

@@ -1,17 +1,17 @@
---
name: "\U0001F5A5 New Benchmark"
about: You benchmark a part of this library and would like to share your results
name: "\U0001F5A5 New benchmark"
about: Benchmark a part of this library and share your results
title: "[Benchmark]"
labels: ''
assignees: ''
---
# Benchmarking Transformers
# 🖥 Benchmarking `transformers`
## Benchmark
Which part of Transformers did you benchmark?
Which part of `transformers` did you benchmark?
## Set-up

View File

@@ -1,24 +1,20 @@
---
name: "\U0001F31FNew model addition"
name: "\U0001F31F New model addition"
about: Submit a proposal/request to implement a new Transformer-based model
title: ''
labels: ''
labels: New model
assignees: ''
---
# 🌟New model addition
# 🌟 New model addition
## Model description
<!-- Important information -->
## Open Source status
## Open source status
* [ ] the model implementation is available: (give details)
* [ ] the model weights are available: (give details)
* [ ] who are the authors: (mention them)
## Additional context
<!-- Add any other context about the problem here. -->
* [ ] who are the authors: (mention them, if possible by @gh-username)

View File

@@ -1,29 +1,29 @@
---
name: "\U0001F41B Bug Report"
about: Submit a bug report to help us improve PyTorch Transformers
about: Submit a bug report to help us improve transformers
title: ''
labels: ''
assignees: ''
---
## 🐛 Bug
# 🐛 Bug
<!-- Important information -->
## Information
Model I am using (Bert, XLNet....):
Model I am using (Bert, XLNet ...):
Language I am using the model on (English, Chinese....):
Language I am using the model on (English, Chinese ...):
The problem arise when using:
* [ ] the official example scripts: (give details)
* [ ] my own modified scripts: (give details)
The problem arises when using:
* [ ] the official example scripts: (give details below)
* [ ] my own modified scripts: (give details below)
The tasks I am working on is:
* [ ] an official GLUE/SQUaD task: (give the name)
* [ ] my own task or dataset: (give details)
* [ ] my own task or dataset: (give details below)
## To Reproduce
## To reproduce
Steps to reproduce the behavior:
@@ -31,22 +31,22 @@ Steps to reproduce the behavior:
2.
3.
<!-- If you have a code sample, error messages, stack traces, please provide it here as well. -->
<!-- If you have code snippets, error messages, stack traces please provide them here as well.
Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.-->
## Expected behavior
<!-- A clear and concise description of what you expected to happen. -->
<!-- A clear and concise description of what you would expect to happen. -->
## Environment
* OS:
* Python version:
* PyTorch version:
* PyTorch Transformers version (or branch):
* Using GPU ?
* Distributed of parallel setup ?
* Any other relevant information:
## Additional context
<!-- Add any other context about the problem here. -->
## Environment info
<!-- You can run the command `transformers-cli env` and copy-and-paste its output below.
Don't forget to fill out the missing fields in that output! -->
- `transformers` version:
- Platform:
- Python version:
- PyTorch version (GPU?):
- Tensorflow version (GPU?):
- Using GPU in script?:
- Using distributed or parallel set-up in script?:

View File

@@ -1,20 +1,25 @@
---
name: "\U0001F680 Feature Request"
about: Submit a proposal/request for a new PyTorch Transformers feature
name: "\U0001F680 Feature request"
about: Submit a proposal/request for a new transformers feature
title: ''
labels: ''
assignees: ''
---
## 🚀 Feature
# 🚀 Feature request
<!-- A clear and concise description of the feature proposal. Please provide a link to the paper and code in case they exist. -->
<!-- A clear and concise description of the feature proposal.
Please provide a link to the paper and code in case they exist. -->
## Motivation
<!-- Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too. -->
<!-- Please outline the motivation for the proposal. Is your feature request
related to a problem? e.g., I'm always frustrated when [...]. If this is related
to another GitHub issue, please link here too. -->
## Additional context
## Your contribution
<!-- Add any other context or screenshots about the feature request here. -->
<!-- Is there any way that you could help, e.g. by submitting a PR?
Make sure to read the CONTRIBUTING.MD readme:
https://github.com/huggingface/transformers/blob/master/CONTRIBUTING.md -->

View File

@@ -1,47 +1,58 @@
---
name: "\U0001F4DA Migration from PyTorch-pretrained-Bert"
about: Report a problem when migrating from PyTorch-pretrained-Bert to Transformers
name: "\U0001F4DA Migration from pytorch-pretrained-bert or pytorch-transformers"
about: Report a problem when migrating from pytorch-pretrained-bert or pytorch-transformers
to transformers
title: ''
labels: ''
labels: Migration
assignees: ''
---
## 📚 Migration
# 📚 Migration
## Information
<!-- Important information -->
Model I am using (Bert, XLNet....):
Model I am using (Bert, XLNet ...):
Language I am using the model on (English, Chinese....):
Language I am using the model on (English, Chinese ...):
The problem arise when using:
* [ ] the official example scripts: (give details)
* [ ] my own modified scripts: (give details)
The problem arises when using:
* [ ] the official example scripts: (give details below)
* [ ] my own modified scripts: (give details below)
The tasks I am working on is:
* [ ] an official GLUE/SQUaD task: (give the name)
* [ ] my own task or dataset: (give details)
* [ ] my own task or dataset: (give details below)
Details of the issue:
## Details
<!-- A clear and concise description of the migration issue. If you have code snippets, please provide it here as well. -->
<!-- A clear and concise description of the migration issue.
If you have code snippets, please provide it here as well.
Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.
-->
## Environment
## Environment info
<!-- You can run the command `python transformers-cli env` and copy-and-paste its output below.
Don't forget to fill out the missing fields in that output! -->
- `transformers` version:
- Platform:
- Python version:
- PyTorch version (GPU?):
- Tensorflow version (GPU?):
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
<!-- IMPORTANT: which version of the former library do you use? -->
* `pytorch-transformers` or `pytorch-pretrained-bert` version (or branch):
* OS:
* Python version:
* PyTorch version:
* PyTorch Transformers version (or branch):
* Using GPU ?
* Distributed of parallel setup ?
* Any other relevant information:
## Checklist
- [ ] I have read the migration guide in the readme.
([pytorch-transformers](https://github.com/huggingface/transformers#migrating-from-pytorch-transformers-to-transformers);
[pytorch-pretrained-bert](https://github.com/huggingface/transformers#migrating-from-pytorch-pretrained-bert-to-transformers))
- [ ] I checked if a related official extension example runs on my machine.
## Additional context
<!-- Add any other context about the problem here. -->

View File

@@ -1,12 +1,29 @@
---
name: "❓Questions & Help"
about: Start a general discussion related to PyTorch Transformers
name: "❓ Questions & Help"
about: Post your general questions on Stack Overflow tagged huggingface-transformers
title: ''
labels: ''
assignees: ''
---
## ❓ Questions & Help
# ❓ Questions & Help
<!-- A clear and concise description of the question. -->
<!-- The GitHub issue tracker is primarly intended for bugs, feature requests,
new models and benchmarks, and migration questions. For all other questions,
we direct you to Stack Overflow (SO) where a whole community of PyTorch and
Tensorflow enthusiast can help you out. Make sure to tag your question with the
right deep learning framework as well as the huggingface-transformers tag:
https://stackoverflow.com/questions/tagged/huggingface-transformers
If your question wasn't answered after a period of time on Stack Overflow, you
can always open a question on GitHub. You should then link to the SO question
that you posted.
-->
## Details
<!-- Description of your issue -->
<!-- You should first ask your question on SO, and only if
you didn't get an answer ask it here on GitHub. -->
**A link to original question on Stack Overflow**:

19
.github/workflows/github-push.yml vendored Normal file
View File

@@ -0,0 +1,19 @@
name: GitHub-hosted runner
on: push
jobs:
check_code_quality:
runs-on: ubuntu-18.04
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v1
with:
python-version: 3.7
# - name: Install dependencies
# run: |
# pip install .[tf,torch,quality]

32
.github/workflows/github-torch-hub.yml vendored Normal file
View File

@@ -0,0 +1,32 @@
name: Torch hub integration
on:
push:
branches:
- "*"
jobs:
torch_hub_integration:
runs-on: ubuntu-latest
steps:
# no checkout necessary here.
- name: Extract branch name
run: echo "::set-env name=BRANCH::${GITHUB_REF#refs/heads/}"
- name: Check branch name
run: echo $BRANCH
- name: Set up Python
uses: actions/setup-python@v1
with:
python-version: 3.7
- name: Install dependencies
run: |
pip install torch
pip install numpy tokenizers filelock requests tqdm regex sentencepiece sacremoses
- name: Torch hub list
run: |
python -c "import torch; print(torch.hub.list('huggingface/transformers:$BRANCH'))"
- name: Torch hub help
run: |
python -c "import torch; print(torch.hub.help('huggingface/transformers:$BRANCH', 'modelForSequenceClassification'))"

50
.github/workflows/self-push.yml vendored Normal file
View File

@@ -0,0 +1,50 @@
name: Self-hosted runner (push)
on:
# push:
# branches:
# - master
# pull_request:
repository_dispatch:
jobs:
run_tests_torch_and_tf_gpu:
runs-on: self-hosted
steps:
- uses: actions/checkout@v2
- name: Python version
run: |
which python
python --version
pip --version
- name: Current dir
run: pwd
- run: nvidia-smi
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
run: |
python -m venv .env
source .env/bin/activate
which python
python --version
pip --version
- name: Install dependencies
run: |
source .env/bin/activate
pip install .[sklearn,tf,torch,testing]
pip uninstall -y tensorflow
- name: Are GPUs recognized by our DL frameworks
run: |
source .env/bin/activate
python -c "import torch; print(torch.cuda.is_available())"
- name: Run all non-slow tests on GPU
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
# TF_GPU_MEMORY_LIMIT: 4096
OMP_NUM_THREADS: 1
USE_CUDA: yes
run: |
source .env/bin/activate
python -m pytest -n 2 --dist=loadfile -s -v ./tests/

51
.github/workflows/self-scheduled.yml vendored Normal file
View File

@@ -0,0 +1,51 @@
name: Self-hosted runner (scheduled)
on:
push:
branches:
- ci_*
repository_dispatch:
schedule:
- cron: "0 0 * * *"
jobs:
run_all_tests_torch_and_tf_gpu:
runs-on: self-hosted
steps:
- uses: actions/checkout@v2
- name: Python version
run: |
which python
python --version
pip --version
- name: Current dir
run: pwd
- run: nvidia-smi
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
run: |
python -m venv .env
source .env/bin/activate
which python
python --version
pip --version
- name: Install dependencies
run: |
source .env/bin/activate
pip install .[sklearn,tf,torch,testing]
- name: Are GPUs recognized by our DL frameworks
run: |
source .env/bin/activate
python -c "import torch; print(torch.cuda.is_available())"
python -c "import tensorflow as tf; print(tf.test.is_built_with_cuda(), tf.config.list_physical_devices('GPU'))"
- name: Run all tests on GPU
env:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
OMP_NUM_THREADS: 1
RUN_SLOW: yes
USE_CUDA: yes
run: |
source .env/bin/activate
python -m pytest -n 1 --dist=loadfile -s -v ./tests/

11
.gitignore vendored
View File

@@ -130,7 +130,10 @@ proc_data
# examples
runs
examples/runs
/runs_old
/wandb
/examples/runs
/examples/**/*.args
# data
/data
@@ -139,3 +142,9 @@ serialization_dir
# emacs
*.*~
debug.env
# vim
.*.swp
#ctags
tags

View File

@@ -41,14 +41,10 @@ Did not find it? :( So we can act quickly on it, please follow these steps:
less than 30s;
* Provide the *full* traceback if an exception is raised.
To get the OS and software versions, execute the following code and copy-paste
the output:
To get the OS and software versions automatically, you can run the following command:
```
import platform; print("Platform", platform.platform())
import sys; print("Python", sys.version)
import torch; print("PyTorch", torch.__version__)
import tensorflow; print("Tensorflow", tensorflow.__version__)
```bash
python transformers-cli env
```
### Do you want to implement a new model?
@@ -100,9 +96,10 @@ Follow these steps to start contributing:
1. Fork the [repository](https://github.com/huggingface/transformers) by
clicking on the 'Fork' button on the repository's page. This creates a copy of the code
under your github user account.
under your GitHub user account.
2. Clone your fork to your local disk, and add the base repository as a remote:
```bash
$ git clone git@github.com:<your Github handle>/transformers.git
$ cd transformers
@@ -114,43 +111,77 @@ Follow these steps to start contributing:
```bash
$ git checkout -b a-descriptive-name-for-my-changes
```
**do not** work on the `master` branch.
4. Set up a development environment by running the following command in a virtual environment:
```bash
$ pip install -r requirements-dev.txt
$ pip install -e ".[dev]"
```
5. Develop the features on your branch. Add changed files using `git add` and
then `git commit` to record your changes locally:
(If transformers was already installed in the virtual environment, remove
it with `pip uninstall transformers` before reinstalling it in editable
mode with the `-e` flag.)
Right now, we need an unreleased version of `isort` to avoid a
[bug](https://github.com/timothycrosley/isort/pull/1000):
```bash
$ pip install -U git+git://github.com/timothycrosley/isort.git@e63ae06ec7d70b06df9e528357650281a3d3ec22#egg=isort
```
5. Develop the features on your branch.
As you work on the features, you should make sure that the test suite
passes:
```bash
$ make test
```
`transformers` relies on `black` and `isort` to format its source code
consistently. After you make changes, format them with:
```bash
$ make style
```
`transformers` also uses `flake8` to check for coding mistakes. Quality
control runs in CI, however you can also run the same checks with:
```bash
$ make quality
```
Once you're happy with your changes, add changed files using `git add` and
make a commit with `git commit` to record your changes locally:
```bash
$ git add modified_file.py
$ git commit
```
Please write [good commit
messages](https://chris.beams.io/posts/git-commit/). It
is a good idea to sync your copy of the code with the original repository
regularly. This way you can quickly account for changes:
messages](https://chris.beams.io/posts/git-commit/).
It is a good idea to sync your copy of the code with the original
repository regularly. This way you can quickly account for changes:
```bash
$ git fetch upstream
$ git rebase upstream/master
```
Push the changes to your account using:
```bash
$ git push -u origin a-descriptive-name-for-my-changes
```
6. Once you are satisfied (**and the checklist below is happy too**), go to the
webpage of your fork on Github. Click on 'Pull request' to send your changes
webpage of your fork on GitHub. Click on 'Pull request' to send your changes
to the project maintainers for review.
7. It's ok if maintainers ask you for changes. It happens to core contributors
too! So everyone can see the changes in the Pull request, work in your local
branch and push the changes to your fork. They will automatically appear in
@@ -166,9 +197,58 @@ Follow these steps to start contributing:
3. To indicate a work in progress please prefix the title with `[WIP]`. These
are useful to avoid duplicated work, and to differentiate it from PRs ready
to be merged;
4. Make sure pre-existing tests still pass;
5. Add high-coverage tests. No quality test, no merge;
6. All public methods must have informative doctrings;
4. Make sure existing tests pass;
5. Add high-coverage tests. No quality test, no merge.
- If you are adding a new model, make sure that you use `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)`, which triggers the common tests.
- If you are adding new `@slow` tests, make sure they pass using `RUN_SLOW=1 python -m pytest tests/test_my_new_model.py`.
CircleCI does not run them.
6. All public methods must have informative docstrings;
### Tests
You can run 🤗 Transformers tests with `unittest` or `pytest`.
We like `pytest` and `pytest-xdist` because it's faster. From the root of the
repository, here's how to run tests with `pytest` for the library:
```bash
$ python -m pytest -n auto --dist=loadfile -s -v ./tests/
```
and for the examples:
```bash
$ pip install -r examples/requirements.txt # only needed the first time
$ python -m pytest -n auto --dist=loadfile -s -v ./examples/
```
In fact, that's how `make test` and `make test-examples` are implemented!
You can specify a smaller set of tests in order to test only the feature
you're working on.
By default, slow tests are skipped. Set the `RUN_SLOW` environment variable to
`yes` to run them. This will download many gigabytes of models — make sure you
have enough disk space and a good Internet connection, or a lot of patience!
```bash
$ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/
$ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/
```
Likewise, set the `RUN_CUSTOM_TOKENIZERS` environment variable to `yes` to run
tests for custom tokenizers, which don't run by default either.
🤗 Transformers uses `pytest` as a test runner only. It doesn't use any
`pytest`-specific features in the test suite itself.
This means `unittest` is fully supported. Here's how to run tests with
`unittest`:
```bash
$ python -m unittest discover -s tests -t . -v
$ python -m unittest discover -s examples -t examples -v
```
### Style guide

24
Makefile Normal file
View File

@@ -0,0 +1,24 @@
.PHONY: quality style test test-examples
# Check that source code meets quality standards
quality:
black --check --line-length 119 --target-version py35 examples templates tests src utils
isort --check-only --recursive examples templates tests src utils
flake8 examples templates tests src utils
# Format source code automatically
style:
black --line-length 119 --target-version py35 examples templates tests src utils
isort --recursive examples templates tests src utils
# Run tests for the library
test:
python -m pytest -n auto --dist=loadfile -s -v ./tests/
# Run tests for examples
test-examples:
python -m pytest -n auto --dist=loadfile -s -v ./examples/

222
README.md
View File

@@ -19,15 +19,14 @@
</p>
<h3 align="center">
<p>State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
<p>State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0
</h3>
🤗 Transformers (formerly known as `pytorch-transformers` and `pytorch-pretrained-bert`) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.
🤗 Transformers (formerly known as `pytorch-transformers` and `pytorch-pretrained-bert`) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, T5, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over thousands of pretrained models in 100+ languages and deep interoperability between PyTorch & TensorFlow 2.0.
[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/0)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/0)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/1)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/1)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/2)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/2)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/3)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/3)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/4)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/4)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/5)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/5)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/6)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/6)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/7)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/7)
### Features
- As easy to use as pytorch-transformers
- As powerful and concise as Keras
- High performance on NLU and NLG tasks
- Low barrier to entry for educators and practitioners
@@ -39,7 +38,7 @@ State-of-the-art NLP for everyone
Lower compute costs, smaller carbon footprint
- Researchers can share trained models instead of always retraining
- Practitioners can reduce compute time and production costs
- 10 architectures with over 30 pretrained models, some in more than 100 languages
- Dozens of architectures with over 1,000 pretrained models, some in more than 100 languages
Choose the right framework for every part of a model's lifetime
- Train state-of-the-art models in 3 lines of code
@@ -55,14 +54,22 @@ Choose the right framework for every part of a model's lifetime
| [Online demo](#online-demo) | Experimenting with this repos text generation capabilities |
| [Quick tour: Usage](#quick-tour) | Tokenizers & models usage: Bert and GPT-2 |
| [Quick tour: TF 2.0 and PyTorch ](#Quick-tour-TF-20-training-and-PyTorch-interoperability) | Train a TF 2.0 model in 10 lines of code, load it in PyTorch |
| [Quick tour: pipelines](#quick-tour-of-pipelines) | Using Pipelines: Wrapper around tokenizer and models to use finetuned models |
| [Quick tour: Fine-tuning/usage scripts](#quick-tour-of-the-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
| [Quick tour: Share your models ](#Quick-tour-of-model-sharing) | Upload and share your fine-tuned models with the community |
| [Migrating from pytorch-transformers to transformers](#Migrating-from-pytorch-transformers-to-transformers) | Migrating your code from pytorch-transformers to transformers |
| [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
| [Documentation][(v2.2.0/v2.2.1)](https://huggingface.co/transformers/v2.2.0) [(v2.1.1)](https://huggingface.co/transformers/v2.1.1) [(v2.0.0)](https://huggingface.co/transformers/v2.0.0) [(v1.2.0)](https://huggingface.co/transformers/v1.2.0) [(v1.1.0)](https://huggingface.co/transformers/v1.1.0) [(v1.0.0)](https://huggingface.co/transformers/v1.0.0) [(master)](https://huggingface.co/transformers) | Full API documentation and more |
| [Documentation][(v2.5.0)](https://huggingface.co/transformers/v2.5.0)[(v2.4.0/v2.4.1)](https://huggingface.co/transformers/v2.4.0)[(v2.3.0)](https://huggingface.co/transformers/v2.3.0)[(v2.2.0/v2.2.1/v2.2.2)](https://huggingface.co/transformers/v2.2.0) [(v2.1.1)](https://huggingface.co/transformers/v2.1.1) [(v2.0.0)](https://huggingface.co/transformers/v2.0.0) [(v1.2.0)](https://huggingface.co/transformers/v1.2.0) [(v1.1.0)](https://huggingface.co/transformers/v1.1.0) [(v1.0.0)](https://huggingface.co/transformers/v1.0.0) [(master)](https://huggingface.co/transformers) | Full API documentation and more |
## Installation
This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+), PyTorch 1.0.0+ and TensorFlow 2.0.0-rc1
This repo is tested on Python 3.6+, PyTorch 1.0.0+ and TensorFlow 2.0.
You should install 🤗 Transformers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
Create a virtual environment with the version of Python you're going to use and activate it.
Now, if you want to use 🤗 Transformers, you can install it with pip. If you'd like to play with the examples, you must install it from source.
### With pip
@@ -83,35 +90,49 @@ Please refer to [TensorFlow installation page](https://www.tensorflow.org/instal
When TensorFlow 2.0 and/or PyTorch has been installed, you can install from source by cloning the repository and running:
```bash
pip install [--editable] .
git clone https://github.com/huggingface/transformers
cd transformers
pip install .
```
When you update the repository, you should upgrade the transformers installation and its dependencies as follows:
```bash
git pull
pip install --upgrade .
```
### Run the examples
Examples are included in the repository but are not shipped with the library.
Therefore, in order to run the latest versions of the examples you also need to install from source. To do so, create a new virtual environment and follow these steps:
```bash
git clone https://github.com/huggingface/transformers
cd transformers
pip install [--editable] .
```
Therefore, in order to run the latest versions of the examples, you need to install from source, as described above.
Look at the [README](https://github.com/huggingface/transformers/blob/master/examples/README.md) for how to run examples.
### Tests
A series of tests are included for the library and the example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
These tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
A series of tests are included for the library and for some example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
Depending on which framework is installed (TensorFlow 2.0 and/or PyTorch), the irrelevant tests will be skipped. Ensure that both frameworks are installed if you want to execute all tests.
You can run the tests from the root of the cloned repository with the commands:
Here's the easiest way to run tests for the library:
```bash
python -m pytest -sv ./transformers/tests/
python -m pytest -sv ./examples/
pip install -e ".[testing]"
make test
```
and for the examples:
```bash
pip install -e ".[testing]"
pip install -r examples/requirements.txt
make test-examples
```
For details, refer to the [contributing guide](https://github.com/huggingface/transformers/blob/master/CONTRIBUTING.md#tests).
### Do you want to run a Transformer model on a mobile device?
You should check out our [`swift-coreml-transformers`](https://github.com/huggingface/swift-coreml-transformers) repo.
@@ -122,20 +143,29 @@ At some point in the future, you'll be able to seamlessly move from pre-training
## Model architectures
🤗 Transformers currently provides 10 NLU/NLG architectures:
🤗 Transformers currently provides the following NLU/NLG architectures:
1. **[BERT](https://github.com/google-research/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
2. **[GPT](https://github.com/openai/finetune-transformer-lm)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
3. **[GPT-2](https://blog.openai.com/better-language-models/)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
4. **[Transformer-XL](https://github.com/kimiyoung/transformer-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
8. **[DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation).
9. **[CTRL](https://github.com/salesforce/ctrl/)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
10. **[CamemBERT](https://camembert-model.fr)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
11. **[ALBERT](https://github.com/google-research/google-research/tree/master/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
11. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
1. **[BERT](https://huggingface.co/transformers/model_doc/bert.html)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
2. **[GPT](https://huggingface.co/transformers/model_doc/gpt.html)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
3. **[GPT-2](https://huggingface.co/transformers/model_doc/gpt2.html)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
4. **[Transformer-XL](https://huggingface.co/transformers/model_doc/transformerxl.html)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
5. **[XLNet](https://huggingface.co/transformers/model_doc/xlnet.html)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
6. **[XLM](https://huggingface.co/transformers/model_doc/xlm.html)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
7. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
8. **[DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/master/examples/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
9. **[CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
10. **[CamemBERT](https://huggingface.co/transformers/model_doc/camembert.html)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
11. **[ALBERT](https://huggingface.co/transformers/model_doc/albert.html)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
12. **[T5](https://huggingface.co/transformers/model_doc/t5.html)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
13. **[XLM-RoBERTa](https://huggingface.co/transformers/model_doc/xlmroberta.html)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
14. **[MMBT](https://github.com/facebookresearch/mmbt/)** (from Facebook), released together with the paper a [Supervised Multimodal Bitransformers for Classifying Images and Text](https://arxiv.org/pdf/1909.02950.pdf) by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.
15. **[FlauBERT](https://huggingface.co/transformers/model_doc/flaubert.html)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
16. **[BART](https://huggingface.co/transformers/model_doc/bart.html)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
17. **[ELECTRA](https://huggingface.co/transformers/model_doc/electra.html)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
18. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
19. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
20. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
21. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
@@ -157,7 +187,7 @@ import torch
from transformers import *
# Transformers has a unified API
# for 8 transformer architectures and 30 pretrained weights.
# for 10 transformer architectures and 30 pretrained weights.
# Model | Tokenizer | Pretrained weights shortcut
MODELS = [(BertModel, BertTokenizer, 'bert-base-uncased'),
(OpenAIGPTModel, OpenAIGPTTokenizer, 'openai-gpt'),
@@ -166,8 +196,10 @@ MODELS = [(BertModel, BertTokenizer, 'bert-base-uncased'),
(TransfoXLModel, TransfoXLTokenizer, 'transfo-xl-wt103'),
(XLNetModel, XLNetTokenizer, 'xlnet-base-cased'),
(XLMModel, XLMTokenizer, 'xlm-mlm-enfr-1024'),
(DistilBertModel, DistilBertTokenizer, 'distilbert-base-uncased'),
(RobertaModel, RobertaTokenizer, 'roberta-base')]
(DistilBertModel, DistilBertTokenizer, 'distilbert-base-cased'),
(RobertaModel, RobertaTokenizer, 'roberta-base'),
(XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'),
]
# To use TensorFlow 2.0 versions of the models, simply prefix the class names with 'TF', e.g. `TFRobertaModel` is the TF 2.0 counterpart of the PyTorch model `RobertaModel`
@@ -235,7 +267,7 @@ valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer,
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
valid_dataset = valid_dataset.batch(64)
# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
@@ -265,15 +297,16 @@ print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sen
## Quick tour of the fine-tuning/usage scripts
**Important**
**Important**
Before running the fine-tuning scripts, please read the
[instructions](#run-the-examples) on how to
setup your environment to run the examples.
The library comprises several example scripts with SOTA performances for NLU and NLG tasks:
- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
- `run_squad.py`: an example fine-tuning Bert, XLNet and XLM on the question answering dataset SQuAD 2.0 (*token-level classification*)
- `run_glue.py`: an example fine-tuning sequence classification models on nine different GLUE tasks (*sequence-level classification*)
- `run_squad.py`: an example fine-tuning question answering models on the question answering dataset SQuAD 2.0 (*token-level classification*)
- `run_ner.py`: an example fine-tuning token classification models on named entity recognition (*token-level classification*)
- `run_generation.py`: an example using GPT, GPT-2, CTRL, Transformer-XL and XLNet for conditional language generation
- other model-specific examples (see the documentation).
@@ -283,7 +316,7 @@ Here are three quick usage examples for these scripts:
The [General Language Understanding Evaluation (GLUE) benchmark](https://gluebenchmark.com/) is a collection of nine sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems.
Before running anyone of these GLUE tasks you should download the
Before running any of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.
@@ -298,13 +331,11 @@ pip install -r ./examples/requirements.txt
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC
python ./examples/run_glue.py \
--model_type bert \
python ./examples/text-classification/run_glue.py \
--model_name_or_path bert-base-uncased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_gpu_eval_batch_size=8 \
@@ -326,8 +357,7 @@ Parallel training is a simple way to use several GPUs (but is slower and less fl
```shell
export GLUE_DIR=/path/to/glue
python ./examples/run_glue.py \
--model_type xlnet \
python ./examples/text-classification/run_glue.py \
--model_name_or_path xlnet-large-cased \
--do_train \
--do_eval \
@@ -352,13 +382,11 @@ On this machine we thus have a batch size of 32, please increase `gradient_accum
This example code fine-tunes the Bert Whole Word Masking model on the Microsoft Research Paraphrase Corpus (MRPC) corpus using distributed training on 8 V100 GPUs to reach a F1 > 92.
```bash
python -m torch.distributed.launch --nproc_per_node 8 ./examples/run_glue.py \
--model_type bert \
python -m torch.distributed.launch --nproc_per_node 8 ./examples/text-classification/run_glue.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--task_name MRPC \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_eval_batch_size=8 \
@@ -391,7 +419,6 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
@@ -426,7 +453,7 @@ python ./examples/run_generation.py \
--model_name_or_path=gpt2 \
```
and from the Salesforce CTRL model:
and from the Salesforce CTRL model:
```shell
python ./examples/run_generation.py \
--model_type=ctrl \
@@ -436,6 +463,95 @@ python ./examples/run_generation.py \
--repetition_penalty=1.2 \
```
## Quick tour of model sharing
Starting with `v2.2.2`, you can now upload and share your fine-tuned models with the community, using the <abbr title="Command-line interface">CLI</abbr> that's built-in to the library.
**First, create an account on [https://huggingface.co/join](https://huggingface.co/join)**. Optionally, join an existing organization or create a new one. Then:
```shell
transformers-cli login
# log in using the same credentials as on huggingface.co
```
Upload your model:
```shell
transformers-cli upload ./path/to/pretrained_model/
# ^^ Upload folder containing weights/tokenizer/config
# saved via `.save_pretrained()`
transformers-cli upload ./config.json [--filename folder/foobar.json]
# ^^ Upload a single file
# (you can optionally override its filename, which can be nested inside a folder)
```
If you want your model to be namespaced by your organization name rather than your username, add the following flag to any command:
```shell
--organization organization_name
```
Your model will then be accessible through its identifier, a concatenation of your username (or organization name) and the folder name above:
```python
"username/pretrained_model"
# or if an org:
"organization_name/pretrained_model"
```
**Please add a README.md model card** to the repo under `model_cards/` with: model description, training params (dataset, preprocessing, hardware used, hyperparameters), evaluation results, intended uses & limitations, etc.
Your model now has a page on huggingface.co/models 🔥
Anyone can load it from code:
```python
tokenizer = AutoTokenizer.from_pretrained("namespace/pretrained_model")
model = AutoModel.from_pretrained("namespace/pretrained_model")
```
List all your files on S3:
```shell
transformers-cli s3 ls
```
You can also delete unneeded files:
```shell
transformers-cli s3 rm …
```
## Quick tour of pipelines
New in version `v2.3`: `Pipeline` are high-level objects which automatically handle tokenization, running your data through a transformers model
and outputting the result in a structured object.
You can create `Pipeline` objects for the following down-stream tasks:
- `feature-extraction`: Generates a tensor representation for the input sequence
- `ner`: Generates named entity mapping for each word in the input sequence.
- `sentiment-analysis`: Gives the polarity (positive / negative) of the whole input sequence.
- `text-classification`: Initialize a `TextClassificationPipeline` directly, or see `sentiment-analysis` for an example.
- `question-answering`: Provided some context and a question refering to the context, it will extract the answer to the question in the context.
- `fill-mask`: Takes an input sequence containing a masked token (e.g. `<mask>`) and return list of most probable filled sequences, with their probabilities.
- `summarization`
- `translation_xx_to_yy`
```python
from transformers import pipeline
# Allocate a pipeline for sentiment-analysis
nlp = pipeline('sentiment-analysis')
nlp('We are very happy to include pipeline into the transformers repository.')
>>> {'label': 'POSITIVE', 'score': 0.99893874}
# Allocate a pipeline for question-answering
nlp = pipeline('question-answering')
nlp({
'question': 'What is the name of the repository ?',
'context': 'Pipeline have been included in the huggingface/transformers repository'
})
>>> {'score': 0.28756016668193496, 'start': 35, 'end': 59, 'answer': 'huggingface/transformers'}
```
## Migrating from pytorch-transformers to transformers
Here is a quick summary of what you should take care of when migrating from `pytorch-transformers` to `transformers`.
@@ -567,7 +683,7 @@ for batch in train_data:
## Citation
We now have a paper you can cite for the 🤗 Transformers library:
```
```bibtex
@article{Wolf2019HuggingFacesTS,
title={HuggingFace's Transformers: State-of-the-art Natural Language Processing},
author={Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R'emi Louf and Morgan Funtowicz and Jamie Brew},

View File

@@ -19,4 +19,5 @@ deploy_doc "fe02e45" v1.1.0
deploy_doc "89fd345" v1.2.0
deploy_doc "fc9faa8" v2.0.0
deploy_doc "3ddce1d" v2.1.1
deploy_doc "f2f3294" v2.2.0
deploy_doc "f2f3294" v2.2.0
deploy_doc "d0f8b9a" v2.3.0

View File

@@ -1,7 +0,0 @@
FROM pytorch/pytorch:latest
RUN git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext
RUN pip install transformers
WORKDIR /workspace

View File

@@ -0,0 +1,26 @@
FROM ubuntu:18.04
LABEL maintainer="Hugging Face"
LABEL repository="transformers"
RUN apt update && \
apt install -y bash \
build-essential \
git \
curl \
ca-certificates \
python3 \
python3-pip && \
rm -rf /var/lib/apt/lists
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --no-cache-dir \
jupyter \
tensorflow-cpu \
torch
WORKDIR /workspace
COPY . transformers/
RUN cd transformers/ && \
python3 -m pip install --no-cache-dir .
CMD ["/bin/bash"]

View File

@@ -0,0 +1,26 @@
FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
LABEL maintainer="Hugging Face"
LABEL repository="transformers"
RUN apt update && \
apt install -y bash \
build-essential \
git \
curl \
ca-certificates \
python3 \
python3-pip && \
rm -rf /var/lib/apt/lists
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --no-cache-dir \
jupyter \
tensorflow \
torch
WORKDIR /workspace
COPY . transformers/
RUN cd transformers/ && \
python3 -m pip install --no-cache-dir .
CMD ["/bin/bash"]

View File

@@ -0,0 +1,25 @@
FROM ubuntu:18.04
LABEL maintainer="Hugging Face"
LABEL repository="transformers"
RUN apt update && \
apt install -y bash \
build-essential \
git \
curl \
ca-certificates \
python3 \
python3-pip && \
rm -rf /var/lib/apt/lists
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --no-cache-dir \
jupyter \
torch
WORKDIR /workspace
COPY . transformers/
RUN cd transformers/ && \
python3 -m pip install --no-cache-dir .
CMD ["/bin/bash"]

View File

@@ -0,0 +1,25 @@
FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
LABEL maintainer="Hugging Face"
LABEL repository="transformers"
RUN apt update && \
apt install -y bash \
build-essential \
git \
curl \
ca-certificates \
python3 \
python3-pip && \
rm -rf /var/lib/apt/lists
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --no-cache-dir \
mkl \
torch
WORKDIR /workspace
COPY . transformers/
RUN cd transformers/ && \
python3 -m pip install --no-cache-dir .
CMD ["/bin/bash"]

View File

@@ -0,0 +1,25 @@
FROM ubuntu:18.04
LABEL maintainer="Hugging Face"
LABEL repository="transformers"
RUN apt update && \
apt install -y bash \
build-essential \
git \
curl \
ca-certificates \
python3 \
python3-pip && \
rm -rf /var/lib/apt/lists
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --no-cache-dir \
mkl \
tensorflow-cpu
WORKDIR /workspace
COPY . transformers/
RUN cd transformers/ && \
python3 -m pip install --no-cache-dir .
CMD ["/bin/bash"]

View File

@@ -0,0 +1,25 @@
FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
LABEL maintainer="Hugging Face"
LABEL repository="transformers"
RUN apt update && \
apt install -y bash \
build-essential \
git \
curl \
ca-certificates \
python3 \
python3-pip && \
rm -rf /var/lib/apt/lists
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --no-cache-dir \
mkl \
tensorflow
WORKDIR /workspace
COPY . transformers/
RUN cd transformers/ && \
python3 -m pip install --no-cache-dir .
CMD ["/bin/bash"]

View File

@@ -1,25 +1,25 @@
# Generating the documentation
To generate the documentation, you first have to build it. Several packages are necessary to build the doc,
you can install them using:
you can install them with the following command, at the root of the code repository:
```bash
pip install -r requirements.txt
pip install -e ".[docs]"
```
## Packages installed
Here's an overview of all the packages installed. If you ran the previous command installing all packages from
Here's an overview of all the packages installed. If you ran the previous command installing all packages from
`requirements.txt`, you do not need to run the following commands.
Building it requires the package `sphinx` that you can
Building it requires the package `sphinx` that you can
install using:
```bash
pip install -U sphinx
```
You would also need the custom installed [theme](https://github.com/readthedocs/sphinx_rtd_theme) by
You would also need the custom installed [theme](https://github.com/readthedocs/sphinx_rtd_theme) by
[Read The Docs](https://readthedocs.org/). You can install it using the following command:
```bash
@@ -34,7 +34,7 @@ pip install recommonmark
## Building the documentation
Make sure that there is a symlink from the `example` file (in /examples) inside the source folder. Run the following
Make sure that there is a symlink from the `example` file (in /examples) inside the source folder. Run the following
command to generate it:
```bash
@@ -47,6 +47,8 @@ Once you have setup `sphinx`, you can build the documentation by running the fol
make html
```
A folder called ``_build/html`` should have been created. You can now open the file ``_build/html/index.html`` in your browser.
---
**NOTE**

View File

@@ -1,32 +0,0 @@
alabaster==0.7.12
Babel==2.7.0
certifi==2019.6.16
chardet==3.0.4
commonmark==0.9.0
docutils==0.14
future==0.17.1
idna==2.8
imagesize==1.1.0
Jinja2==2.10.1
MarkupSafe==1.1.1
packaging==19.0
Pygments==2.4.2
pyparsing==2.4.0
pytz==2019.1
recommonmark==0.5.0
requests==2.22.0
six==1.12.0
snowballstemmer==1.9.0
Sphinx==2.1.2
sphinx-rtd-theme==0.4.3
sphinxcontrib-applehelp==1.0.1
sphinxcontrib-devhelp==1.0.1
sphinxcontrib-htmlhelp==1.0.2
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.2
sphinxcontrib-serializinghtml==1.1.3
urllib3==1.25.3
sphinx-markdown-tables==0.0.9
numpy==1.17.2
tensorflow==2.0.0rc2
torch==1.2.0

View File

@@ -1,3 +1,25 @@
/* Our DOM objects */
.framework-selector {
display: flex;
flex-direction: row;
justify-content: flex-end;
}
.framework-selector > button {
background-color: white;
color: #6670FF;
border: 1px solid #6670FF;
padding: 5px;
}
.framework-selector > button.selected{
background-color: #6670FF;
color: white;
border: 1px solid #6670FF;
padding: 5px;
}
/* The literal code blocks */
.rst-content tt.literal, .rst-content tt.literal, .rst-content code.literal {
color: #6670FF;
@@ -194,3 +216,41 @@ h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend{
src: url(./Calibre-Thin.otf);
font-weight:400;
}
/**
* Nav Links to other parts of huggingface.co
*/
div.menu {
position: absolute;
top: 0;
right: 0;
padding-top: 20px;
padding-right: 20px;
z-index: 1000;
}
div.menu a {
font-size: 14px;
letter-spacing: 0.3px;
text-transform: uppercase;
color: white;
-webkit-font-smoothing: antialiased;
background: linear-gradient(0deg, #6671ffb8, #9a66ffb8 50%);
padding: 10px 16px 6px 16px;
border-radius: 3px;
margin-left: 12px;
position: relative;
}
div.menu a:active {
top: 1px;
}
@media (min-width: 768px) and (max-width: 1750px) {
.wy-breadcrumbs {
margin-top: 32px;
}
}
@media (max-width: 768px) {
div.menu {
display: none;
}
}

View File

@@ -58,6 +58,84 @@ function addGithubButton() {
document.querySelector(".wy-side-nav-search .icon-home").insertAdjacentHTML('afterend', div);
}
function addHfMenu() {
const div = `
<div class="menu">
<a href="/welcome">🔥 Sign in</a>
<a href="/models">🚀 Models</a>
</div>
`;
document.body.insertAdjacentHTML('afterbegin', div);
}
function platformToggle() {
const codeBlocks = Array.from(document.getElementsByClassName("highlight"));
const pytorchIdentifier = "## PYTORCH CODE";
const tensorflowIdentifier = "## TENSORFLOW CODE";
const pytorchSpanIdentifier = `<span class="c1">${pytorchIdentifier}</span>`;
const tensorflowSpanIdentifier = `<span class="c1">${tensorflowIdentifier}</span>`;
const getFrameworkSpans = filteredCodeBlock => {
const spans = filteredCodeBlock.element.innerHTML;
const pytorchSpanPosition = spans.indexOf(pytorchSpanIdentifier);
const tensorflowSpanPosition = spans.indexOf(tensorflowSpanIdentifier);
let pytorchSpans;
let tensorflowSpans;
if(pytorchSpanPosition < tensorflowSpanPosition){
pytorchSpans = spans.slice(pytorchSpanPosition + pytorchSpanIdentifier.length + 1, tensorflowSpanPosition);
tensorflowSpans = spans.slice(tensorflowSpanPosition + tensorflowSpanIdentifier.length + 1, spans.length);
}else{
tensorflowSpans = spans.slice(tensorflowSpanPosition + tensorflowSpanIdentifier.length + 1, pytorchSpanPosition);
pytorchSpans = spans.slice(pytorchSpanPosition + pytorchSpanIdentifier.length + 1, spans.length);
}
return {
...filteredCodeBlock,
pytorchSample: pytorchSpans ,
tensorflowSample: tensorflowSpans
}
};
const createFrameworkButtons = sample => {
const pytorchButton = document.createElement("button");
pytorchButton.innerText = "PyTorch";
const tensorflowButton = document.createElement("button");
tensorflowButton.innerText = "TensorFlow";
const selectorDiv = document.createElement("div");
selectorDiv.classList.add("framework-selector");
selectorDiv.appendChild(pytorchButton);
selectorDiv.appendChild(tensorflowButton);
sample.element.parentElement.prepend(selectorDiv);
// Init on PyTorch
sample.element.innerHTML = sample.pytorchSample;
pytorchButton.classList.add("selected");
tensorflowButton.classList.remove("selected");
pytorchButton.addEventListener("click", () => {
sample.element.innerHTML = sample.pytorchSample;
pytorchButton.classList.add("selected");
tensorflowButton.classList.remove("selected");
});
tensorflowButton.addEventListener("click", () => {
sample.element.innerHTML = sample.tensorflowSample;
tensorflowButton.classList.add("selected");
pytorchButton.classList.remove("selected");
});
};
codeBlocks
.map(element => {return {element: element.firstChild, innerText: element.innerText}})
.filter(codeBlock => codeBlock.innerText.includes(pytorchIdentifier) && codeBlock.innerText.includes(tensorflowIdentifier))
.map(getFrameworkSpans)
.forEach(createFrameworkButtons);
}
/*!
* github-buttons v2.2.10
* (c) 2019 なつき
@@ -74,6 +152,8 @@ function onLoad() {
addCustomFooter();
addGithubButton();
parseGithubButtons();
addHfMenu();
platformToggle();
}
window.addEventListener("load", onLoad);

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 14 KiB

After

Width:  |  Height:  |  Size: 7.6 KiB

View File

@@ -8,7 +8,7 @@ There is a growing field of study concerned with investigating the inner working
* Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
* accessing all the hidden-states of BERT/GPT/GPT-2,

View File

@@ -14,19 +14,19 @@
#
import os
import sys
sys.path.insert(0, os.path.abspath('../..'))
sys.path.insert(0, os.path.abspath('../../src'))
# -- Project information -----------------------------------------------------
project = u'transformers'
copyright = u'2019, huggingface'
copyright = u'2020, huggingface'
author = u'huggingface'
# The short X.Y version
version = u''
# The full version, including alpha/beta/rc tags
release = u'2.2.1'
release = u'2.9.0'
# -- General configuration ---------------------------------------------------
@@ -105,6 +105,12 @@ html_static_path = ['_static']
#
# html_sidebars = {}
# This must be the name of an image file (path relative to the configuration
# directory) that is the favicon of the docs. Modern browsers use this as
# the icon for tabs, windows and bookmarks. It should be a Windows-style
# icon file (.ico).
html_favicon = 'favicon.ico'
# -- Options for HTMLHelp output ---------------------------------------------

View File

@@ -3,6 +3,12 @@ Converting Tensorflow Checkpoints
A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library.
.. note::
Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**)
available in any transformers >= 2.3.0 installation.
The documentation below reflects the **transformers-cli convert** command format.
BERT
^^^^
@@ -20,10 +26,10 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
transformers bert \
$BERT_BASE_DIR/bert_model.ckpt \
$BERT_BASE_DIR/bert_config.json \
$BERT_BASE_DIR/pytorch_model.bin
transformers-cli convert --model_type bert \
--tf_checkpoint $BERT_BASE_DIR/bert_model.ckpt \
--config $BERT_BASE_DIR/bert_config.json \
--pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin
You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.
@@ -36,10 +42,12 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights
transformers gpt \
$OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
$PYTORCH_DUMP_OUTPUT \
[OPENAI_GPT_CONFIG]
transformers-cli convert --model_type gpt \
--tf_checkpoint $OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
--pytorch_dump_output $PYTORCH_DUMP_OUTPUT \
[--config OPENAI_GPT_CONFIG] \
[--finetuning_task_name OPENAI_GPT_FINETUNED_TASK] \
OpenAI GPT-2
^^^^^^^^^^^^
@@ -50,10 +58,11 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT-2 mode
export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/gpt2/pretrained/weights
transformers gpt2 \
$OPENAI_GPT2_CHECKPOINT_PATH \
$PYTORCH_DUMP_OUTPUT \
[OPENAI_GPT2_CONFIG]
transformers-cli convert --model_type gpt2 \
--tf_checkpoint $OPENAI_GPT2_CHECKPOINT_PATH \
--pytorch_dump_output $PYTORCH_DUMP_OUTPUT \
[--config OPENAI_GPT2_CONFIG] \
[--finetuning_task_name OPENAI_GPT2_FINETUNED_TASK]
Transformer-XL
^^^^^^^^^^^^^^
@@ -64,27 +73,28 @@ Here is an example of the conversion process for a pre-trained Transformer-XL mo
export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint
transformers transfo_xl \
$TRANSFO_XL_CHECKPOINT_FOLDER_PATH \
$PYTORCH_DUMP_OUTPUT \
[TRANSFO_XL_CONFIG]
transformers-cli convert --model_type transfo_xl \
--tf_checkpoint $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \
--pytorch_dump_output $PYTORCH_DUMP_OUTPUT \
[--config TRANSFO_XL_CONFIG] \
[--finetuning_task_name TRANSFO_XL_FINETUNED_TASK]
XLNet
^^^^^
Here is an example of the conversion process for a pre-trained XLNet model, fine-tuned on STS-B using the TensorFlow script:
Here is an example of the conversion process for a pre-trained XLNet model:
.. code-block:: shell
export TRANSFO_XL_CHECKPOINT_PATH=/path/to/xlnet/checkpoint
export TRANSFO_XL_CONFIG_PATH=/path/to/xlnet/config
transformers xlnet \
$TRANSFO_XL_CHECKPOINT_PATH \
$TRANSFO_XL_CONFIG_PATH \
$PYTORCH_DUMP_OUTPUT \
STS-B \
transformers-cli convert --model_type xlnet \
--tf_checkpoint $TRANSFO_XL_CHECKPOINT_PATH \
--config $TRANSFO_XL_CONFIG_PATH \
--pytorch_dump_output $PYTORCH_DUMP_OUTPUT \
[--finetuning_task_name XLNET_FINETUNED_TASK] \
XLM
@@ -96,6 +106,8 @@ Here is an example of the conversion process for a pre-trained XLM model:
export XLM_CHECKPOINT_PATH=/path/to/xlm/checkpoint
transformers xlm \
$XLM_CHECKPOINT_PATH \
$PYTORCH_DUMP_OUTPUT \
transformers-cli convert --model_type xlm \
--tf_checkpoint $XLM_CHECKPOINT_PATH \
--pytorch_dump_output $PYTORCH_DUMP_OUTPUT
[--config XML_CONFIG] \
[--finetuning_task_name XML_FINETUNED_TASK]

View File

@@ -1 +0,0 @@
../../examples/README.md

649
docs/source/examples.md Normal file
View File

@@ -0,0 +1,649 @@
# Examples
In this section a few examples are put together. All of these examples work for several models, making use of the very
similar API between the different models.
**Important**
To run the latest versions of the examples, you have to install from source and install some specific requirements for the examples.
Execute the following steps in a new virtual environment:
```bash
git clone https://github.com/huggingface/transformers
cd transformers
pip install .
pip install -r ./examples/requirements.txt
```
| Section | Description |
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
## TensorFlow 2.0 Bert models on GLUE
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
These options and the below benchmark are provided by @tlkh.
Quick benchmarks from the script (no other modifications):
| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) |
| --------- | -------- | ----------------------- | ----------------------|
| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
| V100 | FP32 | 35s | 0.8646/0.8359/0.8464 |
| V100 | AMP | 22s | 0.8646/0.8385/0.8411 |
| 1080 Ti | FP32 | 55s | - |
Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
## Running on TPUs
You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
[README](https://github.com/pytorch/xla/blob/master/README.md).
The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
identical to your normal GPU + Huggingface setup.
### GLUE
Before running anyone of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.
For running your GLUE task on MNLI dataset you can run something like the following:
```
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
export GLUE_DIR=/path/to/glue
export TASK_NAME=MNLI
python run_glue_tpu.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--train_batch_size 32 \
--learning_rate 3e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/$TASK_NAME \
--overwrite_output_dir \
--logging_steps 50 \
--save_steps 200 \
--num_cores=8 \
--only_log_master
```
## Language model training
Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
are fine-tuned using a masked language modeling (MLM) loss.
Before running the following example, you should get a file that contains text on which the language model will be
trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
text that will be used for evaluation.
### GPT-2/GPT and causal language modeling
The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
the tokenization). The loss here is that of causal language modeling.
```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw
python run_language_modeling.py \
--output_dir=output \
--model_type=gpt2 \
--model_name_or_path=gpt2 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE
```
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
a score of ~20 perplexity once fine-tuned on the dataset.
### RoBERTa/BERT and masked language modeling
The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
pre-training: masked language modeling.
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
slightly slower (over-fitting takes more epochs).
We use the `--mlm` flag so that the script may change its loss function.
```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw
python run_language_modeling.py \
--output_dir=output \
--model_type=roberta \
--model_name_or_path=roberta-base \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--mlm
```
## Language generation
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
can try out the different models available in the library.
Example usage:
```bash
python run_generation.py \
--model_type=gpt2 \
--model_name_or_path=gpt2
```
## GLUE
Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
uncased BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
| Task | Metric | Result |
|-------|------------------------------|-------------|
| CoLA | Matthew's corr | 49.23 |
| SST-2 | Accuracy | 91.97 |
| MRPC | F1/Accuracy | 89.47/85.29 |
| STS-B | Person/Spearman corr. | 83.95/83.70 |
| QQP | Accuracy/F1 | 88.40/84.31 |
| MNLI | Matched acc./Mismatched acc. | 80.61/81.08 |
| QNLI | Accuracy | 87.46 |
| RTE | Accuracy | 61.73 |
| WNLI | Accuracy | 45.07 |
Some of these results are significantly different from the ones reported on the test set
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
Before running any one of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.
```bash
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC
python run_glue.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/$TASK_NAME/
```
where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
said, there shouldnt be any issues in running half-precision training with the remaining GLUE tasks as well,
since the data processor for each task inherits from the base class DataProcessor.
### MRPC
#### Fine-tuning example
The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
Before running any one of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.
```bash
export GLUE_DIR=/path/to/glue
python run_glue.py \
--model_name_or_path bert-base-cased \
--task_name MRPC \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/
```
Our test ran on a few seeds with [the original implementation hyper-
parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
results between 84% and 88%.
#### Using Apex and mixed-precision
Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
[apex](https://github.com/NVIDIA/apex), then run the following example:
```bash
export GLUE_DIR=/path/to/glue
python run_glue.py \
--model_name_or_path bert-base-cased \
--task_name MRPC \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/ \
--fp16
```
#### Distributed training
Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
reaches F1 > 92 on MRPC.
```bash
export GLUE_DIR=/path/to/glue
python -m torch.distributed.launch \
--nproc_per_node 8 run_glue.py \
--model_name_or_path bert-base-cased \
--task_name MRPC \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/
```
Training with these hyper-parameters gave us the following results:
```bash
acc = 0.8823529411764706
acc_and_f1 = 0.901702786377709
eval_loss = 0.3418912578906332
f1 = 0.9210526315789473
global_step = 174
loss = 0.07231863956341798
```
### MNLI
The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
```bash
export GLUE_DIR=/path/to/glue
python -m torch.distributed.launch \
--nproc_per_node 8 run_glue.py \
--model_name_or_path bert-base-cased \
--task_name mnli \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/MNLI/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir output_dir \
```
The results are the following:
```bash
***** Eval results *****
acc = 0.8679706601466992
eval_loss = 0.4911287787382479
global_step = 18408
loss = 0.04755385363816904
***** Eval results *****
acc = 0.8747965825874695
eval_loss = 0.45516540421714036
global_step = 18408
loss = 0.04755385363816904
```
## Multiple Choice
Based on the script [`run_multiple_choice.py`]().
#### Fine-tuning on SWAG
Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
```bash
#training on 4 tesla V100(16GB) GPUS
export SWAG_DIR=/path/to/swag_data_dir
python ./examples/run_multiple_choice.py \
--task_name swag \
--model_name_or_path roberta-base \
--do_train \
--do_eval \
--data_dir $SWAG_DIR \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--max_seq_length 80 \
--output_dir models_bert/swag_base \
--per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
--gradient_accumulation_steps 2 \
--overwrite_output
```
Training with the defined hyper-parameters yields the following results:
```
***** Eval results *****
eval_acc = 0.8338998300509847
eval_loss = 0.44457291918821606
```
## SQuAD
Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
#### Fine-tuning BERT on SQuAD1.0
This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
$SQUAD_DIR directory.
* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
And for SQuAD2.0, you need to download:
- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
- [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)
```bash
export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--per_gpu_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/debug_squad/
```
Training with the previously defined hyper-parameters yields the following results:
```bash
f1 = 88.52
exact_match = 81.22
```
#### Distributed training
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
```bash
python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
--model_type bert \
--model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
--per_gpu_eval_batch_size=3 \
--per_gpu_train_batch_size=3 \
```
Training with the previously defined hyper-parameters yields the following results:
```bash
f1 = 93.15
exact_match = 86.91
```
This fine-tuned model is available as a checkpoint under the reference
`bert-large-uncased-whole-word-masking-finetuned-squad`.
#### Fine-tuning XLNet on SQuAD
This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .
##### Command for SQuAD1.0:
```bash
export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type xlnet \
--model_name_or_path xlnet-large-cased \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./wwm_cased_finetuned_squad/ \
--per_gpu_eval_batch_size=4 \
--per_gpu_train_batch_size=4 \
--save_steps 5000
```
##### Command for SQuAD2.0:
```bash
export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type xlnet \
--model_name_or_path xlnet-large-cased \
--do_train \
--do_eval \
--version_2_with_negative \
--train_file $SQUAD_DIR/train-v2.0.json \
--predict_file $SQUAD_DIR/dev-v2.0.json \
--learning_rate 3e-5 \
--num_train_epochs 4 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./wwm_cased_finetuned_squad/ \
--per_gpu_eval_batch_size=2 \
--per_gpu_train_batch_size=2 \
--save_steps 5000
```
Larger batch size may improve the performance while costing more memory.
##### Results for SQuAD1.0 with the previously defined hyper-parameters:
```python
{
"exact": 85.45884578997162,
"f1": 92.5974600601065,
"total": 10570,
"HasAns_exact": 85.45884578997162,
"HasAns_f1": 92.59746006010651,
"HasAns_total": 10570
}
```
##### Results for SQuAD2.0 with the previously defined hyper-parameters:
```python
{
"exact": 80.4177545691906,
"f1": 84.07154997729623,
"total": 11873,
"HasAns_exact": 76.73751686909581,
"HasAns_f1": 84.05558584352873,
"HasAns_total": 5928,
"NoAns_exact": 84.0874684608915,
"NoAns_f1": 84.0874684608915,
"NoAns_total": 5945
}
```
## XNLI
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
#### Fine-tuning on XNLI
This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
`$XNLI_DIR` directory.
* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
```bash
export XNLI_DIR=/path/to/XNLI
python run_xnli.py \
--model_type bert \
--model_name_or_path bert-base-multilingual-cased \
--language de \
--train_language en \
--do_train \
--do_eval \
--data_dir $XNLI_DIR \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-5 \
--num_train_epochs 2.0 \
--max_seq_length 128 \
--output_dir /tmp/debug_xnli/ \
--save_steps -1
```
Training with the previously defined hyper-parameters yields the following results on the **test** set:
```bash
acc = 0.7093812375249501
```
## MM-IMDb
Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/mm-imdb/run_mmimdb.py).
[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.
### Training on MM-IMDb
```
python run_mmimdb.py \
--data_dir /path/to/mmimdb/dataset/ \
--model_type bert \
--model_name_or_path bert-base-uncased \
--output_dir /path/to/save/dir/ \
--do_train \
--do_eval \
--max_seq_len 512 \
--gradient_accumulation_steps 20 \
--num_image_embeds 3 \
--num_train_epochs 100 \
--patience 5
```
## Adversarial evaluation of model performances
Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).
The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).
This is an example of using test_hans.py:
```bash
export HANS_DIR=path-to-hans
export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py
python examples/hans/test_hans.py \
--task_name hans \
--model_type $MODEL_TYPE \
--do_eval \
--data_dir $HANS_DIR \
--model_name_or_path $MODEL_PATH \
--max_seq_length 128 \
--output_dir $MODEL_PATH \
```
This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset.
The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows:
```bash
Heuristic entailed results:
lexical_overlap: 0.9702
subsequence: 0.9942
constituent: 0.9962
Heuristic non-entailed results:
lexical_overlap: 0.199
subsequence: 0.0396
constituent: 0.118
```

BIN
docs/source/favicon.ico Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 47 KiB

156
docs/source/glossary.rst Normal file
View File

@@ -0,0 +1,156 @@
Glossary
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Every model is different yet bears similarities with the others. Therefore most models use the same inputs, which are
detailed here alongside usage examples.
Input IDs
--------------------------
The input ids are often the only required parameters to be passed to the model as input. *They are token indices,
numerical representations of tokens building the sequences that will be used as input by the model*.
Each tokenizer works differently but the underlying mechanism remains the same. Here's an example using the BERT
tokenizer, which is a `WordPiece <https://arxiv.org/pdf/1609.08144.pdf>`__ tokenizer:
::
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence = "A Titan RTX has 24GB of VRAM"
The tokenizer takes care of splitting the sequence into tokens available in the tokenizer vocabulary.
::
# Continuation of the previous script
tokenized_sequence = tokenizer.tokenize(sequence)
assert tokenized_sequence == ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
These tokens can then be converted into IDs which are understandable by the model. Several methods are available for
this, the recommended being `encode` or `encode_plus`, which leverage the Rust implementation of
`huggingface/tokenizers <https://github.com/huggingface/tokenizers>`__ for peak performance.
::
# Continuation of the previous script
encoded_sequence = tokenizer.encode(sequence)
assert encoded_sequence == [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
The `encode` and `encode_plus` methods automatically add "special tokens" which are special IDs the model uses.
Attention mask
--------------------------
The attention mask is an optional argument used when batching sequences together. This argument indicates to the
model which tokens should be attended to, and which should not.
For example, consider these two sequences:
::
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."
encoded_sequence_a = tokenizer.encode(sequence_a)
assert len(encoded_sequence_a) == 8
encoded_sequence_b = tokenizer.encode(sequence_b)
assert len(encoded_sequence_b) == 19
These two sequences have different lengths and therefore can't be put together in a same tensor as-is. The first
sequence needs to be padded up to the length of the second one, or the second one needs to be truncated down to
the length of the first one.
In the first case, the list of IDs will be extended by the padding indices:
::
# Continuation of the previous script
padded_sequence_a = tokenizer.encode(sequence_a, max_length=19, pad_to_max_length=True)
assert padded_sequence_a == [101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
assert encoded_sequence_b == [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]
These can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating
the position of the padded indices so that the model does not attend to them. For the
:class:`~transformers.BertTokenizer`, :obj:`1` indicate a value that should be attended to while :obj:`0` indicate
a padded value.
The method :func:`~transformers.PreTrainedTokenizer.encode_plus` may be used to obtain the attention mask directly:
::
# Continuation of the previous script
sequence_a_dict = tokenizer.encode_plus(sequence_a, max_length=19, pad_to_max_length=True)
assert sequence_a_dict['input_ids'] == [101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
assert sequence_a_dict['attention_mask'] == [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Token Type IDs
--------------------------
Some models' purpose is to do sequence classification or question answering. These require two different sequences to
be encoded in the same input IDs. They are usually separated by special tokens, such as the classifier and separator
tokens. For example, the BERT model builds its two sequence input as such:
::
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
# [CLS] SEQ_A [SEP] SEQ_B [SEP]
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"
encoded_sequence = tokenizer.encode(sequence_a, sequence_b)
assert tokenizer.decode(encoded_sequence) == "[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]"
This is enough for some models to understand where one sequence ends and where another begins. However, other models
such as BERT have an additional mechanism, which are the segment IDs. The Token Type IDs are a binary mask identifying
the different sequences in the model.
We can leverage :func:`~transformers.PreTrainedTokenizer.encode_plus` to output the Token Type IDs for us:
::
# Continuation of the previous script
encoded_dict = tokenizer.encode_plus(sequence_a, sequence_b)
assert encoded_dict['input_ids'] == [101, 20164, 10932, 2271, 7954, 1110, 1359, 1107, 17520, 102, 2777, 1110, 20164, 10932, 2271, 7954, 1359, 136, 102]
assert encoded_dict['token_type_ids'] == [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
The first sequence, the "context" used for the question, has all its tokens represented by :obj:`0`, whereas the
question has all its tokens represented by :obj:`1`. Some models, like :class:`~transformers.XLNetModel` use an
additional token represented by a :obj:`2`.
Position IDs
--------------------------
The position IDs are used by the model to identify which token is at which position. Contrary to RNNs that have the
position of each token embedded within them, transformers are unaware of the position of each token. The position
IDs are created for this purpose.
They are an optional parameter. If no position IDs are passed to the model, they are automatically created as absolute
positional embeddings.
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models
use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
Feed Forward Chunking
--------------------------
In transformers two feed forward layers usually follows the self attention layer in each residual attention block. The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (*e.g.* for ``bert-base-uncased``).
For an input of size ``[batch_size, sequence_length]``, the memory required to store the intermediate feed forward embeddings ``[batch_size, sequence_length, config.intermediate_size]`` can account for a large fraction of the memory use. The authors of `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_ noticed that since the computation is independent of the ``sequence_length`` dimension, it is mathematically equivalent to compute the output embeddings of both feed forward layers ``[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n`` individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with ``n = sequence_length``, which trades increased computation time against reduced memory use, but yields a mathematically **equivalent** result.
For models employing the function :func:`~.transformers.apply_chunking_to_forward`, the ``chunk_size`` defines the number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time complexity.
If ``chunk_size`` is set to 0, no feed forward chunking is done.

View File

@@ -49,7 +49,9 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
8. `DistilBERT <https://huggingface.co/transformers/model_doc/distilbert.html>`_ (from HuggingFace) released together with the paper `DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2 <https://github.com/huggingface/transformers/tree/master/examples/distillation>`_.
9. `CTRL <https://github.com/pytorch/fairseq/tree/master/examples/ctrl>`_ (from Salesforce), released together with the paper `CTRL: A Conditional Transformer Language Model for Controllable Generation <https://www.github.com/salesforce/ctrl>`_ by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
10. `CamemBERT <https://huggingface.co/transformers/model_doc/camembert.html>`_ (from FAIR, Inria, Sorbonne Université) released together with the paper `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`_ by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de la Clergerie, Djame Seddah, and Benoît Sagot.
11. `ALBERT <https://github.com/pytorch/fairseq/tree/master/examples/albert>`_ (from Google Research), released together with the paper a `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_ by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
11. `ALBERT <https://github.com/google-research/ALBERT>`_ (from Google Research), released together with the paper a `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_ by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
12. `XLM-RoBERTa <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`_ (from Facebook AI), released together with the paper `Unsupervised Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`_ by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
13. `FlauBERT <https://github.com/getalp/Flaubert>`_ (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model Pre-training for French <https://arxiv.org/abs/1912.05372>`_ by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
.. toctree::
:maxdepth: 2
@@ -57,7 +59,10 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
installation
quickstart
glossary
pretrained_models
usage
model_sharing
examples
notebooks
serialization
@@ -75,6 +80,7 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
main_classes/configuration
main_classes/model
main_classes/tokenizer
main_classes/pipelines
main_classes/optimizer_schedules
main_classes/processors
@@ -83,6 +89,7 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
:caption: Package Reference
model_doc/auto
model_doc/encoderdecoder
model_doc/bert
model_doc/gpt
model_doc/transformerxl
@@ -94,3 +101,10 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
model_doc/ctrl
model_doc/camembert
model_doc/albert
model_doc/xlmroberta
model_doc/flaubert
model_doc/bart
model_doc/t5
model_doc/electra
model_doc/dialogpt
model_doc/reformer

View File

@@ -1,6 +1,6 @@
# Installation
Transformers is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.1.0
Transformers is tested on Python 3.6+ and PyTorch 1.1.0
## With pip
@@ -17,25 +17,18 @@ To install from source, clone the repository and install with:
``` bash
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install [--editable] .
pip install .
```
## Tests
An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
Tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
Run all the tests from the root of the cloned repository with the commands:
``` bash
python -m pytest -sv ./transformers/tests/
python -m pytest -sv ./examples/
```
Refer to the [contributing guide](https://github.com/huggingface/transformers/blob/master/CONTRIBUTING.md#tests) for details about running tests.
## OpenAI GPT original tokenization workflow
If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install `ftfy` (use version 4.4.3 if you are using Python 2) and `SpaCy`:
If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install `ftfy` and `SpaCy`:
``` bash
pip install spacy ftfy==4.4.3

View File

@@ -14,6 +14,12 @@ The base class ``PreTrainedModel`` implements the common methods for loading/sav
.. autoclass:: transformers.PreTrainedModel
:members:
``Helper Functions``
~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: transformers.apply_chunking_to_forward
``TFPreTrainedModel``
~~~~~~~~~~~~~~~~~~~~~

View File

@@ -5,6 +5,7 @@ The ``.optimization`` module provides:
- an optimizer with weight decay fixed that can be used to fine-tuned models, and
- several schedules in the form of schedule objects that inherit from ``_LRSchedule``:
- a gradient accumulation class to accumulate the gradients of multiple batches
``AdamW``
~~~~~~~~~~~~~~~~
@@ -12,12 +13,19 @@ The ``.optimization`` module provides:
.. autoclass:: transformers.AdamW
:members:
``AdamWeightDecay``
~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AdamWeightDecay
:members:
.. autofunction:: transformers.create_optimizer
Schedules
----------------------------------------------------
Learning Rate Schedules
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: transformers.get_constant_schedule
@@ -29,7 +37,6 @@ Learning Rate Schedules
.. autofunction:: transformers.get_cosine_schedule_with_warmup
:members:
.. image:: /imgs/warmup_cosine_schedule.png
:target: /imgs/warmup_cosine_schedule.png
@@ -49,3 +56,17 @@ Learning Rate Schedules
.. image:: /imgs/warmup_linear_schedule.png
:target: /imgs/warmup_linear_schedule.png
:alt:
``Warmup``
~~~~~~~~~~~~~~~~
.. autoclass:: transformers.WarmUp
:members:
Gradient Strategies
----------------------------------------------------
``GradientAccumulator``
~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.GradientAccumulator

View File

@@ -0,0 +1,74 @@
Pipelines
----------------------------------------------------
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most
of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering.
There are two categories of pipeline abstractions to be aware about:
- The :class:`~transformers.pipeline` which is the most powerful object encapsulating all other pipelines
- The other task-specific pipelines, such as :class:`~transformers.NerPipeline`
or :class:`~transformers.QuestionAnsweringPipeline`
The pipeline abstraction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any
other pipeline but requires an additional argument which is the `task`.
.. autoclass:: transformers.pipeline
:members:
The task specific pipelines
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Parent class: Pipeline
=========================================
.. autoclass:: transformers.Pipeline
:members: predict, transform, save_pretrained
NerPipeline
==========================================
.. autoclass:: transformers.NerPipeline
TokenClassificationPipeline
==========================================
This class is an alias of the :class:`~transformers.NerPipeline` defined above. Please refer to that pipeline for
documentation and usage examples.
FillMaskPipeline
==========================================
.. autoclass:: transformers.FillMaskPipeline
FeatureExtractionPipeline
==========================================
.. autoclass:: transformers.FeatureExtractionPipeline
TextClassificationPipeline
==========================================
.. autoclass:: transformers.TextClassificationPipeline
QuestionAnsweringPipeline
==========================================
.. autoclass:: transformers.QuestionAnsweringPipeline
SummarizationPipeline
==========================================
.. autoclass:: transformers.SummarizationPipeline
TextGenerationPipeline
==========================================
.. autoclass:: transformers.TextGenerationPipeline

View File

@@ -54,8 +54,7 @@ Additionally, the following method can be used to load values from a data file
Example usage
^^^^^^^^^^^^^^^^^^^^^^^^^
An example using these processors is given in the
`run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_glue.py>`__ script.
An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
XNLI
@@ -64,7 +63,7 @@ XNLI
`The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates
the quality of cross-lingual text representations.
XNLI is crowd-sourced dataset based on `MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment
annotations for 15 different languages (including both high-ressource language such as English and low-ressource languages such as Swahili).
annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
It was released together with the paper
`XNLI: Evaluating Cross-lingual Sentence Representations <https://arxiv.org/abs/1809.05053>`__
@@ -74,8 +73,81 @@ This library hosts the processor to load the XNLI data:
Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
Example usage
An example using these processors is given in the
`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_xnli.py>`__ script.
SQuAD
~~~~~~~~~~~~~~~~~~~~~
`The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that evaluates
the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper
`SQuAD: 100,000+ Questions for Machine Comprehension of Text <https://arxiv.org/abs/1606.05250>`__. The second version (v2.0) was released alongside
the paper `Know What You Don't Know: Unanswerable Questions for SQuAD <https://arxiv.org/abs/1806.03822>`__.
This library hosts a processor for each of the two versions:
Processors
^^^^^^^^^^^^^^^^^^^^^^^^^
An example using these processors is given in the
`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_xnli.py>`__ script.
Those processors are:
- :class:`~transformers.data.processors.utils.SquadV1Processor`
- :class:`~transformers.data.processors.utils.SquadV2Processor`
They both inherit from the abstract class :class:`~transformers.data.processors.utils.SquadProcessor`
.. autoclass:: transformers.data.processors.squad.SquadProcessor
:members:
Additionally, the following method can be used to convert SQuAD examples into :class:`~transformers.data.processors.utils.SquadFeatures`
that can be used as model inputs.
.. automethod:: transformers.data.processors.squad.squad_convert_examples_to_features
These processors as well as the aforementionned method can be used with files containing the data as well as with the `tensorflow_datasets` package.
Examples are given below.
Example usage
^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example using the processors as well as the conversion method using data files:
Example::
# Loading a V2 processor
processor = SquadV2Processor()
examples = processor.get_dev_examples(squad_v2_data_dir)
# Loading a V1 processor
processor = SquadV1Processor()
examples = processor.get_dev_examples(squad_v1_data_dir)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
Using `tensorflow_datasets` is as easy as using a data file:
Example::
# tensorflow_datasets only handle Squad V1.
tfds_examples = tfds.load("squad")
examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
Another example using these processors is given in the
`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/run_squad.py>`__ script.

View File

@@ -1,16 +1,38 @@
Tokenizer
----------------------------------------------------
The base class ``PreTrainedTokenizer`` implements the common methods for loading/saving a tokenizer either from a local file or directory, or from a pretrained tokenizer provided by the library (downloaded from HuggingFace's AWS S3 repository).
A tokenizer is in charge of preparing the inputs for a model. The library comprise tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a "Fast" implementation based on the Rust library `tokenizers`. The "Fast" implementations allows (1) a significant speed-up in particular when doing batched tokenization and (2) additional methods to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token). Currently no "Fast" implementation is available for the SentencePiece-based tokenizers (for T5, ALBERT, CamemBERT, XLMRoBERTa and XLNet models).
``PreTrainedTokenizer`` is the main entry point into tokenizers as it also implements the main methods for using all the tokenizers:
The base classes ``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` implements the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and "Fast" tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library (downloaded from HuggingFace's AWS S3 repository).
- tokenizing, converting tokens to ids and back and encoding/decoding,
``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` thus implements the main methods for using all the tokenizers:
- tokenizing (spliting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e. tokenizing + convert to integers),
- adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...),
- managing special tokens (adding them, assigning them to roles, making sure they are not split during tokenization)
- managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization)
``BatchEncoding`` holds the output of the tokenizer's encoding methods (``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e. backed by HuggingFace tokenizers library), this class provides in addition several advanced alignement methods which can be used to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).
``PreTrainedTokenizer``
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PreTrainedTokenizer
:members:
``PreTrainedTokenizerFast``
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PreTrainedTokenizerFast
:members:
``BatchEncoding``
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BatchEncoding
:members:
``SpecialTokensMixin``
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.SpecialTokensMixin
:members:

View File

@@ -27,7 +27,7 @@ loss = outputs[0]
# In transformers you can also have access to the logits:
loss, logits = outputs[:2]
# And even the attention weigths if you configure the model to output them (and other outputs too, see the docstrings and documentation)
# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
outputs = model(input_ids, labels=labels)
loss, logits, attentions = outputs
@@ -104,6 +104,6 @@ for batch in train_data:
loss = model(batch)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm) # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
scheduler.step()
optimizer.step()
scheduler.step()
```

View File

@@ -1,63 +1,95 @@
ALBERT
----------------------------------------------------
``AlbrtConfig``
Overview
~~~~~~~~~~~~~~~~~~~~~
The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_
by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents
two parameter-reduction techniques to lower memory consumption and increase the trainig speed of BERT:
- Splitting the embedding matrix into two smaller matrices
- Using repeating layers split among groups
The abstract from the paper is the following:
*Increasing model size when pretraining natural language representations often results in improved performance on
downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations,
longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream
tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE,
RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.*
Tips:
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
number of (repeating) layers.
The original code can be found `here <https://github.com/google-research/ALBERT>`_.
AlbertConfig
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AlbertConfig
:members:
``AlbertTokenizer``
AlbertTokenizer
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AlbertTokenizer
:members:
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
``AlbertModel``
AlbertModel
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AlbertModel
:members:
``AlbertForMaskedLM``
AlbertForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AlbertForMaskedLM
:members:
``AlbertForSequenceClassification``
~~~~~~~~~~~~~~~~~~~~~~~~~~
AlbertForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AlbertForSequenceClassification
:members:
``AlbertForQuestionAnswering``
AlbertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AlbertForQuestionAnswering
:members:
``TFAlbertModel``
TFAlbertModel
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFAlbertModel
:members:
``TFAlbertForMaskedLM``
TFAlbertForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFAlbertForMaskedLM
:members:
``TFAlbertForSequenceClassification``
TFAlbertForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFAlbertForSequenceClassification

View File

@@ -3,7 +3,7 @@ AutoModels
In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the ``from_pretrained`` method.
AutoClasses are here to do this job for you so that you automatically retreive the relevant model given the name/path to the pretrained weights/config/vocabulary:
AutoClasses are here to do this job for you so that you automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary:
Instantiating one of ``AutoModel``, ``AutoConfig`` and ``AutoTokenizer`` will directly create a class of the relevant architecture (ex: ``model = AutoModel.from_pretrained('bert-base-cased')`` will create a instance of ``BertModel``).
@@ -15,6 +15,13 @@ Instantiating one of ``AutoModel``, ``AutoConfig`` and ``AutoTokenizer`` will di
:members:
``AutoTokenizer``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoTokenizer
:members:
``AutoModel``
~~~~~~~~~~~~~~~~~~~~~
@@ -22,8 +29,37 @@ Instantiating one of ``AutoModel``, ``AutoConfig`` and ``AutoTokenizer`` will di
:members:
``AutoTokenizer``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``AutoModelForPreTraining``
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoTokenizer
.. autoclass:: transformers.AutoModelForPreTraining
:members:
``AutoModelWithLMHead``
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelWithLMHead
:members:
``AutoModelForSequenceClassification``
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelForSequenceClassification
:members:
``AutoModelForQuestionAnswering``
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelForQuestionAnswering
:members:
``AutoModelForTokenClassification``
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.AutoModelForTokenClassification
:members:

View File

@@ -0,0 +1,56 @@
Bart
----------------------------------------------------
**DISCLAIMER:** This model is still a work in progress, if you see something strange,
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
@sshleifer
Paper
~~~~~
The Bart model was `proposed <https://arxiv.org/abs/1910.13461>`_ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
According to the abstract,
- Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).
- The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.
- BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_
Implementation Notes
~~~~~~~~~~~~~~~~~~~~
- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use BartTokenizer.encode to get the proper splitting.
- The forward pass of ``BartModel`` will create decoder inputs (using the helper function ``transformers.modeling_bart._prepare_bart_decoder_inputs``) if they are not passed. This is different than some other modeling APIs.
- Model predictions are intended to be identical to the original implementation. This only works, however, if the string you pass to ``fairseq.encode`` starts with a space.
- ``BartForConditionalGeneration.generate`` should be used for conditional generation tasks like summarization, see the example in that docstrings
- Models that load the ``"bart-large-cnn"`` weights will not have a ``mask_token_id``, or be able to perform mask filling tasks.
BartModel
~~~~~~~~~~~~~
.. autoclass:: transformers.BartModel
:members: forward
.. autofunction:: transformers.modeling_bart._prepare_bart_decoder_inputs
BartForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BartForConditionalGeneration
:members: generate, forward
BartForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BartForSequenceClassification
:members: forward
BartConfig
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BartConfig
:members:

View File

@@ -1,126 +1,170 @@
BERT
----------------------------------------------------
``BertConfig``
Overview
~~~~~~~~~~~~~~~~~~~~~
The BERT model was proposed in `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`__
by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It's a bidirectional transformer
pre-trained using a combination of masked language modeling objective and next sentence prediction
on a large corpus comprising the Toronto Book Corpus and Wikipedia.
The abstract from the paper is the following:
*We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations
from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional
representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result,
the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models
for a wide range of tasks, such as question answering and language inference, without substantial task-specific
architecture modifications.*
*BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural
language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI
accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute
improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).*
Tips:
- BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- BERT was trained with a masked language modeling (MLM) objective. It is therefore efficient at predicting masked
tokens and at NLU in general, but is not optimal for text generation. Models trained with a causal language
modeling (CLM) objective are better in that regard.
- Alongside MLM, BERT was trained using a next sentence prediction (NSP) objective using the [CLS] token as a sequence
approximate. The user may use this token (the first token in a sequence built with special tokens) to get a sequence
prediction rather than a token prediction. However, averaging over the sequence may yield better results than using
the [CLS] token.
The original code can be found `here <https://github.com/google-research/bert>`_.
BertConfig
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BertConfig
:members:
``BertTokenizer``
BertTokenizer
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BertTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
BertTokenizerFast
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BertTokenizerFast
:members:
``BertModel``
BertModel
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BertModel
:members:
``BertForPreTraining``
BertForPreTraining
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BertForPreTraining
:members:
``BertForMaskedLM``
BertForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BertForMaskedLM
:members:
``BertForNextSentencePrediction``
BertForNextSentencePrediction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BertForNextSentencePrediction
:members:
``BertForSequenceClassification``
BertForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BertForSequenceClassification
:members:
``BertForMultipleChoice``
BertForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BertForMultipleChoice
:members:
``BertForTokenClassification``
BertForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BertForTokenClassification
:members:
``BertForQuestionAnswering``
BertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BertForQuestionAnswering
:members:
``TFBertModel``
TFBertModel
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFBertModel
:members:
``TFBertForPreTraining``
TFBertForPreTraining
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFBertForPreTraining
:members:
``TFBertForMaskedLM``
TFBertForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFBertForMaskedLM
:members:
``TFBertForNextSentencePrediction``
TFBertForNextSentencePrediction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFBertForNextSentencePrediction
:members:
``TFBertForSequenceClassification``
TFBertForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFBertForSequenceClassification
:members:
``TFBertForMultipleChoice``
TFBertForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFBertForMultipleChoice
:members:
``TFBertForTokenClassification``
TFBertForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFBertForTokenClassification
:members:
``TFBertForQuestionAnswering``
TFBertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFBertForQuestionAnswering

View File

@@ -1,50 +1,102 @@
CamemBERT
----------------------------------------------------
``CamembertConfig``
~~~~~~~~~~~~~~~~~~~~~
The CamemBERT model was proposed in `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`__
by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model
trained on 138GB of French text.
The abstract from the paper is the following:
*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success,
most available models have either been trained on English data or on the concatenation of data in multiple
languages. This makes practical use of such models --in all languages except English-- very limited. Aiming
to address this issue for French, we release CamemBERT, a French version of the Bi-directional Encoders for
Transformers (BERT). We measure the performance of CamemBERT compared to multilingual models in multiple
downstream tasks, namely part-of-speech tagging, dependency parsing, named-entity recognition, and natural
language inference. CamemBERT improves the state of the art for most of the tasks considered. We release the
pretrained model for CamemBERT hoping to foster research and downstream applications for French NLP.*
Tips:
- This implementation is the same as RoBERTa. Refer to the `documentation of RoBERTa <./roberta.html>`__ for usage
examples as well as the information relative to the inputs and outputs.
The original code can be found `here <https://camembert-model.fr/>`_.
CamembertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CamembertConfig
:members:
``CamembertTokenizer``
~~~~~~~~~~~~~~~~~~~~~
CamembertTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CamembertTokenizer
:members:
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
``CamembertModel``
~~~~~~~~~~~~~~~~~~~~
CamembertModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CamembertModel
:members:
``CamembertForMaskedLM``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CamembertForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CamembertForMaskedLM
:members:
``CamembertForSequenceClassification``
~~~~~~~~~~~~~~~~~~~~~~~~~~
CamembertForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CamembertForSequenceClassification
:members:
``CamembertForMultipleChoice``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CamembertForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CamembertForMultipleChoice
:members:
``CamembertForTokenClassification``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CamembertForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CamembertForTokenClassification
:members:
TFCamembertModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFCamembertModel
:members:
TFCamembertForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFCamembertForMaskedLM
:members:
TFCamembertForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFCamembertForSequenceClassification
:members:
TFCamembertForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFCamembertForTokenClassification
:members:

View File

@@ -1,47 +1,75 @@
CTRL
----------------------------------------------------
Note: if you fine-tune a CTRL model using the Salesforce code (https://github.com/salesforce/ctrl),
you'll be able to convert from TF to our HuggingFace/Transformers format using the
``convert_tf_to_huggingface_pytorch.py`` script (see `issue #1654 <https://github.com/huggingface/transformers/issues/1654>`_).
CTRL model was proposed in `CTRL: A Conditional Transformer Language Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`_
by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
It's a causal (unidirectional) transformer pre-trained using language modeling on a very large
corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
The abstract from the paper is the following:
*Large-scale language models show promising text generation capabilities, but users cannot easily control particular
aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model,
trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were
derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning
while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of
the training data are most likely given a sequence. This provides a potential method for analyzing large amounts
of data via model-based source attribution.*
Tips:
- CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
or links to generate coherent text. Refer to the `original implementation <https://github.com/salesforce/ctrl>`__
for more information.
- CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as
it can be observed in the `run_generation.py` example script.
- The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
this `past` value prevents the model from re-computing pre-computed values in the context of text generation.
See `reusing the past in generative models <../quickstart.html#using-the-past>`_ for more information on the usage
of this argument.
The original code can be found `here <https://github.com/salesforce/ctrl>`_.
``CTRLConfig``
CTRLConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CTRLConfig
:members:
``CTRLTokenizer``
CTRLTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CTRLTokenizer
:members:
:members: save_vocabulary
``CTRLModel``
CTRLModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CTRLModel
:members:
``CTRLLMHeadModel``
CTRLLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CTRLLMHeadModel
:members:
``TFCTRLModel``
TFCTRLModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFCTRLModel
:members:
``TFCTRLLMHeadModel``
TFCTRLLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFCTRLLMHeadModel

View File

@@ -0,0 +1,39 @@
DialoGPT
----------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~
DialoGPT was proposed in
`DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`_
by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
It's a GPT2 Model trained on 147M conversation-like exchanges extracted from Reddit.
The abstract from the paper is the following:
*We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained transformer).
Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings.
We show that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems.
The pre-trained model and training pipeline are publicly released to facilitate research into neural response generation and the development of more intelligent open-domain dialogue systems.*
Tips:
- DialoGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- DialoGPT was trained with a causal language modeling (CLM) objective on conversational data and is therefore powerful at response generation in open-domain dialogue systems.
- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on `DialoGPT's model card <https://huggingface.co/microsoft/DialoGPT-medium>`_.
Training:
In order to train or fine-tune DialoGPT, one can use causal language modeling training.
To cite the official paper:
*We follow the OpenAI GPT-2 to model a multiturn dialogue session
as a long text and frame the generation task as language modeling. We first
concatenate all dialog turns within a dialogue session into a long text
x_1,..., x_N (N is the sequence length), ended by the end-of-text token.*
For more information please confer to the original paper.
DialoGPT's architecture is based on the GPT2 model, so one can refer to GPT2's `docstring <https://huggingface.co/transformers/model_doc/gpt2.html>`_.
The original code can be found `here <https://github.com/microsoft/DialoGPT>`_.

View File

@@ -1,69 +1,105 @@
DistilBERT
----------------------------------------------------
``DistilBertConfig``
The DistilBERT model was proposed in the blog post
`Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`__,
and the paper `DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__.
DistilBERT is a small, fast, cheap and light Transformer model trained by distilling Bert base. It has 40% less
parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on
the GLUE language understanding benchmark.
The abstract from the paper is the following:
*As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP),
operating these large models in on-the-edge and/or under constrained computational training or inference budgets
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
counterparts. While most prior work investigated the use of distillation for building task-specific models, we
leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a
BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage
the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language
modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train
and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative
on-device study.*
Tips:
- DistilBert doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`)
- DistilBert doesn't have options to select the input positions (`position_ids` input). This could be added if necessary though, just let's us know if you need this option.
The original code can be found `here <https://github.com/huggingface/transformers/tree/master/examples/distillation>`_.
DistilBertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DistilBertConfig
:members:
``DistilBertTokenizer``
DistilBertTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DistilBertTokenizer
:members:
``DistilBertModel``
DistilBertTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DistilBertTokenizerFast
:members:
DistilBertModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DistilBertModel
:members:
``DistilBertForMaskedLM``
DistilBertForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DistilBertForMaskedLM
:members:
``DistilBertForSequenceClassification``
DistilBertForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DistilBertForSequenceClassification
:members:
``DistilBertForQuestionAnswering``
DistilBertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.DistilBertForQuestionAnswering
:members:
``TFDistilBertModel``
TFDistilBertModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFDistilBertModel
:members:
``TFDistilBertForMaskedLM``
TFDistilBertForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFDistilBertForMaskedLM
:members:
``TFDistilBertForSequenceClassification``
TFDistilBertForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFDistilBertForSequenceClassification
:members:
``TFDistilBertForQuestionAnswering``
TFDistilBertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFDistilBertForQuestionAnswering

View File

@@ -0,0 +1,124 @@
ELECTRA
----------------------------------------------------
The ELECTRA model was proposed in the paper.
`ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators <https://openreview.net/pdf?id=r1xMH1BtvB>`__.
ELECTRA is a new pre-training approach which trains two transformer models: the generator and the discriminator. The
generator's role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator,
which is the model we're interested in, tries to identify which tokens were replaced by the generator in the sequence.
The abstract from the paper is the following:
*Masked language modeling (MLM) pre-training methods such as BERT corrupt
the input by replacing some tokens with [MASK] and then train a model to
reconstruct the original tokens. While they produce good results when transferred
to downstream NLP tasks, they generally require large amounts of compute to be
effective. As an alternative, we propose a more sample-efficient pre-training task
called replaced token detection. Instead of masking the input, our approach
corrupts it by replacing some tokens with plausible alternatives sampled from a small
generator network. Then, instead of training a model that predicts the original
identities of the corrupted tokens, we train a discriminative model that predicts
whether each token in the corrupted input was replaced by a generator sample
or not. Thorough experiments demonstrate this new pre-training task is more
efficient than MLM because the task is defined over all input tokens rather than
just the small subset that was masked out. As a result, the contextual representations
learned by our approach substantially outperform the ones learned by BERT
given the same model size, data, and compute. The gains are particularly strong
for small models; for example, we train a model on one GPU for 4 days that
outperforms GPT (trained using 30x more compute) on the GLUE natural language
understanding benchmark. Our approach also works well at scale, where it
performs comparably to RoBERTa and XLNet while using less than 1/4 of their
compute and outperforms them when using the same amount of compute.*
Tips:
- ELECTRA is the pre-training approach, therefore there is nearly no changes done to the underlying model: BERT. The
only change is the separation of the embedding size and the hidden size -> The embedding size is generally smaller,
while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from
their embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no
projection layer is used.
- The ELECTRA checkpoints saved using `Google Research's implementation <https://github.com/google-research/electra>`__
contain both the generator and discriminator. The conversion script requires the user to name which model to export
into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all
available ELECTRA models, however. This means that the discriminator may be loaded in the `ElectraForMaskedLM` model,
and the generator may be loaded in the `ElectraForPreTraining` model (the classification head will be randomly
initialized as it doesn't exist in the generator).
The original code can be found `here <https://github.com/google-research/electra>`_.
ElectraConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraConfig
:members:
ElectraTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraTokenizer
:members:
ElectraTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraTokenizerFast
:members:
ElectraModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraModel
:members:
ElectraForPreTraining
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraForPreTraining
:members:
ElectraForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraForMaskedLM
:members:
ElectraForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ElectraForTokenClassification
:members:
TFElectraModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFElectraModel
:members:
TFElectraForPreTraining
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFElectraForPreTraining
:members:
TFElectraForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFElectraForMaskedLM
:members:
TFElectraForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFElectraForTokenClassification
:members:

View File

@@ -0,0 +1,23 @@
Encoder Decoder Models
-----------
This class can wrap an encoder model, such as ``BertModel`` and a decoder modeling with a language modeling head, such as ``BertForMaskedLM`` into a encoder-decoder model.
The ``EncoderDecoderModel`` class allows to instantiate a encoder decoder model using the ``from_encoder_decoder_pretrain`` class method taking a pretrained encoder and pretrained decoder model as an input.
The ``EncoderDecoderModel`` is saved using the standard ``save_pretrained()`` method and can also again be loaded using the standard ``from_pretrained()`` method.
An application of this architecture could be *summarization* using two pretrained Bert models as is shown in the paper: `Text Summarization with Pretrained Encoders <https://arxiv.org/abs/1910.13461>`_ by Yang Liu and Mirella Lapata.
``EncoderDecoderConfig``
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.EncoderDecoderConfig
:members:
``EncoderDecoderModel``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.EncoderDecoderModel
:members:

View File

@@ -0,0 +1,74 @@
FlauBERT
----------------------------------------------------
The FlauBERT model was proposed in the paper
`FlauBERT: Unsupervised Language Model Pre-training for French <https://arxiv.org/abs/1912.05372>`__ by Hang Le et al.
It's a transformer pre-trained using a masked language modeling (MLM) objective (BERT-like).
The abstract from the paper is the following:
*Language models have become a key step to achieve state-of-the art results in many different Natural Language
Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient
way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their
contextualization at the sentence level. This has been widely demonstrated for English using contextualized
representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et
al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large
and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre
for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most
of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified
evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared
to the research community for further reproducible experiments in French NLP.*
The original code can be found `here <https://github.com/getalp/Flaubert>`_.
FlaubertConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertConfig
:members:
FlaubertTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertTokenizer
:members:
FlaubertModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertModel
:members:
FlaubertWithLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertWithLMHeadModel
:members:
FlaubertForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertForSequenceClassification
:members:
FlaubertForQuestionAnsweringSimple
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertForQuestionAnsweringSimple
:members:
FlaubertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.FlaubertForQuestionAnswering
:members:

View File

@@ -1,56 +1,101 @@
OpenAI GPT
----------------------------------------------------
``OpenAIGPTConfig``
Overview
~~~~~~~~~~~~~~~~~~~~~
OpenAI GPT model was proposed in `Improving Language Understanding by Generative Pre-Training <https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf>`__
by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional)
transformer pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus.
The abstract from the paper is the following:
*Natural language understanding comprises a wide range of diverse tasks such
as textual entailment, question answering, semantic similarity assessment, and
document classification. Although large unlabeled text corpora are abundant,
labeled data for learning these specific tasks is scarce, making it challenging for
discriminatively trained models to perform adequately. We demonstrate that large
gains on these tasks can be realized by generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task. In contrast to previous approaches, we make use of task-aware input
transformations during fine-tuning to achieve effective transfer while requiring
minimal changes to the model architecture. We demonstrate the effectiveness of
our approach on a wide range of benchmarks for natural language understanding.
Our general task-agnostic model outperforms discriminatively trained models that
use architectures specifically crafted for each task, significantly improving upon the
state of the art in 9 out of the 12 tasks studied.*
Tips:
- GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- GPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as
it can be observed in the `run_generation.py` example script.
`Write With Transformer <https://transformer.huggingface.co/doc/gpt>`__ is a webapp created and hosted by
Hugging Face showcasing the generative capabilities of several models. GPT is one of them.
The original code can be found `here <https://github.com/openai/finetune-transformer-lm>`_.
OpenAIGPTConfig
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.OpenAIGPTConfig
:members:
``OpenAIGPTTokenizer``
OpenAIGPTTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.OpenAIGPTTokenizer
:members: save_vocabulary
OpenAIGPTTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.OpenAIGPTTokenizerFast
:members:
``OpenAIGPTModel``
OpenAIGPTModel
~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.OpenAIGPTModel
:members:
``OpenAIGPTLMHeadModel``
OpenAIGPTLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.OpenAIGPTLMHeadModel
:members:
``OpenAIGPTDoubleHeadsModel``
OpenAIGPTDoubleHeadsModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.OpenAIGPTDoubleHeadsModel
:members:
``TFOpenAIGPTModel``
TFOpenAIGPTModel
~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFOpenAIGPTModel
:members:
``TFOpenAIGPTLMHeadModel``
TFOpenAIGPTLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFOpenAIGPTLMHeadModel
:members:
``TFOpenAIGPTDoubleHeadsModel``
TFOpenAIGPTDoubleHeadsModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFOpenAIGPTDoubleHeadsModel

View File

@@ -1,56 +1,99 @@
OpenAI GPT2
----------------------------------------------------
``GPT2Config``
Overview
~~~~~~~~~~~~~~~~~~~~~
OpenAI GPT-2 model was proposed in
`Language Models are Unsupervised Multitask Learners <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_
by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
It's a causal (unidirectional) transformer pre-trained using language modeling on a very large
corpus of ~40 GB of text data.
The abstract from the paper is the following:
*GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1]
of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous
words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring
demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X
the parameters and trained on more than 10X the amount of data.*
Tips:
- GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on
the right rather than the left.
- GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as
it can be observed in the `run_generation.py` example script.
- The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
this `past` value prevents the model from re-computing pre-computed values in the context of text generation.
See `reusing the past in generative models <../quickstart.html#using-the-past>`_ for more information on the usage
of this argument.
`Write With Transformer <https://transformer.huggingface.co/doc/gpt2-large>`__ is a webapp created and hosted by
Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five
different sizes: small, medium, large, xl and a distilled version of the small checkpoint: distilgpt-2.
The original code can be found `here <https://openai.com/blog/better-language-models/>`_.
GPT2Config
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.GPT2Config
:members:
``GPT2Tokenizer``
GPT2Tokenizer
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.GPT2Tokenizer
:members: save_vocabulary
GPT2TokenizerFast
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.GPT2TokenizerFast
:members:
``GPT2Model``
GPT2Model
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.GPT2Model
:members:
``GPT2LMHeadModel``
GPT2LMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.GPT2LMHeadModel
:members:
``GPT2DoubleHeadsModel``
GPT2DoubleHeadsModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.GPT2DoubleHeadsModel
:members:
``TFGPT2Model``
TFGPT2Model
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFGPT2Model
:members:
``TFGPT2LMHeadModel``
TFGPT2LMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFGPT2LMHeadModel
:members:
``TFGPT2DoubleHeadsModel``
TFGPT2DoubleHeadsModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFGPT2DoubleHeadsModel

View File

@@ -0,0 +1,114 @@
Reformer
----------------------------------------------------
**DISCLAIMER:** This model is still a work in progress, if you see something strange,
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
Overview
~~~~~
The Reformer model was presented in `Reformer: The Efficient Transformer <https://https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
Here the abstract:
*Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(Llog(L)), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.*
The Authors' code can be found `here <https://github.com/google/trax/tree/master/trax/models/reformer>`_ .
Axial Positional Encodings
~~~~~~~~~~~~~~~~~~~~
Axial Positional Encodings were first implemented in Google's `trax library <https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29>`_ and developed by the authors of this model's paper. In models that are treating very long input sequences, the conventional position id encodings store an embedings vector of size :math:`d` being the ``config.hidden_size`` for every position :math:`i, \ldots, n_s`, with :math:`n_s` being ``config.max_embedding_size``. *E.g.*, having a sequence length of :math:`n_s = 2^{19} \approx 0.5M` and a ``config.hidden_size`` of :math:`d = 2^{10} \approx 1000` would result in a position encoding matrix:
.. math::
X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right]
which alone has over 500M parameters to store. Axial positional encodings factorize :math:`X_{i,j}` into two matrices:
.. math::
X^{1}_{i,j}, \text{ with } i \in \left[1,\ldots, d^1\right] \text{ and } j \in \left[1,\ldots, n_s^1\right]
and
.. math::
X^{2}_{i,j}, \text{ with } i \in \left[1,\ldots, d^2\right] \text{ and } j \in \left[1,\ldots, n_s^2\right]
with:
.. math::
d = d^1 + d^2 \text{ and } n_s = n_s^1 \times n_s^2 .
Therefore the following holds:
.. math::
X_{i,j} = \begin{cases}
X^{1}_{i, k}, & \text{if }\ i < d^1 \text{ with } k = j \mod n_s^1 \\
X^{2}_{i - d^1, l}, & \text{if } i \ge d^1 \text{ with } l = \lfloor\frac{j}{n_s^1}\rfloor
\end{cases}
Intuitively, this means that a position embedding vector :math:`x_j \in \mathbb{R}^{d}` is now the composition of two factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the ``config.max_embedding_size`` dimension :math:`j` is factorized into :math:`k \text{ and } l`.
This design ensures that each position embedding vector :math:`x_j` is unique.
Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}` can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters.
In practice, the parameter ``config.axial_pos_embds_dim`` is set to ``list``:math:`(d^1, d^2)` which sum has to be equal to ``config.hidden_size`` and ``config.axial_pos_shape`` is set to ``list``:math:`(n_s^1, n_s^2)` and which product has to be equal to ``config.max_embedding_size`` which during training has to be equal to the ``sequence length`` of the ``input_ids``.
LSH Self Attention
~~~~~~~~~~~~~~~~~~~~
In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key query embedding vectors are also tied.
LSH self attention uses the locality sensitive
hashing mechanism proposed in `Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`_ to assign each of the tied key query embedding vectors to one of ``config.num_buckets`` possible buckets. The premise is that the more "similar" key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to the same bucket.
The accuracy of the LSH mechanism can be improved by increasing ``config.num_hashes`` or directly the argument ``num_hashes`` of the forward function so that the output of the LSH self attention better approximates the output of the "normal" full self attention.
The buckets are then sorted and chunked into query key embedding vector chunks each of length ``config.lsh_chunk_length``. For each chunk, the query embedding vectors attend to its key vectors (which are tied to themselves) and to the key embedding vectors of ``config.lsh_num_chunks_before`` previous neighboring chunks and ``config.lsh_num_chunks_after`` following neighboring chunks.
For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`_ or this great `blog post <https://www.pragmatic.ml/reformer-deep-dive/>`_.
Note that ``config.num_buckets`` can also be factorized into a ``list``:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to save memory.
It is recommended to leave ``config.num_buckets=None``, so that depending on the sequence length, a good value for ``num_buckets`` are calculated on the fly.
Using LSH self attention, the memory and time complexity of the query-key matmul operation can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.
Local Self Attention
~~~~~~~~~~~~~~~~~~~~
Local self attention is essentially a "normal" self attention layer with
key, query and value projections, but is chunked so that in each chunk of length ``config.local_chunk_length`` the query embedding vectors only attends to the key embedding vectors in its chunk and to the key embedding vectors of ``config.local_num_chunks_before`` previous neighboring chunks and ``config.local_num_chunks_after`` following neighboring chunks.
Using Local self attention, the memory and time complexity of the query-key matmul operation can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.
Training
~~~~~~~~~~~~~~~~~~~~
During training, we must ensure that the sequence length is set to a value that can be divided by the least common multiple of ``config.lsh_chunk_length`` and ``config.local_chunk_length`` and that the parameters of the Axial Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can easily be trained on sequences as long as 64000 tokens.
For training, the ``ReformerModelWithLMHead`` should be used as follows:
::
input_ids = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
loss = model(input_ids, labels=input_ids)[0]
ReformerConfig
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ReformerConfig
:members:
ReformerTokenizer
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ReformerTokenizer
:members:
ReformerModel
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ReformerModel
:members:
ReformerModelWithLMHead
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ReformerModelWithLMHead
:members:

View File

@@ -1,57 +1,108 @@
RoBERTa
----------------------------------------------------
``RobertaConfig``
The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_
by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer,
Veselin Stoyanov. It is based on Google's BERT model released in 2018.
It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
objective and training with much larger mini-batches and learning rates.
The abstract from the paper is the following:
*Language model pretraining has led to significant performance gains but careful comparison between different
approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes,
and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication
study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and
training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of
every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These
results highlight the importance of previously overlooked design choices, and raise questions about the source
of recently reported improvements. We release our models and code.*
Tips:
- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a
setup for Roberta pretrained models.
- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
different pre-training scheme.
- RoBERTa doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `</s>`)
- `Camembert <./camembert.html>`__ is a wrapper around RoBERTa. Refer to this page for usage examples.
The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`_.
RobertaConfig
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RobertaConfig
:members:
``RobertaTokenizer``
RobertaTokenizer
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RobertaTokenizer
:members:
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
``RobertaModel``
RobertaTokenizerFast
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RobertaTokenizerFast
:members: build_inputs_with_special_tokens
RobertaModel
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RobertaModel
:members:
``RobertaForMaskedLM``
RobertaForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RobertaForMaskedLM
:members:
``RobertaForSequenceClassification``
RobertaForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RobertaForSequenceClassification
:members:
``TFRobertaModel``
RobertaForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RobertaForTokenClassification
:members:
TFRobertaModel
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFRobertaModel
:members:
``TFRobertaForMaskedLM``
TFRobertaForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFRobertaForMaskedLM
:members:
``TFRobertaForSequenceClassification``
TFRobertaForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFRobertaForSequenceClassification
:members:
TFRobertaForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFRobertaForTokenClassification
:members:

View File

@@ -0,0 +1,105 @@
T5
----------------------------------------------------
**DISCLAIMER:** This model is still a work in progress, if you see something strange,
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
Overview
~~~~~
The T5 model was presented in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/pdf/1910.10683.pdf>`_ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu in
Here the abstract:
*Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice.
In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format.
Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks.
By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.*
The Authors' code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`_ .
Training
~~~~~~~~~~~~~~~~~~~~
T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing.
This means that for training we always need an input sequence and a target sequence.
The input sequence is fed to the model using ``input_ids``. The target sequence is shifted to the right, *i.e.* prepended by a start-sequence token and fed to the decoder using the `decoder_input_ids`. In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the ``lm_labels``. The PAD token is hereby used as the start-sequence token.
T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.
- Unsupervised denoising training
In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens)
and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens.
Each sentinel token represents a unique mask token for this sentence and should start with ``<extra_id_1>``, ``<extra_id_2>``, ... up to ``<extra_id_100>``. As a default 100 sentinel tokens are available in ``T5Tokenizer``.
*E.g.* the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be processed as follows:
::
input_ids = tokenizer.encode('The <extra_id_1> walks in <extra_id_2> park', return_tensors='pt')
lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)
- Supervised training
In this setup the input sequence and output sequence are standard sequence to sequence input output mapping.
In translation, *e.g.* the input sequence "The house is wonderful." and output sequence "Das Haus ist wunderbar." should
be processed as follows:
::
input_ids = tokenizer.encode('translate English to German: The house is wonderful. </s>', return_tensors='pt')
lm_labels = tokenizer.encode('Das Haus ist wunderbar. </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)
Tips
~~~~~~~~~~~~~~~~~~~~
- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised
and supervised tasks and for which each task is converted into a text-to-text format.
T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: *translate English to German: ..., summarize: ...*.
For more information about which prefix to use, it is easiest to look into Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`_ .
- For sequence to sequence generation, it is recommended to use ``T5ForConditionalGeneration.generate()``. The method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively generates the decoder output.
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`_.
T5Config
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.T5Config
:members:
T5Tokenizer
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.T5Tokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
T5Model
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.T5Model
:members:
T5ForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.T5ForConditionalGeneration
:members:
TFT5Model
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFT5Model
:members:
TFT5ForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFT5ForConditionalGeneration
:members:

View File

@@ -1,43 +1,81 @@
Transformer XL
----------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~
``TransfoXLConfig``
The Transformer-XL model was proposed in
`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__
by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
It's a causal (uni-directional) transformer with relative positioning (sinusoïdal) embeddings which can reuse
previously computed hidden-states to attend to longer context (memory).
This model also uses adaptive softmax inputs and outputs (tied).
The abstract from the paper is the following:
*Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the
setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency
beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and
a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves
the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and
450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up
to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results
of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on
Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably
coherent, novel text articles with thousands of tokens.*
Tips:
- Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right.
The original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
- Transformer-XL is one of the few models that has no sequence length limit.
The original code can be found `here <https://github.com/kimiyoung/transformer-xl>`_.
TransfoXLConfig
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TransfoXLConfig
:members:
``TransfoXLTokenizer``
TransfoXLTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TransfoXLTokenizer
:members: save_vocabulary
TransfoXLTokenizerFast
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TransfoXLTokenizerFast
:members:
``TransfoXLModel``
TransfoXLModel
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TransfoXLModel
:members:
``TransfoXLLMHeadModel``
TransfoXLLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TransfoXLLMHeadModel
:members:
``TFTransfoXLModel``
TFTransfoXLModel
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFTransfoXLModel
:members:
``TFTransfoXLLMHeadModel``
TFTransfoXLLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFTransfoXLLMHeadModel

View File

@@ -1,68 +1,108 @@
XLM
----------------------------------------------------
``XLMConfig``
Overview
~~~~~~~~~~~~~~~~~~~~~
The XLM model was proposed in `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_
by Guillaume Lample*, Alexis Conneau*. It's a transformer pre-trained using one of the following objectives:
- a causal language modeling (CLM) objective (next token prediction),
- a masked language modeling (MLM) objective (Bert-like), or
- a Translation Language Modeling (TLM) object (extension of Bert's MLM to multiple language inputs)
The abstract from the paper is the following:
*Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding.
In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining.
We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual
data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain
state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI,
our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation,
we obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On
supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming
the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.*
Tips:
- XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
- XLM has multilingual checkpoints which leverage a specific `lang` parameter. Check out the
`multi-lingual <../multilingual.html>`__ page for more information.
The original code can be found `here <https://github.com/facebookresearch/XLM/>`_.
XLMConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMConfig
:members:
``XLMTokenizer``
XLMTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMTokenizer
:members:
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
``XLMModel``
XLMModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMModel
:members:
``XLMWithLMHeadModel``
XLMWithLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMWithLMHeadModel
:members:
``XLMForSequenceClassification``
XLMForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMForSequenceClassification
:members:
``XLMForQuestionAnswering``
XLMForQuestionAnsweringSimple
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMForQuestionAnsweringSimple
:members:
XLMForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMForQuestionAnswering
:members:
``TFXLMModel``
TFXLMModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFXLMModel
:members:
``TFXLMWithLMHeadModel``
TFXLMWithLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFXLMWithLMHeadModel
:members:
``TFXLMForSequenceClassification``
TFXLMForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFXLMForSequenceClassification
:members:
``TFXLMForQuestionAnsweringSimple``
TFXLMForQuestionAnsweringSimple
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFXLMForQuestionAnsweringSimple

View File

@@ -0,0 +1,109 @@
XLM-RoBERTa
------------------------------------------
The XLM-RoBERTa model was proposed in `Unsupervised Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__
by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán,
Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019.
It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
The abstract from the paper is the following:
*This paper shows that pretraining multilingual language models at scale leads to significant performance gains for
a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred
languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly
outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy
on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on
low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model.
We also present a detailed empirical evaluation of the key factors that are required to achieve these gains,
including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and
low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling
without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE
and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.*
Tips:
- XLM-R is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
not require `lang` tensors to understand which language is used, and should be able to determine the correct
language from the input ids.
- This implementation is the same as RoBERTa. Refer to the `documentation of RoBERTa <./roberta.html>`__ for usage
examples as well as the information relative to the inputs and outputs.
The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`_.
XLMRobertaConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMRobertaConfig
:members:
XLMRobertaTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMRobertaTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
XLMRobertaModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMRobertaModel
:members:
XLMRobertaForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMRobertaForMaskedLM
:members:
XLMRobertaForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMRobertaForSequenceClassification
:members:
XLMRobertaForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMRobertaForMultipleChoice
:members:
XLMRobertaForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLMRobertaForTokenClassification
:members:
TFXLMRobertaModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFXLMRobertaModel
:members:
TFXLMRobertaForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFXLMRobertaForMaskedLM
:members:
TFXLMRobertaForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFXLMRobertaForSequenceClassification
:members:
TFXLMRobertaForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFXLMRobertaForTokenClassification
:members:

View File

@@ -1,70 +1,126 @@
XLNet
----------------------------------------------------
``XLNetConfig``
Overview
~~~~~~~~~~~~~~~~~~~~~
The XLNet model was proposed in `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_
by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method
to learn bidirectional contexts by maximizing the expected likelihood over all permutations
of the input sequence factorization order.
The abstract from the paper is the following:
*With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves
better performance than pretraining approaches based on autoregressive language modeling. However, relying on
corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a
pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive
pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over
all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive
formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model,
into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by
a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.*
Tips:
- The specific attention pattern can be controlled at training and test time using the `perm_mask` input.
- Due to the difficulty of training a fully auto-regressive model over various factorization order,
XLNet is pretrained using only a sub-set of the output tokens as target which are selected
with the `target_mapping` input.
- To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the `perm_mask` and
`target_mapping` inputs to control the attention span and outputs (see examples in `examples/run_generation.py`)
- XLNet is one of the few models that has no sequence length limit.
The original code can be found `here <https://github.com/zihangdai/xlnet/>`_.
XLNetConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLNetConfig
:members:
``XLNetTokenizer``
XLNetTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLNetTokenizer
:members:
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
``XLNetModel``
XLNetModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLNetModel
:members:
``XLNetLMHeadModel``
XLNetLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLNetLMHeadModel
:members:
``XLNetForSequenceClassification``
XLNetForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLNetForSequenceClassification
:members:
``XLNetForQuestionAnswering``
XLNetForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLNetForTokenClassification
:members:
XLNetForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLNetForMultipleChoice
:members:
XLNetForQuestionAnsweringSimple
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLNetForQuestionAnsweringSimple
:members:
XLNetForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.XLNetForQuestionAnswering
:members:
``TFXLNetModel``
TFXLNetModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFXLNetModel
:members:
``TFXLNetLMHeadModel``
TFXLNetLMHeadModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFXLNetLMHeadModel
:members:
``TFXLNetForSequenceClassification``
TFXLNetForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFXLNetForSequenceClassification
:members:
``TFXLNetForQuestionAnsweringSimple``
TFXLNetForQuestionAnsweringSimple
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFXLNetForQuestionAnsweringSimple

View File

@@ -0,0 +1,55 @@
# Model upload and sharing
Starting with `v2.2.2`, you can now upload and share your fine-tuned models with the community, using the <abbr title="Command-line interface">CLI</abbr> that's built-in to the library.
**First, create an account on [https://huggingface.co/join](https://huggingface.co/join)**. Optionally, join an existing organization or create a new one. Then:
```shell
transformers-cli login
# log in using the same credentials as on huggingface.co
```
Upload your model:
```shell
transformers-cli upload ./path/to/pretrained_model/
# ^^ Upload folder containing weights/tokenizer/config
# saved via `.save_pretrained()`
transformers-cli upload ./config.json [--filename folder/foobar.json]
# ^^ Upload a single file
# (you can optionally override its filename, which can be nested inside a folder)
```
If you want your model to be namespaced by your organization name rather than your username, add the following flag to any command:
```shell
--organization organization_name
```
Your model will then be accessible through its identifier, a concatenation of your username (or organization name) and the folder name above:
```python
"username/pretrained_model"
# or if an org:
"organization_name/pretrained_model"
```
**Please add a README.md model card** to the repo under `model_cards/` with: model description, training params (dataset, preprocessing, hardware used, hyperparameters), evaluation results, intended uses & limitations, etc.
Your model now has a page on huggingface.co/models 🔥
Anyone can load it from code:
```python
tokenizer = AutoTokenizer.from_pretrained("namespace/pretrained_model")
model = AutoModel.from_pretrained("namespace/pretrained_model")
```
List all your files on S3:
```shell
transformers-cli s3 ls
```
You can also delete unneeded files:
```shell
transformers-cli s3 rm …
```

View File

@@ -47,6 +47,7 @@ The different languages this model/tokenizer handles, as well as the ids of thes
.. code-block::
# Continuation of the previous script
print(tokenizer.lang2id) # {'en': 0, 'fr': 1}
@@ -54,6 +55,7 @@ These ids should be used when passing a language parameter during a model pass.
.. code-block::
# Continuation of the previous script
input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
@@ -62,6 +64,7 @@ filled with the appropriate language ids, of the same size as input_ids. For eng
.. code-block::
# Continuation of the previous script
language_id = tokenizer.lang2id['en'] # 0
langs = torch.tensor([language_id] * input_ids.shape[1]) # torch.tensor([0, 0, 0, ..., 0])
@@ -73,6 +76,7 @@ You can then feed it all as input to your model:
.. code-block::
# Continuation of the previous script
outputs = model(input_ids, langs=langs)
@@ -100,4 +104,16 @@ BERT has two checkpoints that can be used for multi-lingual tasks:
- ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages)
These checkpoints do not require language embeddings at inference time. They should identify the language
used in the context and infer accordingly.
used in the context and infer accordingly.
XLM-RoBERTa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong
gains over previously released multi-lingual models like mBERT or XLM on downstream taks like classification,
sequence labeling and question answering.
Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:
- ``xlm-roberta-base`` (Masked language modeling, 100 languages)
- ``xlm-roberta-large`` (Masked language modeling, 100 languages)

1
docs/source/notebooks.md Symbolic link
View File

@@ -0,0 +1 @@
../../notebooks/README.md

View File

@@ -1,16 +0,0 @@
Notebooks
================================================
We include `three Jupyter Notebooks <https://github.com/huggingface/transformers/tree/master/notebooks>`_ that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model.
*
The first NoteBook (\ `Comparing-TF-and-PT-models.ipynb <https://github.com/huggingface/transformers/blob/master/notebooks/Comparing-TF-and-PT-models.ipynb>`_\ ) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.
*
The second NoteBook (\ `Comparing-TF-and-PT-models-SQuAD.ipynb <https://github.com/huggingface/transformers/blob/master/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`_\ ) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the ``BertForQuestionAnswering`` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.
*
The third NoteBook (\ `Comparing-TF-and-PT-models-MLM-NSP.ipynb <https://github.com/huggingface/transformers/blob/master/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`_\ ) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.
Please follow the instructions given in the notebooks to run and modify them.

View File

@@ -3,6 +3,7 @@ Pretrained models
Here is the full list of the currently provided pretrained models together with a short presentation of each model.
For a list that includes community-uploaded models, refer to `https://huggingface.co/models <https://huggingface.co/models>`__.
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Architecture | Shortcut name | Details of the model |
@@ -61,6 +62,36 @@ Here is the full list of the currently provided pretrained models together with
| | ``bert-base-german-dbmdz-uncased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on uncased German text by DBMDZ |
| | | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-japanese`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece. |
| | | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization. |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-japanese-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized with MeCab and WordPiece. |
| | | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization. |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-japanese-char`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text. Text is tokenized into characters. |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-japanese-char-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters. |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-finnish-cased-v1`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on cased Finnish text. |
| | | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-finnish-uncased-v1`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on uncased Finnish text. |
| | | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-dutch-cased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on cased Dutch text. |
| | | (see `details on wietsedv repository <https://github.com/wietsedv/bertje/>`__). |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| GPT | ``openai-gpt`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | OpenAI GPT English model |
@@ -128,6 +159,10 @@ Here is the full list of the currently provided pretrained models together with
| | | | ``roberta-large`` fine-tuned on `MNLI <http://www.nyu.edu/projects/bowman/multinli/>`__. |
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilroberta-base`` | | 6-layer, 768-hidden, 12-heads, 82M parameters |
| | | | The DistilRoBERTa model distilled from the RoBERTa model `roberta-base` checkpoint. |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``roberta-base-openai-detector`` | | 12-layer, 768-hidden, 12-heads, 125M parameters |
| | | | ``roberta-base`` fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model. |
| | | (see `details <https://github.com/openai/gpt-2-output-dataset/tree/master/detector>`__) |
@@ -144,17 +179,25 @@ Here is the full list of the currently provided pretrained models together with
| | | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint, with an additional linear layer. |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilbert-base-cased`` | | 6-layer, 768-hidden, 12-heads, 65M parameters |
| | | | The DistilBERT model distilled from the BERT model `bert-base-cased` checkpoint |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilbert-base-cased-distilled-squad`` | | 6-layer, 768-hidden, 12-heads, 65M parameters |
| | | | The DistilBERT model distilled from the BERT model `bert-base-cased` checkpoint, with an additional question answering layer. |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilgpt2`` | | 6-layer, 768-hidden, 12-heads, 82M parameters |
| | | | The DistilGPT2 model distilled from the GPT2 model `gpt2` checkpoint. |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilroberta-base`` | | 6-layer, 768-hidden, 12-heads, 82M parameters |
| | | | The DistilRoBERTa model distilled from the RoBERTa model `roberta-base` checkpoint. |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilbert-base-german-cased`` | | 6-layer, 768-hidden, 12-heads, 66M parameters |
| | | | The German DistilBERT model distilled from the German DBMDZ BERT model `bert-base-german-dbmdz-cased` checkpoint. |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``distilbert-base-multilingual-cased`` | | 6-layer, 768-hidden, 12-heads, 134M parameters |
| | | | The multilingual DistilBERT model distilled from the Multilingual BERT model `bert-base-multilingual-cased` checkpoint. |
| | | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| CTRL | ``ctrl`` | | 48-layer, 1280-hidden, 16-heads, 1.6B parameters |
| | | | Salesforce's Large-sized CTRL English model |
@@ -165,36 +208,94 @@ Here is the full list of the currently provided pretrained models together with
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| ALBERT | ``albert-base-v1`` | | 12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters |
| | | | ALBERT base model |
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
| | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-large-v1`` | | 24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters |
| | | | ALBERT large model |
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
| | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-xlarge-v1`` | | 24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters |
| | | | ALBERT xlarge model |
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
| | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-xxlarge-v1`` | | 12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters |
| | | | ALBERT xxlarge model |
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
| | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-base-v2`` | | 12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters |
| | | | ALBERT base model with no dropout, additional training data and longer training |
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
| | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-large-v2`` | | 24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters |
| | | | ALBERT large model with no dropout, additional training data and longer training |
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
| | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-xlarge-v2`` | | 24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters |
| | | | ALBERT xlarge model with no dropout, additional training data and longer training |
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
| | | (see `details <https://github.com/google-research/ALBERT>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``albert-xxlarge-v2`` | | 12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters |
| | | | ALBERT xxlarge model with no dropout, additional training data and longer training |
| | | (see `details <https://github.com/google-research/google-research/tree/master/albert>`__) |
| | | (see `details <https://github.com/google-research/ALBERT>`__) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| T5 | ``t5-small`` | | ~60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads, |
| | | | Trained on English text: the Colossal Clean Crawled Corpus (C4) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``t5-base`` | | ~220M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 12-heads, |
| | | | Trained on English text: the Colossal Clean Crawled Corpus (C4) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``t5-large`` | | ~770M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads, |
| | | | Trained on English text: the Colossal Clean Crawled Corpus (C4) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``t5-3B`` | | ~2.8B parameters with 24-layers, 1024-hidden-state, 16384 feed-forward hidden-state, 32-heads, |
| | | | Trained on English text: the Colossal Clean Crawled Corpus (C4) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``t5-11B`` | | ~11B parameters with 24-layers, 1024-hidden-state, 65536 feed-forward hidden-state, 128-heads, |
| | | | Trained on English text: the Colossal Clean Crawled Corpus (C4) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| XLM-RoBERTa | ``xlm-roberta-base`` | | ~125M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 8-heads, |
| | | | Trained on on 2.5 TB of newly created clean CommonCrawl data in 100 languages |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``xlm-roberta-large`` | | ~355M parameters with 24-layers, 1027-hidden-state, 4096 feed-forward hidden-state, 16-heads, |
| | | | Trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| FlauBERT | ``flaubert-small-cased`` | | 6-layer, 512-hidden, 8-heads, 54M parameters |
| | | | FlauBERT small architecture |
| | | (see `details <https://github.com/getalp/Flaubert>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``flaubert-base-uncased`` | | 12-layer, 768-hidden, 12-heads, 137M parameters |
| | | | FlauBERT base architecture with uncased vocabulary |
| | | (see `details <https://github.com/getalp/Flaubert>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``flaubert-base-cased`` | | 12-layer, 768-hidden, 12-heads, 138M parameters |
| | | | FlauBERT base architecture with cased vocabulary |
| | | (see `details <https://github.com/getalp/Flaubert>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``flaubert-large-cased`` | | 24-layer, 1024-hidden, 16-heads, 373M parameters |
| | | | FlauBERT large architecture |
| | | (see `details <https://github.com/getalp/Flaubert>`__) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Bart | ``bart-large`` | | 12-layer, 1024-hidden, 16-heads, 406M parameters |
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bart-large-mnli`` | | Adds a 2 layer classification head with 1 million parameters |
| | | | bart-large base architecture with a classification head, finetuned on MNLI |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bart-large-cnn`` | | 12-layer, 1024-hidden, 16-heads, 406M parameters (same as base) |
| | | | bart-large base architecture finetuned on cnn summarization task |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``mbart-large-en-ro`` | | 12-layer, 1024-hidden, 16-heads, 880M parameters |
| | | | bart-large architecture pretrained on cc25 multilingual data , finetuned on WMT english romanian translation. |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| DialoGPT | ``DialoGPT-small`` | | 12-layer, 768-hidden, 12-heads, 124M parameters |
| | | | Trained on English text: 147M conversation-like exchanges extracted from Reddit. |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``DialoGPT-medium`` | | 24-layer, 1024-hidden, 16-heads, 355M parameters |
| | | | Trained on English text: 147M conversation-like exchanges extracted from Reddit. |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``DialoGPT-large`` | | 36-layer, 1280-hidden, 20-heads, 774M parameters |
| | | | Trained on English text: 147M conversation-like exchanges extracted from Reddit. |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Reformer | ``reformer-crime-and-punishment`` | | 6-layer, 256-hidden, 2-heads, 3M parameters |
| | | | Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
.. <https://huggingface.co/transformers/examples.html>`__

View File

@@ -209,7 +209,7 @@ past = None
for i in range(100):
print(i)
output, past = model(context, past=past)
token = torch.argmax(output[0, :])
token = torch.argmax(output[..., -1, :])
generated += [token.tolist()]
context = token.unsqueeze(0)
@@ -219,4 +219,4 @@ sequence = tokenizer.decode(generated)
print(sequence)
```
The model only requires a single token as input as all the previous tokens' key/value pairs are contained in the `past`.
The model only requires a single token as input as all the previous tokens' key/value pairs are contained in the `past`.

View File

@@ -58,14 +58,14 @@ where
``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`__ or the original TensorFlow repository.
When using an ``uncased model``\ , make sure to pass ``--do_lower_case`` to the example training scripts (or pass ``do_lower_case=True`` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).
When using an ``uncased model``\ , make sure your tokenizer has ``do_lower_case=True`` (either in its configuration, or passed as an additional parameter).
Examples:
.. code-block:: python
# BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize=True)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
# OpenAI GPT
@@ -140,13 +140,13 @@ Here is the recommended way of saving the model, configuration and vocabulary to
torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(output_dir)
tokenizer.save_pretrained(output_dir)
# Step 2: Re-load the saved model and vocabulary
# Example for a Bert model
model = BertForQuestionAnswering.from_pretrained(output_dir)
tokenizer = BertTokenizer.from_pretrained(output_dir, do_lower_case=args.do_lower_case) # Add specific options if needed
tokenizer = BertTokenizer.from_pretrained(output_dir) # Add specific options if needed
# Example for a GPT model
model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)

727
docs/source/usage.rst Normal file
View File

@@ -0,0 +1,727 @@
Usage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This page shows the most frequent use-cases when using the library. The models available allow for many different
configurations and a great versatility in use-cases. The most simple ones are presented here, showcasing usage
for tasks such as question answering, sequence classification, named entity recognition and others.
These examples leverage auto-models, which are classes that will instantiate a model according to a given checkpoint,
automatically selecting the correct model architecture. Please check the :class:`~transformers.AutoModel` documentation
for more information.
Feel free to modify the code to be more specific and adapt it to your specific use-case.
In order for a model to perform well on a task, it must be loaded from a checkpoint corresponding to that task. These
checkpoints are usually pre-trained on a large corpus of data and fine-tuned on a specific task. This means the
following:
- Not all models were fine-tuned on all tasks. If you want to fine-tune a model on a specific task, you can leverage
one of the `run_$TASK.py` script in the
`examples <https://github.com/huggingface/transformers/tree/master/examples>`_ directory.
- Fine-tuned models were fine-tuned on a specific dataset. This dataset may or may not overlap with your use-case
and domain. As mentioned previously, you may leverage the
`examples <https://github.com/huggingface/transformers/tree/master/examples>`_ scripts to fine-tune your model, or you
may create your own training script.
In order to do an inference on a task, several mechanisms are made available by the library:
- Pipelines: very easy-to-use abstractions, which require as little as two lines of code.
- Using a model directly with a tokenizer (PyTorch/TensorFlow): the full inference using the model. Less abstraction,
but much more powerful.
Both approaches are showcased here.
.. note::
All tasks presented here leverage pre-trained checkpoints that were fine-tuned on specific tasks. Loading a
checkpoint that was not fine-tuned on a specific task would load only the base transformer layers and not the
additional head that is used for the task, initializing the weights of that head randomly.
This would produce random output.
Sequence Classification
--------------------------
Sequence classification is the task of classifying sequences according to a given number of classes. An example
of sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune
a model on a GLUE sequence classification task, you may leverage the
`run_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_glue.py>`_ or
`run_tf_glue.py <https://github.com/huggingface/transformers/tree/master/examples/run_tf_glue.py>`_ scripts.
Here is an example using the pipelines do to sentiment analysis: identifying if a sequence is positive or negative.
It leverages a fine-tuned model on sst2, which is a GLUE task.
::
from transformers import pipeline
nlp = pipeline("sentiment-analysis")
print(nlp("I hate you"))
print(nlp("I love you"))
This returns a label ("POSITIVE" or "NEGATIVE") alongside a score, as follows:
::
[{'label': 'NEGATIVE', 'score': 0.9991129}]
[{'label': 'POSITIVE', 'score': 0.99986565}]
Here is an example of doing a sequence classification using a model to determine if two sequences are paraphrases
of each other. The process is the following:
- Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
with the weights stored in the checkpoint.
- Build a sequence from the two sentences, with the correct model-specific separators token type ids
and attention masks (:func:`~transformers.PreTrainedTokenizer.encode` and
:func:`~transformers.PreTrainedTokenizer.encode_plus` take care of this)
- Pass this sequence through the model so that it is classified in one of the two available classes: 0
(not a paraphrase) and 1 (is a paraphrase)
- Compute the softmax of the result to get probabilities over the classes
- Print the results
::
## PYTORCH CODE
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
classes = ["not paraphrase", "is paraphrase"]
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")
paraphrase_classification_logits = model(**paraphrase)[0]
not_paraphrase_classification_logits = model(**not_paraphrase)[0]
paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]
print("Should be paraphrase")
for i in range(len(classes)):
print(f"{classes[i]}: {round(paraphrase_results[i] * 100)}%")
print("\nShould not be paraphrase")
for i in range(len(classes)):
print(f"{classes[i]}: {round(not_paraphrase_results[i] * 100)}%")
## TENSORFLOW CODE
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
classes = ["not paraphrase", "is paraphrase"]
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="tf")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
not_paraphrase_classification_logits = model(not_paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]
print("Should be paraphrase")
for i in range(len(classes)):
print(f"{classes[i]}: {round(paraphrase_results[i] * 100)}%")
print("\nShould not be paraphrase")
for i in range(len(classes)):
print(f"{classes[i]}: {round(not_paraphrase_results[i] * 100)}%")
This outputs the following results:
::
Should be paraphrase
not paraphrase: 10%
is paraphrase: 90%
Should not be paraphrase
not paraphrase: 94%
is paraphrase: 6%
Extractive Question Answering
----------------------------------------------------
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the `run_squad.py`.
Here is an example using the pipelines do to question answering: extracting an answer from a text given a question.
It leverages a fine-tuned model on SQuAD.
::
from transformers import pipeline
nlp = pipeline("question-answering")
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the `run_squad.py`.
"""
print(nlp(question="What is extractive question answering?", context=context))
print(nlp(question="What is a good example of a question answering dataset?", context=context))
This returns an answer extracted from the text, a confidence score, alongside "start" and "end" values which
are the positions of the extracted answer in the text.
::
{'score': 0.622232091629833, 'start': 34, 'end': 96, 'answer': 'the task of extracting an answer from a text given a question.'}
{'score': 0.5115299158662765, 'start': 147, 'end': 161, 'answer': 'SQuAD dataset,'}
Here is an example of question answering using a model and a tokenizer. The process is the following:
- Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
with the weights stored in the checkpoint.
- Define a text and a few questions.
- Iterate over the questions and build a sequence from the text and the current question, with the correct
model-specific separators token type ids and attention masks
- Pass this sequence through the model. This outputs a range of scores across the entire sequence tokens (question and
text), for both the start and end positions.
- Compute the softmax of the result to get probabilities over the tokens
- Fetch the tokens from the identified start and stop values, convert those tokens to a string.
- Print the results
::
## PYTORCH CODE
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""
questions = [
"How many pretrained models are available in Transformers?",
"What does Transformers provide?",
"Transformers provides interoperability between which frameworks?",
]
for question in questions:
inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = model(**inputs)
answer_start = torch.argmax(
answer_start_scores
) # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1 # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"Question: {question}")
print(f"Answer: {answer}\n")
## TENSORFLOW CODE
from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""
questions = [
"How many pretrained models are available in Transformers?",
"What does Transformers provide?",
"Transformers provides interoperability between which frameworks?",
]
for question in questions:
inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="tf")
input_ids = inputs["input_ids"].numpy()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = model(inputs)
answer_start = tf.argmax(
answer_start_scores, axis=1
).numpy()[0] # Get the most likely beginning of answer with the argmax of the score
answer_end = (
tf.argmax(answer_end_scores, axis=1) + 1
).numpy()[0] # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"Question: {question}")
print(f"Answer: {answer}\n")
This outputs the questions followed by the predicted answers:
::
Question: How many pretrained models are available in Transformers?
Answer: over 32 +
Question: What does Transformers provide?
Answer: general - purpose architectures
Question: Transformers provides interoperability between which frameworks?
Answer: tensorflow 2 . 0 and pytorch
Language Modeling
----------------------------------------------------
Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular transformer
based models are trained using a variant of language modeling, e.g. BERT with masked language modeling, GPT-2 with
causal language modeling.
Language modeling can be useful outside of pre-training as well, for example to shift the model distribution to be
domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset
or on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
Masked Language Modeling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to
fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the
right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis
for downstream tasks requiring bi-directional context such as SQuAD (question answering,
see `Lewis, Lui, Goyal et al. <https://arxiv.org/abs/1910.13461>`__, part 4.2).
Here is an example of using pipelines to replace a mask from a sequence:
::
from transformers import pipeline
nlp = pipeline("fill-mask")
print(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."))
This outputs the sequences with the mask filled, the confidence score as well as the token id in the tokenizer
vocabulary:
::
[
{'sequence': '<s> HuggingFace is creating a tool that the community uses to solve NLP tasks.</s>', 'score': 0.15627853572368622, 'token': 3944},
{'sequence': '<s> HuggingFace is creating a framework that the community uses to solve NLP tasks.</s>', 'score': 0.11690319329500198, 'token': 7208},
{'sequence': '<s> HuggingFace is creating a library that the community uses to solve NLP tasks.</s>', 'score': 0.058063216507434845, 'token': 5560},
{'sequence': '<s> HuggingFace is creating a database that the community uses to solve NLP tasks.</s>', 'score': 0.04211743175983429, 'token': 8503},
{'sequence': '<s> HuggingFace is creating a prototype that the community uses to solve NLP tasks.</s>', 'score': 0.024718601256608963, 'token': 17715}
]
Here is an example doing masked language modeling using a model and a tokenizer. The process is the following:
- Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a DistilBERT model and
loads it with the weights stored in the checkpoint.
- Define a sequence with a masked token, placing the :obj:`tokenizer.mask_token` instead of a word.
- Encode that sequence into IDs and find the position of the masked token in that list of IDs.
- Retrieve the predictions at the index of the mask token: this tensor has the same size as the vocabulary, and the
values are the scores attributed to each token. The model gives higher score to tokens he deems probable in that
context.
- Retrieve the top 5 tokens using the PyTorch :obj:`topk` or TensorFlow :obj:`top_k` methods.
- Replace the mask token by the tokens and print the results
::
## PYTORCH CODE
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")
sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
## TENSORFLOW CODE
from transformers import TFAutoModelWithLMHead, AutoTokenizer
import tensorflow as tf
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFAutoModelWithLMHead.from_pretrained("distilbert-base-cased")
sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
input = tokenizer.encode(sequence, return_tensors="tf")
mask_token_index = tf.where(input == tokenizer.mask_token_id)[0, 1]
token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()
for token in top_5_tokens:
print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
This prints five sequences, with the top 5 tokens predicted by the model:
::
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
Causal Language Modeling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Causal language modeling is the task of predicting the token following a sequence of tokens. In this situation, the
model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting
for generation tasks.
There is currently no pipeline to do causal language modeling/generation.
Here is an example using the tokenizer and model. leveraging the :func:`~transformers.PreTrainedModel.generate` method
to generate the tokens following the initial sequence in PyTorch, and creating a simple loop in TensorFlow.
::
## PYTORCH CODE
from transformers import AutoModelWithLMHead, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")
sequence = f"Hugging Face is based in DUMBO, New York City, and is"
input = tokenizer.encode(sequence, return_tensors="pt")
generated = model.generate(input, max_length=50, do_sample=True)
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
## TENSORFLOW CODE
from transformers import TFAutoModelWithLMHead, AutoTokenizer
import tensorflow as tf
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelWithLMHead.from_pretrained("gpt2")
sequence = f"Hugging Face is based in DUMBO, New York City, and is"
input = tokenizer.encode(sequence, return_tensors="tf")
generated = model.generate(input, max_length=50, do_sample=True)
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
This outputs a (hopefully) coherent string from the original sequence, as the
:func:`~transformers.PreTrainedModel.generate` samples from a top_p/tok_k distribution:
::
Hugging Face is based in DUMBO, New York City, and is a live-action TV series based on the novel by John
Carpenter, and its producers, David Kustlin and Steve Pichar. The film is directed by!
Named Entity Recognition
----------------------------------------------------
Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example identifying a
token as a person, an organisation or a location.
An example of a named entity recognition dataset is the CoNLL-2003 dataset, which is entirely based on that task.
If you would like to fine-tune a model on an NER task, you may leverage the `ner/run_ner.py` (PyTorch),
`ner/run_pl_ner.py` (leveraging pytorch-lightning) or the `ner/run_tf_ner.py` (TensorFlow) scripts.
Here is an example using the pipelines do to named entity recognition, trying to identify tokens as belonging to one
of 9 classes:
- O, Outside of a named entity
- B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
- I-MIS, Miscellaneous entity
- B-PER, Beginning of a person's name right after another person's name
- I-PER, Person's name
- B-ORG, Beginning of an organisation right after another organisation
- I-ORG, Organisation
- B-LOC, Beginning of a location right after another location
- I-LOC, Location
It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https://github.com/stefan-it>`__ from
`dbmdz <https://github.com/dbmdz>`__.
::
from transformers import pipeline
nlp = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge which is visible from the window."
print(nlp(sequence))
This outputs a list of all words that have been identified as an entity from the 9 classes defined above. Here is the
expected results:
::
[
{'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
{'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
{'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
{'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
{'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]
Note how the words "Hugging Face" have been identified as an organisation, and "New York City", "DUMBO" and
"Manhattan Bridge" have been identified as locations.
Here is an example doing named entity recognition using a model and a tokenizer. The process is the following:
- Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and
loads it with the weights stored in the checkpoint.
- Define the label list with which the model was trained on.
- Define a sequence with known entities, such as "Hugging Face" as an organisation and "New York City" as a location.
- Split words into tokens so that they can be mapped to the predictions. We use a small hack by firstly completely
encoding and decoding the sequence, so that we're left with a string that contains the special tokens.
- Encode that sequence into IDs (special tokens are added automatically).
- Retrieve the predictions by passing the input to the model and getting the first output. This results in a
distribution over the 9 possible classes for each token. We take the argmax to retrieve the most likely class
for each token.
- Zip together each token with its prediction and print it.
::
## PYTORCH CODE
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
label_list = [
"O", # Outside of a named entity
"B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC", # Miscellaneous entity
"B-PER", # Beginning of a person's name right after another person's name
"I-PER", # Person's name
"B-ORG", # Beginning of an organisation right after another organisation
"I-ORG", # Organisation
"B-LOC", # Beginning of a location right after another location
"I-LOC" # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])
## TENSORFLOW CODE
from transformers import TFAutoModelForTokenClassification, AutoTokenizer
import tensorflow as tf
model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
label_list = [
"O", # Outside of a named entity
"B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC", # Miscellaneous entity
"B-PER", # Beginning of a person's name right after another person's name
"I-PER", # Person's name
"B-ORG", # Beginning of an organisation right after another organisation
"I-ORG", # Organisation
"B-LOC", # Beginning of a location right after another location
"I-LOC" # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="tf")
outputs = model(inputs)[0]
predictions = tf.argmax(outputs, axis=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())])
This outputs a list of each token mapped to their prediction. Differently from the pipeline, here every token has
a prediction as we didn't remove the "0" class which means that no particular entity was found on that token. The
following array should be the output:
::
[('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]
Summarization
----------------------------------------------------
Summarization is the task of summarizing a text / an article into a shorter text.
An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was created for the task of summarization.
If you would like to fine-tune a model on a summarization task, you may leverage the ``examples/summarization/bart/run_train.sh`` (leveraging pytorch-lightning) script.
Here is an example using the pipelines do to summarization.
It leverages a Bart model that was fine-tuned on the CNN / Daily Mail data set.
::
from transformers import pipeline
summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30))
Because the summarization pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
of ``PretrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` and ``min_length`` above.
This outputs the following summary:
::
Liana Barrientos has been married 10 times, sometimes within two weeks of each other. Prosecutors say the marriages were part of an immigration scam. She pleaded not guilty at State Supreme Court in the Bronx on Friday.
Here is an example doing summarization using a model and a tokenizer. The process is the following:
- Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
- Define the article that should be summarizaed.
- Leverage the ``PretrainedModel.generate()`` method.
- Add the T5 specific prefix "summarize: ".
Here Google`s T5 model is used that was only pre-trained on a multi-task mixed data set (including CNN / Daily Mail), but nevertheless yields very good results.
::
## PYTORCH CODE
from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="pt", max_length=512)
outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
print(outputs)
## TENSORFLOW CODE
from transformers import TFAutoModelWithLMHead, AutoTokenizer
model = TFAutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="tf", max_length=512)
outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
print(outputs)
Translation
----------------------------------------------------
Translation is the task of translating a text from one language to another.
An example of a translation dataset is the WMT English to German dataset, which has English sentences as the input data
and German sentences as the target data.
Here is an example using the pipelines do to translation.
It leverages a T5 model that was only pre-trained on a multi-task mixture dataset (including WMT), but yields impressive
translation results nevertheless.
::
from transformers import pipeline
translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
Because the translation pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
of ``PretrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
This outputs the following translation into German:
::
Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.
Here is an example doing translation using a model and a tokenizer. The process is the following:
- Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
- Define the article that should be summarizaed.
- Leverage the ``PretrainedModel.generate()`` method.
- Add the T5 specific prefix "translate English to German: "
::
## PYTORCH CODE
from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)
print(outputs)
## TENSORFLOW CODE
from transformers import TFAutoModelWithLMHead, AutoTokenizer
model = TFAutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="tf")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)
print(outputs)

View File

@@ -3,641 +3,27 @@
In this section a few examples are put together. All of these examples work for several models, making use of the very
similar API between the different models.
**Important**
To run the latest versions of the examples, you have to install from source. Execute the following steps in a new virtual environment:
**Important**
To run the latest versions of the examples, you have to install from source and install some specific requirements for the examples.
Execute the following steps in a new virtual environment:
```bash
git clone https://github.com/huggingface/transformers
cd transformers
pip install [--editable] .
pip install .
pip install -r ./examples/requirements.txt
```
| Section | Description |
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks.
| [Language Model fine-tuning](#language-model-fine-tuning) | Fine-tuning the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks.
| [Named Entity Recognition](#named-entity-recognition) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
|----------------------------|-----------------------------------------------------
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
| [Abstractive summarization](#abstractive-summarization) | Fine-tuning the library models for abstractive summarization tasks on the CNN/Daily Mail dataset. |
| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
## TensorFlow 2.0 Bert models on GLUE
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
These options and the below benchmark are provided by @tlkh.
Quick benchmarks from the script (no other modifications):
| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) |
| --------- | -------- | ----------------------- | ----------------------|
| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
| V100 | FP32 | 35s | 0.8646/0.8359/0.8464 |
| V100 | AMP | 22s | 0.8646/0.8385/0.8411 |
| 1080 Ti | FP32 | 55s | - |
Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
## Language model fine-tuning
Based on the script [`run_lm_finetuning.py`](https://github.com/huggingface/transformers/blob/master/examples/run_lm_finetuning.py).
Fine-tuning the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
are fine-tuned using a masked language modeling (MLM) loss.
Before running the following example, you should get a file that contains text on which the language model will be
fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
text that will be used for evaluation.
### GPT-2/GPT and causal language modeling
The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
the tokenization). The loss here is that of causal language modeling.
```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw
python run_lm_finetuning.py \
--output_dir=output \
--model_type=gpt2 \
--model_name_or_path=gpt2 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE
```
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
a score of ~20 perplexity once fine-tuned on the dataset.
### RoBERTa/BERT and masked language modeling
The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
pre-training: masked language modeling.
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
slightly slower (over-fitting takes more epochs).
We use the `--mlm` flag so that the script may change its loss function.
```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw
python run_lm_finetuning.py \
--output_dir=output \
--model_type=roberta \
--model_name_or_path=roberta-base \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--mlm
```
## Language generation
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
can try out the different models available in the library.
Example usage:
```bash
python run_generation.py \
--model_type=gpt2 \
--model_name_or_path=gpt2
```
## GLUE
Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py).
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
uncased BERT base model (the checkpoint `bert-base-uncased`). All experiments ran on 8 V100 GPUs with a total train
batch size of 24. Some of these tasks have a small dataset and training can lead to high variance in the results
between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
| Task | Metric | Result |
|-------|------------------------------|-------------|
| CoLA | Matthew's corr | 48.87 |
| SST-2 | Accuracy | 91.74 |
| MRPC | F1/Accuracy | 90.70/86.27 |
| STS-B | Person/Spearman corr. | 91.39/91.04 |
| QQP | Accuracy/F1 | 90.79/87.66 |
| MNLI | Matched acc./Mismatched acc. | 83.70/84.83 |
| QNLI | Accuracy | 89.31 |
| RTE | Accuracy | 71.43 |
| WNLI | Accuracy | 43.66 |
Some of these results are significantly different from the ones reported on the test set
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
Before running anyone of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.
```bash
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC
python run_glue.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/$TASK_NAME/
```
where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
said, there shouldnt be any issues in running half-precision training with the remaining GLUE tasks as well,
since the data processor for each task inherits from the base class DataProcessor.
### MRPC
#### Fine-tuning example
The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
Before running anyone of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.
```bash
export GLUE_DIR=/path/to/glue
python run_glue.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--task_name MRPC \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/
```
Our test ran on a few seeds with [the original implementation hyper-
parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
results between 84% and 88%.
#### Using Apex and mixed-precision
Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
[apex](https://github.com/NVIDIA/apex), then run the following example:
```bash
export GLUE_DIR=/path/to/glue
python run_glue.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--task_name MRPC \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/ \
--fp16
```
#### Distributed training
Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
reaches F1 > 92 on MRPC.
```bash
export GLUE_DIR=/path/to/glue
python -m torch.distributed.launch \
--nproc_per_node 8 run_glue.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--task_name MRPC \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/
```
Training with these hyper-parameters gave us the following results:
```bash
acc = 0.8823529411764706
acc_and_f1 = 0.901702786377709
eval_loss = 0.3418912578906332
f1 = 0.9210526315789473
global_step = 174
loss = 0.07231863956341798
```
### MNLI
The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
```bash
export GLUE_DIR=/path/to/glue
python -m torch.distributed.launch \
--nproc_per_node 8 run_glue.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--task_name mnli \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MNLI/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir output_dir \
```
The results are the following:
```bash
***** Eval results *****
acc = 0.8679706601466992
eval_loss = 0.4911287787382479
global_step = 18408
loss = 0.04755385363816904
***** Eval results *****
acc = 0.8747965825874695
eval_loss = 0.45516540421714036
global_step = 18408
loss = 0.04755385363816904
```
## Multiple Choice
Based on the script [`run_multiple_choice.py`]().
#### Fine-tuning on SWAG
Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
```bash
#training on 4 tesla V100(16GB) GPUS
export SWAG_DIR=/path/to/swag_data_dir
python ./examples/run_multiple_choice.py \
--model_type roberta \
--task_name swag \
--model_name_or_path roberta-base \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $SWAG_DIR \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--max_seq_length 80 \
--output_dir models_bert/swag_base \
--per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
--gradient_accumulation_steps 2 \
--overwrite_output
```
Training with the defined hyper-parameters yields the following results:
```
***** Eval results *****
eval_acc = 0.8338998300509847
eval_loss = 0.44457291918821606
```
## SQuAD
Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
#### Fine-tuning on SQuAD
This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
$SQUAD_DIR directory.
* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
```bash
export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--per_gpu_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/debug_squad/
```
Training with the previously defined hyper-parameters yields the following results:
```bash
f1 = 88.52
exact_match = 81.22
```
#### Distributed training
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:
```bash
python -m torch.distributed.launch --nproc_per_node=8 run_squad.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ../models/wwm_uncased_finetuned_squad/ \
--per_gpu_train_batch_size 24 \
--gradient_accumulation_steps 12
```
Training with the previously defined hyper-parameters yields the following results:
```bash
f1 = 93.15
exact_match = 86.91
```
This fine-tuned model is available as a checkpoint under the reference
`bert-large-uncased-whole-word-masking-finetuned-squad`.
#### Fine-tuning XLNet on SQuAD
This example code fine-tunes XLNet on the SQuAD dataset. See above to download the data for SQuAD .
```bash
export SQUAD_DIR=/path/to/SQUAD
python /data/home/hlu/transformers/examples/run_squad.py \
--model_type xlnet \
--model_name_or_path xlnet-large-cased \
--do_train \
--do_eval \
--do_lower_case \
--train_file /data/home/hlu/notebooks/NLP/examples/question_answering/train-v1.1.json \
--predict_file /data/home/hlu/notebooks/NLP/examples/question_answering/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./wwm_cased_finetuned_squad/ \
--per_gpu_eval_batch_size=4 \
--per_gpu_train_batch_size=4 \
--save_steps 5000
```
Training with the previously defined hyper-parameters yields the following results:
```python
{
"exact": 85.45884578997162,
"f1": 92.5974600601065,
"total": 10570,
"HasAns_exact": 85.45884578997162,
"HasAns_f1": 92.59746006010651,
"HasAns_total": 10570
}
```
## Named Entity Recognition
Based on the script [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py).
This example fine-tune Bert Multilingual on GermEval 2014 (German NER).
Details and results for the fine-tuning provided by @stefan-it.
### Data (Download and pre-processing steps)
Data can be obtained from the [GermEval 2014](https://sites.google.com/site/germeval2014ner/data) shared task page.
Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted:
```bash
curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-train.tsv?attredirects=0&d=1' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-dev.tsv?attredirects=0&d=1' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-test.tsv?attredirects=0&d=1' \
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
```
The GermEval 2014 dataset contains some strange "control character" tokens like `'\x96', '\u200e', '\x95', '\xad' or '\x80'`. One problem with these tokens is, that `BertTokenizer` returns an empty token for them, resulting in misaligned `InputExample`s. I wrote a script that a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached).
```bash
wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"
```
Let's define some variables that we need for further pre-processing steps and training the model:
```bash
export MAX_LENGTH=128
export BERT_MODEL=bert-base-multilingual-cased
```
Run the pre-processing script on training, dev and test datasets:
```bash
python3 preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
python3 preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
python3 preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
```
The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used:
```bash
cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
```
### Training
Additional environment variables must be set:
```bash
export OUTPUT_DIR=germeval-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=750
export SEED=1
```
To start training, just run:
```bash
python3 run_ner.py --data_dir ./ \
--model_type bert \
--labels ./labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict
```
If your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets.
### Evaluation
Evaluation on development dataset outputs the following for our example:
```bash
10/04/2019 00:42:06 - INFO - __main__ - ***** Eval results *****
10/04/2019 00:42:06 - INFO - __main__ - f1 = 0.8623348017621146
10/04/2019 00:42:06 - INFO - __main__ - loss = 0.07183869666975543
10/04/2019 00:42:06 - INFO - __main__ - precision = 0.8467916366258111
10/04/2019 00:42:06 - INFO - __main__ - recall = 0.8784592370979806
```
On the test dataset the following results could be achieved:
```bash
10/04/2019 00:42:42 - INFO - __main__ - ***** Eval results *****
10/04/2019 00:42:42 - INFO - __main__ - f1 = 0.8614389652384803
10/04/2019 00:42:42 - INFO - __main__ - loss = 0.07064602487454782
10/04/2019 00:42:42 - INFO - __main__ - precision = 0.8604651162790697
10/04/2019 00:42:42 - INFO - __main__ - recall = 0.8624150210424085
```
### Comparing BERT (large, cased), RoBERTa (large, cased) and DistilBERT (base, uncased)
Here is a small comparison between BERT (large, cased), RoBERTa (large, cased) and DistilBERT (base, uncased) with the same hyperparameters as specified in the [example documentation](https://huggingface.co/transformers/examples.html#named-entity-recognition) (one run):
| Model | F-Score Dev | F-Score Test
| --------------------------------- | ------- | --------
| `bert-large-cased` | 95.59 | 91.70
| `roberta-large` | 95.96 | 91.87
| `distilbert-base-uncased` | 94.34 | 90.32
## Abstractive summarization
Based on the script
[`run_summarization_finetuning.py`](https://github.com/huggingface/transformers/blob/master/examples/run_summarization_finetuning.py).
Before running this script you should download **both** CNN and Daily Mail
datasets from [Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/) (the
links next to "Stories") in the same folder. Then uncompress the archives by running:
```bash
tar -xvf cnn_stories.tgz && tar -xvf dailymail_stories.tgz
```
note that the finetuning script **will not work** if you do not download both
datasets. We will refer as `$DATA_PATH` the path to where you uncompressed both
archive.
```bash
export DATA_PATH=/path/to/dataset/
python run_summarization_finetuning.py \
--output_dir=output \
--model_type=bert2bert \
--model_name_or_path=bert2bert \
--do_train \
--data_path=$DATA_PATH \
```
## XNLI
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-ressource language such as English and low-ressource languages such as Swahili).
#### Fine-tuning on XNLI
This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
`$XNLI_DIR` directory.
* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
```bash
export XNLI_DIR=/path/to/XNLI
python run_xnli.py \
--model_type bert \
--model_name_or_path bert-base-multilingual-cased \
--language de \
--train_language en \
--do_train \
--do_eval \
--data_dir $XNLI_DIR \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-5 \
--num_train_epochs 2.0 \
--max_seq_length 128 \
--output_dir /tmp/debug_xnli/ \
--save_steps -1
```
Training with the previously defined hyper-parameters yields the following results on the **test** set:
```bash
acc = 0.7093812375249501
```

View File

@@ -0,0 +1,38 @@
## Adversarial evaluation of model performances
Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).
The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).
This is an example of using test_hans.py:
```bash
export HANS_DIR=path-to-hans
export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py
python examples/hans/test_hans.py \
--task_name hans \
--model_type $MODEL_TYPE \
--do_eval \
--data_dir $HANS_DIR \
--model_name_or_path $MODEL_PATH \
--max_seq_length 128 \
--output_dir $MODEL_PATH \
```
This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset.
The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows:
```bash
Heuristic entailed results:
lexical_overlap: 0.9702
subsequence: 0.9942
constituent: 0.9962
Heuristic non-entailed results:
lexical_overlap: 0.199
subsequence: 0.0396
constituent: 0.118
```

View File

@@ -0,0 +1,221 @@
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" GLUE processors and helpers """
import logging
import os
from transformers.file_utils import is_tf_available
from utils_hans import DataProcessor, InputExample, InputFeatures
if is_tf_available():
import tensorflow as tf
logger = logging.getLogger(__name__)
def hans_convert_examples_to_features(
examples,
tokenizer,
max_length=512,
task=None,
label_list=None,
output_mode=None,
pad_on_left=False,
pad_token=0,
pad_token_segment_id=0,
mask_padding_with_zero=True,
):
"""
Loads a data file into a list of ``InputFeatures``
Args:
examples: List of ``InputExamples`` or ``tf.data.Dataset`` containing the examples.
tokenizer: Instance of a tokenizer that will tokenize the examples
max_length: Maximum example length
task: HANS
label_list: List of labels. Can be obtained from the processor using the ``processor.get_labels()`` method
output_mode: String indicating the output mode. Either ``regression`` or ``classification``
pad_on_left: If set to ``True``, the examples will be padded on the left rather than on the right (default)
pad_token: Padding token
pad_token_segment_id: The segment ID for the padding token (It is usually 0, but can vary such as for XLNet where it is 4)
mask_padding_with_zero: If set to ``True``, the attention mask will be filled by ``1`` for actual values
and by ``0`` for padded values. If set to ``False``, inverts it (``1`` for padded values, ``0`` for
actual values)
Returns:
If the ``examples`` input is a ``tf.data.Dataset``, will return a ``tf.data.Dataset``
containing the task-specific features. If the input is a list of ``InputExamples``, will return
a list of task-specific ``InputFeatures`` which can be fed to the model.
"""
is_tf_dataset = False
if is_tf_available() and isinstance(examples, tf.data.Dataset):
is_tf_dataset = True
if task is not None:
processor = glue_processors[task]()
if label_list is None:
label_list = processor.get_labels()
logger.info("Using label list %s for task %s" % (label_list, task))
if output_mode is None:
output_mode = glue_output_modes[task]
logger.info("Using output mode %s for task %s" % (output_mode, task))
label_map = {label: i for i, label in enumerate(label_list)}
features = []
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
logger.info("Writing example %d" % (ex_index))
if is_tf_dataset:
example = processor.get_example_from_tensor_dict(example)
example = processor.tfds_map(example)
inputs = tokenizer.encode_plus(example.text_a, example.text_b, add_special_tokens=True, max_length=max_length,)
input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
# Zero-pad up to the sequence length.
padding_length = max_length - len(input_ids)
if pad_on_left:
input_ids = ([pad_token] * padding_length) + input_ids
attention_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + attention_mask
token_type_ids = ([pad_token_segment_id] * padding_length) + token_type_ids
else:
input_ids = input_ids + ([pad_token] * padding_length)
attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
token_type_ids = token_type_ids + ([pad_token_segment_id] * padding_length)
assert len(input_ids) == max_length, "Error with input length {} vs {}".format(len(input_ids), max_length)
assert len(attention_mask) == max_length, "Error with input length {} vs {}".format(
len(attention_mask), max_length
)
assert len(token_type_ids) == max_length, "Error with input length {} vs {}".format(
len(token_type_ids), max_length
)
if output_mode == "classification":
label = label_map[example.label] if example.label in label_map else 0
elif output_mode == "regression":
label = float(example.label)
else:
raise KeyError(output_mode)
pairID = str(example.pairID)
if ex_index < 10:
logger.info("*** Example ***")
logger.info("text_a: %s" % (example.text_a))
logger.info("text_b: %s" % (example.text_b))
logger.info("guid: %s" % (example.guid))
logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
logger.info("attention_mask: %s" % " ".join([str(x) for x in attention_mask]))
logger.info("token_type_ids: %s" % " ".join([str(x) for x in token_type_ids]))
logger.info("label: %s (id = %d)" % (example.label, label))
features.append(
InputFeatures(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
label=label,
pairID=pairID,
)
)
if is_tf_available() and is_tf_dataset:
def gen():
for ex in features:
yield (
{
"input_ids": ex.input_ids,
"attention_mask": ex.attention_mask,
"token_type_ids": ex.token_type_ids,
},
ex.label,
)
return tf.data.Dataset.from_generator(
gen,
({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
(
{
"input_ids": tf.TensorShape([None]),
"attention_mask": tf.TensorShape([None]),
"token_type_ids": tf.TensorShape([None]),
},
tf.TensorShape([]),
),
)
return features
class HansProcessor(DataProcessor):
"""Processor for the HANS data set."""
def get_example_from_tensor_dict(self, tensor_dict):
"""See base class."""
return InputExample(
tensor_dict["idx"].numpy(),
tensor_dict["premise"].numpy().decode("utf-8"),
tensor_dict["hypothesis"].numpy().decode("utf-8"),
str(tensor_dict["label"].numpy()),
)
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(self._read_tsv(os.path.join(data_dir, "heuristics_train_set.txt")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(self._read_tsv(os.path.join(data_dir, "heuristics_evaluation_set.txt")), "dev")
def get_labels(self):
"""See base class."""
return ["contradiction", "entailment", "neutral"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[5]
text_b = line[6]
pairID = line[7][2:] if line[7].startswith("ex") else line[7]
label = line[-1]
examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label, pairID=pairID))
return examples
glue_tasks_num_labels = {
"hans": 3,
}
glue_processors = {
"hans": HansProcessor,
}
glue_output_modes = {
"hans": "classification",
}

View File

@@ -22,57 +22,64 @@ import glob
import logging
import os
import random
import json
import numpy as np
import torch
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
TensorDataset)
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from hans_processors import glue_output_modes as output_modes
from hans_processors import glue_processors as processors
from hans_processors import hans_convert_examples_to_features as convert_examples_to_features
from transformers import (
WEIGHTS_NAME,
AdamW,
AlbertConfig,
AlbertForSequenceClassification,
AlbertTokenizer,
BertConfig,
BertForSequenceClassification,
BertTokenizer,
DistilBertConfig,
DistilBertForSequenceClassification,
DistilBertTokenizer,
RobertaConfig,
RobertaForSequenceClassification,
RobertaTokenizer,
XLMConfig,
XLMForSequenceClassification,
XLMTokenizer,
XLNetConfig,
XLNetForSequenceClassification,
XLNetTokenizer,
get_linear_schedule_with_warmup,
)
try:
from torch.utils.tensorboard import SummaryWriter
except:
except ImportError:
from tensorboardX import SummaryWriter
from tqdm import tqdm, trange
from transformers import (WEIGHTS_NAME, BertConfig,
BertForSequenceClassification, BertTokenizer,
RobertaConfig,
RobertaForSequenceClassification,
RobertaTokenizer,
XLMConfig, XLMForSequenceClassification,
XLMTokenizer, XLNetConfig,
XLNetForSequenceClassification,
XLNetTokenizer,
DistilBertConfig,
DistilBertForSequenceClassification,
DistilBertTokenizer,
AlbertConfig,
AlbertForSequenceClassification,
AlbertTokenizer,
)
from transformers import AdamW, get_linear_schedule_with_warmup
from transformers import glue_compute_metrics as compute_metrics
from transformers import glue_output_modes as output_modes
from transformers import glue_processors as processors
from transformers import glue_convert_examples_to_features as convert_examples_to_features
logger = logging.getLogger(__name__)
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, XLMConfig,
RobertaConfig, DistilBertConfig)), ())
ALL_MODELS = sum(
(
tuple(conf.pretrained_config_archive_map.keys())
for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig, DistilBertConfig)
),
(),
)
MODEL_CLASSES = {
'bert': (BertConfig, BertForSequenceClassification, BertTokenizer),
'xlnet': (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer),
'xlm': (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
'roberta': (RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer),
'distilbert': (DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer),
'albert': (AlbertConfig, AlbertForSequenceClassification, AlbertTokenizer)
"bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
"xlnet": (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer),
"xlm": (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
"roberta": (RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer),
"distilbert": (DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer),
"albert": (AlbertConfig, AlbertForSequenceClassification, AlbertTokenizer),
}
@@ -100,14 +107,19 @@ def train(args, train_dataset, model, tokenizer):
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ['bias', 'LayerNorm.weight']
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
{
"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
"weight_decay": args.weight_decay,
},
{"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total)
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
)
if args.fp16:
try:
from apex import amp
@@ -121,17 +133,21 @@ def train(args, train_dataset, model, tokenizer):
# Distributed training (should be after apex fp16 initialization)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
output_device=args.local_rank,
find_unused_parameters=True)
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
)
# Train!
logger.info("***** Running training *****")
logger.info(" Num examples = %d", len(train_dataset))
logger.info(" Num Epochs = %d", args.num_train_epochs)
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
logger.info(" Total train batch size (w. parallel, distributed & accumulation) = %d",
args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
logger.info(
" Total train batch size (w. parallel, distributed & accumulation) = %d",
args.train_batch_size
* args.gradient_accumulation_steps
* (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
)
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
logger.info(" Total optimization steps = %d", t_total)
@@ -145,16 +161,16 @@ def train(args, train_dataset, model, tokenizer):
for step, batch in enumerate(epoch_iterator):
model.train()
batch = tuple(t.to(args.device) for t in batch)
inputs = {'input_ids': batch[0],
'attention_mask': batch[1],
'labels': batch[3]}
if args.model_type != 'distilbert':
inputs['token_type_ids'] = batch[2] if args.model_type in ['bert', 'xlnet'] else None # XLM, DistilBERT and RoBERTa don't use segment_ids
inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
if args.model_type != "distilbert":
inputs["token_type_ids"] = (
batch[2] if args.model_type in ["bert", "xlnet"] else None
) # XLM, DistilBERT and RoBERTa don't use segment_ids
outputs = model(**inputs)
loss = outputs[0] # model outputs are always tuple in transformers (see doc)
if args.n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu parallel training
loss = loss.mean() # mean() to average on multi-gpu parallel training
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
@@ -178,30 +194,34 @@ def train(args, train_dataset, model, tokenizer):
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
logs = {}
if args.local_rank == -1 and args.evaluate_during_training: # Only evaluate when single GPU otherwise metrics may not average well
if (
args.local_rank == -1 and args.evaluate_during_training
): # Only evaluate when single GPU otherwise metrics may not average well
results = evaluate(args, model, tokenizer)
for key, value in results.items():
eval_key = 'eval_{}'.format(key)
eval_key = "eval_{}".format(key)
logs[eval_key] = value
loss_scalar = (tr_loss - logging_loss) / args.logging_steps
learning_rate_scalar = scheduler.get_lr()[0]
logs['learning_rate'] = learning_rate_scalar
logs['loss'] = loss_scalar
logs["learning_rate"] = learning_rate_scalar
logs["loss"] = loss_scalar
logging_loss = tr_loss
for key, value in logs.items():
tb_writer.add_scalar(key, value, global_step)
print(json.dumps({**logs, **{'step': global_step}}))
# print(json.dumps({**logs, **{'step': global_step}}))
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
# Save model checkpoint
output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
if not os.path.exists(output_dir):
os.makedirs(output_dir)
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
torch.save(args, os.path.join(output_dir, 'training_args.bin'))
torch.save(args, os.path.join(output_dir, "training_args.bin"))
logger.info("Saving model checkpoint to %s", output_dir)
if args.max_steps > 0 and global_step > args.max_steps:
@@ -220,11 +240,11 @@ def train(args, train_dataset, model, tokenizer):
def evaluate(args, model, tokenizer, prefix=""):
# Loop to handle MNLI double evaluation (matched, mis-matched)
eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
eval_outputs_dirs = (args.output_dir, args.output_dir + '-MM') if args.task_name == "mnli" else (args.output_dir,)
eval_outputs_dirs = (args.output_dir, args.output_dir + "-MM") if args.task_name == "mnli" else (args.output_dir,)
results = {}
for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
eval_dataset, label_list = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
os.makedirs(eval_output_dir)
@@ -235,7 +255,7 @@ def evaluate(args, model, tokenizer, prefix=""):
eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
# multi-gpu eval
if args.n_gpu > 1:
if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
model = torch.nn.DataParallel(model)
# Eval!
@@ -251,11 +271,11 @@ def evaluate(args, model, tokenizer, prefix=""):
batch = tuple(t.to(args.device) for t in batch)
with torch.no_grad():
inputs = {'input_ids': batch[0],
'attention_mask': batch[1],
'labels': batch[3]}
if args.model_type != 'distilbert':
inputs['token_type_ids'] = batch[2] if args.model_type in ['bert', 'xlnet'] else None # XLM, DistilBERT and RoBERTa don't use segment_ids
inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
if args.model_type != "distilbert":
inputs["token_type_ids"] = (
batch[2] if args.model_type in ["bert", "xlnet"] else None
) # XLM, DistilBERT and RoBERTa don't use segment_ids
outputs = model(**inputs)
tmp_eval_loss, logits = outputs[:2]
@@ -263,25 +283,24 @@ def evaluate(args, model, tokenizer, prefix=""):
nb_eval_steps += 1
if preds is None:
preds = logits.detach().cpu().numpy()
out_label_ids = inputs['labels'].detach().cpu().numpy()
out_label_ids = inputs["labels"].detach().cpu().numpy()
pair_ids = batch[4].detach().cpu().numpy()
else:
preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)
pair_ids = np.append(pair_ids, batch[4].detach().cpu().numpy(), axis=0)
eval_loss = eval_loss / nb_eval_steps
if args.output_mode == "classification":
preds = np.argmax(preds, axis=1)
elif args.output_mode == "regression":
preds = np.squeeze(preds)
result = compute_metrics(eval_task, preds, out_label_ids)
results.update(result)
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
output_eval_file = os.path.join(eval_output_dir, "hans_predictions.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results {} *****".format(prefix))
for key in sorted(result.keys()):
logger.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
writer.write("pairID,gld_label\n")
for pid, pred in zip(pair_ids, preds):
writer.write("ex" + str(pid) + "," + label_list[int(pred)] + "\n")
return results
@@ -293,29 +312,38 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
processor = processors[task]()
output_mode = output_modes[task]
# Load data features from cache or dataset file
cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_{}'.format(
'dev' if evaluate else 'train',
list(filter(None, args.model_name_or_path.split('/'))).pop(),
str(args.max_seq_length),
str(task)))
cached_features_file = os.path.join(
args.data_dir,
"cached_{}_{}_{}_{}".format(
"dev" if evaluate else "train",
list(filter(None, args.model_name_or_path.split("/"))).pop(),
str(args.max_seq_length),
str(task),
),
)
label_list = processor.get_labels()
if os.path.exists(cached_features_file) and not args.overwrite_cache:
logger.info("Loading features from cached file %s", cached_features_file)
features = torch.load(cached_features_file)
else:
logger.info("Creating features from dataset file at %s", args.data_dir)
label_list = processor.get_labels()
if task in ['mnli', 'mnli-mm'] and args.model_type in ['roberta']:
if task in ["mnli", "mnli-mm"] and args.model_type in ["roberta"]:
# HACK(label indices are swapped in RoBERTa pretrained model)
label_list[1], label_list[2] = label_list[2], label_list[1]
examples = processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
features = convert_examples_to_features(examples,
tokenizer,
label_list=label_list,
max_length=args.max_seq_length,
output_mode=output_mode,
pad_on_left=bool(args.model_type in ['xlnet']), # pad on the left for xlnet
pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
label_list[1], label_list[2] = label_list[2], label_list[1]
examples = (
processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
)
features = convert_examples_to_features(
examples,
tokenizer,
label_list=label_list,
max_length=args.max_seq_length,
output_mode=output_mode,
pad_on_left=bool(args.model_type in ["xlnet"]), # pad on the left for xlnet
pad_token=tokenizer.pad_token_id,
pad_token_segment_id=tokenizer.pad_token_type_id,
)
if args.local_rank in [-1, 0]:
logger.info("Saving features into cached file %s", cached_features_file)
@@ -332,99 +360,159 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
elif output_mode == "regression":
all_labels = torch.tensor([f.label for f in features], dtype=torch.float)
dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
return dataset
all_pair_ids = torch.tensor([int(f.pairID) for f in features], dtype=torch.long)
dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels, all_pair_ids)
return dataset, label_list
def main():
parser = argparse.ArgumentParser()
## Required parameters
parser.add_argument("--data_dir", default=None, type=str, required=True,
help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
parser.add_argument("--model_type", default=None, type=str, required=True,
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
parser.add_argument("--task_name", default=None, type=str, required=True,
help="The name of the task to train selected in the list: " + ", ".join(processors.keys()))
parser.add_argument("--output_dir", default=None, type=str, required=True,
help="The output directory where the model predictions and checkpoints will be written.")
# Required parameters
parser.add_argument(
"--data_dir",
default=None,
type=str,
required=True,
help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
)
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
)
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
)
parser.add_argument(
"--task_name",
default=None,
type=str,
required=True,
help="The name of the task to train selected in the list: " + ", ".join(processors.keys()),
)
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
## Other parameters
parser.add_argument("--config_name", default="", type=str,
help="Pretrained config name or path if not the same as model_name")
parser.add_argument("--tokenizer_name", default="", type=str,
help="Pretrained tokenizer name or path if not the same as model_name")
parser.add_argument("--cache_dir", default="", type=str,
help="Where do you want to store the pre-trained models downloaded from s3")
parser.add_argument("--max_seq_length", default=128, type=int,
help="The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded.")
parser.add_argument("--do_train", action='store_true',
help="Whether to run training.")
parser.add_argument("--do_eval", action='store_true',
help="Whether to run eval on the dev set.")
parser.add_argument("--evaluate_during_training", action='store_true',
help="Rul evaluation during training at each logging step.")
parser.add_argument("--do_lower_case", action='store_true',
help="Set this flag if you are using an uncased model.")
# Other parameters
parser.add_argument(
"--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name"
)
parser.add_argument(
"--tokenizer_name",
default="",
type=str,
help="Pretrained tokenizer name or path if not the same as model_name",
)
parser.add_argument(
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
)
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded.",
)
parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
parser.add_argument(
"--evaluate_during_training", action="store_true", help="Rul evaluation during training at each logging step."
)
parser.add_argument(
"--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model."
)
parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
help="Batch size per GPU/CPU for training.")
parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
help="Batch size per GPU/CPU for evaluation.")
parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.")
parser.add_argument("--learning_rate", default=5e-5, type=float,
help="The initial learning rate for Adam.")
parser.add_argument("--weight_decay", default=0.0, type=float,
help="Weight deay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-8, type=float,
help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float,
help="Max gradient norm.")
parser.add_argument("--num_train_epochs", default=3.0, type=float,
help="Total number of training epochs to perform.")
parser.add_argument("--max_steps", default=-1, type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
parser.add_argument("--warmup_steps", default=0, type=int,
help="Linear warmup over warmup_steps.")
parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument(
"--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation."
)
parser.add_argument(
"--gradient_accumulation_steps",
type=int,
default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.",
)
parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
parser.add_argument(
"--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform."
)
parser.add_argument(
"--max_steps",
default=-1,
type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
parser.add_argument('--logging_steps', type=int, default=50,
help="Log every X updates steps.")
parser.add_argument('--save_steps', type=int, default=50,
help="Save checkpoint every X updates steps.")
parser.add_argument("--eval_all_checkpoints", action='store_true',
help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
parser.add_argument("--no_cuda", action='store_true',
help="Avoid using CUDA when available")
parser.add_argument('--overwrite_output_dir', action='store_true',
help="Overwrite the content of the output directory")
parser.add_argument('--overwrite_cache', action='store_true',
help="Overwrite the cached training and evaluation sets")
parser.add_argument('--seed', type=int, default=42,
help="random seed for initialization")
parser.add_argument("--logging_steps", type=int, default=50, help="Log every X updates steps.")
parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.")
parser.add_argument(
"--eval_all_checkpoints",
action="store_true",
help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
)
parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
parser.add_argument(
"--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory"
)
parser.add_argument(
"--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets"
)
parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
parser.add_argument('--fp16', action='store_true',
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
parser.add_argument('--fp16_opt_level', type=str, default='O1',
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
"See details at https://nvidia.github.io/apex/amp.html")
parser.add_argument("--local_rank", type=int, default=-1,
help="For distributed training: local_rank")
parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
parser.add_argument(
"--fp16",
action="store_true",
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
)
parser.add_argument(
"--fp16_opt_level",
type=str,
default="O1",
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
"See details at https://nvidia.github.io/apex/amp.html",
)
parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.")
parser.add_argument("--server_port", type=str, default="", help="For distant debugging.")
args = parser.parse_args()
if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
if (
os.path.exists(args.output_dir)
and os.listdir(args.output_dir)
and args.do_train
and not args.overwrite_output_dir
):
raise ValueError(
"Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
args.output_dir
)
)
# Setup distant debugging if needed
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
@@ -432,20 +520,28 @@ def main():
# Setup CUDA, GPU & distributed training
if args.local_rank == -1 or args.no_cuda:
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
args.n_gpu = torch.cuda.device_count()
args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)
torch.distributed.init_process_group(backend='nccl')
torch.distributed.init_process_group(backend="nccl")
args.n_gpu = 1
args.device = device
# Setup logging
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
)
logger.warning(
"Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
args.local_rank,
device,
args.n_gpu,
bool(args.local_rank != -1),
args.fp16,
)
# Set seed
set_seed(args)
@@ -465,17 +561,23 @@ def main():
args.model_type = args.model_type.lower()
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path,
num_labels=num_labels,
finetuning_task=args.task_name,
cache_dir=args.cache_dir if args.cache_dir else None)
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
do_lower_case=args.do_lower_case,
cache_dir=args.cache_dir if args.cache_dir else None)
model = model_class.from_pretrained(args.model_name_or_path,
from_tf=bool('.ckpt' in args.model_name_or_path),
config=config,
cache_dir=args.cache_dir if args.cache_dir else None)
config = config_class.from_pretrained(
args.config_name if args.config_name else args.model_name_or_path,
num_labels=num_labels,
finetuning_task=args.task_name,
cache_dir=args.cache_dir if args.cache_dir else None,
)
tokenizer = tokenizer_class.from_pretrained(
args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
do_lower_case=args.do_lower_case,
cache_dir=args.cache_dir if args.cache_dir else None,
)
model = model_class.from_pretrained(
args.model_name_or_path,
from_tf=bool(".ckpt" in args.model_name_or_path),
config=config,
cache_dir=args.cache_dir if args.cache_dir else None,
)
if args.local_rank == 0:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
@@ -484,14 +586,12 @@ def main():
logger.info("Training/evaluation parameters %s", args)
# Training
if args.do_train:
train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
train_dataset, _ = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
# Create output directory if needed
@@ -501,36 +601,39 @@ def main():
logger.info("Saving model checkpoint to %s", args.output_dir)
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
model_to_save.save_pretrained(args.output_dir)
tokenizer.save_pretrained(args.output_dir)
# Good practice: save your training arguments together with the trained model
torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
# Load a trained model and vocabulary that you have fine-tuned
model = model_class.from_pretrained(args.output_dir)
tokenizer = tokenizer_class.from_pretrained(args.output_dir)
model.to(args.device)
# Evaluation
results = {}
if args.do_eval and args.local_rank in [-1, 0]:
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
checkpoints = [args.output_dir]
if args.eval_all_checkpoints:
checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
checkpoints = list(
os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
)
logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging
logger.info("Evaluate the following checkpoints: %s", checkpoints)
for checkpoint in checkpoints:
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
model = model_class.from_pretrained(checkpoint)
model.to(args.device)
result = evaluate(args, model, tokenizer, prefix=prefix)
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
results.update(result)
return results

View File

@@ -14,11 +14,11 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import csv
import sys
import copy
import csv
import json
class InputExample(object):
"""
A single training/test example for simple sequence classification.
@@ -32,11 +32,13 @@ class InputExample(object):
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
def __init__(self, guid, text_a, text_b=None, label=None):
def __init__(self, guid, text_a, text_b=None, label=None, pairID=None):
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
self.pairID = pairID
def __repr__(self):
return str(self.to_json_string())
@@ -64,11 +66,12 @@ class InputFeatures(object):
label: Label corresponding to the input
"""
def __init__(self, input_ids, attention_mask, token_type_ids, label):
def __init__(self, input_ids, attention_mask, token_type_ids, label, pairID=None):
self.input_ids = input_ids
self.attention_mask = attention_mask
self.token_type_ids = token_type_ids
self.label = label
self.pairID = pairID
def __repr__(self):
return str(self.to_json_string())
@@ -107,13 +110,6 @@ class DataProcessor(object):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
def tfds_map(self, example):
"""Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are.
This method converts examples to the correct format."""
if len(self.get_labels()) > 1:
example.label = self.get_labels()[int(example.label)]
return example
@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
@@ -121,7 +117,5 @@ class DataProcessor(object):
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
for line in reader:
if sys.version_info[0] == 2:
line = list(unicode(cell, 'utf-8') for cell in line)
lines.append(line)
return lines

View File

@@ -18,12 +18,23 @@
# If checking the tensors placement
# tf.debugging.set_log_device_placement(True)
from typing import List
import timeit
from transformers import is_tf_available, is_torch_available
from time import time
import argparse
import csv
import logging
import timeit
from time import time
from typing import Callable, List
from transformers import (
AutoConfig,
AutoTokenizer,
MemorySummary,
is_tf_available,
is_torch_available,
start_memory_tracing,
stop_memory_tracing,
)
if is_tf_available():
import tensorflow as tf
@@ -33,230 +44,236 @@ if is_torch_available():
import torch
from transformers import AutoModel
from transformers import AutoConfig, AutoTokenizer
input_text = """Bent over their instruments, three hundred Fertilizers were plunged, as
the Director of Hatcheries and Conditioning entered the room, in the
input_text = """Bent over their instruments, three hundred Fertilizers were plunged, as
the Director of Hatcheries and Conditioning entered the room, in the
scarcely breathing silence, the absent-minded, soliloquizing hum or
whistle, of absorbed concentration. A troop of newly arrived students,
very young, pink and callow, followed nervously, rather abjectly, at the
Director's heels. Each of them carried a notebook, in which, whenever
the great man spoke, he desperately scribbled. Straight from the
horse's mouth. It was a rare privilege. The D. H. C. for Central London
always made a point of personally conducting his new students round
the various departments.
"Just to give you a general idea," he would explain to them. For of
course some sort of general idea they must have, if they were to do
their work intelligently-though as little of one, if they were to be good
and happy members of society, as possible. For particulars, as every
one knows, make for virtue and happiness; generalities are intellectu-
ally necessary evils. Not philosophers but fret-sawyers and stamp col-
lectors compose the backbone of society.
"To-morrow," he would add, smiling at them with a slightly menacing
geniality, "you'll be settling down to serious work. You won't have time
for generalities. Meanwhile ..."
Meanwhile, it was a privilege. Straight from the horse's mouth into the
notebook. The boys scribbled like mad.
Tall and rather thin but upright, the Director advanced into the room.
He had a long chin and big rather prominent teeth, just covered, when
he was not talking, by his full, floridly curved lips. Old, young? Thirty?
Fifty? Fifty-five? It was hard to say. And anyhow the question didn't
arise; in this year of stability, A. F. 632, it didn't occur to you to ask it.
"I shall begin at the beginning," said the D.H.C. and the more zealous
students recorded his intention in their notebooks: Begin at the begin-
ning. "These," he waved his hand, "are the incubators." And opening
an insulated door he showed them racks upon racks of numbered test-
tubes. "The week's supply of ova. Kept," he explained, "at blood heat;
whereas the male gametes," and here he opened another door, "they
have to be kept at thirty-five instead of thirty-seven. Full blood heat
sterilizes." Rams wrapped in theremogene beget no lambs.
Still leaning against the incubators he gave them, while the pencils
scurried illegibly across the pages, a brief description of the modern
scarcely breathing silence, the absent-minded, soliloquizing hum or
whistle, of absorbed concentration. A troop of newly arrived students,
very young, pink and callow, followed nervously, rather abjectly, at the
Director's heels. Each of them carried a notebook, in which, whenever
the great man spoke, he desperately scribbled. Straight from the
horse's mouth. It was a rare privilege. The D. H. C. for Central London
always made a point of personally conducting his new students round
the various departments.
fertilizing process; spoke first, of course, of its surgical introduc-
tion-"the operation undergone voluntarily for the good of Society, not
to mention the fact that it carries a bonus amounting to six months'
salary"; continued with some account of the technique for preserving
the excised ovary alive and actively developing; passed on to a consid-
eration of optimum temperature, salinity, viscosity; referred to the liq-
uor in which the detached and ripened eggs were kept; and, leading
his charges to the work tables, actually showed them how this liquor
was drawn off from the test-tubes; how it was let out drop by drop
onto the specially warmed slides of the microscopes; how the eggs
which it contained were inspected for abnormalities, counted and
transferred to a porous receptacle; how (and he now took them to
watch the operation) this receptacle was immersed in a warm bouillon
containing free-swimming spermatozoa-at a minimum concentration
of one hundred thousand per cubic centimetre, he insisted; and how,
after ten minutes, the container was lifted out of the liquor and its
contents re-examined; how, if any of the eggs remained unfertilized, it
was again immersed, and, if necessary, yet again; how the fertilized
ova went back to the incubators; where the Alphas and Betas re-
mained until definitely bottled; while the Gammas, Deltas and Epsilons
were brought out again, after only thirty-six hours, to undergo Bo-
kanovsky's Process.
"Just to give you a general idea," he would explain to them. For of
course some sort of general idea they must have, if they were to do
their work intelligently-though as little of one, if they were to be good
and happy members of society, as possible. For particulars, as every
one knows, make for virtue and happiness; generalities are intellectu-
ally necessary evils. Not philosophers but fret-sawyers and stamp col-
lectors compose the backbone of society.
"Bokanovsky's Process," repeated the Director, and the students un-
derlined the words in their little notebooks.
"To-morrow," he would add, smiling at them with a slightly menacing
geniality, "you'll be settling down to serious work. You won't have time
for generalities. Meanwhile ..."
One egg, one embryo, one adult-normality. But a bokanovskified egg
will bud, will proliferate, will divide. From eight to ninety-six buds, and
every bud will grow into a perfectly formed embryo, and every embryo
into a full-sized adult. Making ninety-six human beings grow where
only one grew before. Progress.
Meanwhile, it was a privilege. Straight from the horse's mouth into the
notebook. The boys scribbled like mad.
"Essentially," the D.H.C. concluded, "bokanovskification consists of a
series of arrests of development. We check the normal growth and,
paradoxically enough, the egg responds by budding."
Tall and rather thin but upright, the Director advanced into the room.
He had a long chin and big rather prominent teeth, just covered, when
he was not talking, by his full, floridly curved lips. Old, young? Thirty?
Fifty? Fifty-five? It was hard to say. And anyhow the question didn't
arise; in this year of stability, A. F. 632, it didn't occur to you to ask it.
Responds by budding. The pencils were busy.
"I shall begin at the beginning," said the D.H.C. and the more zealous
students recorded his intention in their notebooks: Begin at the begin-
ning. "These," he waved his hand, "are the incubators." And opening
an insulated door he showed them racks upon racks of numbered test-
tubes. "The week's supply of ova. Kept," he explained, "at blood heat;
whereas the male gametes," and here he opened another door, "they
have to be kept at thirty-five instead of thirty-seven. Full blood heat
sterilizes." Rams wrapped in theremogene beget no lambs.
Still leaning against the incubators he gave them, while the pencils
scurried illegibly across the pages, a brief description of the modern
He pointed. On a very slowly moving band a rack-full of test-tubes was
entering a large metal box, another, rack-full was emerging. Machinery
faintly purred. It took eight minutes for the tubes to go through, he
fertilizing process; spoke first, of course, of its surgical introduc-
tion-"the operation undergone voluntarily for the good of Society, not
to mention the fact that it carries a bonus amounting to six months'
salary"; continued with some account of the technique for preserving
the excised ovary alive and actively developing; passed on to a consid-
eration of optimum temperature, salinity, viscosity; referred to the liq-
uor in which the detached and ripened eggs were kept; and, leading
his charges to the work tables, actually showed them how this liquor
was drawn off from the test-tubes; how it was let out drop by drop
onto the specially warmed slides of the microscopes; how the eggs
which it contained were inspected for abnormalities, counted and
transferred to a porous receptacle; how (and he now took them to
watch the operation) this receptacle was immersed in a warm bouillon
containing free-swimming spermatozoa-at a minimum concentration
of one hundred thousand per cubic centimetre, he insisted; and how,
after ten minutes, the container was lifted out of the liquor and its
contents re-examined; how, if any of the eggs remained unfertilized, it
was again immersed, and, if necessary, yet again; how the fertilized
ova went back to the incubators; where the Alphas and Betas re-
mained until definitely bottled; while the Gammas, Deltas and Epsilons
were brought out again, after only thirty-six hours, to undergo Bo-
kanovsky's Process.
told them. Eight minutes of hard X-rays being about as much as an
egg can stand. A few died; of the rest, the least susceptible divided
into two; most put out four buds; some eight; all were returned to the
incubators, where the buds began to develop; then, after two days,
were suddenly chilled, chilled and checked. Two, four, eight, the buds
in their turn budded; and having budded were dosed almost to death
with alcohol; consequently burgeoned again and having budded-bud
out of bud out of bud-were thereafter-further arrest being generally
fatal-left to develop in peace. By which time the original egg was in a
fair way to becoming anything from eight to ninety-six embryos- a
prodigious improvement, you will agree, on nature. Identical twins-but
not in piddling twos and threes as in the old viviparous days, when an
egg would sometimes accidentally divide; actually by dozens, by
scores at a time.
"Bokanovsky's Process," repeated the Director, and the students un-
derlined the words in their little notebooks.
"Scores," the Director repeated and flung out his arms, as though he
were distributing largesse. "Scores."
One egg, one embryo, one adult-normality. But a bokanovskified egg
will bud, will proliferate, will divide. From eight to ninety-six buds, and
every bud will grow into a perfectly formed embryo, and every embryo
into a full-sized adult. Making ninety-six human beings grow where
only one grew before. Progress.
But one of the students was fool enough to ask where the advantage
lay.
"Essentially," the D.H.C. concluded, "bokanovskification consists of a
series of arrests of development. We check the normal growth and,
paradoxically enough, the egg responds by budding."
"My good boy!" The Director wheeled sharply round on him. "Can't you
see? Can't you see?" He raised a hand; his expression was solemn.
"Bokanovsky's Process is one of the major instruments of social stabil-
ity!"
Responds by budding. The pencils were busy.
Major instruments of social stability.
He pointed. On a very slowly moving band a rack-full of test-tubes was
entering a large metal box, another, rack-full was emerging. Machinery
faintly purred. It took eight minutes for the tubes to go through, he
Standard men and women; in uniform batches. The whole of a small
factory staffed with the products of a single bokanovskified egg.
"Ninety-six identical twins working ninety-six identical machines!" The
voice was almost tremulous with enthusiasm. "You really know where
you are. For the first time in history." He quoted the planetary motto.
"Community, Identity, Stability." Grand words. "If we could bo-
kanovskify indefinitely the whole problem would be solved."
Solved by standard Gammas, unvarying Deltas, uniform Epsilons. Mil-
lions of identical twins. The principle of mass production at last applied
to biology.
told them. Eight minutes of hard X-rays being about as much as an
egg can stand. A few died; of the rest, the least susceptible divided
into two; most put out four buds; some eight; all were returned to the
incubators, where the buds began to develop; then, after two days,
were suddenly chilled, chilled and checked. Two, four, eight, the buds
in their turn budded; and having budded were dosed almost to death
with alcohol; consequently burgeoned again and having budded-bud
out of bud out of bud-were thereafter-further arrest being generally
fatal-left to develop in peace. By which time the original egg was in a
fair way to becoming anything from eight to ninety-six embryos- a
prodigious improvement, you will agree, on nature. Identical twins-but
not in piddling twos and threes as in the old viviparous days, when an
egg would sometimes accidentally divide; actually by dozens, by
scores at a time.
"But, alas," the Director shook his head, "we can't bokanovskify indefi-
nitely."
"Scores," the Director repeated and flung out his arms, as though he
were distributing largesse. "Scores."
Ninety-six seemed to be the limit; seventy-two a good average. From
the same ovary and with gametes of the same male to manufacture as
many batches of identical twins as possible-that was the best (sadly a
second best) that they could do. And even that was difficult.
But one of the students was fool enough to ask where the advantage
lay.
"For in nature it takes thirty years for two hundred eggs to reach ma-
turity. But our business is to stabilize the population at this moment,
here and now. Dribbling out twins over a quarter of a century-what
would be the use of that?"
"My good boy!" The Director wheeled sharply round on him. "Can't you
see? Can't you see?" He raised a hand; his expression was solemn.
"Bokanovsky's Process is one of the major instruments of social stabil-
ity!"
Obviously, no use at all. But Podsnap's Technique had immensely ac-
celerated the process of ripening. They could make sure of at least a
hundred and fifty mature eggs within two years. Fertilize and bo-
kanovskify-in other words, multiply by seventy-two-and you get an
average of nearly eleven thousand brothers and sisters in a hundred
and fifty batches of identical twins, all within two years of the same
age.
Major instruments of social stability.
"And in exceptional cases we can make one ovary yield us over fifteen
thousand adult individuals."
Standard men and women; in uniform batches. The whole of a small
factory staffed with the products of a single bokanovskified egg.
Beckoning to a fair-haired, ruddy young man who happened to be
passing at the moment. "Mr. Foster," he called. The ruddy young man
approached. "Can you tell us the record for a single ovary, Mr. Foster?"
"Ninety-six identical twins working ninety-six identical machines!" The
voice was almost tremulous with enthusiasm. "You really know where
you are. For the first time in history." He quoted the planetary motto.
"Community, Identity, Stability." Grand words. "If we could bo-
kanovskify indefinitely the whole problem would be solved."
Solved by standard Gammas, unvarying Deltas, uniform Epsilons. Mil-
lions of identical twins. The principle of mass production at last applied
to biology.
"Sixteen thousand and twelve in this Centre," Mr. Foster replied with-
out hesitation. He spoke very quickly, had a vivacious blue eye, and
took an evident pleasure in quoting figures. "Sixteen thousand and
twelve; in one hundred and eighty-nine batches of identicals. But of
course they've done much better," he rattled on, "in some of the tropi-
cal Centres. Singapore has often produced over sixteen thousand five
hundred; and Mombasa has actually touched the seventeen thousand
mark. But then they have unfair advantages. You should see the way a
negro ovary responds to pituitary! It's quite astonishing, when you're
used to working with European material. Still," he added, with a laugh
(but the light of combat was in his eyes and the lift of his chin was
challenging), "still, we mean to beat them if we can. I'm working on a
wonderful Delta-Minus ovary at this moment. Only just eighteen
"But, alas," the Director shook his head, "we can't bokanovskify indefi-
nitely."
months old. Over twelve thousand seven hundred children already, ei-
ther decanted or in embryo. And still going strong. We'll beat them
yet."
Ninety-six seemed to be the limit; seventy-two a good average. From
the same ovary and with gametes of the same male to manufacture as
many batches of identical twins as possible-that was the best (sadly a
second best) that they could do. And even that was difficult.
"That's the spirit I like!" cried the Director, and clapped Mr. Foster on
the shoulder. "Come along with us, and give these boys the benefit of
your expert knowledge."
"For in nature it takes thirty years for two hundred eggs to reach ma-
turity. But our business is to stabilize the population at this moment,
here and now. Dribbling out twins over a quarter of a century-what
would be the use of that?"
Mr. Foster smiled modestly. "With pleasure." They went.
In the Bottling Room all was harmonious bustle and ordered activity.
Flaps of fresh sow's peritoneum ready cut to the proper size came
shooting up in little lifts from the Organ Store in the sub-basement.
Whizz and then, click! the lift-hatches hew open; the bottle-liner had
only to reach out a hand, take the flap, insert, smooth-down, and be-
fore the lined bottle had had time to travel out of reach along the end-
less band, whizz, click! another flap of peritoneum had shot up from
the depths, ready to be slipped into yet another bottle, the next of that
slow interminable procession on the band.
Obviously, no use at all. But Podsnap's Technique had immensely ac-
celerated the process of ripening. They could make sure of at least a
hundred and fifty mature eggs within two years. Fertilize and bo-
kanovskify-in other words, multiply by seventy-two-and you get an
average of nearly eleven thousand brothers and sisters in a hundred
and fifty batches of identical twins, all within two years of the same
age.
"And in exceptional cases we can make one ovary yield us over fifteen
thousand adult individuals."
Beckoning to a fair-haired, ruddy young man who happened to be
passing at the moment. "Mr. Foster," he called. The ruddy young man
approached. "Can you tell us the record for a single ovary, Mr. Foster?"
"Sixteen thousand and twelve in this Centre," Mr. Foster replied with-
out hesitation. He spoke very quickly, had a vivacious blue eye, and
took an evident pleasure in quoting figures. "Sixteen thousand and
twelve; in one hundred and eighty-nine batches of identicals. But of
course they've done much better," he rattled on, "in some of the tropi-
cal Centres. Singapore has often produced over sixteen thousand five
hundred; and Mombasa has actually touched the seventeen thousand
mark. But then they have unfair advantages. You should see the way a
negro ovary responds to pituitary! It's quite astonishing, when you're
used to working with European material. Still," he added, with a laugh
(but the light of combat was in his eyes and the lift of his chin was
challenging), "still, we mean to beat them if we can. I'm working on a
wonderful Delta-Minus ovary at this moment. Only just eighteen
months old. Over twelve thousand seven hundred children already, ei-
ther decanted or in embryo. And still going strong. We'll beat them
yet."
"That's the spirit I like!" cried the Director, and clapped Mr. Foster on
the shoulder. "Come along with us, and give these boys the benefit of
your expert knowledge."
Mr. Foster smiled modestly. "With pleasure." They went.
In the Bottling Room all was harmonious bustle and ordered activity.
Flaps of fresh sow's peritoneum ready cut to the proper size came
shooting up in little lifts from the Organ Store in the sub-basement.
Whizz and then, click! the lift-hatches hew open; the bottle-liner had
only to reach out a hand, take the flap, insert, smooth-down, and be-
fore the lined bottle had had time to travel out of reach along the end-
less band, whizz, click! another flap of peritoneum had shot up from
the depths, ready to be slipped into yet another bottle, the next of that
slow interminable procession on the band.
Next to the Liners stood the Matriculators. The procession advanced;
one by one the eggs were transferred from their test-tubes to the
larger containers; deftly the peritoneal lining was slit, the morula
dropped into place, the saline solution poured in ... and already the
bottle had passed, and it was the turn of the labellers. Heredity, date
of fertilization, membership of Bokanovsky Group-details were trans-
ferred from test-tube to bottle. No longer anonymous, but named,
identified, the procession marched slowly on; on through an opening in
the wall, slowly on into the Social Predestination Room.
"Eighty-eight cubic metres of card-index," said Mr. Foster with relish,
Next to the Liners stood the Matriculators. The procession advanced;
one by one the eggs were transferred from their test-tubes to the
larger containers; deftly the peritoneal lining was slit, the morula
dropped into place, the saline solution poured in ... and already the
bottle had passed, and it was the turn of the labellers. Heredity, date
of fertilization, membership of Bokanovsky Group-details were trans-
ferred from test-tube to bottle. No longer anonymous, but named,
identified, the procession marched slowly on; on through an opening in
the wall, slowly on into the Social Predestination Room.
"Eighty-eight cubic metres of card-index," said Mr. Foster with relish,
as they entered."""
def create_setup_and_compute(model_names: List[str],
gpu: bool = True,
tensorflow: bool = False,
average_over: int = 3,
torchscript: bool = False,
xla: bool = False,
amp: bool = False,
fp16: bool = False,
save_to_csv: bool = False,
csv_filename: str = f"results_{round(time())}.csv"):
def create_setup_and_compute(
model_names: List[str],
batch_sizes: List[int],
slice_sizes: List[int],
gpu: bool = True,
tensorflow: bool = False,
average_over: int = 3,
no_speed: bool = False,
no_memory: bool = False,
verbose: bool = False,
torchscript: bool = False,
xla: bool = False,
amp: bool = False,
fp16: bool = False,
save_to_csv: bool = False,
csv_time_filename: str = f"time_{round(time())}.csv",
csv_memory_filename: str = f"memory_{round(time())}.csv",
print_fn: Callable[[str], None] = print,
):
if xla:
tf.config.optimizer.set_jit(True)
if amp:
@@ -264,51 +281,152 @@ def create_setup_and_compute(model_names: List[str],
if tensorflow:
dictionary = {model_name: {} for model_name in model_names}
results = _compute_tensorflow(model_names, dictionary, average_over, amp)
results = _compute_tensorflow(
model_names,
batch_sizes,
slice_sizes,
dictionary,
average_over,
amp,
no_speed,
no_memory,
verbose,
print_fn,
)
else:
device = 'cuda' if (gpu and torch.cuda.is_available()) else 'cpu'
device = "cuda" if (gpu and torch.cuda.is_available()) else "cpu"
dictionary = {model_name: {} for model_name in model_names}
results = _compute_pytorch(model_names, dictionary, average_over, device, torchscript, fp16)
results = _compute_pytorch(
model_names,
batch_sizes,
slice_sizes,
dictionary,
average_over,
device,
torchscript,
fp16,
no_speed,
no_memory,
verbose,
print_fn,
)
print("=========== RESULTS ===========")
print_fn("=========== RESULTS ===========")
for model_name in model_names:
print("\t" + f"======= MODEL CHECKPOINT: {model_name} =======")
print_fn("\t" + f"======= MODEL CHECKPOINT: {model_name} =======")
for batch_size in results[model_name]["bs"]:
print("\t\t" + f"===== BATCH SIZE: {batch_size} =====")
print_fn("\t\t" + f"===== BATCH SIZE: {batch_size} =====")
for slice_size in results[model_name]["ss"]:
result = results[model_name]['results'][batch_size][slice_size]
if isinstance(result, str):
print(f"\t\t{model_name}/{batch_size}/{slice_size}: "
f"{result}")
time = results[model_name]["time"][batch_size][slice_size]
memory = results[model_name]["memory"][batch_size][slice_size]
if isinstance(time, str):
print_fn(f"\t\t{model_name}/{batch_size}/{slice_size}: " f"{time} " f"{memory}")
else:
print(f"\t\t{model_name}/{batch_size}/{slice_size}: "
f"{(round(1000 * result) / 1000)}"
f"s")
print_fn(
f"\t\t{model_name}/{batch_size}/{slice_size}: "
f"{(round(1000 * time) / 1000)}"
f"s "
f"{memory}"
)
if save_to_csv:
with open(csv_filename, mode='w') as csv_file:
fieldnames = ['model',
'1x8', '1x64', '1x128', '1x256', '1x512', '1x1024',
'2x8', '2x64', '2x128', '2x256', '2x512', '2x1024',
'4x8', '4x64', '4x128', '4x256', '4x512', '4x1024',
'8x8', '8x64', '8x128', '8x256', '8x512', '8x1024',
]
with open(csv_time_filename, mode="w") as csv_time_file, open(
csv_memory_filename, mode="w"
) as csv_memory_file:
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
assert len(model_names) > 0, "At least 1 model should be defined, but got {}".format(model_names)
fieldnames = ["model", "batch_size", "sequence_length"]
time_writer = csv.DictWriter(csv_time_file, fieldnames=fieldnames + ["time_in_s"])
time_writer.writeheader()
memory_writer = csv.DictWriter(csv_memory_file, fieldnames=fieldnames + ["memory"])
memory_writer.writeheader()
for model_name in model_names:
model_results = {
f'{bs}x{ss}': results[model_name]['results'][bs][ss]
for bs in results[model_name]["results"]
for ss in results[model_name]['results'][bs]
}
writer.writerow({'model': model_name, **model_results})
time_dict = results[model_name]["time"]
memory_dict = results[model_name]["memory"]
for bs in time_dict:
for ss in time_dict[bs]:
time_writer.writerow(
{
"model": model_name,
"batch_size": bs,
"sequence_length": ss,
"time_in_s": "{:.4f}".format(time_dict[bs][ss]),
}
)
for bs in memory_dict:
for ss in time_dict[bs]:
memory_writer.writerow(
{
"model": model_name,
"batch_size": bs,
"sequence_length": ss,
"memory": memory_dict[bs][ss],
}
)
def _compute_pytorch(model_names, dictionary, average_over, device, torchscript, fp16):
def print_summary_statistics(summary: MemorySummary, print_fn: Callable[[str], None]):
print_fn(
"\nLines by line memory consumption:\n"
+ "\n".join(
f"{state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
for state in summary.sequential
)
)
print_fn(
"\nLines with top memory consumption:\n"
+ "\n".join(
f"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
for state in summary.cumulative[:6]
)
)
print_fn(
"\nLines with lowest memory consumption:\n"
+ "\n".join(
f"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
for state in summary.cumulative[-6:]
)
)
print_fn(f"\nTotal memory increase: {summary.total}")
def get_print_function(save_print_log, log_filename):
if save_print_log:
logging.basicConfig(
level=logging.DEBUG,
filename=log_filename,
filemode="a+",
format="%(asctime)-15s %(levelname)-8s %(message)s",
)
def print_with_print_log(*args):
logging.info(*args)
print(*args)
return print_with_print_log
else:
return print
def _compute_pytorch(
model_names,
batch_sizes,
slice_sizes,
dictionary,
average_over,
device,
torchscript,
fp16,
no_speed,
no_memory,
verbose,
print_fn,
):
for c, model_name in enumerate(model_names):
print(f"{c + 1} / {len(model_names)}")
print_fn(f"{c + 1} / {len(model_names)}")
config = AutoConfig.from_pretrained(model_name, torchscript=torchscript)
model = AutoModel.from_pretrained(model_name, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
@@ -316,45 +434,70 @@ def _compute_pytorch(model_names, dictionary, average_over, device, torchscript,
tokenized_sequence = tokenizer.encode(input_text, add_special_tokens=False)
max_input_size = tokenizer.max_model_input_sizes[model_name]
batch_sizes = [1, 2, 4, 8]
slice_sizes = [8, 64, 128, 256, 512, 1024]
dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "results": {}}
dictionary[model_name]["results"] = {i: {} for i in batch_sizes}
dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "time": {}, "memory": {}}
dictionary[model_name]["time"] = {i: {} for i in batch_sizes}
dictionary[model_name]["memory"] = {i: {} for i in batch_sizes}
print_fn("Using model {}".format(model))
print_fn("Number of all parameters {}".format(model.num_parameters()))
for batch_size in batch_sizes:
if fp16:
model.half()
model.to(device)
model.eval()
for slice_size in slice_sizes:
if max_input_size is not None and slice_size > max_input_size:
dictionary[model_name]["results"][batch_size][slice_size] = "N/A"
dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
else:
sequence = torch.tensor(tokenized_sequence[:slice_size], device=device).repeat(batch_size, 1)
try:
if torchscript:
print("Tracing model with sequence size", sequence.shape)
print_fn("Tracing model with sequence size {}".format(sequence.shape))
inference = torch.jit.trace(model, sequence)
inference(sequence)
else:
inference = model
inference(sequence)
print("Going through model with sequence of shape", sequence.shape)
runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
average_time = sum(runtimes)/float(len(runtimes)) / 3.0
dictionary[model_name]["results"][batch_size][slice_size] = average_time
if not no_memory:
# model.add_memory_hooks() # Forward method tracing (only for PyTorch models)
# Line by line memory tracing (all code in the module `transformers`) works for all models/arbitrary code
trace = start_memory_tracing("transformers")
inference(sequence)
summary = stop_memory_tracing(trace)
if verbose:
print_summary_statistics(summary, print_fn)
dictionary[model_name]["memory"][batch_size][slice_size] = str(summary.total)
else:
dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
if not no_speed:
print_fn("Going through model with sequence of shape".format(sequence.shape))
runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
average_time = sum(runtimes) / float(len(runtimes)) / 3.0
dictionary[model_name]["time"][batch_size][slice_size] = average_time
else:
dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
except RuntimeError as e:
print("Doesn't fit on GPU.", e)
print_fn("Doesn't fit on GPU. {}".format(e))
torch.cuda.empty_cache()
dictionary[model_name]["results"][batch_size][slice_size] = "N/A"
dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
return dictionary
def _compute_tensorflow(model_names, dictionary, average_over, amp):
def _compute_tensorflow(
model_names, batch_sizes, slice_sizes, dictionary, average_over, amp, no_speed, no_memory, verbose, print_fn
):
for c, model_name in enumerate(model_names):
print(f"{c + 1} / {len(model_names)}")
print_fn(f"{c + 1} / {len(model_names)}")
config = AutoConfig.from_pretrained(model_name)
model = TFAutoModel.from_pretrained(model_name, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
@@ -362,13 +505,13 @@ def _compute_tensorflow(model_names, dictionary, average_over, amp):
tokenized_sequence = tokenizer.encode(input_text, add_special_tokens=False)
max_input_size = tokenizer.max_model_input_sizes[model_name]
batch_sizes = [1, 2, 4, 8]
slice_sizes = [8, 64, 128, 256, 512, 1024]
dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "results": {}}
dictionary[model_name]["results"] = {i: {} for i in batch_sizes}
dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "time": {}, "memory": {}}
dictionary[model_name]["time"] = {i: {} for i in batch_sizes}
dictionary[model_name]["memory"] = {i: {} for i in batch_sizes}
print("Using model", model)
print_fn("Using model {}".format(model))
print_fn("Number of all parameters {}".format(model.num_parameters()))
@tf.function
def inference(inputs):
@@ -377,55 +520,128 @@ def _compute_tensorflow(model_names, dictionary, average_over, amp):
for batch_size in batch_sizes:
for slice_size in slice_sizes:
if max_input_size is not None and slice_size > max_input_size:
dictionary[model_name]["results"][batch_size][slice_size] = "N/A"
dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
else:
sequence = tf.stack([tf.squeeze(tf.constant(tokenized_sequence[:slice_size])[None, :])] * batch_size)
sequence = tf.stack(
[tf.squeeze(tf.constant(tokenized_sequence[:slice_size])[None, :])] * batch_size
)
try:
print("Going through model with sequence of shape", sequence.shape)
print_fn("Going through model with sequence of shape {}".format(sequence.shape))
# To make sure that the model is traced + that the tensors are on the appropriate device
inference(sequence)
runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
average_time = sum(runtimes)/float(len(runtimes)) / 3.0
dictionary[model_name]["results"][batch_size][slice_size] = average_time
if not no_memory:
# Line by line memory tracing (all code in the module `transformers`) works for all models/arbitrary code
trace = start_memory_tracing("transformers")
inference(sequence)
summary = stop_memory_tracing(trace)
if verbose:
print_summary_statistics(summary, print_fn)
dictionary[model_name]["memory"][batch_size][slice_size] = str(summary.total)
else:
dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
if not no_speed:
runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
average_time = sum(runtimes) / float(len(runtimes)) / 3.0
dictionary[model_name]["time"][batch_size][slice_size] = average_time
else:
dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
except tf.errors.ResourceExhaustedError as e:
print("Doesn't fit on GPU.", e)
torch.cuda.empty_cache()
dictionary[model_name]["results"][batch_size][slice_size] = "N/A"
print_fn("Doesn't fit on GPU. {}".format(e))
dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
return dictionary
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--models", required=False, type=str, default='all', help="Model checkpoints to be provided "
"to the AutoModel classes. Leave "
"blank to benchmark the base version "
"of all available model "
"architectures.")
parser.add_argument("--torch", required=False, action="store_true", help="Benchmark the Pytorch version of the "
"models")
parser.add_argument("--torch_cuda", required=False, action="store_true", help="Pytorch only: run on available "
"cuda devices")
parser.add_argument("--torchscript", required=False, action="store_true", help="Pytorch only: trace the models "
"using torchscript")
parser.add_argument("--tensorflow", required=False, action="store_true", help="Benchmark the TensorFlow version "
"of the models. Will run on GPU if "
"the correct dependencies are "
"installed")
parser.add_argument(
"--models",
required=False,
type=str,
default="all",
help="Model checkpoints to be provided "
"to the AutoModel classes. Leave "
"blank to benchmark the base version "
"of all available model "
"architectures.",
)
parser.add_argument("--verbose", required=False, action="store_true", help="Verbose memory tracing")
parser.add_argument("--no_speed", required=False, action="store_true", help="Don't perform speed measurments")
parser.add_argument("--no_memory", required=False, action="store_true", help="Don't perform memory measurments")
parser.add_argument(
"--torch", required=False, action="store_true", help="Benchmark the Pytorch version of the " "models"
)
parser.add_argument(
"--torch_cuda", required=False, action="store_true", help="Pytorch only: run on available " "cuda devices"
)
parser.add_argument(
"--torchscript",
required=False,
action="store_true",
help="Pytorch only: trace the models " "using torchscript",
)
parser.add_argument(
"--tensorflow",
required=False,
action="store_true",
help="Benchmark the TensorFlow version "
"of the models. Will run on GPU if "
"the correct dependencies are "
"installed",
)
parser.add_argument("--xla", required=False, action="store_true", help="TensorFlow only: use XLA acceleration.")
parser.add_argument("--amp", required=False, action="store_true", help="TensorFlow only: use automatic mixed precision acceleration.")
parser.add_argument("--fp16", required=False, action="store_true", help="PyTorch only: use FP16 to accelerate inference.")
parser.add_argument("--keras_predict", required=False, action="store_true", help="Whether to use model.predict "
"instead of model() to do a "
"forward pass.")
parser.add_argument(
"--amp",
required=False,
action="store_true",
help="TensorFlow only: use automatic mixed precision acceleration.",
)
parser.add_argument(
"--fp16", required=False, action="store_true", help="PyTorch only: use FP16 to accelerate inference."
)
parser.add_argument(
"--keras_predict",
required=False,
action="store_true",
help="Whether to use model.predict " "instead of model() to do a " "forward pass.",
)
parser.add_argument("--save_to_csv", required=False, action="store_true", help="Save to a CSV file.")
parser.add_argument("--csv_filename", required=False, default=None, help="CSV filename used if saving results to csv.")
parser.add_argument("--average_over", required=False, default=30, type=int, help="Times an experiment will be run.")
parser.add_argument(
"--log_print", required=False, action="store_true", help="Save all print statements in log file."
)
parser.add_argument(
"--csv_time_filename",
required=False,
default=f"time_{round(time())}.csv",
help="CSV filename used if saving time results to csv.",
)
parser.add_argument(
"--csv_memory_filename",
required=False,
default=f"memory_{round(time())}.csv",
help="CSV filename used if saving memory results to csv.",
)
parser.add_argument(
"--log_filename",
required=False,
default=f"log_{round(time())}.txt",
help="Log filename used if print statements are saved in log.",
)
parser.add_argument(
"--average_over", required=False, default=30, type=int, help="Times an experiment will be run."
)
parser.add_argument("--batch_sizes", nargs="+", type=int, default=[1, 2, 4, 8])
parser.add_argument("--slice_sizes", nargs="+", type=int, default=[8, 64, 128, 256, 512, 1024])
args = parser.parse_args()
if args.models == 'all':
if args.models == "all":
args.models = [
"gpt2",
"bert-base-cased",
@@ -436,24 +652,34 @@ def main():
"distilbert-base-uncased",
"distilgpt2",
"roberta-base",
"ctrl"
"ctrl",
"t5-base",
"bart-large",
]
else:
args.models = args.models.split()
print("Running with arguments", args)
print_fn = get_print_function(args.log_print, args.log_filename)
print_fn("Running with arguments: {}".format(args))
if args.torch:
if is_torch_available():
create_setup_and_compute(
model_names=args.models,
batch_sizes=args.batch_sizes,
slice_sizes=args.slice_sizes,
tensorflow=False,
gpu=args.torch_cuda,
torchscript=args.torchscript,
fp16=args.fp16,
save_to_csv=args.save_to_csv,
csv_filename=args.csv_filename,
average_over=args.average_over
csv_time_filename=args.csv_time_filename,
csv_memory_filename=args.csv_memory_filename,
average_over=args.average_over,
no_speed=args.no_speed,
no_memory=args.no_memory,
verbose=args.verbose,
print_fn=print_fn,
)
else:
raise ImportError("Trying to run a PyTorch benchmark but PyTorch was not found in the environment.")
@@ -462,16 +688,23 @@ def main():
if is_tf_available():
create_setup_and_compute(
model_names=args.models,
batch_sizes=args.batch_sizes,
slice_sizes=args.slice_sizes,
tensorflow=True,
xla=args.xla,
amp=args.amp,
save_to_csv=args.save_to_csv,
csv_filename=args.csv_filename,
average_over=args.average_over
csv_time_filename=args.csv_time_filename,
csv_memory_filename=args.csv_memory_filename,
average_over=args.average_over,
no_speed=args.no_speed,
no_memory=args.no_memory,
verbose=args.verbose,
print_fn=print_fn,
)
else:
raise ImportError("Trying to run a TensorFlow benchmark but TensorFlow was not found in the environment.")
if __name__ == '__main__':
main()
if __name__ == "__main__":
main()

View File

@@ -19,29 +19,29 @@
Some parts of this script are adapted from the code of Michel et al. (http://arxiv.org/abs/1905.10650)
which is available at https://github.com/pmichel31415/are-16-heads-really-better-than-1
"""
import os
import argparse
import logging
from datetime import timedelta, datetime
from tqdm import tqdm
import os
from datetime import datetime
import numpy as np
import torch
from torch.utils.data import DataLoader, SequentialSampler, TensorDataset, Subset
from torch.utils.data import DataLoader, SequentialSampler, Subset
from torch.utils.data.distributed import DistributedSampler
from torch.nn import CrossEntropyLoss, MSELoss
from tqdm import tqdm
from transformers import (WEIGHTS_NAME,
BertConfig, BertForSequenceClassification, BertTokenizer,
XLMConfig, XLMForSequenceClassification, XLMTokenizer,
XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer)
from transformers import (
AutoConfig,
AutoModelForSequenceClassification,
AutoTokenizer,
DefaultDataCollator,
GlueDataset,
glue_compute_metrics,
glue_output_modes,
glue_processors,
set_seed,
)
from run_glue import set_seed, load_and_cache_examples, ALL_MODELS, MODEL_CLASSES
from transformers import glue_compute_metrics as compute_metrics
from transformers import glue_output_modes as output_modes
from transformers import glue_processors as processors
logger = logging.getLogger(__name__)
@@ -63,13 +63,15 @@ def print_2d_tensor(tensor):
logger.info(f"layer {row + 1}:\t" + "\t".join(f"{x:d}" for x in tensor[row].cpu().data))
def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None):
def compute_heads_importance(
args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None
):
""" This method shows how to compute:
- head attention entropy
- head importance scores according to http://arxiv.org/abs/1905.10650
"""
# Prepare our tensors
n_layers, n_heads = model.bert.config.num_hidden_layers, model.bert.config.num_attention_heads
n_layers, n_heads = model.config.num_hidden_layers, model.config.num_attention_heads
head_importance = torch.zeros(n_layers, n_heads).to(args.device)
attn_entropy = torch.zeros(n_layers, n_heads).to(args.device)
@@ -80,18 +82,22 @@ def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True,
labels = None
tot_tokens = 0.0
for step, batch in enumerate(tqdm(eval_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])):
batch = tuple(t.to(args.device) for t in batch)
input_ids, input_mask, segment_ids, label_ids = batch
for step, inputs in enumerate(tqdm(eval_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])):
for k, v in inputs.items():
inputs[k] = v.to(args.device)
# Do a forward pass (not with torch.no_grad() since we need gradients for importance score - see below)
outputs = model(input_ids, token_type_ids=segment_ids, attention_mask=input_mask, labels=label_ids, head_mask=head_mask)
loss, logits, all_attentions = outputs[0], outputs[1], outputs[-1] # Loss and logits are the first, attention the last
outputs = model(**inputs, head_mask=head_mask)
loss, logits, all_attentions = (
outputs[0],
outputs[1],
outputs[-1],
) # Loss and logits are the first, attention the last
loss.backward() # Backpropagate to populate the gradients in the head mask
if compute_entropy:
for layer, attn in enumerate(all_attentions):
masked_entropy = entropy(attn.detach()) * input_mask.float().unsqueeze(1)
masked_entropy = entropy(attn.detach()) * inputs["attention_mask"].float().unsqueeze(1)
attn_entropy[layer] += masked_entropy.sum(-1).sum(0).detach()
if compute_importance:
@@ -100,12 +106,12 @@ def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True,
# Also store our logits/labels if we want to compute metrics afterwards
if preds is None:
preds = logits.detach().cpu().numpy()
labels = label_ids.detach().cpu().numpy()
labels = inputs["labels"].detach().cpu().numpy()
else:
preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
labels = np.append(labels, label_ids.detach().cpu().numpy(), axis=0)
labels = np.append(labels, inputs["labels"].detach().cpu().numpy(), axis=0)
tot_tokens += input_mask.float().detach().sum().data
tot_tokens += inputs["attention_mask"].float().detach().sum().data
# Normalize
attn_entropy /= tot_tokens
@@ -113,15 +119,15 @@ def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True,
# Layerwise importance normalization
if not args.dont_normalize_importance_by_layer:
exponent = 2
norm_by_layer = torch.pow(torch.pow(head_importance, exponent).sum(-1), 1/exponent)
norm_by_layer = torch.pow(torch.pow(head_importance, exponent).sum(-1), 1 / exponent)
head_importance /= norm_by_layer.unsqueeze(-1) + 1e-20
if not args.dont_normalize_global_importance:
head_importance = (head_importance - head_importance.min()) / (head_importance.max() - head_importance.min())
# Print/save matrices
np.save(os.path.join(args.output_dir, 'attn_entropy.npy'), attn_entropy.detach().cpu().numpy())
np.save(os.path.join(args.output_dir, 'head_importance.npy'), head_importance.detach().cpu().numpy())
np.save(os.path.join(args.output_dir, "attn_entropy.npy"), attn_entropy.detach().cpu().numpy())
np.save(os.path.join(args.output_dir, "head_importance.npy"), head_importance.detach().cpu().numpy())
logger.info("Attention entropies")
print_2d_tensor(attn_entropy)
@@ -129,7 +135,9 @@ def compute_heads_importance(args, model, eval_dataloader, compute_entropy=True,
print_2d_tensor(head_importance)
logger.info("Head ranked by importance scores")
head_ranks = torch.zeros(head_importance.numel(), dtype=torch.long, device=args.device)
head_ranks[head_importance.view(-1).sort(descending=True)[1]] = torch.arange(head_importance.numel(), device=args.device)
head_ranks[head_importance.view(-1).sort(descending=True)[1]] = torch.arange(
head_importance.numel(), device=args.device
)
head_ranks = head_ranks.view_as(head_importance)
print_2d_tensor(head_ranks)
@@ -142,7 +150,7 @@ def mask_heads(args, model, eval_dataloader):
"""
_, head_importance, preds, labels = compute_heads_importance(args, model, eval_dataloader, compute_entropy=False)
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
original_score = compute_metrics(args.task_name, preds, labels)[args.metric_name]
original_score = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
logger.info("Pruning: original score: %f, threshold: %f", original_score, original_score * args.masking_threshold)
new_head_mask = torch.ones_like(head_importance)
@@ -150,9 +158,9 @@ def mask_heads(args, model, eval_dataloader):
current_score = original_score
while current_score >= original_score * args.masking_threshold:
head_mask = new_head_mask.clone() # save current head mask
head_mask = new_head_mask.clone() # save current head mask
# heads from least important to most - keep only not-masked heads
head_importance[head_mask == 0.0] = float('Inf')
head_importance[head_mask == 0.0] = float("Inf")
current_heads_to_mask = head_importance.view(-1).sort()[1]
if len(current_heads_to_mask) <= num_to_mask:
@@ -167,14 +175,21 @@ def mask_heads(args, model, eval_dataloader):
print_2d_tensor(new_head_mask)
# Compute metric and head importance again
_, head_importance, preds, labels = compute_heads_importance(args, model, eval_dataloader, compute_entropy=False, head_mask=new_head_mask)
_, head_importance, preds, labels = compute_heads_importance(
args, model, eval_dataloader, compute_entropy=False, head_mask=new_head_mask
)
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
current_score = compute_metrics(args.task_name, preds, labels)[args.metric_name]
logger.info("Masking: current score: %f, remaning heads %d (%.1f percents)", current_score, new_head_mask.sum(), new_head_mask.sum()/new_head_mask.numel() * 100)
current_score = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
logger.info(
"Masking: current score: %f, remaning heads %d (%.1f percents)",
current_score,
new_head_mask.sum(),
new_head_mask.sum() / new_head_mask.numel() * 100,
)
logger.info("Final head mask")
print_2d_tensor(head_mask)
np.save(os.path.join(args.output_dir, 'head_mask.npy'), head_mask.detach().cpu().numpy())
np.save(os.path.join(args.output_dir, "head_mask.npy"), head_mask.detach().cpu().numpy())
return head_mask
@@ -186,10 +201,11 @@ def prune_heads(args, model, eval_dataloader, head_mask):
# Try pruning and test time speedup
# Pruning is like masking but we actually remove the masked weights
before_time = datetime.now()
_, _, preds, labels = compute_heads_importance(args, model, eval_dataloader,
compute_entropy=False, compute_importance=False, head_mask=head_mask)
_, _, preds, labels = compute_heads_importance(
args, model, eval_dataloader, compute_entropy=False, compute_importance=False, head_mask=head_mask
)
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
score_masking = compute_metrics(args.task_name, preds, labels)[args.metric_name]
score_masking = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
original_time = datetime.now() - before_time
original_num_params = sum(p.numel() for p in model.parameters())
@@ -199,73 +215,127 @@ def prune_heads(args, model, eval_dataloader, head_mask):
pruned_num_params = sum(p.numel() for p in model.parameters())
before_time = datetime.now()
_, _, preds, labels = compute_heads_importance(args, model, eval_dataloader,
compute_entropy=False, compute_importance=False, head_mask=None)
_, _, preds, labels = compute_heads_importance(
args, model, eval_dataloader, compute_entropy=False, compute_importance=False, head_mask=None
)
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
score_pruning = compute_metrics(args.task_name, preds, labels)[args.metric_name]
score_pruning = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
new_time = datetime.now() - before_time
logger.info("Pruning: original num of params: %.2e, after pruning %.2e (%.1f percents)", original_num_params, pruned_num_params, pruned_num_params/original_num_params * 100)
logger.info(
"Pruning: original num of params: %.2e, after pruning %.2e (%.1f percents)",
original_num_params,
pruned_num_params,
pruned_num_params / original_num_params * 100,
)
logger.info("Pruning: score with masking: %f score with pruning: %f", score_masking, score_pruning)
logger.info("Pruning: speed ratio (new timing / original timing): %f percents", original_time/new_time * 100)
logger.info("Pruning: speed ratio (new timing / original timing): %f percents", original_time / new_time * 100)
def main():
parser = argparse.ArgumentParser()
## Required parameters
parser.add_argument("--data_dir", default=None, type=str, required=True,
help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(
ALL_MODELS))
parser.add_argument("--task_name", default=None, type=str, required=True,
help="The name of the task to train selected in the list: " + ", ".join(processors.keys()))
parser.add_argument("--output_dir", default=None, type=str, required=True,
help="The output directory where the model predictions and checkpoints will be written.")
# Required parameters
parser.add_argument(
"--data_dir",
default=None,
type=str,
required=True,
help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
)
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pretrained model or model identifier from huggingface.co/models",
)
parser.add_argument(
"--task_name",
default=None,
type=str,
required=True,
help="The name of the task to train selected in the list: " + ", ".join(glue_processors.keys()),
)
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
## Other parameters
parser.add_argument("--config_name", default="", type=str,
help="Pretrained config name or path if not the same as model_name_or_path")
parser.add_argument("--tokenizer_name", default="", type=str,
help="Pretrained tokenizer name or path if not the same as model_name_or_path")
parser.add_argument("--cache_dir", default="", type=str,
help="Where do you want to store the pre-trained models downloaded from s3")
parser.add_argument("--data_subset", type=int, default=-1,
help="If > 0: limit the data to a subset of data_subset instances.")
parser.add_argument("--overwrite_output_dir", action='store_true',
help="Whether to overwrite data in output directory")
parser.add_argument('--overwrite_cache', action='store_true',
help="Overwrite the cached training and evaluation sets")
# Other parameters
parser.add_argument(
"--config_name",
default="",
type=str,
help="Pretrained config name or path if not the same as model_name_or_path",
)
parser.add_argument(
"--tokenizer_name",
default="",
type=str,
help="Pretrained tokenizer name or path if not the same as model_name_or_path",
)
parser.add_argument(
"--cache_dir",
default=None,
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
)
parser.add_argument(
"--data_subset", type=int, default=-1, help="If > 0: limit the data to a subset of data_subset instances."
)
parser.add_argument(
"--overwrite_output_dir", action="store_true", help="Whether to overwrite data in output directory"
)
parser.add_argument(
"--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets"
)
parser.add_argument("--dont_normalize_importance_by_layer", action='store_true',
help="Don't normalize importance score by layers")
parser.add_argument("--dont_normalize_global_importance", action='store_true',
help="Don't normalize all importance scores between 0 and 1")
parser.add_argument(
"--dont_normalize_importance_by_layer", action="store_true", help="Don't normalize importance score by layers"
)
parser.add_argument(
"--dont_normalize_global_importance",
action="store_true",
help="Don't normalize all importance scores between 0 and 1",
)
parser.add_argument("--try_masking", action='store_true',
help="Whether to try to mask head until a threshold of accuracy.")
parser.add_argument("--masking_threshold", default=0.9, type=float,
help="masking threshold in term of metrics (stop masking when metric < threshold * original metric value).")
parser.add_argument("--masking_amount", default=0.1, type=float,
help="Amount to heads to masking at each masking step.")
parser.add_argument("--metric_name", default="acc", type=str,
help="Metric to use for head masking.")
parser.add_argument(
"--try_masking", action="store_true", help="Whether to try to mask head until a threshold of accuracy."
)
parser.add_argument(
"--masking_threshold",
default=0.9,
type=float,
help="masking threshold in term of metrics (stop masking when metric < threshold * original metric value).",
)
parser.add_argument(
"--masking_amount", default=0.1, type=float, help="Amount to heads to masking at each masking step."
)
parser.add_argument("--metric_name", default="acc", type=str, help="Metric to use for head masking.")
parser.add_argument("--max_seq_length", default=128, type=int,
help="The maximum total input sequence length after WordPiece tokenization. \n"
"Sequences longer than this will be truncated, sequences shorter padded.")
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after WordPiece tokenization. \n"
"Sequences longer than this will be truncated, sequences shorter padded.",
)
parser.add_argument("--batch_size", default=1, type=int, help="Batch size.")
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--local_rank", type=int, default=-1, help="local_rank for distributed training on gpus")
parser.add_argument("--no_cuda", action='store_true', help="Whether not to use CUDA when available")
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
parser.add_argument("--no_cuda", action="store_true", help="Whether not to use CUDA when available")
parser.add_argument("--server_ip", type=str, default="", help="Can be used for distant debugging.")
parser.add_argument("--server_port", type=str, default="", help="Can be used for distant debugging.")
args = parser.parse_args()
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
@@ -273,79 +343,78 @@ def main():
# Setup devices and distributed training
if args.local_rank == -1 or args.no_cuda:
args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
args.n_gpu = torch.cuda.device_count()
args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
else:
torch.cuda.set_device(args.local_rank)
args.device = torch.device("cuda", args.local_rank)
args.n_gpu = 1
torch.distributed.init_process_group(backend='nccl') # Initializes the distributed backend
torch.distributed.init_process_group(backend="nccl") # Initializes the distributed backend
# Setup logging
logging.basicConfig(level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
logging.basicConfig(level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
logger.info("device: {} n_gpu: {}, distributed: {}".format(args.device, args.n_gpu, bool(args.local_rank != -1)))
# Set seeds
set_seed(args)
set_seed(args.seed)
# Prepare GLUE task
args.task_name = args.task_name.lower()
if args.task_name not in processors:
if args.task_name not in glue_processors:
raise ValueError("Task not found: %s" % (args.task_name))
processor = processors[args.task_name]()
args.output_mode = output_modes[args.task_name]
processor = glue_processors[args.task_name]()
args.output_mode = glue_output_modes[args.task_name]
label_list = processor.get_labels()
num_labels = len(label_list)
# Load pretrained model and tokenizer
if args.local_rank not in [-1, 0]:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
#
# Distributed training:
# The .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.
args.model_type = ""
for key in MODEL_CLASSES:
if key in args.model_name_or_path.lower():
args.model_type = key # take the first match in model types
break
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path,
num_labels=num_labels,
finetuning_task=args.task_name,
output_attentions=True,
cache_dir=args.cache_dir if args.cache_dir else None)
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
cache_dir=args.cache_dir if args.cache_dir else None)
model = model_class.from_pretrained(args.model_name_or_path,
from_tf=bool('.ckpt' in args.model_name_or_path),
config=config,
cache_dir=args.cache_dir if args.cache_dir else None)
if args.local_rank == 0:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
config = AutoConfig.from_pretrained(
args.config_name if args.config_name else args.model_name_or_path,
num_labels=num_labels,
finetuning_task=args.task_name,
output_attentions=True,
cache_dir=args.cache_dir,
)
tokenizer = AutoTokenizer.from_pretrained(
args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, cache_dir=args.cache_dir,
)
model = AutoModelForSequenceClassification.from_pretrained(
args.model_name_or_path,
from_tf=bool(".ckpt" in args.model_name_or_path),
config=config,
cache_dir=args.cache_dir,
)
# Distributed and parallel training
model.to(args.device)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
output_device=args.local_rank,
find_unused_parameters=True)
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
)
elif args.n_gpu > 1:
model = torch.nn.DataParallel(model)
# Print/save training arguments
torch.save(args, os.path.join(args.output_dir, 'run_args.bin'))
os.makedirs(args.output_dir, exist_ok=True)
torch.save(args, os.path.join(args.output_dir, "run_args.bin"))
logger.info("Training/evaluation parameters %s", args)
# Prepare dataset for the GLUE task
eval_data = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=True)
eval_dataset = GlueDataset(args, tokenizer=tokenizer, evaluate=True, local_rank=args.local_rank)
if args.data_subset > 0:
eval_data = Subset(eval_data, list(range(min(args.data_subset, len(eval_data)))))
eval_sampler = SequentialSampler(eval_data) if args.local_rank == -1 else DistributedSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.batch_size)
eval_dataset = Subset(eval_dataset, list(range(min(args.data_subset, len(eval_dataset)))))
eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
eval_dataloader = DataLoader(
eval_dataset, sampler=eval_sampler, batch_size=args.batch_size, collate_fn=DefaultDataCollator().collate_batch
)
# Compute head entropy and importance score
compute_heads_importance(args, model, eval_dataloader)
# Try head masking (set heads to zero until the score goes under a threshole)
# and head pruning (remove masked heads and see the effect on the network)
if args.try_masking and args.masking_threshold > 0.0 and args.masking_threshold < 1.0:
@@ -353,5 +422,5 @@ def main():
prune_heads(args, model, eval_dataloader, head_mask)
if __name__ == '__main__':
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,23 @@
## MM-IMDb
Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/mm-imdb/run_mmimdb.py).
[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.
### Training on MM-IMDb
```
python run_mmimdb.py \
--data_dir /path/to/mmimdb/dataset/ \
--model_type bert \
--model_name_or_path bert-base-uncased \
--output_dir /path/to/save/dir/ \
--do_train \
--do_eval \
--max_seq_len 512 \
--gradient_accumulation_steps 20 \
--num_image_embeds 3 \
--num_train_epochs 100 \
--patience 5
```

View File

@@ -0,0 +1,614 @@
# coding=utf-8
# Copyright (c) Facebook, Inc. and its affiliates.
# Copyright (c) HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Finetuning the library models for multimodal multiclass prediction on MM-IMDB dataset."""
import argparse
import glob
import json
import logging
import os
import random
import numpy as np
import torch
import torch.nn as nn
from sklearn.metrics import f1_score
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from transformers import (
WEIGHTS_NAME,
AdamW,
AlbertConfig,
AlbertModel,
AlbertTokenizer,
BertConfig,
BertModel,
BertTokenizer,
DistilBertConfig,
DistilBertModel,
DistilBertTokenizer,
MMBTConfig,
MMBTForClassification,
RobertaConfig,
RobertaModel,
RobertaTokenizer,
XLMConfig,
XLMModel,
XLMTokenizer,
XLNetConfig,
XLNetModel,
XLNetTokenizer,
get_linear_schedule_with_warmup,
)
from utils_mmimdb import ImageEncoder, JsonlDataset, collate_fn, get_image_transforms, get_mmimdb_labels
try:
from torch.utils.tensorboard import SummaryWriter
except ImportError:
from tensorboardX import SummaryWriter
logger = logging.getLogger(__name__)
ALL_MODELS = sum(
(
tuple(conf.pretrained_config_archive_map.keys())
for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig, DistilBertConfig)
),
(),
)
MODEL_CLASSES = {
"bert": (BertConfig, BertModel, BertTokenizer),
"xlnet": (XLNetConfig, XLNetModel, XLNetTokenizer),
"xlm": (XLMConfig, XLMModel, XLMTokenizer),
"roberta": (RobertaConfig, RobertaModel, RobertaTokenizer),
"distilbert": (DistilBertConfig, DistilBertModel, DistilBertTokenizer),
"albert": (AlbertConfig, AlbertModel, AlbertTokenizer),
}
def set_seed(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if args.n_gpu > 0:
torch.cuda.manual_seed_all(args.seed)
def train(args, train_dataset, model, tokenizer, criterion):
""" Train the model """
if args.local_rank in [-1, 0]:
tb_writer = SummaryWriter()
args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
train_dataloader = DataLoader(
train_dataset,
sampler=train_sampler,
batch_size=args.train_batch_size,
collate_fn=collate_fn,
num_workers=args.num_workers,
)
if args.max_steps > 0:
t_total = args.max_steps
args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
else:
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
"weight_decay": args.weight_decay,
},
{"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
)
if args.fp16:
try:
from apex import amp
except ImportError:
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
# multi-gpu training (should be after apex fp16 initialization)
if args.n_gpu > 1:
model = torch.nn.DataParallel(model)
# Distributed training (should be after apex fp16 initialization)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
)
# Train!
logger.info("***** Running training *****")
logger.info(" Num examples = %d", len(train_dataset))
logger.info(" Num Epochs = %d", args.num_train_epochs)
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
logger.info(
" Total train batch size (w. parallel, distributed & accumulation) = %d",
args.train_batch_size
* args.gradient_accumulation_steps
* (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
)
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
logger.info(" Total optimization steps = %d", t_total)
global_step = 0
tr_loss, logging_loss = 0.0, 0.0
best_f1, n_no_improve = 0, 0
model.zero_grad()
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
set_seed(args) # Added here for reproductibility
for _ in train_iterator:
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
for step, batch in enumerate(epoch_iterator):
model.train()
batch = tuple(t.to(args.device) for t in batch)
labels = batch[5]
inputs = {
"input_ids": batch[0],
"input_modal": batch[2],
"attention_mask": batch[1],
"modal_start_tokens": batch[3],
"modal_end_tokens": batch[4],
}
outputs = model(**inputs)
logits = outputs[0] # model outputs are always tuple in transformers (see doc)
loss = criterion(logits, labels)
if args.n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu parallel training
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
if args.fp16:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss.backward()
tr_loss += loss.item()
if (step + 1) % args.gradient_accumulation_steps == 0:
if args.fp16:
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
else:
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
optimizer.step()
scheduler.step() # Update learning rate schedule
model.zero_grad()
global_step += 1
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
logs = {}
if (
args.local_rank == -1 and args.evaluate_during_training
): # Only evaluate when single GPU otherwise metrics may not average well
results = evaluate(args, model, tokenizer, criterion)
for key, value in results.items():
eval_key = "eval_{}".format(key)
logs[eval_key] = value
loss_scalar = (tr_loss - logging_loss) / args.logging_steps
learning_rate_scalar = scheduler.get_lr()[0]
logs["learning_rate"] = learning_rate_scalar
logs["loss"] = loss_scalar
logging_loss = tr_loss
for key, value in logs.items():
tb_writer.add_scalar(key, value, global_step)
print(json.dumps({**logs, **{"step": global_step}}))
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
# Save model checkpoint
output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
if not os.path.exists(output_dir):
os.makedirs(output_dir)
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
torch.save(model_to_save.state_dict(), os.path.join(output_dir, WEIGHTS_NAME))
torch.save(args, os.path.join(output_dir, "training_args.bin"))
logger.info("Saving model checkpoint to %s", output_dir)
if args.max_steps > 0 and global_step > args.max_steps:
epoch_iterator.close()
break
if args.max_steps > 0 and global_step > args.max_steps:
train_iterator.close()
break
if args.local_rank == -1:
results = evaluate(args, model, tokenizer, criterion)
if results["micro_f1"] > best_f1:
best_f1 = results["micro_f1"]
n_no_improve = 0
else:
n_no_improve += 1
if n_no_improve > args.patience:
train_iterator.close()
break
if args.local_rank in [-1, 0]:
tb_writer.close()
return global_step, tr_loss / global_step
def evaluate(args, model, tokenizer, criterion, prefix=""):
# Loop to handle MNLI double evaluation (matched, mis-matched)
eval_output_dir = args.output_dir
eval_dataset = load_examples(args, tokenizer, evaluate=True)
if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
os.makedirs(eval_output_dir)
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
# Note that DistributedSampler samples randomly
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(
eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate_fn
)
# multi-gpu eval
if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
model = torch.nn.DataParallel(model)
# Eval!
logger.info("***** Running evaluation {} *****".format(prefix))
logger.info(" Num examples = %d", len(eval_dataset))
logger.info(" Batch size = %d", args.eval_batch_size)
eval_loss = 0.0
nb_eval_steps = 0
preds = None
out_label_ids = None
for batch in tqdm(eval_dataloader, desc="Evaluating"):
model.eval()
batch = tuple(t.to(args.device) for t in batch)
with torch.no_grad():
batch = tuple(t.to(args.device) for t in batch)
labels = batch[5]
inputs = {
"input_ids": batch[0],
"input_modal": batch[2],
"attention_mask": batch[1],
"modal_start_tokens": batch[3],
"modal_end_tokens": batch[4],
}
outputs = model(**inputs)
logits = outputs[0] # model outputs are always tuple in transformers (see doc)
tmp_eval_loss = criterion(logits, labels)
eval_loss += tmp_eval_loss.mean().item()
nb_eval_steps += 1
if preds is None:
preds = torch.sigmoid(logits).detach().cpu().numpy() > 0.5
out_label_ids = labels.detach().cpu().numpy()
else:
preds = np.append(preds, torch.sigmoid(logits).detach().cpu().numpy() > 0.5, axis=0)
out_label_ids = np.append(out_label_ids, labels.detach().cpu().numpy(), axis=0)
eval_loss = eval_loss / nb_eval_steps
result = {
"loss": eval_loss,
"macro_f1": f1_score(out_label_ids, preds, average="macro"),
"micro_f1": f1_score(out_label_ids, preds, average="micro"),
}
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results {} *****".format(prefix))
for key in sorted(result.keys()):
logger.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
return result
def load_examples(args, tokenizer, evaluate=False):
path = os.path.join(args.data_dir, "dev.jsonl" if evaluate else "train.jsonl")
transforms = get_image_transforms()
labels = get_mmimdb_labels()
dataset = JsonlDataset(path, tokenizer, transforms, labels, args.max_seq_length - args.num_image_embeds - 2)
return dataset
def main():
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--data_dir",
default=None,
type=str,
required=True,
help="The input data dir. Should contain the .jsonl files for MMIMDB.",
)
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
)
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
)
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
# Other parameters
parser.add_argument(
"--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name"
)
parser.add_argument(
"--tokenizer_name",
default="",
type=str,
help="Pretrained tokenizer name or path if not the same as model_name",
)
parser.add_argument(
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
)
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded.",
)
parser.add_argument(
"--num_image_embeds", default=1, type=int, help="Number of Image Embeddings from the Image Encoder"
)
parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
parser.add_argument(
"--evaluate_during_training", action="store_true", help="Rul evaluation during training at each logging step."
)
parser.add_argument(
"--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model."
)
parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument(
"--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation."
)
parser.add_argument(
"--gradient_accumulation_steps",
type=int,
default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.",
)
parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
parser.add_argument(
"--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform."
)
parser.add_argument("--patience", default=5, type=int, help="Patience for Early Stopping.")
parser.add_argument(
"--max_steps",
default=-1,
type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
parser.add_argument("--logging_steps", type=int, default=50, help="Log every X updates steps.")
parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.")
parser.add_argument(
"--eval_all_checkpoints",
action="store_true",
help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
)
parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
parser.add_argument("--num_workers", type=int, default=8, help="number of worker threads for dataloading")
parser.add_argument(
"--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory"
)
parser.add_argument(
"--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets"
)
parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
parser.add_argument(
"--fp16",
action="store_true",
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
)
parser.add_argument(
"--fp16_opt_level",
type=str,
default="O1",
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
"See details at https://nvidia.github.io/apex/amp.html",
)
parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.")
parser.add_argument("--server_port", type=str, default="", help="For distant debugging.")
args = parser.parse_args()
if (
os.path.exists(args.output_dir)
and os.listdir(args.output_dir)
and args.do_train
and not args.overwrite_output_dir
):
raise ValueError(
"Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
args.output_dir
)
)
# Setup distant debugging if needed
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
# Setup CUDA, GPU & distributed training
if args.local_rank == -1 or args.no_cuda:
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)
torch.distributed.init_process_group(backend="nccl")
args.n_gpu = 1
args.device = device
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
)
logger.warning(
"Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
args.local_rank,
device,
args.n_gpu,
bool(args.local_rank != -1),
args.fp16,
)
# Set seed
set_seed(args)
# Load pretrained model and tokenizer
if args.local_rank not in [-1, 0]:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
# Setup model
labels = get_mmimdb_labels()
num_labels = len(labels)
args.model_type = args.model_type.lower()
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
transformer_config = config_class.from_pretrained(
args.config_name if args.config_name else args.model_name_or_path
)
tokenizer = tokenizer_class.from_pretrained(
args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
do_lower_case=args.do_lower_case,
cache_dir=args.cache_dir if args.cache_dir else None,
)
transformer = model_class.from_pretrained(
args.model_name_or_path, config=transformer_config, cache_dir=args.cache_dir if args.cache_dir else None
)
img_encoder = ImageEncoder(args)
config = MMBTConfig(transformer_config, num_labels=num_labels)
model = MMBTForClassification(config, transformer, img_encoder)
if args.local_rank == 0:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
model.to(args.device)
logger.info("Training/evaluation parameters %s", args)
# Training
if args.do_train:
train_dataset = load_examples(args, tokenizer, evaluate=False)
label_frequences = train_dataset.get_label_frequencies()
label_frequences = [label_frequences[l] for l in labels]
label_weights = (
torch.tensor(label_frequences, device=args.device, dtype=torch.float) / len(train_dataset)
) ** -1
criterion = nn.BCEWithLogitsLoss(pos_weight=label_weights)
global_step, tr_loss = train(args, train_dataset, model, tokenizer, criterion)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
# Create output directory if needed
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
os.makedirs(args.output_dir)
logger.info("Saving model checkpoint to %s", args.output_dir)
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
torch.save(model_to_save.state_dict(), os.path.join(args.output_dir, WEIGHTS_NAME))
tokenizer.save_pretrained(args.output_dir)
# Good practice: save your training arguments together with the trained model
torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
# Load a trained model and vocabulary that you have fine-tuned
model = MMBTForClassification(config, transformer, img_encoder)
model.load_state_dict(torch.load(os.path.join(args.output_dir, WEIGHTS_NAME)))
tokenizer = tokenizer_class.from_pretrained(args.output_dir)
model.to(args.device)
# Evaluation
results = {}
if args.do_eval and args.local_rank in [-1, 0]:
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
checkpoints = [args.output_dir]
if args.eval_all_checkpoints:
checkpoints = list(
os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
)
logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging
logger.info("Evaluate the following checkpoints: %s", checkpoints)
for checkpoint in checkpoints:
global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
model = MMBTForClassification(config, transformer, img_encoder)
model.load_state_dict(torch.load(checkpoint))
model.to(args.device)
result = evaluate(args, model, tokenizer, criterion, prefix=prefix)
result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
results.update(result)
return results
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,143 @@
# coding=utf-8
# Copyright (c) Facebook, Inc. and its affiliates.
# Copyright (c) HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
from collections import Counter
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from PIL import Image
from torch.utils.data import Dataset
POOLING_BREAKDOWN = {1: (1, 1), 2: (2, 1), 3: (3, 1), 4: (2, 2), 5: (5, 1), 6: (3, 2), 7: (7, 1), 8: (4, 2), 9: (3, 3)}
class ImageEncoder(nn.Module):
def __init__(self, args):
super().__init__()
model = torchvision.models.resnet152(pretrained=True)
modules = list(model.children())[:-2]
self.model = nn.Sequential(*modules)
self.pool = nn.AdaptiveAvgPool2d(POOLING_BREAKDOWN[args.num_image_embeds])
def forward(self, x):
# Bx3x224x224 -> Bx2048x7x7 -> Bx2048xN -> BxNx2048
out = self.pool(self.model(x))
out = torch.flatten(out, start_dim=2)
out = out.transpose(1, 2).contiguous()
return out # BxNx2048
class JsonlDataset(Dataset):
def __init__(self, data_path, tokenizer, transforms, labels, max_seq_length):
self.data = [json.loads(l) for l in open(data_path)]
self.data_dir = os.path.dirname(data_path)
self.tokenizer = tokenizer
self.labels = labels
self.n_classes = len(labels)
self.max_seq_length = max_seq_length
self.transforms = transforms
def __len__(self):
return len(self.data)
def __getitem__(self, index):
sentence = torch.LongTensor(self.tokenizer.encode(self.data[index]["text"], add_special_tokens=True))
start_token, sentence, end_token = sentence[0], sentence[1:-1], sentence[-1]
sentence = sentence[: self.max_seq_length]
label = torch.zeros(self.n_classes)
label[[self.labels.index(tgt) for tgt in self.data[index]["label"]]] = 1
image = Image.open(os.path.join(self.data_dir, self.data[index]["img"])).convert("RGB")
image = self.transforms(image)
return {
"image_start_token": start_token,
"image_end_token": end_token,
"sentence": sentence,
"image": image,
"label": label,
}
def get_label_frequencies(self):
label_freqs = Counter()
for row in self.data:
label_freqs.update(row["label"])
return label_freqs
def collate_fn(batch):
lens = [len(row["sentence"]) for row in batch]
bsz, max_seq_len = len(batch), max(lens)
mask_tensor = torch.zeros(bsz, max_seq_len, dtype=torch.long)
text_tensor = torch.zeros(bsz, max_seq_len, dtype=torch.long)
for i_batch, (input_row, length) in enumerate(zip(batch, lens)):
text_tensor[i_batch, :length] = input_row["sentence"]
mask_tensor[i_batch, :length] = 1
img_tensor = torch.stack([row["image"] for row in batch])
tgt_tensor = torch.stack([row["label"] for row in batch])
img_start_token = torch.stack([row["image_start_token"] for row in batch])
img_end_token = torch.stack([row["image_end_token"] for row in batch])
return text_tensor, mask_tensor, img_tensor, img_start_token, img_end_token, tgt_tensor
def get_mmimdb_labels():
return [
"Crime",
"Drama",
"Thriller",
"Action",
"Comedy",
"Romance",
"Documentary",
"Short",
"Mystery",
"History",
"Family",
"Adventure",
"Fantasy",
"Sci-Fi",
"Western",
"Horror",
"Sport",
"War",
"Music",
"Musical",
"Animation",
"Biography",
"Film-Noir",
]
def get_image_transforms():
return transforms.Compose(
[
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.46777044, 0.44531429, 0.40661017], std=[0.12221994, 0.12145835, 0.14380469],),
]
)

View File

@@ -1,47 +1,42 @@
from pathlib import Path
import tarfile
import urllib.request
import torch
from transformers.tokenization_camembert import CamembertTokenizer
from transformers.modeling_camembert import CamembertForMaskedLM
from transformers.tokenization_camembert import CamembertTokenizer
def fill_mask(masked_input, model, tokenizer, topk=5):
# Adapted from https://github.com/pytorch/fairseq/blob/master/fairseq/models/roberta/hub_interface.py
assert masked_input.count('<mask>') == 1
assert masked_input.count("<mask>") == 1
input_ids = torch.tensor(tokenizer.encode(masked_input, add_special_tokens=True)).unsqueeze(0) # Batch size 1
logits = model(input_ids)[0] # The last hidden-state is the first element of the output tuple
masked_index = (input_ids.squeeze() == tokenizer.mask_token_id).nonzero().item()
logits = logits[0, masked_index, :]
prob = logits.softmax(dim=0)
values, indices = prob.topk(k=topk, dim=0)
topk_predicted_token_bpe = ' '.join([tokenizer.convert_ids_to_tokens(indices[i].item())
for i in range(len(indices))])
topk_predicted_token_bpe = " ".join(
[tokenizer.convert_ids_to_tokens(indices[i].item()) for i in range(len(indices))]
)
masked_token = tokenizer.mask_token
topk_filled_outputs = []
for index, predicted_token_bpe in enumerate(topk_predicted_token_bpe.split(' ')):
predicted_token = predicted_token_bpe.replace('\u2581', ' ')
for index, predicted_token_bpe in enumerate(topk_predicted_token_bpe.split(" ")):
predicted_token = predicted_token_bpe.replace("\u2581", " ")
if " {0}".format(masked_token) in masked_input:
topk_filled_outputs.append((
masked_input.replace(
' {0}'.format(masked_token), predicted_token
),
values[index].item(),
predicted_token,
))
topk_filled_outputs.append(
(
masked_input.replace(" {0}".format(masked_token), predicted_token),
values[index].item(),
predicted_token,
)
)
else:
topk_filled_outputs.append((
masked_input.replace(masked_token, predicted_token),
values[index].item(),
predicted_token,
))
topk_filled_outputs.append(
(masked_input.replace(masked_token, predicted_token), values[index].item(), predicted_token,)
)
return topk_filled_outputs
tokenizer = CamembertTokenizer.from_pretrained('camembert-base')
model = CamembertForMaskedLM.from_pretrained('camembert-base')
tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = CamembertForMaskedLM.from_pretrained("camembert-base")
model.eval()
masked_input = "Le camembert est <mask> :)"

View File

@@ -22,48 +22,54 @@
--model_name openai-gpt \
--do_train \
--do_eval \
--train_dataset $ROC_STORIES_DIR/cloze_test_val__spring2016\ -\ cloze_test_ALL_val.csv \
--eval_dataset $ROC_STORIES_DIR/cloze_test_test__spring2016\ -\ cloze_test_ALL_test.csv \
--train_dataset "$ROC_STORIES_DIR/cloze_test_val__spring2016 - cloze_test_ALL_val.csv" \
--eval_dataset "$ROC_STORIES_DIR/cloze_test_test__spring2016 - cloze_test_ALL_test.csv" \
--output_dir ../log \
--train_batch_size 16 \
"""
import argparse
import os
import csv
import random
import logging
from tqdm import tqdm, trange
import os
import random
import numpy as np
import torch
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
TensorDataset)
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from tqdm import tqdm, trange
from transformers import (OpenAIGPTDoubleHeadsModel, OpenAIGPTTokenizer,
AdamW, cached_path, WEIGHTS_NAME, CONFIG_NAME,
get_linear_schedule_with_warmup)
from transformers import (
CONFIG_NAME,
WEIGHTS_NAME,
AdamW,
OpenAIGPTDoubleHeadsModel,
OpenAIGPTTokenizer,
get_linear_schedule_with_warmup,
)
ROCSTORIES_URL = "https://s3.amazonaws.com/datasets.huggingface.co/ROCStories.tar.gz"
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO)
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO
)
logger = logging.getLogger(__name__)
def accuracy(out, labels):
outputs = np.argmax(out, axis=1)
return np.sum(outputs == labels)
def load_rocstories_dataset(dataset_path):
""" Output a list of tuples(story, 1st continuation, 2nd continuation, label) """
with open(dataset_path, encoding='utf_8') as f:
with open(dataset_path, encoding="utf_8") as f:
f = csv.reader(f)
output = []
next(f) # skip the first line
next(f) # skip the first line
for line in tqdm(f):
output.append((' '.join(line[1:5]), line[5], line[6], int(line[-1])-1))
output.append((" ".join(line[1:5]), line[5], line[6], int(line[-1]) - 1))
return output
def pre_process_datasets(encoded_datasets, input_len, cap_length, start_token, delimiter_token, clf_token):
""" Pre-process datasets containing lists of tuples(story, 1st continuation, 2nd continuation, label)
@@ -75,61 +81,73 @@ def pre_process_datasets(encoded_datasets, input_len, cap_length, start_token, d
n_batch = len(dataset)
input_ids = np.zeros((n_batch, 2, input_len), dtype=np.int64)
mc_token_ids = np.zeros((n_batch, 2), dtype=np.int64)
lm_labels = np.full((n_batch, 2, input_len), fill_value=-1, dtype=np.int64)
lm_labels = np.full((n_batch, 2, input_len), fill_value=-100, dtype=np.int64)
mc_labels = np.zeros((n_batch,), dtype=np.int64)
for i, (story, cont1, cont2, mc_label), in enumerate(dataset):
with_cont1 = [start_token] + story[:cap_length] + [delimiter_token] + cont1[:cap_length] + [clf_token]
with_cont2 = [start_token] + story[:cap_length] + [delimiter_token] + cont2[:cap_length] + [clf_token]
input_ids[i, 0, :len(with_cont1)] = with_cont1
input_ids[i, 1, :len(with_cont2)] = with_cont2
input_ids[i, 0, : len(with_cont1)] = with_cont1
input_ids[i, 1, : len(with_cont2)] = with_cont2
mc_token_ids[i, 0] = len(with_cont1) - 1
mc_token_ids[i, 1] = len(with_cont2) - 1
lm_labels[i, 0, :len(with_cont1)] = with_cont1
lm_labels[i, 1, :len(with_cont2)] = with_cont2
lm_labels[i, 0, : len(with_cont1)] = with_cont1
lm_labels[i, 1, : len(with_cont2)] = with_cont2
mc_labels[i] = mc_label
all_inputs = (input_ids, mc_token_ids, lm_labels, mc_labels)
tensor_datasets.append(tuple(torch.tensor(t) for t in all_inputs))
return tensor_datasets
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--model_name', type=str, default='openai-gpt',
help='pretrained model name')
parser.add_argument("--do_train", action='store_true', help="Whether to run training.")
parser.add_argument("--do_eval", action='store_true', help="Whether to run eval on the dev set.")
parser.add_argument("--output_dir", default=None, type=str, required=True,
help="The output directory where the model predictions and checkpoints will be written.")
parser.add_argument('--train_dataset', type=str, default='')
parser.add_argument('--eval_dataset', type=str, default='')
parser.add_argument('--seed', type=int, default=42)
parser.add_argument('--num_train_epochs', type=int, default=3)
parser.add_argument('--train_batch_size', type=int, default=8)
parser.add_argument('--eval_batch_size', type=int, default=16)
parser.add_argument("--adam_epsilon", default=1e-8, type=float,
help="Epsilon for Adam optimizer.")
parser.add_argument('--max_grad_norm', type=int, default=1)
parser.add_argument("--max_steps", default=-1, type=int,
help="If > 0: set total number of training \
steps to perform. Override num_train_epochs.")
parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
help="Number of updates steps to accumulate before\
performing a backward/update pass.")
parser.add_argument('--learning_rate', type=float, default=6.25e-5)
parser.add_argument("--warmup_steps", default=0, type=int,
help="Linear warmup over warmup_steps.")
parser.add_argument('--lr_schedule', type=str, default='warmup_linear')
parser.add_argument('--weight_decay', type=float, default=0.01)
parser.add_argument('--lm_coef', type=float, default=0.9)
parser.add_argument('--n_valid', type=int, default=374)
parser.add_argument("--model_name", type=str, default="openai-gpt", help="pretrained model name")
parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
parser.add_argument("--train_dataset", type=str, default="")
parser.add_argument("--eval_dataset", type=str, default="")
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--num_train_epochs", type=int, default=3)
parser.add_argument("--train_batch_size", type=int, default=8)
parser.add_argument("--eval_batch_size", type=int, default=16)
parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", type=int, default=1)
parser.add_argument(
"--max_steps",
default=-1,
type=int,
help="If > 0: set total number of training \
steps to perform. Override num_train_epochs.",
)
parser.add_argument(
"--gradient_accumulation_steps",
type=int,
default=1,
help="Number of updates steps to accumulate before\
performing a backward/update pass.",
)
parser.add_argument("--learning_rate", type=float, default=6.25e-5)
parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
parser.add_argument("--lr_schedule", type=str, default="warmup_linear")
parser.add_argument("--weight_decay", type=float, default=0.01)
parser.add_argument("--lm_coef", type=float, default=0.9)
parser.add_argument("--n_valid", type=int, default=374)
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
parser.add_argument("--server_ip", type=str, default="", help="Can be used for distant debugging.")
parser.add_argument("--server_port", type=str, default="", help="Can be used for distant debugging.")
args = parser.parse_args()
print(args)
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
@@ -152,7 +170,7 @@ def main():
# Load tokenizer and model
# This loading functions also add new tokens and embeddings called `special tokens`
# These new embeddings will be fine-tuned on the RocStories dataset
special_tokens = ['_start_', '_delimiter_', '_classify_']
special_tokens = ["_start_", "_delimiter_", "_classify_"]
tokenizer = OpenAIGPTTokenizer.from_pretrained(args.model_name)
tokenizer.add_tokens(special_tokens)
special_tokens_ids = tokenizer.convert_tokens_to_ids(special_tokens)
@@ -161,8 +179,6 @@ def main():
model.to(device)
# Load and encode the datasets
if not args.train_dataset and not args.eval_dataset:
roc_stories = cached_path(ROCSTORIES_URL)
def tokenize_and_encode(obj):
""" Tokenize and encode a nested object """
if isinstance(obj, str):
@@ -170,6 +186,7 @@ def main():
elif isinstance(obj, int):
return obj
return list(tokenize_and_encode(o) for o in obj)
logger.info("Encoding dataset...")
train_dataset = load_rocstories_dataset(args.train_dataset)
eval_dataset = load_rocstories_dataset(args.eval_dataset)
@@ -178,8 +195,11 @@ def main():
# Compute the max input length for the Transformer
max_length = model.config.n_positions // 2 - 2
input_length = max(len(story[:max_length]) + max(len(cont1[:max_length]), len(cont2[:max_length])) + 3 \
for dataset in encoded_datasets for story, cont1, cont2, _ in dataset)
input_length = max(
len(story[:max_length]) + max(len(cont1[:max_length]), len(cont2[:max_length])) + 3
for dataset in encoded_datasets
for story, cont1, cont2, _ in dataset
)
input_length = min(input_length, model.config.n_positions) # Max size of input for the pre-trained model
# Prepare inputs tensors and dataloaders
@@ -198,20 +218,23 @@ def main():
if args.do_train:
if args.max_steps > 0:
t_total = args.max_steps
args.num_train_epochs = args.max_steps //\
(len(train_dataloader) // args.gradient_accumulation_steps) + 1
args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
else:
t_total = len(train_dataloader)\
// args.gradient_accumulation_steps * args.num_train_epochs
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
{
"params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
"weight_decay": args.weight_decay,
},
{"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total)
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
)
if args.do_train:
nb_tr_steps, tr_loss, exp_average_loss = 0, 0, None
@@ -226,18 +249,20 @@ def main():
losses = model(input_ids, mc_token_ids=mc_token_ids, lm_labels=lm_labels, mc_labels=mc_labels)
loss = args.lm_coef * losses[0] + losses[1]
loss.backward()
scheduler.step()
optimizer.step()
scheduler.step()
optimizer.zero_grad()
tr_loss += loss.item()
exp_average_loss = loss.item() if exp_average_loss is None else 0.7*exp_average_loss+0.3*loss.item()
exp_average_loss = (
loss.item() if exp_average_loss is None else 0.7 * exp_average_loss + 0.3 * loss.item()
)
nb_tr_steps += 1
tqdm_bar.desc = "Training loss: {:.2e} lr: {:.2e}".format(exp_average_loss, scheduler.get_lr()[0])
# Save a trained model
if args.do_train:
# Save a trained model, configuration and tokenizer
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model itself
model_to_save = model.module if hasattr(model, "module") else model # Only save the model itself
# If we save using the predefined names, we can load using `from_pretrained`
output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
@@ -260,10 +285,12 @@ def main():
batch = tuple(t.to(device) for t in batch)
input_ids, mc_token_ids, lm_labels, mc_labels = batch
with torch.no_grad():
_, mc_loss, _, mc_logits = model(input_ids, mc_token_ids=mc_token_ids, lm_labels=lm_labels, mc_labels=mc_labels)
_, mc_loss, _, mc_logits = model(
input_ids, mc_token_ids=mc_token_ids, lm_labels=lm_labels, mc_labels=mc_labels
)
mc_logits = mc_logits.detach().cpu().numpy()
mc_labels = mc_labels.to('cpu').numpy()
mc_labels = mc_labels.to("cpu").numpy()
tmp_eval_accuracy = accuracy(mc_logits, mc_labels)
eval_loss += mc_loss.mean().item()
@@ -274,10 +301,8 @@ def main():
eval_loss = eval_loss / nb_eval_steps
eval_accuracy = eval_accuracy / nb_eval_examples
train_loss = tr_loss/nb_tr_steps if args.do_train else None
result = {'eval_loss': eval_loss,
'eval_accuracy': eval_accuracy,
'train_loss': train_loss}
train_loss = tr_loss / nb_tr_steps if args.do_train else None
result = {"eval_loss": eval_loss, "eval_accuracy": eval_accuracy, "train_loss": train_loss}
output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
with open(output_eval_file, "w") as writer:
@@ -286,5 +311,6 @@ def main():
logger.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
if __name__ == '__main__':
if __name__ == "__main__":
main()

View File

@@ -16,54 +16,50 @@
"""BERT finetuning runner.
Finetuning the library models for multiple choice on SWAG (Bert).
"""
from __future__ import absolute_import, division, print_function
import argparse
import logging
import csv
import glob
import logging
import os
import random
import sys
import glob
import numpy as np
import torch
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
TensorDataset)
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from transformers import (
WEIGHTS_NAME,
AdamW,
BertConfig,
BertForMultipleChoice,
BertTokenizer,
get_linear_schedule_with_warmup,
)
try:
from torch.utils.tensorboard import SummaryWriter
except:
except ImportError:
from tensorboardX import SummaryWriter
from tqdm import tqdm, trange
from transformers import (WEIGHTS_NAME, BertConfig,
BertForMultipleChoice, BertTokenizer)
from transformers import AdamW, get_linear_schedule_with_warmup
logger = logging.getLogger(__name__)
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) \
for conf in [BertConfig]), ())
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in [BertConfig]), ())
MODEL_CLASSES = {
'bert': (BertConfig, BertForMultipleChoice, BertTokenizer),
"bert": (BertConfig, BertForMultipleChoice, BertTokenizer),
}
class SwagExample(object):
"""A single training/test example for the SWAG dataset."""
def __init__(self,
swag_id,
context_sentence,
start_ending,
ending_0,
ending_1,
ending_2,
ending_3,
label = None):
def __init__(self, swag_id, context_sentence, start_ending, ending_0, ending_1, ending_2, ending_3, label=None):
self.swag_id = swag_id
self.context_sentence = context_sentence
self.start_ending = start_ending
@@ -79,7 +75,7 @@ class SwagExample(object):
return self.__repr__()
def __repr__(self):
l = [
attributes = [
"swag_id: {}".format(self.swag_id),
"context_sentence: {}".format(self.context_sentence),
"start_ending: {}".format(self.start_ending),
@@ -90,61 +86,48 @@ class SwagExample(object):
]
if self.label is not None:
l.append("label: {}".format(self.label))
attributes.append("label: {}".format(self.label))
return ", ".join(attributes)
return ", ".join(l)
class InputFeatures(object):
def __init__(self,
example_id,
choices_features,
label
):
def __init__(self, example_id, choices_features, label):
self.example_id = example_id
self.choices_features = [
{
'input_ids': input_ids,
'input_mask': input_mask,
'segment_ids': segment_ids
}
{"input_ids": input_ids, "input_mask": input_mask, "segment_ids": segment_ids}
for _, input_ids, input_mask, segment_ids in choices_features
]
self.label = label
def read_swag_examples(input_file, is_training=True):
with open(input_file, 'r', encoding='utf-8') as f:
reader = csv.reader(f)
lines = []
for line in reader:
if sys.version_info[0] == 2:
line = list(unicode(cell, 'utf-8') for cell in line)
lines.append(line)
if is_training and lines[0][-1] != 'label':
raise ValueError(
"For training, the input file must contain a label column."
)
def read_swag_examples(input_file, is_training=True):
with open(input_file, "r", encoding="utf-8") as f:
lines = list(csv.reader(f))
if is_training and lines[0][-1] != "label":
raise ValueError("For training, the input file must contain a label column.")
examples = [
SwagExample(
swag_id = line[2],
context_sentence = line[4],
start_ending = line[5], # in the swag dataset, the
# common beginning of each
# choice is stored in "sent2".
ending_0 = line[7],
ending_1 = line[8],
ending_2 = line[9],
ending_3 = line[10],
label = int(line[11]) if is_training else None
) for line in lines[1:] # we skip the line with the column names
swag_id=line[2],
context_sentence=line[4],
start_ending=line[5], # in the swag dataset, the
# common beginning of each
# choice is stored in "sent2".
ending_0=line[7],
ending_1=line[8],
ending_2=line[9],
ending_3=line[10],
label=int(line[11]) if is_training else None,
)
for line in lines[1:] # we skip the line with the column names
]
return examples
def convert_examples_to_features(examples, tokenizer, max_seq_length,
is_training):
def convert_examples_to_features(examples, tokenizer, max_seq_length, is_training):
"""Loads a data file into a list of `InputBatch`s."""
# Swag is a multiple choice task. To perform this task using Bert,
@@ -204,23 +187,18 @@ def convert_examples_to_features(examples, tokenizer, max_seq_length,
logger.info("swag_id: {}".format(example.swag_id))
for choice_idx, (tokens, input_ids, input_mask, segment_ids) in enumerate(choices_features):
logger.info("choice: {}".format(choice_idx))
logger.info("tokens: {}".format(' '.join(tokens)))
logger.info("input_ids: {}".format(' '.join(map(str, input_ids))))
logger.info("input_mask: {}".format(' '.join(map(str, input_mask))))
logger.info("segment_ids: {}".format(' '.join(map(str, segment_ids))))
logger.info("tokens: {}".format(" ".join(tokens)))
logger.info("input_ids: {}".format(" ".join(map(str, input_ids))))
logger.info("input_mask: {}".format(" ".join(map(str, input_mask))))
logger.info("segment_ids: {}".format(" ".join(map(str, segment_ids))))
if is_training:
logger.info("label: {}".format(label))
features.append(
InputFeatures(
example_id = example.swag_id,
choices_features = choices_features,
label = label
)
)
features.append(InputFeatures(example_id=example.swag_id, choices_features=choices_features, label=label))
return features
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
@@ -237,18 +215,14 @@ def _truncate_seq_pair(tokens_a, tokens_b, max_length):
else:
tokens_b.pop()
def accuracy(out, labels):
outputs = np.argmax(out, axis=1)
return np.sum(outputs == labels)
def select_field(features, field):
return [
[
choice[field]
for choice in feature.choices_features
]
for feature in features
]
return [[choice[field] for choice in feature.choices_features] for feature in features]
def set_seed(args):
@@ -258,24 +232,28 @@ def set_seed(args):
if args.n_gpu > 0:
torch.cuda.manual_seed_all(args.seed)
def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False):
if args.local_rank not in [-1, 0]:
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
# Load data features from cache or dataset file
input_file = args.predict_file if evaluate else args.train_file
cached_features_file = os.path.join(os.path.dirname(input_file), 'cached_{}_{}_{}'.format(
'dev' if evaluate else 'train',
list(filter(None, args.model_name_or_path.split('/'))).pop(),
str(args.max_seq_length)))
cached_features_file = os.path.join(
os.path.dirname(input_file),
"cached_{}_{}_{}".format(
"dev" if evaluate else "train",
list(filter(None, args.model_name_or_path.split("/"))).pop(),
str(args.max_seq_length),
),
)
if os.path.exists(cached_features_file) and not args.overwrite_cache and not output_examples:
logger.info("Loading features from cached file %s", cached_features_file)
features = torch.load(cached_features_file)
else:
logger.info("Creating features from dataset file at %s", input_file)
examples = read_swag_examples(input_file)
features = convert_examples_to_features(
examples, tokenizer, args.max_seq_length, not evaluate)
features = convert_examples_to_features(examples, tokenizer, args.max_seq_length, not evaluate)
if args.local_rank in [-1, 0]:
logger.info("Saving features into cached file %s", cached_features_file)
@@ -285,21 +263,21 @@ def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=Fal
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
# Convert to Tensors and build dataset
all_input_ids = torch.tensor(select_field(features, 'input_ids'), dtype=torch.long)
all_input_mask = torch.tensor(select_field(features, 'input_mask'), dtype=torch.long)
all_segment_ids = torch.tensor(select_field(features, 'segment_ids'), dtype=torch.long)
all_input_ids = torch.tensor(select_field(features, "input_ids"), dtype=torch.long)
all_input_mask = torch.tensor(select_field(features, "input_mask"), dtype=torch.long)
all_segment_ids = torch.tensor(select_field(features, "segment_ids"), dtype=torch.long)
all_label = torch.tensor([f.label for f in features], dtype=torch.long)
if evaluate:
dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
all_label)
dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label)
else:
dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
all_label)
dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label)
if output_examples:
return dataset, examples, features
return dataset
def train(args, train_dataset, model, tokenizer):
""" Train the model """
if args.local_rank in [-1, 0]:
@@ -316,13 +294,18 @@ def train(args, train_dataset, model, tokenizer):
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ['bias', 'LayerNorm.weight']
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
{
"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
"weight_decay": args.weight_decay,
},
{"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total)
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
)
if args.fp16:
try:
from apex import amp
@@ -336,17 +319,21 @@ def train(args, train_dataset, model, tokenizer):
# Distributed training (should be after apex fp16 initialization)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
output_device=args.local_rank,
find_unused_parameters=True)
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
)
# Train!
logger.info("***** Running training *****")
logger.info(" Num examples = %d", len(train_dataset))
logger.info(" Num Epochs = %d", args.num_train_epochs)
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
logger.info(" Total train batch size (w. parallel, distributed & accumulation) = %d",
args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
logger.info(
" Total train batch size (w. parallel, distributed & accumulation) = %d",
args.train_batch_size
* args.gradient_accumulation_steps
* (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
)
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
logger.info(" Total optimization steps = %d", t_total)
@@ -354,17 +341,19 @@ def train(args, train_dataset, model, tokenizer):
tr_loss, logging_loss = 0.0, 0.0
model.zero_grad()
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
set_seed(args) # Added here for reproductibility (even between python 2 and 3)
set_seed(args) # Added here for reproductibility
for _ in train_iterator:
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
for step, batch in enumerate(epoch_iterator):
model.train()
batch = tuple(t.to(args.device) for t in batch)
inputs = {'input_ids': batch[0],
'attention_mask': batch[1],
#'token_type_ids': None if args.model_type == 'xlm' else batch[2],
'token_type_ids': batch[2],
'labels': batch[3]}
inputs = {
"input_ids": batch[0],
"attention_mask": batch[1],
# 'token_type_ids': None if args.model_type == 'xlm' else batch[2],
"token_type_ids": batch[2],
"labels": batch[3],
}
# if args.model_type in ['xlnet', 'xlm']:
# inputs.update({'cls_index': batch[5],
# 'p_mask': batch[6]})
@@ -372,7 +361,7 @@ def train(args, train_dataset, model, tokenizer):
loss = outputs[0] # model outputs are always tuple in transformers (see doc)
if args.n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu parallel (not distributed) training
loss = loss.mean() # mean() to average on multi-gpu parallel (not distributed) training
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
@@ -393,23 +382,27 @@ def train(args, train_dataset, model, tokenizer):
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
# Log metrics
if args.local_rank == -1 and args.evaluate_during_training: # Only evaluate when single GPU otherwise metrics may not average well
if (
args.local_rank == -1 and args.evaluate_during_training
): # Only evaluate when single GPU otherwise metrics may not average well
results = evaluate(args, model, tokenizer)
for key, value in results.items():
tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
tb_writer.add_scalar("eval_{}".format(key), value, global_step)
tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
logging_loss = tr_loss
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
# Save model checkpoint
output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
if not os.path.exists(output_dir):
os.makedirs(output_dir)
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_vocabulary(output_dir)
torch.save(args, os.path.join(output_dir, 'training_args.bin'))
torch.save(args, os.path.join(output_dir, "training_args.bin"))
logger.info("Saving model checkpoint to %s", output_dir)
if args.max_steps > 0 and global_step > args.max_steps:
@@ -424,6 +417,7 @@ def train(args, train_dataset, model, tokenizer):
return global_step, tr_loss / global_step
def evaluate(args, model, tokenizer, prefix=""):
dataset, examples, features = load_and_cache_examples(args, tokenizer, evaluate=True, output_examples=True)
@@ -440,7 +434,6 @@ def evaluate(args, model, tokenizer, prefix=""):
logger.info(" Num examples = %d", len(dataset))
logger.info(" Batch size = %d", args.eval_batch_size)
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
@@ -448,11 +441,13 @@ def evaluate(args, model, tokenizer, prefix=""):
model.eval()
batch = tuple(t.to(args.device) for t in batch)
with torch.no_grad():
inputs = {'input_ids': batch[0],
'attention_mask': batch[1],
# 'token_type_ids': None if args.model_type == 'xlm' else batch[2] # XLM don't use segment_ids
'token_type_ids': batch[2],
'labels': batch[3]}
inputs = {
"input_ids": batch[0],
"attention_mask": batch[1],
# 'token_type_ids': None if args.model_type == 'xlm' else batch[2] # XLM don't use segment_ids
"token_type_ids": batch[2],
"labels": batch[3],
}
# if args.model_type in ['xlnet', 'xlm']:
# inputs.update({'cls_index': batch[4],
@@ -462,17 +457,16 @@ def evaluate(args, model, tokenizer, prefix=""):
eval_loss += tmp_eval_loss.mean().item()
logits = logits.detach().cpu().numpy()
label_ids = inputs['labels'].to('cpu').numpy()
label_ids = inputs["labels"].to("cpu").numpy()
tmp_eval_accuracy = accuracy(logits, label_ids)
eval_accuracy += tmp_eval_accuracy
nb_eval_steps += 1
nb_eval_examples += inputs['input_ids'].size(0)
nb_eval_examples += inputs["input_ids"].size(0)
eval_loss = eval_loss / nb_eval_steps
eval_accuracy = eval_accuracy / nb_eval_examples
result = {'eval_loss': eval_loss,
'eval_accuracy': eval_accuracy}
result = {"eval_loss": eval_loss, "eval_accuracy": eval_accuracy}
output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
with open(output_eval_file, "w") as writer:
@@ -483,92 +477,144 @@ def evaluate(args, model, tokenizer, prefix=""):
return result
def main():
parser = argparse.ArgumentParser()
## Required parameters
parser.add_argument("--train_file", default=None, type=str, required=True,
help="SWAG csv for training. E.g., train.csv")
parser.add_argument("--predict_file", default=None, type=str, required=True,
help="SWAG csv for predictions. E.g., val.csv or test.csv")
parser.add_argument("--model_type", default=None, type=str, required=True,
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
parser.add_argument("--output_dir", default=None, type=str, required=True,
help="The output directory where the model checkpoints and predictions will be written.")
# Required parameters
parser.add_argument(
"--train_file", default=None, type=str, required=True, help="SWAG csv for training. E.g., train.csv"
)
parser.add_argument(
"--predict_file",
default=None,
type=str,
required=True,
help="SWAG csv for predictions. E.g., val.csv or test.csv",
)
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
)
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
)
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model checkpoints and predictions will be written.",
)
## Other parameters
parser.add_argument("--config_name", default="", type=str,
help="Pretrained config name or path if not the same as model_name")
parser.add_argument("--tokenizer_name", default="", type=str,
help="Pretrained tokenizer name or path if not the same as model_name")
parser.add_argument("--max_seq_length", default=384, type=int,
help="The maximum total input sequence length after tokenization. Sequences "
"longer than this will be truncated, and sequences shorter than this will be padded.")
parser.add_argument("--do_train", action='store_true',
help="Whether to run training.")
parser.add_argument("--do_eval", action='store_true',
help="Whether to run eval on the dev set.")
parser.add_argument("--evaluate_during_training", action='store_true',
help="Rul evaluation during training at each logging step.")
parser.add_argument("--do_lower_case", action='store_true',
help="Set this flag if you are using an uncased model.")
# Other parameters
parser.add_argument(
"--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name"
)
parser.add_argument(
"--tokenizer_name",
default="",
type=str,
help="Pretrained tokenizer name or path if not the same as model_name",
)
parser.add_argument(
"--max_seq_length",
default=384,
type=int,
help="The maximum total input sequence length after tokenization. Sequences "
"longer than this will be truncated, and sequences shorter than this will be padded.",
)
parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
parser.add_argument(
"--evaluate_during_training", action="store_true", help="Rul evaluation during training at each logging step."
)
parser.add_argument(
"--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model."
)
parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
help="Batch size per GPU/CPU for training.")
parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
help="Batch size per GPU/CPU for evaluation.")
parser.add_argument("--learning_rate", default=5e-5, type=float,
help="The initial learning rate for Adam.")
parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.")
parser.add_argument("--weight_decay", default=0.0, type=float,
help="Weight deay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-8, type=float,
help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float,
help="Max gradient norm.")
parser.add_argument("--num_train_epochs", default=3.0, type=float,
help="Total number of training epochs to perform.")
parser.add_argument("--max_steps", default=-1, type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
parser.add_argument("--warmup_steps", default=0, type=int,
help="Linear warmup over warmup_steps.")
parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument(
"--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation."
)
parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
parser.add_argument(
"--gradient_accumulation_steps",
type=int,
default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.",
)
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
parser.add_argument(
"--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform."
)
parser.add_argument(
"--max_steps",
default=-1,
type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
parser.add_argument('--logging_steps', type=int, default=50,
help="Log every X updates steps.")
parser.add_argument('--save_steps', type=int, default=50,
help="Save checkpoint every X updates steps.")
parser.add_argument("--eval_all_checkpoints", action='store_true',
help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
parser.add_argument("--no_cuda", action='store_true',
help="Whether not to use CUDA when available")
parser.add_argument('--overwrite_output_dir', action='store_true',
help="Overwrite the content of the output directory")
parser.add_argument('--overwrite_cache', action='store_true',
help="Overwrite the cached training and evaluation sets")
parser.add_argument('--seed', type=int, default=42,
help="random seed for initialization")
parser.add_argument("--logging_steps", type=int, default=50, help="Log every X updates steps.")
parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.")
parser.add_argument(
"--eval_all_checkpoints",
action="store_true",
help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
)
parser.add_argument("--no_cuda", action="store_true", help="Whether not to use CUDA when available")
parser.add_argument(
"--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory"
)
parser.add_argument(
"--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets"
)
parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
parser.add_argument("--local_rank", type=int, default=-1,
help="local_rank for distributed training on gpus")
parser.add_argument('--fp16', action='store_true',
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
parser.add_argument('--fp16_opt_level', type=str, default='O1',
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
"See details at https://nvidia.github.io/apex/amp.html")
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
parser.add_argument("--local_rank", type=int, default=-1, help="local_rank for distributed training on gpus")
parser.add_argument(
"--fp16",
action="store_true",
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
)
parser.add_argument(
"--fp16_opt_level",
type=str,
default="O1",
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
"See details at https://nvidia.github.io/apex/amp.html",
)
parser.add_argument("--server_ip", type=str, default="", help="Can be used for distant debugging.")
parser.add_argument("--server_port", type=str, default="", help="Can be used for distant debugging.")
args = parser.parse_args()
if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
if (
os.path.exists(args.output_dir)
and os.listdir(args.output_dir)
and args.do_train
and not args.overwrite_output_dir
):
raise ValueError(
"Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
args.output_dir
)
)
# Setup distant debugging if needed
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
@@ -576,20 +622,28 @@ def main():
# Setup CUDA, GPU & distributed training
if args.local_rank == -1 or args.no_cuda:
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
args.n_gpu = torch.cuda.device_count()
args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)
torch.distributed.init_process_group(backend='nccl')
torch.distributed.init_process_group(backend="nccl")
args.n_gpu = 1
args.device = device
# Setup logging
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
)
logger.warning(
"Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
args.local_rank,
device,
args.n_gpu,
bool(args.local_rank != -1),
args.fp16,
)
# Set seed
set_seed(args)
@@ -601,8 +655,12 @@ def main():
args.model_type = args.model_type.lower()
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
tokenizer = tokenizer_class.from_pretrained(
args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case
)
model = model_class.from_pretrained(
args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config
)
if args.local_rank == 0:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
@@ -617,7 +675,6 @@ def main():
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
# Save the trained model and the tokenizer
if args.local_rank == -1 or torch.distributed.get_rank() == 0:
# Create output directory if needed
@@ -627,19 +684,20 @@ def main():
logger.info("Saving model checkpoint to %s", args.output_dir)
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
model_to_save.save_pretrained(args.output_dir)
tokenizer.save_pretrained(args.output_dir)
# Good practice: save your training arguments together with the trained model
torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
# Load a trained model and vocabulary that you have fine-tuned
model = model_class.from_pretrained(args.output_dir)
tokenizer = tokenizer_class.from_pretrained(args.output_dir)
model.to(args.device)
# Evaluation - we can ask to evaluate all the checkpoints (sub-directories) in a directory
results = {}
if args.do_eval and args.local_rank in [-1, 0]:
@@ -650,14 +708,16 @@ def main():
checkpoints = [args.model_name_or_path]
if args.eval_all_checkpoints:
checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
checkpoints = list(
os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
)
logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce model loading logs
logger.info("Evaluate the following checkpoints: %s", checkpoints)
for checkpoint in checkpoints:
# Reload the model
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
model = model_class.from_pretrained(checkpoint)
tokenizer = tokenizer_class.from_pretrained(checkpoint)
model.to(args.device)
@@ -665,7 +725,7 @@ def main():
# Evaluate
result = evaluate(args, model, tokenizer, prefix=global_step)
result = dict((k + ('_{}'.format(global_step) if global_step else ''), v) for k, v in result.items())
result = dict((k + ("_{}".format(global_step) if global_step else ""), v) for k, v in result.items())
results.update(result)
logger.info("Results: {}".format(results))

View File

@@ -19,55 +19,48 @@
This script with default values evaluates a pretrained Transformer-XL on WikiText 103
"""
from __future__ import absolute_import, division, print_function, unicode_literals
import argparse
import logging
import time
import math
import time
import torch
from transformers import TransfoXLLMHeadModel, TransfoXLCorpus, TransfoXLTokenizer
from transformers import TransfoXLCorpus, TransfoXLLMHeadModel
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO)
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO
)
logger = logging.getLogger(__name__)
def main():
parser = argparse.ArgumentParser(description='PyTorch Transformer Language Model')
parser.add_argument('--model_name', type=str, default='transfo-xl-wt103',
help='pretrained model name')
parser.add_argument('--split', type=str, default='test',
choices=['all', 'valid', 'test'],
help='which split to evaluate')
parser.add_argument('--batch_size', type=int, default=10,
help='batch size')
parser.add_argument('--tgt_len', type=int, default=128,
help='number of tokens to predict')
parser.add_argument('--ext_len', type=int, default=0,
help='length of the extended context')
parser.add_argument('--mem_len', type=int, default=1600,
help='length of the retained previous heads')
parser.add_argument('--clamp_len', type=int, default=1000,
help='max positional embedding index')
parser.add_argument('--no_cuda', action='store_true',
help='Do not use CUDA even though CUA is available')
parser.add_argument('--work_dir', type=str, required=True,
help='path to the work_dir')
parser.add_argument('--no_log', action='store_true',
help='do not log the eval result')
parser.add_argument('--same_length', action='store_true',
help='set same length attention with masking')
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
parser = argparse.ArgumentParser(description="PyTorch Transformer Language Model")
parser.add_argument("--model_name", type=str, default="transfo-xl-wt103", help="pretrained model name")
parser.add_argument(
"--split", type=str, default="test", choices=["all", "valid", "test"], help="which split to evaluate"
)
parser.add_argument("--batch_size", type=int, default=10, help="batch size")
parser.add_argument("--tgt_len", type=int, default=128, help="number of tokens to predict")
parser.add_argument("--ext_len", type=int, default=0, help="length of the extended context")
parser.add_argument("--mem_len", type=int, default=1600, help="length of the retained previous heads")
parser.add_argument("--clamp_len", type=int, default=1000, help="max positional embedding index")
parser.add_argument("--no_cuda", action="store_true", help="Do not use CUDA even though CUA is available")
parser.add_argument("--work_dir", type=str, required=True, help="path to the work_dir")
parser.add_argument("--no_log", action="store_true", help="do not log the eval result")
parser.add_argument("--same_length", action="store_true", help="set same length attention with masking")
parser.add_argument("--server_ip", type=str, default="", help="Can be used for distant debugging.")
parser.add_argument("--server_port", type=str, default="", help="Can be used for distant debugging.")
args = parser.parse_args()
assert args.ext_len >= 0, 'extended context length must be non-negative'
assert args.ext_len >= 0, "extended context length must be non-negative"
if args.server_ip and args.server_port:
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
import ptvsd
print("Waiting for debugger attach")
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
ptvsd.wait_for_attach()
@@ -80,21 +73,20 @@ def main():
# The pre-processing involve computing word frequencies to prepare the Adaptive input and SoftMax
# and tokenizing the dataset
# The pre-processed corpus is a convertion (using the conversion script )
tokenizer = TransfoXLTokenizer.from_pretrained(args.model_name)
corpus = TransfoXLCorpus.from_pretrained(args.model_name)
ntokens = len(corpus.vocab)
va_iter = corpus.get_iterator('valid', args.batch_size, args.tgt_len,
device=device, ext_len=args.ext_len)
te_iter = corpus.get_iterator('test', args.batch_size, args.tgt_len,
device=device, ext_len=args.ext_len)
va_iter = corpus.get_iterator("valid", args.batch_size, args.tgt_len, device=device, ext_len=args.ext_len)
te_iter = corpus.get_iterator("test", args.batch_size, args.tgt_len, device=device, ext_len=args.ext_len)
# Load a pre-trained model
model = TransfoXLLMHeadModel.from_pretrained(args.model_name)
model = model.to(device)
logger.info('Evaluating with bsz {} tgt_len {} ext_len {} mem_len {} clamp_len {}'.format(
args.batch_size, args.tgt_len, args.ext_len, args.mem_len, args.clamp_len))
logger.info(
"Evaluating with bsz {} tgt_len {} ext_len {} mem_len {} clamp_len {}".format(
args.batch_size, args.tgt_len, args.ext_len, args.mem_len, args.clamp_len
)
)
model.reset_length(args.tgt_len, args.ext_len, args.mem_len)
if args.clamp_len > 0:
@@ -108,7 +100,7 @@ def main():
def evaluate(eval_iter):
# Turn on evaluation mode which disables dropout.
model.eval()
total_len, total_loss = 0, 0.
total_len, total_loss = 0, 0.0
start_time = time.time()
with torch.no_grad():
mems = None
@@ -119,35 +111,34 @@ def main():
total_loss += seq_len * loss.item()
total_len += seq_len
total_time = time.time() - start_time
logger.info('Time : {:.2f}s, {:.2f}ms/segment'.format(
total_time, 1000 * total_time / (idx+1)))
logger.info("Time : {:.2f}s, {:.2f}ms/segment".format(total_time, 1000 * total_time / (idx + 1)))
return total_loss / total_len
# Run on test data.
if args.split == 'all':
if args.split == "all":
test_loss = evaluate(te_iter)
valid_loss = evaluate(va_iter)
elif args.split == 'valid':
elif args.split == "valid":
valid_loss = evaluate(va_iter)
test_loss = None
elif args.split == 'test':
elif args.split == "test":
test_loss = evaluate(te_iter)
valid_loss = None
def format_log(loss, split):
log_str = '| {0} loss {1:5.2f} | {0} ppl {2:9.3f} '.format(
split, loss, math.exp(loss))
log_str = "| {0} loss {1:5.2f} | {0} ppl {2:9.3f} ".format(split, loss, math.exp(loss))
return log_str
log_str = ''
log_str = ""
if valid_loss is not None:
log_str += format_log(valid_loss, 'valid')
log_str += format_log(valid_loss, "valid")
if test_loss is not None:
log_str += format_log(test_loss, 'test')
log_str += format_log(test_loss, "test")
logger.info('=' * 100)
logger.info("=" * 100)
logger.info(log_str)
logger.info('=' * 100)
logger.info("=" * 100)
if __name__ == '__main__':
if __name__ == "__main__":
main()

View File

@@ -2,13 +2,17 @@
This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT, DistilRoBERTa and DistilGPT2.
**November 19th, 2019 - Update** We release German **DistilBERT**: 98.8% of `bert-base-german-dbmdz-cased` on NER tasks.
**January 20, 2020 - Bug fixing** We have recently discovered and fixed [a bug](https://github.com/huggingface/transformers/commit/48cbf267c988b56c71a2380f748a3e6092ccaed3) in the evaluation of our `run_*.py` scripts that caused the reported metrics to be over-estimated on average. We have updated all the metrics with the latest runs.
**October 23rd, 2019 - Update** We release **DistilRoBERTa**: 95% of `RoBERTa-base`'s performance on GLUE, twice as fast as RoBERTa while being 35% smaller.
**December 6, 2019 - Update** We release **DistilmBERT**: 92% of `bert-base-multilingual-cased` on XNLI. The model supports 104 different languages listed [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages).
**October 3rd, 2019 - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2. **The paper superseeds our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances. Please use the paper as a reference when comparing/reporting results on DistilBERT.**
**November 19, 2019 - Update** We release German **DistilBERT**: 98.8% of `bert-base-german-dbmdz-cased` on NER tasks.
**September 19th, 2019 - Update:** We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 97% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future!
**October 23, 2019 - Update** We release **DistilRoBERTa**: 95% of `RoBERTa-base`'s performance on GLUE, twice as fast as RoBERTa while being 35% smaller.
**October 3, 2019 - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2. **The paper supersedes our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances. Please use the paper as a reference when comparing/reporting results on DistilBERT.**
**September 19, 2019 - Update:** We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 99% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future!
## What is Distil*
@@ -16,9 +20,10 @@ This folder contains the original code used to train Distil* as well as examples
Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
We have applied the same method to other Transformer architectures and released the weights:
- GPT2: on the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 15.0 compared to 18.5 for **DistilGPT2** (after fine-tuning on the train set).
- RoBERTa: **DistilRoBERTa** reaches 95% of `RoBERTa-base` performance on GLUE while being twice faster and 35% smaller.
- and more to come! 🤗🤗🤗
- GPT2: on the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 16.3 compared to 21.1 for **DistilGPT2** (after fine-tuning on the train set).
- RoBERTa: **DistilRoBERTa** reaches 95% of `RoBERTa-base`'s performance on GLUE while being twice faster and 35% smaller.
- German BERT: **German DistilBERT** reaches 99% of `bert-base-german-dbmdz-cased`'s performance on German NER (CoNLL-2003).
- Multilingual BERT: **DistilmBERT** reaches 92% of Multilingual BERT's performance on XNLI while being twice faster and 25% smaller. The model supports 104 languages listed [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages).
For more information on DistilBERT, please refer to our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108).
@@ -26,18 +31,28 @@ Here are the results on the dev sets of GLUE:
| Model | Macro-score | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2| STS-B| WNLI |
| :---: | :---: | :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---: |
| BERT-base | **77.6** | 48.9 | 84.3 | 88.6 | 89.3 | 89.5 | 71.3 | 91.7 | 91.2 | 43.7 |
| DistilBERT | **76.8** | 49.1 | 81.8 | 90.2 | 90.2 | 89.2 | 62.9 | 92.7 | 90.7 | 44.4 |
| BERT-base-uncased | **79.5** | 56.3 | 84.7 | 88.6 | 91.8 | 89.6 | 69.3 | 92.7 | 89.0 | 53.5 |
| DistilBERT-base-uncased | **77.0** | 51.3 | 82.1 | 87.5 | 89.2 | 88.5 | 59.9 | 91.3 | 86.9 | 56.3 |
| BERT-base-cased | **78.2** | 58.2 | 83.9 | 87.8 | 91.0 | 89.2 | 66.1 | 91.7 | 89.2 | 46.5 |
| DistilBERT-base-cased | **75.9** | 47.2 | 81.5 | 85.6 | 88.2 | 87.8 | 60.6 | 90.4 | 85.5 | 56.3 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| RoBERTa-base (reported) | **83.2**/**86.4**<sup>2</sup> | 63.6 | 87.6 | 90.2 | 92.8 | 91.9 | 78.7 | 94.8 | 91.2 | 57.7<sup>3</sup> |
| DistilRoBERTa<sup>1</sup> | **79.0**/**82.3**<sup>2</sup> | 59.4 | 83.9 | 86.6 | 90.8 | 89.4 | 67.9 | 92.5 | 88.3 | 52.1 |
| RoBERTa-base (reported) | **83.2**/**86.4**<sup>2</sup> | 63.6 | 87.6 | 90.2 | 92.8 | 91.9 | 78.7 | 94.8 | 91.2 | 57.7<sup>3</sup> |
| DistilRoBERTa<sup>1</sup> | **79.0**/**82.3**<sup>2</sup> | 59.3 | 84.0 | 86.6 | 90.8 | 89.4 | 67.9 | 92.5 | 88.3 | 52.1 |
<sup>1</sup> We did not use the MNLI checkpoint for fine-tuning but directy perform transfer learning on the pre-trained DistilRoBERTa.
<sup>1</sup> We did not use the MNLI checkpoint for fine-tuning but directly perform transfer learning on the pre-trained DistilRoBERTa.
<sup>2</sup> Macro-score computed without WNLI.
<sup>3</sup> We compute this score ourselves for completeness.
Here are the results on the *test* sets for 6 of the languages available in XNLI. The results are computed in the zero shot setting (trained on the English portion and evaluated on the target language portion):
| Model | English | Spanish | Chinese | German | Arabic | Urdu |
| :---: | :---: | :---: | :---: | :---: | :---: | :---:|
| mBERT base cased (computed) | 82.1 | 74.6 | 69.1 | 72.3 | 66.4 | 58.5 |
| mBERT base uncased (reported)| 81.4 | 74.3 | 63.8 | 70.5 | 62.1 | 58.3 |
| DistilmBERT | 78.2 | 69.1 | 64.0 | 66.3 | 59.1 | 54.7 |
## Setup
This part of the library has only be tested with Python3.6+. There are few specific dependencies to install before launching a distillation, you can install them with the command `pip install -r requirements.txt`.
@@ -50,17 +65,19 @@ This part of the library has only be tested with Python3.6+. There are few speci
Transformers includes five pre-trained Distil* models, currently only provided for English and German (we are investigating the possibility to train and release a multilingual version of DistilBERT):
- `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knowledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
- `distilbert-base-cased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-cased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 65M parameters.
- `distilbert-base-cased-distilled-squad`: A finetuned version of `distilbert-base-cased` finetuned using (a second step of) knowledge distillation on SQuAD 1.0. This model reaches a F1 score of 87.1 on the dev set (for comparison, Bert `bert-base-cased` version reaches a 88.7 F1 score).
- `distilbert-base-german-cased`: DistilBERT German language model pretrained on 1/2 of the data used to pretrain Bert using distillation with the supervision of the `bert-base-german-dbmdz-cased` version of German DBMDZ Bert. For NER tasks the model reaches a F1 score of 83.49 on the CoNLL-2003 test set (for comparison, `bert-base-german-dbmdz-cased` reaches a 84.52 F1 score), and a F1 score of 85.23 on the GermEval 2014 test set (`bert-base-german-dbmdz-cased` reaches a 86.89 F1 score).
- `distilgpt2`: DistilGPT2 English language model pretrained with the supervision of `gpt2` (the smallest version of GPT2) on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset. The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 124M parameters for GPT2). On average, DistilGPT2 is two times faster than GPT2.
- `distilroberta-base`: DistilRoBERTa English language model pretrained with the supervision of `roberta-base` solely on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset (it is ~4 times less training data than the teacher RoBERTa). The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). On average DistilRoBERTa is twice as fast as Roberta-base.
- and more to come! 🤗🤗🤗
- `distilbert-base-multilingual-cased`: DistilmBERT multilingual model pretrained with the supervision of `bert-base-multilingual-cased` on the concatenation of Wikipedia in 104 different languages. The model supports the 104 languages listed [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages). The model has 6 layers, 768 dimension and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base). On average DistilmBERT is twice as fast as mBERT-base.
Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.
```python
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
model = DistilBertModel.from_pretrained('distilbert-base-cased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)
outputs = model(input_ids)
@@ -68,8 +85,10 @@ last_hidden_states = outputs[0] # The last hidden-state is the first element of
```
Similarly, using the other Distil* models simply consists in calling the base classes with a different pretrained checkpoint:
- DistilBERT uncased: `model = DistilBertModel.from_pretrained('distilbert-base-uncased')`
- DistilGPT2: `model = GPT2Model.from_pretrained('distilgpt2')`
- DistilRoBERTa: `model = RobertaModel.from_pretrained('distilroberta-base')`
- DistilmBERT: `model = DistilBertModel.from_pretrained('distilbert-base-multilingual-cased')`
## How to train Distil*
@@ -92,7 +111,7 @@ python scripts/binarized_data.py \
--dump_file data/binarized_text
```
Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smoothes the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurences of each tokens in the data:
Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smoothes the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurrences of each tokens in the data:
```bash
python scripts/token_counts.py \
@@ -160,7 +179,7 @@ Happy distillation!
## Citation
If you find the ressource useful, you should cite the following paper:
If you find the resource useful, you should cite the following paper:
```
@inproceedings{sanh2019distilbert,

View File

@@ -15,40 +15,36 @@
""" The distiller to distil the student.
Adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
"""
import os
import math
import psutil
import os
import time
from tqdm import trange, tqdm
import numpy as np
import psutil
import psutil
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import AdamW
from torch.utils.data import BatchSampler, DataLoader, RandomSampler
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data import RandomSampler, BatchSampler, DataLoader
from tqdm import tqdm
from grouped_batch_sampler import GroupedBatchSampler, create_lengths_groups
from lm_seqs_dataset import LmSeqsDataset
from transformers import get_linear_schedule_with_warmup
from utils import logger
try:
from torch.utils.tensorboard import SummaryWriter
except:
except ImportError:
from tensorboardX import SummaryWriter
from transformers import get_linear_schedule_with_warmup
from utils import logger
from lm_seqs_dataset import LmSeqsDataset
from grouped_batch_sampler import GroupedBatchSampler, create_lengths_groups
class Distiller:
def __init__(self,
params: dict,
dataset: LmSeqsDataset,
token_probs: torch.tensor,
student: nn.Module,
teacher: nn.Module):
logger.info('Initializing Distiller')
def __init__(
self, params: dict, dataset: LmSeqsDataset, token_probs: torch.tensor, student: nn.Module, teacher: nn.Module
):
logger.info("Initializing Distiller")
self.params = params
self.dump_path = params.dump_path
self.multi_gpu = params.multi_gpu
@@ -71,12 +67,10 @@ class Distiller:
else:
sampler = BatchSampler(sampler=sampler, batch_size=params.batch_size, drop_last=False)
self.dataloader = DataLoader(dataset=dataset,
batch_sampler=sampler,
collate_fn=dataset.batch_sequences)
self.dataloader = DataLoader(dataset=dataset, batch_sampler=sampler, collate_fn=dataset.batch_sequences)
self.temperature = params.temperature
assert self.temperature > 0.
assert self.temperature > 0.0
self.alpha_ce = params.alpha_ce
self.alpha_mlm = params.alpha_mlm
@@ -86,18 +80,18 @@ class Distiller:
self.mlm = params.mlm
if self.mlm:
logger.info(f'Using MLM loss for LM step.')
logger.info(f"Using MLM loss for LM step.")
self.mlm_mask_prop = params.mlm_mask_prop
assert 0.0 <= self.mlm_mask_prop <= 1.0
assert params.word_mask + params.word_keep + params.word_rand == 1.0
self.pred_probs = torch.FloatTensor([params.word_mask, params.word_keep, params.word_rand])
self.pred_probs = self.pred_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else self.pred_probs
self.token_probs = token_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else token_probs
self.pred_probs = self.pred_probs.to(f"cuda:{params.local_rank}") if params.n_gpu > 0 else self.pred_probs
self.token_probs = token_probs.to(f"cuda:{params.local_rank}") if params.n_gpu > 0 else token_probs
if self.fp16:
self.pred_probs = self.pred_probs.half()
self.token_probs = self.token_probs.half()
else:
logger.info(f'Using CLM loss for LM step.')
logger.info(f"Using CLM loss for LM step.")
self.epoch = 0
self.n_iter = 0
@@ -108,38 +102,54 @@ class Distiller:
self.last_loss_ce = 0
self.last_loss_mlm = 0
self.last_loss_clm = 0
if self.alpha_mse > 0.: self.last_loss_mse = 0
if self.alpha_cos > 0.: self.last_loss_cos = 0
if self.alpha_mse > 0.0:
self.last_loss_mse = 0
if self.alpha_cos > 0.0:
self.last_loss_cos = 0
self.last_log = 0
self.ce_loss_fct = nn.KLDivLoss(reduction='batchmean')
self.lm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
if self.alpha_mse > 0.:
self.mse_loss_fct = nn.MSELoss(reduction='sum')
if self.alpha_cos > 0.:
self.cosine_loss_fct = nn.CosineEmbeddingLoss(reduction='mean')
self.ce_loss_fct = nn.KLDivLoss(reduction="batchmean")
self.lm_loss_fct = nn.CrossEntropyLoss(ignore_index=-100)
if self.alpha_mse > 0.0:
self.mse_loss_fct = nn.MSELoss(reduction="sum")
if self.alpha_cos > 0.0:
self.cosine_loss_fct = nn.CosineEmbeddingLoss(reduction="mean")
logger.info('--- Initializing model optimizer')
logger.info("--- Initializing model optimizer")
assert params.gradient_accumulation_steps >= 1
self.num_steps_epoch = len(self.dataloader)
num_train_optimization_steps = int(self.num_steps_epoch / params.gradient_accumulation_steps * params.n_epoch) + 1
num_train_optimization_steps = (
int(self.num_steps_epoch / params.gradient_accumulation_steps * params.n_epoch) + 1
)
no_decay = ['bias', 'LayerNorm.weight']
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{'params': [p for n, p in student.named_parameters() if not any(nd in n for nd in no_decay) and p.requires_grad], 'weight_decay': params.weight_decay},
{'params': [p for n, p in student.named_parameters() if any(nd in n for nd in no_decay) and p.requires_grad], 'weight_decay': 0.0}
{
"params": [
p for n, p in student.named_parameters() if not any(nd in n for nd in no_decay) and p.requires_grad
],
"weight_decay": params.weight_decay,
},
{
"params": [
p for n, p in student.named_parameters() if any(nd in n for nd in no_decay) and p.requires_grad
],
"weight_decay": 0.0,
},
]
logger.info("------ Number of trainable parameters (student): %i" % sum([p.numel() for p in self.student.parameters() if p.requires_grad]))
logger.info(
"------ Number of trainable parameters (student): %i"
% sum([p.numel() for p in self.student.parameters() if p.requires_grad])
)
logger.info("------ Number of parameters (student): %i" % sum([p.numel() for p in self.student.parameters()]))
self.optimizer = AdamW(optimizer_grouped_parameters,
lr=params.learning_rate,
eps=params.adam_epsilon,
betas=(0.9, 0.98))
self.optimizer = AdamW(
optimizer_grouped_parameters, lr=params.learning_rate, eps=params.adam_epsilon, betas=(0.9, 0.98)
)
warmup_steps = math.ceil(num_train_optimization_steps * params.warmup_prop)
self.scheduler = get_linear_schedule_with_warmup(self.optimizer,
num_warmup_steps=warmup_steps,
num_training_steps=num_train_optimization_steps)
self.scheduler = get_linear_schedule_with_warmup(
self.optimizer, num_warmup_steps=warmup_steps, num_training_steps=num_train_optimization_steps
)
if self.fp16:
try:
@@ -147,33 +157,36 @@ class Distiller:
except ImportError:
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
logger.info(f"Using fp16 training: {self.params.fp16_opt_level} level")
self.student, self.optimizer = amp.initialize(self.student,
self.optimizer,
opt_level=self.params.fp16_opt_level)
self.student, self.optimizer = amp.initialize(
self.student, self.optimizer, opt_level=self.params.fp16_opt_level
)
self.teacher = self.teacher.half()
if self.multi_gpu:
if self.fp16:
from apex.parallel import DistributedDataParallel
logger.info("Using apex.parallel.DistributedDataParallel for distributed training.")
self.student = DistributedDataParallel(self.student)
else:
from torch.nn.parallel import DistributedDataParallel
logger.info("Using nn.parallel.DistributedDataParallel for distributed training.")
self.student = DistributedDataParallel(self.student,
device_ids=[params.local_rank],
output_device=params.local_rank,
find_unused_parameters=True)
self.student = DistributedDataParallel(
self.student,
device_ids=[params.local_rank],
output_device=params.local_rank,
find_unused_parameters=True,
)
self.is_master = params.is_master
if self.is_master:
logger.info('--- Initializing Tensorboard')
self.tensorboard = SummaryWriter(log_dir=os.path.join(self.dump_path, 'log', 'train'))
self.tensorboard.add_text(tag='config/training', text_string=str(self.params), global_step=0)
self.tensorboard.add_text(tag='config/student', text_string=str(self.student_config), global_step=0)
logger.info("--- Initializing Tensorboard")
self.tensorboard = SummaryWriter(log_dir=os.path.join(self.dump_path, "log", "train"))
self.tensorboard.add_text(tag="config/training", text_string=str(self.params), global_step=0)
self.tensorboard.add_text(tag="config/student", text_string=str(self.student_config), global_step=0)
def prepare_batch_mlm(self,
batch):
def prepare_batch_mlm(self, batch):
"""
Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the masked label for MLM.
@@ -187,13 +200,13 @@ class Distiller:
-------
token_ids: `torch.tensor(bs, seq_length)` - The token ids after the modifications for MLM.
attn_mask: `torch.tensor(bs, seq_length)` - The attention mask for the self-attention.
mlm_labels: `torch.tensor(bs, seq_length)` - The masked languge modeling labels. There is a -1 where there is nothing to predict.
mlm_labels: `torch.tensor(bs, seq_length)` - The masked languge modeling labels. There is a -100 where there is nothing to predict.
"""
token_ids, lengths = batch
token_ids, lengths = self.round_batch(x=token_ids, lengths=lengths)
assert token_ids.size(0) == lengths.size(0)
attn_mask = (torch.arange(token_ids.size(1), dtype=torch.long, device=lengths.device) < lengths[:, None])
attn_mask = torch.arange(token_ids.size(1), dtype=torch.long, device=lengths.device) < lengths[:, None]
bs, max_seq_len = token_ids.size()
mlm_labels = token_ids.new(token_ids.size()).copy_(token_ids)
@@ -201,11 +214,13 @@ class Distiller:
x_prob = self.token_probs[token_ids.flatten()]
n_tgt = math.ceil(self.mlm_mask_prop * lengths.sum().item())
tgt_ids = torch.multinomial(x_prob / x_prob.sum(), n_tgt, replacement=False)
pred_mask = torch.zeros(bs * max_seq_len, dtype=torch.bool, device=token_ids.device) # previously `dtype=torch.uint8`, cf pytorch 1.2.0 compatibility
pred_mask = torch.zeros(
bs * max_seq_len, dtype=torch.bool, device=token_ids.device
) # previously `dtype=torch.uint8`, cf pytorch 1.2.0 compatibility
pred_mask[tgt_ids] = 1
pred_mask = pred_mask.view(bs, max_seq_len)
pred_mask[token_ids == self.params.special_tok_ids['pad_token']] = 0
pred_mask[token_ids == self.params.special_tok_ids["pad_token"]] = 0
# mask a number of words == 0 [8] (faster with fp16)
if self.fp16:
@@ -214,26 +229,29 @@ class Distiller:
pred_mask = pred_mask.view(-1)
n2 = max(n1 % 8, 8 * (n1 // 8))
if n2 != n1:
pred_mask[torch.nonzero(pred_mask).view(-1)[:n1-n2]] = 0
pred_mask[torch.nonzero(pred_mask).view(-1)[: n1 - n2]] = 0
pred_mask = pred_mask.view(bs, max_seq_len)
assert pred_mask.sum().item() % 8 == 0, pred_mask.sum().item()
_token_ids_real = token_ids[pred_mask]
_token_ids_rand = _token_ids_real.clone().random_(self.vocab_size)
_token_ids_mask = _token_ids_real.clone().fill_(self.params.special_tok_ids['mask_token'])
_token_ids_mask = _token_ids_real.clone().fill_(self.params.special_tok_ids["mask_token"])
probs = torch.multinomial(self.pred_probs, len(_token_ids_real), replacement=True)
_token_ids = _token_ids_mask * (probs == 0).long() + _token_ids_real * (probs == 1).long() + _token_ids_rand * (probs == 2).long()
_token_ids = (
_token_ids_mask * (probs == 0).long()
+ _token_ids_real * (probs == 1).long()
+ _token_ids_rand * (probs == 2).long()
)
token_ids = token_ids.masked_scatter(pred_mask, _token_ids)
mlm_labels[~pred_mask] = -1 # previously `mlm_labels[1-pred_mask] = -1`, cf pytorch 1.2.0 compatibility
mlm_labels[~pred_mask] = -100 # previously `mlm_labels[1-pred_mask] = -1`, cf pytorch 1.2.0 compatibility
# sanity checks
assert 0 <= token_ids.min() <= token_ids.max() < self.vocab_size
return token_ids, attn_mask, mlm_labels
def prepare_batch_clm(self,
batch):
def prepare_batch_clm(self, batch):
"""
Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the labels for CLM.
@@ -247,24 +265,22 @@ class Distiller:
-------
token_ids: `torch.tensor(bs, seq_length)` - The token ids after the modifications for MLM.
attn_mask: `torch.tensor(bs, seq_length)` - The attention mask for the self-attention.
clm_labels: `torch.tensor(bs, seq_length)` - The causal languge modeling labels. There is a -1 where there is nothing to predict.
clm_labels: `torch.tensor(bs, seq_length)` - The causal languge modeling labels. There is a -100 where there is nothing to predict.
"""
token_ids, lengths = batch
token_ids, lengths = self.round_batch(x=token_ids, lengths=lengths)
assert token_ids.size(0) == lengths.size(0)
attn_mask = (torch.arange(token_ids.size(1), dtype=torch.long, device=lengths.device) < lengths[:, None])
attn_mask = torch.arange(token_ids.size(1), dtype=torch.long, device=lengths.device) < lengths[:, None]
clm_labels = token_ids.new(token_ids.size()).copy_(token_ids)
clm_labels[~attn_mask] = -1 # previously `clm_labels[1-attn_mask] = -1`, cf pytorch 1.2.0 compatibility
clm_labels[~attn_mask] = -100 # previously `clm_labels[1-attn_mask] = -1`, cf pytorch 1.2.0 compatibility
# sanity checks
assert 0 <= token_ids.min() <= token_ids.max() < self.vocab_size
return token_ids, attn_mask, clm_labels
def round_batch(self,
x: torch.tensor,
lengths: torch.tensor):
def round_batch(self, x: torch.tensor, lengths: torch.tensor):
"""
For float16 only.
Sub-sample sentences in a batch, and add padding, so that each dimension is a multiple of 8.
@@ -300,9 +316,9 @@ class Distiller:
pad = 8 - (ml1 % 8)
ml2 = ml1 + pad
if self.mlm:
pad_id = self.params.special_tok_ids['pad_token']
pad_id = self.params.special_tok_ids["pad_token"]
else:
pad_id = self.params.special_tok_ids['unk_token']
pad_id = self.params.special_tok_ids["unk_token"]
padding_tensor = torch.zeros(bs2, pad, dtype=torch.long, device=x.device).fill_(pad_id)
x = torch.cat([x, padding_tensor], 1)
assert x.size() == (bs2, ml2)
@@ -315,20 +331,22 @@ class Distiller:
"""
The real training loop.
"""
if self.is_master: logger.info('Starting training')
if self.is_master:
logger.info("Starting training")
self.last_log = time.time()
self.student.train()
self.teacher.eval()
for _ in range(self.params.n_epoch):
if self.is_master: logger.info(f'--- Starting epoch {self.epoch}/{self.params.n_epoch-1}')
if self.is_master:
logger.info(f"--- Starting epoch {self.epoch}/{self.params.n_epoch-1}")
if self.multi_gpu:
torch.distributed.barrier()
iter_bar = tqdm(self.dataloader, desc="-Iter", disable=self.params.local_rank not in [-1, 0])
for batch in iter_bar:
if self.params.n_gpu > 0:
batch = tuple(t.to(f'cuda:{self.params.local_rank}') for t in batch)
batch = tuple(t.to(f"cuda:{self.params.local_rank}") for t in batch)
if self.mlm:
token_ids, attn_mask, lm_labels = self.prepare_batch_mlm(batch=batch)
@@ -337,22 +355,21 @@ class Distiller:
self.step(input_ids=token_ids, attention_mask=attn_mask, lm_labels=lm_labels)
iter_bar.update()
iter_bar.set_postfix({'Last_loss': f'{self.last_loss:.2f}',
'Avg_cum_loss': f'{self.total_loss_epoch/self.n_iter:.2f}'})
iter_bar.set_postfix(
{"Last_loss": f"{self.last_loss:.2f}", "Avg_cum_loss": f"{self.total_loss_epoch/self.n_iter:.2f}"}
)
iter_bar.close()
if self.is_master: logger.info(f'--- Ending epoch {self.epoch}/{self.params.n_epoch-1}')
if self.is_master:
logger.info(f"--- Ending epoch {self.epoch}/{self.params.n_epoch-1}")
self.end_epoch()
if self.is_master:
logger.info(f'Save very last checkpoint as `pytorch_model.bin`.')
self.save_checkpoint(checkpoint_name=f'pytorch_model.bin')
logger.info('Training is finished')
logger.info(f"Save very last checkpoint as `pytorch_model.bin`.")
self.save_checkpoint(checkpoint_name=f"pytorch_model.bin")
logger.info("Training is finished")
def step(self,
input_ids: torch.tensor,
attention_mask: torch.tensor,
lm_labels: torch.tensor):
def step(self, input_ids: torch.tensor, attention_mask: torch.tensor, lm_labels: torch.tensor):
"""
One optimization step: forward of student AND teacher, backward on the loss (for gradient accumulation),
and possibly a parameter update (depending on the gradient accumulation).
@@ -364,78 +381,91 @@ class Distiller:
lm_labels: `torch.tensor(bs, seq_length)` - The language modeling labels (mlm labels for MLM and clm labels for CLM).
"""
if self.mlm:
s_logits, s_hidden_states = self.student(input_ids=input_ids, attention_mask=attention_mask) # (bs, seq_length, voc_size)
s_logits, s_hidden_states = self.student(
input_ids=input_ids, attention_mask=attention_mask
) # (bs, seq_length, voc_size)
with torch.no_grad():
t_logits, t_hidden_states = self.teacher(input_ids=input_ids, attention_mask=attention_mask) # (bs, seq_length, voc_size)
t_logits, t_hidden_states = self.teacher(
input_ids=input_ids, attention_mask=attention_mask
) # (bs, seq_length, voc_size)
else:
s_logits, _, s_hidden_states = self.student(input_ids=input_ids, attention_mask=None) # (bs, seq_length, voc_size)
s_logits, _, s_hidden_states = self.student(
input_ids=input_ids, attention_mask=None
) # (bs, seq_length, voc_size)
with torch.no_grad():
t_logits, _, t_hidden_states = self.teacher(input_ids=input_ids, attention_mask=None) # (bs, seq_length, voc_size)
t_logits, _, t_hidden_states = self.teacher(
input_ids=input_ids, attention_mask=None
) # (bs, seq_length, voc_size)
assert s_logits.size() == t_logits.size()
#https://github.com/peterliht/knowledge-distillation-pytorch/blob/master/model/net.py#L100
#https://github.com/peterliht/knowledge-distillation-pytorch/issues/2
# https://github.com/peterliht/knowledge-distillation-pytorch/blob/master/model/net.py#L100
# https://github.com/peterliht/knowledge-distillation-pytorch/issues/2
if self.params.restrict_ce_to_mask:
mask = (lm_labels>-1).unsqueeze(-1).expand_as(s_logits) # (bs, seq_lenth, voc_size)
mask = (lm_labels > -1).unsqueeze(-1).expand_as(s_logits) # (bs, seq_lenth, voc_size)
else:
mask = attention_mask.unsqueeze(-1).expand_as(s_logits) # (bs, seq_lenth, voc_size)
s_logits_slct = torch.masked_select(s_logits, mask) # (bs * seq_length * voc_size) modulo the 1s in mask
s_logits_slct = s_logits_slct.view(-1, s_logits.size(-1)) # (bs * seq_length, voc_size) modulo the 1s in mask
t_logits_slct = torch.masked_select(t_logits, mask) # (bs * seq_length * voc_size) modulo the 1s in mask
t_logits_slct = t_logits_slct.view(-1, s_logits.size(-1)) # (bs * seq_length, voc_size) modulo the 1s in mask
mask = attention_mask.unsqueeze(-1).expand_as(s_logits) # (bs, seq_lenth, voc_size)
s_logits_slct = torch.masked_select(s_logits, mask) # (bs * seq_length * voc_size) modulo the 1s in mask
s_logits_slct = s_logits_slct.view(-1, s_logits.size(-1)) # (bs * seq_length, voc_size) modulo the 1s in mask
t_logits_slct = torch.masked_select(t_logits, mask) # (bs * seq_length * voc_size) modulo the 1s in mask
t_logits_slct = t_logits_slct.view(-1, s_logits.size(-1)) # (bs * seq_length, voc_size) modulo the 1s in mask
assert t_logits_slct.size() == s_logits_slct.size()
loss_ce = self.ce_loss_fct(F.log_softmax(s_logits_slct/self.temperature, dim=-1),
F.softmax(t_logits_slct/self.temperature, dim=-1)) * (self.temperature)**2
loss = self.alpha_ce*loss_ce
loss_ce = (
self.ce_loss_fct(
F.log_softmax(s_logits_slct / self.temperature, dim=-1),
F.softmax(t_logits_slct / self.temperature, dim=-1),
)
* (self.temperature) ** 2
)
loss = self.alpha_ce * loss_ce
if self.alpha_mlm > 0.:
if self.alpha_mlm > 0.0:
loss_mlm = self.lm_loss_fct(s_logits.view(-1, s_logits.size(-1)), lm_labels.view(-1))
loss += self.alpha_mlm * loss_mlm
if self.alpha_clm > 0.:
if self.alpha_clm > 0.0:
shift_logits = s_logits[..., :-1, :].contiguous()
shift_labels = lm_labels[..., 1:].contiguous()
loss_clm = self.lm_loss_fct(shift_logits.view(-1, shift_logits.size(-1)),
shift_labels.view(-1))
loss_clm = self.lm_loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
loss += self.alpha_clm * loss_clm
if self.alpha_mse > 0.:
loss_mse = self.mse_loss_fct(s_logits_slct, t_logits_slct)/s_logits_slct.size(0) # Reproducing batchmean reduction
if self.alpha_mse > 0.0:
loss_mse = self.mse_loss_fct(s_logits_slct, t_logits_slct) / s_logits_slct.size(
0
) # Reproducing batchmean reduction
loss += self.alpha_mse * loss_mse
if self.alpha_cos > 0.:
s_hidden_states = s_hidden_states[-1] # (bs, seq_length, dim)
t_hidden_states = t_hidden_states[-1] # (bs, seq_length, dim)
mask = attention_mask.unsqueeze(-1).expand_as(s_hidden_states) # (bs, seq_length, dim)
if self.alpha_cos > 0.0:
s_hidden_states = s_hidden_states[-1] # (bs, seq_length, dim)
t_hidden_states = t_hidden_states[-1] # (bs, seq_length, dim)
mask = attention_mask.unsqueeze(-1).expand_as(s_hidden_states) # (bs, seq_length, dim)
assert s_hidden_states.size() == t_hidden_states.size()
dim = s_hidden_states.size(-1)
s_hidden_states_slct = torch.masked_select(s_hidden_states, mask) # (bs * seq_length * dim)
s_hidden_states_slct = s_hidden_states_slct.view(-1, dim) # (bs * seq_length, dim)
t_hidden_states_slct = torch.masked_select(t_hidden_states, mask) # (bs * seq_length * dim)
t_hidden_states_slct = t_hidden_states_slct.view(-1, dim) # (bs * seq_length, dim)
target = s_hidden_states_slct.new(s_hidden_states_slct.size(0)).fill_(1) # (bs * seq_length,)
s_hidden_states_slct = torch.masked_select(s_hidden_states, mask) # (bs * seq_length * dim)
s_hidden_states_slct = s_hidden_states_slct.view(-1, dim) # (bs * seq_length, dim)
t_hidden_states_slct = torch.masked_select(t_hidden_states, mask) # (bs * seq_length * dim)
t_hidden_states_slct = t_hidden_states_slct.view(-1, dim) # (bs * seq_length, dim)
target = s_hidden_states_slct.new(s_hidden_states_slct.size(0)).fill_(1) # (bs * seq_length,)
loss_cos = self.cosine_loss_fct(s_hidden_states_slct, t_hidden_states_slct, target)
loss += self.alpha_cos * loss_cos
self.total_loss_epoch += loss.item()
self.last_loss = loss.item()
self.last_loss_ce = loss_ce.item()
if self.alpha_mlm > 0.:
if self.alpha_mlm > 0.0:
self.last_loss_mlm = loss_mlm.item()
if self.alpha_clm > 0.:
if self.alpha_clm > 0.0:
self.last_loss_clm = loss_clm.item()
if self.alpha_mse > 0.:
if self.alpha_mse > 0.0:
self.last_loss_mse = loss_mse.item()
if self.alpha_cos > 0.:
if self.alpha_cos > 0.0:
self.last_loss_cos = loss_cos.item()
self.optimize(loss)
self.n_sequences_epoch += input_ids.size(0)
def optimize(self,
loss):
def optimize(self, loss):
"""
Normalization on the loss (gradient accumulation or distributed training), followed by
backward pass on the loss, possibly followed by a parameter update (depending on the gradient accumulation).
@@ -443,7 +473,7 @@ class Distiller:
"""
# Check for NaN
if (loss != loss).data.any():
logger.error('NaN detected')
logger.error("NaN detected")
exit()
if self.multi_gpu:
@@ -453,6 +483,7 @@ class Distiller:
if self.fp16:
from apex import amp
with amp.scale_loss(loss, self.optimizer) as scaled_loss:
scaled_loss.backward()
else:
@@ -489,53 +520,84 @@ class Distiller:
return
for param_name, param in self.student.named_parameters():
self.tensorboard.add_scalar(tag='parameter_mean/' + param_name, scalar_value=param.data.mean(), global_step=self.n_total_iter)
self.tensorboard.add_scalar(tag='parameter_std/' + param_name, scalar_value=param.data.std(), global_step=self.n_total_iter)
self.tensorboard.add_scalar(
tag="parameter_mean/" + param_name, scalar_value=param.data.mean(), global_step=self.n_total_iter
)
self.tensorboard.add_scalar(
tag="parameter_std/" + param_name, scalar_value=param.data.std(), global_step=self.n_total_iter
)
if param.grad is None:
continue
self.tensorboard.add_scalar(tag="grad_mean/" + param_name, scalar_value=param.grad.data.mean(),global_step=self.n_total_iter)
self.tensorboard.add_scalar(tag="grad_std/" + param_name, scalar_value=param.grad.data.std(), global_step=self.n_total_iter)
self.tensorboard.add_scalar(
tag="grad_mean/" + param_name, scalar_value=param.grad.data.mean(), global_step=self.n_total_iter
)
self.tensorboard.add_scalar(
tag="grad_std/" + param_name, scalar_value=param.grad.data.std(), global_step=self.n_total_iter
)
self.tensorboard.add_scalar(tag="losses/cum_avg_loss_epoch", scalar_value=self.total_loss_epoch/self.n_iter, global_step=self.n_total_iter)
self.tensorboard.add_scalar(
tag="losses/cum_avg_loss_epoch",
scalar_value=self.total_loss_epoch / self.n_iter,
global_step=self.n_total_iter,
)
self.tensorboard.add_scalar(tag="losses/loss", scalar_value=self.last_loss, global_step=self.n_total_iter)
self.tensorboard.add_scalar(tag="losses/loss_ce", scalar_value=self.last_loss_ce, global_step=self.n_total_iter)
if self.alpha_mlm > 0.:
self.tensorboard.add_scalar(tag="losses/loss_mlm", scalar_value=self.last_loss_mlm, global_step=self.n_total_iter)
if self.alpha_clm > 0.:
self.tensorboard.add_scalar(tag="losses/loss_clm", scalar_value=self.last_loss_clm, global_step=self.n_total_iter)
if self.alpha_mse > 0.:
self.tensorboard.add_scalar(tag="losses/loss_mse", scalar_value=self.last_loss_mse, global_step=self.n_total_iter)
if self.alpha_cos > 0.:
self.tensorboard.add_scalar(tag="losses/loss_cos", scalar_value=self.last_loss_cos, global_step=self.n_total_iter)
self.tensorboard.add_scalar(tag="learning_rate/lr", scalar_value=self.scheduler.get_lr()[0], global_step=self.n_total_iter)
self.tensorboard.add_scalar(tag="global/memory_usage", scalar_value=psutil.virtual_memory()._asdict()['used']/1_000_000, global_step=self.n_total_iter)
self.tensorboard.add_scalar(tag="global/speed", scalar_value=time.time()-self.last_log, global_step=self.n_total_iter)
self.tensorboard.add_scalar(
tag="losses/loss_ce", scalar_value=self.last_loss_ce, global_step=self.n_total_iter
)
if self.alpha_mlm > 0.0:
self.tensorboard.add_scalar(
tag="losses/loss_mlm", scalar_value=self.last_loss_mlm, global_step=self.n_total_iter
)
if self.alpha_clm > 0.0:
self.tensorboard.add_scalar(
tag="losses/loss_clm", scalar_value=self.last_loss_clm, global_step=self.n_total_iter
)
if self.alpha_mse > 0.0:
self.tensorboard.add_scalar(
tag="losses/loss_mse", scalar_value=self.last_loss_mse, global_step=self.n_total_iter
)
if self.alpha_cos > 0.0:
self.tensorboard.add_scalar(
tag="losses/loss_cos", scalar_value=self.last_loss_cos, global_step=self.n_total_iter
)
self.tensorboard.add_scalar(
tag="learning_rate/lr", scalar_value=self.scheduler.get_lr()[0], global_step=self.n_total_iter
)
self.tensorboard.add_scalar(
tag="global/memory_usage",
scalar_value=psutil.virtual_memory()._asdict()["used"] / 1_000_000,
global_step=self.n_total_iter,
)
self.tensorboard.add_scalar(
tag="global/speed", scalar_value=time.time() - self.last_log, global_step=self.n_total_iter
)
def end_epoch(self):
"""
Finally arrived at the end of epoch (full pass on dataset).
Do some tensorboard logging and checkpoint saving.
"""
logger.info(f'{self.n_sequences_epoch} sequences have been trained during this epoch.')
logger.info(f"{self.n_sequences_epoch} sequences have been trained during this epoch.")
if self.is_master:
self.save_checkpoint(checkpoint_name=f'model_epoch_{self.epoch}.pth')
self.tensorboard.add_scalar(tag='epoch/loss', scalar_value=self.total_loss_epoch/self.n_iter, global_step=self.epoch)
self.save_checkpoint(checkpoint_name=f"model_epoch_{self.epoch}.pth")
self.tensorboard.add_scalar(
tag="epoch/loss", scalar_value=self.total_loss_epoch / self.n_iter, global_step=self.epoch
)
self.epoch += 1
self.n_sequences_epoch = 0
self.n_iter = 0
self.total_loss_epoch = 0
def save_checkpoint(self,
checkpoint_name: str = 'checkpoint.pth'):
def save_checkpoint(self, checkpoint_name: str = "checkpoint.pth"):
"""
Save the current state. Only by the master process.
"""
if not self.is_master:
return
mdl_to_save = self.student.module if hasattr(self.student, 'module') else self.student
mdl_to_save = self.student.module if hasattr(self.student, "module") else self.student
mdl_to_save.config.save_pretrained(self.dump_path)
state_dict = mdl_to_save.state_dict()
torch.save(state_dict, os.path.join(self.dump_path, checkpoint_name))

View File

@@ -17,18 +17,20 @@
import bisect
import copy
from collections import defaultdict
import numpy as np
import numpy as np
from torch.utils.data.sampler import BatchSampler, Sampler
from utils import logger
def _quantize(x, bins):
bins = copy.deepcopy(bins)
bins = sorted(bins)
quantized = list(map(lambda y: bisect.bisect_right(bins, y), x))
return quantized
def create_lengths_groups(lengths, k=0):
bins = np.arange(start=3, stop=k, step=4).tolist() if k > 0 else [10]
groups = _quantize(lengths, bins)
@@ -39,6 +41,7 @@ def create_lengths_groups(lengths, k=0):
logger.info("Count of instances per bin: {}".format(counts))
return groups
class GroupedBatchSampler(BatchSampler):
"""
Wraps another sampler to yield a mini-batch of indices.
@@ -53,11 +56,11 @@ class GroupedBatchSampler(BatchSampler):
0, i.e. they must be in the range [0, num_groups).
batch_size (int): Size of mini-batch.
"""
def __init__(self, sampler, group_ids, batch_size):
if not isinstance(sampler, Sampler):
raise ValueError(
"sampler should be an instance of "
"torch.utils.data.Sampler, but got sampler={}".format(sampler)
"sampler should be an instance of " "torch.utils.data.Sampler, but got sampler={}".format(sampler)
)
self.sampler = sampler
self.group_ids = group_ids
@@ -73,7 +76,7 @@ class GroupedBatchSampler(BatchSampler):
buffer_per_group[group_id].append(idx)
samples_per_group[group_id].append(idx)
if len(buffer_per_group[group_id]) == self.batch_size:
yield buffer_per_group[group_id] #TODO
yield buffer_per_group[group_id] # TODO
num_batches += 1
del buffer_per_group[group_id]
assert len(buffer_per_group[group_id]) < self.batch_size
@@ -90,8 +93,8 @@ class GroupedBatchSampler(BatchSampler):
for group_id, idxs in sorted(buffer_per_group.items(), key=lambda x: x[0]):
batch_idx.extend(idxs)
if len(batch_idx) >= self.batch_size:
yield batch_idx[:self.batch_size]
batch_idx = batch_idx[self.batch_size:]
yield batch_idx[: self.batch_size]
batch_idx = batch_idx[self.batch_size :]
num_remaining -= 1
if len(batch_idx) > 0:
yield batch_idx

View File

@@ -15,12 +15,13 @@
""" Dataset to distilled models
adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
"""
import numpy as np
import torch
from torch.utils.data import Dataset
import numpy as np
from utils import logger
class LmSeqsDataset(Dataset):
"""Custom Dataset wrapping language modeling sequences.
@@ -32,9 +33,7 @@ class LmSeqsDataset(Dataset):
data: `List[np.array[int]]
"""
def __init__(self,
params,
data):
def __init__(self, params, data):
self.params = params
self.token_ids = np.array(data)
@@ -43,6 +42,7 @@ class LmSeqsDataset(Dataset):
self.check()
self.remove_long_sequences()
self.remove_empty_sequences()
self.remove_unknown_sequences()
self.check()
self.print_statistics()
@@ -57,7 +57,7 @@ class LmSeqsDataset(Dataset):
Some sanity checks
"""
assert len(self.token_ids) == len(self.lengths)
assert all(self.lengths[i] == len(self.token_ids[i]) for i in range(len(self.lengths)))
assert all(self.lengths[i] == len(self.token_ids[i]) for i in range(len(self.lengths)))
def remove_long_sequences(self):
"""
@@ -65,17 +65,17 @@ class LmSeqsDataset(Dataset):
"""
max_len = self.params.max_model_input_size
indices = self.lengths > max_len
logger.info(f'Splitting {sum(indices)} too long sequences.')
logger.info(f"Splitting {sum(indices)} too long sequences.")
def divide_chunks(l, n):
return [l[i:i + n] for i in range(0, len(l), n)]
return [l[i : i + n] for i in range(0, len(l), n)]
new_tok_ids = []
new_lengths = []
if self.params.mlm:
cls_id, sep_id = self.params.special_tok_ids['cls_token'], self.params.special_tok_ids['sep_token']
cls_id, sep_id = self.params.special_tok_ids["cls_token"], self.params.special_tok_ids["sep_token"]
else:
cls_id, sep_id = self.params.special_tok_ids['bos_token'], self.params.special_tok_ids['eos_token']
cls_id, sep_id = self.params.special_tok_ids["bos_token"], self.params.special_tok_ids["eos_token"]
for seq_, len_ in zip(self.token_ids, self.lengths):
assert (seq_[0] == cls_id) and (seq_[-1] == sep_id), seq_
@@ -84,7 +84,7 @@ class LmSeqsDataset(Dataset):
new_lengths.append(len_)
else:
sub_seqs = []
for sub_s in divide_chunks(seq_, max_len-2):
for sub_s in divide_chunks(seq_, max_len - 2):
if sub_s[0] != cls_id:
sub_s = np.insert(sub_s, 0, cls_id)
if sub_s[-1] != sep_id:
@@ -108,7 +108,23 @@ class LmSeqsDataset(Dataset):
self.token_ids = self.token_ids[indices]
self.lengths = self.lengths[indices]
new_size = len(self)
logger.info(f'Remove {init_size - new_size} too short (<=11 tokens) sequences.')
logger.info(f"Remove {init_size - new_size} too short (<=11 tokens) sequences.")
def remove_unknown_sequences(self):
"""
Remove sequences with a (too) high level of unknown tokens.
"""
if "unk_token" not in self.params.special_tok_ids:
return
else:
unk_token_id = self.params.special_tok_ids["unk_token"]
init_size = len(self)
unk_occs = np.array([np.count_nonzero(a == unk_token_id) for a in self.token_ids])
indices = (unk_occs / self.lengths) < 0.5
self.token_ids = self.token_ids[indices]
self.lengths = self.lengths[indices]
new_size = len(self)
logger.info(f"Remove {init_size - new_size} sequences with a high level of unknown tokens (50%).")
def print_statistics(self):
"""
@@ -116,7 +132,7 @@ class LmSeqsDataset(Dataset):
"""
if not self.params.is_master:
return
logger.info(f'{len(self)} sequences')
logger.info(f"{len(self)} sequences")
# data_len = sum(self.lengths)
# nb_unique_tokens = len(Counter(list(chain(*self.token_ids))))
# logger.info(f'{data_len} tokens ({nb_unique_tokens} unique)')
@@ -125,8 +141,7 @@ class LmSeqsDataset(Dataset):
# nb_unkown = sum([(t==unk_idx).sum() for t in self.token_ids])
# logger.info(f'{nb_unkown} unknown tokens (covering {100*nb_unkown/data_len:.2f}% of the data)')
def batch_sequences(self,
batch):
def batch_sequences(self, batch):
"""
Do the padding and transform into torch.tensor.
"""
@@ -139,13 +154,13 @@ class LmSeqsDataset(Dataset):
# Pad token ids
if self.params.mlm:
pad_idx = self.params.special_tok_ids['pad_token']
pad_idx = self.params.special_tok_ids["pad_token"]
else:
pad_idx = self.params.special_tok_ids['unk_token']
tk_ = [list(t.astype(int)) + [pad_idx]*(max_seq_len_-len(t)) for t in token_ids]
pad_idx = self.params.special_tok_ids["unk_token"]
tk_ = [list(t.astype(int)) + [pad_idx] * (max_seq_len_ - len(t)) for t in token_ids]
assert len(tk_) == len(token_ids)
assert all(len(t) == max_seq_len_ for t in tk_)
tk_t = torch.tensor(tk_) # (bs, max_seq_len_)
tk_t = torch.tensor(tk_) # (bs, max_seq_len_)
lg_t = torch.tensor(lengths) # (bs)
return tk_t, lg_t

View File

@@ -1,6 +1,7 @@
transformers
gitpython==3.0.2
tensorboard>=1.14.0
tensorboardX==1.8
psutil==5.6.3
psutil==5.6.6
scipy==1.3.1
transformers==2.0.0

File diff suppressed because it is too large Load Diff

View File

@@ -16,75 +16,79 @@
Preprocessing script before distillation.
"""
import argparse
import logging
import pickle
import random
import time
import numpy as np
from transformers import BertTokenizer, RobertaTokenizer, GPT2Tokenizer
import logging
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO)
import numpy as np
from transformers import BertTokenizer, GPT2Tokenizer, RobertaTokenizer
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO
)
logger = logging.getLogger(__name__)
def main():
parser = argparse.ArgumentParser(description="Preprocess the data to avoid re-doing it several times by (tokenization + token_to_ids).")
parser.add_argument('--file_path', type=str, default='data/dump.txt',
help='The path to the data.')
parser.add_argument('--tokenizer_type', type=str, default='bert', choices=['bert', 'roberta', 'gpt2'])
parser.add_argument('--tokenizer_name', type=str, default='bert-base-uncased',
help="The tokenizer to use.")
parser.add_argument('--dump_file', type=str, default='data/dump',
help='The dump file prefix.')
parser = argparse.ArgumentParser(
description="Preprocess the data to avoid re-doing it several times by (tokenization + token_to_ids)."
)
parser.add_argument("--file_path", type=str, default="data/dump.txt", help="The path to the data.")
parser.add_argument("--tokenizer_type", type=str, default="bert", choices=["bert", "roberta", "gpt2"])
parser.add_argument("--tokenizer_name", type=str, default="bert-base-uncased", help="The tokenizer to use.")
parser.add_argument("--dump_file", type=str, default="data/dump", help="The dump file prefix.")
args = parser.parse_args()
logger.info(f'Loading Tokenizer ({args.tokenizer_name})')
if args.tokenizer_type == 'bert':
logger.info(f"Loading Tokenizer ({args.tokenizer_name})")
if args.tokenizer_type == "bert":
tokenizer = BertTokenizer.from_pretrained(args.tokenizer_name)
bos = tokenizer.special_tokens_map['cls_token'] # `[CLS]`
sep = tokenizer.special_tokens_map['sep_token'] # `[SEP]`
elif args.tokenizer_type == 'roberta':
bos = tokenizer.special_tokens_map["cls_token"] # `[CLS]`
sep = tokenizer.special_tokens_map["sep_token"] # `[SEP]`
elif args.tokenizer_type == "roberta":
tokenizer = RobertaTokenizer.from_pretrained(args.tokenizer_name)
bos = tokenizer.special_tokens_map['cls_token'] # `<s>`
sep = tokenizer.special_tokens_map['sep_token'] # `</s>`
elif args.tokenizer_type == 'gpt2':
bos = tokenizer.special_tokens_map["cls_token"] # `<s>`
sep = tokenizer.special_tokens_map["sep_token"] # `</s>`
elif args.tokenizer_type == "gpt2":
tokenizer = GPT2Tokenizer.from_pretrained(args.tokenizer_name)
bos = tokenizer.special_tokens_map['bos_token'] # `<|endoftext|>`
sep = tokenizer.special_tokens_map['eos_token'] # `<|endoftext|>`
bos = tokenizer.special_tokens_map["bos_token"] # `<|endoftext|>`
sep = tokenizer.special_tokens_map["eos_token"] # `<|endoftext|>`
logger.info(f'Loading text from {args.file_path}')
with open(args.file_path, 'r', encoding='utf8') as fp:
logger.info(f"Loading text from {args.file_path}")
with open(args.file_path, "r", encoding="utf8") as fp:
data = fp.readlines()
logger.info(f'Start encoding')
logger.info(f'{len(data)} examples to process.')
logger.info(f"Start encoding")
logger.info(f"{len(data)} examples to process.")
rslt = []
iter = 0
interval = 10000
start = time.time()
for text in data:
text = f'{bos} {text.strip()} {sep}'
text = f"{bos} {text.strip()} {sep}"
token_ids = tokenizer.encode(text, add_special_tokens=False)
rslt.append(token_ids)
iter += 1
if iter % interval == 0:
end = time.time()
logger.info(f'{iter} examples processed. - {(end-start)/interval:.2f}s/expl')
logger.info(f"{iter} examples processed. - {(end-start):.2f}s/{interval}expl")
start = time.time()
logger.info('Finished binarization')
logger.info(f'{len(data)} examples processed.')
logger.info("Finished binarization")
logger.info(f"{len(data)} examples processed.")
dp_file = f'{args.dump_file}.{args.tokenizer_name}.pickle'
rslt_ = [np.uint16(d) for d in rslt]
dp_file = f"{args.dump_file}.{args.tokenizer_name}.pickle"
vocab_size = tokenizer.vocab_size
if vocab_size < (1 << 16):
rslt_ = [np.uint16(d) for d in rslt]
else:
rslt_ = [np.int32(d) for d in rslt]
random.shuffle(rslt_)
logger.info(f'Dump to {dp_file}')
with open(dp_file, 'wb') as handle:
logger.info(f"Dump to {dp_file}")
with open(dp_file, "wb") as handle:
pickle.dump(rslt_, handle, protocol=pickle.HIGHEST_PROTOCOL)

View File

@@ -16,74 +16,87 @@
Preprocessing script before training the distilled model.
Specific to RoBERTa -> DistilRoBERTa and GPT2 -> DistilGPT2.
"""
from transformers import BertForMaskedLM, RobertaForMaskedLM, GPT2LMHeadModel
import torch
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Extraction some layers of the full RobertaForMaskedLM or GPT2LMHeadModel for Transfer Learned Distillation")
import torch
from transformers import GPT2LMHeadModel, RobertaForMaskedLM
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Extraction some layers of the full RobertaForMaskedLM or GPT2LMHeadModel for Transfer Learned Distillation"
)
parser.add_argument("--model_type", default="roberta", choices=["roberta", "gpt2"])
parser.add_argument("--model_name", default='roberta-large', type=str)
parser.add_argument("--dump_checkpoint", default='serialization_dir/tf_roberta_048131723.pth', type=str)
parser.add_argument("--vocab_transform", action='store_true')
parser.add_argument("--model_name", default="roberta-large", type=str)
parser.add_argument("--dump_checkpoint", default="serialization_dir/tf_roberta_048131723.pth", type=str)
parser.add_argument("--vocab_transform", action="store_true")
args = parser.parse_args()
if args.model_type == 'roberta':
if args.model_type == "roberta":
model = RobertaForMaskedLM.from_pretrained(args.model_name)
prefix = 'roberta'
elif args.model_type == 'gpt2':
prefix = "roberta"
elif args.model_type == "gpt2":
model = GPT2LMHeadModel.from_pretrained(args.model_name)
prefix = 'transformer'
prefix = "transformer"
state_dict = model.state_dict()
compressed_sd = {}
### Embeddings ###
if args.model_type == 'gpt2':
for param_name in ['wte.weight', 'wpe.weight']:
compressed_sd[f'{prefix}.{param_name}'] = state_dict[f'{prefix}.{param_name}']
# Embeddings #
if args.model_type == "gpt2":
for param_name in ["wte.weight", "wpe.weight"]:
compressed_sd[f"{prefix}.{param_name}"] = state_dict[f"{prefix}.{param_name}"]
else:
for w in ['word_embeddings', 'position_embeddings', 'token_type_embeddings']:
param_name = f'{prefix}.embeddings.{w}.weight'
for w in ["word_embeddings", "position_embeddings", "token_type_embeddings"]:
param_name = f"{prefix}.embeddings.{w}.weight"
compressed_sd[param_name] = state_dict[param_name]
for w in ['weight', 'bias']:
param_name = f'{prefix}.embeddings.LayerNorm.{w}'
for w in ["weight", "bias"]:
param_name = f"{prefix}.embeddings.LayerNorm.{w}"
compressed_sd[param_name] = state_dict[param_name]
### Transformer Blocks ###
# Transformer Blocks #
std_idx = 0
for teacher_idx in [0, 2, 4, 7, 9, 11]:
if args.model_type == 'gpt2':
for layer in ['ln_1', 'attn.c_attn', 'attn.c_proj', 'ln_2', 'mlp.c_fc', 'mlp.c_proj']:
for w in ['weight', 'bias']:
compressed_sd[f'{prefix}.h.{std_idx}.{layer}.{w}'] = \
state_dict[f'{prefix}.h.{teacher_idx}.{layer}.{w}']
compressed_sd[f'{prefix}.h.{std_idx}.attn.bias'] = state_dict[f'{prefix}.h.{teacher_idx}.attn.bias']
if args.model_type == "gpt2":
for layer in ["ln_1", "attn.c_attn", "attn.c_proj", "ln_2", "mlp.c_fc", "mlp.c_proj"]:
for w in ["weight", "bias"]:
compressed_sd[f"{prefix}.h.{std_idx}.{layer}.{w}"] = state_dict[
f"{prefix}.h.{teacher_idx}.{layer}.{w}"
]
compressed_sd[f"{prefix}.h.{std_idx}.attn.bias"] = state_dict[f"{prefix}.h.{teacher_idx}.attn.bias"]
else:
for layer in ['attention.self.query', 'attention.self.key', 'attention.self.value',
'attention.output.dense', 'attention.output.LayerNorm',
'intermediate.dense', 'output.dense', 'output.LayerNorm']:
for w in ['weight', 'bias']:
compressed_sd[f'{prefix}.encoder.layer.{std_idx}.{layer}.{w}'] = \
state_dict[f'{prefix}.encoder.layer.{teacher_idx}.{layer}.{w}']
for layer in [
"attention.self.query",
"attention.self.key",
"attention.self.value",
"attention.output.dense",
"attention.output.LayerNorm",
"intermediate.dense",
"output.dense",
"output.LayerNorm",
]:
for w in ["weight", "bias"]:
compressed_sd[f"{prefix}.encoder.layer.{std_idx}.{layer}.{w}"] = state_dict[
f"{prefix}.encoder.layer.{teacher_idx}.{layer}.{w}"
]
std_idx += 1
### Language Modeling Head ###s
if args.model_type == 'roberta':
for layer in ['lm_head.decoder.weight', 'lm_head.bias']:
compressed_sd[f'{layer}'] = state_dict[f'{layer}']
# Language Modeling Head ###s
if args.model_type == "roberta":
for layer in ["lm_head.decoder.weight", "lm_head.bias"]:
compressed_sd[f"{layer}"] = state_dict[f"{layer}"]
if args.vocab_transform:
for w in ['weight', 'bias']:
compressed_sd[f'lm_head.dense.{w}'] = state_dict[f'lm_head.dense.{w}']
compressed_sd[f'lm_head.layer_norm.{w}'] = state_dict[f'lm_head.layer_norm.{w}']
elif args.model_type == 'gpt2':
for w in ['weight', 'bias']:
compressed_sd[f'{prefix}.ln_f.{w}'] = state_dict[f'{prefix}.ln_f.{w}']
compressed_sd[f'lm_head.weight'] = state_dict[f'lm_head.weight']
for w in ["weight", "bias"]:
compressed_sd[f"lm_head.dense.{w}"] = state_dict[f"lm_head.dense.{w}"]
compressed_sd[f"lm_head.layer_norm.{w}"] = state_dict[f"lm_head.layer_norm.{w}"]
elif args.model_type == "gpt2":
for w in ["weight", "bias"]:
compressed_sd[f"{prefix}.ln_f.{w}"] = state_dict[f"{prefix}.ln_f.{w}"]
compressed_sd[f"lm_head.weight"] = state_dict[f"lm_head.weight"]
print(f'N layers selected for distillation: {std_idx}')
print(f'Number of params transfered for distillation: {len(compressed_sd.keys())}')
print(f"N layers selected for distillation: {std_idx}")
print(f"Number of params transfered for distillation: {len(compressed_sd.keys())}")
print(f'Save transfered checkpoint to {args.dump_checkpoint}.')
print(f"Save transfered checkpoint to {args.dump_checkpoint}.")
torch.save(compressed_sd, args.dump_checkpoint)

View File

@@ -16,67 +16,77 @@
Preprocessing script before training DistilBERT.
Specific to BERT -> DistilBERT.
"""
from transformers import BertForMaskedLM, RobertaForMaskedLM
import torch
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Extraction some layers of the full BertForMaskedLM or RObertaForMaskedLM for Transfer Learned Distillation")
import torch
from transformers import BertForMaskedLM
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Extraction some layers of the full BertForMaskedLM or RObertaForMaskedLM for Transfer Learned Distillation"
)
parser.add_argument("--model_type", default="bert", choices=["bert"])
parser.add_argument("--model_name", default='bert-base-uncased', type=str)
parser.add_argument("--dump_checkpoint", default='serialization_dir/tf_bert-base-uncased_0247911.pth', type=str)
parser.add_argument("--vocab_transform", action='store_true')
parser.add_argument("--model_name", default="bert-base-uncased", type=str)
parser.add_argument("--dump_checkpoint", default="serialization_dir/tf_bert-base-uncased_0247911.pth", type=str)
parser.add_argument("--vocab_transform", action="store_true")
args = parser.parse_args()
if args.model_type == 'bert':
if args.model_type == "bert":
model = BertForMaskedLM.from_pretrained(args.model_name)
prefix = 'bert'
prefix = "bert"
else:
raise ValueError(f'args.model_type should be "bert".')
state_dict = model.state_dict()
compressed_sd = {}
for w in ['word_embeddings', 'position_embeddings']:
compressed_sd[f'distilbert.embeddings.{w}.weight'] = \
state_dict[f'{prefix}.embeddings.{w}.weight']
for w in ['weight', 'bias']:
compressed_sd[f'distilbert.embeddings.LayerNorm.{w}'] = \
state_dict[f'{prefix}.embeddings.LayerNorm.{w}']
for w in ["word_embeddings", "position_embeddings"]:
compressed_sd[f"distilbert.embeddings.{w}.weight"] = state_dict[f"{prefix}.embeddings.{w}.weight"]
for w in ["weight", "bias"]:
compressed_sd[f"distilbert.embeddings.LayerNorm.{w}"] = state_dict[f"{prefix}.embeddings.LayerNorm.{w}"]
std_idx = 0
for teacher_idx in [0, 2, 4, 7, 9, 11]:
for w in ['weight', 'bias']:
compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.q_lin.{w}'] = \
state_dict[f'{prefix}.encoder.layer.{teacher_idx}.attention.self.query.{w}']
compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.k_lin.{w}'] = \
state_dict[f'{prefix}.encoder.layer.{teacher_idx}.attention.self.key.{w}']
compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.v_lin.{w}'] = \
state_dict[f'{prefix}.encoder.layer.{teacher_idx}.attention.self.value.{w}']
for w in ["weight", "bias"]:
compressed_sd[f"distilbert.transformer.layer.{std_idx}.attention.q_lin.{w}"] = state_dict[
f"{prefix}.encoder.layer.{teacher_idx}.attention.self.query.{w}"
]
compressed_sd[f"distilbert.transformer.layer.{std_idx}.attention.k_lin.{w}"] = state_dict[
f"{prefix}.encoder.layer.{teacher_idx}.attention.self.key.{w}"
]
compressed_sd[f"distilbert.transformer.layer.{std_idx}.attention.v_lin.{w}"] = state_dict[
f"{prefix}.encoder.layer.{teacher_idx}.attention.self.value.{w}"
]
compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.out_lin.{w}'] = \
state_dict[f'{prefix}.encoder.layer.{teacher_idx}.attention.output.dense.{w}']
compressed_sd[f'distilbert.transformer.layer.{std_idx}.sa_layer_norm.{w}'] = \
state_dict[f'{prefix}.encoder.layer.{teacher_idx}.attention.output.LayerNorm.{w}']
compressed_sd[f"distilbert.transformer.layer.{std_idx}.attention.out_lin.{w}"] = state_dict[
f"{prefix}.encoder.layer.{teacher_idx}.attention.output.dense.{w}"
]
compressed_sd[f"distilbert.transformer.layer.{std_idx}.sa_layer_norm.{w}"] = state_dict[
f"{prefix}.encoder.layer.{teacher_idx}.attention.output.LayerNorm.{w}"
]
compressed_sd[f'distilbert.transformer.layer.{std_idx}.ffn.lin1.{w}'] = \
state_dict[f'{prefix}.encoder.layer.{teacher_idx}.intermediate.dense.{w}']
compressed_sd[f'distilbert.transformer.layer.{std_idx}.ffn.lin2.{w}'] = \
state_dict[f'{prefix}.encoder.layer.{teacher_idx}.output.dense.{w}']
compressed_sd[f'distilbert.transformer.layer.{std_idx}.output_layer_norm.{w}'] = \
state_dict[f'{prefix}.encoder.layer.{teacher_idx}.output.LayerNorm.{w}']
compressed_sd[f"distilbert.transformer.layer.{std_idx}.ffn.lin1.{w}"] = state_dict[
f"{prefix}.encoder.layer.{teacher_idx}.intermediate.dense.{w}"
]
compressed_sd[f"distilbert.transformer.layer.{std_idx}.ffn.lin2.{w}"] = state_dict[
f"{prefix}.encoder.layer.{teacher_idx}.output.dense.{w}"
]
compressed_sd[f"distilbert.transformer.layer.{std_idx}.output_layer_norm.{w}"] = state_dict[
f"{prefix}.encoder.layer.{teacher_idx}.output.LayerNorm.{w}"
]
std_idx += 1
compressed_sd[f'vocab_projector.weight'] = state_dict[f'cls.predictions.decoder.weight']
compressed_sd[f'vocab_projector.bias'] = state_dict[f'cls.predictions.bias']
compressed_sd[f"vocab_projector.weight"] = state_dict[f"cls.predictions.decoder.weight"]
compressed_sd[f"vocab_projector.bias"] = state_dict[f"cls.predictions.bias"]
if args.vocab_transform:
for w in ['weight', 'bias']:
compressed_sd[f'vocab_transform.{w}'] = state_dict[f'cls.predictions.transform.dense.{w}']
compressed_sd[f'vocab_layer_norm.{w}'] = state_dict[f'cls.predictions.transform.LayerNorm.{w}']
for w in ["weight", "bias"]:
compressed_sd[f"vocab_transform.{w}"] = state_dict[f"cls.predictions.transform.dense.{w}"]
compressed_sd[f"vocab_layer_norm.{w}"] = state_dict[f"cls.predictions.transform.LayerNorm.{w}"]
print(f'N layers selected for distillation: {std_idx}')
print(f'Number of params transfered for distillation: {len(compressed_sd.keys())}')
print(f"N layers selected for distillation: {std_idx}")
print(f"Number of params transfered for distillation: {len(compressed_sd.keys())}")
print(f'Save transfered checkpoint to {args.dump_checkpoint}.')
print(f"Save transfered checkpoint to {args.dump_checkpoint}.")
torch.save(compressed_sd, args.dump_checkpoint)

View File

@@ -15,37 +15,42 @@
"""
Preprocessing script before training the distilled model.
"""
from collections import Counter
import argparse
import pickle
import logging
import pickle
from collections import Counter
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO)
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO
)
logger = logging.getLogger(__name__)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Token Counts for smoothing the masking probabilities in MLM (cf XLM/word2vec)")
parser.add_argument("--data_file", type=str, default="data/dump.bert-base-uncased.pickle",
help="The binarized dataset.")
parser.add_argument("--token_counts_dump", type=str, default="data/token_counts.bert-base-uncased.pickle",
help="The dump file.")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Token Counts for smoothing the masking probabilities in MLM (cf XLM/word2vec)"
)
parser.add_argument(
"--data_file", type=str, default="data/dump.bert-base-uncased.pickle", help="The binarized dataset."
)
parser.add_argument(
"--token_counts_dump", type=str, default="data/token_counts.bert-base-uncased.pickle", help="The dump file."
)
parser.add_argument("--vocab_size", default=30522, type=int)
args = parser.parse_args()
logger.info(f'Loading data from {args.data_file}')
with open(args.data_file, 'rb') as fp:
logger.info(f"Loading data from {args.data_file}")
with open(args.data_file, "rb") as fp:
data = pickle.load(fp)
logger.info('Counting occurences for MLM.')
logger.info("Counting occurences for MLM.")
counter = Counter()
for tk_ids in data:
counter.update(tk_ids)
counts = [0]*args.vocab_size
counts = [0] * args.vocab_size
for k, v in counter.items():
counts[k] = v
logger.info(f'Dump to {args.token_counts_dump}')
with open(args.token_counts_dump, 'wb') as handle:
logger.info(f"Dump to {args.token_counts_dump}")
with open(args.token_counts_dump, "wb") as handle:
pickle.dump(counts, handle, protocol=pickle.HIGHEST_PROTOCOL)

View File

@@ -16,272 +16,304 @@
Training the distilled model.
Supported architectures include: BERT -> DistilBERT, RoBERTa -> DistilRoBERTa, GPT2 -> DistilGPT2.
"""
import os
import argparse
import pickle
import json
import os
import pickle
import shutil
import numpy as np
import torch
from transformers import BertConfig, BertForMaskedLM, BertTokenizer
from transformers import RobertaConfig, RobertaForMaskedLM, RobertaTokenizer
from transformers import DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer
from distiller import Distiller
from utils import git_log, logger, init_gpu_params, set_seed
from lm_seqs_dataset import LmSeqsDataset
from transformers import (
BertConfig,
BertForMaskedLM,
BertTokenizer,
DistilBertConfig,
DistilBertForMaskedLM,
DistilBertTokenizer,
GPT2Config,
GPT2LMHeadModel,
GPT2Tokenizer,
RobertaConfig,
RobertaForMaskedLM,
RobertaTokenizer,
)
from utils import git_log, init_gpu_params, logger, set_seed
MODEL_CLASSES = {
'distilbert': (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
'bert': (BertConfig, BertForMaskedLM, BertTokenizer),
'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer)
"distilbert": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
"roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
"bert": (BertConfig, BertForMaskedLM, BertTokenizer),
"gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
}
def sanity_checks(args):
"""
A bunch of args sanity checks to perform even starting...
"""
assert (args.mlm and args.alpha_mlm > 0.) or (not args.mlm and args.alpha_mlm == 0.)
assert (args.alpha_mlm > 0. and args.alpha_clm == 0.) or (args.alpha_mlm == 0. and args.alpha_clm > 0.)
assert (args.mlm and args.alpha_mlm > 0.0) or (not args.mlm and args.alpha_mlm == 0.0)
assert (args.alpha_mlm > 0.0 and args.alpha_clm == 0.0) or (args.alpha_mlm == 0.0 and args.alpha_clm > 0.0)
if args.mlm:
assert os.path.isfile(args.token_counts)
assert (args.student_type in ['roberta', 'distilbert']) and (args.teacher_type in ['roberta', 'bert'])
assert (args.student_type in ["roberta", "distilbert"]) and (args.teacher_type in ["roberta", "bert"])
else:
assert (args.student_type in ['gpt2']) and (args.teacher_type in ['gpt2'])
assert (args.student_type in ["gpt2"]) and (args.teacher_type in ["gpt2"])
assert args.teacher_type == args.student_type or (args.student_type=='distilbert' and args.teacher_type=='bert')
assert args.teacher_type == args.student_type or (
args.student_type == "distilbert" and args.teacher_type == "bert"
)
assert os.path.isfile(args.student_config)
if args.student_pretrained_weights is not None:
assert os.path.isfile(args.student_pretrained_weights)
if args.freeze_token_type_embds: assert args.student_type in ['roberta']
if args.freeze_token_type_embds:
assert args.student_type in ["roberta"]
assert args.alpha_ce >= 0.0
assert args.alpha_mlm >= 0.0
assert args.alpha_clm >= 0.0
assert args.alpha_mse >= 0.0
assert args.alpha_cos >= 0.0
assert args.alpha_ce + args.alpha_mlm + args.alpha_clm + args.alpha_mse + args.alpha_cos > 0.0
assert args.alpha_ce >= 0.
assert args.alpha_mlm >= 0.
assert args.alpha_clm >= 0.
assert args.alpha_mse >= 0.
assert args.alpha_cos >= 0.
assert args.alpha_ce + args.alpha_mlm + args.alpha_clm + args.alpha_mse + args.alpha_cos > 0.
def freeze_pos_embeddings(student, args):
if args.student_type == 'roberta':
if args.student_type == "roberta":
student.roberta.embeddings.position_embeddings.weight.requires_grad = False
elif args.student_type == 'gpt2':
elif args.student_type == "gpt2":
student.transformer.wpe.weight.requires_grad = False
def freeze_token_type_embeddings(student, args):
if args.student_type == 'roberta':
if args.student_type == "roberta":
student.roberta.embeddings.token_type_embeddings.weight.requires_grad = False
def main():
parser = argparse.ArgumentParser(description="Training")
parser.add_argument("--force", action='store_true',
help="Overwrite dump_path if it already exists.")
parser.add_argument("--force", action="store_true", help="Overwrite dump_path if it already exists.")
parser.add_argument("--dump_path", type=str, required=True,
help="The output directory (log, checkpoints, parameters, etc.)")
parser.add_argument("--data_file", type=str, required=True,
help="The binarized file (tokenized + tokens_to_ids) and grouped by sequence.")
parser.add_argument(
"--dump_path", type=str, required=True, help="The output directory (log, checkpoints, parameters, etc.)"
)
parser.add_argument(
"--data_file",
type=str,
required=True,
help="The binarized file (tokenized + tokens_to_ids) and grouped by sequence.",
)
parser.add_argument("--student_type", type=str, choices=["distilbert", "roberta", "gpt2"], required=True,
help="The student type (DistilBERT, RoBERTa).")
parser.add_argument("--student_config", type=str, required=True,
help="Path to the student configuration.")
parser.add_argument("--student_pretrained_weights", default=None, type=str,
help="Load student initialization checkpoint.")
parser.add_argument(
"--student_type",
type=str,
choices=["distilbert", "roberta", "gpt2"],
required=True,
help="The student type (DistilBERT, RoBERTa).",
)
parser.add_argument("--student_config", type=str, required=True, help="Path to the student configuration.")
parser.add_argument(
"--student_pretrained_weights", default=None, type=str, help="Load student initialization checkpoint."
)
parser.add_argument("--teacher_type", choices=["bert", "roberta", "gpt2"], required=True,
help="Teacher type (BERT, RoBERTa).")
parser.add_argument("--teacher_name", type=str, required=True,
help="The teacher model.")
parser.add_argument(
"--teacher_type", choices=["bert", "roberta", "gpt2"], required=True, help="Teacher type (BERT, RoBERTa)."
)
parser.add_argument("--teacher_name", type=str, required=True, help="The teacher model.")
parser.add_argument("--temperature", default=2., type=float,
help="Temperature for the softmax temperature.")
parser.add_argument("--alpha_ce", default=0.5, type=float,
help="Linear weight for the distillation loss. Must be >=0.")
parser.add_argument("--alpha_mlm", default=0.0, type=float,
help="Linear weight for the MLM loss. Must be >=0. Should be used in coonjunction with `mlm` flag.")
parser.add_argument("--alpha_clm", default=0.5, type=float,
help="Linear weight for the CLM loss. Must be >=0.")
parser.add_argument("--alpha_mse", default=0.0, type=float,
help="Linear weight of the MSE loss. Must be >=0.")
parser.add_argument("--alpha_cos", default=0.0, type=float,
help="Linear weight of the cosine embedding loss. Must be >=0.")
parser.add_argument("--temperature", default=2.0, type=float, help="Temperature for the softmax temperature.")
parser.add_argument(
"--alpha_ce", default=0.5, type=float, help="Linear weight for the distillation loss. Must be >=0."
)
parser.add_argument(
"--alpha_mlm",
default=0.0,
type=float,
help="Linear weight for the MLM loss. Must be >=0. Should be used in coonjunction with `mlm` flag.",
)
parser.add_argument("--alpha_clm", default=0.5, type=float, help="Linear weight for the CLM loss. Must be >=0.")
parser.add_argument("--alpha_mse", default=0.0, type=float, help="Linear weight of the MSE loss. Must be >=0.")
parser.add_argument(
"--alpha_cos", default=0.0, type=float, help="Linear weight of the cosine embedding loss. Must be >=0."
)
parser.add_argument("--mlm", action="store_true",
help="The LM step: MLM or CLM. If `mlm` is True, the MLM is used over CLM.")
parser.add_argument("--mlm_mask_prop", default=0.15, type=float,
help="Proportion of tokens for which we need to make a prediction.")
parser.add_argument("--word_mask", default=0.8, type=float,
help="Proportion of tokens to mask out.")
parser.add_argument("--word_keep", default=0.1, type=float,
help="Proportion of tokens to keep.")
parser.add_argument("--word_rand", default=0.1, type=float,
help="Proportion of tokens to randomly replace.")
parser.add_argument("--mlm_smoothing", default=0.7, type=float,
help="Smoothing parameter to emphasize more rare tokens (see XLM, similar to word2vec).")
parser.add_argument("--token_counts", type=str,
help="The token counts in the data_file for MLM.")
parser.add_argument(
"--mlm", action="store_true", help="The LM step: MLM or CLM. If `mlm` is True, the MLM is used over CLM."
)
parser.add_argument(
"--mlm_mask_prop",
default=0.15,
type=float,
help="Proportion of tokens for which we need to make a prediction.",
)
parser.add_argument("--word_mask", default=0.8, type=float, help="Proportion of tokens to mask out.")
parser.add_argument("--word_keep", default=0.1, type=float, help="Proportion of tokens to keep.")
parser.add_argument("--word_rand", default=0.1, type=float, help="Proportion of tokens to randomly replace.")
parser.add_argument(
"--mlm_smoothing",
default=0.7,
type=float,
help="Smoothing parameter to emphasize more rare tokens (see XLM, similar to word2vec).",
)
parser.add_argument("--token_counts", type=str, help="The token counts in the data_file for MLM.")
parser.add_argument("--restrict_ce_to_mask", action='store_true',
help="If true, compute the distilation loss only the [MLM] prediction distribution.")
parser.add_argument("--freeze_pos_embs", action="store_true",
help="Freeze positional embeddings during distillation. For student_type in ['roberta', 'gpt2'] only.")
parser.add_argument("--freeze_token_type_embds", action="store_true",
help="Freeze token type embeddings during distillation if existent. For student_type in ['roberta'] only.")
parser.add_argument(
"--restrict_ce_to_mask",
action="store_true",
help="If true, compute the distilation loss only the [MLM] prediction distribution.",
)
parser.add_argument(
"--freeze_pos_embs",
action="store_true",
help="Freeze positional embeddings during distillation. For student_type in ['roberta', 'gpt2'] only.",
)
parser.add_argument(
"--freeze_token_type_embds",
action="store_true",
help="Freeze token type embeddings during distillation if existent. For student_type in ['roberta'] only.",
)
parser.add_argument("--n_epoch", type=int, default=3,
help="Number of pass on the whole dataset.")
parser.add_argument("--batch_size", type=int, default=5,
help="Batch size (for each process).")
parser.add_argument("--group_by_size", action='store_false',
help="If true, group sequences that have similar length into the same batch. Default is true.")
parser.add_argument("--n_epoch", type=int, default=3, help="Number of pass on the whole dataset.")
parser.add_argument("--batch_size", type=int, default=5, help="Batch size (for each process).")
parser.add_argument(
"--group_by_size",
action="store_false",
help="If true, group sequences that have similar length into the same batch. Default is true.",
)
parser.add_argument("--gradient_accumulation_steps", type=int, default=50,
help="Gradient accumulation for larger training batches.")
parser.add_argument("--warmup_prop", default=0.05, type=float,
help="Linear warmup proportion.")
parser.add_argument("--weight_decay", default=0.0, type=float,
help="Weight deay if we apply some.")
parser.add_argument("--learning_rate", default=5e-4, type=float,
help="The initial learning rate for Adam.")
parser.add_argument("--adam_epsilon", default=1e-6, type=float,
help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=5.0, type=float,
help="Max gradient norm.")
parser.add_argument("--initializer_range", default=0.02, type=float,
help="Random initialization range.")
parser.add_argument(
"--gradient_accumulation_steps",
type=int,
default=50,
help="Gradient accumulation for larger training batches.",
)
parser.add_argument("--warmup_prop", default=0.05, type=float, help="Linear warmup proportion.")
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.")
parser.add_argument("--learning_rate", default=5e-4, type=float, help="The initial learning rate for Adam.")
parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=5.0, type=float, help="Max gradient norm.")
parser.add_argument("--initializer_range", default=0.02, type=float, help="Random initialization range.")
parser.add_argument('--fp16', action='store_true',
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
parser.add_argument('--fp16_opt_level', type=str, default='O1',
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
"See details at https://nvidia.github.io/apex/amp.html")
parser.add_argument("--n_gpu", type=int, default=1,
help="Number of GPUs in the node.")
parser.add_argument("--local_rank", type=int, default=-1,
help="Distributed training - Local rank")
parser.add_argument("--seed", type=int, default=56,
help="Random seed")
parser.add_argument(
"--fp16",
action="store_true",
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
)
parser.add_argument(
"--fp16_opt_level",
type=str,
default="O1",
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
"See details at https://nvidia.github.io/apex/amp.html",
)
parser.add_argument("--n_gpu", type=int, default=1, help="Number of GPUs in the node.")
parser.add_argument("--local_rank", type=int, default=-1, help="Distributed training - Local rank")
parser.add_argument("--seed", type=int, default=56, help="Random seed")
parser.add_argument("--log_interval", type=int, default=500,
help="Tensorboard logging interval.")
parser.add_argument("--checkpoint_interval", type=int, default=4000,
help="Checkpoint interval.")
parser.add_argument("--log_interval", type=int, default=500, help="Tensorboard logging interval.")
parser.add_argument("--checkpoint_interval", type=int, default=4000, help="Checkpoint interval.")
args = parser.parse_args()
sanity_checks(args)
## ARGS ##
# ARGS #
init_gpu_params(args)
set_seed(args)
if args.is_master:
if os.path.exists(args.dump_path):
if not args.force:
raise ValueError(f'Serialization dir {args.dump_path} already exists, but you have not precised wheter to overwrite it'
'Use `--force` if you want to overwrite it')
raise ValueError(
f"Serialization dir {args.dump_path} already exists, but you have not precised wheter to overwrite it"
"Use `--force` if you want to overwrite it"
)
else:
shutil.rmtree(args.dump_path)
if not os.path.exists(args.dump_path):
os.makedirs(args.dump_path)
logger.info(f'Experiment will be dumped and logged in {args.dump_path}')
logger.info(f"Experiment will be dumped and logged in {args.dump_path}")
### SAVE PARAMS ###
logger.info(f'Param: {args}')
with open(os.path.join(args.dump_path, 'parameters.json'), 'w') as f:
# SAVE PARAMS #
logger.info(f"Param: {args}")
with open(os.path.join(args.dump_path, "parameters.json"), "w") as f:
json.dump(vars(args), f, indent=4)
git_log(args.dump_path)
student_config_class, student_model_class, _ = MODEL_CLASSES[args.student_type]
teacher_config_class, teacher_model_class, teacher_tokenizer_class = MODEL_CLASSES[args.teacher_type]
### TOKENIZER ###
# TOKENIZER #
tokenizer = teacher_tokenizer_class.from_pretrained(args.teacher_name)
special_tok_ids = {}
for tok_name, tok_symbol in tokenizer.special_tokens_map.items():
idx = tokenizer.all_special_tokens.index(tok_symbol)
special_tok_ids[tok_name] = tokenizer.all_special_ids[idx]
logger.info(f'Special tokens {special_tok_ids}')
logger.info(f"Special tokens {special_tok_ids}")
args.special_tok_ids = special_tok_ids
args.max_model_input_size = tokenizer.max_model_input_sizes[args.teacher_name]
## DATA LOADER ##
logger.info(f'Loading data from {args.data_file}')
with open(args.data_file, 'rb') as fp:
# DATA LOADER #
logger.info(f"Loading data from {args.data_file}")
with open(args.data_file, "rb") as fp:
data = pickle.load(fp)
if args.mlm:
logger.info(f'Loading token counts from {args.token_counts} (already pre-computed)')
with open(args.token_counts, 'rb') as fp:
logger.info(f"Loading token counts from {args.token_counts} (already pre-computed)")
with open(args.token_counts, "rb") as fp:
counts = pickle.load(fp)
token_probs = np.maximum(counts, 1) ** -args.mlm_smoothing
for idx in special_tok_ids.values():
token_probs[idx] = 0. # do not predict special tokens
token_probs[idx] = 0.0 # do not predict special tokens
token_probs = torch.from_numpy(token_probs)
else:
token_probs = None
train_lm_seq_dataset = LmSeqsDataset(params=args, data=data)
logger.info(f'Data loader created.')
logger.info(f"Data loader created.")
## STUDENT ##
logger.info(f'Loading student config from {args.student_config}')
# STUDENT #
logger.info(f"Loading student config from {args.student_config}")
stu_architecture_config = student_config_class.from_pretrained(args.student_config)
stu_architecture_config.output_hidden_states = True
if args.student_pretrained_weights is not None:
logger.info(f'Loading pretrained weights from {args.student_pretrained_weights}')
student = student_model_class.from_pretrained(args.student_pretrained_weights,
config=stu_architecture_config)
logger.info(f"Loading pretrained weights from {args.student_pretrained_weights}")
student = student_model_class.from_pretrained(args.student_pretrained_weights, config=stu_architecture_config)
else:
student = student_model_class(stu_architecture_config)
if args.n_gpu > 0:
student.to(f'cuda:{args.local_rank}')
logger.info(f'Student loaded.')
student.to(f"cuda:{args.local_rank}")
logger.info(f"Student loaded.")
## TEACHER ##
# TEACHER #
teacher = teacher_model_class.from_pretrained(args.teacher_name, output_hidden_states=True)
if args.n_gpu > 0:
teacher.to(f'cuda:{args.local_rank}')
logger.info(f'Teacher loaded from {args.teacher_name}.')
teacher.to(f"cuda:{args.local_rank}")
logger.info(f"Teacher loaded from {args.teacher_name}.")
## FREEZING ##
# FREEZING #
if args.freeze_pos_embs:
freeze_pos_embeddings(student, args)
if args.freeze_token_type_embds:
freeze_token_type_embeddings(student, args)
## SANITY CHECKS ##
# SANITY CHECKS #
assert student.config.vocab_size == teacher.config.vocab_size
assert student.config.hidden_size == teacher.config.hidden_size
assert student.config.max_position_embeddings == teacher.config.max_position_embeddings
if args.mlm:
assert token_probs.size(0) == stu_architecture_config.vocab_size
## DISTILLER ##
# DISTILLER #
torch.cuda.empty_cache()
distiller = Distiller(params=args,
dataset=train_lm_seq_dataset,
token_probs=token_probs,
student=student,
teacher=teacher)
distiller = Distiller(
params=args, dataset=train_lm_seq_dataset, token_probs=token_probs, student=student, teacher=teacher
)
distiller.train()
logger.info("Let's go get some drinks.")

View File

@@ -0,0 +1,15 @@
{
"activation": "gelu",
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"n_heads": 12,
"n_layers": 6,
"sinusoidal_pos_embds": true,
"tie_weights_": true,
"vocab_size": 28996
}

View File

@@ -0,0 +1,15 @@
{
"activation": "gelu",
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"n_heads": 12,
"n_layers": 6,
"sinusoidal_pos_embds": true,
"tie_weights_": true,
"vocab_size": 119547
}

View File

@@ -0,0 +1,14 @@
{
"vocab_size": 50265,
"hidden_size": 768,
"num_hidden_layers": 6,
"num_attention_heads": 12,
"intermediate_size": 3072,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"attention_probs_dropout_prob": 0.1,
"max_position_embeddings": 514,
"type_vocab_size": 1,
"initializer_range": 0.02,
"layer_norm_eps": 0.00001
}

Some files were not shown because too many files have changed in this diff Show More