Compare commits

...

169 Commits

Author SHA1 Message Date
Julien Chaumond
b42586ea56 Fix CI after killing archive maps (#4724)
Some checks failed
GitHub-hosted runner / check_code_quality (push) Has been cancelled
* 🐛 Fix model ids for BART and Flaubert
2020-06-02 10:21:09 -04:00
Lysandre
b43c78e5d3 Release: v2.11.0 2020-06-02 09:49:09 -04:00
Julien Chaumond
d4c2cb402d Kill model archive maps (#4636)
* Kill model archive maps

* Fixup

* Also kill model_archive_map for MaskedBertPreTrainedModel

* Unhook config_archive_map

* Tokenizers: align with model id changes

* make style && make quality

* Fix CI
2020-06-02 09:39:33 -04:00
Patrick von Platen
47a551d17b [pipeline] Tokenizer should not add special tokens for text generation (#4686)
* allow to not add special tokens

* remove print
2020-06-02 11:03:46 +02:00
Funtowicz Morgan
f6d5046af1 Override get_vocab for fast tokenizer. (#4717) 2020-06-02 11:02:27 +02:00
Lysandre Debut
88762a2f8c Specify PyTorch versions for examples (#4710) 2020-06-02 04:29:28 -04:00
Lorenzo Ampil
d3ef14f931 Add community notebook for sentiment span extraction (#4700) 2020-06-02 09:59:53 +02:00
Sylvain Gugger
7677936316 Make docstring match args (#4711) 2020-06-01 15:22:51 -04:00
Lysandre
6449c494d0 close #4685 2020-06-01 12:57:52 -04:00
Julien Chaumond
ec8717d5d8 [config] Ensure that id2label always takes precedence over num_labels 2020-06-01 16:54:55 +02:00
Julien Chaumond
751a1e0890 [config] Ensure that id2label always takes precedence over num_labels
Fixes bug reported in https://github.com/huggingface/transformers/issues/4669

See #3967 for context
2020-06-01 16:25:56 +02:00
Rens
ec62b7d953 Fix onnx export input names order (#4641)
* pass on tokenizer to pipeline

* order input names when convert to onnx

* update style

* remove unused imports

* make ordered inputs list needs to be mutable

* add test custom bert model

* remove unused imports
2020-06-01 16:12:48 +02:00
Victor SANH
bf760c80b5 finish README 2020-06-01 09:23:31 -04:00
Victor SANH
9d7d9b3ae0 weird import 2020-06-01 09:23:31 -04:00
Victor SANH
2a3c88a659 Update examples/movement-pruning/README.md
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-06-01 09:23:31 -04:00
Victor SANH
4ac462bfb8 Update examples/movement-pruning/README.md
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-06-01 09:23:31 -04:00
Victor SANH
35fa0bbca0 clarify README 2020-06-01 09:23:31 -04:00
Victor SANH
cc746a5020 flake8 compliance 2020-06-01 09:23:31 -04:00
Victor SANH
b11386e158 less prints in saving prunebert 2020-06-01 09:23:31 -04:00
Victor SANH
8b5d4003ab complete README 2020-06-01 09:23:31 -04:00
Victor SANH
5c8e5b3709 commplying with isort 2020-06-01 09:23:31 -04:00
Victor SANH
db2a3b2e01 space 2020-06-01 09:23:31 -04:00
Victor SANH
5f8f2d849a add floppy bert model notebok 2020-06-01 09:23:31 -04:00
Victor SANH
b41948f5cd add requirements 2020-06-01 09:23:31 -04:00
Victor SANH
fb8f4277b2 add scripts 2020-06-01 09:23:31 -04:00
Victor SANH
d489a6d3d5 add masked_run_* 2020-06-01 09:23:31 -04:00
Victor SANH
e4c07faf0a add sparsity modules 2020-06-01 09:23:31 -04:00
Mehrdad Farahani
667003e447 Create README.md (#4665) 2020-06-01 08:29:09 -04:00
Mehrdad Farahani
ed23f5909e HooshvareLab readme parsbert-armananer (#4666)
Readme for HooshvareLab/bert-base-parsbert-armananer-uncased
2020-06-01 08:28:43 -04:00
Mehrdad Farahani
3750b9b0b0 HooshvareLab readme parsbert-peymaner (#4667)
Readme for HooshvareLab/bert-base-parsbert-peymaner-uncased
2020-06-01 08:28:25 -04:00
Mehrdad Farahani
036c2c6b02 Update HooshvareLab/bert-base-parsbert-uncased (#4687)
mBERT results added regarding NER datasets!
2020-06-01 08:27:00 -04:00
Manuel Romero
74872c19d3 Create README.md (#4684) 2020-06-01 05:45:54 -04:00
Patrick von Platen
0866669e75 [EncoderDecoder] Fix initialization and save/load bug (#4680)
* fix bug

* add more tests
2020-05-30 01:25:19 +02:00
Patrick von Platen
6f82aea66b Include nlp notebook for model evaluation (#4676) 2020-05-29 19:38:56 +02:00
Wei Fang
33b7532e69 Fix longformer attention mask type casting when using apex (#4574)
* Fix longformer attention mask casting when using apex

* remove extra type casting
2020-05-29 18:13:30 +02:00
Patrick von Platen
56ee2560be [Longformer] Better handling of global attention mask vs local attention mask (#4672)
* better api

* improve automatic setting of global attention mask

* fix longformer bug

* fix global attention mask in test

* fix global attn mask flatten

* fix slow tests

* update docstring

* update docs and make more robust

* improve attention mask
2020-05-29 17:58:42 +02:00
Simon Böhm
e2230ba77b Fix BERT example code for NSP and Multiple Choice (#3953)
Change the example code to use encode_plus since the token_type_id
wasn't being correctly set.
2020-05-29 11:55:55 -04:00
Zhangyx
3a5d1ea2a5 Fix two bugs: 1. Index of test data of SST-2. 2. Label index of MNLI data. (#4546) 2020-05-29 11:12:24 -04:00
Patrick von Platen
9c17256447 [Longformer] Multiple choice for longformer (#4645)
* add multiple choice for longformer

* add models to docs

* adapt docstring

* add test to longformer

* add longformer for mc in init and modeling auto

* fix tests
2020-05-29 13:46:08 +02:00
Iz Beltagy
91487cbb8e [Longformer] fix model name in examples (#4653)
* fix longformer model names in examples

* a better name for the notebook
2020-05-29 13:12:35 +02:00
flozi00
b5015a2a0f gpt2 typo (#4629)
* gpt2 typo

* Add files via upload
2020-05-28 16:44:43 -04:00
Iz Beltagy
fe5cb1a1c8 Adding community notebook (#4642)
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-05-28 22:35:15 +02:00
Suraj Patil
aecaaf73a4 [Community notebooks] add longformer-for-qa notebook (#4652) 2020-05-28 22:27:22 +02:00
Anthony MOI
5e737018e1 Fix add_special_tokens on fast tokenizers (#4531) 2020-05-28 10:54:45 -04:00
Suraj Patil
e444648a30 LongformerForTokenClassification (#4638) 2020-05-28 12:48:18 +02:00
Lavanya Shukla
3cc2c2a150 add 2 colab notebooks (#4505)
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-05-28 11:18:16 +02:00
Iz Beltagy
ef03ae874f [Longformer] more models + model cards (#4628)
* adding freeze roberta models

* model cards

* lint
2020-05-28 11:11:05 +02:00
Patrick von Platen
96f57c9ccb [Benchmark] Memory benchmark utils (#4198)
* improve memory benchmarking

* correct typo

* fix current memory

* check torch memory allocated

* better pytorch function

* add total cached gpu memory

* add total gpu required

* improve torch gpu usage

* update memory usage

* finalize memory tracing

* save intermediate benchmark class

* fix conflict

* improve benchmark

* improve benchmark

* finalize

* make style

* improve benchmarking

* correct typo

* make train function more flexible

* fix csv save

* better repr of bytes

* better print

* fix __repr__ bug

* finish plot script

* rename plot file

* delete csv and small improvements

* fix in plot

* fix in plot

* correct usage of timeit

* remove redundant line

* remove redundant line

* fix bug

* add hf parser tests

* add versioning and platform info

* make style

* add gpu information

* ensure backward compatibility

* finish adding all tests

* Update src/transformers/benchmark/benchmark_args.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/benchmark/benchmark_args_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* delete csv files

* fix isort ordering

* add out of memory handling

* add better train memory handling

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-05-27 23:22:16 +02:00
Suraj Patil
ec4cdfdd05 LongformerForSequenceClassification (#4580)
* LongformerForSequenceClassification

* better naming x=>hidden_states, fix typo in doc

* Update src/transformers/modeling_longformer.py

* Update src/transformers/modeling_longformer.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-05-27 22:30:00 +02:00
Suraj Patil
4402879ee4 [Model Card] model card for longformer-base-4096-finetuned-squadv1 (#4625) 2020-05-27 18:48:03 +02:00
Lysandre Debut
6a17688021 per_device instead of per_gpu/error thrown when argument unknown (#4618)
* per_device instead of per_gpu/error thrown when argument unknown

* [docs] Restore examples.md symlink

* Correct absolute links so that symlink to the doc works correctly

* Update src/transformers/hf_argparser.py

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

* Warning + reorder

* Docs

* Style

* not for squad

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-27 11:36:55 -04:00
Mehrdad Farahani
1381b6d01d README for HooshvareLab (#4610)
HooshvareLab/bert-base-parsbert-uncased
2020-05-27 11:25:36 -04:00
Patrick von Platen
5acb4edf25 Update version command when contributing (#4614) 2020-05-27 17:19:11 +02:00
Darek Kłeczek
842588c12f uncased readme (#4608)
Co-authored-by: kldarek <darekmail>
2020-05-27 09:50:04 -04:00
Darek Kłeczek
ac1a612179 Create README.md (#4607)
Model card for cased model
2020-05-27 09:36:20 -04:00
Sam Shleifer
07797c4da4 [testing] LanguageModelGenerationTests require_tf or require_torch (#4616) 2020-05-27 09:10:26 -04:00
Hao Tan
a9aa7456ac Add back --do_lower_case to uncased models (#4245)
The option `--do_lower_case` is currently required by the uncased models (i.e., bert-base-uncased, bert-large-uncased).

Results:
BERT-BASE without --do_lower_case:  'exact': 73.83, 'f1': 82.22
BERT-BASE with --do_lower_case:  'exact': 81.02, 'f1': 88.34
2020-05-26 21:13:07 -04:00
Bayartsogt Yadamsuren
a801c7fd74 Creating a readme for ALBERT in Mongolian (#4603)
Here I am uploading Mongolian masked language model (ALBERT) on your platform.
https://en.wikipedia.org/wiki/Mongolia
2020-05-26 16:54:42 -04:00
Wissam Antoun
6458c0e268 updated model cards for both models at aubmindlab (#4604)
* updated aubmindlab/bert-base-arabert/ Model card

* updated aubmindlab/bert-base-arabertv01 model card
2020-05-26 16:52:43 -04:00
Oleksandr Bushkovskyi
ea4e7a53fa Improve model card for Tereveni-AI/gpt2-124M-uk-fiction (#4582)
Add language metadata, training and evaluation corpora details.
Add example output. Fix inconsistent use of quotes.
2020-05-26 16:51:40 -04:00
Manuel Romero
937930dcae Create README.md (#4591) 2020-05-26 16:50:08 -04:00
Manuel Romero
bac1cc4dc1 Remove MD emojis (#4602) 2020-05-26 16:38:39 -04:00
Patrick von Platen
003c477129 [GPT2, CTRL] Allow input of input_ids and past of variable length (#4581)
* revert convenience  method

* clean docs a bit
2020-05-26 19:43:58 +02:00
ohmeow
5ddd8d6531 Add BART fine-tuning summarization community notebook (#4539)
* adding BART summarization how-to community notebook

* Update notebooks/README.md

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-05-26 16:43:41 +02:00
Bram Vanroy
8cc6807e89 Make transformers-cli cross-platform (#4131)
* make transformers-cli cross-platform

Using "scripts" is a useful option in setup.py particularly when you want to get access to non-python scripts. However, in this case we want to have an entry point into some of our own Python scripts. To do this in a concise, cross-platfom way, we can use entry_points.console_scripts. This change is necessary to provide the CLI on different platforms, which "scripts" does not ensure. Usage remains the same, but the "transformers-cli" script has to be moved (be part of the library) and renamed (underscore + extension)

* make style & quality
2020-05-26 10:00:51 -04:00
Patrick von Platen
c589eae2b8 [Longformer For Question Answering] Conversion script, doc, small fixes (#4593)
* add new longformer for question answering model

* add new config as well

* fix links

* fix links part 2
2020-05-26 14:58:47 +02:00
ZhuBaohe
a163c9ca5b [T5] Fix Cross Attention position bias (#4499)
* fix

* fix1
2020-05-26 08:57:24 -04:00
ZhuBaohe
1d69028989 fix (#4410) 2020-05-26 08:51:28 -04:00
Sam Shleifer
b86e42e0ac [ci] fix 3 remaining slow GPU failures (#4584) 2020-05-25 19:20:50 -04:00
Julien Chaumond
365d452d4d [ci] Slow GPU tests run daily (#4465) 2020-05-25 17:28:02 -04:00
Patrick von Platen
3e3e552125 [Reformer] fix reformer num buckets (#4564)
* fix reformer num buckets

* fix

* adapt docs

* set num buckets in config
2020-05-25 16:04:45 -04:00
Elman Mansimov
3dea40b858 fixing tokenization of extra_id symbols in T5Tokenizer. Related to issue 4021 (#4353) 2020-05-25 16:04:30 -04:00
Suraj Patil
5139733623 LongformerTokenizerFast (#4547) 2020-05-25 16:03:55 -04:00
Oliver Guhr
c9c385c522 Updated the link to the paper (#4570)
I looks like the conference has changed the link to the paper.
2020-05-25 15:29:50 -04:00
Sho Arora
adab7f8332 Add nn.Module as superclass (#4533) 2020-05-25 15:29:33 -04:00
Manuel Romero
8f7c1c7672 Create model card (#4578) 2020-05-25 15:28:30 -04:00
Ali Safaya
4c6b218056 Update README.md (#4556) 2020-05-25 15:12:23 -04:00
Antonis Maronikolakis
50d1ce411f add DistilBERT to supported models (#4558) 2020-05-25 14:50:45 -04:00
Suraj Patil
03d8527de0 Longformer for question answering (#4500)
* added LongformerForQuestionAnswering

* add LongformerForQuestionAnswering

* fix import for LongformerForMaskedLM

* add LongformerForQuestionAnswering

* hardcoded sep_token_id

* compute attention_mask if not provided

* combine global_attention_mask with attention_mask when provided

* update example in  docstring

* add assert error messages, better attention combine

* add test for longformerForQuestionAnswering

* typo

* cast gloabl_attention_mask to long

* make style

* Update src/transformers/configuration_longformer.py

* Update src/transformers/configuration_longformer.py

* fix the code quality

* Merge branch 'longformer-for-question-answering' of https://github.com/patil-suraj/transformers into longformer-for-question-answering

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-05-25 18:43:36 +02:00
Bharat Raghunathan
a34a9896ac DOC: Fix typos in modeling_auto (#4534) 2020-05-23 09:40:59 -04:00
Bijay Gurung
e19b978151 Add Type Hints to modeling_utils.py Closes #3911 (#3948)
* Add Type Hints to modeling_utils.py Closes #3911

Add Type Hints to methods in `modeling_utils.py`

Note: The coverage isn't 100%. Mostly skipped internal methods.

* Reformat according to `black` and `isort`

* Use typing.Iterable instead of Sequence

* Parameterize Iterable by its generic type

* Use typing.Optional when None is the default value

* Adhere to style guideline

* Update src/transformers/modeling_utils.py

* Update src/transformers/modeling_utils.py

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-22 19:10:22 -04:00
Funtowicz Morgan
996f393a86 Warn the user about max_len being on the path to be deprecated. (#4528)
* Warn the user about max_len being on the path to be deprecated.

* Ensure better backward compatibility when max_len is provided to a tokenizer.

* Make sure to override the parameter and not the actual instance value.

* Format & quality
2020-05-22 18:08:30 -04:00
Patrick von Platen
0f6969b7e9 Better github link for Reformer Colab Notebook 2020-05-22 23:51:36 +02:00
Sam Shleifer
ab44630db2 [Summarization Pipeline]: Fix default tokenizer (#4506)
* Fix pipelines defaults bug

* one liner

* style
2020-05-22 17:49:45 -04:00
Julien Chaumond
2c1ebb8b50 Re-apply #4446 + add packaging dependency
As discussed w/ @lysandrejik

packaging is maintained by PyPA (the Python Packaging Authority), and should be lightweight and stable
2020-05-22 17:29:03 -04:00
Lysandre
e6aeb0d3e8 Style 2020-05-22 17:20:03 -04:00
Alexander Measure
95a26fcf2d link to paper was broken (#4526)
changed from https://https://arxiv.org/abs/2001.04451.pdf to https://arxiv.org/abs/2001.04451.pdf
2020-05-22 15:17:09 -04:00
HUSEIN ZOLKEPLI
89d795f180 Added huseinzol05/t5-small-bahasa-cased README.md (#4522) 2020-05-22 15:04:06 -04:00
Anthony MOI
35df911485 Fix convert_token_type_ids_from_sequences for fast tokenizers (#4503) 2020-05-22 12:45:10 -04:00
Julien Chaumond
f7677e1623 [model_cards] bart-large-cnn
cc @sshleifer
2020-05-22 12:20:54 -04:00
Patrick von Platen
12e6afe900 Add Reformer colab to community noteboos 2020-05-22 17:03:34 +02:00
Lysandre
ef22ba4836 Re-pin versions 2020-05-22 11:03:07 -04:00
Lysandre
10d72390c0 Revert #4446 Since it introduces a new dependency
Some checks failed
GitHub-hosted runner / check_code_quality (push) Has been cancelled
2020-05-22 10:49:45 -04:00
Lysandre
e0db6bbd65 Release: v2.10.0 2020-05-22 10:37:44 -04:00
Frankie Liuzzi
bd6e301832 added functionality for electra classification head (#4257)
* added functionality for electra classification head

* unneeded dropout

* Test ELECTRA for sequence classification

* Style

Co-authored-by: Frankie <frankie@frase.io>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-05-22 09:48:21 -04:00
Lysandre
a086527727 Unused Union should not be imported 2020-05-21 09:42:47 -04:00
Lysandre Debut
9d2ce253de TPU hangs when saving optimizer/scheduler (#4467)
* TPU hangs when saving optimizer/scheduler

* Style

* ParallelLoader is not a DataLoader

* Style

* Addressing @julien-c's comments
2020-05-21 09:18:27 -04:00
Zhangyx
49296533ca Adds predict stage for glue tasks, and generate result files which can be submitted to gluebenchmark.com (#4463)
* Adds predict stage for glue tasks, and generate result files which could be submitted to gluebenchmark.com website.

* Use Split enum + always output the label name

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-21 09:17:44 -04:00
Tobias Lee
271bedb485 [examples] fix no grad in second pruning in run_bertology (#4479)
* fix no grad in second pruning and typo

* fix prune heads attention mismatch problem

* fix

* fix

* fix

* run make style

* run make style
2020-05-21 09:17:03 -04:00
Julien Chaumond
865d4d595e [ci] Close #4481 2020-05-20 18:27:42 -04:00
Julien Chaumond
a3af8e86cb Update test_trainer_distributed.py 2020-05-20 18:26:51 -04:00
Cola
eacea530c1 🚨 Remove warning of deprecation (#4477)
Remove warning of deprecated overload of addcdiv_

Fix #4451
2020-05-20 16:48:29 -04:00
Julien Plu
fa2fbed3e5 Better None gradients handling in TF Trainer (#4469)
* Better None gradients handling

* Apply Style

* Apply Style
2020-05-20 16:46:21 -04:00
Oliver Åstrand
e708bb75bf Correct TF formatting to exclude LayerNorms from weight decay (#4448)
* Exclude LayerNorms from weight decay

* Include both formats of layer norm
2020-05-20 16:45:59 -04:00
Rens
49c06132df pass on tokenizer to pipeline (#4489) 2020-05-20 22:23:21 +02:00
Nathan Cooper
cacb654c7f Add Fine-tune DialoGPT on new datasets notebook (#4473) 2020-05-20 16:17:52 -04:00
Timo Moeller
30a09f3827 Adjust german bert model card, add new model card (#4488) 2020-05-20 16:08:29 -04:00
Lysandre Debut
14cb5b35fa Fix slow gpu tests lysandre (#4487)
* There is one missing key in BERT

* Correct device for CamemBERT model

* RoBERTa tokenization adding prefix space

* Style
2020-05-20 11:59:45 -04:00
Manuel Romero
6dc52c78d8 Create README.md (#4482) 2020-05-20 09:45:50 -04:00
Manuel Romero
ed5456daf4 Model card for RuPERTa-base fine-tuned for NER (#4466) 2020-05-20 09:45:24 -04:00
Oleksandr Bushkovskyi
c76450e20c Model card for Tereveni-AI/gpt2-124M-uk-fiction (#4470)
Create model card for "Tereveni-AI/gpt2-124M-uk-fiction" model
2020-05-20 09:44:26 -04:00
Hu Xu
9907dc523a add BERT trained from review corpus. (#4405)
* add model_cards for BERT trained on reviews.

* add link to repository.

* refine README.md for each review model
2020-05-20 09:42:35 -04:00
Sam Shleifer
efbc1c5a9d [MarianTokenizer] implement save_vocabulary and other common methods (#4389) 2020-05-19 19:45:49 -04:00
Sam Shleifer
956c4c4eb4 [gpu slow tests] fix mbart-large-enro gpu tests (#4472) 2020-05-19 19:45:31 -04:00
Patrick von Platen
48c3a70b4e [Longformer] Docs and clean API (#4464)
* add longformer docs

* improve docs
2020-05-19 21:52:36 +02:00
Patrick von Platen
aa925a52fa [Tests, GPU, SLOW] fix a bunch of GPU hardcoded tests in Pytorch (#4468)
* fix gpu slow tests in pytorch

* change model to device syntax
2020-05-19 21:35:04 +02:00
Suraj Patil
5856999a9f add T5 fine-tuning notebook [Community notebooks] (#4462)
* add T5 fine-tuning notebook [Community notebooks]

* Update README.md

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-05-19 18:26:28 +02:00
Sam Shleifer
07dd7c2fd8 [cleanup] test_tokenization_common.py (#4390) 2020-05-19 10:46:55 -04:00
Iz Beltagy
8f1d047148 Longformer (#4352)
* first commit

* bug fixes

* better examples

* undo padding

* remove wrong VOCAB_FILES_NAMES

* License

* make style

* make isort happy

* unit tests

* integration test

* make `black` happy by undoing `isort` changes!!

* lint

* no need for the padding value

* batch_size not bsz

* remove unused type casting

* seqlen not seq_len

* staticmethod

* `bert` selfattention instead of `n2`

* uint8 instead of bool + lints

* pad inputs_embeds using embeddings not a constant

* black

* unit test with padding

* fix unit tests

* remove redundant unit test

* upload model weights

* resolve todo

* simpler _mask_invalid_locations without lru_cache + backward compatible masked_fill_

* increase unittest coverage
2020-05-19 16:04:43 +02:00
Girishkumar
31eedff5a0 Refactored the README.md file (#4427) 2020-05-19 09:56:24 -04:00
Shaoyen
384f0eb2f9 Map optimizer to correct device after loading from checkpoint. (#4403)
* Map optimizer to correct device after loading from checkpoint.

* Make style test pass

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-18 23:16:05 -04:00
Julien Chaumond
bf14ef75f1 [Trainer] move model to device before setting optimizer (#4450) 2020-05-18 23:13:33 -04:00
Julien Chaumond
5e7fe8b585 Distributed eval: SequentialDistributedSampler + gather all results (#4243)
* Distributed eval: SequentialDistributedSampler + gather all results

* For consistency only write to disk from world_master

Close https://github.com/huggingface/transformers/issues/4272

* Working distributed eval

* Hook into scripts

* Fix #3721 again

* TPU.mesh_reduce: stay in tensor space

Thanks @jysohn23

* Just a small comment

* whitespace

* torch.hub: pip install packaging

* Add test scenarii
2020-05-18 22:02:39 -04:00
Julien Chaumond
4c06893610 Fix nn.DataParallel compatibility in PyTorch 1.5 (#4300)
* Test case for #3936

* multigpu tests pass on pytorch 1.4.0

* Fixup

* multigpu tests pass on pytorch 1.5.0

* Update src/transformers/modeling_utils.py

* Update src/transformers/modeling_utils.py

* rename multigpu to require_multigpu

* mode doc
2020-05-18 20:34:50 -04:00
Rakesh Chada
9de4afa897 Make get_last_lr in trainer backward compatible (#4446)
* makes fetching last learning late in trainer backward compatible

* split comment to multiple lines

* fixes black styling issue

* uses version to create a more explicit logic
2020-05-18 20:17:36 -04:00
Stefan Dumitrescu
42e8fbfc51 Added model cards for Romanian BERT models (#4437)
* Create README.md

* Create README.md

* Update README.md

* Update README.md

* Apply suggestions from code review

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-18 18:48:56 -04:00
Oliver Guhr
54065d68b8 added model card for german-sentiment-bert (#4435) 2020-05-18 18:44:41 -04:00
Martin Müller
e28b7e2311 Create README.md (#4433) 2020-05-18 18:41:34 -04:00
sy-wada
09b933f19d Update README.md (model_card) (#4424)
- add a citation.
- modify the table of the BLUE benchmark.

The table of the first version was not displayed correctly on https://huggingface.co/seiya/oubiobert-base-uncased.
Could you please confirm that this fix will allow you to display it correctly?
2020-05-18 18:18:17 -04:00
Manuel Romero
235777ccc9 Modify example of usage (#4413)
I followed the google example of usage for its electra small model but i have seen it is not meaningful, so i created a better example
2020-05-18 18:17:33 -04:00
Suraj Patil
9ddd3a6548 add model card for t5-base-squad (#4409)
* add model card for t5-base-squad

* Update model_cards/valhalla/t5-base-squad/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-18 18:17:14 -04:00
HUSEIN ZOLKEPLI
c5aa114392 Added README huseinzol05/t5-base-bahasa-cased (#4377)
* add bert bahasa readme

* update readme

* update readme

* added xlnet

* added tiny-bert and fix xlnet readme

* added albert base

* added albert tiny

* added electra model

* added gpt2 117m bahasa readme

* added gpt2 345m bahasa readme

* added t5-base-bahasa

* fix readme

* Update model_cards/huseinzol05/t5-base-bahasa-cased/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-18 18:10:23 -04:00
Funtowicz Morgan
ca4a3f4da9 Adding optimizations block from ONNXRuntime. (#4431)
* Adding optimizations block from ONNXRuntime.

* Turn off external data format by default for PyTorch export.

* Correct the way use_external_format is passed through the cmdline args.
2020-05-18 20:32:33 +02:00
Patrick von Platen
24538df919 [Community notebooks] General notebooks (#4441)
* Update README.md

* Update README.md

* Update README.md

* Update README.md
2020-05-18 20:23:57 +02:00
Sam Shleifer
a699525d25 [test_pipelines] Mark tests > 10s @slow, small speedups (#4421) 2020-05-18 12:23:21 -04:00
Boris Dayma
d9ece8233d fix(run_language_modeling): use arg overwrite_cache (#4407) 2020-05-18 11:37:35 -04:00
Patrick von Platen
d39bf0ac2d better naming in tf t5 (#4401) 2020-05-18 11:34:00 -04:00
Patrick von Platen
590adb130b improve docstring (#4422) 2020-05-18 11:31:35 -04:00
Patrick von Platen
026a5d0888 [T5 fp16] Fix fp16 in T5 (#4436)
* fix fp16 in t5

* make style

* refactor invert_attention_mask fn

* fix typo
2020-05-18 17:25:58 +02:00
Soham Chatterjee
fa6113f9a0 Fixed spelling of training (#4416) 2020-05-18 11:23:29 -04:00
Julien Chaumond
757baee846 Fix un-prefixed f-string
see https://github.com/huggingface/transformers/pull/4367#discussion_r426356693

Hat/tip @girishponkiya
2020-05-18 11:20:46 -04:00
Patrick von Platen
a27c795908 fix (#4419) 2020-05-18 15:51:40 +02:00
Funtowicz Morgan
31c799a0c9 Tag onnx export tests as slow (#4432) 2020-05-18 09:24:41 -04:00
Mehrad Moradshahi
8581a670e3 [MbartTokenizer] save to sentencepiece.bpe.model (#4335) 2020-05-18 08:54:04 -04:00
Lorenzo Ampil
18d233d525 Allow the creation of "entity groups" for NerPipeline #3548 (#3957)
* Add index to be returned by NerPipeline to allow for the creation of

* Add entity groups

* Convert entity list to dict

* Add entity to entity_group_disagg atfter updating entity gorups

* Change 'group' parameter to 'grouped_entities'

* Add unit tests for grouped NER pipeline case

* Correct variable name typo for NER_FINETUNED_MODELS

* Sync grouped tests to recent test updates
2020-05-17 09:25:17 +02:00
Julien Chaumond
3e0f062106 Fix addcmul_ 2020-05-15 17:44:17 -04:00
Julien Chaumond
fc2a4c88ce Fix: one more try 2020-05-15 17:38:48 -04:00
Julien Chaumond
55bda52555 Same fix for addcmul_ 2020-05-15 17:23:48 -04:00
Julien Chaumond
ad02c961c6 Fix UserWarning: This overload of add_ is deprecated in pytorch==1.5.0 2020-05-15 17:09:11 -04:00
Julien Chaumond
15550ce0d1 [skip ci] remove local rank 2020-05-15 17:08:38 -04:00
Nikita
62427d0815 rerun notebook 02-transformers (#4341) 2020-05-15 10:33:08 -04:00
Jared T Nielsen
34706ba050 Allow for None gradients in GradientAccumulator. (#4372) 2020-05-15 09:52:00 -04:00
Lysandre Debut
edf9ac11d4 Should return overflowing information for the log (#4385) 2020-05-15 09:49:11 -04:00
Funtowicz Morgan
b908f2e9dd Attempt to unpin torch version for Github Action. (#4384) 2020-05-15 15:47:15 +02:00
Julien Chaumond
af2e6bf87c [examples] Streamline doc 2020-05-14 20:34:31 -04:00
Lysandre Debut
7defc6670f p_mask in SQuAD pre-processing (#4049)
* Better p_mask building

* Adressing @mfuntowicz comments
2020-05-14 17:07:52 -04:00
Morgan Funtowicz
84894974bd Updated ONNX notebook link in README. 2020-05-14 22:40:59 +02:00
Funtowicz Morgan
db0076a9df Conversion script to export transformers models to ONNX IR. (#4253)
* Added generic ONNX conversion script for PyTorch model.

* WIP initial TF support.

* TensorFlow/Keras ONNX export working.

* Print framework version info

* Add possibility to check the model is correctly loading on ONNX runtime.

* Remove quantization option.

* Specify ONNX opset version when exporting.

* Formatting.

* Remove unused imports.

* Make functions more generally reusable from other part of the code.

* isort happy.

* flake happy

* Export only feature-extraction for now

* Correctly check inputs order / filter before export.

* Removed task variable

* Fix invalid args call in load_graph_from_args.

* Fix invalid args call in convert.

* Fix invalid args call in infer_shapes.

* Raise exception and catch in caller function instead of exit.

* Add 04-onnx-export.ipynb notebook

* More WIP on the notebook

* Remove unused imports

* Simplify & remove unused constants.

* Export with constant_folding in PyTorch

* Let's try to put function args in the right order this time ...

* Disable external_data_format temporary

* ONNX notebook draft ready.

* Updated notebooks charts + wording

* Correct error while exporting last chart in notebook.

* Adressing @LysandreJik comment.

* Set ONNX opset to 11 as default value.

* Set opset param mandatory

* Added ONNX export unittests

* Quality.

* flake8 happy

* Add keras2onnx dependency on extras["tf"]

* Pin keras2onnx on github master to v1.6.5

* Second attempt.

* Third attempt.

* Use the right repo URL this time ...

* Do the same for onnxconverter-common

* Added keras2onnx and onnxconveter-common to 1.7.0 to supports TF2.2

* Correct commit hash.

* Addressing PR review: Optimization are enabled by default.

* Addressing PR review: small changes in the notebook

* setup.py comment about keras2onnx versioning.
2020-05-14 16:35:52 -04:00
Suraj Patil
2d05480174 Fix trainer evaluation (#4363)
* fix loss calculation in evaluation

* fix evaluation on TPU when prediction_loss_only is True
2020-05-14 14:39:44 -04:00
Savaş Yıldırım
035678efdb Create README.md (#4359)
* Create README.md

* Update model_cards/savasy/bert-base-turkish-squad/README.md

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
2020-05-14 14:07:32 -04:00
sy-wada
b9c9e05381 Create README.md (#4357) 2020-05-14 14:06:10 -04:00
Sam Shleifer
9535bf1977 Tokenizer.batch_decode convenience method (#4159) 2020-05-14 13:50:47 -04:00
Sam Shleifer
7822cd38a0 [tests] make pipelines tests faster with smaller models (#4238)
covers torch and tf. Also fixes a failing @slow test
2020-05-14 13:36:02 -04:00
Julien Chaumond
448c467256 Fix: unpin flake8 and fix cs errors (#4367)
* Fix: unpin flake8 and fix cs errors

* Ok we still need to quote those
2020-05-14 13:14:26 -04:00
Julien Chaumond
c547f15a17 Use Filelock to ensure distributed barriers
see context in https://github.com/huggingface/transformers/pull/4223
2020-05-14 11:58:32 -04:00
Julien Chaumond
015f7812ed [ci skip] Pin isort 2020-05-14 10:12:18 -04:00
Lysandre Debut
ef46ccb05c TPU needs a rendezvous (#4339) 2020-05-14 08:59:52 -04:00
Viktor Alm
94cb73c2d2 Add image and metadata (#4345)
Unfortunately i accidentally orphaned my other PR
2020-05-13 20:05:15 -04:00
Manuel Romero
a0eebdc404 Add link to W&B to see whole training logs (#4348) 2020-05-13 20:04:57 -04:00
260 changed files with 13487 additions and 4002 deletions

View File

@@ -21,7 +21,7 @@ jobs:
- name: Install dependencies
run: |
pip install torch
pip install numpy tokenizers filelock requests tqdm regex sentencepiece sacremoses
pip install numpy tokenizers filelock requests tqdm regex sentencepiece sacremoses packaging
- name: Torch hub list
run: |

View File

@@ -35,7 +35,7 @@ jobs:
- name: Install dependencies
run: |
source .env/bin/activate
pip install torch==1.4.0
pip install torch
pip install .[sklearn,testing]
- name: Are GPUs recognized by our DL frameworks

View File

@@ -31,13 +31,12 @@ jobs:
- name: Install dependencies
run: |
source .env/bin/activate
pip install .[sklearn,tf,torch,testing]
pip install .[sklearn,torch,testing]
- name: Are GPUs recognized by our DL frameworks
run: |
source .env/bin/activate
python -c "import torch; print(torch.cuda.is_available())"
python -c "import tensorflow as tf; print(tf.test.is_built_with_cuda(), tf.config.list_physical_devices('GPU'))"
- name: Run all tests on GPU
env:

View File

@@ -44,9 +44,16 @@ Did not find it? :( So we can act quickly on it, please follow these steps:
To get the OS and software versions automatically, you can run the following command:
```bash
python transformers-cli env
transformers-cli env
```
or from the root of the repository the following command:
```bash
python src/transformers/commands/transformers_cli.py env
```
### Do you want to implement a new model?
Awesome! Please provide the following information:
@@ -198,11 +205,12 @@ Follow these steps to start contributing:
are useful to avoid duplicated work, and to differentiate it from PRs ready
to be merged;
4. Make sure existing tests pass;
5. Add high-coverage tests. No quality test, no merge.
5. Add high-coverage tests. No quality testing = no merge.
- If you are adding a new model, make sure that you use `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)`, which triggers the common tests.
- If you are adding new `@slow` tests, make sure they pass using `RUN_SLOW=1 python -m pytest tests/test_my_new_model.py`.
- If you are adding a new tokenizer, write tests, and make sure `RUN_SLOW=1 python -m pytest tests/test_tokenization_{your_model_name}.py` passes.
CircleCI does not run them.
6. All public methods must have informative docstrings;
6. All public methods must have informative docstrings that work nicely with sphinx. See `modeling_ctrl.py` for an example.
### Tests

View File

@@ -63,7 +63,7 @@ Choose the right framework for every part of a model's lifetime
## Installation
This repo is tested on Python 3.6+, PyTorch 1.0.0+ and TensorFlow 2.0.
This repo is tested on Python 3.6+, PyTorch 1.0.0+ (PyTorch 1.3.1+ for examples) and TensorFlow 2.0.
You should install 🤗 Transformers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
@@ -165,8 +165,9 @@ At some point in the future, you'll be able to seamlessly move from pre-training
18. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
19. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
20. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
21. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
22. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
21. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
22. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
23. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
@@ -339,8 +340,8 @@ python ./examples/text-classification/run_glue.py \
--do_eval \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--per_device_eval_batch_size=8 \
--per_device_train_batch_size=8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/$TASK_NAME/
@@ -366,8 +367,8 @@ python ./examples/text-classification/run_glue.py \
--data_dir=${GLUE_DIR}/STS-B \
--output_dir=./proc_data/sts-b-110 \
--max_seq_length=128 \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--per_device_eval_batch_size=8 \
--per_device_train_batch_size=8 \
--gradient_accumulation_steps=1 \
--max_steps=1200 \
--model_name=xlnet-large-cased \
@@ -390,8 +391,8 @@ python -m torch.distributed.launch --nproc_per_node 8 ./examples/text-classifica
--do_eval \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--per_device_eval_batch_size=8 \
--per_device_train_batch_size=8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/ \
@@ -427,8 +428,8 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answer
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ../models/wwm_uncased_finetuned_squad/ \
--per_gpu_eval_batch_size=3 \
--per_gpu_train_batch_size=3 \
--per_device_eval_batch_size=3 \
--per_device_train_batch_size=3 \
```
Training with these hyper-parameters gave us the following results:

View File

@@ -26,7 +26,7 @@ author = u'huggingface'
# The short X.Y version
version = u''
# The full version, including alpha/beta/rc tags
release = u'2.9.1'
release = u'2.11.0'
# -- General configuration ---------------------------------------------------

View File

@@ -1,649 +0,0 @@
# Examples
In this section a few examples are put together. All of these examples work for several models, making use of the very
similar API between the different models.
**Important**
To run the latest versions of the examples, you have to install from source and install some specific requirements for the examples.
Execute the following steps in a new virtual environment:
```bash
git clone https://github.com/huggingface/transformers
cd transformers
pip install .
pip install -r ./examples/requirements.txt
```
| Section | Description |
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
## TensorFlow 2.0 Bert models on GLUE
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_glue.py).
Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
These options and the below benchmark are provided by @tlkh.
Quick benchmarks from the script (no other modifications):
| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) |
| --------- | -------- | ----------------------- | ----------------------|
| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
| V100 | FP32 | 35s | 0.8646/0.8359/0.8464 |
| V100 | AMP | 22s | 0.8646/0.8385/0.8411 |
| 1080 Ti | FP32 | 55s | - |
Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
## Running on TPUs
You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
[README](https://github.com/pytorch/xla/blob/master/README.md).
The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
identical to your normal GPU + Huggingface setup.
### GLUE
Before running anyone of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.
For running your GLUE task on MNLI dataset you can run something like the following:
```
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
export GLUE_DIR=/path/to/glue
export TASK_NAME=MNLI
python run_glue_tpu.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--train_batch_size 32 \
--learning_rate 3e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/$TASK_NAME \
--overwrite_output_dir \
--logging_steps 50 \
--save_steps 200 \
--num_cores=8 \
--only_log_master
```
## Language model training
Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py).
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
are fine-tuned using a masked language modeling (MLM) loss.
Before running the following example, you should get a file that contains text on which the language model will be
trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
text that will be used for evaluation.
### GPT-2/GPT and causal language modeling
The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
the tokenization). The loss here is that of causal language modeling.
```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw
python run_language_modeling.py \
--output_dir=output \
--model_type=gpt2 \
--model_name_or_path=gpt2 \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE
```
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
a score of ~20 perplexity once fine-tuned on the dataset.
### RoBERTa/BERT and masked language modeling
The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
pre-training: masked language modeling.
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
slightly slower (over-fitting takes more epochs).
We use the `--mlm` flag so that the script may change its loss function.
```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw
python run_language_modeling.py \
--output_dir=output \
--model_type=roberta \
--model_name_or_path=roberta-base \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--mlm
```
## Language generation
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py).
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
can try out the different models available in the library.
Example usage:
```bash
python run_generation.py \
--model_type=gpt2 \
--model_name_or_path=gpt2
```
## GLUE
Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
uncased BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
| Task | Metric | Result |
|-------|------------------------------|-------------|
| CoLA | Matthew's corr | 49.23 |
| SST-2 | Accuracy | 91.97 |
| MRPC | F1/Accuracy | 89.47/85.29 |
| STS-B | Person/Spearman corr. | 83.95/83.70 |
| QQP | Accuracy/F1 | 88.40/84.31 |
| MNLI | Matched acc./Mismatched acc. | 80.61/81.08 |
| QNLI | Accuracy | 87.46 |
| RTE | Accuracy | 61.73 |
| WNLI | Accuracy | 45.07 |
Some of these results are significantly different from the ones reported on the test set
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
Before running any one of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.
```bash
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC
python run_glue.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/$TASK_NAME/
```
where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
said, there shouldnt be any issues in running half-precision training with the remaining GLUE tasks as well,
since the data processor for each task inherits from the base class DataProcessor.
### MRPC
#### Fine-tuning example
The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
Before running any one of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.
```bash
export GLUE_DIR=/path/to/glue
python run_glue.py \
--model_name_or_path bert-base-cased \
--task_name MRPC \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/
```
Our test ran on a few seeds with [the original implementation hyper-
parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
results between 84% and 88%.
#### Using Apex and mixed-precision
Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
[apex](https://github.com/NVIDIA/apex), then run the following example:
```bash
export GLUE_DIR=/path/to/glue
python run_glue.py \
--model_name_or_path bert-base-cased \
--task_name MRPC \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/ \
--fp16
```
#### Distributed training
Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
reaches F1 > 92 on MRPC.
```bash
export GLUE_DIR=/path/to/glue
python -m torch.distributed.launch \
--nproc_per_node 8 run_glue.py \
--model_name_or_path bert-base-cased \
--task_name MRPC \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/
```
Training with these hyper-parameters gave us the following results:
```bash
acc = 0.8823529411764706
acc_and_f1 = 0.901702786377709
eval_loss = 0.3418912578906332
f1 = 0.9210526315789473
global_step = 174
loss = 0.07231863956341798
```
### MNLI
The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
```bash
export GLUE_DIR=/path/to/glue
python -m torch.distributed.launch \
--nproc_per_node 8 run_glue.py \
--model_name_or_path bert-base-cased \
--task_name mnli \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/MNLI/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir output_dir \
```
The results are the following:
```bash
***** Eval results *****
acc = 0.8679706601466992
eval_loss = 0.4911287787382479
global_step = 18408
loss = 0.04755385363816904
***** Eval results *****
acc = 0.8747965825874695
eval_loss = 0.45516540421714036
global_step = 18408
loss = 0.04755385363816904
```
## Multiple Choice
Based on the script [`run_multiple_choice.py`]().
#### Fine-tuning on SWAG
Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
```bash
#training on 4 tesla V100(16GB) GPUS
export SWAG_DIR=/path/to/swag_data_dir
python ./examples/multiple-choice/run_multiple_choice.py \
--task_name swag \
--model_name_or_path roberta-base \
--do_train \
--do_eval \
--data_dir $SWAG_DIR \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--max_seq_length 80 \
--output_dir models_bert/swag_base \
--per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
--gradient_accumulation_steps 2 \
--overwrite_output
```
Training with the defined hyper-parameters yields the following results:
```
***** Eval results *****
eval_acc = 0.8338998300509847
eval_loss = 0.44457291918821606
```
## SQuAD
Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py).
#### Fine-tuning BERT on SQuAD1.0
This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
$SQUAD_DIR directory.
* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
And for SQuAD2.0, you need to download:
- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
- [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)
```bash
export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--per_gpu_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/debug_squad/
```
Training with the previously defined hyper-parameters yields the following results:
```bash
f1 = 88.52
exact_match = 81.22
```
#### Distributed training
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
```bash
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_type bert \
--model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
--per_gpu_eval_batch_size=3 \
--per_gpu_train_batch_size=3 \
```
Training with the previously defined hyper-parameters yields the following results:
```bash
f1 = 93.15
exact_match = 86.91
```
This fine-tuned model is available as a checkpoint under the reference
`bert-large-uncased-whole-word-masking-finetuned-squad`.
#### Fine-tuning XLNet on SQuAD
This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .
##### Command for SQuAD1.0:
```bash
export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type xlnet \
--model_name_or_path xlnet-large-cased \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./wwm_cased_finetuned_squad/ \
--per_gpu_eval_batch_size=4 \
--per_gpu_train_batch_size=4 \
--save_steps 5000
```
##### Command for SQuAD2.0:
```bash
export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type xlnet \
--model_name_or_path xlnet-large-cased \
--do_train \
--do_eval \
--version_2_with_negative \
--train_file $SQUAD_DIR/train-v2.0.json \
--predict_file $SQUAD_DIR/dev-v2.0.json \
--learning_rate 3e-5 \
--num_train_epochs 4 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./wwm_cased_finetuned_squad/ \
--per_gpu_eval_batch_size=2 \
--per_gpu_train_batch_size=2 \
--save_steps 5000
```
Larger batch size may improve the performance while costing more memory.
##### Results for SQuAD1.0 with the previously defined hyper-parameters:
```python
{
"exact": 85.45884578997162,
"f1": 92.5974600601065,
"total": 10570,
"HasAns_exact": 85.45884578997162,
"HasAns_f1": 92.59746006010651,
"HasAns_total": 10570
}
```
##### Results for SQuAD2.0 with the previously defined hyper-parameters:
```python
{
"exact": 80.4177545691906,
"f1": 84.07154997729623,
"total": 11873,
"HasAns_exact": 76.73751686909581,
"HasAns_f1": 84.05558584352873,
"HasAns_total": 5928,
"NoAns_exact": 84.0874684608915,
"NoAns_f1": 84.0874684608915,
"NoAns_total": 5945
}
```
## XNLI
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
#### Fine-tuning on XNLI
This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
`$XNLI_DIR` directory.
* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
```bash
export XNLI_DIR=/path/to/XNLI
python run_xnli.py \
--model_type bert \
--model_name_or_path bert-base-multilingual-cased \
--language de \
--train_language en \
--do_train \
--do_eval \
--data_dir $XNLI_DIR \
--per_gpu_train_batch_size 32 \
--learning_rate 5e-5 \
--num_train_epochs 2.0 \
--max_seq_length 128 \
--output_dir /tmp/debug_xnli/ \
--save_steps -1
```
Training with the previously defined hyper-parameters yields the following results on the **test** set:
```bash
acc = 0.7093812375249501
```
## MM-IMDb
Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/mm-imdb/run_mmimdb.py).
[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.
### Training on MM-IMDb
```
python run_mmimdb.py \
--data_dir /path/to/mmimdb/dataset/ \
--model_type bert \
--model_name_or_path bert-base-uncased \
--output_dir /path/to/save/dir/ \
--do_train \
--do_eval \
--max_seq_len 512 \
--gradient_accumulation_steps 20 \
--num_image_embeds 3 \
--num_train_epochs 100 \
--patience 5
```
## Adversarial evaluation of model performances
Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).
The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).
This is an example of using test_hans.py:
```bash
export HANS_DIR=path-to-hans
export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py
python examples/hans/test_hans.py \
--task_name hans \
--model_type $MODEL_TYPE \
--do_eval \
--data_dir $HANS_DIR \
--model_name_or_path $MODEL_PATH \
--max_seq_length 128 \
--output_dir $MODEL_PATH \
```
This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset.
The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows:
```bash
Heuristic entailed results:
lexical_overlap: 0.9702
subsequence: 0.9942
constituent: 0.9962
Heuristic non-entailed results:
lexical_overlap: 0.199
subsequence: 0.0396
constituent: 0.118
```

1
docs/source/examples.md Symbolic link
View File

@@ -0,0 +1 @@
../../examples/README.md

View File

@@ -109,3 +109,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
model_doc/dialogpt
model_doc/reformer
model_doc/marian
model_doc/longformer

View File

@@ -6,7 +6,7 @@ Overview
The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_
by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents
two parameter-reduction techniques to lower memory consumption and increase the trainig speed of BERT:
two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:
- Splitting the embedding matrix into two smaller matrices
- Using repeating layers split among groups
@@ -94,3 +94,17 @@ TFAlbertForSequenceClassification
.. autoclass:: transformers.TFAlbertForSequenceClassification
:members:
TFAlbertForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFAlbertForMultipleChoice
:members:
TFAlbertForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFAlbertForQuestionAnswering
:members:

View File

@@ -22,7 +22,7 @@ Implementation Notes
- The forward pass of ``BartModel`` will create decoder inputs (using the helper function ``transformers.modeling_bart._prepare_bart_decoder_inputs``) if they are not passed. This is different than some other modeling APIs.
- Model predictions are intended to be identical to the original implementation. This only works, however, if the string you pass to ``fairseq.encode`` starts with a space.
- ``BartForConditionalGeneration.generate`` should be used for conditional generation tasks like summarization, see the example in that docstrings
- Models that load the ``"bart-large-cnn"`` weights will not have a ``mask_token_id``, or be able to perform mask filling tasks.
- Models that load the ``"facebook/bart-large-cnn"`` weights will not have a ``mask_token_id``, or be able to perform mask filling tasks.

View File

@@ -0,0 +1,91 @@
Longformer
----------------------------------------------------
**DISCLAIMER:** This model is still a work in progress, if you see something strange,
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
Overview
~~~~~
The Longformer model was presented in `Longformer: The Long-Document Transformer <https://arxiv.org/pdf/2004.05150.pdf>`_ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
Here the abstract:
*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.*
The Authors' code can be found `here <https://github.com/allenai/longformer>`_ .
Longformer Self Attention
~~~~~~~~~~~~~~~~~~~~
Longformer self attention employs self attention on both a "local" context and a "global" context.
Most tokens only attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous tokens and :math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in `config.attention_window`. Note that `config.attention_window` can be of type ``list`` to define a different :math:`w` for each layer.
A selecetd few tokens attend "globally" to all other tokens, as it is conventionally done for all tokens in *e.g.* `BertSelfAttention`.
Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices.
Also note that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all "globally" attending tokens so that global attention is *symmetric*.
The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor `global_attention_mask` at run-time appropriately. `Longformer` employs the following logic for `global_attention_mask`: `0` - the token attends "locally", `1` - token attends "globally". For more information please also refer to :func:`~transformers.LongformerModel.forward` method.
Using Longformer self attention, the memory and time complexity of the query-key matmul operation, which usually represents the memory and time bottleneck, can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times w)`, with :math:`n_s` being the sequence length and :math:`w` being the average window size. It is assumed that the number of "globally" attending tokens is insignificant as compared to the number of "locally" attending tokens.
For more information, please refer to the official `paper <https://arxiv.org/pdf/2004.05150.pdf>`_ .
Training
~~~~~~~~~~~~~~~~~~~~
``LongformerForMaskedLM`` is trained the exact same way, ``RobertaForMaskedLM`` is trained and
should be used as follows:
::
input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
LongformerConfig
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LongformerConfig
:members:
LongformerTokenizer
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LongformerTokenizer
:members:
LongformerModel
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LongformerModel
:members:
LongformerForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LongformerForMaskedLM
:members:
LongformerForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LongformerForQuestionAnswering
:members:
LongformerForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LongformerForMultipleChoice
:members:
LongformerForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.LongformerForTokenClassification
:members:

View File

@@ -5,7 +5,7 @@ file a `Github Issue <https://github.com/huggingface/transformers/issues/new?ass
Overview
~~~~~
The Reformer model was presented in `Reformer: The Efficient Transformer <https://https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
The Reformer model was presented in `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
Here the abstract:
*Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(Llog(L)), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.*
@@ -62,7 +62,7 @@ For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>
Note that ``config.num_buckets`` can also be factorized into a ``list``:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to save memory.
It is recommended to leave ``config.num_buckets=None``, so that depending on the sequence length, a good value for ``num_buckets`` are calculated on the fly.
When training a model from scratch, it is recommended to leave ``config.num_buckets=None``, so that depending on the sequence length a good value for ``num_buckets`` is calculated on the fly. This value will then automatically be saved in the config and should be reused for inference.
Using LSH self attention, the memory and time complexity of the query-key matmul operation can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.

View File

@@ -74,6 +74,13 @@ RobertaForSequenceClassification
:members:
RobertaForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.RobertaForMultipleChoice
:members:
RobertaForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -63,33 +63,33 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| | | | Trained on uncased German text by DBMDZ |
| | | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-japanese`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | ``cl-tohoku/bert-base-japanese`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece. |
| | | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization. |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-japanese-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | ``cl-tohoku/bert-base-japanese-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized with MeCab and WordPiece. |
| | | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization. |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-japanese-char`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | ``cl-tohoku/bert-base-japanese-char`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text. Text is tokenized into characters. |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-japanese-char-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | ``cl-tohoku/bert-base-japanese-char-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters. |
| | | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-finnish-cased-v1`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | ``TurkuNLP/bert-base-finnish-cased-v1`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on cased Finnish text. |
| | | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-finnish-uncased-v1`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | ``TurkuNLP/bert-base-finnish-uncased-v1`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on uncased Finnish text. |
| | | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__). |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bert-base-dutch-cased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | ``wietsedv/bert-base-dutch-cased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on cased Dutch text. |
| | | (see `details on wietsedv repository <https://github.com/wietsedv/bertje/>`__). |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
@@ -259,32 +259,32 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| | ``xlm-roberta-large`` | | ~355M parameters with 24-layers, 1027-hidden-state, 4096 feed-forward hidden-state, 16-heads, |
| | | | Trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| FlauBERT | ``flaubert-small-cased`` | | 6-layer, 512-hidden, 8-heads, 54M parameters |
| FlauBERT | ``flaubert/flaubert_small_cased`` | | 6-layer, 512-hidden, 8-heads, 54M parameters |
| | | | FlauBERT small architecture |
| | | (see `details <https://github.com/getalp/Flaubert>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``flaubert-base-uncased`` | | 12-layer, 768-hidden, 12-heads, 137M parameters |
| | ``flaubert/flaubert_base_uncased`` | | 12-layer, 768-hidden, 12-heads, 137M parameters |
| | | | FlauBERT base architecture with uncased vocabulary |
| | | (see `details <https://github.com/getalp/Flaubert>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``flaubert-base-cased`` | | 12-layer, 768-hidden, 12-heads, 138M parameters |
| | ``flaubert/flaubert_base_cased`` | | 12-layer, 768-hidden, 12-heads, 138M parameters |
| | | | FlauBERT base architecture with cased vocabulary |
| | | (see `details <https://github.com/getalp/Flaubert>`__) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``flaubert-large-cased`` | | 24-layer, 1024-hidden, 16-heads, 373M parameters |
| | ``flaubert/flaubert_large_cased`` | | 24-layer, 1024-hidden, 16-heads, 373M parameters |
| | | | FlauBERT large architecture |
| | | (see `details <https://github.com/getalp/Flaubert>`__) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Bart | ``bart-large`` | | 24-layer, 1024-hidden, 16-heads, 406M parameters |
| Bart | ``facebook/bart-large`` | | 24-layer, 1024-hidden, 16-heads, 406M parameters |
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bart-large-mnli`` | | Adds a 2 layer classification head with 1 million parameters |
| | ``facebook/bart-large-mnli`` | | Adds a 2 layer classification head with 1 million parameters |
| | | | bart-large base architecture with a classification head, finetuned on MNLI |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``bart-large-cnn`` | | 12-layer, 1024-hidden, 16-heads, 406M parameters (same as base) |
| | ``facebook/bart-large-cnn`` | | 12-layer, 1024-hidden, 16-heads, 406M parameters (same as base) |
| | | | bart-large base architecture finetuned on cnn summarization task |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``mbart-large-en-ro`` | | 12-layer, 1024-hidden, 16-heads, 880M parameters |
| | ``facebook/mbart-large-en-ro`` | | 12-layer, 1024-hidden, 16-heads, 880M parameters |
| | | | bart-large architecture pretrained on cc25 multilingual data , finetuned on WMT english romanian translation. |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| DialoGPT | ``DialoGPT-small`` | | 12-layer, 768-hidden, 12-heads, 124M parameters |
@@ -305,3 +305,9 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| MarianMT | ``Helsinki-NLP/opus-mt-{src}-{tgt}`` | | 12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size. |
| | | | (see `model list <https://huggingface.co/Helsinki-NLP>`_) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Longformer | ``allenai/longformer-base-4096`` | | 12-layer, 768-hidden, 12-heads, ~149M parameters |
| | | | Starting from RoBERTa-base checkpoint, trained on documents of max length 4,096 |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| | ``allenai/longformer-large-4096`` | | 24-layer, 1024-hidden, 16-heads, ~435M parameters |
| | | | Starting from RoBERTa-large checkpoint, trained on documents of max length 4,096 |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+

View File

@@ -1,6 +1,7 @@
# Examples
## Examples
Version 2.9 of `transformers` introduces a new `Trainer` class for PyTorch, and its equivalent `TFTrainer` for TF 2.
Version 2.9 of `transformers` introduces a new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) class for PyTorch, and its equivalent [`TFTrainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_tf.py) for TF 2.
Running the examples requires PyTorch 1.3.1+ or TensorFlow 2.0+.
Here is the list of all our examples:
- **grouped by task** (all official examples work for multiple models)
@@ -12,32 +13,24 @@ Here is the list of all our examples:
This is still a work-in-progress in particular documentation is still sparse so please **contribute improvements/pull requests.**
## Tasks built on Trainer
# The Big Table of Tasks
| Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab | One-click Deploy to Azure (wip) |
|---|---|:---:|:---:|:---:|:---:|:---:|
| [`language-modeling`](./language-modeling) | Raw text | ✅ | - | - | - | - |
| [`text-classification`](./text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb) | [![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure%2Fazure-quickstart-templates%2Fmaster%2F101-storage-account-create%2Fazuredeploy.json) |
| [`token-classification`](./token-classification) | CoNLL NER | ✅ | ✅ | ✅ | - | - |
| [`multiple-choice`](./multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb) | - |
| [`question-answering`](./question-answering) | SQuAD | - | ✅ | - | - | - |
| Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab
|---|---|:---:|:---:|:---:|:---:|
| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | Raw text | ✅ | - | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb)
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | -
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | - | | - | -
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | - | - | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
| [**`distillation`**](https://github.com/huggingface/transformers/tree/master/examples/distillation) | All | - | - | - | -
| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/summarization) | CNN/Daily Mail | - | - | - | -
| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/translation) | WMT | - | - | - | -
| [**`bertology`**](https://github.com/huggingface/transformers/tree/master/examples/bertology) | - | - | - | - | -
| [**`adversarial`**](https://github.com/huggingface/transformers/tree/master/examples/adversarial) | HANS | - | - | - | -
## Other examples and how-to's
| Section | Description |
|---|---|
| [TensorFlow 2.0 models on GLUE](./text-classification) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
| [Language Model training](./language-modeling) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
| [Language Generation](./text-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
| [GLUE](./text-classification) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
| [SQuAD](./question-answering) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
| [Multiple Choice](./multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
| [Named Entity Recognition](./token-classification) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
| [XNLI](./text-classification) | Examples running BERT/XLM on the XNLI benchmark. |
| [Adversarial evaluation of model performances](./adversarial) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
<br>
## Important note
@@ -52,6 +45,12 @@ pip install .
pip install -r ./examples/requirements.txt
```
## One-click Deploy to Cloud (wip)
#### Azure
[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure%2Fazure-quickstart-templates%2Fmaster%2F101-storage-account-create%2Fazuredeploy.json)
## Running on TPUs
When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Strategy`.
@@ -59,7 +58,7 @@ When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Str
When using PyTorch, we support TPUs thanks to `pytorch/xla`. For more context and information on how to setup your TPU environment refer to Google's documentation and to the
very detailed [pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
In this repo, we provide a very simple launcher script named [xla_spawn.py](./xla_spawn.py) that lets you run our example scripts on multiple TPU cores without any boilerplate.
In this repo, we provide a very simple launcher script named [xla_spawn.py](https://github.com/huggingface/transformers/tree/master/examples/xla_spawn.py) that lets you run our example scripts on multiple TPU cores without any boilerplate.
Just pass a `--num_cores` flag to this script, then your regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for torch.distributed).
For example for `run_glue`:

View File

@@ -65,13 +65,6 @@ except ImportError:
logger = logging.getLogger(__name__)
ALL_MODELS = sum(
(
tuple(conf.pretrained_config_archive_map.keys())
for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig, DistilBertConfig)
),
(),
)
MODEL_CLASSES = {
"bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
@@ -389,7 +382,7 @@ def main():
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
help="Path to pretrained model or model identifier from huggingface.co/models",
)
parser.add_argument(
"--task_name",

View File

@@ -0,0 +1,113 @@
import csv
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Optional
import numpy as np
import matplotlib.pyplot as plt
from transformers import HfArgumentParser
@dataclass
class PlotArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
"""
csv_file: str = field(metadata={"help": "The csv file to plot."},)
plot_along_batch: bool = field(
default=False,
metadata={"help": "Whether to plot along batch size or sequence lengh. Defaults to sequence length."},
)
is_time: bool = field(
default=False,
metadata={"help": "Whether the csv file has time results or memory results. Defaults to memory results."},
)
is_train: bool = field(
default=False,
metadata={
"help": "Whether the csv file has training results or inference results. Defaults to inference results."
},
)
figure_png_file: Optional[str] = field(
default=None, metadata={"help": "Filename under which the plot will be saved. If unused no plot is saved."},
)
class Plot:
def __init__(self, args):
self.args = args
self.result_dict = defaultdict(lambda: dict(bsz=[], seq_len=[], result={}))
with open(self.args.csv_file, newline="") as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
model_name = row["model"]
self.result_dict[model_name]["bsz"].append(int(row["batch_size"]))
self.result_dict[model_name]["seq_len"].append(int(row["sequence_length"]))
self.result_dict[model_name]["result"][(int(row["batch_size"]), int(row["sequence_length"]))] = row[
"result"
]
def plot(self):
fig, ax = plt.subplots()
title_str = "Time usage" if self.args.is_time else "Memory usage"
title_str = title_str + " for training" if self.args.is_train else title_str + " for inference"
for model_name in self.result_dict.keys():
batch_sizes = sorted(list(set(self.result_dict[model_name]["bsz"])))
sequence_lengths = sorted(list(set(self.result_dict[model_name]["seq_len"])))
results = self.result_dict[model_name]["result"]
(x_axis_array, inner_loop_array) = (
(batch_sizes, sequence_lengths) if self.args.plot_along_batch else (sequence_lengths, batch_sizes)
)
plt.xlim(min(x_axis_array), max(x_axis_array))
for inner_loop_value in inner_loop_array:
if self.args.plot_along_batch:
y_axis_array = np.asarray([results[(x, inner_loop_value)] for x in x_axis_array], dtype=np.int)
else:
y_axis_array = np.asarray([results[(inner_loop_value, x)] for x in x_axis_array], dtype=np.float32)
ax.set_xscale("log", basex=2)
ax.set_yscale("log", basey=10)
(x_axis_label, inner_loop_label) = (
("batch_size", "sequence_length in #tokens")
if self.args.plot_along_batch
else ("sequence_length in #tokens", "batch_size")
)
x_axis_array = np.asarray(x_axis_array, np.int)
plt.scatter(x_axis_array, y_axis_array, label=f"{model_name} - {inner_loop_label}: {inner_loop_value}")
plt.plot(x_axis_array, y_axis_array, "--")
title_str += f" {model_name} vs."
title_str = title_str[:-4]
y_axis_label = "Time in s" if self.args.is_time else "Memory in MB"
# plot
plt.title(title_str)
plt.xlabel(x_axis_label)
plt.ylabel(y_axis_label)
plt.legend()
if self.args.figure_png_file is not None:
plt.savefig(self.args.figure_png_file)
else:
plt.show()
def main():
parser = HfArgumentParser(PlotArguments)
plot_args = parser.parse_args_into_dataclasses()[0]
plot = Plot(args=plot_args)
plot.plot()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,29 @@
# coding=utf-8
# Copyright 2018 The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Benchmarking the library on inference and training """
from transformers import HfArgumentParser, PyTorchBenchmark, PyTorchBenchmarkArguments
def main():
parser = HfArgumentParser(PyTorchBenchmarkArguments)
benchmark_args = parser.parse_args_into_dataclasses()[0]
benchmark = PyTorchBenchmark(args=benchmark_args)
benchmark.run()
if __name__ == "__main__":
main()

View File

@@ -1,710 +0,0 @@
# coding=utf-8
# Copyright 2018 The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Benchmarking the library on inference and training """
# If checking the tensors placement
# tf.debugging.set_log_device_placement(True)
import argparse
import csv
import logging
import timeit
from time import time
from typing import Callable, List
from transformers import (
AutoConfig,
AutoTokenizer,
MemorySummary,
is_tf_available,
is_torch_available,
start_memory_tracing,
stop_memory_tracing,
)
if is_tf_available():
import tensorflow as tf
from transformers import TFAutoModel
if is_torch_available():
import torch
from transformers import AutoModel
input_text = """Bent over their instruments, three hundred Fertilizers were plunged, as
the Director of Hatcheries and Conditioning entered the room, in the
scarcely breathing silence, the absent-minded, soliloquizing hum or
whistle, of absorbed concentration. A troop of newly arrived students,
very young, pink and callow, followed nervously, rather abjectly, at the
Director's heels. Each of them carried a notebook, in which, whenever
the great man spoke, he desperately scribbled. Straight from the
horse's mouth. It was a rare privilege. The D. H. C. for Central London
always made a point of personally conducting his new students round
the various departments.
"Just to give you a general idea," he would explain to them. For of
course some sort of general idea they must have, if they were to do
their work intelligently-though as little of one, if they were to be good
and happy members of society, as possible. For particulars, as every
one knows, make for virtue and happiness; generalities are intellectu-
ally necessary evils. Not philosophers but fret-sawyers and stamp col-
lectors compose the backbone of society.
"To-morrow," he would add, smiling at them with a slightly menacing
geniality, "you'll be settling down to serious work. You won't have time
for generalities. Meanwhile ..."
Meanwhile, it was a privilege. Straight from the horse's mouth into the
notebook. The boys scribbled like mad.
Tall and rather thin but upright, the Director advanced into the room.
He had a long chin and big rather prominent teeth, just covered, when
he was not talking, by his full, floridly curved lips. Old, young? Thirty?
Fifty? Fifty-five? It was hard to say. And anyhow the question didn't
arise; in this year of stability, A. F. 632, it didn't occur to you to ask it.
"I shall begin at the beginning," said the D.H.C. and the more zealous
students recorded his intention in their notebooks: Begin at the begin-
ning. "These," he waved his hand, "are the incubators." And opening
an insulated door he showed them racks upon racks of numbered test-
tubes. "The week's supply of ova. Kept," he explained, "at blood heat;
whereas the male gametes," and here he opened another door, "they
have to be kept at thirty-five instead of thirty-seven. Full blood heat
sterilizes." Rams wrapped in theremogene beget no lambs.
Still leaning against the incubators he gave them, while the pencils
scurried illegibly across the pages, a brief description of the modern
fertilizing process; spoke first, of course, of its surgical introduc-
tion-"the operation undergone voluntarily for the good of Society, not
to mention the fact that it carries a bonus amounting to six months'
salary"; continued with some account of the technique for preserving
the excised ovary alive and actively developing; passed on to a consid-
eration of optimum temperature, salinity, viscosity; referred to the liq-
uor in which the detached and ripened eggs were kept; and, leading
his charges to the work tables, actually showed them how this liquor
was drawn off from the test-tubes; how it was let out drop by drop
onto the specially warmed slides of the microscopes; how the eggs
which it contained were inspected for abnormalities, counted and
transferred to a porous receptacle; how (and he now took them to
watch the operation) this receptacle was immersed in a warm bouillon
containing free-swimming spermatozoa-at a minimum concentration
of one hundred thousand per cubic centimetre, he insisted; and how,
after ten minutes, the container was lifted out of the liquor and its
contents re-examined; how, if any of the eggs remained unfertilized, it
was again immersed, and, if necessary, yet again; how the fertilized
ova went back to the incubators; where the Alphas and Betas re-
mained until definitely bottled; while the Gammas, Deltas and Epsilons
were brought out again, after only thirty-six hours, to undergo Bo-
kanovsky's Process.
"Bokanovsky's Process," repeated the Director, and the students un-
derlined the words in their little notebooks.
One egg, one embryo, one adult-normality. But a bokanovskified egg
will bud, will proliferate, will divide. From eight to ninety-six buds, and
every bud will grow into a perfectly formed embryo, and every embryo
into a full-sized adult. Making ninety-six human beings grow where
only one grew before. Progress.
"Essentially," the D.H.C. concluded, "bokanovskification consists of a
series of arrests of development. We check the normal growth and,
paradoxically enough, the egg responds by budding."
Responds by budding. The pencils were busy.
He pointed. On a very slowly moving band a rack-full of test-tubes was
entering a large metal box, another, rack-full was emerging. Machinery
faintly purred. It took eight minutes for the tubes to go through, he
told them. Eight minutes of hard X-rays being about as much as an
egg can stand. A few died; of the rest, the least susceptible divided
into two; most put out four buds; some eight; all were returned to the
incubators, where the buds began to develop; then, after two days,
were suddenly chilled, chilled and checked. Two, four, eight, the buds
in their turn budded; and having budded were dosed almost to death
with alcohol; consequently burgeoned again and having budded-bud
out of bud out of bud-were thereafter-further arrest being generally
fatal-left to develop in peace. By which time the original egg was in a
fair way to becoming anything from eight to ninety-six embryos- a
prodigious improvement, you will agree, on nature. Identical twins-but
not in piddling twos and threes as in the old viviparous days, when an
egg would sometimes accidentally divide; actually by dozens, by
scores at a time.
"Scores," the Director repeated and flung out his arms, as though he
were distributing largesse. "Scores."
But one of the students was fool enough to ask where the advantage
lay.
"My good boy!" The Director wheeled sharply round on him. "Can't you
see? Can't you see?" He raised a hand; his expression was solemn.
"Bokanovsky's Process is one of the major instruments of social stabil-
ity!"
Major instruments of social stability.
Standard men and women; in uniform batches. The whole of a small
factory staffed with the products of a single bokanovskified egg.
"Ninety-six identical twins working ninety-six identical machines!" The
voice was almost tremulous with enthusiasm. "You really know where
you are. For the first time in history." He quoted the planetary motto.
"Community, Identity, Stability." Grand words. "If we could bo-
kanovskify indefinitely the whole problem would be solved."
Solved by standard Gammas, unvarying Deltas, uniform Epsilons. Mil-
lions of identical twins. The principle of mass production at last applied
to biology.
"But, alas," the Director shook his head, "we can't bokanovskify indefi-
nitely."
Ninety-six seemed to be the limit; seventy-two a good average. From
the same ovary and with gametes of the same male to manufacture as
many batches of identical twins as possible-that was the best (sadly a
second best) that they could do. And even that was difficult.
"For in nature it takes thirty years for two hundred eggs to reach ma-
turity. But our business is to stabilize the population at this moment,
here and now. Dribbling out twins over a quarter of a century-what
would be the use of that?"
Obviously, no use at all. But Podsnap's Technique had immensely ac-
celerated the process of ripening. They could make sure of at least a
hundred and fifty mature eggs within two years. Fertilize and bo-
kanovskify-in other words, multiply by seventy-two-and you get an
average of nearly eleven thousand brothers and sisters in a hundred
and fifty batches of identical twins, all within two years of the same
age.
"And in exceptional cases we can make one ovary yield us over fifteen
thousand adult individuals."
Beckoning to a fair-haired, ruddy young man who happened to be
passing at the moment. "Mr. Foster," he called. The ruddy young man
approached. "Can you tell us the record for a single ovary, Mr. Foster?"
"Sixteen thousand and twelve in this Centre," Mr. Foster replied with-
out hesitation. He spoke very quickly, had a vivacious blue eye, and
took an evident pleasure in quoting figures. "Sixteen thousand and
twelve; in one hundred and eighty-nine batches of identicals. But of
course they've done much better," he rattled on, "in some of the tropi-
cal Centres. Singapore has often produced over sixteen thousand five
hundred; and Mombasa has actually touched the seventeen thousand
mark. But then they have unfair advantages. You should see the way a
negro ovary responds to pituitary! It's quite astonishing, when you're
used to working with European material. Still," he added, with a laugh
(but the light of combat was in his eyes and the lift of his chin was
challenging), "still, we mean to beat them if we can. I'm working on a
wonderful Delta-Minus ovary at this moment. Only just eighteen
months old. Over twelve thousand seven hundred children already, ei-
ther decanted or in embryo. And still going strong. We'll beat them
yet."
"That's the spirit I like!" cried the Director, and clapped Mr. Foster on
the shoulder. "Come along with us, and give these boys the benefit of
your expert knowledge."
Mr. Foster smiled modestly. "With pleasure." They went.
In the Bottling Room all was harmonious bustle and ordered activity.
Flaps of fresh sow's peritoneum ready cut to the proper size came
shooting up in little lifts from the Organ Store in the sub-basement.
Whizz and then, click! the lift-hatches hew open; the bottle-liner had
only to reach out a hand, take the flap, insert, smooth-down, and be-
fore the lined bottle had had time to travel out of reach along the end-
less band, whizz, click! another flap of peritoneum had shot up from
the depths, ready to be slipped into yet another bottle, the next of that
slow interminable procession on the band.
Next to the Liners stood the Matriculators. The procession advanced;
one by one the eggs were transferred from their test-tubes to the
larger containers; deftly the peritoneal lining was slit, the morula
dropped into place, the saline solution poured in ... and already the
bottle had passed, and it was the turn of the labellers. Heredity, date
of fertilization, membership of Bokanovsky Group-details were trans-
ferred from test-tube to bottle. No longer anonymous, but named,
identified, the procession marched slowly on; on through an opening in
the wall, slowly on into the Social Predestination Room.
"Eighty-eight cubic metres of card-index," said Mr. Foster with relish,
as they entered."""
def create_setup_and_compute(
model_names: List[str],
batch_sizes: List[int],
slice_sizes: List[int],
gpu: bool = True,
tensorflow: bool = False,
average_over: int = 3,
no_speed: bool = False,
no_memory: bool = False,
verbose: bool = False,
torchscript: bool = False,
xla: bool = False,
amp: bool = False,
fp16: bool = False,
save_to_csv: bool = False,
csv_time_filename: str = f"time_{round(time())}.csv",
csv_memory_filename: str = f"memory_{round(time())}.csv",
print_fn: Callable[[str], None] = print,
):
if xla:
tf.config.optimizer.set_jit(True)
if amp:
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})
if tensorflow:
dictionary = {model_name: {} for model_name in model_names}
results = _compute_tensorflow(
model_names,
batch_sizes,
slice_sizes,
dictionary,
average_over,
amp,
no_speed,
no_memory,
verbose,
print_fn,
)
else:
device = "cuda" if (gpu and torch.cuda.is_available()) else "cpu"
dictionary = {model_name: {} for model_name in model_names}
results = _compute_pytorch(
model_names,
batch_sizes,
slice_sizes,
dictionary,
average_over,
device,
torchscript,
fp16,
no_speed,
no_memory,
verbose,
print_fn,
)
print_fn("=========== RESULTS ===========")
for model_name in model_names:
print_fn("\t" + f"======= MODEL CHECKPOINT: {model_name} =======")
for batch_size in results[model_name]["bs"]:
print_fn("\t\t" + f"===== BATCH SIZE: {batch_size} =====")
for slice_size in results[model_name]["ss"]:
time = results[model_name]["time"][batch_size][slice_size]
memory = results[model_name]["memory"][batch_size][slice_size]
if isinstance(time, str):
print_fn(f"\t\t{model_name}/{batch_size}/{slice_size}: " f"{time} " f"{memory}")
else:
print_fn(
f"\t\t{model_name}/{batch_size}/{slice_size}: "
f"{(round(1000 * time) / 1000)}"
f"s "
f"{memory}"
)
if save_to_csv:
with open(csv_time_filename, mode="w") as csv_time_file, open(
csv_memory_filename, mode="w"
) as csv_memory_file:
assert len(model_names) > 0, "At least 1 model should be defined, but got {}".format(model_names)
fieldnames = ["model", "batch_size", "sequence_length"]
time_writer = csv.DictWriter(csv_time_file, fieldnames=fieldnames + ["time_in_s"])
time_writer.writeheader()
memory_writer = csv.DictWriter(csv_memory_file, fieldnames=fieldnames + ["memory"])
memory_writer.writeheader()
for model_name in model_names:
time_dict = results[model_name]["time"]
memory_dict = results[model_name]["memory"]
for bs in time_dict:
for ss in time_dict[bs]:
time_writer.writerow(
{
"model": model_name,
"batch_size": bs,
"sequence_length": ss,
"time_in_s": "{:.4f}".format(time_dict[bs][ss]),
}
)
for bs in memory_dict:
for ss in time_dict[bs]:
memory_writer.writerow(
{
"model": model_name,
"batch_size": bs,
"sequence_length": ss,
"memory": memory_dict[bs][ss],
}
)
def print_summary_statistics(summary: MemorySummary, print_fn: Callable[[str], None]):
print_fn(
"\nLines by line memory consumption:\n"
+ "\n".join(
f"{state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
for state in summary.sequential
)
)
print_fn(
"\nLines with top memory consumption:\n"
+ "\n".join(
f"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
for state in summary.cumulative[:6]
)
)
print_fn(
"\nLines with lowest memory consumption:\n"
+ "\n".join(
f"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
for state in summary.cumulative[-6:]
)
)
print_fn(f"\nTotal memory increase: {summary.total}")
def get_print_function(save_print_log, log_filename):
if save_print_log:
logging.basicConfig(
level=logging.DEBUG,
filename=log_filename,
filemode="a+",
format="%(asctime)-15s %(levelname)-8s %(message)s",
)
def print_with_print_log(*args):
logging.info(*args)
print(*args)
return print_with_print_log
else:
return print
def _compute_pytorch(
model_names,
batch_sizes,
slice_sizes,
dictionary,
average_over,
device,
torchscript,
fp16,
no_speed,
no_memory,
verbose,
print_fn,
):
for c, model_name in enumerate(model_names):
print_fn(f"{c + 1} / {len(model_names)}")
config = AutoConfig.from_pretrained(model_name, torchscript=torchscript)
model = AutoModel.from_pretrained(model_name, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_sequence = tokenizer.encode(input_text, add_special_tokens=False)
max_input_size = tokenizer.max_model_input_sizes[model_name]
dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "time": {}, "memory": {}}
dictionary[model_name]["time"] = {i: {} for i in batch_sizes}
dictionary[model_name]["memory"] = {i: {} for i in batch_sizes}
print_fn("Using model {}".format(model))
print_fn("Number of all parameters {}".format(model.num_parameters()))
for batch_size in batch_sizes:
if fp16:
model.half()
model.to(device)
model.eval()
for slice_size in slice_sizes:
if max_input_size is not None and slice_size > max_input_size:
dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
else:
sequence = torch.tensor(tokenized_sequence[:slice_size], device=device).repeat(batch_size, 1)
try:
if torchscript:
print_fn("Tracing model with sequence size {}".format(sequence.shape))
inference = torch.jit.trace(model, sequence)
inference(sequence)
else:
inference = model
inference(sequence)
if not no_memory:
# model.add_memory_hooks() # Forward method tracing (only for PyTorch models)
# Line by line memory tracing (all code in the module `transformers`) works for all models/arbitrary code
trace = start_memory_tracing("transformers")
inference(sequence)
summary = stop_memory_tracing(trace)
if verbose:
print_summary_statistics(summary, print_fn)
dictionary[model_name]["memory"][batch_size][slice_size] = str(summary.total)
else:
dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
if not no_speed:
print_fn("Going through model with sequence of shape".format(sequence.shape))
runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
average_time = sum(runtimes) / float(len(runtimes)) / 3.0
dictionary[model_name]["time"][batch_size][slice_size] = average_time
else:
dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
except RuntimeError as e:
print_fn("Doesn't fit on GPU. {}".format(e))
torch.cuda.empty_cache()
dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
return dictionary
def _compute_tensorflow(
model_names, batch_sizes, slice_sizes, dictionary, average_over, amp, no_speed, no_memory, verbose, print_fn
):
for c, model_name in enumerate(model_names):
print_fn(f"{c + 1} / {len(model_names)}")
config = AutoConfig.from_pretrained(model_name)
model = TFAutoModel.from_pretrained(model_name, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_sequence = tokenizer.encode(input_text, add_special_tokens=False)
max_input_size = tokenizer.max_model_input_sizes[model_name]
dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "time": {}, "memory": {}}
dictionary[model_name]["time"] = {i: {} for i in batch_sizes}
dictionary[model_name]["memory"] = {i: {} for i in batch_sizes}
print_fn("Using model {}".format(model))
print_fn("Number of all parameters {}".format(model.num_parameters()))
@tf.function
def inference(inputs):
return model(inputs)
for batch_size in batch_sizes:
for slice_size in slice_sizes:
if max_input_size is not None and slice_size > max_input_size:
dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
else:
sequence = tf.stack(
[tf.squeeze(tf.constant(tokenized_sequence[:slice_size])[None, :])] * batch_size
)
try:
print_fn("Going through model with sequence of shape {}".format(sequence.shape))
# To make sure that the model is traced + that the tensors are on the appropriate device
inference(sequence)
if not no_memory:
# Line by line memory tracing (all code in the module `transformers`) works for all models/arbitrary code
trace = start_memory_tracing("transformers")
inference(sequence)
summary = stop_memory_tracing(trace)
if verbose:
print_summary_statistics(summary, print_fn)
dictionary[model_name]["memory"][batch_size][slice_size] = str(summary.total)
else:
dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
if not no_speed:
runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
average_time = sum(runtimes) / float(len(runtimes)) / 3.0
dictionary[model_name]["time"][batch_size][slice_size] = average_time
else:
dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
except tf.errors.ResourceExhaustedError as e:
print_fn("Doesn't fit on GPU. {}".format(e))
dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
return dictionary
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"--models",
required=False,
type=str,
default="all",
help="Model checkpoints to be provided "
"to the AutoModel classes. Leave "
"blank to benchmark the base version "
"of all available model "
"architectures.",
)
parser.add_argument("--verbose", required=False, action="store_true", help="Verbose memory tracing")
parser.add_argument("--no_speed", required=False, action="store_true", help="Don't perform speed measurments")
parser.add_argument("--no_memory", required=False, action="store_true", help="Don't perform memory measurments")
parser.add_argument(
"--torch", required=False, action="store_true", help="Benchmark the Pytorch version of the " "models"
)
parser.add_argument(
"--torch_cuda", required=False, action="store_true", help="Pytorch only: run on available " "cuda devices"
)
parser.add_argument(
"--torchscript",
required=False,
action="store_true",
help="Pytorch only: trace the models " "using torchscript",
)
parser.add_argument(
"--tensorflow",
required=False,
action="store_true",
help="Benchmark the TensorFlow version "
"of the models. Will run on GPU if "
"the correct dependencies are "
"installed",
)
parser.add_argument("--xla", required=False, action="store_true", help="TensorFlow only: use XLA acceleration.")
parser.add_argument(
"--amp",
required=False,
action="store_true",
help="TensorFlow only: use automatic mixed precision acceleration.",
)
parser.add_argument(
"--fp16", required=False, action="store_true", help="PyTorch only: use FP16 to accelerate inference."
)
parser.add_argument(
"--keras_predict",
required=False,
action="store_true",
help="Whether to use model.predict " "instead of model() to do a " "forward pass.",
)
parser.add_argument("--save_to_csv", required=False, action="store_true", help="Save to a CSV file.")
parser.add_argument(
"--log_print", required=False, action="store_true", help="Save all print statements in log file."
)
parser.add_argument(
"--csv_time_filename",
required=False,
default=f"time_{round(time())}.csv",
help="CSV filename used if saving time results to csv.",
)
parser.add_argument(
"--csv_memory_filename",
required=False,
default=f"memory_{round(time())}.csv",
help="CSV filename used if saving memory results to csv.",
)
parser.add_argument(
"--log_filename",
required=False,
default=f"log_{round(time())}.txt",
help="Log filename used if print statements are saved in log.",
)
parser.add_argument(
"--average_over", required=False, default=30, type=int, help="Times an experiment will be run."
)
parser.add_argument("--batch_sizes", nargs="+", type=int, default=[1, 2, 4, 8])
parser.add_argument("--slice_sizes", nargs="+", type=int, default=[8, 64, 128, 256, 512, 1024])
args = parser.parse_args()
if args.models == "all":
args.models = [
"gpt2",
"bert-base-cased",
"xlnet-base-cased",
"xlm-mlm-en-2048",
"transfo-xl-wt103",
"openai-gpt",
"distilbert-base-uncased",
"distilgpt2",
"roberta-base",
"ctrl",
"t5-base",
"bart-large",
]
else:
args.models = args.models.split()
print_fn = get_print_function(args.log_print, args.log_filename)
print_fn("Running with arguments: {}".format(args))
if args.torch:
if is_torch_available():
create_setup_and_compute(
model_names=args.models,
batch_sizes=args.batch_sizes,
slice_sizes=args.slice_sizes,
tensorflow=False,
gpu=args.torch_cuda,
torchscript=args.torchscript,
fp16=args.fp16,
save_to_csv=args.save_to_csv,
csv_time_filename=args.csv_time_filename,
csv_memory_filename=args.csv_memory_filename,
average_over=args.average_over,
no_speed=args.no_speed,
no_memory=args.no_memory,
verbose=args.verbose,
print_fn=print_fn,
)
else:
raise ImportError("Trying to run a PyTorch benchmark but PyTorch was not found in the environment.")
if args.tensorflow:
if is_tf_available():
create_setup_and_compute(
model_names=args.models,
batch_sizes=args.batch_sizes,
slice_sizes=args.slice_sizes,
tensorflow=True,
xla=args.xla,
amp=args.amp,
save_to_csv=args.save_to_csv,
csv_time_filename=args.csv_time_filename,
csv_memory_filename=args.csv_memory_filename,
average_over=args.average_over,
no_speed=args.no_speed,
no_memory=args.no_memory,
verbose=args.verbose,
print_fn=print_fn,
)
else:
raise ImportError("Trying to run a TensorFlow benchmark but TensorFlow was not found in the environment.")
if __name__ == "__main__":
main()

View File

@@ -64,7 +64,7 @@ def print_2d_tensor(tensor):
def compute_heads_importance(
args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None
args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None, actually_pruned=False
):
""" This method shows how to compute:
- head attention entropy
@@ -77,7 +77,12 @@ def compute_heads_importance(
if head_mask is None:
head_mask = torch.ones(n_layers, n_heads).to(args.device)
head_mask.requires_grad_(requires_grad=True)
# If actually pruned attention multi-head, set head mask to None to avoid shape mismatch
if actually_pruned:
head_mask = None
preds = None
labels = None
tot_tokens = 0.0
@@ -172,6 +177,7 @@ def mask_heads(args, model, eval_dataloader):
new_head_mask = new_head_mask.view(-1)
new_head_mask[current_heads_to_mask] = 0.0
new_head_mask = new_head_mask.view_as(head_mask)
new_head_mask = new_head_mask.clone().detach()
print_2d_tensor(new_head_mask)
# Compute metric and head importance again
@@ -181,7 +187,7 @@ def mask_heads(args, model, eval_dataloader):
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
current_score = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
logger.info(
"Masking: current score: %f, remaning heads %d (%.1f percents)",
"Masking: current score: %f, remaining heads %d (%.1f percents)",
current_score,
new_head_mask.sum(),
new_head_mask.sum() / new_head_mask.numel() * 100,
@@ -209,14 +215,23 @@ def prune_heads(args, model, eval_dataloader, head_mask):
original_time = datetime.now() - before_time
original_num_params = sum(p.numel() for p in model.parameters())
heads_to_prune = dict((layer, (1 - head_mask[layer].long()).nonzero().tolist()) for layer in range(len(head_mask)))
heads_to_prune = dict(
(layer, (1 - head_mask[layer].long()).nonzero().squeeze().tolist()) for layer in range(len(head_mask))
)
assert sum(len(h) for h in heads_to_prune.values()) == (1 - head_mask.long()).sum().item()
model.prune_heads(heads_to_prune)
pruned_num_params = sum(p.numel() for p in model.parameters())
before_time = datetime.now()
_, _, preds, labels = compute_heads_importance(
args, model, eval_dataloader, compute_entropy=False, compute_importance=False, head_mask=None
args,
model,
eval_dataloader,
compute_entropy=False,
compute_importance=False,
head_mask=None,
actually_pruned=True,
)
preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
score_pruning = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
@@ -404,7 +419,7 @@ def main():
logger.info("Training/evaluation parameters %s", args)
# Prepare dataset for the GLUE task
eval_dataset = GlueDataset(args, tokenizer=tokenizer, evaluate=True)
eval_dataset = GlueDataset(args, tokenizer=tokenizer, mode="dev")
if args.data_subset > 0:
eval_dataset = Subset(eval_dataset, list(range(min(args.data_subset, len(eval_dataset)))))
eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)

View File

@@ -34,26 +34,11 @@ from tqdm import tqdm, trange
from transformers import (
WEIGHTS_NAME,
AdamW,
AlbertConfig,
AlbertModel,
AlbertTokenizer,
BertConfig,
BertModel,
BertTokenizer,
DistilBertConfig,
DistilBertModel,
DistilBertTokenizer,
AutoConfig,
AutoModel,
AutoTokenizer,
MMBTConfig,
MMBTForClassification,
RobertaConfig,
RobertaModel,
RobertaTokenizer,
XLMConfig,
XLMModel,
XLMTokenizer,
XLNetConfig,
XLNetModel,
XLNetTokenizer,
get_linear_schedule_with_warmup,
)
from utils_mmimdb import ImageEncoder, JsonlDataset, collate_fn, get_image_transforms, get_mmimdb_labels
@@ -67,23 +52,6 @@ except ImportError:
logger = logging.getLogger(__name__)
ALL_MODELS = sum(
(
tuple(conf.pretrained_config_archive_map.keys())
for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig, DistilBertConfig)
),
(),
)
MODEL_CLASSES = {
"bert": (BertConfig, BertModel, BertTokenizer),
"xlnet": (XLNetConfig, XLNetModel, XLNetTokenizer),
"xlm": (XLMConfig, XLMModel, XLMTokenizer),
"roberta": (RobertaConfig, RobertaModel, RobertaTokenizer),
"distilbert": (DistilBertConfig, DistilBertModel, DistilBertTokenizer),
"albert": (AlbertConfig, AlbertModel, AlbertTokenizer),
}
def set_seed(args):
random.seed(args.seed)
@@ -351,19 +319,12 @@ def main():
required=True,
help="The input data dir. Should contain the .jsonl files for MMIMDB.",
)
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
)
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
help="Path to pretrained model or model identifier from huggingface.co/models",
)
parser.add_argument(
"--output_dir",
@@ -385,7 +346,7 @@ def main():
)
parser.add_argument(
"--cache_dir",
default="",
default=None,
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
)
@@ -526,18 +487,14 @@ def main():
# Setup model
labels = get_mmimdb_labels()
num_labels = len(labels)
args.model_type = args.model_type.lower()
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
transformer_config = config_class.from_pretrained(
args.config_name if args.config_name else args.model_name_or_path
)
tokenizer = tokenizer_class.from_pretrained(
transformer_config = AutoConfig.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(
args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
do_lower_case=args.do_lower_case,
cache_dir=args.cache_dir if args.cache_dir else None,
cache_dir=args.cache_dir,
)
transformer = model_class.from_pretrained(
args.model_name_or_path, config=transformer_config, cache_dir=args.cache_dir if args.cache_dir else None
transformer = AutoModel.from_pretrained(
args.model_name_or_path, config=transformer_config, cache_dir=args.cache_dir
)
img_encoder = ImageEncoder(args)
config = MMBTConfig(transformer_config, num_labels=num_labels)
@@ -583,13 +540,12 @@ def main():
# Load a trained model and vocabulary that you have fine-tuned
model = MMBTForClassification(config, transformer, img_encoder)
model.load_state_dict(torch.load(os.path.join(args.output_dir, WEIGHTS_NAME)))
tokenizer = tokenizer_class.from_pretrained(args.output_dir)
tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
model.to(args.device)
# Evaluation
results = {}
if args.do_eval and args.local_rank in [-1, 0]:
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
checkpoints = [args.output_dir]
if args.eval_all_checkpoints:
checkpoints = list(

View File

@@ -31,14 +31,8 @@ from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, Tenso
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from transformers import (
WEIGHTS_NAME,
AdamW,
BertConfig,
BertForMultipleChoice,
BertTokenizer,
get_linear_schedule_with_warmup,
)
from transformers import WEIGHTS_NAME, AdamW, AutoConfig, AutoTokenizer, get_linear_schedule_with_warmup
from transformers.modeling_auto import AutoModelForMultipleChoice
try:
@@ -49,12 +43,6 @@ except ImportError:
logger = logging.getLogger(__name__)
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in [BertConfig]), ())
MODEL_CLASSES = {
"bert": (BertConfig, BertForMultipleChoice, BertTokenizer),
}
class SwagExample(object):
"""A single training/test example for the SWAG dataset."""
@@ -492,19 +480,12 @@ def main():
required=True,
help="SWAG csv for predictions. E.g., val.csv or test.csv",
)
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
)
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
help="Path to pretrained model or model identifier from huggingface.co/models",
)
parser.add_argument(
"--output_dir",
@@ -536,9 +517,6 @@ def main():
parser.add_argument(
"--evaluate_during_training", action="store_true", help="Rul evaluation during training at each logging step."
)
parser.add_argument(
"--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model."
)
parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument(
@@ -652,13 +630,9 @@ def main():
if args.local_rank not in [-1, 0]:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
args.model_type = args.model_type.lower()
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
tokenizer = tokenizer_class.from_pretrained(
args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case
)
model = model_class.from_pretrained(
config = AutoConfig.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,)
model = AutoModelForMultipleChoice.from_pretrained(
args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config
)
@@ -694,8 +668,8 @@ def main():
torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
# Load a trained model and vocabulary that you have fine-tuned
model = model_class.from_pretrained(args.output_dir)
tokenizer = tokenizer_class.from_pretrained(args.output_dir)
model = AutoModelForMultipleChoice.from_pretrained(args.output_dir)
tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
model.to(args.device)
# Evaluation - we can ask to evaluate all the checkpoints (sub-directories) in a directory
@@ -718,8 +692,8 @@ def main():
for checkpoint in checkpoints:
# Reload the model
global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
model = model_class.from_pretrained(checkpoint)
tokenizer = tokenizer_class.from_pretrained(checkpoint)
model = AutoModelForMultipleChoice.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model.to(args.device)
# Evaluate

View File

@@ -80,7 +80,7 @@ def main():
# Load a pre-trained model
model = TransfoXLLMHeadModel.from_pretrained(args.model_name)
model = model.to(device)
model.to(device)
logger.info(
"Evaluating with bsz {} tgt_len {} ext_len {} mem_len {} clamp_len {}".format(

View File

@@ -80,7 +80,7 @@ class Distiller:
self.mlm = params.mlm
if self.mlm:
logger.info(f"Using MLM loss for LM step.")
logger.info("Using MLM loss for LM step.")
self.mlm_mask_prop = params.mlm_mask_prop
assert 0.0 <= self.mlm_mask_prop <= 1.0
assert params.word_mask + params.word_keep + params.word_rand == 1.0
@@ -91,7 +91,7 @@ class Distiller:
self.pred_probs = self.pred_probs.half()
self.token_probs = self.token_probs.half()
else:
logger.info(f"Using CLM loss for LM step.")
logger.info("Using CLM loss for LM step.")
self.epoch = 0
self.n_iter = 0
@@ -365,8 +365,8 @@ class Distiller:
self.end_epoch()
if self.is_master:
logger.info(f"Save very last checkpoint as `pytorch_model.bin`.")
self.save_checkpoint(checkpoint_name=f"pytorch_model.bin")
logger.info("Save very last checkpoint as `pytorch_model.bin`.")
self.save_checkpoint(checkpoint_name="pytorch_model.bin")
logger.info("Training is finished")
def step(self, input_ids: torch.tensor, attention_mask: torch.tensor, lm_labels: torch.tensor):

View File

@@ -67,9 +67,6 @@ except ImportError:
logger = logging.getLogger(__name__)
ALL_MODELS = sum(
(tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, XLMConfig)), ()
)
MODEL_CLASSES = {
"bert": (BertConfig, BertForQuestionAnswering, BertTokenizer),
@@ -505,7 +502,7 @@ def main():
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
help="Path to pretrained model or model identifier from huggingface.co/models",
)
parser.add_argument(
"--output_dir",

View File

@@ -60,7 +60,7 @@ def main():
with open(args.file_path, "r", encoding="utf8") as fp:
data = fp.readlines()
logger.info(f"Start encoding")
logger.info("Start encoding")
logger.info(f"{len(data)} examples to process.")
rslt = []

View File

@@ -93,7 +93,7 @@ if __name__ == "__main__":
elif args.model_type == "gpt2":
for w in ["weight", "bias"]:
compressed_sd[f"{prefix}.ln_f.{w}"] = state_dict[f"{prefix}.ln_f.{w}"]
compressed_sd[f"lm_head.weight"] = state_dict[f"lm_head.weight"]
compressed_sd["lm_head.weight"] = state_dict["lm_head.weight"]
print(f"N layers selected for distillation: {std_idx}")
print(f"Number of params transfered for distillation: {len(compressed_sd.keys())}")

View File

@@ -37,7 +37,7 @@ if __name__ == "__main__":
model = BertForMaskedLM.from_pretrained(args.model_name)
prefix = "bert"
else:
raise ValueError(f'args.model_type should be "bert".')
raise ValueError('args.model_type should be "bert".')
state_dict = model.state_dict()
compressed_sd = {}
@@ -78,8 +78,8 @@ if __name__ == "__main__":
]
std_idx += 1
compressed_sd[f"vocab_projector.weight"] = state_dict[f"cls.predictions.decoder.weight"]
compressed_sd[f"vocab_projector.bias"] = state_dict[f"cls.predictions.bias"]
compressed_sd["vocab_projector.weight"] = state_dict["cls.predictions.decoder.weight"]
compressed_sd["vocab_projector.bias"] = state_dict["cls.predictions.bias"]
if args.vocab_transform:
for w in ["weight", "bias"]:
compressed_sd[f"vocab_transform.{w}"] = state_dict[f"cls.predictions.transform.dense.{w}"]

View File

@@ -273,7 +273,7 @@ def main():
token_probs = None
train_lm_seq_dataset = LmSeqsDataset(params=args, data=data)
logger.info(f"Data loader created.")
logger.info("Data loader created.")
# STUDENT #
logger.info(f"Loading student config from {args.student_config}")
@@ -288,7 +288,7 @@ def main():
if args.n_gpu > 0:
student.to(f"cuda:{args.local_rank}")
logger.info(f"Student loaded.")
logger.info("Student loaded.")
# TEACHER #
teacher = teacher_model_class.from_pretrained(args.teacher_name, output_hidden_states=True)

View File

@@ -3,8 +3,7 @@
Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py).
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT, DistilBERT and RoBERTa. GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT, DistilBERT and RoBERTa
are fine-tuned using a masked language modeling (MLM) loss.
Before running the following example, you should get a file that contains text on which the language model will be
@@ -35,7 +34,7 @@ python run_language_modeling.py \
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
a score of ~20 perplexity once fine-tuned on the dataset.
### RoBERTa/BERT and masked language modeling
### RoBERTa/BERT/DistilBERT and masked language modeling
The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their

View File

@@ -115,15 +115,13 @@ class DataTrainingArguments:
)
def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False, local_rank=-1):
def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False):
file_path = args.eval_data_file if evaluate else args.train_data_file
if args.line_by_line:
return LineByLineTextDataset(
tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, local_rank=local_rank
)
return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
else:
return TextDataset(
tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, local_rank=local_rank,
tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, overwrite_cache=args.overwrite_cache
)
@@ -220,16 +218,9 @@ def main():
data_args.block_size = min(data_args.block_size, tokenizer.max_len)
# Get datasets
train_dataset = (
get_dataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank)
if training_args.do_train
else None
)
eval_dataset = (
get_dataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank, evaluate=True)
if training_args.do_eval
else None
)
train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
eval_dataset = get_dataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=data_args.mlm, mlm_probability=data_args.mlm_probability
)
@@ -260,7 +251,7 @@ def main():
# Evaluation
results = {}
if training_args.do_eval and training_args.local_rank in [-1, 0]:
if training_args.do_eval:
logger.info("*** Evaluate ***")
eval_output = trainer.evaluate()
@@ -269,11 +260,12 @@ def main():
result = {"perplexity": perplexity}
output_eval_file = os.path.join(training_args.output_dir, "eval_results_lm.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key in sorted(result.keys()):
logger.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
if trainer.is_world_master():
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key in sorted(result.keys()):
logger.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
results.update(result)

View File

@@ -0,0 +1,183 @@
# Movement Pruning: Adaptive Sparsity by Fine-Tuning
*Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of *movement pruning*, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters:*
| Fine-pruning+Distillation<br>(Teacher=BERT-base fine-tuned) | BERT base<br>fine-tuned | Remaining<br>Weights (%) | Magnitude Pruning | L0 Regularization | Movement Pruning | Soft Movement Pruning |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| SQuAD - Dev<br>EM/F1 | 80.4/88.1 | 10%<br>3% | 70.2/80.1<br>45.5/59.6 | 72.4/81.9<br>64.3/75.8 | 75.6/84.3<br>67.5/78.0 | **76.6/84.9**<br>**72.7/82.3** |
| MNLI - Dev<br>acc/MM acc | 84.5/84.9 | 10%<br>3% | 78.3/79.3<br>69.4/70.6 | 78.7/79.7<br>76.0/76.2 | 80.1/80.4<br>76.5/77.4 | **81.2/81.8**<br>**79.5/80.1** |
| QQP - Dev<br>acc/F1 | 91.4/88.4 | 10%<br>3% | 79.8/65.0<br>72.4/57.8 | 88.1/82.8<br>87.0/81.9 | 89.7/86.2<br>86.1/81.5 | **90.2/86.8**<br>**89.1/85.5** |
This page contains information on how to fine-prune pre-trained models such as `BERT` to obtain extremely sparse models with movement pruning. In contrast to magnitude pruning which selects weights that are far from 0, movement pruning retains weights that are moving away from 0.
For more information, we invite you to check out [our paper](https://arxiv.org/abs/2005.07683).
You can also have a look at this fun *Explain Like I'm Five* introductory [slide deck](https://www.slideshare.net/VictorSanh/movement-pruning-explain-like-im-five-234205241).
<div align="center">
<img src="https://www.seekpng.com/png/detail/166-1669328_how-to-make-emmental-cheese-at-home-icooker.png" width="400">
</div>
## Extreme sparsity and efficient storage
One promise of extreme pruning is to obtain extremely small models that can be easily sent (and stored) on edge devices. By setting weights to 0., we reduce the amount of information we need to store, and thus decreasing the memory size. We are able to obtain extremely sparse fine-pruned models with movement pruning: ~95% of the dense performance with ~5% of total remaining weights in the BERT encoder.
In [this notebook](https://github.com/huggingface/transformers/blob/master/examples/movement-pruning/Saving_PruneBERT.ipynb), we showcase how we can leverage standard tools that exist out-of-the-box to efficiently store an extremely sparse question answering model (only 6% of total remaining weights in the encoder). We are able to reduce the memory size of the encoder **from the 340MB (the orignal dense BERT) to 11MB**, without any additional training of the model (every operation is performed *post fine-pruning*). It is sufficiently small to store it on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical) 📎!
While movement pruning does not directly optimize for memory footprint (but rather the number of non-null weights), we hypothetize that further memory compression ratios can be achieved with specific quantization aware trainings (see for instance [Q8BERT](https://arxiv.org/abs/1910.06188), [And the Bit Goes Down](https://arxiv.org/abs/1907.05686) or [Quant-Noise](https://arxiv.org/abs/2004.07320)).
## Fine-pruned models
As examples, we release two English PruneBERT checkpoints (models fine-pruned from a pre-trained `BERT` checkpoint), one on SQuAD and the other on MNLI.
- **`prunebert-base-uncased-6-finepruned-w-distil-squad`**<br/>
Pre-trained `BERT-base-uncased` fine-pruned with soft movement pruning on SQuAD v1.1. We use an additional distillation signal from `BERT-base-uncased` finetuned on SQuAD. The encoder counts 6% of total non-null weights and reaches 83.8 F1 score. The model can be accessed with: `pruned_bert = BertForQuestionAnswering.from_pretrained("huggingface/prunebert-base-uncased-6-finepruned-w-distil-squad")`
- **`prunebert-base-uncased-6-finepruned-w-distil-mnli`**<br/>
Pre-trained `BERT-base-uncased` fine-pruned with soft movement pruning on MNLI. We use an additional distillation signal from `BERT-base-uncased` finetuned on MNLI. The encoder counts 6% of total non-null weights and reaches 80.7 (matched) accuracy. The model can be accessed with: `pruned_bert = BertForSequenceClassification.from_pretrained("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")`
## How to fine-prune?
### Setup
The code relies on the 🤗 Transformers library. In addition to the dependencies listed in the [`examples`](https://github.com/huggingface/transformers/tree/master/examples) folder, you should install a few additional dependencies listed in the `requirements.txt` file: `pip install -r requirements.txt`.
Note that we built our experiments on top of a stabilized version of the library (commit https://github.com/huggingface/transformers/commit/352d5472b0c1dec0f420d606d16747d851b4bda8): we do not guarantee that everything is still compatible with the latest version of the master branch.
### Fine-pruning with movement pruning
Below, we detail how to reproduce the results reported in the paper. We use SQuAD as a running example. Commands (and scripts) can be easily adapted for other tasks.
The following command fine-prunes a pre-trained `BERT-base` on SQuAD using movement pruning towards 15% of remaining weights (85% sparsity). Note that we freeze all the embeddings modules (from their pre-trained value) and only prune the Fully Connected layers in the encoder (12 layers of Transformer Block).
```bash
SERIALIZATION_DIR=<OUTPUT_DIR>
SQUAD_DATA=<SQUAD_DATA>
python examples/movement-pruning/masked_run_squad.py \
--output_dir $SERIALIZATION_DIR \
--data_dir $SQUAD_DATA \
--train_file train-v1.1.json \
--predict_file dev-v1.1.json \
--do_train --do_eval --do_lower_case \
--model_type masked_bert \
--model_name_or_path bert-base-uncased \
--per_gpu_train_batch_size 16 \
--warmup_steps 5400 \
--num_train_epochs 10 \
--learning_rate 3e-5 --mask_scores_learning_rate 1e-2 \
--initial_threshold 1 --final_threshold 0.15 \
--initial_warmup 1 --final_warmup 2 \
--pruning_method topK --mask_init constant --mask_scale 0.
```
### Fine-pruning with other methods
We can also explore other fine-pruning methods by changing the `pruning_method` parameter:
Soft movement pruning
```bash
python examples/movement-pruning/masked_run_squad.py \
--output_dir $SERIALIZATION_DIR \
--data_dir $SQUAD_DATA \
--train_file train-v1.1.json \
--predict_file dev-v1.1.json \
--do_train --do_eval --do_lower_case \
--model_type masked_bert \
--model_name_or_path bert-base-uncased \
--per_gpu_train_batch_size 16 \
--warmup_steps 5400 \
--num_train_epochs 10 \
--learning_rate 3e-5 --mask_scores_learning_rate 1e-2 \
--initial_threshold 0 --final_threshold 0.1 \
--initial_warmup 1 --final_warmup 2 \
--pruning_method sigmoied_threshold --mask_init constant --mask_scale 0. \
--regularization l1 --final_lambda 400.
```
L0 regularization
```bash
python examples/movement-pruning/masked_run_squad.py \
--output_dir $SERIALIZATION_DIR \
--data_dir $SQUAD_DATA \
--train_file train-v1.1.json \
--predict_file dev-v1.1.json \
--do_train --do_eval --do_lower_case \
--model_type masked_bert \
--model_name_or_path bert-base-uncased \
--per_gpu_train_batch_size 16 \
--warmup_steps 5400 \
--num_train_epochs 10 \
--learning_rate 3e-5 --mask_scores_learning_rate 1e-1 \
--initial_threshold 1. --final_threshold 1. \
--initial_warmup 1 --final_warmup 1 \
--pruning_method l0 --mask_init constant --mask_scale 2.197 \
--regularization l0 --final_lambda 125.
```
Iterative Magnitude Pruning
```bash
python examples/movement-pruning/masked_run_squad.py \
--output_dir ./dbg \
--data_dir examples/distillation/data/squad_data \
--train_file train-v1.1.json \
--predict_file dev-v1.1.json \
--do_train --do_eval --do_lower_case \
--model_type masked_bert \
--model_name_or_path bert-base-uncased \
--per_gpu_train_batch_size 16 \
--warmup_steps 5400 \
--num_train_epochs 10 \
--learning_rate 3e-5 \
--initial_threshold 1 --final_threshold 0.15 \
--initial_warmup 1 --final_warmup 2 \
--pruning_method magnitude
```
### After fine-pruning
**Counting parameters**
Regularization based pruning methods (soft movement pruning and L0 regularization) rely on the penalty to induce sparsity. The multiplicative coefficient controls the sparsity level.
To obtain the effective sparsity level in the encoder, we simply count the number of activated (non-null) weights:
```bash
python examples/movement-pruning/count_parameters.py \
--pruning_method sigmoied_threshold \
--threshold 0.1 \
--serialization_dir $SERIALIZATION_DIR
```
**Pruning once for all**
Once the model has been fine-pruned, the pruned weights can be set to 0. once for all (reducing the amount of information to store). In our running experiments, we can convert a `MaskedBertForQuestionAnswering` (a BERT model augmented to enable on-the-fly pruning capabilities) to a standard `BertForQuestionAnswering`:
```bash
python examples/movement-pruning/bertarize.py \
--pruning_method sigmoied_threshold \
--threshold 0.1 \
--model_name_or_path $SERIALIZATION_DIR
```
## Hyper-parameters
For reproducibility purposes, we share the detailed results presented in the paper. These [tables](https://docs.google.com/spreadsheets/d/17JgRq_OFFTniUrz6BZWW_87DjFkKXpI1kYDSsseT_7g/edit?usp=sharing) exhaustively describe the individual hyper-parameters used for each data point.
## Inference speed
Early experiments show that even though models fine-pruned with (soft) movement pruning are extremely sparse, they do not benefit from significant improvement in terms of inference speed when using the standard PyTorch inference.
We are currently benchmarking and exploring inference setups specifically for sparse architectures.
In particular, hardware manufacturers are announcing devices that will speedup inference for sparse networks considerably.
## Citation
If you find this resource useful, please consider citing the following paper:
```
@article{sanh2020movement,
title={Movement Pruning: Adaptive Sparsity by Fine-Tuning},
author={Victor Sanh and Thomas Wolf and Alexander M. Rush},
year={2020},
eprint={2005.07683},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

View File

@@ -0,0 +1,612 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Saving PruneBERT\n",
"\n",
"\n",
"This notebook aims at showcasing how we can leverage standard tools to save (and load) an extremely sparse model fine-pruned with [movement pruning](https://arxiv.org/abs/2005.07683) (or any other unstructured pruning mehtod).\n",
"\n",
"In this example, we used BERT (base-uncased, but the procedure described here is not specific to BERT and can be applied to a large variety of models.\n",
"\n",
"We first obtain an extremely sparse model by fine-pruning with movement pruning on SQuAD v1.1. We then used the following combination of standard tools:\n",
"- We reduce the precision of the model with Int8 dynamic quantization using [PyTorch implementation](https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html). We only quantized the Fully Connected Layers.\n",
"- Sparse quantized matrices are converted into the [Compressed Sparse Row format](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html).\n",
"- We use HDF5 with `gzip` compression to store the weights.\n",
"\n",
"We experiment with a question answering model with only 6% of total remaining weights in the encoder (previously obtained with movement pruning). **We are able to reduce the memory size of the encoder from 340MB (original dense BERT) to 11MB**, which fits on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical)!\n",
"\n",
"<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/0/00/Floptical_disk_21MB.jpg/440px-Floptical_disk_21MB.jpg\" width=\"200\">"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Includes\n",
"\n",
"import h5py\n",
"import os\n",
"import json\n",
"from collections import OrderedDict\n",
"\n",
"from scipy import sparse\n",
"import numpy as np\n",
"\n",
"import torch\n",
"from torch import nn\n",
"\n",
"from transformers import *\n",
"\n",
"os.chdir('../../')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Saving"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dynamic quantization induces little or no loss of performance while significantly reducing the memory footprint."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Load fine-pruned model and quantize the model\n",
"\n",
"model_path = \"serialization_dir/bert-base-uncased/92/squad/l1\"\n",
"model_name = \"bertarized_l1_with_distil_0._0.1_1_2_l1_1100._3e-5_1e-2_sigmoied_threshold_constant_0._10_epochs\"\n",
"\n",
"model = BertForQuestionAnswering.from_pretrained(os.path.join(model_path, model_name))\n",
"model.to('cpu')\n",
"\n",
"quantized_model = torch.quantization.quantize_dynamic(\n",
" model=model,\n",
" qconfig_spec = {\n",
" torch.nn.Linear : torch.quantization.default_dynamic_qconfig,\n",
" },\n",
" dtype=torch.qint8,\n",
" )\n",
"# print(quantized_model)\n",
"\n",
"qtz_st = quantized_model.state_dict()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Saving the original (encoder + classifier) in the standard torch.save format\n",
"\n",
"dense_st = {name: param for name, param in model.state_dict().items() \n",
" if \"embedding\" not in name and \"pooler\" not in name}\n",
"torch.save(dense_st, 'dbg/dense_squad.pt',)\n",
"dense_mb_size = os.path.getsize(\"dbg/dense_squad.pt\")\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Decompose quantization for bert.encoder.layer.0.attention.self.query._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.0.attention.self.key._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.0.attention.self.value._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.0.attention.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.0.intermediate.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.0.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.1.attention.self.query._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.1.attention.self.key._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.1.attention.self.value._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.1.attention.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.1.intermediate.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.1.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.2.attention.self.query._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.2.attention.self.key._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.2.attention.self.value._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.2.attention.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.2.intermediate.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.2.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.3.attention.self.query._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.3.attention.self.key._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.3.attention.self.value._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.3.attention.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.3.intermediate.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.3.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.4.attention.self.query._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.4.attention.self.key._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.4.attention.self.value._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.4.attention.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.4.intermediate.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.4.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.5.attention.self.query._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.5.attention.self.key._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.5.attention.self.value._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.5.attention.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.5.intermediate.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.5.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.6.attention.self.query._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.6.attention.self.key._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.6.attention.self.value._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.6.attention.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.6.intermediate.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.6.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.7.attention.self.query._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.7.attention.self.key._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.7.attention.self.value._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.7.attention.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.7.intermediate.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.7.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.8.attention.self.query._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.8.attention.self.key._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.8.attention.self.value._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.8.attention.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.8.intermediate.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.8.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.9.attention.self.query._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.9.attention.self.key._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.9.attention.self.value._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.9.attention.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.9.intermediate.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.9.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.10.attention.self.query._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.10.attention.self.key._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.10.attention.self.value._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.10.attention.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.10.intermediate.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.10.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.11.attention.self.query._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.11.attention.self.key._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.11.attention.self.value._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.11.attention.output.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.11.intermediate.dense._packed_params.weight\n",
"Decompose quantization for bert.encoder.layer.11.output.dense._packed_params.weight\n",
"Decompose quantization for bert.pooler.dense._packed_params.weight\n",
"Decompose quantization for qa_outputs._packed_params.weight\n"
]
}
],
"source": [
"# Elementary representation: we decompose the quantized tensors into (scale, zero_point, int_repr).\n",
"# See https://pytorch.org/docs/stable/quantization.html\n",
"\n",
"# We further leverage the fact that int_repr is sparse matrix to optimize the storage: we decompose int_repr into\n",
"# its CSR representation (data, indptr, indices).\n",
"\n",
"elementary_qtz_st = {}\n",
"for name, param in qtz_st.items():\n",
" if param.is_quantized:\n",
" print(\"Decompose quantization for\", name)\n",
" # We need to extract the scale, the zero_point and the int_repr for the quantized tensor and modules\n",
" scale = param.q_scale() # torch.tensor(1,) - float32\n",
" zero_point = param.q_zero_point() # torch.tensor(1,) - int32\n",
" elementary_qtz_st[f\"{name}.scale\"] = scale\n",
" elementary_qtz_st[f\"{name}.zero_point\"] = zero_point\n",
"\n",
" # We assume the int_repr is sparse and compute its CSR representation\n",
" # Only the FCs in the encoder are actually sparse\n",
" int_repr = param.int_repr() # torch.tensor(nb_rows, nb_columns) - int8\n",
" int_repr_cs = sparse.csr_matrix(int_repr) # scipy.sparse.csr.csr_matrix\n",
"\n",
" elementary_qtz_st[f\"{name}.int_repr.data\"] = int_repr_cs.data # np.array int8\n",
" elementary_qtz_st[f\"{name}.int_repr.indptr\"] = int_repr_cs.indptr # np.array int32\n",
" assert max(int_repr_cs.indices) < 65535 # If not, we shall fall back to int32\n",
" elementary_qtz_st[f\"{name}.int_repr.indices\"] = np.uint16(int_repr_cs.indices) # np.array uint16\n",
" elementary_qtz_st[f\"{name}.int_repr.shape\"] = int_repr_cs.shape # tuple(int, int)\n",
" else:\n",
" elementary_qtz_st[name] = param\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Encoder Size (MB) - Sparse & Quantized - `torch.save`: 21.29\n"
]
}
],
"source": [
"# Saving the pruned (encoder + classifier) in the standard torch.save format\n",
"\n",
"dense_optimized_st = {name: param for name, param in elementary_qtz_st.items() \n",
" if \"embedding\" not in name and \"pooler\" not in name}\n",
"torch.save(dense_optimized_st, 'dbg/dense_squad_optimized.pt',)\n",
"print(\"Encoder Size (MB) - Sparse & Quantized - `torch.save`:\",\n",
" round(os.path.getsize(\"dbg/dense_squad_optimized.pt\")/1e6, 2))\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Skip bert.embeddings.word_embeddings.weight\n",
"Skip bert.embeddings.position_embeddings.weight\n",
"Skip bert.embeddings.token_type_embeddings.weight\n",
"Skip bert.embeddings.LayerNorm.weight\n",
"Skip bert.embeddings.LayerNorm.bias\n",
"Skip bert.pooler.dense.scale\n",
"Skip bert.pooler.dense.zero_point\n",
"Skip bert.pooler.dense._packed_params.weight.scale\n",
"Skip bert.pooler.dense._packed_params.weight.zero_point\n",
"Skip bert.pooler.dense._packed_params.weight.int_repr.data\n",
"Skip bert.pooler.dense._packed_params.weight.int_repr.indptr\n",
"Skip bert.pooler.dense._packed_params.weight.int_repr.indices\n",
"Skip bert.pooler.dense._packed_params.weight.int_repr.shape\n",
"Skip bert.pooler.dense._packed_params.bias\n",
"\n",
"Encoder Size (MB) - Dense: 340.25\n",
"Encoder Size (MB) - Sparse & Quantized: 11.27\n"
]
}
],
"source": [
"# Save the decomposed state_dict with an HDF5 file\n",
"# Saving only the encoder + QA Head\n",
"\n",
"with h5py.File('dbg/squad_sparse.h5','w') as hf:\n",
" for name, param in elementary_qtz_st.items():\n",
" if \"embedding\" in name:\n",
" print(f\"Skip {name}\")\n",
" continue\n",
"\n",
" if \"pooler\" in name:\n",
" print(f\"Skip {name}\")\n",
" continue\n",
"\n",
" if type(param) == torch.Tensor:\n",
" if param.numel() == 1:\n",
" # module scale\n",
" # module zero_point\n",
" hf.attrs[name] = param\n",
" continue\n",
"\n",
" if param.requires_grad:\n",
" # LayerNorm\n",
" param = param.detach().numpy()\n",
" hf.create_dataset(name, data=param, compression=\"gzip\", compression_opts=9)\n",
"\n",
" elif type(param) == float or type(param) == int or type(param) == tuple:\n",
" # float - tensor _packed_params.weight.scale\n",
" # int - tensor_packed_params.weight.zero_point\n",
" # tuple - tensor _packed_params.weight.shape\n",
" hf.attrs[name] = param\n",
"\n",
" else:\n",
" hf.create_dataset(name, data=param, compression=\"gzip\", compression_opts=9)\n",
"\n",
"\n",
"with open('dbg/metadata.json', 'w') as f:\n",
" f.write(json.dumps(qtz_st._metadata)) \n",
"\n",
"size = os.path.getsize(\"dbg/squad_sparse.h5\") + os.path.getsize(\"dbg/metadata.json\")\n",
"print(\"\")\n",
"print(\"Encoder Size (MB) - Dense: \", round(dense_mb_size/1e6, 2))\n",
"print(\"Encoder Size (MB) - Sparse & Quantized:\", round(size/1e6, 2))\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Size (MB): 99.39\n"
]
}
],
"source": [
"# Save the decomposed state_dict to HDF5 storage\n",
"# Save everything in the architecutre (embedding + encoder + QA Head)\n",
"\n",
"with h5py.File('dbg/squad_sparse_with_embs.h5','w') as hf:\n",
" for name, param in elementary_qtz_st.items():\n",
"# if \"embedding\" in name:\n",
"# print(f\"Skip {name}\")\n",
"# continue\n",
"\n",
"# if \"pooler\" in name:\n",
"# print(f\"Skip {name}\")\n",
"# continue\n",
"\n",
" if type(param) == torch.Tensor:\n",
" if param.numel() == 1:\n",
" # module scale\n",
" # module zero_point\n",
" hf.attrs[name] = param\n",
" continue\n",
"\n",
" if param.requires_grad:\n",
" # LayerNorm\n",
" param = param.detach().numpy()\n",
" hf.create_dataset(name, data=param, compression=\"gzip\", compression_opts=9)\n",
"\n",
" elif type(param) == float or type(param) == int or type(param) == tuple:\n",
" # float - tensor _packed_params.weight.scale\n",
" # int - tensor _packed_params.weight.zero_point\n",
" # tuple - tensor _packed_params.weight.shape\n",
" hf.attrs[name] = param\n",
"\n",
" else:\n",
" hf.create_dataset(name, data=param, compression=\"gzip\", compression_opts=9)\n",
"\n",
"\n",
"with open('dbg/metadata.json', 'w') as f:\n",
" f.write(json.dumps(qtz_st._metadata)) \n",
"\n",
"size = os.path.getsize(\"dbg/squad_sparse_with_embs.h5\") + os.path.getsize(\"dbg/metadata.json\")\n",
"print('\\nSize (MB):', round(size/1e6, 2))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Reconstruct the elementary state dict\n",
"\n",
"reconstructed_elementary_qtz_st = {}\n",
"\n",
"hf = h5py.File('dbg/squad_sparse_with_embs.h5','r')\n",
"\n",
"for attr_name, attr_param in hf.attrs.items():\n",
" if 'shape' in attr_name:\n",
" attr_param = tuple(attr_param)\n",
" elif \".scale\" in attr_name:\n",
" if \"_packed_params\" in attr_name:\n",
" attr_param = float(attr_param)\n",
" else:\n",
" attr_param = torch.tensor(attr_param)\n",
" elif \".zero_point\" in attr_name:\n",
" if \"_packed_params\" in attr_name:\n",
" attr_param = int(attr_param)\n",
" else:\n",
" attr_param = torch.tensor(attr_param)\n",
" reconstructed_elementary_qtz_st[attr_name] = attr_param\n",
" # print(f\"Unpack {attr_name}\")\n",
" \n",
"# Get the tensors/arrays\n",
"for data_name, data_param in hf.items():\n",
" if \"LayerNorm\" in data_name or \"_packed_params.bias\" in data_name:\n",
" reconstructed_elementary_qtz_st[data_name] = torch.from_numpy(np.array(data_param))\n",
" elif \"embedding\" in data_name:\n",
" reconstructed_elementary_qtz_st[data_name] = torch.from_numpy(np.array(data_param))\n",
" else: # _packed_params.weight.int_repr.data, _packed_params.weight.int_repr.indices and _packed_params.weight.int_repr.indptr\n",
" data_param = np.array(data_param)\n",
" if \"indices\" in data_name:\n",
" data_param = np.array(data_param, dtype=np.int32)\n",
" reconstructed_elementary_qtz_st[data_name] = data_param\n",
" # print(f\"Unpack {data_name}\")\n",
" \n",
"\n",
"hf.close()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Sanity checks\n",
"\n",
"for name, param in reconstructed_elementary_qtz_st.items():\n",
" assert name in elementary_qtz_st\n",
"for name, param in elementary_qtz_st.items():\n",
" assert name in reconstructed_elementary_qtz_st, name\n",
"\n",
"for name, param in reconstructed_elementary_qtz_st.items():\n",
" assert type(param) == type(elementary_qtz_st[name]), name\n",
" if type(param) == torch.Tensor:\n",
" assert torch.all(torch.eq(param, elementary_qtz_st[name])), name\n",
" elif type(param) == np.ndarray:\n",
" assert (param == elementary_qtz_st[name]).all(), name\n",
" else:\n",
" assert param == elementary_qtz_st[name], name"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Re-assemble the sparse int_repr from the CSR format\n",
"\n",
"reconstructed_qtz_st = {}\n",
"\n",
"for name, param in reconstructed_elementary_qtz_st.items():\n",
" if \"weight.int_repr.indptr\" in name:\n",
" prefix_ = name[:-16]\n",
" data = reconstructed_elementary_qtz_st[f\"{prefix_}.int_repr.data\"]\n",
" indptr = reconstructed_elementary_qtz_st[f\"{prefix_}.int_repr.indptr\"]\n",
" indices = reconstructed_elementary_qtz_st[f\"{prefix_}.int_repr.indices\"]\n",
" shape = reconstructed_elementary_qtz_st[f\"{prefix_}.int_repr.shape\"]\n",
"\n",
" int_repr = sparse.csr_matrix(arg1=(data, indices, indptr),\n",
" shape=shape)\n",
" int_repr = torch.tensor(int_repr.todense())\n",
"\n",
" scale = reconstructed_elementary_qtz_st[f\"{prefix_}.scale\"]\n",
" zero_point = reconstructed_elementary_qtz_st[f\"{prefix_}.zero_point\"]\n",
" weight = torch._make_per_tensor_quantized_tensor(int_repr,\n",
" scale,\n",
" zero_point)\n",
"\n",
" reconstructed_qtz_st[f\"{prefix_}\"] = weight\n",
" elif \"int_repr.data\" in name or \"int_repr.shape\" in name or \"int_repr.indices\" in name or \\\n",
" \"weight.scale\" in name or \"weight.zero_point\" in name:\n",
" continue\n",
" else:\n",
" reconstructed_qtz_st[name] = param\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Sanity checks\n",
"\n",
"for name, param in reconstructed_qtz_st.items():\n",
" assert name in qtz_st\n",
"for name, param in qtz_st.items():\n",
" assert name in reconstructed_qtz_st, name\n",
"\n",
"for name, param in reconstructed_qtz_st.items():\n",
" assert type(param) == type(qtz_st[name]), name\n",
" if type(param) == torch.Tensor:\n",
" assert torch.all(torch.eq(param, qtz_st[name])), name\n",
" elif type(param) == np.ndarray:\n",
" assert (param == qtz_st[name]).all(), name\n",
" else:\n",
" assert param == qtz_st[name], name"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sanity checks"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<All keys matched successfully>"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load the re-constructed state dict into a model\n",
"\n",
"dummy_model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')\n",
"dummy_model.to('cpu')\n",
"\n",
"reconstructed_qtz_model = torch.quantization.quantize_dynamic(\n",
" model=dummy_model,\n",
" qconfig_spec = None,\n",
" dtype=torch.qint8,\n",
" )\n",
"\n",
"reconstructed_qtz_st = OrderedDict(reconstructed_qtz_st)\n",
"with open('dbg/metadata.json', 'r') as read_file:\n",
" metadata = json.loads(read_file.read())\n",
"reconstructed_qtz_st._metadata = metadata\n",
"\n",
"reconstructed_qtz_model.load_state_dict(reconstructed_qtz_st)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sanity check passed\n"
]
}
],
"source": [
"# Sanity checks on the infernce\n",
"\n",
"N = 32\n",
"\n",
"for _ in range(25):\n",
" inputs = torch.randint(low=0, high=30000, size=(N, 128))\n",
" mask = torch.ones(size=(N, 128))\n",
"\n",
" y_reconstructed = reconstructed_qtz_model(input_ids=inputs, attention_mask=mask)[0]\n",
" y = quantized_model(input_ids=inputs, attention_mask=mask)[0]\n",
" \n",
" assert torch.all(torch.eq(y, y_reconstructed))\n",
"print(\"Sanity check passed\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,132 @@
# Copyright 2020-present, the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Once a model has been fine-pruned, the weights that are masked during the forward pass can be pruned once for all.
For instance, once the a model from the :class:`~emmental.MaskedBertForSequenceClassification` is trained, it can be saved (and then loaded)
as a standard :class:`~transformers.BertForSequenceClassification`.
"""
import argparse
import os
import shutil
import torch
from emmental.modules import MagnitudeBinarizer, ThresholdBinarizer, TopKBinarizer
def main(args):
pruning_method = args.pruning_method
threshold = args.threshold
model_name_or_path = args.model_name_or_path.rstrip("/")
target_model_path = args.target_model_path
print(f"Load fine-pruned model from {model_name_or_path}")
model = torch.load(os.path.join(model_name_or_path, "pytorch_model.bin"))
pruned_model = {}
for name, tensor in model.items():
if "embeddings" in name or "LayerNorm" in name or "pooler" in name:
pruned_model[name] = tensor
print(f"Copied layer {name}")
elif "classifier" in name or "qa_output" in name:
pruned_model[name] = tensor
print(f"Copied layer {name}")
elif "bias" in name:
pruned_model[name] = tensor
print(f"Copied layer {name}")
else:
if pruning_method == "magnitude":
mask = MagnitudeBinarizer.apply(inputs=tensor, threshold=threshold)
pruned_model[name] = tensor * mask
print(f"Pruned layer {name}")
elif pruning_method == "topK":
if "mask_scores" in name:
continue
prefix_ = name[:-6]
scores = model[f"{prefix_}mask_scores"]
mask = TopKBinarizer.apply(scores, threshold)
pruned_model[name] = tensor * mask
print(f"Pruned layer {name}")
elif pruning_method == "sigmoied_threshold":
if "mask_scores" in name:
continue
prefix_ = name[:-6]
scores = model[f"{prefix_}mask_scores"]
mask = ThresholdBinarizer.apply(scores, threshold, True)
pruned_model[name] = tensor * mask
print(f"Pruned layer {name}")
elif pruning_method == "l0":
if "mask_scores" in name:
continue
prefix_ = name[:-6]
scores = model[f"{prefix_}mask_scores"]
l, r = -0.1, 1.1
s = torch.sigmoid(scores)
s_bar = s * (r - l) + l
mask = s_bar.clamp(min=0.0, max=1.0)
pruned_model[name] = tensor * mask
print(f"Pruned layer {name}")
else:
raise ValueError("Unknown pruning method")
if target_model_path is None:
target_model_path = os.path.join(
os.path.dirname(model_name_or_path), f"bertarized_{os.path.basename(model_name_or_path)}"
)
if not os.path.isdir(target_model_path):
shutil.copytree(model_name_or_path, target_model_path)
print(f"\nCreated folder {target_model_path}")
torch.save(pruned_model, os.path.join(target_model_path, "pytorch_model.bin"))
print("\nPruned model saved! See you later!")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--pruning_method",
choices=["l0", "magnitude", "topK", "sigmoied_threshold"],
type=str,
required=True,
help="Pruning Method (l0 = L0 regularization, magnitude = Magnitude pruning, topK = Movement pruning, sigmoied_threshold = Soft movement pruning)",
)
parser.add_argument(
"--threshold",
type=float,
required=False,
help="For `magnitude` and `topK`, it is the level of remaining weights (in %) in the fine-pruned model."
"For `sigmoied_threshold`, it is the threshold \tau against which the (sigmoied) scores are compared."
"Not needed for `l0`",
)
parser.add_argument(
"--model_name_or_path",
type=str,
required=True,
help="Folder containing the model that was previously fine-pruned",
)
parser.add_argument(
"--target_model_path",
default=None,
type=str,
required=False,
help="Folder containing the model that was previously fine-pruned",
)
args = parser.parse_args()
main(args)

View File

@@ -0,0 +1,92 @@
# Copyright 2020-present, the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Count remaining (non-zero) weights in the encoder (i.e. the transformer layers).
Sparsity and remaining weights levels are equivalent: sparsity % = 100 - remaining weights %.
"""
import argparse
import os
import torch
from emmental.modules import ThresholdBinarizer, TopKBinarizer
def main(args):
serialization_dir = args.serialization_dir
pruning_method = args.pruning_method
threshold = args.threshold
st = torch.load(os.path.join(serialization_dir, "pytorch_model.bin"), map_location="cpu")
remaining_count = 0 # Number of remaining (not pruned) params in the encoder
encoder_count = 0 # Number of params in the encoder
print("name".ljust(60, " "), "Remaining Weights %", "Remaning Weight")
for name, param in st.items():
if "encoder" not in name:
continue
if "mask_scores" in name:
if pruning_method == "topK":
mask_ones = TopKBinarizer.apply(param, threshold).sum().item()
elif pruning_method == "sigmoied_threshold":
mask_ones = ThresholdBinarizer.apply(param, threshold, True).sum().item()
elif pruning_method == "l0":
l, r = -0.1, 1.1
s = torch.sigmoid(param)
s_bar = s * (r - l) + l
mask = s_bar.clamp(min=0.0, max=1.0)
mask_ones = (mask > 0.0).sum().item()
else:
raise ValueError("Unknown pruning method")
remaining_count += mask_ones
print(name.ljust(60, " "), str(round(100 * mask_ones / param.numel(), 3)).ljust(20, " "), str(mask_ones))
else:
encoder_count += param.numel()
if "bias" in name or "LayerNorm" in name:
remaining_count += param.numel()
print("")
print("Remaining Weights (global) %: ", 100 * remaining_count / encoder_count)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--pruning_method",
choices=["l0", "topK", "sigmoied_threshold"],
type=str,
required=True,
help="Pruning Method (l0 = L0 regularization, topK = Movement pruning, sigmoied_threshold = Soft movement pruning)",
)
parser.add_argument(
"--threshold",
type=float,
required=False,
help="For `topK`, it is the level of remaining weights (in %) in the fine-pruned model."
"For `sigmoied_threshold`, it is the threshold \tau against which the (sigmoied) scores are compared."
"Not needed for `l0`",
)
parser.add_argument(
"--serialization_dir",
type=str,
required=True,
help="Folder containing the model that was previously fine-pruned",
)
args = parser.parse_args()
main(args)

View File

@@ -0,0 +1,10 @@
# flake8: noqa
from .configuration_bert_masked import MaskedBertConfig
from .modeling_bert_masked import (
MaskedBertForMultipleChoice,
MaskedBertForQuestionAnswering,
MaskedBertForSequenceClassification,
MaskedBertForTokenClassification,
MaskedBertModel,
)
from .modules import *

View File

@@ -0,0 +1,71 @@
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Masked BERT model configuration. It replicates the class `~transformers.BertConfig`
and adapts it to the specificities of MaskedBert (`pruning_method`, `mask_init` and `mask_scale`."""
import logging
from transformers.configuration_utils import PretrainedConfig
logger = logging.getLogger(__name__)
class MaskedBertConfig(PretrainedConfig):
"""
A class replicating the `~transformers.BertConfig` with additional parameters for pruning/masking configuration.
"""
model_type = "masked_bert"
def __init__(
self,
vocab_size=30522,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=2,
initializer_range=0.02,
layer_norm_eps=1e-12,
pad_token_id=0,
pruning_method="topK",
mask_init="constant",
mask_scale=0.0,
**kwargs
):
super().__init__(pad_token_id=pad_token_id, **kwargs)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps
self.pruning_method = pruning_method
self.mask_init = mask_init
self.mask_scale = mask_scale

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,3 @@
# flake8: noqa
from .binarizer import MagnitudeBinarizer, ThresholdBinarizer, TopKBinarizer
from .masked_nn import MaskedLinear

View File

@@ -0,0 +1,144 @@
# coding=utf-8
# Copyright 2020-present, AllenAI Authors, University of Illinois Urbana-Champaign,
# Intel Nervana Systems and the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Binarizers take a (real value) matrice as input and produce a binary (values in {0,1}) mask of the same shape.
"""
import torch
from torch import autograd
class ThresholdBinarizer(autograd.Function):
"""
Thresholdd binarizer.
Computes a binary mask M from a real value matrix S such that `M_{i,j} = 1` if and only if `S_{i,j} > \tau`
where `\tau` is a real value threshold.
Implementation is inspired from:
https://github.com/arunmallya/piggyback
Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights
Arun Mallya, Dillon Davis, Svetlana Lazebnik
"""
@staticmethod
def forward(ctx, inputs: torch.tensor, threshold: float, sigmoid: bool):
"""
Args:
inputs (`torch.FloatTensor`)
The input matrix from which the binarizer computes the binary mask.
threshold (`float`)
The threshold value (in R).
sigmoid (`bool`)
If set to ``True``, we apply the sigmoid function to the `inputs` matrix before comparing to `threshold`.
In this case, `threshold` should be a value between 0 and 1.
Returns:
mask (`torch.FloatTensor`)
Binary matrix of the same size as `inputs` acting as a mask (1 - the associated weight is
retained, 0 - the associated weight is pruned).
"""
nb_elems = inputs.numel()
nb_min = int(0.005 * nb_elems) + 1
if sigmoid:
mask = (torch.sigmoid(inputs) > threshold).type(inputs.type())
else:
mask = (inputs > threshold).type(inputs.type())
if mask.sum() < nb_min:
# We limit the pruning so that at least 0.5% (half a percent) of the weights are remaining
k_threshold = inputs.flatten().kthvalue(max(nb_elems - nb_min, 1)).values
mask = (inputs > k_threshold).type(inputs.type())
return mask
@staticmethod
def backward(ctx, gradOutput):
return gradOutput, None, None
class TopKBinarizer(autograd.Function):
"""
Top-k Binarizer.
Computes a binary mask M from a real value matrix S such that `M_{i,j} = 1` if and only if `S_{i,j}`
is among the k% highest values of S.
Implementation is inspired from:
https://github.com/allenai/hidden-networks
What's hidden in a randomly weighted neural network?
Vivek Ramanujan*, Mitchell Wortsman*, Aniruddha Kembhavi, Ali Farhadi, Mohammad Rastegari
"""
@staticmethod
def forward(ctx, inputs: torch.tensor, threshold: float):
"""
Args:
inputs (`torch.FloatTensor`)
The input matrix from which the binarizer computes the binary mask.
threshold (`float`)
The percentage of weights to keep (the rest is pruned).
`threshold` is a float between 0 and 1.
Returns:
mask (`torch.FloatTensor`)
Binary matrix of the same size as `inputs` acting as a mask (1 - the associated weight is
retained, 0 - the associated weight is pruned).
"""
# Get the subnetwork by sorting the inputs and using the top threshold %
mask = inputs.clone()
_, idx = inputs.flatten().sort(descending=True)
j = int(threshold * inputs.numel())
# flat_out and mask access the same memory.
flat_out = mask.flatten()
flat_out[idx[j:]] = 0
flat_out[idx[:j]] = 1
return mask
@staticmethod
def backward(ctx, gradOutput):
return gradOutput, None
class MagnitudeBinarizer(object):
"""
Magnitude Binarizer.
Computes a binary mask M from a real value matrix S such that `M_{i,j} = 1` if and only if `S_{i,j}`
is among the k% highest values of |S| (absolute value).
Implementation is inspired from https://github.com/NervanaSystems/distiller/blob/2291fdcc2ea642a98d4e20629acb5a9e2e04b4e6/distiller/pruning/automated_gradual_pruner.py#L24
"""
@staticmethod
def apply(inputs: torch.tensor, threshold: float):
"""
Args:
inputs (`torch.FloatTensor`)
The input matrix from which the binarizer computes the binary mask.
This input marix is typically the weight matrix.
threshold (`float`)
The percentage of weights to keep (the rest is pruned).
`threshold` is a float between 0 and 1.
Returns:
mask (`torch.FloatTensor`)
Binary matrix of the same size as `inputs` acting as a mask (1 - the associated weight is
retained, 0 - the associated weight is pruned).
"""
# Get the subnetwork by sorting the inputs and using the top threshold %
mask = inputs.clone()
_, idx = inputs.abs().flatten().sort(descending=True)
j = int(threshold * inputs.numel())
# flat_out and mask access the same memory.
flat_out = mask.flatten()
flat_out[idx[j:]] = 0
flat_out[idx[:j]] = 1
return mask

View File

@@ -0,0 +1,107 @@
# coding=utf-8
# Copyright 2020-present, the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Masked Linear module: A fully connected layer that computes an adaptive binary mask on the fly.
The mask (binary or not) is computed at each forward pass and multiplied against
the weight matrix to prune a portion of the weights.
The pruned weight matrix is then multiplied against the inputs (and if necessary, the bias is added).
"""
import math
import torch
from torch import nn
from torch.nn import functional as F
from torch.nn import init
from .binarizer import MagnitudeBinarizer, ThresholdBinarizer, TopKBinarizer
class MaskedLinear(nn.Linear):
"""
Fully Connected layer with on the fly adaptive mask.
If needed, a score matrix is created to store the importance of each associated weight.
"""
def __init__(
self,
in_features: int,
out_features: int,
bias: bool = True,
mask_init: str = "constant",
mask_scale: float = 0.0,
pruning_method: str = "topK",
):
"""
Args:
in_features (`int`)
Size of each input sample
out_features (`int`)
Size of each output sample
bias (`bool`)
If set to ``False``, the layer will not learn an additive bias.
Default: ``True``
mask_init (`str`)
The initialization method for the score matrix if a score matrix is needed.
Choices: ["constant", "uniform", "kaiming"]
Default: ``constant``
mask_scale (`float`)
The initialization parameter for the chosen initialization method `mask_init`.
Default: ``0.``
pruning_method (`str`)
Method to compute the mask.
Choices: ["topK", "threshold", "sigmoied_threshold", "magnitude", "l0"]
Default: ``topK``
"""
super(MaskedLinear, self).__init__(in_features=in_features, out_features=out_features, bias=bias)
assert pruning_method in ["topK", "threshold", "sigmoied_threshold", "magnitude", "l0"]
self.pruning_method = pruning_method
if self.pruning_method in ["topK", "threshold", "sigmoied_threshold", "l0"]:
self.mask_scale = mask_scale
self.mask_init = mask_init
self.mask_scores = nn.Parameter(torch.Tensor(self.weight.size()))
self.init_mask()
def init_mask(self):
if self.mask_init == "constant":
init.constant_(self.mask_scores, val=self.mask_scale)
elif self.mask_init == "uniform":
init.uniform_(self.mask_scores, a=-self.mask_scale, b=self.mask_scale)
elif self.mask_init == "kaiming":
init.kaiming_uniform_(self.mask_scores, a=math.sqrt(5))
def forward(self, input: torch.tensor, threshold: float):
# Get the mask
if self.pruning_method == "topK":
mask = TopKBinarizer.apply(self.mask_scores, threshold)
elif self.pruning_method in ["threshold", "sigmoied_threshold"]:
sig = "sigmoied" in self.pruning_method
mask = ThresholdBinarizer.apply(self.mask_scores, threshold, sig)
elif self.pruning_method == "magnitude":
mask = MagnitudeBinarizer.apply(self.weight, threshold)
elif self.pruning_method == "l0":
l, r, b = -0.1, 1.1, 2 / 3
if self.training:
u = torch.zeros_like(self.mask_scores).uniform_().clamp(0.0001, 0.9999)
s = torch.sigmoid((u.log() - (1 - u).log() + self.mask_scores) / b)
else:
s = torch.sigmoid(self.mask_scores)
s_bar = s * (r - l) + l
mask = s_bar.clamp(min=0.0, max=1.0)
# Mask weights with computed mask
weight_thresholded = mask * self.weight
# Compute output (linear layer) with masked weights
return F.linear(input, weight_thresholded, self.bias)

View File

@@ -0,0 +1,924 @@
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Fine-pruning Masked BERT on sequence classification on GLUE."""
import argparse
import glob
import json
import logging
import os
import random
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from emmental import MaskedBertConfig, MaskedBertForSequenceClassification
from transformers import (
WEIGHTS_NAME,
AdamW,
BertConfig,
BertForSequenceClassification,
BertTokenizer,
get_linear_schedule_with_warmup,
)
from transformers import glue_compute_metrics as compute_metrics
from transformers import glue_convert_examples_to_features as convert_examples_to_features
from transformers import glue_output_modes as output_modes
from transformers import glue_processors as processors
try:
from torch.utils.tensorboard import SummaryWriter
except ImportError:
from tensorboardX import SummaryWriter
logger = logging.getLogger(__name__)
MODEL_CLASSES = {
"bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
"masked_bert": (MaskedBertConfig, MaskedBertForSequenceClassification, BertTokenizer),
}
def set_seed(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if args.n_gpu > 0:
torch.cuda.manual_seed_all(args.seed)
def schedule_threshold(
step: int,
total_step: int,
warmup_steps: int,
initial_threshold: float,
final_threshold: float,
initial_warmup: int,
final_warmup: int,
final_lambda: float,
):
if step <= initial_warmup * warmup_steps:
threshold = initial_threshold
elif step > (total_step - final_warmup * warmup_steps):
threshold = final_threshold
else:
spars_warmup_steps = initial_warmup * warmup_steps
spars_schedu_steps = (final_warmup + initial_warmup) * warmup_steps
mul_coeff = 1 - (step - spars_warmup_steps) / (total_step - spars_schedu_steps)
threshold = final_threshold + (initial_threshold - final_threshold) * (mul_coeff ** 3)
regu_lambda = final_lambda * threshold / final_threshold
return threshold, regu_lambda
def regularization(model: nn.Module, mode: str):
regu, counter = 0, 0
for name, param in model.named_parameters():
if "mask_scores" in name:
if mode == "l1":
regu += torch.norm(torch.sigmoid(param), p=1) / param.numel()
elif mode == "l0":
regu += torch.sigmoid(param - 2 / 3 * np.log(0.1 / 1.1)).sum() / param.numel()
else:
ValueError("Don't know this mode.")
counter += 1
return regu / counter
def train(args, train_dataset, model, tokenizer, teacher=None):
""" Train the model """
if args.local_rank in [-1, 0]:
tb_writer = SummaryWriter(log_dir=args.output_dir)
args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
if args.max_steps > 0:
t_total = args.max_steps
args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
else:
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if "mask_score" in n and p.requires_grad],
"lr": args.mask_scores_learning_rate,
},
{
"params": [
p
for n, p in model.named_parameters()
if "mask_score" not in n and p.requires_grad and not any(nd in n for nd in no_decay)
],
"lr": args.learning_rate,
"weight_decay": args.weight_decay,
},
{
"params": [
p
for n, p in model.named_parameters()
if "mask_score" not in n and p.requires_grad and any(nd in n for nd in no_decay)
],
"lr": args.learning_rate,
"weight_decay": 0.0,
},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
)
# Check if saved optimizer or scheduler states exist
if os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt")) and os.path.isfile(
os.path.join(args.model_name_or_path, "scheduler.pt")
):
# Load in optimizer and scheduler states
optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))
if args.fp16:
try:
from apex import amp
except ImportError:
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
# multi-gpu training (should be after apex fp16 initialization)
if args.n_gpu > 1:
model = torch.nn.DataParallel(model)
# Distributed training (should be after apex fp16 initialization)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True,
)
# Train!
logger.info("***** Running training *****")
logger.info(" Num examples = %d", len(train_dataset))
logger.info(" Num Epochs = %d", args.num_train_epochs)
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
logger.info(
" Total train batch size (w. parallel, distributed & accumulation) = %d",
args.train_batch_size
* args.gradient_accumulation_steps
* (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
)
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
logger.info(" Total optimization steps = %d", t_total)
# Distillation
if teacher is not None:
logger.info(" Training with distillation")
global_step = 0
# Global TopK
if args.global_topk:
threshold_mem = None
epochs_trained = 0
steps_trained_in_current_epoch = 0
# Check if continuing training from a checkpoint
if os.path.exists(args.model_name_or_path):
# set global_step to global_step of last saved checkpoint from model path
try:
global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])
except ValueError:
global_step = 0
epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)
logger.info(" Continuing training from checkpoint, will skip to saved global_step")
logger.info(" Continuing training from epoch %d", epochs_trained)
logger.info(" Continuing training from global step %d", global_step)
logger.info(" Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
tr_loss, logging_loss = 0.0, 0.0
model.zero_grad()
train_iterator = trange(
epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0],
)
set_seed(args) # Added here for reproductibility
for _ in train_iterator:
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
for step, batch in enumerate(epoch_iterator):
# Skip past any already trained steps if resuming training
if steps_trained_in_current_epoch > 0:
steps_trained_in_current_epoch -= 1
continue
model.train()
batch = tuple(t.to(args.device) for t in batch)
threshold, regu_lambda = schedule_threshold(
step=global_step,
total_step=t_total,
warmup_steps=args.warmup_steps,
final_threshold=args.final_threshold,
initial_threshold=args.initial_threshold,
final_warmup=args.final_warmup,
initial_warmup=args.initial_warmup,
final_lambda=args.final_lambda,
)
# Global TopK
if args.global_topk:
if threshold == 1.0:
threshold = -1e2 # Or an indefinitely low quantity
else:
if (threshold_mem is None) or (global_step % args.global_topk_frequency_compute == 0):
# Sort all the values to get the global topK
concat = torch.cat(
[param.view(-1) for name, param in model.named_parameters() if "mask_scores" in name]
)
n = concat.numel()
kth = max(n - (int(n * threshold) + 1), 1)
threshold_mem = concat.kthvalue(kth).values.item()
threshold = threshold_mem
else:
threshold = threshold_mem
inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
if args.model_type != "distilbert":
inputs["token_type_ids"] = (
batch[2] if args.model_type in ["bert", "masked_bert", "xlnet", "albert"] else None
) # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids
if "masked" in args.model_type:
inputs["threshold"] = threshold
outputs = model(**inputs)
loss, logits_stu = outputs # model outputs are always tuple in transformers (see doc)
# Distillation loss
if teacher is not None:
if "token_type_ids" not in inputs:
inputs["token_type_ids"] = None if args.teacher_type == "xlm" else batch[2]
with torch.no_grad():
(logits_tea,) = teacher(
input_ids=inputs["input_ids"],
token_type_ids=inputs["token_type_ids"],
attention_mask=inputs["attention_mask"],
)
loss_logits = F.kl_div(
input=F.log_softmax(logits_stu / args.temperature, dim=-1),
target=F.softmax(logits_tea / args.temperature, dim=-1),
reduction="batchmean",
) * (args.temperature ** 2)
loss = args.alpha_distil * loss_logits + args.alpha_ce * loss
# Regularization
if args.regularization is not None:
regu_ = regularization(model=model, mode=args.regularization)
loss = loss + regu_lambda * regu_
if args.n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu parallel training
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
if args.fp16:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss.backward()
tr_loss += loss.item()
if (step + 1) % args.gradient_accumulation_steps == 0 or (
# last step in epoch but step is always smaller than gradient_accumulation_steps
len(epoch_iterator) <= args.gradient_accumulation_steps
and (step + 1) == len(epoch_iterator)
):
if args.fp16:
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
else:
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
tb_writer.add_scalar("threshold", threshold, global_step)
for name, param in model.named_parameters():
if not param.requires_grad:
continue
tb_writer.add_scalar("parameter_mean/" + name, param.data.mean(), global_step)
tb_writer.add_scalar("parameter_std/" + name, param.data.std(), global_step)
tb_writer.add_scalar("parameter_min/" + name, param.data.min(), global_step)
tb_writer.add_scalar("parameter_max/" + name, param.data.max(), global_step)
tb_writer.add_scalar("grad_mean/" + name, param.grad.data.mean(), global_step)
tb_writer.add_scalar("grad_std/" + name, param.grad.data.std(), global_step)
if args.regularization is not None and "mask_scores" in name:
if args.regularization == "l1":
perc = (torch.sigmoid(param) > threshold).sum().item() / param.numel()
elif args.regularization == "l0":
perc = (torch.sigmoid(param - 2 / 3 * np.log(0.1 / 1.1))).sum().item() / param.numel()
tb_writer.add_scalar("retained_weights_perc/" + name, perc, global_step)
optimizer.step()
scheduler.step() # Update learning rate schedule
model.zero_grad()
global_step += 1
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
logs = {}
if (
args.local_rank == -1 and args.evaluate_during_training
): # Only evaluate when single GPU otherwise metrics may not average well
results = evaluate(args, model, tokenizer)
for key, value in results.items():
eval_key = "eval_{}".format(key)
logs[eval_key] = value
loss_scalar = (tr_loss - logging_loss) / args.logging_steps
learning_rate_scalar = scheduler.get_lr()
logs["learning_rate"] = learning_rate_scalar[0]
if len(learning_rate_scalar) > 1:
for idx, lr in enumerate(learning_rate_scalar[1:]):
logs[f"learning_rate/{idx+1}"] = lr
logs["loss"] = loss_scalar
if teacher is not None:
logs["loss/distil"] = loss_logits.item()
if args.regularization is not None:
logs["loss/regularization"] = regu_.item()
if (teacher is not None) or (args.regularization is not None):
if (teacher is not None) and (args.regularization is not None):
logs["loss/instant_ce"] = (
loss.item()
- regu_lambda * logs["loss/regularization"]
- args.alpha_distil * logs["loss/distil"]
) / args.alpha_ce
elif teacher is not None:
logs["loss/instant_ce"] = (
loss.item() - args.alpha_distil * logs["loss/distil"]
) / args.alpha_ce
else:
logs["loss/instant_ce"] = loss.item() - regu_lambda * logs["loss/regularization"]
logging_loss = tr_loss
for key, value in logs.items():
tb_writer.add_scalar(key, value, global_step)
print(json.dumps({**logs, **{"step": global_step}}))
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
# Save model checkpoint
output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
if not os.path.exists(output_dir):
os.makedirs(output_dir)
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
torch.save(args, os.path.join(output_dir, "training_args.bin"))
logger.info("Saving model checkpoint to %s", output_dir)
torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
logger.info("Saving optimizer and scheduler states to %s", output_dir)
if args.max_steps > 0 and global_step > args.max_steps:
epoch_iterator.close()
break
if args.max_steps > 0 and global_step > args.max_steps:
train_iterator.close()
break
if args.local_rank in [-1, 0]:
tb_writer.close()
return global_step, tr_loss / global_step
def evaluate(args, model, tokenizer, prefix=""):
# Loop to handle MNLI double evaluation (matched, mis-matched)
eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
eval_outputs_dirs = (args.output_dir, args.output_dir + "/MM") if args.task_name == "mnli" else (args.output_dir,)
results = {}
for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
os.makedirs(eval_output_dir)
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
# Note that DistributedSampler samples randomly
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
# multi-gpu eval
if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
model = torch.nn.DataParallel(model)
# Eval!
logger.info("***** Running evaluation {} *****".format(prefix))
logger.info(" Num examples = %d", len(eval_dataset))
logger.info(" Batch size = %d", args.eval_batch_size)
eval_loss = 0.0
nb_eval_steps = 0
preds = None
out_label_ids = None
# Global TopK
if args.global_topk:
threshold_mem = None
for batch in tqdm(eval_dataloader, desc="Evaluating"):
model.eval()
batch = tuple(t.to(args.device) for t in batch)
with torch.no_grad():
inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
if args.model_type != "distilbert":
inputs["token_type_ids"] = (
batch[2] if args.model_type in ["bert", "masked_bert", "xlnet", "albert"] else None
) # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids
if "masked" in args.model_type:
inputs["threshold"] = args.final_threshold
if args.global_topk:
if threshold_mem is None:
concat = torch.cat(
[param.view(-1) for name, param in model.named_parameters() if "mask_scores" in name]
)
n = concat.numel()
kth = max(n - (int(n * args.final_threshold) + 1), 1)
threshold_mem = concat.kthvalue(kth).values.item()
inputs["threshold"] = threshold_mem
outputs = model(**inputs)
tmp_eval_loss, logits = outputs[:2]
eval_loss += tmp_eval_loss.mean().item()
nb_eval_steps += 1
if preds is None:
preds = logits.detach().cpu().numpy()
out_label_ids = inputs["labels"].detach().cpu().numpy()
else:
preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)
eval_loss = eval_loss / nb_eval_steps
if args.output_mode == "classification":
from scipy.special import softmax
probs = softmax(preds, axis=-1)
entropy = np.exp((-probs * np.log(probs)).sum(axis=-1).mean())
preds = np.argmax(preds, axis=1)
elif args.output_mode == "regression":
preds = np.squeeze(preds)
result = compute_metrics(eval_task, preds, out_label_ids)
results.update(result)
if entropy is not None:
result["eval_avg_entropy"] = entropy
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results {} *****".format(prefix))
for key in sorted(result.keys()):
logger.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
return results
def load_and_cache_examples(args, task, tokenizer, evaluate=False):
if args.local_rank not in [-1, 0] and not evaluate:
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
processor = processors[task]()
output_mode = output_modes[task]
# Load data features from cache or dataset file
cached_features_file = os.path.join(
args.data_dir,
"cached_{}_{}_{}_{}".format(
"dev" if evaluate else "train",
list(filter(None, args.model_name_or_path.split("/"))).pop(),
str(args.max_seq_length),
str(task),
),
)
if os.path.exists(cached_features_file) and not args.overwrite_cache:
logger.info("Loading features from cached file %s", cached_features_file)
features = torch.load(cached_features_file)
else:
logger.info("Creating features from dataset file at %s", args.data_dir)
label_list = processor.get_labels()
if task in ["mnli", "mnli-mm"] and args.model_type in ["roberta", "xlmroberta"]:
# HACK(label indices are swapped in RoBERTa pretrained model)
label_list[1], label_list[2] = label_list[2], label_list[1]
examples = (
processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
)
features = convert_examples_to_features(
examples, tokenizer, max_length=args.max_seq_length, label_list=label_list, output_mode=output_mode,
)
if args.local_rank in [-1, 0]:
logger.info("Saving features into cached file %s", cached_features_file)
torch.save(features, cached_features_file)
if args.local_rank == 0 and not evaluate:
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
# Convert to Tensors and build dataset
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
if output_mode == "classification":
all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
elif output_mode == "regression":
all_labels = torch.tensor([f.label for f in features], dtype=torch.float)
dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
return dataset
def main():
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--data_dir",
default=None,
type=str,
required=True,
help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
)
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
)
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pretrained model or model identifier from huggingface.co/models",
)
parser.add_argument(
"--task_name",
default=None,
type=str,
required=True,
help="The name of the task to train selected in the list: " + ", ".join(processors.keys()),
)
parser.add_argument(
"--output_dir",
default=None,
type=str,
required=True,
help="The output directory where the model predictions and checkpoints will be written.",
)
# Other parameters
parser.add_argument(
"--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name",
)
parser.add_argument(
"--tokenizer_name",
default="",
type=str,
help="Pretrained tokenizer name or path if not the same as model_name",
)
parser.add_argument(
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
)
parser.add_argument(
"--max_seq_length",
default=128,
type=int,
help="The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded.",
)
parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
parser.add_argument(
"--evaluate_during_training", action="store_true", help="Run evaluation during training at each logging step.",
)
parser.add_argument(
"--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model.",
)
parser.add_argument(
"--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.",
)
parser.add_argument(
"--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation.",
)
parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
# Pruning parameters
parser.add_argument(
"--mask_scores_learning_rate",
default=1e-2,
type=float,
help="The Adam initial learning rate of the mask scores.",
)
parser.add_argument(
"--initial_threshold", default=1.0, type=float, help="Initial value of the threshold (for scheduling)."
)
parser.add_argument(
"--final_threshold", default=0.7, type=float, help="Final value of the threshold (for scheduling)."
)
parser.add_argument(
"--initial_warmup",
default=1,
type=int,
help="Run `initial_warmup` * `warmup_steps` steps of threshold warmup during which threshold stays"
"at its `initial_threshold` value (sparsity schedule).",
)
parser.add_argument(
"--final_warmup",
default=2,
type=int,
help="Run `final_warmup` * `warmup_steps` steps of threshold cool-down during which threshold stays"
"at its final_threshold value (sparsity schedule).",
)
parser.add_argument(
"--pruning_method",
default="topK",
type=str,
help="Pruning Method (l0 = L0 regularization, magnitude = Magnitude pruning, topK = Movement pruning, sigmoied_threshold = Soft movement pruning).",
)
parser.add_argument(
"--mask_init",
default="constant",
type=str,
help="Initialization method for the mask scores. Choices: constant, uniform, kaiming.",
)
parser.add_argument(
"--mask_scale", default=0.0, type=float, help="Initialization parameter for the chosen initialization method."
)
parser.add_argument("--regularization", default=None, help="Add L0 or L1 regularization to the mask scores.")
parser.add_argument(
"--final_lambda",
default=0.0,
type=float,
help="Regularization intensity (used in conjunction with `regulariation`.",
)
parser.add_argument("--global_topk", action="store_true", help="Global TopK on the Scores.")
parser.add_argument(
"--global_topk_frequency_compute",
default=25,
type=int,
help="Frequency at which we compute the TopK global threshold.",
)
# Distillation parameters (optional)
parser.add_argument(
"--teacher_type",
default=None,
type=str,
help="Teacher type. Teacher tokenizer and student (model) tokenizer must output the same tokenization. Only for distillation.",
)
parser.add_argument(
"--teacher_name_or_path",
default=None,
type=str,
help="Path to the already fine-tuned teacher model. Only for distillation.",
)
parser.add_argument(
"--alpha_ce", default=0.5, type=float, help="Cross entropy loss linear weight. Only for distillation."
)
parser.add_argument(
"--alpha_distil", default=0.5, type=float, help="Distillation loss linear weight. Only for distillation."
)
parser.add_argument(
"--temperature", default=2.0, type=float, help="Distillation temperature. Only for distillation."
)
parser.add_argument(
"--gradient_accumulation_steps",
type=int,
default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.",
)
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
parser.add_argument(
"--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform.",
)
parser.add_argument(
"--max_steps",
default=-1,
type=int,
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
parser.add_argument("--logging_steps", type=int, default=50, help="Log every X updates steps.")
parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.")
parser.add_argument(
"--eval_all_checkpoints",
action="store_true",
help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
)
parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
parser.add_argument(
"--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory",
)
parser.add_argument(
"--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets",
)
parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
parser.add_argument(
"--fp16",
action="store_true",
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
)
parser.add_argument(
"--fp16_opt_level",
type=str,
default="O1",
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
"See details at https://nvidia.github.io/apex/amp.html",
)
parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
args = parser.parse_args()
# Regularization
if args.regularization == "null":
args.regularization = None
if (
os.path.exists(args.output_dir)
and os.listdir(args.output_dir)
and args.do_train
and not args.overwrite_output_dir
):
raise ValueError(
f"Output directory ({args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
)
# Setup CUDA, GPU & distributed training
if args.local_rank == -1 or args.no_cuda:
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)
torch.distributed.init_process_group(backend="nccl")
args.n_gpu = 1
args.device = device
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
)
logger.warning(
"Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
args.local_rank,
device,
args.n_gpu,
bool(args.local_rank != -1),
args.fp16,
)
# Set seed
set_seed(args)
# Prepare GLUE task
args.task_name = args.task_name.lower()
if args.task_name not in processors:
raise ValueError("Task not found: %s" % (args.task_name))
processor = processors[args.task_name]()
args.output_mode = output_modes[args.task_name]
label_list = processor.get_labels()
num_labels = len(label_list)
# Load pretrained model and tokenizer
if args.local_rank not in [-1, 0]:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
args.model_type = args.model_type.lower()
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
config = config_class.from_pretrained(
args.config_name if args.config_name else args.model_name_or_path,
num_labels=num_labels,
finetuning_task=args.task_name,
cache_dir=args.cache_dir if args.cache_dir else None,
pruning_method=args.pruning_method,
mask_init=args.mask_init,
mask_scale=args.mask_scale,
)
tokenizer = tokenizer_class.from_pretrained(
args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
cache_dir=args.cache_dir if args.cache_dir else None,
do_lower_case=args.do_lower_case,
)
model = model_class.from_pretrained(
args.model_name_or_path,
from_tf=bool(".ckpt" in args.model_name_or_path),
config=config,
cache_dir=args.cache_dir if args.cache_dir else None,
)
if args.teacher_type is not None:
assert args.teacher_name_or_path is not None
assert args.alpha_distil > 0.0
assert args.alpha_distil + args.alpha_ce > 0.0
teacher_config_class, teacher_model_class, _ = MODEL_CLASSES[args.teacher_type]
teacher_config = teacher_config_class.from_pretrained(args.teacher_name_or_path)
teacher = teacher_model_class.from_pretrained(
args.teacher_name_or_path,
from_tf=False,
config=teacher_config,
cache_dir=args.cache_dir if args.cache_dir else None,
)
teacher.to(args.device)
else:
teacher = None
if args.local_rank == 0:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
model.to(args.device)
logger.info("Training/evaluation parameters %s", args)
# Training
if args.do_train:
train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
global_step, tr_loss = train(args, train_dataset, model, tokenizer, teacher=teacher)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
# Create output directory if needed
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
os.makedirs(args.output_dir)
logger.info("Saving model checkpoint to %s", args.output_dir)
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
model_to_save.save_pretrained(args.output_dir)
tokenizer.save_pretrained(args.output_dir)
# Good practice: save your training arguments together with the trained model
torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
# Load a trained model and vocabulary that you have fine-tuned
model = model_class.from_pretrained(args.output_dir)
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
model.to(args.device)
# Evaluation
results = {}
if args.do_eval and args.local_rank in [-1, 0]:
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
checkpoints = [args.output_dir]
if args.eval_all_checkpoints:
checkpoints = list(
os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
)
logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging
logger.info("Evaluate the following checkpoints: %s", checkpoints)
for checkpoint in checkpoints:
global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
model = model_class.from_pretrained(checkpoint)
model.to(args.device)
result = evaluate(args, model, tokenizer, prefix=prefix)
result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
results.update(result)
return results
if __name__ == "__main__":
main()

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,6 @@
torch>=1.4.0
-e git+https://github.com/huggingface/transformers.git@352d5472b0c1dec0f420d606d16747d851b4bda8#egg=transformers
knockknock>=0.1.8.1
h5py>=2.10.0
numpy>=1.18.2
scipy>=1.4.1

View File

@@ -19,7 +19,7 @@ python ./examples/multiple-choice/run_multiple_choice.py \
--max_seq_length 80 \
--output_dir models_bert/swag_base \
--per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
--per_device_train_batch_size=16 \
--gradient_accumulation_steps 2 \
--overwrite_output
```
@@ -46,7 +46,7 @@ python ./examples/multiple-choice/run_tf_multiple_choice.py \
--max_seq_length 80 \
--output_dir models_bert/swag_base \
--per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
--per_device_train_batch_size=16 \
--logging-dir logs \
--gradient_accumulation_steps 2 \
--overwrite_output

View File

@@ -159,7 +159,6 @@ def main():
max_seq_length=data_args.max_seq_length,
overwrite_cache=data_args.overwrite_cache,
mode=Split.train,
local_rank=training_args.local_rank,
)
if training_args.do_train
else None
@@ -172,7 +171,6 @@ def main():
max_seq_length=data_args.max_seq_length,
overwrite_cache=data_args.overwrite_cache,
mode=Split.dev,
local_rank=training_args.local_rank,
)
if training_args.do_eval
else None
@@ -204,19 +202,20 @@ def main():
# Evaluation
results = {}
if training_args.do_eval and training_args.local_rank in [-1, 0]:
if training_args.do_eval:
logger.info("*** Evaluate ***")
result = trainer.evaluate()
output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in result.items():
logger.info(" %s = %s", key, value)
writer.write("%s = %s\n" % (key, value))
if trainer.is_world_master():
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in result.items():
logger.info(" %s = %s", key, value)
writer.write("%s = %s\n" % (key, value))
results.update(result)
results.update(result)
return results

View File

@@ -26,6 +26,7 @@ from enum import Enum
from typing import List, Optional
import tqdm
from filelock import FileLock
from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available
@@ -77,7 +78,6 @@ class Split(Enum):
if is_torch_available():
import torch
from torch.utils.data.dataset import Dataset
from transformers import torch_distributed_zero_first
class MultipleChoiceDataset(Dataset):
"""
@@ -95,7 +95,6 @@ if is_torch_available():
max_seq_length: Optional[int] = None,
overwrite_cache=False,
mode: Split = Split.train,
local_rank=-1,
):
processor = processors[task]()
@@ -103,9 +102,11 @@ if is_torch_available():
data_dir,
"cached_{}_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length), task,),
)
with torch_distributed_zero_first(local_rank):
# Make sure only the first process in distributed training processes the dataset,
# and the others will use the cache.
# Make sure only the first process in distributed training processes the dataset,
# and the others will use the cache.
lock_path = cached_features_file + ".lock"
with FileLock(lock_path):
if os.path.exists(cached_features_file) and not overwrite_cache:
logger.info(f"Loading features from cached file {cached_features_file}")
@@ -130,9 +131,8 @@ if is_torch_available():
pad_token=tokenizer.pad_token_id,
pad_token_segment_id=tokenizer.pad_token_type_id,
)
if local_rank in [-1, 0]:
logger.info("Saving features into cached file %s", cached_features_file)
torch.save(self.features, cached_features_file)
logger.info("Saving features into cached file %s", cached_features_file)
torch.save(self.features, cached_features_file)
def __len__(self):
return len(self.features)
@@ -535,7 +535,12 @@ def convert_examples_to_features(
text_b = example.question + " " + ending
inputs = tokenizer.encode_plus(
text_a, text_b, add_special_tokens=True, max_length=max_length, pad_to_max_length=True,
text_a,
text_b,
add_special_tokens=True,
max_length=max_length,
pad_to_max_length=True,
return_overflowing_tokens=True,
)
if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0:
logger.info(

View File

@@ -28,6 +28,7 @@ python run_squad.py \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--per_gpu_train_batch_size 12 \
@@ -56,6 +57,7 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answer
--model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \

View File

@@ -58,8 +58,6 @@ logger = logging.getLogger(__name__)
MODEL_CONFIG_CLASSES = list(MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in MODEL_CONFIG_CLASSES), (),)
def set_seed(args):
random.seed(args.seed)
@@ -491,7 +489,7 @@ def main():
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
help="Path to pretrained model or model identifier from huggingface.co/models",
)
parser.add_argument(
"--output_dir",

View File

@@ -6,3 +6,4 @@ sacrebleu
rouge-score
tensorflow_datasets
pytorch-lightning==0.7.3 # April 10, 2020 release
matplotlib

View File

@@ -21,7 +21,7 @@ def generate_summaries(
):
fout = Path(out_file).open("w")
model = BartForConditionalGeneration.from_pretrained(model_name).to(device)
tokenizer = BartTokenizer.from_pretrained("bart-large")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
max_length = 140
min_length = 55
@@ -54,7 +54,7 @@ def run_generate():
"output_path", type=str, help="where to save summaries",
)
parser.add_argument(
"model_name", type=str, default="bart-large-cnn", help="like bart-large-cnn",
"model_name", type=str, default="facebook/bart-large-cnn", help="like bart-large-cnn",
)
parser.add_argument(
"--device", type=str, required=False, default=DEFAULT_DEVICE, help="cuda, cuda:1, cpu etc.",

View File

@@ -129,7 +129,7 @@ class TestBartExamples(unittest.TestCase):
summaries = ["A very interesting story about what I ate for lunch.", "Avocado, celery, turkey, coffee"]
_dump_articles((tmp_dir / "train.source"), articles)
_dump_articles((tmp_dir / "train.target"), summaries)
tokenizer = BartTokenizer.from_pretrained("bart-large")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
max_len_source = max(len(tokenizer.encode(a)) for a in articles)
max_len_target = max(len(tokenizer.encode(a)) for a in summaries)
trunc_target = 4

View File

@@ -61,7 +61,6 @@ class BertAbsConfig(PretrainedConfig):
the decoder.
"""
pretrained_config_archive_map = BERTABS_FINETUNED_CONFIG_MAP
model_type = "bertabs"
def __init__(

View File

@@ -33,14 +33,13 @@ from transformers import BertConfig, BertModel, PreTrainedModel
MAX_SIZE = 5000
BERTABS_FINETUNED_MODEL_MAP = {
"bertabs-finetuned-cnndm": "https://cdn.huggingface.co/remi/bertabs-finetuned-cnndm-extractive-abstractive-summarization/pytorch_model.bin",
}
BERTABS_FINETUNED_MODEL_ARCHIVE_LIST = [
"remi/bertabs-finetuned-cnndm-extractive-abstractive-summarization",
]
class BertAbsPreTrainedModel(PreTrainedModel):
config_class = BertAbsConfig
pretrained_model_archive_map = BERTABS_FINETUNED_MODEL_MAP
load_tf_weights = False
base_model_prefix = "bert"

View File

@@ -61,8 +61,8 @@ class ExamplesTests(unittest.TestCase):
--do_train
--do_eval
--output_dir ./tests/fixtures/tests_samples/temp_dir
--per_gpu_train_batch_size=2
--per_gpu_eval_batch_size=1
--per_device_train_batch_size=2
--per_device_eval_batch_size=1
--learning_rate=1e-4
--max_steps=10
--warmup_steps=2

View File

@@ -68,7 +68,7 @@ python run_glue.py \
--do_eval \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/$TASK_NAME/
@@ -141,7 +141,7 @@ python run_glue.py \
--do_eval \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/
@@ -166,7 +166,7 @@ python run_glue.py \
--do_eval \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/ \
@@ -189,7 +189,7 @@ python -m torch.distributed.launch \
--do_eval \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 8 \
--per_device_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/
@@ -221,7 +221,7 @@ python -m torch.distributed.launch \
--do_eval \
--data_dir $GLUE_DIR/MNLI/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 8 \
--per_device_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir output_dir \
@@ -258,7 +258,7 @@ TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recal
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is a crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
#### Fine-tuning on XNLI
@@ -273,14 +273,13 @@ on a single tesla V100 16GB. The data for XNLI can be downloaded with the follow
export XNLI_DIR=/path/to/XNLI
python run_xnli.py \
--model_type bert \
--model_name_or_path bert-base-multilingual-cased \
--language de \
--train_language en \
--do_train \
--do_eval \
--data_dir $XNLI_DIR \
--per_gpu_train_batch_size 32 \
--per_device_train_batch_size 32 \
--learning_rate 5e-5 \
--num_train_epochs 2.0 \
--max_seq_length 128 \

View File

@@ -135,7 +135,8 @@ def main():
# Get datasets
train_dataset = GlueDataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None
eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev") if training_args.do_eval else None
test_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="test") if training_args.do_predict else None
def compute_metrics(p: EvalPrediction) -> Dict:
if output_mode == "classification":
@@ -165,31 +166,57 @@ def main():
tokenizer.save_pretrained(training_args.output_dir)
# Evaluation
results = {}
if training_args.do_eval and training_args.local_rank in [-1, 0]:
eval_results = {}
if training_args.do_eval:
logger.info("*** Evaluate ***")
# Loop to handle MNLI double evaluation (matched, mis-matched)
eval_datasets = [eval_dataset]
if data_args.task_name == "mnli":
mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
eval_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, evaluate=True))
eval_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="dev"))
for eval_dataset in eval_datasets:
result = trainer.evaluate(eval_dataset=eval_dataset)
eval_result = trainer.evaluate(eval_dataset=eval_dataset)
output_eval_file = os.path.join(
training_args.output_dir, f"eval_results_{eval_dataset.args.task_name}.txt"
)
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results {} *****".format(eval_dataset.args.task_name))
for key, value in result.items():
logger.info(" %s = %s", key, value)
writer.write("%s = %s\n" % (key, value))
if trainer.is_world_master():
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results {} *****".format(eval_dataset.args.task_name))
for key, value in eval_result.items():
logger.info(" %s = %s", key, value)
writer.write("%s = %s\n" % (key, value))
results.update(result)
eval_results.update(eval_result)
return results
if training_args.do_predict:
logging.info("*** Test ***")
test_datasets = [test_dataset]
if data_args.task_name == "mnli":
mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
test_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="test"))
for test_dataset in test_datasets:
predictions = trainer.predict(test_dataset=test_dataset).predictions
if output_mode == "classification":
predictions = np.argmax(predictions, axis=1)
output_test_file = os.path.join(
training_args.output_dir, f"test_results_{test_dataset.args.task_name}.txt"
)
if trainer.is_world_master():
with open(output_test_file, "w") as writer:
logger.info("***** Test results {} *****".format(test_dataset.args.task_name))
writer.write("index\tprediction\n")
for index, item in enumerate(predictions):
if output_mode == "regression":
writer.write("%d\t%3.3f\n" % (index, item))
else:
item = test_dataset.get_labels()[item]
writer.write("%d\t%s\n" % (index, item))
return eval_results
def _mp_fn(index):

View File

@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Finetuning multi-lingual models on XNLI (Bert, DistilBERT, XLM).
""" Finetuning multi-lingual models on XNLI (e.g. Bert, DistilBERT, XLM).
Adapted from `examples/text-classification/run_glue.py`"""
@@ -32,15 +32,9 @@ from tqdm import tqdm, trange
from transformers import (
WEIGHTS_NAME,
AdamW,
BertConfig,
BertForSequenceClassification,
BertTokenizer,
DistilBertConfig,
DistilBertForSequenceClassification,
DistilBertTokenizer,
XLMConfig,
XLMForSequenceClassification,
XLMTokenizer,
AutoConfig,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
)
from transformers import glue_convert_examples_to_features as convert_examples_to_features
@@ -57,16 +51,6 @@ except ImportError:
logger = logging.getLogger(__name__)
ALL_MODELS = sum(
(tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, DistilBertConfig, XLMConfig)), ()
)
MODEL_CLASSES = {
"bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
"xlm": (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
"distilbert": (DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer),
}
def set_seed(args):
random.seed(args.seed)
@@ -377,19 +361,12 @@ def main():
required=True,
help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
)
parser.add_argument(
"--model_type",
default=None,
type=str,
required=True,
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
)
parser.add_argument(
"--model_name_or_path",
default=None,
type=str,
required=True,
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
help="Path to pretrained model or model identifier from huggingface.co/models",
)
parser.add_argument(
"--language",
@@ -421,7 +398,7 @@ def main():
)
parser.add_argument(
"--cache_dir",
default="",
default=None,
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
)
@@ -562,24 +539,23 @@ def main():
if args.local_rank not in [-1, 0]:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
args.model_type = args.model_type.lower()
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
config = config_class.from_pretrained(
config = AutoConfig.from_pretrained(
args.config_name if args.config_name else args.model_name_or_path,
num_labels=num_labels,
finetuning_task=args.task_name,
cache_dir=args.cache_dir if args.cache_dir else None,
cache_dir=args.cache_dir,
)
tokenizer = tokenizer_class.from_pretrained(
args.model_type = config.model_type
tokenizer = AutoTokenizer.from_pretrained(
args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
do_lower_case=args.do_lower_case,
cache_dir=args.cache_dir if args.cache_dir else None,
cache_dir=args.cache_dir,
)
model = model_class.from_pretrained(
model = AutoModelForSequenceClassification.from_pretrained(
args.model_name_or_path,
from_tf=bool(".ckpt" in args.model_name_or_path),
config=config,
cache_dir=args.cache_dir if args.cache_dir else None,
cache_dir=args.cache_dir,
)
if args.local_rank == 0:
@@ -614,14 +590,13 @@ def main():
torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
# Load a trained model and vocabulary that you have fine-tuned
model = model_class.from_pretrained(args.output_dir)
tokenizer = tokenizer_class.from_pretrained(args.output_dir)
model = AutoModelForSequenceClassification.from_pretrained(args.output_dir)
tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
model.to(args.device)
# Evaluation
results = {}
if args.do_eval and args.local_rank in [-1, 0]:
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
checkpoints = [args.output_dir]
if args.eval_all_checkpoints:
checkpoints = list(
@@ -633,7 +608,7 @@ def main():
global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
model = model_class.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
model.to(args.device)
result = evaluate(args, model, tokenizer, prefix=prefix)
result = dict((k + "_{}".format(global_step), v) for k, v in result.items())

View File

@@ -69,7 +69,7 @@ python3 run_ner.py --data_dir ./ \
--output_dir $OUTPUT_DIR \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--per_device_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
@@ -91,7 +91,7 @@ Instead of passing all parameters via commandline arguments, the `run_ner.py` sc
"output_dir": "germeval-model",
"max_seq_length": 128,
"num_train_epochs": 3,
"per_gpu_train_batch_size": 32,
"per_device_train_batch_size": 32,
"save_steps": 750,
"seed": 1,
"do_train": true,

View File

@@ -13,7 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Fine-tuning the library models for named entity recognition on CoNLL-2003 (Bert or Roberta). """
""" Fine-tuning the library models for named entity recognition on CoNLL-2003. """
import logging
@@ -171,7 +171,6 @@ def main():
max_seq_length=data_args.max_seq_length,
overwrite_cache=data_args.overwrite_cache,
mode=Split.train,
local_rank=training_args.local_rank,
)
if training_args.do_train
else None
@@ -185,7 +184,6 @@ def main():
max_seq_length=data_args.max_seq_length,
overwrite_cache=data_args.overwrite_cache,
mode=Split.dev,
local_rank=training_args.local_rank,
)
if training_args.do_eval
else None
@@ -237,22 +235,23 @@ def main():
# Evaluation
results = {}
if training_args.do_eval and training_args.local_rank in [-1, 0]:
if training_args.do_eval:
logger.info("*** Evaluate ***")
result = trainer.evaluate()
output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in result.items():
logger.info(" %s = %s", key, value)
writer.write("%s = %s\n" % (key, value))
if trainer.is_world_master():
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
for key, value in result.items():
logger.info(" %s = %s", key, value)
writer.write("%s = %s\n" % (key, value))
results.update(result)
# Predict
if training_args.do_predict and training_args.local_rank in [-1, 0]:
if training_args.do_predict:
test_dataset = NerDataset(
data_dir=data_args.data_dir,
tokenizer=tokenizer,
@@ -261,33 +260,36 @@ def main():
max_seq_length=data_args.max_seq_length,
overwrite_cache=data_args.overwrite_cache,
mode=Split.test,
local_rank=training_args.local_rank,
)
predictions, label_ids, metrics = trainer.predict(test_dataset)
preds_list, _ = align_predictions(predictions, label_ids)
output_test_results_file = os.path.join(training_args.output_dir, "test_results.txt")
with open(output_test_results_file, "w") as writer:
for key, value in metrics.items():
logger.info(" %s = %s", key, value)
writer.write("%s = %s\n" % (key, value))
if trainer.is_world_master():
with open(output_test_results_file, "w") as writer:
for key, value in metrics.items():
logger.info(" %s = %s", key, value)
writer.write("%s = %s\n" % (key, value))
# Save predictions
output_test_predictions_file = os.path.join(training_args.output_dir, "test_predictions.txt")
with open(output_test_predictions_file, "w") as writer:
with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f:
example_id = 0
for line in f:
if line.startswith("-DOCSTART-") or line == "" or line == "\n":
writer.write(line)
if not preds_list[example_id]:
example_id += 1
elif preds_list[example_id]:
output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n"
writer.write(output_line)
else:
logger.warning("Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0])
if trainer.is_world_master():
with open(output_test_predictions_file, "w") as writer:
with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f:
example_id = 0
for line in f:
if line.startswith("-DOCSTART-") or line == "" or line == "\n":
writer.write(line)
if not preds_list[example_id]:
example_id += 1
elif preds_list[example_id]:
output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n"
writer.write(output_line)
else:
logger.warning(
"Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0]
)
return results

View File

@@ -22,6 +22,8 @@ from dataclasses import dataclass
from enum import Enum
from typing import List, Optional, Union
from filelock import FileLock
from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available
@@ -68,7 +70,6 @@ if is_torch_available():
import torch
from torch import nn
from torch.utils.data.dataset import Dataset
from transformers import torch_distributed_zero_first
class NerDataset(Dataset):
"""
@@ -90,16 +91,16 @@ if is_torch_available():
max_seq_length: Optional[int] = None,
overwrite_cache=False,
mode: Split = Split.train,
local_rank=-1,
):
# Load data features from cache or dataset file
cached_features_file = os.path.join(
data_dir, "cached_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length)),
)
with torch_distributed_zero_first(local_rank):
# Make sure only the first process in distributed training processes the dataset,
# and the others will use the cache.
# Make sure only the first process in distributed training processes the dataset,
# and the others will use the cache.
lock_path = cached_features_file + ".lock"
with FileLock(lock_path):
if os.path.exists(cached_features_file) and not overwrite_cache:
logger.info(f"Loading features from cached file {cached_features_file}")
@@ -125,9 +126,8 @@ if is_torch_available():
pad_token_segment_id=tokenizer.pad_token_type_id,
pad_token_label_id=self.pad_token_label_id,
)
if local_rank in [-1, 0]:
logger.info(f"Saving features into cached file {cached_features_file}")
torch.save(self.features, cached_features_file)
logger.info(f"Saving features into cached file {cached_features_file}")
torch.save(self.features, cached_features_file)
def __len__(self):
return len(self.features)

View File

@@ -0,0 +1,124 @@
## ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Googles BERT architecture with the same configurations as BERT-Base.
Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
## Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
This task aims to extract named entities in the text, such as names and label with appropriate `NER` classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with `IOB` format. In this format, tokens that are not part of an entity are tagged as `”O”` the `”B”`tag corresponds to the first word of an object, and the `”I”` tag corresponds to the rest of the terms of the same entity. Both `”B”` and `”I”` tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text. There are two primary datasets used in Persian NER, `ARMAN`, and `PEYMA`. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.
### PEYMA
PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes.
1. Organization
2. Money
3. Location
4. Date
5. Time
6. Person
7. Percent
| Label | # |
|:------------:|:-----:|
| Organization | 16964 |
| Money | 2037 |
| Location | 8782 |
| Date | 4259 |
| Time | 732 |
| Person | 7675 |
| Percent | 699 |
**Download**
You can download the dataset from [here](http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/)
---
### ARMAN
ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.
1. Organization
2. Location
3. Facility
4. Event
5. Product
6. Person
| Label | # |
|:------------:|:-----:|
| Organization | 30108 |
| Location | 12924 |
| Facility | 4458 |
| Event | 7557 |
| Product | 4389 |
| Person | 15645 |
**Download**
You can download the dataset from [here](https://github.com/HaniehP/PersianNER)
## Results
The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
| Dataset | ParsBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
|:---------------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
| ARMAN + PEYMA | 95.13* | - | - | - | - | - |
| PEYMA | 98.79* | - | 90.59 | - | 84.00 | - |
| ARMAN | 93.10* | 89.9 | 84.03 | 86.55 | - | 77.45 |
## How to use :hugs:
| Notebook | Description | |
|:----------|:-------------|------:|
| [How to use Pipelines](https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) |
## Cite
Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
```markdown
@article{ParsBERT,
title={ParsBERT: Transformer-based Model for Persian Language Understanding},
author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
journal={ArXiv},
year={2020},
volume={abs/2005.12515}
}
```
## Acknowledgments
We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
## Contributors
- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
- Mohammad Gharachorloo: [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
- Marzieh Farahani: [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
- Mohammad Manthouri: [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
- Hooshvare Team: [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+ And a special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara-tabrizi-64548b79/), [Behance](https://www.behance.net/saratabrizi), [Instagram](https://www.instagram.com/sara_b_tabrizi/)
## Releases
### Release v0.1 (May 29, 2019)
This is the first version of our ParsBERT NER!

View File

@@ -0,0 +1,124 @@
## ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Googles BERT architecture with the same configurations as BERT-Base.
Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
## Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
This task aims to extract named entities in the text, such as names and label with appropriate `NER` classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with `IOB` format. In this format, tokens that are not part of an entity are tagged as `”O”` the `”B”`tag corresponds to the first word of an object, and the `”I”` tag corresponds to the rest of the terms of the same entity. Both `”B”` and `”I”` tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text. There are two primary datasets used in Persian NER, `ARMAN`, and `PEYMA`. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.
### PEYMA
PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes.
1. Organization
2. Money
3. Location
4. Date
5. Time
6. Person
7. Percent
| Label | # |
|:------------:|:-----:|
| Organization | 16964 |
| Money | 2037 |
| Location | 8782 |
| Date | 4259 |
| Time | 732 |
| Person | 7675 |
| Percent | 699 |
**Download**
You can download the dataset from [here](http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/)
---
### ARMAN
ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.
1. Organization
2. Location
3. Facility
4. Event
5. Product
6. Person
| Label | # |
|:------------:|:-----:|
| Organization | 30108 |
| Location | 12924 |
| Facility | 4458 |
| Event | 7557 |
| Product | 4389 |
| Person | 15645 |
**Download**
You can download the dataset from [here](https://github.com/HaniehP/PersianNER)
## Results
The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
| Dataset | ParsBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
|:---------------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
| ARMAN + PEYMA | 95.13* | - | - | - | - | - |
| PEYMA | 98.79* | - | 90.59 | - | 84.00 | - |
| ARMAN | 93.10* | 89.9 | 84.03 | 86.55 | - | 77.45 |
## How to use :hugs:
| Notebook | Description | |
|:----------|:-------------|------:|
| [How to use Pipelines](https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) |
## Cite
Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
```markdown
@article{ParsBERT,
title={ParsBERT: Transformer-based Model for Persian Language Understanding},
author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
journal={ArXiv},
year={2020},
volume={abs/2005.12515}
}
```
## Acknowledgments
We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
## Contributors
- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
- Mohammad Gharachorloo: [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
- Marzieh Farahani: [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
- Mohammad Manthouri: [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
- Hooshvare Team: [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+ And a special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara-tabrizi-64548b79/), [Behance](https://www.behance.net/saratabrizi), [Instagram](https://www.instagram.com/sara_b_tabrizi/)
## Releases
### Release v0.1 (May 29, 2019)
This is the first version of our ParsBERT NER!

View File

@@ -0,0 +1,124 @@
## ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Googles BERT architecture with the same configurations as BERT-Base.
Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
## Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
This task aims to extract named entities in the text, such as names and label with appropriate `NER` classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with `IOB` format. In this format, tokens that are not part of an entity are tagged as `”O”` the `”B”`tag corresponds to the first word of an object, and the `”I”` tag corresponds to the rest of the terms of the same entity. Both `”B”` and `”I”` tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text. There are two primary datasets used in Persian NER, `ARMAN`, and `PEYMA`. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.
### PEYMA
PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes.
1. Organization
2. Money
3. Location
4. Date
5. Time
6. Person
7. Percent
| Label | # |
|:------------:|:-----:|
| Organization | 16964 |
| Money | 2037 |
| Location | 8782 |
| Date | 4259 |
| Time | 732 |
| Person | 7675 |
| Percent | 699 |
**Download**
You can download the dataset from [here](http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/)
---
### ARMAN
ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.
1. Organization
2. Location
3. Facility
4. Event
5. Product
6. Person
| Label | # |
|:------------:|:-----:|
| Organization | 30108 |
| Location | 12924 |
| Facility | 4458 |
| Event | 7557 |
| Product | 4389 |
| Person | 15645 |
**Download**
You can download the dataset from [here](https://github.com/HaniehP/PersianNER)
## Results
The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
| Dataset | ParsBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
|:---------------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
| ARMAN + PEYMA | 95.13* | - | - | - | - | - |
| PEYMA | 98.79* | - | 90.59 | - | 84.00 | - |
| ARMAN | 93.10* | 89.9 | 84.03 | 86.55 | - | 77.45 |
## How to use :hugs:
| Notebook | Description | |
|:----------|:-------------|------:|
| [How to use Pipelines](https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) |
## Cite
Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
```markdown
@article{ParsBERT,
title={ParsBERT: Transformer-based Model for Persian Language Understanding},
author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
journal={ArXiv},
year={2020},
volume={abs/2005.12515}
}
```
## Acknowledgments
We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
## Contributors
- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
- Mohammad Gharachorloo: [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
- Marzieh Farahani: [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
- Mohammad Manthouri: [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
- Hooshvare Team: [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+ And a special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara-tabrizi-64548b79/), [Behance](https://www.behance.net/saratabrizi), [Instagram](https://www.instagram.com/sara_b_tabrizi/)
## Releases
### Release v0.1 (May 29, 2019)
This is the first version of our ParsBERT NER!

View File

@@ -0,0 +1,124 @@
## ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Googles BERT architecture with the same configurations as BERT-Base.
Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
---
## Introduction
This model is pre-trained on a large Persian corpus with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 2M documents. A large subset of this corpus was crawled manually.
As a part of ParsBERT methodology, an extensive pre-processing combining POS tagging and WordPiece segmentation was carried out to bring the corpus into a proper format. This process produces more than 40M true sentences.
## Evaluation
ParsBERT is evaluated on three NLP downstream tasks: Sentiment Analysis (SA), Text Classification, and Named Entity Recognition (NER). For this matter and due to insufficient resources, two large datasets for SA and two for text classification were manually composed, which are available for public use and benchmarking. ParsBERT outperformed all other language models, including multilingual BERT and other hybrid deep learning models for all tasks, improving the state-of-the-art performance in Persian language modeling.
## Results
The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
### Sentiment Analysis (SA) task
| Dataset | ParsBERT | mBERT | DeepSentiPers |
|:--------------------------:|:---------:|:-----:|:-------------:|
| Digikala User Comments | 81.74* | 80.74 | - |
| SnappFood User Comments | 88.12* | 87.87 | - |
| SentiPers (Multi Class) | 71.11* | - | 69.33 |
| SentiPers (Binary Class) | 92.13* | - | 91.98 |
### Text Classification (TC) task
| Dataset | ParsBERT | mBERT |
|:-----------------:|:--------:|:-----:|
| Digikala Magazine | 93.59* | 90.72 |
| Persian News | 97.19* | 95.79 |
### Named Entity Recognition (NER) task
| Dataset | ParsBERT | mBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
|:-------:|:--------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
| PEYMA | 93.10* | 86.64 | - | 90.59 | - | 84.00 | - |
| ARMAN | 98.79* | 95.89 | 89.9 | 84.03 | 86.55 | - | 77.45 |
**If you tested ParsBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference**
## How to use
### TensorFlow 2.0
```python
from transformers import AutoConfig, AutoTokenizer, TFAutoModel
config = AutoConfig.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد می‌توانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
tokenizer.tokenize(text)
>>> ['ما', 'در', 'هوش', '##واره', 'معتقدیم', 'با', 'انتقال', 'صحیح', 'دانش', 'و', 'اگاهی', '،', 'همه', 'افراد', 'میتوانند', 'از', 'ابزارهای', 'هوشمند', 'استفاده', 'کنند', '.', 'شعار', 'ما', 'هوش', 'مصنوعی', 'برای', 'همه', 'است', '.']
```
### Pytorch
```python
from transformers import AutoConfig, AutoTokenizer, AutoModel
config = AutoConfig.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
```
## NLP Tasks Tutorial
Coming soon stay tuned
## Cite
Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
```markdown
@article{ParsBERT,
title={ParsBERT: Transformer-based Model for Persian Language Understanding},
author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
journal={ArXiv},
year={2020},
volume={abs/2005.12515}
}
```
## Acknowledgments
We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
## Contributors
- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
- Mohammad Gharachorloo: [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
- Marzieh Farahani: [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
- Mohammad Manthouri: [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
- Hooshvare Team: [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
## Releases
### Release v0.1 (May 27, 2019)
This is the first version of our ParsBERT based on BERT<sub>BASE</sub>

View File

@@ -0,0 +1,39 @@
---
language: ukrainian
---
Note: **default code snippet above won't work** because we are using `AlbertTokenizer` with `GPT2LMHeadModel`, see [issue](https://github.com/huggingface/transformers/issues/4285).
## GPT2 124M Trained on Ukranian Fiction
### Training details
Model was trained on corpus of 4040 fiction books, 2.77 GiB in total.
Evaluation on [brown-uk](https://github.com/brown-uk/corpus) gives perplexity of 50.16.
### Example usage:
```python
from transformers import AlbertTokenizer, GPT2LMHeadModel
tokenizer = AlbertTokenizer.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
model = GPT2LMHeadModel.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
input_ids = tokenizer.encode("Но зла Юнона, суча дочка,", add_special_tokens=False, return_tensors='pt')
outputs = model.generate(
input_ids,
do_sample=True,
num_return_sequences=3,
max_length=50
)
for i, out in enumerate(outputs):
print("{}: {}".format(i, tokenizer.decode(out)))
```
Prints something like this:
```bash
0: Но зла Юнона, суча дочка, яка затьмарила всі її таємниці: І хто з'їсть її душу, той помре». І, не дочекавшись гніву богів, посунула в пітьму, щоб не бачити перед собою. Але, за
1: Но зла Юнона, суча дочка, і довела мене до божевілля. Але він не знав нічого. Після того як я його побачив, мені стало зле. Я втратив рівновагу. Але в мене не було часу на роздуми. Я вже втратив надію
2: Но зла Юнона, суча дочка, не нарікала нам! — раптом вигукнула Юнона. — Це ти, старий йолопе! — мовила вона, не перестаючи сміятись. — Хіба ти не знаєш, що мені подобається ходити з тобою?
```

View File

@@ -1,19 +1,25 @@
---
language: norwegian
thumbnail: https://i.imgur.com/QqSEC5I.png
---
# Norwegian Electra
Image incoming, im going to have som fun with this one.
![Image of norwegian electra](https://i.imgur.com/QqSEC5I.png)
Trained on Oscar + wikipedia + opensubtitles + some other data I had with the awesome power of TPUs(V3-8)
Use with caution. I have no downstream tasks in Norwegian to test on so I have no idea of its performance yet.
# Model
## Electra: Pre-training Text Encoders as Discriminators Rather Than Generators
Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning
- https://openreview.net/pdf?id=r1xMH1BtvB
- https://github.com/google-research/electra
# Acknowledgments
### TensorFlow Research Cloud
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️
- https://www.tensorflow.org/tfrc
#### OSCAR corpus
- https://oscar-corpus.com/
#### OPUS
- http://opus.nlpl.eu/
- http://www.opensubtitles.org/

View File

@@ -0,0 +1,43 @@
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`.
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`.
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_laptop")
model = AutoModel.from_pretrained("activebus/BERT-DK_laptop")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```

View File

@@ -0,0 +1,41 @@
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_rest")
model = AutoModel.from_pretrained("activebus/BERT-DK_rest")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```

View File

@@ -0,0 +1,41 @@
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`.
`BERT-PT_*` addtionally uses SQuAD 1.1.
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_laptop")
model = AutoModel.from_pretrained("activebus/BERT-PT_laptop")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```

View File

@@ -0,0 +1,42 @@
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.
`BERT-PT_*` addtionally uses SQuAD 1.1.
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_rest")
model = AutoModel.from_pretrained("activebus/BERT-PT_rest")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```

View File

@@ -0,0 +1,44 @@
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
Please visit https://github.com/howardhsu/BERT-for-RRC-ABSA for details.
`BERT-XD_Review` is a cross-domain (beyond just `laptop` and `restaurant`) language model, where each example is from a single product / restaurant with the same rating, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
## Model Description
The original model is from `BERT-base-uncased`.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-XD_Review")
model = AutoModel.from_pretrained("activebus/BERT-XD_Review")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```

View File

@@ -0,0 +1,44 @@
# ReviewBERT
BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
`BERT_Review` is cross-domain (beyond just `laptop` and `restaurant`) language model with one example from randomly mixed domains, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
## Model Description
The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.
Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).
## Instructions
Loading the post-trained weights are as simple as, e.g.,
```python
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("activebus/BERT_Review")
model = AutoModel.from_pretrained("activebus/BERT_Review")
```
## Evaluation Results
Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf)
`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
## Citation
If you find this work useful, please cite as following.
```
@inproceedings{xu_bert2019,
title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
month = "jun",
year = "2019",
}
```

View File

@@ -0,0 +1,20 @@
# longformer-base-4096-extra.pos.embd.only
This model is similar to `longformer-base-4096` but it was pretrained to preserve RoBERTa weights by freezing all RoBERTa weights and only train the additional position embeddings.
### Citing
If you use `Longformer` in your research, please cite [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150).
```
@article{Beltagy2020Longformer,
title={Longformer: The Long-Document Transformer},
author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
journal={arXiv:2004.05150},
year={2020},
}
```
`Longformer` is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).
AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

View File

@@ -0,0 +1,24 @@
# longformer-base-4096
[Longformer](https://arxiv.org/abs/2004.05150) is a transformer model for long documents.
`longformer-base-4096` is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.
Longformer uses a combination of a sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations.
Please refer to the examples in `modeling_longformer.py` and the paper for more details on how to set global attention.
### Citing
If you use `Longformer` in your research, please cite [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150).
```
@article{Beltagy2020Longformer,
title={Longformer: The Long-Document Transformer},
author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
journal={arXiv:2004.05150},
year={2020},
}
```
`Longformer` is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).
AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

View File

@@ -6,6 +6,17 @@ language: arabic
Pretrained BERT base language model for Arabic
_If you use this model in your work, please cite this paper (to appear in 2020):_
```
@inproceedings{
title={KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media},
author={Safaya, Ali and Abdullatif, Moutasem and Yuret, Deniz},
booktitle={Proceedings of the International Workshop on Semantic Evaluation (SemEval)},
year={2020}
}
```
## Pretraining Corpus
`arabic-bert-base` model was pretrained on ~8.2 Billion words:

View File

@@ -3,8 +3,9 @@ language: arabic
---
# AraBERT : Pre-training BERT for Arabic Language Understanding
<img src="https://github.com/aub-mind/arabert/blob/master/arabert_logo.png" width="100" align="left"/>
**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config.
**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config. More details are available in the [AraBERT PAPER](https://arxiv.org/abs/2003.00104v2) and in the [AraBERT Meetup](https://github.com/WissamAntoun/pydata_khobar_meetup)
There are two version off the model AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were splitted using the [Farasa Segmenter](http://alt.qcri.org/farasa/segmenter.html).
@@ -12,28 +13,34 @@ The model was trained on ~70M sentences or ~23GB of Arabic text with ~3B words.
We evalaute both AraBERT models on different downstream tasks and compare it to [mBERT]((https://github.com/google-research/bert/blob/master/multilingual.md)), and other state of the art models (*To the extent of our knowledge*). The Tasks were Sentiment Analysis on 6 different datasets ([HARD](https://github.com/elnagara/HARD-Arabic-Dataset), [ASTD-Balanced](https://www.aclweb.org/anthology/D15-1299), [ArsenTD-Lev](https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf), [LABR](https://github.com/mohamedadaly/LABR), [ArSaS](http://lrec-conf.org/workshops/lrec2018/W30/pdf/22_W30.pdf)), Named Entity Recognition with the [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp), and Arabic Question Answering on [Arabic-SQuAD and ARCD](https://github.com/husseinmozannar/SOQAL)
**Update 2 (21/5/2020) :**
Added support for the farasapy segmenter https://github.com/MagedSaeed/farasapy in the ``preprocess_arabert.py`` which is ~6x faster than the ``py4j.java_gateway``, consider setting ``use_farasapy=True`` when calling preprocess and pass it an instance of ``FarasaSegmenter(interactive=True)`` with interactive set to ``True`` for faster segmentation.
**Update 1 (21/4/2020) :**
Fixed an issue with ARCD fine-tuning which drastically improved performance. Initially we didn't account for the change of the ```answer_start``` during preprocessing.
## Results (Acc.)
Task | prev. SOTA | mBERT | AraBERTv0.1 | AraBERTv1
---|:---:|:---:|:---:|:---:
HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|96.2|96.1
ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|92.6
ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|59.4
AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|94.1|93.8
LABR|87.5 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
ANERcorp|81.7 (BiLSTM-CRF)|78.4|84.2|81.9
ARCD|mBERT|EM:34.2 F1: 61.3|EM:30.1 F1:61.2|EM:30.6 F1: 62.7
HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|**96.2**|96.1
ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|**92.6**
ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|**59.4**
AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|93.1|**93.8**
LABR|**87.5** [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
ANERcorp|81.7 (BiLSTM-CRF)|78.4|**84.2**|81.9
ARCD|mBERT|EM:34.2 F1: 61.3|EM:51.14 F1:82.13|**EM:54.84 F1: 82.15**
*We would be extremly thankful if everyone can contibute to the Results table by adding more scores on different datasets*
*If you tested AraBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference*
## How to use
You can easily use AraBERT since it is almost fully compatible with existing codebases (You can use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)
To use HuggingFace's Transformer repository you only need to provide a lost of token that forces the model to not split them, also make sure that the text is pre-segmented:
You can easily use AraBERT since it is almost fully compatible with existing codebases (Use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)
To use HuggingFace's Transformer repository you only need to provide a list of token that forces the model to not split them, also make sure that the text is pre-segmented:
**Not all libraries built on top of transformers support the `never_split` argument**
```python
from transformers import AutoTokenizer
from preprocess_arabert import never_split_tokens
from transformers import AutoTokenizer, AutoModel
from arabert.preprocess_arabert import never_split_tokens, preprocess
from farasa.segmenter import FarasaSegmenter
arabert_tokenizer = AutoTokenizer.from_pretrained(
"aubmindlab/bert-base-arabert",
@@ -42,27 +49,75 @@ arabert_tokenizer = AutoTokenizer.from_pretrained(
never_split=never_split_tokens)
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabert")
arabert_tokenizer.tokenize("و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري")
#Preprocess the text to make it compatible with AraBERT using farasapy
farasa_segmenter = FarasaSegmenter(interactive=True)
#or you can use a py4j JavaGateway to the farasa Segmneter .jar but it's slower
#(see update 2)
#from py4j.java_gateway import JavaGateway
#gateway = JavaGateway.launch_gateway(classpath='./PATH_TO_FARASA/FarasaSegmenterJar.jar')
#farasa = gateway.jvm.com.qcri.farasa.segmenter.Farasa()
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
text_preprocessed = preprocess( text,
do_farasa_tokenization = True,
farasa = farasa_segmenter,
use_farasapy = True)
>>>text_preprocessed: "و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري"
arabert_tokenizer.tokenize(text_preprocessed)
>>> ['و+', 'لن', 'نبال', '##غ', 'إذا', 'قل', '+نا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'ال+', 'مكتب', 'في', 'زمن', '+نا', 'هذا', 'ضروري']
```
**AraBERTv0.1 is compatible with all existing libraries, since it needs no pre-segmentation.**
```python
from transformers import AutoTokenizer
from preprocess_arabert import never_split_tokens
from transformers import AutoTokenizer, AutoModel
arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv01",do_lower_case=False)
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv01")
arabert_tokenizer.tokenize("ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري")
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
arabert_tokenizer.tokenize(text)
>>> ['ولن', 'ن', '##بالغ', 'إذا', 'قلنا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'المكتب', 'في', 'زمن', '##ن', '##ا', 'هذا', 'ضروري']
```
The ```araBERT_(initial_Demo_TF)_.ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).
The ```araBERT_(Updated_Demo_TF).ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).
**Coming Soon :** Fine-tunning demo using HuggingFace's Trainer API
**AraBERT on ARCD**
During the preprocessing step the ```answer_start``` character position needs to be recalculated. You can use the file ```arcd_preprocessing.py``` as shown below to clean, preprocess the ARCD dataset before running ```run_squad.py```. More detailed Colab notebook is available in the [SOQAL repo](https://github.com/husseinmozannar/SOQAL).
```bash
python arcd_preprocessing.py \
--input_file="/PATH_TO/arcd-test.json" \
--output_file="arcd-test-pre.json" \
--do_farasa_tokenization=True \
--use_farasapy=True \
```
```bash
python SOQAL/bert/run_squad.py \
--vocab_file="/PATH_TO_PRETRAINED_TF_CKPT/vocab.txt" \
--bert_config_file="/PATH_TO_PRETRAINED_TF_CKPT/config.json" \
--init_checkpoint="/PATH_TO_PRETRAINED_TF_CKPT/" \
--do_train=True \
--train_file=turk_combined_all_pre.json \
--do_predict=True \
--predict_file=arcd-test-pre.json \
--train_batch_size=32 \
--predict_batch_size=24 \
--learning_rate=3e-5 \
--num_train_epochs=4 \
--max_seq_length=384 \
--doc_stride=128 \
--do_lower_case=False\
--output_dir="/PATH_TO/OUTPUT_PATH"/ \
--use_tpu=True \
--tpu_name=$TPU_ADDRESS \
```
## Model Weights and Vocab Download
Models | AraBERTv0.1 | AraBERTv1
---|:---:|:---:
@@ -73,21 +128,17 @@ PyTorch| [Drive_Link](https://drive.google.com/open?id=1-_3te42mQCPD8SxwZ3l-VBL7
## If you used this model please cite us as:
```
@misc{antoun2020arabert,
title={AraBERT: Transformer-based Model for Arabic Language Understanding},
author={Wissam Antoun and Fady Baly and Hazem Hajj},
year={2020},
eprint={2003.00104},
archivePrefix={arXiv},
primaryClass={cs.CL}
@inproceedings{antoun2020arabert,
title={AraBERT: Transformer-based Model for Arabic Language Understanding},
author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
pages={9}
}
```
## Acknowledgments
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access.
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
## Contacts
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/giulio-ravasio-3a81a9110/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/BalyFady) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
***We are looking for sponsors to train BERT-Large and other Transformer models, the sponsor only needs to cover to data storage and compute cost of the generating the pretraining data***
**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/fadybaly) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>

View File

@@ -3,8 +3,9 @@ language: arabic
---
# AraBERT : Pre-training BERT for Arabic Language Understanding
<img src="https://github.com/aub-mind/arabert/blob/master/arabert_logo.png" width="100" align="left"/>
**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config.
**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config. More details are available in the [AraBERT PAPER](https://arxiv.org/abs/2003.00104v2) and in the [AraBERT Meetup](https://github.com/WissamAntoun/pydata_khobar_meetup)
There are two version off the model AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were splitted using the [Farasa Segmenter](http://alt.qcri.org/farasa/segmenter.html).
@@ -12,28 +13,34 @@ The model was trained on ~70M sentences or ~23GB of Arabic text with ~3B words.
We evalaute both AraBERT models on different downstream tasks and compare it to [mBERT]((https://github.com/google-research/bert/blob/master/multilingual.md)), and other state of the art models (*To the extent of our knowledge*). The Tasks were Sentiment Analysis on 6 different datasets ([HARD](https://github.com/elnagara/HARD-Arabic-Dataset), [ASTD-Balanced](https://www.aclweb.org/anthology/D15-1299), [ArsenTD-Lev](https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf), [LABR](https://github.com/mohamedadaly/LABR), [ArSaS](http://lrec-conf.org/workshops/lrec2018/W30/pdf/22_W30.pdf)), Named Entity Recognition with the [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp), and Arabic Question Answering on [Arabic-SQuAD and ARCD](https://github.com/husseinmozannar/SOQAL)
**Update 2 (21/5/2020) :**
Added support for the farasapy segmenter https://github.com/MagedSaeed/farasapy in the ``preprocess_arabert.py`` which is ~6x faster than the ``py4j.java_gateway``, consider setting ``use_farasapy=True`` when calling preprocess and pass it an instance of ``FarasaSegmenter(interactive=True)`` with interactive set to ``True`` for faster segmentation.
**Update 1 (21/4/2020) :**
Fixed an issue with ARCD fine-tuning which drastically improved performance. Initially we didn't account for the change of the ```answer_start``` during preprocessing.
## Results (Acc.)
Task | prev. SOTA | mBERT | AraBERTv0.1 | AraBERTv1
---|:---:|:---:|:---:|:---:
HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|96.2|96.1
ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|92.6
ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|59.4
AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|94.1|93.8
LABR|87.5 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
ANERcorp|81.7 (BiLSTM-CRF)|78.4|84.2|81.9
ARCD|mBERT|EM:34.2 F1: 61.3|EM:30.1 F1:61.2|EM:30.6 F1: 62.7
HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|**96.2**|96.1
ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|**92.6**
ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|**59.4**
AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|93.1|**93.8**
LABR|**87.5** [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
ANERcorp|81.7 (BiLSTM-CRF)|78.4|**84.2**|81.9
ARCD|mBERT|EM:34.2 F1: 61.3|EM:51.14 F1:82.13|**EM:54.84 F1: 82.15**
*We would be extremly thankful if everyone can contibute to the Results table by adding more scores on different datasets*
*If you tested AraBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference*
## How to use
You can easily use AraBERT since it is almost fully compatible with existing codebases (You can use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)
To use HuggingFace's Transformer repository you only need to provide a lost of token that forces the model to not split them, also make sure that the text is pre-segmented:
You can easily use AraBERT since it is almost fully compatible with existing codebases (Use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)
To use HuggingFace's Transformer repository you only need to provide a list of token that forces the model to not split them, also make sure that the text is pre-segmented:
**Not all libraries built on top of transformers support the `never_split` argument**
```python
from transformers import AutoTokenizer
from preprocess_arabert import never_split_tokens
from transformers import AutoTokenizer, AutoModel
from arabert.preprocess_arabert import never_split_tokens, preprocess
from farasa.segmenter import FarasaSegmenter
arabert_tokenizer = AutoTokenizer.from_pretrained(
"aubmindlab/bert-base-arabert",
@@ -42,27 +49,75 @@ arabert_tokenizer = AutoTokenizer.from_pretrained(
never_split=never_split_tokens)
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabert")
arabert_tokenizer.tokenize("و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري")
#Preprocess the text to make it compatible with AraBERT using farasapy
farasa_segmenter = FarasaSegmenter(interactive=True)
#or you can use a py4j JavaGateway to the farasa Segmneter .jar but it's slower
#(see update 2)
#from py4j.java_gateway import JavaGateway
#gateway = JavaGateway.launch_gateway(classpath='./PATH_TO_FARASA/FarasaSegmenterJar.jar')
#farasa = gateway.jvm.com.qcri.farasa.segmenter.Farasa()
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
text_preprocessed = preprocess( text,
do_farasa_tokenization = True,
farasa = farasa_segmenter,
use_farasapy = True)
>>>text_preprocessed: "و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري"
arabert_tokenizer.tokenize(text_preprocessed)
>>> ['و+', 'لن', 'نبال', '##غ', 'إذا', 'قل', '+نا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'ال+', 'مكتب', 'في', 'زمن', '+نا', 'هذا', 'ضروري']
```
**AraBERTv0.1 is compatible with all existing libraries, since it needs no pre-segmentation.**
```python
from transformers import AutoTokenizer
from preprocess_arabert import never_split_tokens
from transformers import AutoTokenizer, AutoModel
arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv01",do_lower_case=False)
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv01")
arabert_tokenizer.tokenize("ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري")
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
arabert_tokenizer.tokenize(text)
>>> ['ولن', 'ن', '##بالغ', 'إذا', 'قلنا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'المكتب', 'في', 'زمن', '##ن', '##ا', 'هذا', 'ضروري']
```
The ```araBERT_(initial_Demo_TF)_.ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).
The ```araBERT_(Updated_Demo_TF).ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).
**Coming Soon :** Fine-tunning demo using HuggingFace's Trainer API
**AraBERT on ARCD**
During the preprocessing step the ```answer_start``` character position needs to be recalculated. You can use the file ```arcd_preprocessing.py``` as shown below to clean, preprocess the ARCD dataset before running ```run_squad.py```. More detailed Colab notebook is available in the [SOQAL repo](https://github.com/husseinmozannar/SOQAL).
```bash
python arcd_preprocessing.py \
--input_file="/PATH_TO/arcd-test.json" \
--output_file="arcd-test-pre.json" \
--do_farasa_tokenization=True \
--use_farasapy=True \
```
```bash
python SOQAL/bert/run_squad.py \
--vocab_file="/PATH_TO_PRETRAINED_TF_CKPT/vocab.txt" \
--bert_config_file="/PATH_TO_PRETRAINED_TF_CKPT/config.json" \
--init_checkpoint="/PATH_TO_PRETRAINED_TF_CKPT/" \
--do_train=True \
--train_file=turk_combined_all_pre.json \
--do_predict=True \
--predict_file=arcd-test-pre.json \
--train_batch_size=32 \
--predict_batch_size=24 \
--learning_rate=3e-5 \
--num_train_epochs=4 \
--max_seq_length=384 \
--doc_stride=128 \
--do_lower_case=False\
--output_dir="/PATH_TO/OUTPUT_PATH"/ \
--use_tpu=True \
--tpu_name=$TPU_ADDRESS \
```
## Model Weights and Vocab Download
Models | AraBERTv0.1 | AraBERTv1
---|:---:|:---:
@@ -73,21 +128,17 @@ PyTorch| [Drive_Link](https://drive.google.com/open?id=1-_3te42mQCPD8SxwZ3l-VBL7
## If you used this model please cite us as:
```
@misc{antoun2020arabert,
title={AraBERT: Transformer-based Model for Arabic Language Understanding},
author={Wissam Antoun and Fady Baly and Hazem Hajj},
year={2020},
eprint={2003.00104},
archivePrefix={arXiv},
primaryClass={cs.CL}
@inproceedings{antoun2020arabert,
title={AraBERT: Transformer-based Model for Arabic Language Understanding},
author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
pages={9}
}
```
## Acknowledgments
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access.
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
## Contacts
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/giulio-ravasio-3a81a9110/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/BalyFady) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
***We are looking for sponsors to train BERT-Large and other Transformer models, the sponsor only needs to cover to data storage and compute cost of the generating the pretraining data***
**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/fadybaly) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>

View File

@@ -0,0 +1,55 @@
# ALBERT-Mongolian
[pretraining repo link](https://github.com/bayartsogt-ya/albert-mongolian)
## Model description
Here we provide pretrained ALBERT model and trained SentencePiece model for Mongolia text. Training data is the Mongolian wikipedia corpus from Wikipedia Downloads and Mongolian News corpus.
## Evaluation Result:
```
loss = 1.7478163
masked_lm_accuracy = 0.6838185
masked_lm_loss = 1.6687671
sentence_order_accuracy = 0.998125
sentence_order_loss = 0.007942731
```
## Fine-tuning Result on Eduge Dataset:
```
precision recall f1-score support
байгал орчин 0.83 0.76 0.80 483
боловсрол 0.79 0.75 0.77 420
спорт 0.98 0.96 0.97 1391
технологи 0.85 0.83 0.84 543
улс төр 0.88 0.87 0.87 1336
урлаг соёл 0.89 0.94 0.91 726
хууль 0.87 0.83 0.85 840
эдийн засаг 0.80 0.84 0.82 1265
эрүүл мэнд 0.84 0.90 0.87 562
accuracy 0.87 7566
macro avg 0.86 0.85 0.86 7566
weighted avg 0.87 0.87 0.87 7566
```
## Reference
1. [ALBERT - official repo](https://github.com/google-research/albert)
2. [WikiExtrator](https://github.com/attardi/wikiextractor)
3. [Mongolian BERT](https://github.com/tugstugi/mongolian-bert)
4. [ALBERT - Japanese](https://github.com/alinear-corp/albert-japanese)
5. [Mongolian Text Classification](https://github.com/sharavsambuu/mongolian-text-classification)
6. [You's paper](https://arxiv.org/abs/1904.00962)
## Citation
```
@misc{albert-mongolian,
author = {Bayartsogt Yadamsuren},
title = {ALBERT Pretrained Model on Mongolian Datasets},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/bayartsogt-ya/albert-mongolian/}}
}
```
## For More Information
Please contact by bayartsogtyadamsuren@icloud.com

View File

@@ -18,13 +18,16 @@ tags:
**Eval data:** Conll03 (NER), GermEval14 (NER), GermEval18 (Classification), GNAD (Classification)
**Infrastructure**: 1x TPU v2
**Published**: Jun 14th, 2019
**Update April 3rd, 2020**: we updated the vocabulary file on deepset's s3 to conform with the default tokenization of punctuation tokens.
For details see the related [FARM issue](https://github.com/deepset-ai/FARM/issues/60). If you want to use the old vocab we have also uploaded a ["deepset/bert-base-german-cased-oldvocab"](https://huggingface.co/deepset/bert-base-german-cased-oldvocab) model.
## Details
- We trained using Google's Tensorflow code on a single cloud TPU v2 with standard settings.
- We trained 810k steps with a batch size of 1024 for sequence length 128 and 30k steps with sequence length 512. Training took about 9 days.
- As training data we used the latest German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB).
- We cleaned the data dumps with tailored scripts and segmented sentences with spacy v2.1. To create tensorflow records we used the recommended sentencepiece library for creating the word piece vocabulary and tensorflow scripts to convert the text to data usable by BERT.
- Update April 3rd, 2020: updated the vocab file on deepset s3 to adjust tokenization of punctuation.
See https://deepset.ai/german-bert for more details

View File

@@ -0,0 +1,28 @@
---
language: german
thumbnail: https://static.tildacdn.com/tild6438-3730-4164-b266-613634323466/german_bert.png
tags:
- exbert
---
<a href="https://huggingface.co/exbert/?model=bert-base-german-cased">
<img width="300px" src="https://hf-dinosaur.huggingface.co/exbert/button.png">
</a>
# German BERT with old vocabulary
For details see the related [FARM issue](https://github.com/deepset-ai/FARM/issues/60).
## About us
![deepset logo](https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/deepset_logo.png)
We bring NLP to the industry via open source!
Our focus: Industry specific language models & large scale QA systems.
Some of our work:
- [German BERT (aka "bert-base-german-cased")](https://deepset.ai/german-bert)
- [FARM](https://github.com/deepset-ai/FARM)
- [Haystack](https://github.com/deepset-ai/haystack/)
Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Website](https://deepset.ai)

View File

@@ -0,0 +1,18 @@
# COVID-Twitter-BERT (CT-BERT)
BERT-large-uncased model, pretrained on a corpus of messages from Twitter about COVID-19
## Overview
This model was trained on 160M tweets collected between January 12 and April 16, 2020 containing at least one of the keywords "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2". These tweets were filtered and preprocessed to reach a final sample of 22.5M tweets (containing 40.7M sentences and 633M tokens) which were used for training.
This model was evaluated based on downstream classification tasks, but it could be used for any other NLP task which can leverage contextual embeddings.
In order to achieve best results, make sure to use the same text preprocessing as we did for pretraining. This involves replacing user mentions, urls and emojis. You can find a script on our projects [GitHub repo](https://github.com/digitalepidemiologylab/covid-twitter-bert).
## Example usage
```python
tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert")
model = TFAutoModel.from_pretrained("digitalepidemiologylab/covid-twitter-bert")
```
## References
[1] Martin Müller, Marcel Salaté, Per E Kummervold. "COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter" arXiv preprint arXiv:2005.07503 (2020).

View File

@@ -0,0 +1,135 @@
---
language: polish
thumbnail: https://raw.githubusercontent.com/kldarek/polbert/master/img/polbert.png
---
# Polbert - Polish BERT
Polish version of BERT language model is here! It is now available in two variants: cased and uncased, both can be downloaded and used via HuggingFace transformers library. I recommend using the cased model, more info on the differences and benchmark results below.
![PolBERT image](https://raw.githubusercontent.com/kldarek/polbert/master/img/polbert.png)
## Cased and uncased variants
* I initially trained the uncased model, the corpus and training details are referenced below. Here are some issues I found after I published the uncased model:
* Some Polish characters and accents are not tokenized correctly through the BERT tokenizer when applying lowercase. This doesn't impact sequence classification much, but may influence token classfication tasks significantly.
* I noticed a lot of duplicates in the Open Subtitles dataset, which dominates the training corpus.
* I didn't use Whole Word Masking.
* The cased model improves on the uncased model in the following ways:
* All Polish characters and accents should now be tokenized correctly.
* I removed duplicates from Open Subtitles dataset. The corpus is smaller, but more balanced now.
* The model is trained with Whole Word Masking.
## Pre-training corpora
Below is the list of corpora used along with the output of `wc` command (counting lines, words and characters). These corpora were divided into sentences with srxsegmenter (see references), concatenated and tokenized with HuggingFace BERT Tokenizer.
### Uncased
| Tables | Lines | Words | Characters |
| ------------- |--------------:| -----:| -----:|
| [Polish subset of Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 236635408| 1431199601 | 7628097730 |
| [Polish subset of ParaCrawl](http://opus.nlpl.eu/ParaCrawl.php) | 8470950 | 176670885 | 1163505275 |
| [Polish Parliamentary Corpus](http://clip.ipipan.waw.pl/PPC) | 9799859 | 121154785 | 938896963 |
| [Polish Wikipedia - Feb 2020](https://dumps.wikimedia.org/plwiki/latest/plwiki-latest-pages-articles.xml.bz2) | 8014206 | 132067986 | 1015849191 |
| Total | 262920423 | 1861093257 | 10746349159 |
### Cased
| Tables | Lines | Words | Characters |
| ------------- |--------------:| -----:| -----:|
| [Polish subset of Open Subtitles (Deduplicated) ](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 41998942| 213590656 | 1424873235 |
| [Polish subset of ParaCrawl](http://opus.nlpl.eu/ParaCrawl.php) | 8470950 | 176670885 | 1163505275 |
| [Polish Parliamentary Corpus](http://clip.ipipan.waw.pl/PPC) | 9799859 | 121154785 | 938896963 |
| [Polish Wikipedia - Feb 2020](https://dumps.wikimedia.org/plwiki/latest/plwiki-latest-pages-articles.xml.bz2) | 8014206 | 132067986 | 1015849191 |
| Total | 68283960 | 646479197 | 4543124667 |
## Pre-training details
### Uncased
* Polbert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
* Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
* Training set-up: in total 1 million training steps:
* 100.000 steps - 128 sequence length, batch size 512, learning rate 1e-4 (10.000 steps warmup)
* 800.000 steps - 128 sequence length, batch size 512, learning rate 5e-5
* 100.000 steps - 512 sequence length, batch size 256, learning rate 2e-5
* The model was trained on a single Google Cloud TPU v3-8
### Cased
* Same approach as uncased model, with the following differences:
* Whole Word Masking
* Training set-up:
* 100.000 steps - 128 sequence length, batch size 2048, learning rate 1e-4 (10.000 steps warmup)
* 100.000 steps - 128 sequence length, batch size 2048, learning rate 5e-5
* 100.000 steps - 512 sequence length, batch size 256, learning rate 2e-5
## Usage
Polbert is released via [HuggingFace Transformers library](https://huggingface.co/transformers/).
For an example use as language model, see [this notebook](/LM_testing.ipynb) file.
### Uncased
```python
from transformers import *
model = BertForMaskedLM.from_pretrained("dkleczek/bert-base-polish-uncased-v1")
tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-uncased-v1")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."):
print(pred)
# Output:
# {'sequence': '[CLS] adam mickiewicz wielkim polskim poeta był. [SEP]', 'score': 0.47196975350379944, 'token': 26596}
# {'sequence': '[CLS] adam mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.09127858281135559, 'token': 10953}
# {'sequence': '[CLS] adam mickiewicz wielkim polskim człowiekiem był. [SEP]', 'score': 0.0647173821926117, 'token': 5182}
# {'sequence': '[CLS] adam mickiewicz wielkim polskim pisarzem był. [SEP]', 'score': 0.05232388526201248, 'token': 24293}
# {'sequence': '[CLS] adam mickiewicz wielkim polskim politykiem był. [SEP]', 'score': 0.04554257541894913, 'token': 44095}
```
### Cased
```python
model = BertForMaskedLM.from_pretrained("dkleczek/bert-base-polish-cased-v1")
tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-cased-v1")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."):
print(pred)
# Output:
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim pisarzem był. [SEP]', 'score': 0.5391148328781128, 'token': 37120}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim człowiekiem był. [SEP]', 'score': 0.11683262139558792, 'token': 6810}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.06021466106176376, 'token': 17709}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim mistrzem był. [SEP]', 'score': 0.051870670169591904, 'token': 14652}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim artystą był. [SEP]', 'score': 0.031787533313035965, 'token': 35680}
```
See the next section for an example usage of Polbert in downstream tasks.
## Evaluation
Thanks to Allegro, we now have the [KLEJ benchmark](https://klejbenchmark.com/leaderboard/), a set of nine evaluation tasks for the Polish language understanding. The following results are achieved by running standard set of evaluation scripts (no tricks!) utilizing both cased and uncased variants of Polbert.
| Model | Average | NKJP-NER | CDSC-E | CDSC-R | CBD | PolEmo2.0-IN | PolEmo2.0-OUT | DYK | PSC | AR |
| ------------- |--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|
| Polbert cased | 81.7 | 93.6 | 93.4 | 93.8 | 52.7 | 87.4 | 71.1 | 59.1 | 98.6 | 85.2 |
| Polbert uncased | 81.4 | 90.1 | 93.9 | 93.5 | 55.0 | 88.1 | 68.8 | 59.4 | 98.8 | 85.4 |
Note how the uncased model performs better than cased on some tasks? My guess this is because of the oversampling of Open Subtitles dataset and its similarity to data in some of these tasks. All these benchmark tasks are sequence classification, so the relative strength of the cased model is not so visible here.
## Bias
The data used to train the model is biased. It may reflect stereotypes related to gender, ethnicity etc. Please be careful when using the model for downstream task to consider these biases and mitigate them.
## Acknowledgements
* I'd like to express my gratitude to Google [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) for providing the free TPU credits - thank you!
* Also appreciate the help from Timo Möller from [deepset](https://deepset.ai) for sharing tips and scripts based on their experience training German BERT model.
* Big thanks to Allegro for releasing KLEJ Benchmark and specifically to Piotr Rybak for help with the evaluation and pointing out some issues with the tokenization.
* Finally, thanks to Rachel Thomas, Jeremy Howard and Sylvain Gugger from [fastai](https://www.fast.ai) for their NLP and Deep Learning courses!
## Author
Darek Kłeczek - contact me on Twitter [@dk21](https://twitter.com/dk21)
## References
* https://github.com/google-research/bert
* https://github.com/narusemotoki/srx_segmenter
* SRX rules file for sentence splitting in Polish, written by Marcin Miłkowski: https://raw.githubusercontent.com/languagetool-org/languagetool/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx
* [KLEJ benchmark](https://klejbenchmark.com/leaderboard/)

View File

@@ -4,14 +4,27 @@ thumbnail: https://raw.githubusercontent.com/kldarek/polbert/master/img/polbert.
---
# Polbert - Polish BERT
Polish version of BERT language model is here! While this is still work in progress, I'm happy to share the first model, similar to BERT-Base and trained on a large Polish corpus. If you'd like to contribute to this project, please reach out to me!
Polish version of BERT language model is here! It is now available in two variants: cased and uncased, both can be downloaded and used via HuggingFace transformers library. I recommend using the cased model, more info on the differences and benchmark results below.
![PolBERT image](https://raw.githubusercontent.com/kldarek/polbert/master/img/polbert.png)
## Cased and uncased variants
* I initially trained the uncased model, the corpus and training details are referenced below. Here are some issues I found after I published the uncased model:
* Some Polish characters and accents are not tokenized correctly through the BERT tokenizer when applying lowercase. This doesn't impact sequence classification much, but may influence token classfication tasks significantly.
* I noticed a lot of duplicates in the Open Subtitles dataset, which dominates the training corpus.
* I didn't use Whole Word Masking.
* The cased model improves on the uncased model in the following ways:
* All Polish characters and accents should now be tokenized correctly.
* I removed duplicates from Open Subtitles dataset. The corpus is smaller, but more balanced now.
* The model is trained with Whole Word Masking.
## Pre-training corpora
Below is the list of corpora used along with the output of `wc` command (counting lines, words and characters). These corpora were divided into sentences with srxsegmenter (see references), concatenated and tokenized with HuggingFace BERT Tokenizer.
### Uncased
| Tables | Lines | Words | Characters |
| ------------- |--------------:| -----:| -----:|
| [Polish subset of Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 236635408| 1431199601 | 7628097730 |
@@ -20,7 +33,21 @@ Below is the list of corpora used along with the output of `wc` command (countin
| [Polish Wikipedia - Feb 2020](https://dumps.wikimedia.org/plwiki/latest/plwiki-latest-pages-articles.xml.bz2) | 8014206 | 132067986 | 1015849191 |
| Total | 262920423 | 1861093257 | 10746349159 |
### Cased
| Tables | Lines | Words | Characters |
| ------------- |--------------:| -----:| -----:|
| [Polish subset of Open Subtitles (Deduplicated) ](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 41998942| 213590656 | 1424873235 |
| [Polish subset of ParaCrawl](http://opus.nlpl.eu/ParaCrawl.php) | 8470950 | 176670885 | 1163505275 |
| [Polish Parliamentary Corpus](http://clip.ipipan.waw.pl/PPC) | 9799859 | 121154785 | 938896963 |
| [Polish Wikipedia - Feb 2020](https://dumps.wikimedia.org/plwiki/latest/plwiki-latest-pages-articles.xml.bz2) | 8014206 | 132067986 | 1015849191 |
| Total | 68283960 | 646479197 | 4543124667 |
## Pre-training details
### Uncased
* Polbert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
* Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
* Training set-up: in total 1 million training steps:
@@ -29,10 +56,22 @@ Below is the list of corpora used along with the output of `wc` command (countin
* 100.000 steps - 512 sequence length, batch size 256, learning rate 2e-5
* The model was trained on a single Google Cloud TPU v3-8
### Cased
* Same approach as uncased model, with the following differences:
* Whole Word Masking
* Training set-up:
* 100.000 steps - 128 sequence length, batch size 2048, learning rate 1e-4 (10.000 steps warmup)
* 100.000 steps - 128 sequence length, batch size 2048, learning rate 5e-5
* 100.000 steps - 512 sequence length, batch size 256, learning rate 2e-5
## Usage
Polbert is released via [HuggingFace Transformers library](https://huggingface.co/transformers/).
For an example use as language model, see [this notebook](https://github.com/kldarek/polbert/blob/master/LM_testing.ipynb) file.
For an example use as language model, see [this notebook](/LM_testing.ipynb) file.
### Uncased
```python
from transformers import *
@@ -41,7 +80,6 @@ tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-uncased-v1"
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."):
print(pred)
# Output:
# {'sequence': '[CLS] adam mickiewicz wielkim polskim poeta był. [SEP]', 'score': 0.47196975350379944, 'token': 26596}
# {'sequence': '[CLS] adam mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.09127858281135559, 'token': 10953}
@@ -50,23 +88,42 @@ for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} by
# {'sequence': '[CLS] adam mickiewicz wielkim polskim politykiem był. [SEP]', 'score': 0.04554257541894913, 'token': 44095}
```
### Cased
```python
model = BertForMaskedLM.from_pretrained("dkleczek/bert-base-polish-cased-v1")
tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-cased-v1")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."):
print(pred)
# Output:
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim pisarzem był. [SEP]', 'score': 0.5391148328781128, 'token': 37120}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim człowiekiem był. [SEP]', 'score': 0.11683262139558792, 'token': 6810}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.06021466106176376, 'token': 17709}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim mistrzem był. [SEP]', 'score': 0.051870670169591904, 'token': 14652}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim artystą był. [SEP]', 'score': 0.031787533313035965, 'token': 35680}
```
See the next section for an example usage of Polbert in downstream tasks.
## Evaluation
I'd love to get some help from the Polish NLP community here! If you feel like evaluating Polbert on some benchmark tasks, it would be great if you can share the results.
Thanks to Allegro, we now have the [KLEJ benchmark](https://klejbenchmark.com/leaderboard/), a set of nine evaluation tasks for the Polish language understanding. The following results are achieved by running standard set of evaluation scripts (no tricks!) utilizing both cased and uncased variants of Polbert.
So far, I've compared the performance of Polbert vs Multilingual BERT on PolEmo 2.0 sentiment classification, here are the results. These results are are produced with a linear classification layer on top of pooled output, trained for 10 epochs with learning rate 3e-5. The checkpoint with the lowest loss on validation set is evaluated on the test set.
| Model | Average | NKJP-NER | CDSC-E | CDSC-R | CBD | PolEmo2.0-IN | PolEmo2.0-OUT | DYK | PSC | AR |
| ------------- |--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|
| Polbert cased | 81.7 | 93.6 | 93.4 | 93.8 | 52.7 | 87.4 | 71.1 | 59.1 | 98.6 | 85.2 |
| Polbert uncased | 81.4 | 90.1 | 93.9 | 93.5 | 55.0 | 88.1 | 68.8 | 59.4 | 98.8 | 85.4 |
| PolEmo 2.0 Sentiment Classifcation | Test Accuracy |
| ------------- |--------------:|
| Multilingual BERT | 0.78 |
| Polbert | 0.85 |
Note how the uncased model performs better than cased on some tasks? My guess this is because of the oversampling of Open Subtitles dataset and its similarity to data in some of these tasks. All these benchmark tasks are sequence classification, so the relative strength of the cased model is not so visible here.
## Bias
The data used to train the model is biased. It may reflect stereotypes related to gender, ethnicity etc. Please be careful when using the model for downstream task to consider these biases and mitigate them.
## Acknowledgements
I'd like to express my gratitude to Google [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) for providing the free TPU credits - thank you! Also appreciate the help from Timo Möller from [deepset](https://deepset.ai) for sharing tips and scripts based on their experience training German BERT model. Finally, thanks to Rachel Thomas, Jeremy Howard and Sylvain Gugger from [fastai](https://www.fast.ai) for their NLP and Deep Learning courses!
* I'd like to express my gratitude to Google [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) for providing the free TPU credits - thank you!
* Also appreciate the help from Timo Möller from [deepset](https://deepset.ai) for sharing tips and scripts based on their experience training German BERT model.
* Big thanks to Allegro for releasing KLEJ Benchmark and specifically to Piotr Rybak for help with the evaluation and pointing out some issues with the tokenization.
* Finally, thanks to Rachel Thomas, Jeremy Howard and Sylvain Gugger from [fastai](https://www.fast.ai) for their NLP and Deep Learning courses!
## Author
Darek Kłeczek - contact me on Twitter [@dk21](https://twitter.com/dk21)
@@ -75,5 +132,4 @@ Darek Kłeczek - contact me on Twitter [@dk21](https://twitter.com/dk21)
* https://github.com/google-research/bert
* https://github.com/narusemotoki/srx_segmenter
* SRX rules file for sentence splitting in Polish, written by Marcin Miłkowski: https://raw.githubusercontent.com/languagetool-org/languagetool/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx
* PolEmo 2.0 Sentiment Analysis Dataset for CoNLL: https://clarin-pl.eu/dspace/handle/11321/710
* [KLEJ benchmark](https://klejbenchmark.com/leaderboard/)

View File

@@ -0,0 +1,48 @@
---
language: romanian
---
# bert-base-romanian-cased-v1
The BERT **base**, **cased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
### How to use
```python
from transformers import AutoTokenizer, AutoModel
import torch
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
```
### Evaluation
Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md).
The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian.
| Model | UPOS | XPOS | NER | LAS |
|--------------------------------|:-----:|:------:|:-----:|:-----:|
| bert-base-multilingual-cased | 97.87 | 96.16 | 84.13 | 88.04 |
| bert-base-romanian-cased-v1 | **98.00** | **96.46** | **85.88** | **89.69** |
### Corpus
The model is trained on the following corpora (stats in the table below are after cleaning):
| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
|----------- |:--------: |:--------: |:--------: |:--------: |
| OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
| OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
#### Acknowledgements
- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!

View File

@@ -0,0 +1,51 @@
---
language: romanian
---
# bert-base-romanian-uncased-v1
The BERT **base**, **uncased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
### How to use
```python
from transformers import AutoTokenizer, AutoModel
import torch
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1", do_lower_case=True)
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")
# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
```
### Evaluation
Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md).
The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian.
| Model | UPOS | XPOS | NER | LAS |
|--------------------------------|:-----:|:------:|:-----:|:-----:|
| bert-base-multilingual-uncased | 97.65 | 95.72 | 83.91 | 87.65 |
| bert-base-romanian-uncased-v1 | **98.18** | **96.84** | **85.26** | **89.61** |
### Corpus
The model is trained on the following corpora (stats in the table below are after cleaning):
| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
|----------- |:--------: |:--------: |:--------: |:--------: |
| OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
| OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
#### Acknowledgements
- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!

View File

@@ -0,0 +1,6 @@
---
tags:
- summarization
license: mit
---

View File

@@ -0,0 +1,74 @@
---
language: malay
---
# Bahasa T5 Model
Pretrained T5 base language model for Malay and Indonesian.
## Pretraining Corpus
`t5-base-bahasa-cased` model was pretrained on multiple tasks. Below is list of tasks we trained on,
1. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
3. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
4. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
5. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
6. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
7. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
8. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
9. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
10. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
11. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
12. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
13. [Bahasa SNLI](https://github.com/huseinzol05/Malaya-Dataset#snli).
14. [Bahasa Question Quora](https://github.com/huseinzol05/Malaya-Dataset#quora).
15. [Bahasa Natural Questions](https://github.com/huseinzol05/Malaya-Dataset#natural-questions).
16. [News title summarization](https://github.com/huseinzol05/Malaya-Dataset#crawled-news).
17. [Stemming to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-stemming.ipynb).
18. [Synonym to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-synonym.ipynb).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using Google T5's github [repository](https://github.com/google-research/text-to-text-transfer-transformer) on v3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/t5](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import T5Tokenizer, T5Model
model = T5Model.from_pretrained('huseinzol05/t5-base-bahasa-cased')
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
```
## Example using T5ForConditionalGeneration
```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-base-bahasa-cased')
input_ids = tokenizer.encode('soalan: siapakah perdana menteri malaysia?', return_tensors = 'pt')
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
```
Output is,
```
'Mahathir Mohamad'
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa.

View File

@@ -0,0 +1,74 @@
---
language: malay
---
# Bahasa T5 Model
Pretrained T5 small language model for Malay and Indonesian.
## Pretraining Corpus
`t5-small-bahasa-cased` model was pretrained on multiple tasks. Below is list of tasks we trained on,
1. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
3. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
4. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
5. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
6. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
7. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
8. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
9. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
10. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
11. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
12. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
13. [Bahasa SNLI](https://github.com/huseinzol05/Malaya-Dataset#snli).
14. [Bahasa Question Quora](https://github.com/huseinzol05/Malaya-Dataset#quora).
15. [Bahasa Natural Questions](https://github.com/huseinzol05/Malaya-Dataset#natural-questions).
16. [News title summarization](https://github.com/huseinzol05/Malaya-Dataset#crawled-news).
17. [Stemming to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-stemming.ipynb).
18. [Synonym to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-synonym.ipynb).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using Google T5's github [repository](https://github.com/google-research/text-to-text-transfer-transformer) on v3-8 TPU.
- All steps can reproduce from here, [Malaya/pretrained-model/t5](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import T5Tokenizer, T5Model
model = T5Model.from_pretrained('huseinzol05/t5-small-bahasa-cased')
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-small-bahasa-cased')
```
## Example using T5ForConditionalGeneration
```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-small-bahasa-cased')
model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-small-bahasa-cased')
input_ids = tokenizer.encode('soalan: siapakah perdana menteri malaysia?', return_tensors = 'pt')
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
```
Output is,
```
'Mahathir Mohamad'
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa.

View File

@@ -0,0 +1,92 @@
---
language: spanish
thumbnail:
---
# RuPERTa-base (Spanish RoBERTa) + NER 🎃🏷
This model is a fine-tuned on [NER-C](https://www.kaggle.com/nltkdata/conll-corpora) version of [RuPERTa-base](https://huggingface.co/mrm8488/RuPERTa-base) for **NER** downstream task.
## Details of the downstream task (NER) - Dataset
- [Dataset: CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora) 📚
| Dataset | # Examples |
| ---------------------- | ----- |
| Train | 329 K |
| Dev | 40 K |
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)
- Labels covered:
```
B-LOC
B-MISC
B-ORG
B-PER
I-LOC
I-MISC
I-ORG
I-PER
O
```
## Metrics on evaluation set 🧾
| Metric | # score |
| :------------------------------------------------------------------------------------: | :-------: |
| F1 | **77.55**
| Precision | **75.53** |
| Recall | **79.68** |
## Model in action 🔨
Example of usage:
```python
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
id2label = {
"0": "B-LOC",
"1": "B-MISC",
"2": "B-ORG",
"3": "B-PER",
"4": "I-LOC",
"5": "I-MISC",
"6": "I-ORG",
"7": "I-PER",
"8": "O"
}
text ="Julien, CEO de HF, nació en Francia."
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
for m in last_hidden_states:
for index, n in enumerate(m):
if(index > 0 and index <= len(text.split(" "))):
print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
'''
Output:
--------
Julien,: I-PER
CEO: O
de: O
HF,: B-ORG
nació: I-PER
en: I-PER
Francia.: I-LOC
'''
```
Yeah! Not too bad 🎉
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -0,0 +1,111 @@
---
language: spanish
thumbnail:
---
# RuPERTa-base (Spanish RoBERTa) + POS 🎃🏷
This model is a fine-tuned on [CONLL CORPORA](https://www.kaggle.com/nltkdata/conll-corpora) version of [RuPERTa-base](https://huggingface.co/mrm8488/RuPERTa-base) for **POS** downstream task.
## Details of the downstream task (POS) - Dataset
- [Dataset: CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora) 📚
| Dataset | # Examples |
| ---------------------- | ----- |
| Train | 445 K |
| Dev | 55 K |
- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)
- Labels covered:
```
ADJ
ADP
ADV
AUX
CCONJ
DET
INTJ
NOUN
NUM
PART
PRON
PROPN
PUNCT
SCONJ
SYM
VERB
```
## Metrics on evaluation set 🧾
| Metric | # score |
| :------------------------------------------------------------------------------------: | :-------: |
| F1 | **97.39**
| Precision | **97.47** |
| Recall | **9732** |
## Model in action 🔨
Example of usage
```python
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('mrm8488/RuPERTa-base-finetuned-pos')
model = AutoModelForTokenClassification.from_pretrained('mrm8488/RuPERTa-base-finetuned-pos')
id2label = {
"0": "O",
"1": "ADJ",
"2": "ADP",
"3": "ADV",
"4": "AUX",
"5": "CCONJ",
"6": "DET",
"7": "INTJ",
"8": "NOUN",
"9": "NUM",
"10": "PART",
"11": "PRON",
"12": "PROPN",
"13": "PUNCT",
"14": "SCONJ",
"15": "SYM",
"16": "VERB"
}
text ="Mis amigos están pensando viajar a Londres este verano."
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
for m in last_hidden_states:
for index, n in enumerate(m):
if(index > 0 and index <= len(text.split(" "))):
print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
'''
Output:
--------
Mis: NUM
amigos: PRON
están: AUX
pensando: ADV
viajar: VERB
a: ADP
Londres: PROPN
este: DET
verano..: NOUN
'''
```
Yeah! Not too bad 🎉
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -0,0 +1,110 @@
---
language: italian
thumbnail:
---
# Italian BERT fine-tuned on SQuAD_it v1
[Italian BERT base cased](https://huggingface.co/dbmdz/bert-base-italian-cased) fine-tuned on [italian SQuAD](https://github.com/crux82/squad-it) for **Q&A** downstream task.
## Details of Italian BERT
The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the OPUS corpora collection. The final training corpus has a size of 13GB and 2,050,057,573 tokens.
For sentence splitting, we use NLTK (faster compared to spacy). Our cased and uncased models are training with an initial sequence length of 512 subwords for ~2-3M steps.
For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the OSCAR corpus. Thus, the final training corpus has a size of 81GB and 13,138,379,147 tokens.
More in its official [model card](https://huggingface.co/dbmdz/bert-base-italian-cased)
Created by [Stefan](https://huggingface.co/stefan-it) at [MDZ](https://huggingface.co/dbmdz)
## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
[Italian SQuAD v1.1](https://rajpurkar.github.io/SQuAD-explorer/) is derived from the SQuAD dataset and it is obtained through semi-automatic translation of the SQuAD dataset
into Italian. It represents a large-scale dataset for open question answering processes on factoid questions in Italian.
**The dataset contains more than 60,000 question/answer pairs derived from the original English dataset.** The dataset is split into training and test sets to support the replicability of the benchmarking of QA systems:
- `SQuAD_it-train.json`: it contains training examples derived from the original SQuAD 1.1 trainig material.
- `SQuAD_it-test.json`: it contains test/benchmarking examples derived from the origial SQuAD 1.1 development material.
More details about SQuAD-it can be found in [Croce et al. 2018]. The original paper can be found at this [link](https://link.springer.com/chapter/10.1007/978-3-030-03840-3_29).
## Model training 🏋️‍
The model was trained on a Tesla P100 GPU and 25GB of RAM.
The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
## Results 📝
| Metric | # Value |
| ------ | --------- |
| **EM** | **62.51** |
| **F1** | **74.16** |
### Raw metrics
```json
{
"exact": 62.5180707057432,
"f1": 74.16038329042492,
"total": 7609,
"HasAns_exact": 62.5180707057432,
"HasAns_f1": 74.16038329042492,
"HasAns_total": 7609,
"best_exact": 62.5180707057432,
"best_exact_thresh": 0.0,
"best_f1": 74.16038329042492,
"best_f1_thresh": 0.0
}
```
## Comparison ⚖️
| Model | EM | F1 score |
| -------------------------------------------------------------------------------------------------------------------------------- | --------- | --------- |
| [DrQA-it trained on SQuAD-it ](https://github.com/crux82/squad-it/blob/master/README.md#evaluating-a-neural-model-over-squad-it) | 56.1 | 65.9 |
| This one | **62.51** | **74.16** |
## Model in action 🚀
Fast usage with **pipelines** 🧪
```python
from transformers import pipeline
nlp_qa = pipeline(
'question-answering',
model='mrm8488/bert-italian-finedtuned-squadv1-it-alfa',
tokenizer='mrm8488/bert-italian-finedtuned-squadv1-it-alfa'
)
nlp_qa(
{
'question': 'Per quale lingua stai lavorando?',
'context': 'Manuel Romero è colaborando attivamente con HF / trasformatori per il trader del poder de las últimas ' +
'técnicas di procesamiento de lenguaje natural al idioma español'
}
)
# Output: {'answer': 'español', 'end': 174, 'score': 0.9925341537498156, 'start': 168}
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain
Dataset citation
<details>
@InProceedings{10.1007/978-3-030-03840-3_29,
author="Croce, Danilo and Zelenanska, Alexandra and Basili, Roberto",
editor="Ghidini, Chiara and Magnini, Bernardo and Passerini, Andrea and Traverso, Paolo",
title="Neural Learning for Question Answering in Italian",
booktitle="AI*IA 2018 -- Advances in Artificial Intelligence",
year="2018",
publisher="Springer International Publishing",
address="Cham",
pages="389--402",
isbn="978-3-030-03840-3"
}
</detail>

View File

@@ -43,8 +43,8 @@ import torch
discriminator = ElectraForPreTraining.from_pretrained("mrm8488/electricidad-small-discriminator")
tokenizer = ElectraTokenizerFast.from_pretrained("mrm8488/electricidad-small-discriminator")
sentence = "El rápido zorro marrón salta sobre el perro perezoso"
fake_sentence = "El rápido zorro marrón falsea sobre el perro perezoso"
sentence = "el zorro rojo es muy rápido"
fake_sentence = "el zorro rojo es muy ser"
fake_tokens = tokenizer.tokenize(sentence)
fake_inputs = tokenizer.encode(sentence, return_tensors="pt")
@@ -53,9 +53,16 @@ predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
[print("%7s" % token, end="") for token in fake_tokens]
[print("%7s" % prediction, end="") for prediction in predictions.tolist()]
[print("%7s" % int(prediction), end="") for prediction in predictions.tolist()[1:-1]]
# Output:
'''
el zorro rojo es muy ser 0 0 0 0 0 1[None, None, None, None, None, None]
'''
```
As you can see there is a **1** in the place where the model detected the fake token (**ser**). So, it works! 🎉
## Acknowledgments
I thank [🤗/transformers team](https://github.com/huggingface/transformers) for answering my doubts and Google for helping me with the [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc) program.

View File

@@ -20,6 +20,9 @@ A few examples of the model response to a query before and after optimisation:
|I have watched 3 episodes |with this guy and he is such a talented actor...| but the show is just plain awful and there ne...| 2.681171| -4.512792|
|We know that firefighters and| police officers are forced to become populari...| other chains have going to get this disaster ...| 1.367811| -3.34017|
## Training logs and metrics <img src="https://gblobscdn.gitbook.com/spaces%2F-Lqya5RvLedGEWPhtkjU%2Favatar.png?alt=media" width="25" height="25">
Watch the whole training logs and metrics on [W&B](https://app.wandb.ai/mrm8488/gpt2-sentiment-negative?workspace=user-mrm8488)
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)

View File

@@ -0,0 +1,65 @@
---
language: english
thumbnail:
---
# Longformer-base-4096 fine-tuned on SQuAD v2
[Longformer-base-4096 model](https://huggingface.co/allenai/longformer-base-4096) fine-tuned on [SQuAD v2](https://rajpurkar.github.io/SQuAD-explorer/) for **Q&A** downstream task.
## Longformer-base-4096
[Longformer](https://arxiv.org/abs/2004.05150) is a transformer model for long documents.
`longformer-base-4096` is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.
Longformer uses a combination of a sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations.
## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
[SQuAD v2](https://rajpurkar.github.io/SQuAD-explorer/) combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
| Dataset | Split | # samples |
| -------- | ----- | --------- |
| SQuAD2.0 | train | 130k |
| SQuAD2.0 | eval | 12.3k |
## Model fine-tuning 🏋️‍
The training script is a slightly modified version of [this one](https://colab.research.google.com/drive/1zEl5D-DdkBKva-DdreVOmN0hrAfzKG1o?usp=sharing)
## Model in Action 🚀
```python
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
tokenizer = AutoTokenizer.from_pretrained("mrm8488/longformer-base-4096-finetuned-squadv2")
model = AutoModelForQuestionAnswering.from_pretrained("mrm8488/longformer-base-4096-finetuned-squadv2")
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
question = "What has Huggingface done ?"
encoding = tokenizer.encode_plus(question, text, return_tensors="pt")
input_ids = encoding["input_ids"]
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
attention_mask = encoding["attention_mask"]
start_scores, end_scores = model(input_ids, attention_mask=attention_mask)
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())
answer_tokens = all_tokens[torch.argmax(start_scores) :torch.argmax(end_scores)+1]
answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens))
# output => democratized NLP
```
If given the same context we ask something that is not there, the output for **no answer** will be ```<s>```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -0,0 +1,68 @@
---
language: english
thumbnail:
---
# T5-base fine-tuned on SQuAD v2
[Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) fine-tuned on [SQuAD v2](https://rajpurkar.github.io/SQuAD-explorer/) for **Q&A** downstream task.
## Details of T5
The **T5** model was presented in [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf) by *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu* in Here the abstract:
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
[SQuAD v2](https://rajpurkar.github.io/SQuAD-explorer/) combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
| Dataset | Split | # samples |
| -------- | ----- | --------- |
| SQuAD2.0 | train | 130k |
| SQuAD2.0 | eval | 12.3k |
## Model fine-tuning 🏋️‍
The training script is a slightly modified version of [this one](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb)
## Results 📝
| Metric | # Value |
| ------ | --------- |
| **EM** | **77.64** |
| **F1** | **81.32** |
## Model in Action 🚀
```python
from transformers import AutoModelWithLMHead, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
def get_answer(question, context):
input_text = "question: %s context: %s </s>" % (question, context)
features = tokenizer.batch_encode_plus([input_text], return_tensors='pt')
output = model.generate(input_ids=features['input_ids'],
attention_mask=features['attention_mask'])
return tokenizer.decode(output[0])
context = "Manuel have created RuPERTa-base with the support of HF-Transformers and Google"
question = "Who has supported Manuel?"
get_answer(question, context)
# output: 'HF-Transformers and Google'
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -0,0 +1,125 @@
# German Sentiment Classification with Bert
This model was trained for sentiment classification of German language texts. To achieve the best results all model inputs needs to be preprocessed with the same procedure, that was applied during the training. To simplify the usage of the model,
we provide a Python package that bundles the code need for the preprocessing and inferencing.
The model uses the Googles Bert architecture and was trained on 1.834 million German-language samples. The training data contains texts from various domains like Twitter, Facebook and movie, app and hotel reviews.
You can find more information about the dataset and the training process in the [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.201.pdf).
## Using the Python package
To get started install the package from [pypi](https://pypi.org/project/germansentiment/):
```bash
pip install germansentiment
```
```python
from germansentiment import SentimentModel
model = SentimentModel()
texts = [
"Mit keinem guten Ergebniss","Das ist gar nicht mal so gut",
"Total awesome!","nicht so schlecht wie erwartet",
"Der Test verlief positiv.","Sie fährt ein grünes Auto."]
result = model.predict_sentiment(texts)
print(result)
```
The code above will output following list:
```python
["negative","negative","positive","positive","neutral", "neutral"]
```
## A minimal working Sample
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from typing import List
import torch
import re
class SentimentModel():
def __init__(self, model_name: str):
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)
self.clean_http_urls = re.compile(r'https*\S+', re.MULTILINE)
self.clean_at_mentions = re.compile(r'@\S+', re.MULTILINE)
def predict_sentiment(self, texts: List[str])-> List[str]:
texts = [self.clean_text(text) for text in texts]
# Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
input_ids = self.tokenizer.batch_encode_plus(texts,pad_to_max_length=True, add_special_tokens=True)
input_ids = torch.tensor(input_ids["input_ids"])
with torch.no_grad():
logits = self.model(input_ids)
label_ids = torch.argmax(logits[0], axis=1)
labels = [self.model.config.id2label[label_id] for label_id in label_ids.tolist()]
return labels
def replace_numbers(self,text: str) -> str:
return text.replace("0"," null").replace("1"," eins").replace("2"," zwei").replace("3"," drei").replace("4"," vier").replace("5"," fünf").replace("6"," sechs").replace("7"," sieben").replace("8"," acht").replace("9"," neun")
def clean_text(self,text: str)-> str:
text = text.replace("\n", " ")
text = self.clean_http_urls.sub('',text)
text = self.clean_at_mentions.sub('',text)
text = self.replace_numbers(text)
text = self.clean_chars.sub('', text) # use only text chars
text = ' '.join(text.split()) # substitute multiple whitespace with single whitespace
text = text.strip().lower()
return text
texts = ["Mit keinem guten Ergebniss","Das war unfair", "Das ist gar nicht mal so gut",
"Total awesome!","nicht so schlecht wie erwartet", "Das ist gar nicht mal so schlecht",
"Der Test verlief positiv.","Sie fährt ein grünes Auto.", "Der Fall wurde an die Polzei übergeben."]
model = SentimentModel(model_name = "oliverguhr/german-sentiment-bert")
print(model.predict_sentiment(texts))
```
## Model and Data
If you are interested in code and data that was used to train this model please have a look at [this repository](https://github.com/oliverguhr/german-sentiment) and our [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.202.pdf). Here is a table of the F1 scores that his model achieves on following datasets. Since we trained this model on a newer version of the transformer library, the results are slightly better than reported in the paper.
| Dataset | F1 micro Score |
| :----------------------------------------------------------- | -------------: |
| [holidaycheck](https://github.com/oliverguhr/german-sentiment) | 0.9568 |
| [scare](https://www.romanklinger.de/scare/) | 0.9418 |
| [filmstarts](https://github.com/oliverguhr/german-sentiment) | 0.9021 |
| [germeval](https://sites.google.com/view/germeval2017-absa/home) | 0.7536 |
| [PotTS](https://www.aclweb.org/anthology/L16-1181/) | 0.6780 |
| [emotions](https://github.com/oliverguhr/german-sentiment) | 0.9649 |
| [sb10k](https://www.spinningbytes.com/resources/germansentiment/) | 0.7376 |
| [Leipzig Wikipedia Corpus 2016](https://wortschatz.uni-leipzig.de/de/download/german) | 0.9967 |
| all | 0.9639 |
## Cite
For feedback and questions contact me view mail or Twitter [@oliverguhr](https://twitter.com/oliverguhr). Please cite us if you found this useful:
```
@InProceedings{guhr-EtAl:2020:LREC,
author = {Guhr, Oliver and Schumann, Anne-Kathrin and Bahrmann, Frank and Böhme, Hans Joachim},
title = {Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems},
booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
month = {May},
year = {2020},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {1620--1625},
url = {https://www.aclweb.org/anthology/2020.lrec-1.201}
}
```

View File

@@ -5,12 +5,12 @@ https://huggingface.co/savasy/bert-base-turkish-sentiment-cased
This model is used for Sentiment Analysis, which is based on BERTurk for Turkish Language https://huggingface.co/dbmdz/bert-base-turkish-cased
# Dataset
## Dataset
The dataset is taken from the studies [2] and [3] and merged.
The dataset is taken from the studies [[2]](#paper-2) and [[3]](#paper-3), and merged.
* The study [2] gathered movie and product reviews. The products are book, DVD, electronics, and kitchen.
The movie dataset is taken from a cinema Web page (www.beyazperde.com) with
The movie dataset is taken from a cinema Web page ([Beyazperde](www.beyazperde.com)) with
5331 positive and 5331 negative sentences. Reviews in the Web page are marked in
scale from 0 to 5 by the users who made the reviews. The study considered a review
sentiment positive if the rating is equal to or bigger than 4, and negative if it is less
@@ -19,9 +19,9 @@ Web page. They constructed benchmark dataset consisting of reviews regarding som
products (book, DVD, etc.). Likewise, reviews are marked in the range from 1 to 5,
and majority class of reviews are 5. Each category has 700 positive and 700 negative
reviews in which average rating of negative reviews is 2.27 and of positive reviews
is 4.5. This dataset is also used the study [1]
is 4.5. This dataset is also used by the study [[1]](#paper-1).
* The study[3] collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion.
* The study [[3]](#paper-3) collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion.
*Merged Dataset*
@@ -32,20 +32,21 @@ is 4.5. This dataset is also used the study [1]
| 32000 |train.tsv|
| *48290* |*total*|
### The dataset is used by following papers
The dataset is used by following papers
* 1 Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12.
* 2 Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
<a id="paper-1">[1]</a> Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12.
<a id="paper-2">[2]</a> Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
Discovery and Opinion Mining (WISDOM 13)
* [3] Hayran, A., Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey
# Training
<a id="paper-3">[3]</a> Hayran, A., Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey
```
## Training
```shell
export GLUE_DIR="./sst-2-newall"
export TASK_NAME=SST-2
python3 run_glue.py \
--model_type bert \
@@ -59,88 +60,79 @@ python3 run_glue.py \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir "./model"
```
## Results
> 05/10/2020 17:00:43 - INFO - transformers.trainer - \*\*\*\*\* Running Evaluation \*\*\*\*\*
> 05/10/2020 17:00:43 - INFO - transformers.trainer - Num examples = 7999
> 05/10/2020 17:00:43 - INFO - transformers.trainer - Batch size = 8
> Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]
> 05/10/2020 17:01:17 - INFO - \_\_main__ - \*\*\*\*\* Eval results sst-2 \*\*\*\*\*
> 05/10/2020 17:01:17 - INFO - \_\_main__ - acc = 0.9539942492811602
> 05/10/2020 17:01:17 - INFO - \_\_main__ - loss = 0.16348013816401363
Accuracy is about **95.4%**
# Results
## Code Usage
> 05/10/2020 17:00:43 - INFO - transformers.trainer - ***** Running Evaluation *****
> 05/10/2020 17:00:43 - INFO - transformers.trainer - Num examples = 7999
> 05/10/2020 17:00:43 - INFO - transformers.trainer - Batch size = 8
>Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]
>05/10/2020 17:01:17 - INFO - __main__ - ***** Eval results sst-2 *****
>05/10/2020 17:01:17 - INFO - __main__ - acc = 0.9539942492811602
>05/10/2020 17:01:17 - INFO - __main__ - loss = 0.16348013816401363
Accuracy is about *%95.4*
# Code Usage
```
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
p= sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
p = sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
print(p)
#[{'label': 'LABEL_1', 'score': 0.9871089}]
print (p[0]['label']=='LABEL_1')
#True
# [{'label': 'LABEL_1', 'score': 0.9871089}]
print(p[0]['label'] == 'LABEL_1')
# True
p= sa("Film çok kötü ve çok sahteydi")
p = sa("Film çok kötü ve çok sahteydi")
print(p)
#[{'label': 'LABEL_0', 'score': 0.9975505}]
print (p[0]['label']=='LABEL_1')
#False
# [{'label': 'LABEL_0', 'score': 0.9975505}]
print(p[0]['label'] == 'LABEL_1')
# False
```
# Test your data
## Test
### Data
Suppose your file has lots of lines of comment and label (1 or 0) at the end (tab seperated)
> comment1 ... \t label
> comment2 ... \t label
> comment1 ... \t label
> comment2 ... \t label
> ...
### Code
```
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
f="/path/to/your/file/yourfile.tsv"
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
sa = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
i,crr=0,0
for line in open(f):
lines=line.strip().split("\t")
if len(lines)==2:
i=i+1
if i%100==0:
print(i)
pred= sa(lines[0])
pred=pred[0]["label"].split("_")[1]
if pred== lines[1]:
crr=crr+1
input_file = "/path/to/your/file/yourfile.tsv"
i, crr = 0, 0
for line in open(input_file):
lines = line.strip().split("\t")
if len(lines) == 2:
i = i + 1
if i%100 == 0:
print(i)
pred = sa(lines[0])
pred = pred[0]["label"].split("_")[1]
if pred == lines[1]:
crr = crr + 1
print(crr, i, crr/i)
```

View File

@@ -0,0 +1,67 @@
---
language: turkish
---
# Turkish SQuAD Model : Question Answering
I fine-tuned Turkish-Bert-Model for Question-Answering problem with Turkish version of SQuAD; TQuAD
* BERT-base: https://huggingface.co/dbmdz/bert-base-turkish-uncased
* TQuAD dataset: https://github.com/TQuad/turkish-nlp-qa-dataset
# Training Code
```
!python3 run_squad.py \
--model_type bert \
--model_name_or_path dbmdz/bert-base-turkish-uncased\
--do_train \
--do_eval \
--train_file trainQ.json \
--predict_file dev1.json \
--per_gpu_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 5.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir "./model"
```
# Example Usage
> Load Model
```
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained("./model")
model = AutoModelForQuestionAnswering.from_pretrained("./model")
nlp=pipeline("question-answering", model=model, tokenizer=tokenizer)
```
> Apply the model
```
sait="ABASIYANIK, Sait Faik. Hikayeci (Adapazarı 23 Kasım 1906-İstanbul 11 Mayıs 1954). \
İlk öğrenimine Adapazarında Rehber-i Terakki Mektebinde başladı. İki yıl kadar Adapazarı İdadisinde okudu.\
İstanbul Erkek Lisesinde devam ettiği orta öğrenimini Bursa Lisesinde tamamladı (1928). İstanbul Edebiyat \
Fakültesine iki yıl devam ettikten sonra babasının isteği üzerine iktisat öğrenimi için İsviçreye gitti. \
Kısa süre sonra iktisat öğrenimini bırakarak Lozandan Grenoblea geçti. Üç yıl başıboş bir edebiyat öğrenimi \
gördükten sonra babası tarafından geri çağrıldı (1933). Bir müddet Halıcıoğlu Ermeni Yetim Mektebi'nde Türkçe \
gurup dersleri öğretmenliği yaptı. Ticarete atıldıysa da tutunamadı. Bir ay Haber gazetesinde adliye muhabirliği\
yaptı (1942). Babasının ölümü üzerine aileden kalan emlakin geliri ile avare bir hayata başladı. Evlenemedi.\
Yazları Burgaz adasındaki köşklerinde, kışları Şişlideki apartmanlarında annesi ile beraber geçen bu fazla \
içkili bohem hayatı ömrünün sonuna kadar sürdü."
print(nlp(question="Ne zaman avare bir hayata başladı?", context=sait))
print(nlp(question="Sait Faik hangi Lisede orta öğrenimini tamamladı?", context=sait))
```
```
# Ask your self ! type your question
print(nlp(question="...?", context=sait))
```
Check My other Model
https://huggingface.co/savasy

Some files were not shown because too many files have changed in this diff Show More