Commit Graph

1298 Commits

Author SHA1 Message Date
Stas Bekman
9edafaebef [s2s] test_bash_script.py - actually learn something (#8318)
* use decorator

* remove hardcoded paths

* make the test use more data and do real quality tests

* shave off 10 secs

* add --eval_beams 2, reformat

* reduce train size, use smaller custom dataset
2020-11-05 23:15:14 -05:00
Leandro von Werra
17450397a7 Docs bart training ref (#8330)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-11-05 17:20:57 -05:00
Stas Bekman
d787935a14 [s2s] test_distributed_eval (#8315)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-11-05 16:01:15 -05:00
Sam Shleifer
7abc1d96d1 no warn (#8329) 2020-11-05 11:42:24 -05:00
Bobby Donchev
52f44dd6d2 change TokenClassificationTask class methods to static methods (#7902)
* change TokenClassificationTask class methods to static methods

Since we do not require self in the class methods of TokenClassificationTask we should probably switch to static methods. Also, since the class TokenClassificationTask does not contain a constructor it is currently unusable as is. By switching to static methods this fixes the issue of having to document the intent of the broken class.

Also, since the get_labels and read_examples_from_file methods are ought to be implemented. Static method definitions are unchanged even after inheritance, which means that it can be overridden, similar to other class methods.

* Trigger Build

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-11-05 09:38:30 -05:00
Guillem García Subies
77c8f6c627 Corrected typo in readme (#8320) 2020-11-05 07:48:36 -05:00
Sylvain Gugger
9c4aa4ac1a Clean up data collators and datasets (#8308)
* Clean up data collators and datasets

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Remove needless clone

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-04 17:24:49 -05:00
Manuel Romero
b1d3e95eb5 Fix path to old run_language_modeling.py script (#8302) 2020-11-04 13:17:57 -05:00
Sylvain Gugger
cf89724696 Fix validation file loading in scripts (#8298) 2020-11-04 10:42:18 -05:00
Pengzhi Gao
734afa37f6 Fix typo in language-modeling README.md (#8287) 2020-11-04 09:38:02 -05:00
Stas Bekman
1bb4bba53c [CIs] Better reports everywhere (#8275)
* make it possible to invoke testconf.py in both test suites without crashing on having the same option added

* perl -pi -e 's|--make_reports|--make-reports|' to be consistent with other opts

* add `pytest --make-reports` to all CIs (and artifacts)

* fix
2020-11-03 16:57:12 -05:00
Patrick von Platen
068e6b5edd make files independent (#8267) 2020-11-03 21:13:33 +01:00
Stas Bekman
cd360dcb26 [examples] minimal version requirement run-time check in PL (#8133)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-11-03 13:17:11 -05:00
Lysandre
eb6313e823 Fix Tatoeba skip 2020-11-03 10:35:00 -05:00
Sam Shleifer
b63beb743c Skip tatoeba tests if Tatoeba-Challenge not cloned (#8260) 2020-11-03 09:49:29 -05:00
Patrick von Platen
9f1747f999 [Seq2Seq] Correct import in Seq2Seq Trainer (#8254) 2020-11-03 07:56:41 -05:00
Sylvain Gugger
e1b1b614b1 Add line by line option to mlm/plm scripts (#8240)
* Make line by line optional in run_mlm

* Add option to disable dynamic padding

* Add option to plm too and update README

* Typos

* More typos

* Even more typos

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-02 12:27:04 -05:00
Patrick von Platen
9bd30f7cf4 [Seq2SeqTrainer] Move import to init to make file self-contained (#8194)
* boom boom

* reverse order
2020-11-01 23:31:55 +01:00
Sylvain Gugger
9eb3a410cd Remove deprecated arguments from new run_clm (#8197) 2020-10-30 15:27:20 -04:00
Sylvain Gugger
cdc48ce92d Finalize lm examples (#8188)
* Finish the cleanup of the language-modeling examples

* Update main README

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Apply suggestions from code review

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Propagate changes

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-10-30 14:20:18 -04:00
wlhgtc
9a21b50614 Fix eval ref miss in Chinese WWM. (#8115)
* ADD: add whole word mask proxy for both eng and chinese

* MOD: adjust format

* MOD: reformat code

* MOD: update import

* MOD: fix bug

* MOD: add import

* MOD: fix bug

* MOD: decouple code and update readme

* MOD: reformat code

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change wwm to whole_word_mask

* reformat code

* reformat

* format

* Code quality

* ADD: update chinese ref readme

* MOD: small changes

* MOD: small changes2

* update readme

* fix eval ref file miss bug

* format file

* MOD: move ref code to contrib

* MOD: add delimeter check

* reformat code

* refomat code

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-29 17:08:39 -04:00
Sylvain Gugger
691176283d Add a template for examples and apply it for mlm and plm examples (#8153)
* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Add a template for example scripts and apply it to mlm

* Formatting

* Fix test

* Add plm script

* Styling
2020-10-29 13:38:11 -04:00
Sam Shleifer
49e4fece5c [s2s] distillBART docs for paper replication (#8150) 2020-10-29 12:01:15 -04:00
Sylvain Gugger
acf56408d8 Smarter prediction loop and no- -> no_ in console args (#8151)
* Smarter prediction loop and no- -> no_ in console args

* Fix test
2020-10-29 10:56:25 -04:00
Santiago Castro
969859d5f6 Fix doc errors and typos across the board (#8139)
* Fix doc errors and typos across the board

* Fix a typo

* Fix the CI

* Fix more typos

* Fix CI

* More fixes

* Fix CI

* More fixes

* More fixes
2020-10-29 10:33:33 -04:00
Stas Bekman
825925dfaa [s2s test] cleanup (#8131) 2020-10-28 16:50:36 -04:00
Sean Naren
5e24982e58 Upgrade PyTorch Lightning to 1.0.2 (#7852)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-10-28 14:59:14 -04:00
Sylvain Gugger
378142afdf Rename add_start_docstrings_to_callable (#8120) 2020-10-28 13:42:31 -04:00
Stas Bekman
5423f2a9d4 [testing] port test_trainer_distributed to distributed pytest + TestCasePlus enhancements (#8107)
* move the helper code into testing_utils

* port test_trainer_distributed to work with pytest

* improve docs

* simplify notes

* doc

* doc

* style

* doc

* further improvements

* torch might not be available

* real fix

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-28 11:51:32 -04:00
Sylvain Gugger
47dfa65b0c New run_clm script (#8105)
* New run_clm script

* Formatting

* More comments

* Remove unused imports

* Apply suggestions from code review

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Address review comments

* Change link to the hub

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-10-28 10:38:58 -04:00
Sylvain Gugger
1e01db3579 Remove header 2020-10-27 17:36:13 -04:00
Sylvain Gugger
b715e40ced Fix typo 2020-10-27 17:34:05 -04:00
Sylvain Gugger
41cc5f3f59 Move installation instructions to the top (#8106) 2020-10-27 17:32:20 -04:00
Stas Bekman
bfd5e370a7 [CI] generate separate report files as artifacts (#7995)
* better reports

* a whole bunch of reports in their own files

* clean up

* improvements

* github artifacts experiment

* style

* complete the report generator with multiple improvements/fixes

* fix

* save all reports under one dir to easy upload

* can remove temp failing tests

* doc fix

* some cleanup
2020-10-27 09:25:07 -04:00
Patrick von Platen
664c7ec453 [Seq2Seq Trainer] Make sure padding is implemented for models without pad_token (#8043)
* make sure padding is implemented for non-padding tokens models as well

* add better error message

* add better warning

* remove results files

* Update examples/seq2seq/seq2seq_trainer.py

* remove unnecessary copy line

* correct usage of labels

* delete test files
2020-10-26 17:28:16 +01:00
mohammadreza-Banaei73
098ddc2244 Update README.md (#8050)
--wwm cant be used as an argument given run_language_modeling.py and should be changed to --whole_word_mask
2020-10-26 12:00:18 -04:00
suliuzh
20a0894d1a update version for scipy (#7998) 2020-10-26 08:56:56 -04:00
Patrick von Platen
3c682ea15c [Examples] Allow EncoderDecoderModels to be trained with Seq2Seq (#7809)
* Make Seq2Seq Trainer more similar to Trainer

* fix typo

* fix seq2seq trainer

* remove from tests

* remove lock

* remove train files

* delete test files

* correct typo

* check at init

* make sure trainer is not slowed down on TPU

* correct isort

* remove use cache

* fix use cache

* add last use chache = false
2020-10-23 23:05:51 +02:00
Ethan Perez
d39da5a2ab Handling longformer model_type (#7990)
Updating the run_squad training script to handle the "longformer" `model_type`. The longformer is trained in the same was as RoBERTa, so I've added the "longformer" `model_type` (that's the right hugginface name for the LongFormer model, right?) everywhere there was a "roberta" `model_type` reference. The longformer (like RoBERTa) doesn't use `token_type_ids` (as I understand from looking at the [longformer notebook](https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb), which is what gets updated after this change.

This fix might be related to [this issue](https://github.com/huggingface/transformers/issues/7249) with SQuAD training when using run_squad.py
2020-10-23 10:34:06 -04:00
Lalit Pagaria
88b3a91e61 Handle the case when title is None (#7941) 2020-10-23 15:54:45 +02:00
Stas Bekman
023f0f3708 [s2s trainer] tests to use distributed on multi-gpu machine (#7965) 2020-10-22 17:26:22 -04:00
Sylvain Gugger
2e5052d4f1 New run glue script (#7917)
* Start simplification

* More progress

* Finished script

* Address comments and update tests instructions

* Wrong test

* Accept files as inputs and fix test

* Update src/transformers/trainer_utils.py

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

* Fix labels and add combined score

* Add special labels

* Update TPU command

* Revert to old label strategy

* Use model labels

* Fix for STT-B

* Styling

* Apply suggestions from code review

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Code styling

* Fix review comments

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-10-22 11:42:22 -04:00
wlhgtc
a16e568f22 # Add whole word mask support for lm fine-tune (#7925)
* ADD: add whole word mask proxy for both eng and chinese

* MOD: adjust format

* MOD: reformat code

* MOD: update import

* MOD: fix bug

* MOD: add import

* MOD: fix bug

* MOD: decouple code and update readme

* MOD: reformat code

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change wwm to whole_word_mask

* reformat code

* reformat

* format

* Code quality

* ADD: update chinese ref readme

* MOD: small changes

* MOD: small changes2

* update readme

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
2020-10-22 09:19:00 -04:00
Stas Bekman
8b38173398 [seq2seq testing] multigpu test run via subprocess (#7281)
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
2020-10-21 17:20:53 -04:00
Stas Bekman
0e24e4c136 [s2s] create doc for pegasus/fsmt replication (#7934) 2020-10-20 15:07:52 -04:00
Stas Bekman
3e31e7f956 [testing] rename skip targets + docs (#7863)
* rename skip targets + docs

* fix quotes

* style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* small improvements

* fix

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-20 04:39:13 -04:00
Quentin Lhoest
033f29c625 Allow Custom Dataset in RAG Retriever (#7763)
* add CustomHFIndex

* typo in config

* update tests

* add custom dataset example

* clean script

* update test data

* minor in test

* docs

* docs

* style

* fix imports

* allow to pass the indexed dataset directly

* update tests

* use multiset DPR

* address thom and patrick's comments

* style

* update dpr tokenizer

* add output_dir flag in use_own_knowledge_dataset.py

* allow custom datasets in examples/rag/finetune.py

* add test for custom dataset in distributed rag retriever
2020-10-19 19:42:45 +02:00
Thomas Wolf
ba8c4d0ac0 [Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659)
* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉

* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-18 20:51:24 +02:00
Stas Bekman
9f7b2b2432 [s2s testing] turn all to unittests, use auto-delete temp dirs (#7859) 2020-10-17 14:33:21 -04:00
Stas Bekman
1652ddad35 [seq2seq testing] improve readability (#7845) 2020-10-16 09:05:29 -04:00