HuggingFace_transformer

Author	SHA1	Message	Date
Stas Bekman	9edafaebef	[s2s] test_bash_script.py - actually learn something (#8318 ) * use decorator * remove hardcoded paths * make the test use more data and do real quality tests * shave off 10 secs * add --eval_beams 2, reformat * reduce train size, use smaller custom dataset	2020-11-05 23:15:14 -05:00
Leandro von Werra	17450397a7	Docs bart training ref (#8330 ) Co-authored-by: Sam Shleifer <sshleifer@gmail.com>	2020-11-05 17:20:57 -05:00
Stas Bekman	d787935a14	[s2s] test_distributed_eval (#8315 ) Co-authored-by: Sam Shleifer <sshleifer@gmail.com>	2020-11-05 16:01:15 -05:00
Sam Shleifer	7abc1d96d1	no warn (#8329 )	2020-11-05 11:42:24 -05:00
Bobby Donchev	52f44dd6d2	change TokenClassificationTask class methods to static methods (#7902 ) * change TokenClassificationTask class methods to static methods Since we do not require self in the class methods of TokenClassificationTask we should probably switch to static methods. Also, since the class TokenClassificationTask does not contain a constructor it is currently unusable as is. By switching to static methods this fixes the issue of having to document the intent of the broken class. Also, since the get_labels and read_examples_from_file methods are ought to be implemented. Static method definitions are unchanged even after inheritance, which means that it can be overridden, similar to other class methods. * Trigger Build Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>	2020-11-05 09:38:30 -05:00
Guillem García Subies	77c8f6c627	Corrected typo in readme (#8320 )	2020-11-05 07:48:36 -05:00
Sylvain Gugger	9c4aa4ac1a	Clean up data collators and datasets (#8308 ) * Clean up data collators and datasets * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Remove needless clone Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2020-11-04 17:24:49 -05:00
Manuel Romero	b1d3e95eb5	Fix path to old run_language_modeling.py script (#8302 )	2020-11-04 13:17:57 -05:00
Sylvain Gugger	cf89724696	Fix validation file loading in scripts (#8298 )	2020-11-04 10:42:18 -05:00
Pengzhi Gao	734afa37f6	Fix typo in language-modeling README.md (#8287 )	2020-11-04 09:38:02 -05:00
Stas Bekman	1bb4bba53c	[CIs] Better reports everywhere (#8275 ) * make it possible to invoke testconf.py in both test suites without crashing on having the same option added * perl -pi -e 's\|--make_reports\|--make-reports\|' to be consistent with other opts * add `pytest --make-reports` to all CIs (and artifacts) * fix	2020-11-03 16:57:12 -05:00
Patrick von Platen	068e6b5edd	make files independent (#8267 )	2020-11-03 21:13:33 +01:00
Stas Bekman	cd360dcb26	[examples] minimal version requirement run-time check in PL (#8133 ) Co-authored-by: Sam Shleifer <sshleifer@gmail.com>	2020-11-03 13:17:11 -05:00
Lysandre	eb6313e823	Fix Tatoeba skip	2020-11-03 10:35:00 -05:00
Sam Shleifer	b63beb743c	Skip tatoeba tests if Tatoeba-Challenge not cloned (#8260 )	2020-11-03 09:49:29 -05:00
Patrick von Platen	9f1747f999	[Seq2Seq] Correct import in Seq2Seq Trainer (#8254 )	2020-11-03 07:56:41 -05:00
Sylvain Gugger	e1b1b614b1	Add line by line option to mlm/plm scripts (#8240 ) * Make line by line optional in run_mlm * Add option to disable dynamic padding * Add option to plm too and update README * Typos * More typos * Even more typos * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2020-11-02 12:27:04 -05:00
Patrick von Platen	9bd30f7cf4	[Seq2SeqTrainer] Move import to init to make file self-contained (#8194 ) * boom boom * reverse order	2020-11-01 23:31:55 +01:00
Sylvain Gugger	9eb3a410cd	Remove deprecated arguments from new run_clm (#8197 )	2020-10-30 15:27:20 -04:00
Sylvain Gugger	cdc48ce92d	Finalize lm examples (#8188 ) * Finish the cleanup of the language-modeling examples * Update main README * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Apply suggestions from code review Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * Propagate changes Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>	2020-10-30 14:20:18 -04:00
wlhgtc	9a21b50614	Fix eval ref miss in Chinese WWM. (#8115 ) * ADD: add whole word mask proxy for both eng and chinese * MOD: adjust format * MOD: reformat code * MOD: update import * MOD: fix bug * MOD: add import * MOD: fix bug * MOD: decouple code and update readme * MOD: reformat code * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * change wwm to whole_word_mask * reformat code * reformat * format * Code quality * ADD: update chinese ref readme * MOD: small changes * MOD: small changes2 * update readme * fix eval ref file miss bug * format file * MOD: move ref code to contrib * MOD: add delimeter check * reformat code * refomat code * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2020-10-29 17:08:39 -04:00
Sylvain Gugger	691176283d	Add a template for examples and apply it for mlm and plm examples (#8153 ) * Add a template for example scripts and apply it to mlm * Formatting * Fix test * Add plm script * Add a template for example scripts and apply it to mlm * Formatting * Fix test * Add plm script * Add a template for example scripts and apply it to mlm * Formatting * Fix test * Add plm script * Styling	2020-10-29 13:38:11 -04:00
Sam Shleifer	49e4fece5c	[s2s] distillBART docs for paper replication (#8150 )	2020-10-29 12:01:15 -04:00
Sylvain Gugger	acf56408d8	Smarter prediction loop and no- -> no_ in console args (#8151 ) * Smarter prediction loop and no- -> no_ in console args * Fix test	2020-10-29 10:56:25 -04:00
Santiago Castro	969859d5f6	Fix doc errors and typos across the board (#8139 ) * Fix doc errors and typos across the board * Fix a typo * Fix the CI * Fix more typos * Fix CI * More fixes * Fix CI * More fixes * More fixes	2020-10-29 10:33:33 -04:00
Stas Bekman	825925dfaa	[s2s test] cleanup (#8131 )	2020-10-28 16:50:36 -04:00
Sean Naren	5e24982e58	Upgrade PyTorch Lightning to 1.0.2 (#7852 ) Co-authored-by: Sam Shleifer <sshleifer@gmail.com>	2020-10-28 14:59:14 -04:00
Sylvain Gugger	378142afdf	Rename add_start_docstrings_to_callable (#8120 )	2020-10-28 13:42:31 -04:00
Stas Bekman	5423f2a9d4	[testing] port test_trainer_distributed to distributed pytest + TestCasePlus enhancements (#8107 ) * move the helper code into testing_utils * port test_trainer_distributed to work with pytest * improve docs * simplify notes * doc * doc * style * doc * further improvements * torch might not be available * real fix * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2020-10-28 11:51:32 -04:00
Sylvain Gugger	47dfa65b0c	New run_clm script (#8105 ) * New run_clm script * Formatting * More comments * Remove unused imports * Apply suggestions from code review Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * Address review comments * Change link to the hub Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>	2020-10-28 10:38:58 -04:00
Sylvain Gugger	1e01db3579	Remove header	2020-10-27 17:36:13 -04:00
Sylvain Gugger	b715e40ced	Fix typo	2020-10-27 17:34:05 -04:00
Sylvain Gugger	41cc5f3f59	Move installation instructions to the top (#8106 )	2020-10-27 17:32:20 -04:00
Stas Bekman	bfd5e370a7	[CI] generate separate report files as artifacts (#7995 ) * better reports * a whole bunch of reports in their own files * clean up * improvements * github artifacts experiment * style * complete the report generator with multiple improvements/fixes * fix * save all reports under one dir to easy upload * can remove temp failing tests * doc fix * some cleanup	2020-10-27 09:25:07 -04:00
Patrick von Platen	664c7ec453	[Seq2Seq Trainer] Make sure padding is implemented for models without pad_token (#8043 ) * make sure padding is implemented for non-padding tokens models as well * add better error message * add better warning * remove results files * Update examples/seq2seq/seq2seq_trainer.py * remove unnecessary copy line * correct usage of labels * delete test files	2020-10-26 17:28:16 +01:00
mohammadreza-Banaei73	098ddc2244	Update README.md (#8050 ) --wwm cant be used as an argument given run_language_modeling.py and should be changed to --whole_word_mask	2020-10-26 12:00:18 -04:00
suliuzh	20a0894d1a	update version for scipy (#7998 )	2020-10-26 08:56:56 -04:00
Patrick von Platen	3c682ea15c	[Examples] Allow EncoderDecoderModels to be trained with Seq2Seq (#7809 ) * Make Seq2Seq Trainer more similar to Trainer * fix typo * fix seq2seq trainer * remove from tests * remove lock * remove train files * delete test files * correct typo * check at init * make sure trainer is not slowed down on TPU * correct isort * remove use cache * fix use cache * add last use chache = false	2020-10-23 23:05:51 +02:00
Ethan Perez	d39da5a2ab	Handling longformer model_type (#7990 ) Updating the run_squad training script to handle the "longformer" `model_type`. The longformer is trained in the same was as RoBERTa, so I've added the "longformer" `model_type` (that's the right hugginface name for the LongFormer model, right?) everywhere there was a "roberta" `model_type` reference. The longformer (like RoBERTa) doesn't use `token_type_ids` (as I understand from looking at the [longformer notebook](https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb), which is what gets updated after this change. This fix might be related to [this issue](https://github.com/huggingface/transformers/issues/7249) with SQuAD training when using run_squad.py	2020-10-23 10:34:06 -04:00
Lalit Pagaria	88b3a91e61	Handle the case when title is None (#7941 )	2020-10-23 15:54:45 +02:00
Stas Bekman	023f0f3708	[s2s trainer] tests to use distributed on multi-gpu machine (#7965 )	2020-10-22 17:26:22 -04:00
Sylvain Gugger	2e5052d4f1	New run glue script (#7917 ) * Start simplification * More progress * Finished script * Address comments and update tests instructions * Wrong test * Accept files as inputs and fix test * Update src/transformers/trainer_utils.py Co-authored-by: Julien Chaumond <chaumond@gmail.com> * Fix labels and add combined score * Add special labels * Update TPU command * Revert to old label strategy * Use model labels * Fix for STT-B * Styling * Apply suggestions from code review Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * Code styling * Fix review comments Co-authored-by: Julien Chaumond <chaumond@gmail.com> Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>	2020-10-22 11:42:22 -04:00
wlhgtc	a16e568f22	# Add whole word mask support for lm fine-tune (#7925 ) * ADD: add whole word mask proxy for both eng and chinese * MOD: adjust format * MOD: reformat code * MOD: update import * MOD: fix bug * MOD: add import * MOD: fix bug * MOD: decouple code and update readme * MOD: reformat code * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * change wwm to whole_word_mask * reformat code * reformat * format * Code quality * ADD: update chinese ref readme * MOD: small changes * MOD: small changes2 * update readme Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>	2020-10-22 09:19:00 -04:00
Stas Bekman	8b38173398	[seq2seq testing] multigpu test run via subprocess (#7281 ) Co-authored-by: Sam Shleifer <sshleifer@gmail.com>	2020-10-21 17:20:53 -04:00
Stas Bekman	0e24e4c136	[s2s] create doc for pegasus/fsmt replication (#7934 )	2020-10-20 15:07:52 -04:00
Stas Bekman	3e31e7f956	[testing] rename skip targets + docs (#7863 ) * rename skip targets + docs * fix quotes * style * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * small improvements * fix Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2020-10-20 04:39:13 -04:00
Quentin Lhoest	033f29c625	Allow Custom Dataset in RAG Retriever (#7763 ) * add CustomHFIndex * typo in config * update tests * add custom dataset example * clean script * update test data * minor in test * docs * docs * style * fix imports * allow to pass the indexed dataset directly * update tests * use multiset DPR * address thom and patrick's comments * style * update dpr tokenizer * add output_dir flag in use_own_knowledge_dataset.py * allow custom datasets in examples/rag/finetune.py * add test for custom dataset in distributed rag retriever	2020-10-19 19:42:45 +02:00
Thomas Wolf	ba8c4d0ac0	[Dependencies\|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659 ) * splitting fast and slow tokenizers [WIP] * [WIP] splitting sentencepiece and tokenizers dependencies * update dummy objects * add name_or_path to models and tokenizers * prefix added to file names * prefix * styling + quality * spliting all the tokenizer files - sorting sentencepiece based ones * update tokenizer version up to 0.9.0 * remove hard dependency on sentencepiece 🎉 * and removed hard dependency on tokenizers 🎉 * update conversion script * update missing models * fixing tests * move test_tokenization_fast to main tokenization tests - fix bugs * bump up tokenizers * fix bert_generation * update ad fix several tokenizers * keep sentencepiece in deps for now * fix funnel and deberta tests * fix fsmt * fix marian tests * fix layoutlm * fix squeezebert and gpt2 * fix T5 tokenization * fix xlnet tests * style * fix mbart * bump up tokenizers to 0.9.2 * fix model tests * fix tf models * fix seq2seq examples * fix tests without sentencepiece * fix slow => fast conversion without sentencepiece * update auto and bert generation tests * fix mbart tests * fix auto and common test without tokenizers * fix tests without tokenizers * clean up tests lighten up when tokenizers + sentencepiece are both off * style quality and tests fixing * add sentencepiece to doc/examples reqs * leave sentencepiece on for now * style quality split hebert and fix pegasus * WIP Herbert fast * add sample_text_no_unicode and fix hebert tokenization * skip FSMT example test for now * fix style * fix fsmt in example tests * update following Lysandre and Sylvain's comments * Update src/transformers/testing_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/testing_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2020-10-18 20:51:24 +02:00
Stas Bekman	9f7b2b2432	[s2s testing] turn all to unittests, use auto-delete temp dirs (#7859 )	2020-10-17 14:33:21 -04:00
Stas Bekman	1652ddad35	[seq2seq testing] improve readability (#7845 )	2020-10-16 09:05:29 -04:00

1 2 3 4 5 ...

1298 Commits