HuggingFace_transformer

Author	SHA1	Message	Date
Funtowicz Morgan	2c05b8a56c	Remove tqdm logging when using pipelines. (#3833 ) Introduce tqdm_enabled parameter on squad_convert_examples_to_features() default to True and set to False in QA pipelines.	2020-04-20 22:58:52 +02:00
Jared T Nielsen	c79b550dd0	Add `qas_id` to SquadResult and SquadExample (#3745 ) * Add qas_id * Fix incorrect name in squad.py * Make output files optional for squad eval	2020-04-20 16:08:57 -04:00
Patrick von Platen	c4158a6314	[Pipelines] Encode to max length of input not max length of tokenizer for batch input (#3857 ) * remove max_length = tokenizer.max_length when encoding * make style	2020-04-20 14:39:16 -04:00
Thomas Wolf	827d6d6ef0	Cleanup fast tokenizers integration (#3706 ) * First pass on utility classes and python tokenizers * finishing cleanup pass * style and quality * Fix tests * Updating following @mfuntowicz comment * style and quality * Fix Roberta * fix batch_size/seq_length inBatchEncoding * add alignement methods + tests * Fix OpenAI and Transfo-XL tokenizers * adding trim_offsets=True default for GPT2 et RoBERTa * style and quality * fix tests * add_prefix_space in roberta * bump up tokenizers to rc7 * style * unfortunately tensorfow does like these - removing shape/seq_len for now * Update src/transformers/tokenization_utils.py Co-Authored-By: Stefan Schweter <stefan@schweter.it> * Adding doc and docstrings * making flake8 happy Co-authored-by: Stefan Schweter <stefan@schweter.it>	2020-04-18 13:43:57 +02:00
Patrick von Platen	e9d0bc027a	[Config, Serialization] more readable config serialization (#3797 ) * better config serialization * finish configuration utils	2020-04-17 20:07:18 -04:00
Lysandre Debut	8b63a01d95	XLM tokenizer should encode with bos token (#3791 ) * XLM tokenizer should encode with bos token * Update tests	2020-04-17 11:28:55 -04:00
Santiago Castro	c19727fd38	Add support for the null answer in `QuestionAnsweringPipeline` (#3441 ) * Add support for the null answer in `QuestionAnsweringPipeline` * black * Fix min null score computation * Fix a PR comment	2020-04-17 11:17:21 -04:00
Simon Böhm	edf0582c0b	Fix token_type_id in BERT question-answering example (#3790 ) token_type_id is converted into the segment embedding. For question answering, this needs to highlight whether a token belongs to sequence 0 or 1. encode_plus takes care of correctly setting this parameter automatically.	2020-04-17 11:14:12 -04:00
Pierric Cistac	6d00033e97	Question Answering support for Albert and Roberta in TF (#3812 ) * Add TFAlbertForQuestionAnswering * Add TFRobertaForQuestionAnswering * Update TFAutoModel with Roberta/Albert for QA * Clean `super` TF Albert calls	2020-04-17 10:45:30 -04:00
Aryansh Omray	14cdeee75a	Tanh torch warnings	2020-04-16 15:10:35 -04:00
Sam Shleifer	16469fedbd	[PretrainedTokenizer] Factor out tensor conversion method (#3777 )	2020-04-16 15:02:43 -04:00
Lysandre Debut	d486795158	JIT not compatible with PyTorch/XLA (#3743 )	2020-04-16 11:19:24 -04:00
Patrick von Platen	baca8fa8e6	clean pipelines (#3795 )	2020-04-16 10:21:34 -04:00
Patrick von Platen	38f7461df3	[TFT5, Cache] Add cache to TFT5 (#3772 ) * correct gpt2 test inputs * make style * delete modeling_gpt2 change in test file * translate from pytorch * correct tests * fix conflicts * fix conflicts * fix conflicts * fix conflicts * make tensorflow t5 caching work * make style * clean reorder cache * remove unnecessary spaces * fix test	2020-04-16 16:14:52 +02:00
Patrick von Platen	a5b249472e	change pad token id to config pad token id (#3793 )	2020-04-16 15:58:57 +02:00
Sam Shleifer	dbd041243d	[cleanup] factor out get_head_mask, invert_attn_mask, get_exten… (#3806 ) * Delete some copy pasted code	2020-04-16 09:55:25 -04:00
Patrick von Platen	01c37dcdb5	[Config, Caching] Remove `output_past` everywhere and replace by `use_cache` argument (#3734 ) * remove output_past from pt * make style * add optional input length for gpt2 * add use cache to prepare input * save memory in gpt2 * correct gpt2 test inputs * make past input optional for gpt2 * finish use_cache for all models * make style * delete modeling_gpt2 change in test file * correct docstring * correct is true statements for gpt2	2020-04-14 14:40:28 -04:00
Patrick von Platen	092cf881a5	[Generation, EncoderDecoder] Apply Encoder Decoder 1.5GB memory… (#3778 )	2020-04-13 22:29:28 -04:00
Teven	352d5472b0	Shift labels internally within TransfoXLLMHeadModel when called with labels (#3716 ) * Shifting labels inside TransfoXLLMHead * Changed doc to reflect change * Updated pytorch test * removed IDE whitespace changes * black reformat Co-authored-by: TevenLeScao <teven.lescao@gmail.com>	2020-04-13 18:11:23 +02:00
Anthony MOI	b7cf9f43d2	Update tokenizers to 0.7.0-rc5 (#3705 )	2020-04-10 14:23:49 -04:00
Jin Young Sohn	551b450527	Add `run_glue_tpu.py` that trains models on TPUs (#3702 ) * Initial commit to get BERT + run_glue.py on TPU * Add README section for TPU and address comments. * Cleanup TPU bits from run_glue.py (#3) TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested. * Cleanup TPU bits from run_glue.py TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested. * No need to call `xm.mark_step()` explicitly (#4) Since for gradient accumulation we're accumulating on batches from `ParallelLoader` instance which on next() marks the step itself. * Resolve R/W conflicts from multiprocessing (#5) * Add XLNet in list of models for `run_glue_tpu.py` (#6) * Add RoBERTa to list of models in TPU GLUE (#7) * Add RoBERTa and DistilBert to list of models in TPU GLUE (#8) * Use barriers to reduce duplicate work/resources (#9) * Shard eval dataset and aggregate eval metrics (#10) * Shard eval dataset and aggregate eval metrics Also, instead of calling `eval_loss.item()` every time do summation with tensors on device. * Change defaultdict to float * Reduce the pred, label tensors instead of metrics As brought up during review some metrics like f1 cannot be aggregated via averaging. GLUE task metrics depends largely on the dataset, so instead we sync the prediction and label tensors so that the metrics can be computed accurately on those instead. * Only use tb_writer from master (#11) * Apply huggingface black code formatting * Style * Remove `--do_lower_case` as example uses cased * Add option to specify tensorboard logdir This is needed for our testing framework which checks regressions against key metrics writtern by the summary writer. * Using configuration for `xla_device` * Prefix TPU specific comments. * num_cores clarification and namespace eval metrics * Cache features file under `args.cache_dir` Instead of under `args.data_dir`. This is needed as our test infra uses data_dir with a read-only filesystem. * Rename `run_glue_tpu` to `run_tpu_glue` Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>	2020-04-10 12:53:54 -04:00
Julien Chaumond	b169ac9c2b	[examples] Generate argparsers from type hints on dataclasses (#3669 ) * [examples] Generate argparsers from type hints on dataclasses * [HfArgumentParser] way simpler API * Restore run_language_modeling.py for easier diff * [HfArgumentParser] final tweaks from code review	2020-04-10 12:21:58 -04:00
Sam Shleifer	7a7fdf71f8	Multilingual BART - (#3602 ) - support mbart-en-ro weights - add MBartTokenizer	2020-04-10 11:25:39 -04:00
Julien Chaumond	f98d0ef2a2	Big cleanup of `glue_convert_examples_to_features` (#3688 ) * Big cleanup of `glue_convert_examples_to_features` * Use batch_encode_plus * Cleaner wrapping of glue_convert_examples_to_features for TF @lysandrejik * Cleanup syntax, thanks to @mfuntowicz * Raise explicit error in case of user error	2020-04-10 10:20:18 -04:00
Patrick von Platen	ce2298fb5f	[T5, generation] Add decoder caching for T5 (#3682 ) * initial commit to add decoder caching for T5 * better naming for caching * finish T5 decoder caching * correct test * added extensive past testing for T5 * clean files * make tests cleaner * improve docstring * improve docstring * better reorder cache * make style * Update src/transformers/modeling_t5.py Co-Authored-By: Yacine Jernite <yjernite@users.noreply.github.com> * make set output past work for all layers * improve docstring * improve docstring Co-authored-by: Yacine Jernite <yjernite@users.noreply.github.com>	2020-04-10 01:02:50 +02:00
calpt	9384e5f6de	Fix force_download of files on Windows (#3697 )	2020-04-09 14:44:57 -04:00
Lysandre Debut	6435b9f908	Updating the TensorFlow models to work as expected with tokenizers v3.0.0 (#3684 ) * Updating modeling tf files; adding tests * Merge `encode_plus` and `batch_encode_plus`	2020-04-08 16:22:44 -04:00
LysandreJik	500aa12318	close #3699	2020-04-08 14:32:47 -04:00
Julien Chaumond	83703cd077	Update doc for {Summarization,Translation}Pipeline and other tweaks	2020-04-08 09:45:00 -04:00
Lorenzo Ampil	747907dc5e	Fix typo in FeatureExtractionPipeline docstring	2020-04-08 09:08:56 -04:00
Sam Shleifer	715aa5b135	[Bart] Replace config.output_past with use_cache kwarg (#3632 )	2020-04-07 19:08:26 -04:00
Patrick von Platen	b0ad069517	[Tokenization] fix edge case for bert tokenization (#3517 ) * fix egde gase for bert tokenization * add Lysandres comments for improvement * use new is_pretokenized_flag	2020-04-07 16:26:31 -04:00
Michael Pang	05deb52dc1	Optimize causal mask using torch.where (#2715 ) * Optimize causal mask using torch.where Instead of multiplying by 1.0 float mask, use torch.where with a bool mask for increased performance. * Maintain compatiblity with torch 1.0.0 - thanks for PR feedback * Fix typo * reformat line for CI	2020-04-07 22:19:18 +02:00
Myle Ott	5aa8a278a3	Fix roberta checkpoint conversion script (#3642 )	2020-04-07 12:03:23 -04:00
Julien Chaumond	11cc1e168b	[model_cards] Turn down spurious warnings Close #3639 + spurious warning mentioned in #3227 cc @lysandrejik @thomwolf	2020-04-07 10:20:19 -04:00
Teven	0a9d09b42a	fixed TransfoXLLMHeadModel documentation (#3661 ) Co-authored-by: TevenLeScao <teven.lescao@gmail.com>	2020-04-07 00:47:51 +02:00
Funtowicz Morgan	96ab75b8dd	Tokenizers v3.0.0 (#3185 ) * Renamed num_added_tokens to num_special_tokens_to_add Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Make fast tokenizers unittests work on Windows. * Entirely refactored unittest for tokenizers fast. * Remove ABC class for CommonFastTokenizerTest * Added embeded_special_tokens tests from allenai @dirkgr * Make embeded_special_tokens tests from allenai more generic * Uniformize vocab_size as a property for both Fast and normal tokenizers * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin) * Ensure providing None input raise the same ValueError than Python tokenizer + tests. * Fix invalid input for assert_padding when testing batch_encode_plus * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter. * Ensure tokenize() correctly forward add_special_tokens to rust. * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast. Avoid stripping on None values. * unittests ensure tokenize() also throws a ValueError if provided None * Added add_special_tokens unittest for all supported models. * Style * Make sure TransfoXL test run only if PyTorch is provided. * Split up tokenizers tests for each model type. * Fix invalid unittest with new tokenizers API. * Filter out Roberta openai detector models from unittests. * Introduce BatchEncoding on fast tokenizers path. This new structure exposes all the mappings retrieved from Rust. It also keeps the current behavior with model forward. * Introduce BatchEncoding on slow tokenizers path. Backward compatibility. * Improve error message on BatchEncoding for slow path * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases. * Style and format. * Added typing on all methods for PretrainedTokenizerFast * Style and format * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast. * Style and format * encode_plus now supports pretokenized inputs. * Remove user warning about add_special_tokens when working on pretokenized inputs. * Always go through the post processor. * Added support for pretokenized input pairs on encode_plus * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError. * Added pretokenized inputs support on batch_encode_plus * Update BatchEncoding methods name to match Encoding. * Bump setup.py tokenizers dependency to 0.7.0rc1 * Remove unused parameters in BertTokenizerFast * Make sure Roberta returns token_type_ids for unittests. * Added missing typings * Update add_tokens prototype to match tokenizers side and allow AddedToken * Bumping tokenizers to 0.7.0rc2 * Added documentation for BatchEncoding * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods. * Added higher-level typing for tokenize / encode_plus / batch_encode_plus. * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers. * Fix text-classification pipeline using the wrong tokenizer * Make pipelines works with BatchEncoding * Turn off add_special_tokens on tokenize by default. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove add_prefix_space from tokenize call in unittest. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Style and quality Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Correct message for batch_encode_plus none input exception. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix invalid list comprehension for offset_mapping overriding content every iteration. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * TransfoXL uses Strip normalizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizers dependency to 0.7.0rc3 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Support AddedTokens for special_tokens and use left stripping on mask for Roberta. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * SpecilaTokenMixin can use slots to faster access to underlying attributes. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove update_special_tokens from fast tokenizers. * Ensure TransfoXL unittests are run only when torch is available. * Style. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Style * Style 🙏🙏 * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol. * Remove Roberta warning on __init__. * Move documentation to Google style. Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>	2020-04-07 00:29:15 +02:00
LysandreJik	36bffc81b3	Release: v2.8.0	2020-04-06 10:03:53 -04:00
Patrick von Platen	1789c7daf1	fix argument order (#3637 )	2020-04-05 12:33:41 +02:00
Patrick von Platen	b809d2f073	Fix TF T5 docstring (#3636 )	2020-04-05 12:23:09 +02:00
Julien Chaumond	5d912e7ed4	Tweak typing for #3566	2020-04-04 15:04:03 -04:00
Julien Chaumond	94eb68d742	weigths*weights	2020-04-04 15:03:26 -04:00
Max Ryabinin	c6acd246ec	Speed up GELU computation with torch.jit (#2988 ) * Compile gelu_new with torchscript * Compile _gelu_python with torchscript * Wrap gelu_new with torch.jit for torch>=1.4	2020-04-03 15:20:21 -04:00
Lysandre Debut	d5d7d88612	ELECTRA (#3257 ) * Electra wip * helpers * Electra wip * Electra v1 * ELECTRA may be saved/loaded * Generator & Discriminator * Embedding size instead of halving the hidden size * ELECTRA Tokenizer * Revert BERT helpers * ELECTRA Conversion script * Archive maps * PyTorch tests * Start fixing tests * Tests pass * Same configuration for both models * Compatible with base + large * Simplification + weight tying * Archives * Auto + Renaming to standard names * ELECTRA is uncased * Tests * Slight API changes * Update tests * wip * ElectraForTokenClassification * temp * Simpler arch + tests Removed ElectraForPreTraining which will be in a script * Conversion script * Auto model * Update links to S3 * Split ElectraForPreTraining and ElectraForTokenClassification * Actually test PreTraining model * Remove num_labels from configuration * wip * wip * From discriminator and generator to electra * Slight API changes * Better naming * TensorFlow ELECTRA tests * Accurate conversion script * Added to conversion script * Fast ELECTRA tokenizer * Style * Add ELECTRA to README * Modeling Pytorch Doc + Real style * TF Docs * Docs * Correct links * Correct model intialized * random fixes * style * Addressing Patrick's and Sam's comments * Correct links in docs	2020-04-03 14:10:54 -04:00
Yohei Tamura	8594dd80dd	BertJapaneseTokenizer accept options for mecab (#3566 ) * BertJapaneseTokenizer accept options for mecab * black * fix mecab_option to Option[str]	2020-04-03 11:12:19 -04:00
Patrick von Platen	f68d22850c	delete bogus print statement (#3595 )	2020-04-02 21:49:34 +02:00
Patrick von Platen	390c128592	[Encoder-Decoder] Force models outputs to always have batch_size as their first dim (#3536 ) * solve conflicts * improve comments	2020-04-02 15:18:33 +02:00
Patrick von Platen	a4ee4da18a	[T5, TF 2.2] change tf t5 argument naming (#3547 ) * change tf t5 argument naming for TF 2.2 * correct bug in testing	2020-04-01 22:04:20 +02:00
Patrick von Platen	06dd597552	fix bug in warnings T5 pipelines (#3545 )	2020-04-01 21:59:12 +02:00
Anirudh Srinivasan	9de9ceb6c5	Correct output shape for Bert NSP models in docs (#3482 )	2020-04-01 15:04:38 -04:00

1 2 3 4 5 ...

474 Commits