HuggingFace_transformer

Author	SHA1	Message	Date
Yuta Hayashibe	66b01ce864	Warn for unexpected argument combinations (#13509 ) * Warn for unexpected argument combinations * Updated the waning message for pad_to_max_length	2021-09-24 09:14:23 -04:00
Stas Bekman	f1c22dae7d	[tokenizer] use use_auth_token for config (#13523 ) * [tokenizer] use use_auth_token for config * args order	2021-09-13 07:31:35 -04:00
Apoorv Garg	b91e65afe0	Correct order of overflowing_tokens for slow tokenizer (#13179 ) * correct order of overflowing_tokens for slow tokenizer (issue fix #13148) * python 3.9 requires sentencepiece version 0.1.94 or above * slicing of ids fixed in truncated_sequence() * Update setup.py * Correct order of overflowing tokens for pair of sentences * code reformatted * Update tokenization_utils_base.py * reformatting file * test to check single_input added * missing function restored * test to check pair_input overflowing tokens order * test to check pair_input overflowing tokens order * test to check pair_input overflowing tokens order * added an error message for pair of seq and longest_first strategy * test for pair_input modified * variable name corrected * fixed a typo in error message * requested changes implemented * required test added * Corrected the message to match test message * added error message for Luke Tokenizer * lost test recovered * docstring for truncate_sequences and prepare_for_model updated * docstring for luke tokenizer updated * updated ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING * aligned text and fixed puncuatations * improved style and quality of code * fixed error_msg in truncate_sequences * replaced encode_plus method with regular call method * clean up * rephrased the docstring	2021-09-02 05:58:23 -04:00
arfy slowy	01977466f4	fix: typo spelling grammar (#13212 ) * fix: typo spelling grammar * fix: make fixup	2021-08-30 08:09:14 -04:00
Bram Vanroy	401377e679	Add error message concerning revision (#13266 ) * add error message concerning revision * Update src/transformers/configuration_utils.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * re-add double line endings * is not None instead of implicit bool casting Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2021-08-26 04:32:57 -04:00
Lysandre Debut	5af8df5afb	Some `model_type`s cannot be in the mapping (#13259 ) * Some tokenizers cannot be in the mapping * Style	2021-08-25 12:56:16 -04:00
Bram Vanroy	39db2f3c19	Allow local_files_only for fast pretrained tokenizers (#13225 ) * allow local_files_only for fast pretrained tokenizers * make style	2021-08-24 03:05:33 -04:00
SaulLu	7223844df9	Change how "additional_special_tokens" argument in the ".from_pretrained" method of the tokenizer is taken into account (#13056 ) * add test * add change in PretrainedTokenizerBase * change Luke * deactivate * add the possibility to add additional special tokens for M2M100 * format * add special test for canine * proposed changes for mbart * proposed changes for mbart50 * proposed changes for byt5 * proposed changes for canine * proposed changes for t5 * test fast and slow * remove comment * remove comment * add fast version for all tests * replace break by continue * add more comments * add check to avoid duplicates * remove comment * format * proposed change for wave2vec2 * reverse changes mbart * uncomment * format	2021-08-23 14:35:18 +02:00
Sylvain Gugger	9870093f7b	[WIP] Disentangle auto modules from other modeling files (#13023 ) * Initial work * All auto models * All tf auto models * All flax auto models * Tokenizers * Add feature extractors * Fix typos * Fix other typo * Use the right config * Remove old mapping names and update logic in AutoTokenizer * Update check_table * Fix copies and check_repo script * Fix last test * Add back name * clean up * Update template * Update template * Forgot a ) * Use alternative to fixup * Fix TF model template * Address review comments * Address review comments * Style	2021-08-06 13:12:30 +02:00
Sylvain Gugger	5f43623843	Add possibility to ignore imports in test_fecther (#12801 ) * Add possibility to ignore imports in test_fecther * Style	2021-07-26 09:48:19 -04:00
Sylvain Gugger	786ced3639	Add versioning system to fast tokenizer files (#12713 ) * Add versioning system to fast tokenizer files * Deal with offline mode * Use staging env in tests * Style * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Style Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2021-07-21 08:24:36 -04:00
Sylvain Gugger	da72ac6e26	Fix push_to_hub docstring and make it appear in doc (#12770 )	2021-07-17 15:52:33 +02:00
Tomohiro Endo	08d609bfb8	Add tokenizers class mismatch detection between `cls` and checkpoint (#12619 ) * Detect mismatch by analyzing config * Fix comment * Fix import * Update src/transformers/tokenization_utils_base.py Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * Revise based on reviews * remove kwargs * Fix exception * Fix handling exception again * Disable mismatch test in PreTrainedTokenizerFast Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>	2021-07-17 15:52:21 +02:00
SaulLu	6e87010060	Preserve `list` type of `additional_special_tokens` in `special_token_map` (#12759 ) * preserve type of `additional_special_tokens` in `special_token_map` * format * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2021-07-16 18:26:54 +02:00
Stas Bekman	7a22a02a70	[tokenizer.prepare_seq2seq_batch] change deprecation to be easily actionable (#12669 ) * change deprecation to be easily actionable * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * rework as suggested * one warning together * fix format Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2021-07-13 09:19:04 -07:00
Sylvain Gugger	0f43e742d9	Fix typo	2021-07-12 10:32:51 -04:00
Sylvain Gugger	53c60babe4	Clean push to hub API (#12187 ) * Clean push to hub API * Create working dir if it does not exist * Different tweak * New API + all models + test Flax * Adds the Trainer clean up * Update src/transformers/file_utils.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Address review comments * (nit) output types * No need to set clone_from when folder exists * Update src/transformers/trainer.py Co-authored-by: Julien Chaumond <julien@huggingface.co> * Add generated_from_trainer tag * Update to new version * Fixes Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Julien Chaumond <julien@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>	2021-06-23 10:11:19 -04:00
Sylvain Gugger	adb70eda4d	AutoTokenizer: infer the class from the tokenizer config if possible (#12208 ) * AutoTokenizer: infer the class from the tokenizer config if possible * Add tests * Update src/transformers/models/auto/tokenization_auto.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>	2021-06-17 12:39:22 -04:00
SaulLu	476ba679dd	Feature to use the PreTrainedTokenizerFast class as a stand-alone tokenizer (#11810 ) * feature for tokenizer without slow/legacy version * format * modify common test * add tests * add PreTrainedTokenizerFast to AutoTokenizer * format * change tokenizer common test in order to be able to run test without a slow version * update tokenizer fast test in order to use `rust_tokenizer_class` attribute instead of `tokenizer_class` * add autokenizer test * replace `if self.tokenizer_class is not None` with ` if self.tokenizer_class is None` * remove obsolete change in comment * Update src/transformers/tokenization_utils_base.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/tokenization_utils_fast.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * change `get_main_tokenizer` into `get_tokenizers` * clarify `get_tokenizers` method * homogenize with `test_slow_tokenizer` and `test_rust_tokenizer` * add `test_rust_tokenizer = False` to tokenizer which don't define a fast version * `test_rust_tokenizer = False` for BertJapaneseTokenizer * `test_rust_tokenizer = False` for BertJapaneseCharacterTokenizationTest Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2021-06-14 11:58:44 +02:00
NielsRogge	f3cf8ae7b3	Add LUKE (#11223 ) * Rebase with master * Minor bug fix in docs * Copy files from adding_luke_v2 and improve docs * change the default value of use_entity_aware_attention to True * remove word_hidden_states * fix head models * fix tests * fix the conversion script * add integration tests for the pretrained large model * improve docstring * Improve docs, make style * fix _init_weights for pytorch 1.8 * improve docs * fix tokenizer to construct entity sequence with [MASK] entity when entities=None * Make fix-copies * Make style & quality * Bug fixes * Add LukeTokenizer to init * Address most comments by @patil-suraj and @LysandreJik * rename _compute_extended_attention_mask to get_extended_attention_mask * add comments to LukeSelfAttention * fix the documentation of the tokenizer * address comments by @patil-suraj, @LysandreJik, and @sgugger * improve docs * Make style, quality and fix-copies * Improve docs * fix docs * add "entity_span_classification" task * update example code for LukeForEntitySpanClassification * improve docs * improve docs * improve the code example in luke.rst * rename the classification layer in LukeForEntityClassification from typing to classifier * add bias to the classifier in LukeForEntitySpanClassification * update docs to use fine-tuned hub models in code examples of the head models * update the example sentences * Make style & quality * Add require_torch to tokenizer tests * Add require_torch to tokenizer tests * Address comments by @sgugger and add community notebooks * Make fix-copies Co-authored-by: Ikuya Yamada <ikuya@ikuya.net>	2021-05-03 09:07:29 -04:00
Stas Bekman	282f3ac3ef	[debug utils] activation/weights underflow/overflow detector (#11274 ) * sync * add activation overflow debug utility * cleanup * document detect_overflow * import torch * add deprecation warning * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * convert to rst, add note * add class * fix docs * improve the doc * rework to dump a lot more info about each frame * complete expansion * cleanup * format * cleanup * doesn't have to be transformers * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * wrap long line * style Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2021-04-30 11:15:46 -07:00
Sylvain Gugger	ad1f7bef13	Reformat to make code clearer in tokenizer call (#11497 ) * Reformat to make code clearer * Reformat to make code clearer	2021-04-29 07:51:09 -04:00
Hamel Husain	c0eb218a55	Update `PreTrainedTokenizerBase` to check/handle batch length for `text_pair` parameter (#11486 ) * Update tokenization_utils_base.py * add assertion * check batch len * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * add error message Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2021-04-28 10:11:17 -04:00
Sylvain Gugger	b03b2a653d	Style	2021-04-26 11:45:04 -04:00
Kostas Stathoulopoulos	6715e3b6a1	Clarify description of the is_split_into_words argument (#11449 ) * Improve documentation for is_split_into_words argument * Change description wording	2021-04-26 11:29:36 -04:00
Patrick von Platen	32dbb2d954	make style (#11442 )	2021-04-26 13:50:34 +02:00
Sylvain Gugger	bf2e0cf70b	Trainer push to hub (#11328 ) * Initial support for upload to hub * push -> upload * Fixes + examples * Fix torchhub test * Torchhub test I hate you * push_model_to_hub -> push_to_hub * Apply mixin to other pretrained models * Remove ABC inheritance * Add tests * Typo * Run tests * Install git-lfs * Change approach * Add push_to_hub to all * Staging test suite * Typo * Maybe like this? * More deps * Cache * Adapt name * Quality * MOAR tests * Put it in testing_utils * Docs + torchhub last hope * Styling * Wrong method * Typos * Update src/transformers/file_utils.py Co-authored-by: Julien Chaumond <julien@huggingface.co> * Address review comments * Apply suggestions from code review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Julien Chaumond <julien@huggingface.co> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>	2021-04-23 09:17:37 -04:00
Takuya Makino	881945c0b5	Add space (#11373 )	2021-04-22 17:48:58 +05:30
Sylvain Gugger	2550b41aa2	Tokenizer fast save (#11234 ) * Save fast tokenizers in both formats * Fix for HerBERT * Proper fix * Properly test new behavior	2021-04-15 09:32:32 -04:00
Joel Stremmel	9337c6c668	make embeddings plural in warning message (#11228 )	2021-04-14 10:13:25 -04:00
Lysandre Debut	c0d97cee13	Adds a note to resize the token embedding matrix when adding special … (#11120 ) * Adds a note to resize the token embedding matrix when adding special tokens * Remove superfluous space	2021-04-07 10:06:45 -04:00
Stas Bekman	3d39226a51	s\|Pretrained\|PreTrained\| (#11048 )	2021-04-04 18:08:42 -07:00
Sylvain Gugger	acc3bd9d2a	Enforce string-formatting with f-strings (#10980 ) * First third * Styling and fix mistake * Quality * All the rest * Treat %s and %d * typo * Missing ) * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2021-03-31 10:00:27 -04:00
Sylvain Gugger	d0b3797a3b	Add more metadata to the user agent (#10972 ) * Add more metadata to the user agent * Fix typo * Use DISABLE_TELEMETRY * Address review comments * Use global env * Add clean envs on circle CI	2021-03-31 09:36:07 -04:00
Sylvain Gugger	89693e170d	Remove special treatment for custom vocab files (#10637 ) * Remove special path for custom vocab files * Update src/transformers/tokenization_utils_base.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Expand error message Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>	2021-03-11 11:11:56 -05:00
Patrick von Platen	9a06b6b11b	[FeatureExtractorSavingUtils] Refactor PretrainedFeatureExtractor (#10594 ) * save first version * finish refactor * finish refactor * correct naming * correct naming * shorter names * Update src/transformers/feature_extraction_common_utils.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * change name * finish Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2021-03-09 12:16:59 +03:00
Stas Bekman	88a951e3cc	offline mode for firewalled envs (#10407 ) * offline mode start * add specific values * fix fallback * add test * better values check and range * test that actually works * document the offline mode * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * more strict check * cleaner test * pt-only test * style Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2021-03-05 17:27:48 -08:00
Patrick von Platen	cb38ffcc5e	[PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer (#10324 ) * push to show * small improvement * small improvement * Update src/transformers/feature_extraction_utils.py * Update src/transformers/feature_extraction_utils.py * implement base * add common tests * make all tests pass for wav2vec2 * make padding work & add more tests * finalize feature extractor utils * add call method to feature extraction * finalize feature processor * finish tokenizer * finish general processor design * finish tests * typo * remove bogus file * finish docstring * add docs * finish docs * small fix * correct docs * save intermediate * load changes * apply changes * apply changes to doc * change tests * apply surajs recommend * final changes * Apply suggestions from code review * fix typo * fix import * correct docstring	2021-02-25 17:42:46 +03:00
Sylvain Gugger	9e147d31f6	Deprecate prepare_seq2seq_batch (#10287 ) * Deprecate prepare_seq2seq_batch * Fix last tests * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Suraj Patil <surajp815@gmail.com> * More review comments Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Suraj Patil <surajp815@gmail.com>	2021-02-22 12:36:16 -05:00
Lysandre Debut	e73a3e1891	Add note to resize token embeddings matrix when adding new tokens to voc (#10331 )	2021-02-22 09:48:20 -05:00
Sylvain Gugger	a2e379743c	Fix style	2021-02-20 15:46:54 -05:00
cronoik	a0dfc2d30f	fixes #10303 (#10304 )	2021-02-20 15:21:33 -05:00
Stas Bekman	8fae93ca19	[t5 tokenizer] add info logs (#9897 ) * save fast tokenizer + add info logs * fix tests * remove the saving of fast tokenizer	2021-02-13 09:10:22 -05:00
Sylvain Gugger	45aaf5f7ab	A few fixes in the documentation (#10033 )	2021-02-08 05:02:01 -05:00
Sylvain Gugger	7898fc03b1	Add `from_slow` in fast tokenizers build and fixes some bugs (#9987 )	2021-02-04 03:34:23 -05:00
Patrick von Platen	538b3b4607	[Tokenizer Utils Base] Make pad function more flexible (#9928 ) * change tokenizer requirement * split line * Correct typo from list to str * improve style * make other function pretty as well * add comment * correct typo * add new test * pass tests for tok without padding token * Apply suggestions from code review	2021-02-02 10:35:27 +03:00
Ethan Chau	99b9affa02	Clarify use of unk_token in tokenizer docstrings (#9875 )	2021-01-29 05:11:53 -05:00
Lysandre Debut	6cb0a6f01a	Partial local tokenizer load (#9807 ) * Allow partial loading of a cached tokenizer * Warning > Info * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Raise error if not local_files_only Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2021-01-28 03:29:12 -05:00
Lysandre Debut	280db79ac1	BatchEncoding.to with device with tests (#9584 )	2021-01-14 07:57:58 -05:00
Sylvain Gugger	063d8d27f4	Refactor `prepare_seq2seq_batch` (#9524 ) * Add target contextmanager and rework prepare_seq2seq_batch * Fix tests, treat BART and Barthez * Add last tokenizers * Fix test * Set src token before calling the superclass * Remove special behavior for T5 * Remove needless imports * Remove needless asserts	2021-01-12 18:19:38 -05:00

1 2 3

116 Commits