HuggingFace_transformer

Author	SHA1	Message	Date
Yih-Dar	2189a7f54a	Fix `pad_token` check condition (#25685 ) fix Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>	2023-08-23 16:39:28 +02:00
Arthur	30b3c46ff5	[`split_special_tokens`] Add support for `split_special_tokens` argument to encode (#25081 ) * draft changes * update and add tests * styling for no * move test * path to usable model * update test * small update * update bertbased tokenizers * don'tuse kwargs for _tokenize * don'tuse kwargs for _tokenize * fix copies * update * update test for special tokenizers * fixup * skip two tests * remove pdb breakpiont() * wowo * rewrite custom tests * nits * revert chang in target keys * fix markup lm * update documentation of the argument	2023-08-18 13:26:27 +02:00
Yih-Dar	224da5df69	update `use_auth_token` -> `token` (#25083 ) * update --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>	2023-07-26 15:09:59 +02:00
Leo	c53c8e490c	fix "UserWarning: Creating a tensor from a list of numpy.ndarrays is … (#24772 ) fix "UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor." Co-authored-by: 刘长伟 <hzliuchw@corp.netease.com>	2023-07-26 09:07:21 -04:00
Sylvain Gugger	49eb357564	Fix token pass (#24862 ) * Fix how token is passed along in from_pretrained for tokenizers * It's actually not necessary	2023-07-17 13:27:11 -04:00
Nicolas Patry	50726f9ea7	Fixing double `use_auth_token.pop` (preventing private models from being visible). (#24812 ) Fixing double `use_auth_token.pop` (preventing private models from being visible). Should fix: https://github.com/huggingface/transformers/issues/14334#issuecomment-1634527833 Repro: Have a private repo, with `vocab.json` (spread out files for the tokenizer) and use `AutoTokenizer.from_pretrained(..., use_auth_token="token")`.	2023-07-14 15:20:02 +02:00
Joao Gante	2642d8d04b	Docs: add `kwargs` type to fix formatting (#24733 )	2023-07-11 16:21:29 +01:00
Yih-Dar	6ce6d62b6f	Explicit arguments in `from_pretrained` (#24306 ) * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>	2023-06-21 19:24:11 +02:00
jiangmingyan	6cd34d451c	[fix] bug in BatchEncoding.__getitem__ (#24293 ) Co-authored-by: luchen <luchen@luchendeMBP.lan>	2023-06-15 12:33:37 +01:00
YangLiu	1e4a7737ed	Add support for non-rust implemented tokenization for `__getitem__` method. (#24039 ) * Add support for non-rust implemented tokenization for `__getitem__` method. * Update for error message on adding new sub-branch for `__item__` method. --------- Co-authored-by: liuyang17 <liuyang17@zhihu.com>	2023-06-07 12:29:19 +01:00
Sanchit Gandhi	8f915c450d	Unpin numba (#23162 ) * fix for ragged list * unpin numba * make style * np.object -> object * propagate changes to tokenizer as well * np.long -> "long" * revert tokenization changes * check with tokenization changes * list/tuple logic * catch numpy * catch else case * clean up * up * better check * trigger ci * Empty commit to trigger CI	2023-05-31 14:59:30 +01:00
zspo	003a0cf8cc	Fix some docs what layerdrop does (#23691 ) * Fix some docs what layerdrop does * Update src/transformers/models/data2vec/configuration_data2vec_audio.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Fix more docs --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2023-05-23 14:50:40 -04:00
Sylvain Gugger	5f9b825c89	Use code on the Hub from another repo (#22814 ) * initial work * Add other classes * Refactor code * Move warning and fix dynamic pipeline * Issue warning when necessary * Add test * Do not skip auto tests * Fix failing tests * Refactor and address review comments * Address review comments	2023-04-18 13:46:11 -04:00
Sylvain Gugger	50caa20628	Revert "Use code on the Hub from another repo" (#22813 ) Revert "Use code on the Hub from another repo (#22698)" This reverts commit `ea7b0a539a`.	2023-04-17 14:22:13 -04:00
Sylvain Gugger	ea7b0a539a	Use code on the Hub from another repo (#22698 ) * initial work * Add other classes * Refactor code * Move warning and fix dynamic pipeline * Issue warning when necessary * Add test	2023-04-17 11:36:29 -04:00
Arthur	b1b3dc3e52	[tokenization] do not push special file (#22657 ) * do not push special file * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2023-04-07 20:12:36 +02:00
Arthur	8d9c3836be	Add clean_up_tokenization_spaces to config (#22341 ) * add draft changes * fix failing wav2vec * style * make sure that the argument is saved + add tests * style * fixup * update test * default clean_up_tokenization_spaces to False for Bloom and Llama * Update code based on review Co-authored-by: Nicolas Patry <patry.nicolas@gmail.com> * style * quality --------- Co-authored-by: Nicolas Patry <patry.nicolas@gmail.com>	2023-03-29 13:21:07 +02:00
Maria Khalusova	7bd8650512	Example of pad_to_multiple_of for padding and truncation guide & docstring update (#22278 ) * added an example of pad_to_multiple_of * make style * addressed feedback	2023-03-20 14:18:55 -04:00
Aaron Gokaslan	5e8c8eb5ba	Apply ruff flake8-comprehensions (#21694 )	2023-02-22 09:14:54 +01:00
Bruno Alvisio	7bac51837b	Pass parent exception as context exception to provide clearer stack trace (#21636 ) * Pass parent exception as context exception to provide clearer stack trace * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2023-02-15 11:34:02 -05:00
Stas Bekman	b9af152efb	[tokenizer] sanitize saved config (#21483 ) * [tokenizer] sanitize saved config * rm config["name_or_path"] test	2023-02-07 10:51:45 -08:00
Sylvain Gugger	6f79d26442	Update quality tooling for formatting (#21480 ) * Result of black 23.1 * Update target to Python 3.7 * Switch flake8 to ruff * Configure isort * Configure isort * Apply isort with line limit * Put the right black version * adapt black in check copies * Fix copies	2023-02-06 18:10:56 -05:00
Lucain	8f3b4a1d5b	Little cleanup: let huggingface_hub manage token retrieval (#21333 ) * Let huggingface_hub manage token retrieval * flake8 * code quality * adapt in every PushToHubMixin children * add explicit return type	2023-01-27 12:09:49 -05:00
Thomas Wang	7419d807ff	Declare __len__ method in PreTrainedTokenizerBase (#21210 )	2023-01-20 15:54:33 +01:00
Matthijs Hollemans	9b468a7cd7	workaround documentation rendering bug (#21189 )	2023-01-19 07:50:59 -05:00
Arthur	95f0dd2123	[Tokenizers] Fix a small typo (#21104 ) * typo * change name in `__repr__` * fix my mistake	2023-01-13 16:21:34 +01:00
amyeroberts	7b23a582b9	Replaces xxx_required with requires_backends (#20715 ) * Replaces xxx_required with requires_backends * Fixup	2022-12-14 14:38:44 +00:00
Sylvain Gugger	a450789d9a	Disambiguate test for required_input in tokenization base file. (#20731 ) * Disambiguate test for required_input in tokenization base file. * Add test for size	2022-12-12 13:13:09 -05:00
xxyzz	b9a0ede6ab	Check if docstring is None before formating it (#20592 ) docstrings could be `None` if Python optimize level is set to 2.	2022-12-06 07:44:17 -05:00
Yih-Dar	293991d44b	Make `add_special_tokens` more clear (#20424 ) * make add_special_tokens more clear Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>	2022-11-30 12:56:32 +01:00
SaulLu	3c39c07f11	fix `word_to_tokens` docstring format (#20450 ) * fix docstring * fix 2 * add details	2022-11-25 20:28:00 +01:00
Yih-Dar	9a5b84a007	Use updated `model_max_length` when saving tokenizers (#20401 ) * Use updated values Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>	2022-11-23 18:16:26 +01:00
Sylvain Gugger	c4a997cd85	Use None to detect if truncation was unset (#19794 ) * Use None to detect if truncation was unset * Fix repo consistency	2022-10-21 12:53:37 -04:00
Sylvain Gugger	9151e649a5	Make public versions of private tensor utils (#19775 ) * Make public versions of private utils * I need sleep	2022-10-21 09:34:01 -04:00
Sylvain Gugger	3e4900208a	Tokenizer from_pretrained should not use local files named like tokenizer files (#19626 )	2022-10-14 14:06:56 -04:00
Sylvain Gugger	ca485e562b	Add tests for legacy load by url and fix bugs (#19078 )	2022-09-16 23:20:02 +02:00
Sylvain Gugger	9017ba4ca4	Fix tokenizer load from one file (#19073 ) * Fix tokenizer load from one file * Add a test * Style Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>	2022-09-16 16:11:47 -04:00
Sylvain Gugger	f89f16a51e	Re-add support for single url files in objects download (#19014 )	2022-09-13 13:11:24 -04:00
SaulLu	6667b0d7bf	add warning to let the user know that the `__call__` method is faster than `encode` + `pad` for a fast tokenizer (#18693 ) * add warning to let the user know that the method is slower that for a fast tokenizer * user warnings * fix layoutlmv2 * fix layout* * change warnings into logger.warning	2022-08-24 06:27:56 -04:00
SaulLu	438698085c	improve `add_tokens` docstring (#18687 ) * improve add_tokens documentation * format	2022-08-23 07:23:51 -04:00
Sylvain Gugger	0d0aada564	Use commit hash to look in cache instead of calling head (#18534 ) * Use commit hash to look in cache instead of calling head * Add tests * Add attr for local configs too * Stupid typos * Fix tests * Update src/transformers/utils/hub.py Co-authored-by: Julien Chaumond <julien@huggingface.co> * Address Julien's comments Co-authored-by: Julien Chaumond <julien@huggingface.co>	2022-08-10 11:55:18 -04:00
Sylvain Gugger	aff5117f46	Remove debug statement	2022-08-08 09:54:10 -04:00
Julien Chaumond	9129fd0377	`transformers-cli login` => `huggingface-cli login` (#18490 ) * zero chance anyone's using that constant no? * `transformers-cli login` => `huggingface-cli login` * `transformers-cli repo create` => `huggingface-cli repo create` * `make style`	2022-08-06 09:42:55 +02:00
Sylvain Gugger	5cd4032368	Use new huggingface_hub tools for download models (#18438 ) * Draft new cached_file * Initial draft for config and model * Small fixes * Fix first batch of tests * Look in cache when internet is down * Fix last tests * Bad black, not fixing all quality errors * Make diff less * Implement change for TF and Flax models * Add tokenizer and feature extractor * For compatibility with main * Add utils to move the cache and auto-do it at first use. * Quality * Deal with empty commit shas * Deal with empty etag * Address review comments	2022-08-05 10:12:40 -04:00
Sylvain Gugger	01db72abd4	Rewrite push_to_hub to use upload_files (#18366 ) * Rewrite push_to_hub to use upload_files * Adapt the doc a bit * Address review comments and clean doc	2022-08-01 12:07:30 -04:00
YouJiacheng	1cd7c6f154	Fix from_pretrained kwargs passing (#18387 ) Fix #18385 I don't know whether `use_auth_token`, `cache_dir` and `local_files_only` should be passed to `(cls.slow_tokenizer_class)._from_pretrained`, but I guess it should.	2022-08-01 08:16:24 -04:00
Sylvain Gugger	986526a0e4	Replace `as_target` context managers by direct calls (#18325 ) * Preliminary work on tokenizers * Quality + fix tests * Treat processors * Fix pad * Remove all uses of in tests, docs and examples * Replace all as_target_tokenizer * Fix tests * Fix quality * Update examples/flax/image-captioning/run_image_captioning_flax.py Co-authored-by: amyeroberts <amy@huggingface.co> * Style Co-authored-by: amyeroberts <amy@huggingface.co>	2022-07-29 08:09:09 -04:00
Sebastian Sosa	5e2f2d7dd2	Better messaging and fix for incorrect shape when collating data. (#18119 ) * More informative error message * raise dynamic error * remove_excess_nesting application * incorrect shape assertion for collator & function to remove excess nesting from DatasetDict * formatting * eliminating datasets import * removed and relocated remove_excess_nesting to the datasets library and updated docs accordingly * independent assert instructions * inform user of excess nesting	2022-07-21 10:35:41 +02:00
Guillaume Klein	3eed5530ec	Fix properties of unset special tokens in non verbose mode (#17797 ) Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>	2022-06-23 14:40:13 +02:00
SaulLu	b2fdbaccdd	change message (#17836 )	2022-06-23 14:39:48 +02:00

1 2 3 4 5

203 Commits