HuggingFace_transformer

Author	SHA1	Message	Date
Sanchit Gandhi	8f915c450d	Unpin numba (#23162 ) * fix for ragged list * unpin numba * make style * np.object -> object * propagate changes to tokenizer as well * np.long -> "long" * revert tokenization changes * check with tokenization changes * list/tuple logic * catch numpy * catch else case * clean up * up * better check * trigger ci * Empty commit to trigger CI	2023-05-31 14:59:30 +01:00
zspo	003a0cf8cc	Fix some docs what layerdrop does (#23691 ) * Fix some docs what layerdrop does * Update src/transformers/models/data2vec/configuration_data2vec_audio.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Fix more docs --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2023-05-23 14:50:40 -04:00
Sylvain Gugger	5f9b825c89	Use code on the Hub from another repo (#22814 ) * initial work * Add other classes * Refactor code * Move warning and fix dynamic pipeline * Issue warning when necessary * Add test * Do not skip auto tests * Fix failing tests * Refactor and address review comments * Address review comments	2023-04-18 13:46:11 -04:00
Sylvain Gugger	50caa20628	Revert "Use code on the Hub from another repo" (#22813 ) Revert "Use code on the Hub from another repo (#22698)" This reverts commit `ea7b0a539a`.	2023-04-17 14:22:13 -04:00
Sylvain Gugger	ea7b0a539a	Use code on the Hub from another repo (#22698 ) * initial work * Add other classes * Refactor code * Move warning and fix dynamic pipeline * Issue warning when necessary * Add test	2023-04-17 11:36:29 -04:00
Arthur	b1b3dc3e52	[tokenization] do not push special file (#22657 ) * do not push special file * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2023-04-07 20:12:36 +02:00
Arthur	8d9c3836be	Add clean_up_tokenization_spaces to config (#22341 ) * add draft changes * fix failing wav2vec * style * make sure that the argument is saved + add tests * style * fixup * update test * default clean_up_tokenization_spaces to False for Bloom and Llama * Update code based on review Co-authored-by: Nicolas Patry <patry.nicolas@gmail.com> * style * quality --------- Co-authored-by: Nicolas Patry <patry.nicolas@gmail.com>	2023-03-29 13:21:07 +02:00
Maria Khalusova	7bd8650512	Example of pad_to_multiple_of for padding and truncation guide & docstring update (#22278 ) * added an example of pad_to_multiple_of * make style * addressed feedback	2023-03-20 14:18:55 -04:00
Aaron Gokaslan	5e8c8eb5ba	Apply ruff flake8-comprehensions (#21694 )	2023-02-22 09:14:54 +01:00
Bruno Alvisio	7bac51837b	Pass parent exception as context exception to provide clearer stack trace (#21636 ) * Pass parent exception as context exception to provide clearer stack trace * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2023-02-15 11:34:02 -05:00
Stas Bekman	b9af152efb	[tokenizer] sanitize saved config (#21483 ) * [tokenizer] sanitize saved config * rm config["name_or_path"] test	2023-02-07 10:51:45 -08:00
Sylvain Gugger	6f79d26442	Update quality tooling for formatting (#21480 ) * Result of black 23.1 * Update target to Python 3.7 * Switch flake8 to ruff * Configure isort * Configure isort * Apply isort with line limit * Put the right black version * adapt black in check copies * Fix copies	2023-02-06 18:10:56 -05:00
Lucain	8f3b4a1d5b	Little cleanup: let huggingface_hub manage token retrieval (#21333 ) * Let huggingface_hub manage token retrieval * flake8 * code quality * adapt in every PushToHubMixin children * add explicit return type	2023-01-27 12:09:49 -05:00
Thomas Wang	7419d807ff	Declare __len__ method in PreTrainedTokenizerBase (#21210 )	2023-01-20 15:54:33 +01:00
Matthijs Hollemans	9b468a7cd7	workaround documentation rendering bug (#21189 )	2023-01-19 07:50:59 -05:00
Arthur	95f0dd2123	[Tokenizers] Fix a small typo (#21104 ) * typo * change name in `__repr__` * fix my mistake	2023-01-13 16:21:34 +01:00
amyeroberts	7b23a582b9	Replaces xxx_required with requires_backends (#20715 ) * Replaces xxx_required with requires_backends * Fixup	2022-12-14 14:38:44 +00:00
Sylvain Gugger	a450789d9a	Disambiguate test for required_input in tokenization base file. (#20731 ) * Disambiguate test for required_input in tokenization base file. * Add test for size	2022-12-12 13:13:09 -05:00
xxyzz	b9a0ede6ab	Check if docstring is None before formating it (#20592 ) docstrings could be `None` if Python optimize level is set to 2.	2022-12-06 07:44:17 -05:00
Yih-Dar	293991d44b	Make `add_special_tokens` more clear (#20424 ) * make add_special_tokens more clear Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>	2022-11-30 12:56:32 +01:00
SaulLu	3c39c07f11	fix `word_to_tokens` docstring format (#20450 ) * fix docstring * fix 2 * add details	2022-11-25 20:28:00 +01:00
Yih-Dar	9a5b84a007	Use updated `model_max_length` when saving tokenizers (#20401 ) * Use updated values Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>	2022-11-23 18:16:26 +01:00
Sylvain Gugger	c4a997cd85	Use None to detect if truncation was unset (#19794 ) * Use None to detect if truncation was unset * Fix repo consistency	2022-10-21 12:53:37 -04:00
Sylvain Gugger	9151e649a5	Make public versions of private tensor utils (#19775 ) * Make public versions of private utils * I need sleep	2022-10-21 09:34:01 -04:00
Sylvain Gugger	3e4900208a	Tokenizer from_pretrained should not use local files named like tokenizer files (#19626 )	2022-10-14 14:06:56 -04:00
Sylvain Gugger	ca485e562b	Add tests for legacy load by url and fix bugs (#19078 )	2022-09-16 23:20:02 +02:00
Sylvain Gugger	9017ba4ca4	Fix tokenizer load from one file (#19073 ) * Fix tokenizer load from one file * Add a test * Style Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>	2022-09-16 16:11:47 -04:00
Sylvain Gugger	f89f16a51e	Re-add support for single url files in objects download (#19014 )	2022-09-13 13:11:24 -04:00
SaulLu	6667b0d7bf	add warning to let the user know that the `__call__` method is faster than `encode` + `pad` for a fast tokenizer (#18693 ) * add warning to let the user know that the method is slower that for a fast tokenizer * user warnings * fix layoutlmv2 * fix layout* * change warnings into logger.warning	2022-08-24 06:27:56 -04:00
SaulLu	438698085c	improve `add_tokens` docstring (#18687 ) * improve add_tokens documentation * format	2022-08-23 07:23:51 -04:00
Sylvain Gugger	0d0aada564	Use commit hash to look in cache instead of calling head (#18534 ) * Use commit hash to look in cache instead of calling head * Add tests * Add attr for local configs too * Stupid typos * Fix tests * Update src/transformers/utils/hub.py Co-authored-by: Julien Chaumond <julien@huggingface.co> * Address Julien's comments Co-authored-by: Julien Chaumond <julien@huggingface.co>	2022-08-10 11:55:18 -04:00
Sylvain Gugger	aff5117f46	Remove debug statement	2022-08-08 09:54:10 -04:00
Julien Chaumond	9129fd0377	`transformers-cli login` => `huggingface-cli login` (#18490 ) * zero chance anyone's using that constant no? * `transformers-cli login` => `huggingface-cli login` * `transformers-cli repo create` => `huggingface-cli repo create` * `make style`	2022-08-06 09:42:55 +02:00
Sylvain Gugger	5cd4032368	Use new huggingface_hub tools for download models (#18438 ) * Draft new cached_file * Initial draft for config and model * Small fixes * Fix first batch of tests * Look in cache when internet is down * Fix last tests * Bad black, not fixing all quality errors * Make diff less * Implement change for TF and Flax models * Add tokenizer and feature extractor * For compatibility with main * Add utils to move the cache and auto-do it at first use. * Quality * Deal with empty commit shas * Deal with empty etag * Address review comments	2022-08-05 10:12:40 -04:00
Sylvain Gugger	01db72abd4	Rewrite push_to_hub to use upload_files (#18366 ) * Rewrite push_to_hub to use upload_files * Adapt the doc a bit * Address review comments and clean doc	2022-08-01 12:07:30 -04:00
YouJiacheng	1cd7c6f154	Fix from_pretrained kwargs passing (#18387 ) Fix #18385 I don't know whether `use_auth_token`, `cache_dir` and `local_files_only` should be passed to `(cls.slow_tokenizer_class)._from_pretrained`, but I guess it should.	2022-08-01 08:16:24 -04:00
Sylvain Gugger	986526a0e4	Replace `as_target` context managers by direct calls (#18325 ) * Preliminary work on tokenizers * Quality + fix tests * Treat processors * Fix pad * Remove all uses of in tests, docs and examples * Replace all as_target_tokenizer * Fix tests * Fix quality * Update examples/flax/image-captioning/run_image_captioning_flax.py Co-authored-by: amyeroberts <amy@huggingface.co> * Style Co-authored-by: amyeroberts <amy@huggingface.co>	2022-07-29 08:09:09 -04:00
Sebastian Sosa	5e2f2d7dd2	Better messaging and fix for incorrect shape when collating data. (#18119 ) * More informative error message * raise dynamic error * remove_excess_nesting application * incorrect shape assertion for collator & function to remove excess nesting from DatasetDict * formatting * eliminating datasets import * removed and relocated remove_excess_nesting to the datasets library and updated docs accordingly * independent assert instructions * inform user of excess nesting	2022-07-21 10:35:41 +02:00
Guillaume Klein	3eed5530ec	Fix properties of unset special tokens in non verbose mode (#17797 ) Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>	2022-06-23 14:40:13 +02:00
SaulLu	b2fdbaccdd	change message (#17836 )	2022-06-23 14:39:48 +02:00
Patrick von Platen	f394a2a50d	[Json configs] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract) (#17457 ) * [Json dump] Make json prettier * correct more tokenizeirs * more patterns * add aggressive test * the aggressive test was actually useful :-) * more tests * Apply suggestions from code review	2022-05-31 17:07:30 +02:00
Vít Novotný	6ee1474b67	Accumulate tokens into batches in `PreTrainedTokenizerBase.add_tokens()` (#17119 ) * Accumulate tokens into batches in PreTrainedTokenizerBase.add_tokens() For tokenizers with a small number of special tokens or special tokens with consecutive token IDs, this reduces the time complexity of creating the trie from quadratic to linear, see also #16936. * Extend explanation of batching added tokens	2022-05-31 16:36:45 +02:00
Sylvain Gugger	afe5d42d8d	Black preview (#17217 ) * Black preview * Fixup too! * Fix check copies * Use the same version as the CI * Bump black	2022-05-12 16:25:55 -04:00
Patrick von Platen	31616b8d61	[T5 Tokenizer] Model has no fixed position ids - there is no hardcode… (#16990 ) * [T5 Tokenizer] Model has no fixed position ids - there is no hardcoded max length * [T5 Tokenizer] Model has no fixed position ids - there is no hardcoded max length * correct t5 tokenizer * correct t5 tokenizer * fix test * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * finish Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-05-02 21:27:34 +02:00
Sylvain Gugger	18df440709	Replace dict/BatchEncoding instance checks by Mapping (#17014 ) * Replace dict/BatchEncoding instance checks by Mapping * Typo	2022-04-29 17:20:52 -04:00
ghlai9665	daf520b033	tiny tweak to allow BatchEncoding.token_to_char when token doesn't correspond to chars (#15901 ) * tweak to allow BatchEncoding.char_to_token(0) * update docstring * remote trailing whitespace * make fixup * make value checking for span_indices explicit Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2022-04-21 08:07:54 -04:00
davidleonfdez	9f8bfe703c	Fix #16660 (tokenizers setters of ids of special tokens) (#16661 ) * Fix setters of _token_id properties of SpecialTokensMixin Test setters of common tokens ids * Move to a separate test checks of setters of tokens ids * Add independent test for ByT5 * Add Canine test * Test speech to text	2022-04-13 07:49:06 -04:00
Jia	342ff6eb41	Update comments in class BatchEncoding (#15932 )	2022-03-28 05:19:12 -04:00
Sylvain Gugger	088c1880b7	Big file_utils cleanup (#16396 ) * Big file_utils cleanup * This one still needs to be treated separately	2022-03-25 07:25:20 -04:00
Sylvain Gugger	c595b6e6a9	Make Transformers use cache files when hf.co is down (#16362 ) * Make Transformers use cache files when hf.co is down * Fix tests * Was there a random circleCI failure? * Isolate patches * Style * Comment out the failure since it doesn't fail anymore * Better comment	2022-03-23 15:56:49 -04:00

1 2 3 4

193 Commits