HuggingFace_transformer

Author	SHA1	Message	Date
Funtowicz Morgan	135791e8ef	Add pad_to_multiple_of on tokenizers (reimport) (#5054 ) * Add new parameter `pad_to_multiple_of` on tokenizers. * unittest for pad_to_multiple_of * Add .name when logging enum. * Fix missing .items() on dict in tests. * Add special check + warning if the tokenizer doesn't have proper pad_token. * Use the correct logger format specifier. * Ensure tokenizer with no pad_token do not modify the underlying padding strategy. * Skip test if tokenizer doesn't have pad_token * Fix RobertaTokenizer on empty input * Format. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * fix and updating to simpler API Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>	2020-06-26 11:55:57 +02:00
Thomas Wolf	315f464b0a	[tokenizers] Several small improvements and bug fixes (#5287 ) * avoid recursion in id checks for fast tokenizers * better typings and fix #5232 * align slow and fast tokenizers behaviors for Roberta and GPT2 * style and quality * fix tests - improve typings	2020-06-25 22:17:14 +02:00
Thomas Wolf	27cf1d97f0	[Tokenization] Fix #5181 - make #5155 more explicit - move back the default logging level in tests to WARNING (#5252 ) * fix-5181 Padding to max sequence length while truncation to another length was wrong on slow tokenizers * clean up and fix #5155 * fix XLM test * Fix tests for Transfo-XL * logging only above WARNING in tests * switch slow tokenizers tests in @slow * fix Marian truncation tokenization test * style and quality * make the test a lot faster by limiting the sequence length used in tests	2020-06-25 17:24:28 +02:00
Thomas Wolf	7ac9110711	Add more tests on tokenizers serialization - fix bugs (#5056 ) * update tests for fast tokenizers + fix small bug in saving/loading * better tests on serialization * fixing serialization * comment cleanup	2020-06-24 21:53:08 +02:00
Thomas Wolf	11fdde0271	Tokenizers API developments (#5103 ) * Add return lengths * make pad a bit more flexible so it can be used as collate_fn * check all kwargs sent to encoding method are known * fixing kwargs in encodings * New AddedToken class in python This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens. * style and quality * switched to hugginface tokenizers library for AddedTokens * up to tokenizer 0.8.0-rc3 - update API to use AddedToken state * style and quality * do not raise an error on additional or unused kwargs for tokenize() but only a warning * transfo-xl pretrained model requires torch * Update src/transformers/tokenization_utils.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2020-06-23 13:36:57 +02:00
Thomas Wolf	ebc36108dc	[tokenizers] Fix #5081 and improve backward compatibility (#5125 ) * fix #5081 and improve backward compatibility (slightly) * add nlp to setup.cfg - style and quality * align default to previous default * remove test that doesn't generalize	2020-06-22 17:25:43 +02:00
Sylvain Gugger	011cc0be51	Fix all sphynx warnings (#5068 )	2020-06-16 16:50:02 -04:00
Funtowicz Morgan	9e03364999	Ability to pickle/unpickle BatchEncoding pickle (reimport) (#5039 ) * Added is_fast property on BatchEncoding to indicate if the object comes from a Fast Tokenizer. * Added __get_state__() & __set_state__() to be pickable. * Correct tokens() return type from List[int] to List[str] * Added unittest for BatchEncoding pickle/unpickle * Added unittest for BatchEncoding is_fast * More careful checking on BatchEncoding unpickle tests. * Formatting. * is_fast should assertTrue on Rust tokenizers. * Ensure tensorflow has correct way of checking array_equal * More formatting.	2020-06-16 09:25:25 +02:00
Anthony MOI	36434220fc	[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510 ) * Use tokenizers pre-tokenized pipeline * failing pretrokenized test * Fix is_pretokenized in python * add pretokenized tests * style and quality * better tests for batched pretokenized inputs * tokenizers clean up - new padding_strategy - split the files * [HUGE] refactoring tokenizers - padding - truncation - tests * style and quality * bump up requied tokenizers version to 0.8.0-rc1 * switched padding/truncation API - simpler better backward compat * updating tests for custom tokenizers * style and quality - tests on pad * fix QA pipeline * fix backward compatibility for max_length only * style and quality * Various cleans up - add verbose * fix tests * update docstrings * Fix tests * Docs reformatted * __call__ method documented Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>	2020-06-15 17:12:51 -04:00

9 Commits