* splitting fast and slow tokenizers [WIP]
* [WIP] splitting sentencepiece and tokenizers dependencies
* update dummy objects
* add name_or_path to models and tokenizers
* prefix added to file names
* prefix
* styling + quality
* spliting all the tokenizer files - sorting sentencepiece based ones
* update tokenizer version up to 0.9.0
* remove hard dependency on sentencepiece 🎉
* and removed hard dependency on tokenizers 🎉
* update conversion script
* update missing models
* fixing tests
* move test_tokenization_fast to main tokenization tests - fix bugs
* bump up tokenizers
* fix bert_generation
* update ad fix several tokenizers
* keep sentencepiece in deps for now
* fix funnel and deberta tests
* fix fsmt
* fix marian tests
* fix layoutlm
* fix squeezebert and gpt2
* fix T5 tokenization
* fix xlnet tests
* style
* fix mbart
* bump up tokenizers to 0.9.2
* fix model tests
* fix tf models
* fix seq2seq examples
* fix tests without sentencepiece
* fix slow => fast conversion without sentencepiece
* update auto and bert generation tests
* fix mbart tests
* fix auto and common test without tokenizers
* fix tests without tokenizers
* clean up tests lighten up when tokenizers + sentencepiece are both off
* style quality and tests fixing
* add sentencepiece to doc/examples reqs
* leave sentencepiece on for now
* style quality split hebert and fix pegasus
* WIP Herbert fast
* add sample_text_no_unicode and fix hebert tokenization
* skip FSMT example test for now
* fix style
* fix fsmt in example tests
* update following Lysandre and Sylvain's comments
* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Update src/transformers/testing_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* [WIP] SP tokenizers
* fixing tests for T5
* WIP tokenizers
* serialization
* update T5
* WIP T5 tokenization
* slow to fast conversion script
* Refactoring to move tokenzier implementations inside transformers
* Adding gpt - refactoring - quality
* WIP adding several tokenizers to the fast world
* WIP Roberta - moving implementations
* update to dev4 switch file loading to in-memory loading
* Updating and fixing
* advancing on the tokenizers - updating do_lower_case
* style and quality
* moving forward with tokenizers conversion and tests
* MBart, T5
* dumping the fast version of transformer XL
* Adding to autotokenizers + style/quality
* update init and space_between_special_tokens
* style and quality
* bump up tokenizers version
* add protobuf
* fix pickle Bert JP with Mecab
* fix newly added tokenizers
* style and quality
* fix bert japanese
* fix funnel
* limite tokenizer warning to one occurence
* clean up file
* fix new tokenizers
* fast tokenizers deep tests
* WIP adding all the special fast tests on the new fast tokenizers
* quick fix
* adding more fast tokenizers in the fast tests
* all tokenizers in fast version tested
* Adding BertGenerationFast
* bump up setup.py for CI
* remove BertGenerationFast (too early)
* bump up tokenizers version
* Clean old docstrings
* Typo
* Update following Lysandre comments
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
* Clean up model documentation
* Formatting
* Preparation work
* Long lines
* Main work on rst files
* Cleanup all config files
* Syntax fix
* Clean all tokenizers
* Work on first models
* Models beginning
* FaluBERT
* All PyTorch models
* All models
* Long lines again
* Fixes
* More fixes
* Update docs/source/model_doc/bert.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Update docs/source/model_doc/electra.rst
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Last fixes
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Add Tuna Mirror for Downloads from China
* format fix
* Use preset instead of hardcoding URL
* Fix
* make style
* update the mirror option doc
* update the mirror
* allow using tokenizer.pad as a collate_fn in pytorch
* allow using tokenizer.pad as a collate_fn in pytorch
* Add documentation and tests
* Make attention mask the right shape
* Better test
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
* Exposing prepare_for_model for both slow & fast tokenizers
* Update method signature
* The traditional style commit
* Hide the warnings behind the verbose flag
* update default truncation strategy and prepare_for_model
* fix tests and prepare_for_models methods
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
* remove references to old API in docstring - update data processors
* style
* fix tests - better type checking error messages
* better type checking
* include awesome fix by @LysandreJik for #5310
* updated doc and examples
* Add new parameter `pad_to_multiple_of` on tokenizers.
* unittest for pad_to_multiple_of
* Add .name when logging enum.
* Fix missing .items() on dict in tests.
* Add special check + warning if the tokenizer doesn't have proper pad_token.
* Use the correct logger format specifier.
* Ensure tokenizer with no pad_token do not modify the underlying padding strategy.
* Skip test if tokenizer doesn't have pad_token
* Fix RobertaTokenizer on empty input
* Format.
Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>
* fix and updating to simpler API
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
* avoid recursion in id checks for fast tokenizers
* better typings and fix#5232
* align slow and fast tokenizers behaviors for Roberta and GPT2
* style and quality
* fix tests - improve typings
* fix-5181
Padding to max sequence length while truncation to another length was wrong on slow tokenizers
* clean up and fix#5155
* fix XLM test
* Fix tests for Transfo-XL
* logging only above WARNING in tests
* switch slow tokenizers tests in @slow
* fix Marian truncation tokenization test
* style and quality
* make the test a lot faster by limiting the sequence length used in tests
* Add return lengths
* make pad a bit more flexible so it can be used as collate_fn
* check all kwargs sent to encoding method are known
* fixing kwargs in encodings
* New AddedToken class in python
This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens.
* style and quality
* switched to hugginface tokenizers library for AddedTokens
* up to tokenizer 0.8.0-rc3 - update API to use AddedToken state
* style and quality
* do not raise an error on additional or unused kwargs for tokenize() but only a warning
* transfo-xl pretrained model requires torch
* Update src/transformers/tokenization_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* fix#5081 and improve backward compatibility (slightly)
* add nlp to setup.cfg - style and quality
* align default to previous default
* remove test that doesn't generalize
* Added is_fast property on BatchEncoding to indicate if the object comes from a Fast Tokenizer.
* Added __get_state__() & __set_state__() to be pickable.
* Correct tokens() return type from List[int] to List[str]
* Added unittest for BatchEncoding pickle/unpickle
* Added unittest for BatchEncoding is_fast
* More careful checking on BatchEncoding unpickle tests.
* Formatting.
* is_fast should assertTrue on Rust tokenizers.
* Ensure tensorflow has correct way of checking array_equal
* More formatting.