* Accumulate tokens into batches in PreTrainedTokenizerBase.add_tokens()
For tokenizers with a small number of special tokens or special tokens
with consecutive token IDs, this reduces the time complexity of creating
the trie from quadratic to linear, see also #16936.
* Extend explanation of batching added tokens
* [T5 Tokenizer] Model has no fixed position ids - there is no hardcoded max length
* [T5 Tokenizer] Model has no fixed position ids - there is no hardcoded max length
* correct t5 tokenizer
* correct t5 tokenizer
* fix test
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* finish
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Fix setters of *_token_id properties of SpecialTokensMixin
* Test setters of common tokens ids
* Move to a separate test checks of setters of tokens ids
* Add independent test for ByT5
* Add Canine test
* Test speech to text
* Make Transformers use cache files when hf.co is down
* Fix tests
* Was there a random circleCI failure?
* Isolate patches
* Style
* Comment out the failure since it doesn't fail anymore
* Better comment
* Split file_utils in several submodules
* Fixes
* Add back more objects
* More fixes
* Who exactly decided to import that from there?
* Second suggestion to code with code review
* Revert wront move
* Fix imports
* Adapt all imports
* Adapt all imports everywhere
* Revert this import, will fix in a separate commit
* add new test
* update test
* remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`
* add `tokenizer_file` for the fast only tokenizer
* change global variables layoutxml
* remove `"tokenizer_file"` from DPR tokenizer's Global variables
* remove `tokenizer_file` from herbert slow tokenizer init
* `"tokenizer_file"` from LED tokenizer's Global variables
* remove `tokenizer_file` from mbart slow tokenizer init
* remove `tokenizer_file` from slow tokenizer template
* adapt to versioning
* adapt the `test_tokenizer_mismatch_warning` test
* clean test
* clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py
* Revert "remove `tokenizer_file` from mbart slow tokenizer init"
This reverts commit 0dbb723fa9c7599d4640fe30b3647a74eb4a64e1.
* Revert "`"tokenizer_file"` from LED tokenizer's Global variables"
This reverts commit 5a3f879bdd651233f3d74a3d1146c34cde82b0c2.
* Revert "remove `tokenizer_file` from herbert slow tokenizer init"
This reverts commit f5e10007b7b0ec5345e015b9de7ffec72c5407fd.
* Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"
This reverts commit da0895330bedfafc81ae3073470a9348c669f032.
* set `tokenizer_file` in super `__init__` of mbart
* replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`
* add test
* fix kwargs
* reformat test
* format
* format
* fix typo to render the documentation
* Avoid using get_list_of_files in config
* Wip, change tokenizer file getter
* Remove call in tokenizer files
* Remove last call to get_list_model_files
* Better tests
* Unit tests for new function
* Document bad API
* Refine errors for pretrained objects
* PoC to avoid using get_list_of_files
* Adapt tests to use new errors
* Quality + Fix PoC
* Revert "PoC to avoid using get_list_of_files"
This reverts commit cb93b7cae8504ef837c2a7663cb7955e714f323e.
* Revert "Quality + Fix PoC"
This reverts commit 3ba6d0d4ca546708b31d355baa9e68ba9736508f.
* Fix doc
* Revert PoC
* Add feature extractors
* More tests and PT model
* Adapt error message
* Feature extractor tests
* TF model
* Flax model and test
* Merge flax auto tests
* Add tokenization
* Fix test
* [AutoProcessor] Correct AutoProcessor and automatically add processor class
* up
* up
* up
* up
* up
* up
* up
* up
* continue tomorrow
* up
* up
* up
* make processor class private
* fix loop
* Fix bad examples
* Add black formatting to style_doc
* Use first nonempty line
* Put it at the right place
* Don't add spaces to empty lines
* Better templates
* Deal with triple quotes in docstrings
* Result of style_doc
* Enable mdx treatment and fix code examples in MDXs
* Result of doc styler on doc source files
* Last fixes
* Break copy from
* New doc styler
* Fix issue with args at the start
* Code sample fixes
* Style code examples in MDX
* Fix more patterns
* Typo
* Typo
* More patterns
* Do without black for now
* Get more info in error
* Docstring style
* Re-enable check
* Quality
* Fix add_end_docstring decorator
* Fix docstring
* Convert docstrings of all configurations and tokenizers
* Processors and fixes
* Last modeling files and fixes to models
* Pipeline modules
* Utils files
* Data submodule
* All the other files
* Style
* Missing examples
* Style again
* Fix copies
* Say bye bye to rst docstrings forever
When loading a pretrained tokenizer, a verification is done to ensure
that the actual tokenizer class matches the class it was called from.
If the tokenizer is absent, its config file is loaded from the repo.
However, the cache_dir for downloading is not provided, which leads to
ignoring of the user-specified cache_dir, storing files in several
places and and may result in incorrect warnings when the default
cache_dir is unreachsble.
This commit fixes that.
* correct order of overflowing tokens for LayoutLmV2 tokenizer
* test to check order of overflowing_tokens for a seq of input_ids
* fix up quality
* added suggested changes
* check that tests the bbox sequence
* pair_input test added
* pass quality test
* check bbox sequence added
* unittest method
* comments added
* add overflowing bbox test
* improved "seq_1"
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
* improve code quality
Co-authored-by: SaulLu <lucilesaul.com@gmail.com>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
* correct order of overflowing_tokens for slow tokenizer (issue fix#13148)
* python 3.9 requires sentencepiece version 0.1.94 or above
* slicing of ids fixed in truncated_sequence()
* Update setup.py
* Correct order of overflowing tokens for pair of sentences
* code reformatted
* Update tokenization_utils_base.py
* reformatting file
* test to check single_input added
* missing function restored
* test to check pair_input overflowing tokens order
* test to check pair_input overflowing tokens order
* test to check pair_input overflowing tokens order
* added an error message for pair of seq and longest_first strategy
* test for pair_input modified
* variable name corrected
* fixed a typo in error message
* requested changes implemented
* required test added
* Corrected the message to match test message
* added error message for Luke Tokenizer
* lost test recovered
* docstring for truncate_sequences and prepare_for_model updated
* docstring for luke tokenizer updated
* updated ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING
* aligned text and fixed puncuatations
* improved style and quality of code
* fixed error_msg in truncate_sequences
* replaced encode_plus method with regular call method
* clean up
* rephrased the docstring
* add test
* add change in PretrainedTokenizerBase
* change Luke
* deactivate
* add the possibility to add additional special tokens for M2M100
* format
* add special test for canine
* proposed changes for mbart
* proposed changes for mbart50
* proposed changes for byt5
* proposed changes for canine
* proposed changes for t5
* test fast and slow
* remove comment
* remove comment
* add fast version for all tests
* replace break by continue
* add more comments
* add check to avoid duplicates
* remove comment
* format
* proposed change for wave2vec2
* reverse changes mbart
* uncomment
* format
* Initial work
* All auto models
* All tf auto models
* All flax auto models
* Tokenizers
* Add feature extractors
* Fix typos
* Fix other typo
* Use the right config
* Remove old mapping names and update logic in AutoTokenizer
* Update check_table
* Fix copies and check_repo script
* Fix last test
* Add back name
* clean up
* Update template
* Update template
* Forgot a )
* Use alternative to fixup
* Fix TF model template
* Address review comments
* Address review comments
* Style