* feat(tokenization): add encode_message to tokenize messages one by one
* Fix the `encode_message` method, remove the `add_generation_prompt` parameter and add the corresponding error handling. Update the document to reflect this change and verify the error handling in the test.
* Optimize the `encode_message` method, improve the processing logic of the empty dialogue history, and ensure that the chat template can be applied correctly when the dialogue history is empty. Update the document to reflect these changes.
* The `_encode_message` method is deleted, the message coding logic is simplified, and the functional integrity of the `encode_message` method is ensured. Update the document to reflect these changes.
* Docs fix
* Revert changes in docstring of pad()
* Revert changes in docstring
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Repair the call of the `encode_message` method, update it to `encode_message_with_chat_template` to support the chat template, and adjust the relevant test cases to reflect this change.
* Optimize the call format of the `apply_chat_template` method, and merge multi-line calls into a single line to improve code readability.
---------
Co-authored-by: pco111 <15262555+pco111@user.noreply.gitee.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* rm tf/flax tests
* more flax deletions
* revert fixture change
* reverted test that should not be deleted; rm tf/flax test
* revert
* fix a few add-model-like tests
* fix add-model-like checkpoint source
* a few more
* test_get_model_files_only_pt fix
* fix test_retrieve_info_for_model_with_xxx
* fix test_retrieve_model_classes
* relative paths are the devil
* add todo
* kinda works
* update
* add tests
* update
* use special tokens in processors
* typo
* fix copies
* fix
* fix moshi after rebase
* update
* fix tests
* update
* Update docs/source/en/main_classes/tokenizer.md
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* update docs
* test for load time adding tokens
* fix some more tests which are now fetched better
* one more fix
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* test(tokenizers): add a test showing conflict with sentencepiece
This is due to the fact that protobuf C implementation uses a global
pool for all added descriptors, so if two different files add
descriptors, they will end up conflicting.
* fix(tokenizers): mitigate sentencepiece/protobuf conflict
When sentencepiece is available, use that protobuf instead of the
internal one.
* chore(style): fix with ruff
* save total_vocab_size = vocab_size + user added tokens to speed up operation
* updating length when added_tokens_decoder is set
* add test len(tokenizer)
* Result of black 23.1
* Update target to Python 3.7
* Switch flake8 to ruff
* Configure isort
* Configure isort
* Apply isort with line limit
* Put the right black version
* adapt black in check copies
* Fix copies