feat(tokenization): add encode_message to tokenize messages one by one (#39507)

* feat(tokenization): add encode_message to tokenize messages one by one

* Fix the `encode_message` method, remove the `add_generation_prompt` parameter and add the corresponding error handling. Update the document to reflect this change and verify the error handling in the test.

* Optimize the `encode_message` method, improve the processing logic of the empty dialogue history, and ensure that the chat template can be applied correctly when the dialogue history is empty. Update the document to reflect these changes.

* The `_encode_message` method is deleted, the message coding logic is simplified, and the functional integrity of the `encode_message` method is ensured. Update the document to reflect these changes.

* Docs fix

* Revert changes in docstring of pad()

* Revert changes in docstring

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Repair the call of the `encode_message` method, update it to `encode_message_with_chat_template` to support the chat template, and adjust the relevant test cases to reflect this change.

* Optimize the call format of the `apply_chat_template` method, and merge multi-line calls into a single line to improve code readability.

---------

Co-authored-by: pco111 <15262555+pco111@user.noreply.gitee.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
Jeff Zhang
2025-07-31 04:55:45 -04:00
committed by GitHub
parent 4f93cc9174
commit cb289ad243
2 changed files with 86 additions and 0 deletions

View File

@@ -24,6 +24,7 @@ from typing import Callable, Optional
import numpy as np
from transformers import (
AutoTokenizer,
BatchEncoding,
BertTokenizer,
BertTokenizerFast,
@@ -375,3 +376,32 @@ class TokenizerUtilsTest(unittest.TestCase):
tokenizer = PreTrainedTokenizerFast(tokenizer_object=_tokenizer)
toy_text_iterator = ("a" for _ in range(1000))
tokenizer.train_new_from_iterator(text_iterator=toy_text_iterator, length=1000, vocab_size=50)
def test_encode_message(self):
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
conversation = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hey there, how are you?"},
{"role": "assistant", "content": "Thank you for asking, I am doing well"},
{"role": "user", "content": "What's the weather like today?"},
{"role": "assistant", "content": "Today the weather is nice"},
]
# First, test the default case, where we encode the whole conversation at once
whole_conversation_tokens = tokenizer.apply_chat_template(conversation, tokenize=True)
# Now, test the message-by-message encoding
tokens = []
for i, message in enumerate(conversation):
tokens += tokenizer.encode_message_with_chat_template(message, conversation_history=conversation[:i])
self.assertEqual(whole_conversation_tokens, tokens)
def test_encode_message_raises_on_add_generation_prompt(self):
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
conversation = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hey there, how are you?"},
]
with self.assertRaises(ValueError):
tokenizer.encode_message_with_chat_template(conversation[0], add_generation_prompt=True)