From e26ae892811dd32c90b12de94fc4105d690cd137 Mon Sep 17 00:00:00 2001 From: Raushan Turganbay Date: Fri, 13 Jun 2025 09:10:56 +0200 Subject: [PATCH] [docs] update cache docs with new info (#38775) * update docs with new info * Update docs/source/en/kv_cache.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/kv_cache.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/source/en/kv_cache.md b/docs/source/en/kv_cache.md index 440ce18e5a..14a0d4901d 100644 --- a/docs/source/en/kv_cache.md +++ b/docs/source/en/kv_cache.md @@ -261,7 +261,9 @@ A cache can also work in iterative generation settings where there is back-and-f For iterative generation with a cache, start by initializing an empty cache class and then you can feed in your new prompts. Keep track of dialogue history with a [chat template](./chat_templating). -The example below demonstrates how to use a cache for iterative generation. +The following example demonstrates [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). If you’re using a different chat-style model, [`~PreTrainedTokenizer.apply_chat_template`] may process messages differently. It might cut out important tokens depending on how the Jinja template is written. + +For example, some models use special ` ... ` tokens during reasoning. These could get lost during re-encoding, causing indexing issues. You might need to manually remove or adjust extra tokens from the completions to keep things stable. ```py import torch @@ -281,7 +283,6 @@ tokenizer = AutoTokenizer.from_pretrained(model_id) user_prompts = ["Hello, what's your name?", "Btw, yesterday I was on a rock concert."] past_key_values = DynamicCache() -max_cache_length = past_key_values.get_max_length() messages = [] for prompt in user_prompts: