From e26ae892811dd32c90b12de94fc4105d690cd137 Mon Sep 17 00:00:00 2001
From: Raushan Turganbay <raushan@huggingface.co>
Date: Fri, 13 Jun 2025 09:10:56 +0200
Subject: [PATCH] [docs] update cache docs with new info (#38775)

* update docs with new info

* Update docs/source/en/kv_cache.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/kv_cache.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/docs/source/en/kv_cache.md b/docs/source/en/kv_cache.md
index 440ce18e5a..14a0d4901d 100644
--- a/docs/source/en/kv_cache.md
+++ b/docs/source/en/kv_cache.md
@@ -261,7 +261,9 @@ A cache can also work in iterative generation settings where there is back-and-f
 
 For iterative generation with a cache, start by initializing an empty cache class and then you can feed in your new prompts. Keep track of dialogue history with a [chat template](./chat_templating).
 
-The example below demonstrates how to use a cache for iterative generation.
+The following example demonstrates [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). If you’re using a different chat-style model, [`~PreTrainedTokenizer.apply_chat_template`] may process messages differently. It might cut out important tokens depending on how the Jinja template is written.
+
+For example, some models use special `<think> ... </think>` tokens during reasoning. These could get lost during re-encoding, causing indexing issues. You might need to manually remove or adjust extra tokens from the completions to keep things stable.
 
 ```py
 import torch
@@ -281,7 +283,6 @@ tokenizer = AutoTokenizer.from_pretrained(model_id)
 user_prompts = ["Hello, what's your name?", "Btw, yesterday I was on a rock concert."]
 
 past_key_values = DynamicCache()
-max_cache_length = past_key_values.get_max_length()
 
 messages = []
 for prompt in user_prompts: