[docs] update attention implementation and cache docs (#39547)
* update docs * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * applu suggestions --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
ef99537f37
commit
cd98c1fee3
@@ -132,6 +132,34 @@ for _ in range(max_new_tokens):
|
||||
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])
|
||||
"[INST] Hello, what's your name. [/INST] Hello! My name is LLaMA,"
|
||||
```
|
||||
|
||||
## Cache position
|
||||
|
||||
The cache position tracks where to insert new tokens in the attention cache. It represents the *absolute* position of each token in the context, independent of padding or batch structure. Suppose you already cached `N` tokens and are now processing `K` new tokens. The cache position for the new tokens will range from `N` to `N + K - 1`. In other words, you're processing tokens at positions - `[N, N + 1, N + 2, ..., N + K - 1]`.
|
||||
|
||||
Cache position is used internally for two purposes:
|
||||
|
||||
1. Selecting new tokens to process in the input sequence and ensuring only tokens that haven’t been cached yet are passed to the model's `forward`.
|
||||
2. Storing key/value pairs at the correct positions in the cache. This is especially important for fixed-size caches, like [`StaticCache`], that pre-allocates a specific cache length.
|
||||
|
||||
The generation loop usually takes care of the cache position, but if you're writing a custom generation method, it is important that cache positions are accurate since they are used to write and read key/value states into fixed slots.
|
||||
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
|
||||
|
||||
model_id = "meta-llama/Llama-2-7b-chat-hf"
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda:0")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
|
||||
messages = [{"role": "user", "content": "You are a helpful assistant."}]
|
||||
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda:0")
|
||||
generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=10)
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Legacy cache format
|
||||
|
||||
Before the [`Cache`] class, the cache used to be stored as a tuple of tuples of tensors. This format is dynamic because it grows as text is generated, similar to [`DynamicCache`].
|
||||
@@ -157,4 +185,4 @@ generation_outputs = model.generate(**inputs, return_dict_in_generate=True, retu
|
||||
|
||||
cache = DynamicCache.from_legacy_cache(generation_outputs.past_key_values)
|
||||
legacy_format_cache = cache.to_legacy_cache()
|
||||
```
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user