Docs: add more cross-references to the KV cache docs (#33323)
* add more cross-references * nit * import guard * more import guards * nit * Update src/transformers/generation/configuration_utils.py
This commit is contained in:
@@ -51,11 +51,11 @@ More concretely, key-value cache acts as a memory bank for these generative mode
|
||||
|
||||
|
||||
See an example below for how to implement your own generation loop.
|
||||
|
||||
|
||||
```python
|
||||
>>> import torch
|
||||
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
|
||||
|
||||
|
||||
>>> model_id = "meta-llama/Llama-2-7b-chat-hf"
|
||||
>>> model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda:0")
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
@@ -69,10 +69,10 @@ More concretely, key-value cache acts as a memory bank for these generative mode
|
||||
>>> max_new_tokens = 10
|
||||
|
||||
>>> for _ in range(max_new_tokens):
|
||||
... outputs = model(**inputs, cache_position=cache_position, past_key_values=past_key_values, use_cache=True)
|
||||
... outputs = model(**inputs, cache_position=cache_position, past_key_values=past_key_values, use_cache=True)
|
||||
... # Greedily sample one next token
|
||||
... next_token_ids = outputs.logits[:, -1:].argmax(-1)
|
||||
... generated_ids = torch.cat([generated_ids, next_token_ids], dim=-1)
|
||||
... generated_ids = torch.cat([generated_ids, next_token_ids], dim=-1)
|
||||
...
|
||||
... # Prepare inputs for the next generation step by leaaving unprocessed tokens, in our case we have only one new token
|
||||
... # and expanding attn mask for the new token, as explained above
|
||||
@@ -222,7 +222,7 @@ before successfully generating 40 beams.
|
||||
|
||||
### Static Cache
|
||||
|
||||
Since the "DynamicCache" dynamically grows with each generation step, it prevents you from taking advantage of JIT optimizations. The [`~StaticCache`] pre-allocates
|
||||
Since the "DynamicCache" dynamically grows with each generation step, it prevents you from taking advantage of JIT optimizations. The [`~StaticCache`] pre-allocates
|
||||
a specific maximum size for the keys and values, allowing you to generate up to the maximum length without having to modify cache size. Check the below usage example.
|
||||
|
||||
For more examples with Static Cache and JIT compilation, take a look at [StaticCache & torchcompile](./llm_optims#static-kv-cache-and-torchcompile)
|
||||
@@ -267,7 +267,7 @@ This will use the [`~OffloadedStaticCache`] implementation instead.
|
||||
|
||||
As the name suggests, this cache type implements a sliding window over previous keys and values, retaining only the last `sliding_window` tokens. It should be used with models like Mistral that support sliding window attention. Additionally, similar to Static Cache, this one is JIT-friendly and can be used with the same compile tecniques as Static Cache.
|
||||
|
||||
Note that you can use this cache only for models that support sliding window, e.g. Mistral models.
|
||||
Note that you can use this cache only for models that support sliding window, e.g. Mistral models.
|
||||
|
||||
|
||||
```python
|
||||
@@ -324,7 +324,7 @@ We have seen how to use each of the cache types when generating. What if you wan
|
||||
|
||||
The general format when doing iterative generation is as below. First you have to initialize an empty cache of the type you want, and you can start feeding in new prompts iteratively. Keeping track of dialogues history and formatting can be done with chat templates, read more on that in [chat_templating](./chat_templating)
|
||||
|
||||
In case you are using Sink Cache, you have to crop your inputs to that maximum length because Sink Cache can generate text longer than its maximum window size, but it expects the first input to not exceed the maximum cache length.
|
||||
In case you are using Sink Cache, you have to crop your inputs to that maximum length because Sink Cache can generate text longer than its maximum window size, but it expects the first input to not exceed the maximum cache length.
|
||||
|
||||
|
||||
```python
|
||||
@@ -354,9 +354,9 @@ In case you are using Sink Cache, you have to crop your inputs to that maximum l
|
||||
... inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(model.device)
|
||||
... if isinstance(past_key_values, SinkCache):
|
||||
... inputs = {k: v[:, -max_cache_length:] for k, v in inputs.items()}
|
||||
...
|
||||
...
|
||||
... input_length = inputs["input_ids"].shape[1]
|
||||
...
|
||||
...
|
||||
... outputs = model.generate(**inputs, do_sample=False, max_new_tokens=256, past_key_values=past_key_values)
|
||||
... completion = tokenizer.decode(outputs[0, input_length: ], skip_special_tokens=True)
|
||||
... messages.append({"role": "assistant", "content": completion})
|
||||
@@ -400,4 +400,4 @@ Sometimes you would want to first fill-in cache object with key/values for certa
|
||||
|
||||
>>> print(responses)
|
||||
['<s> You are a helpful assistant. Help me to write a blogpost about travelling.\n\nTitle: The Ultimate Guide to Travelling: Tips, Tricks, and', '<s> You are a helpful assistant. What is the capital of France?\n\nYes, the capital of France is Paris.</s>']
|
||||
```
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user