Offloaded KV Cache (#31325)
* Initial implementation of OffloadedCache * enable usage via cache_implementation * Address feedback, add tests, remove legacy methods. * Remove flash-attn, discover synchronization bugs, fix bugs * Prevent usage in CPU only mode * Add a section about offloaded KV cache to the docs * Fix typos in docs * Clarifications and better explanation of streams
This commit is contained in:
committed by
GitHub
parent
b4727a1216
commit
ca59d6f77c
@@ -211,6 +211,80 @@ I like rock music because it's loud and energetic. It's a great way to express m
|
||||
I like rock music because it's loud and energetic. I like to listen to it when I'm feeling
|
||||
```
|
||||
|
||||
## KV Cache Offloading
|
||||
|
||||
Similarly to KV cache quantization, this strategy aims to reduce GPU VRAM usage.
|
||||
It does so by moving the KV cache for most layers to the CPU.
|
||||
As the model's `forward()` method iterates over the layers, this strategy maintains the current layer cache on the GPU.
|
||||
At the same time it asynchronously prefetches the next layer cache as well as sending the previous layer cache back to the CPU.
|
||||
Unlike KV cache quantization, this strategy always produces the same result as the default KV cache implementation.
|
||||
Thus, it can serve as a drop-in replacement or a fallback for it.
|
||||
|
||||
Depending on your model and the characteristics of your generation task (size of context, number of generated tokens, number of beams, etc.)
|
||||
you may notice a small degradation in generation throughput compared to the default KV cache implementation.
|
||||
|
||||
To enable KV cache offloading, pass `cache_implementation="offloaded"` in the `generation_config`.
|
||||
|
||||
```python
|
||||
>>> import torch
|
||||
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
>>> ckpt = "microsoft/Phi-3-mini-4k-instruct"
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained(ckpt)
|
||||
>>> model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0")
|
||||
>>> inputs = tokenizer("Fun fact: The shortest", return_tensors="pt").to(model.device)
|
||||
|
||||
>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=23, cache_implementation="offloaded")
|
||||
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
|
||||
Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896.
|
||||
|
||||
>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=23)
|
||||
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
|
||||
Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896.
|
||||
```
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
Cache offloading requires a GPU and can be slower than the default KV cache. Use it if you are getting CUDA out of memory errors.
|
||||
|
||||
</Tip>
|
||||
|
||||
The example below shows how KV cache offloading can be used as a fallback strategy.
|
||||
```python
|
||||
>>> import torch
|
||||
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
>>> def resilient_generate(model, *args, **kwargs):
|
||||
... oom = False
|
||||
... try:
|
||||
... return model.generate(*args, **kwargs)
|
||||
... except torch.cuda.OutOfMemoryError as e:
|
||||
... print(e)
|
||||
... print("retrying with cache_implementation='offloaded'")
|
||||
... oom = True
|
||||
... if oom:
|
||||
... torch.cuda.empty_cache()
|
||||
... kwargs["cache_implementation"] = "offloaded"
|
||||
... return model.generate(*args, **kwargs)
|
||||
...
|
||||
...
|
||||
>>> ckpt = "microsoft/Phi-3-mini-4k-instruct"
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained(ckpt)
|
||||
>>> model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0")
|
||||
>>> prompt = ["okay "*1000 + "Fun fact: The most"]
|
||||
>>> inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
||||
>>> beams = { "num_beams": 40, "num_beam_groups": 40, "num_return_sequences": 40, "diversity_penalty": 1.0, "max_new_tokens": 23, "early_stopping": True, }
|
||||
>>> out = resilient_generate(model, **inputs, **beams)
|
||||
>>> responses = tokenizer.batch_decode(out[:,-28:], skip_special_tokens=True)
|
||||
```
|
||||
|
||||
On a GPU with 50 GB of RAM, running this code will print
|
||||
```
|
||||
CUDA out of memory. Tried to allocate 4.83 GiB. GPU
|
||||
retrying with cache_implementation='offloaded'
|
||||
```
|
||||
before successfully generating 40 beams.
|
||||
|
||||
|
||||
## Watermarking
|
||||
|
||||
The `generate()` supports watermarking the generated text by randomly marking a portion of tokens as "green".
|
||||
|
||||
Reference in New Issue
Block a user