Quantized KV Cache (#30483)
* clean-up * Update src/transformers/cache_utils.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/cache_utils.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/cache_utils.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fixup * Update tests/quantization/quanto_integration/test_quanto.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update src/transformers/generation/configuration_utils.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * more suggestions * mapping if torch available * run tests & add 'support_quantized' flag * fix jamba test * revert, will be fixed by another PR * codestyle * HQQ and versatile cache classes * final update * typo * make tests happy --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
e05baad861
commit
d583f1317b
@@ -174,6 +174,43 @@ An increasing sequence: one, two, three, four, five, six, seven, eight, nine, te
|
||||
```
|
||||
|
||||
|
||||
## KV Cache Quantization
|
||||
|
||||
The `generate()` method supports caching keys and values to enhance efficiency and avoid re-computations. However the key and value
|
||||
cache can occupy a large portion of memory, becoming a bottleneck for long-context generation, especially for Large Language Models.
|
||||
Quantizing the cache when using `generate()` can significantly reduce memory requirements at the cost of speed.
|
||||
|
||||
KV Cache quantization in `transformers` is largely inspired by the paper [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache]
|
||||
(https://arxiv.org/abs/2402.02750) and currently supports `quanto` and `HQQ` as backends. For more information on the inner workings see the paper.
|
||||
|
||||
To enable quantization of the key-value cache, one needs to indicate `cache_implementation="quantized"` in the `generation_config`.
|
||||
Quantization related arguments should be passed to the `generation_config` either as a `dict` or an instance of a [`QuantizedCacheConfig`] class.
|
||||
One has to indicate which quantization backend to use in the [`QuantizedCacheConfig`], the default is `quanto`.
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
Cache quantization can be detrimental if the context length is short and there is enough GPU VRAM available to run without cache quantization.
|
||||
|
||||
</Tip>
|
||||
|
||||
|
||||
```python
|
||||
>>> import torch
|
||||
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
|
||||
>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
|
||||
>>> inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
|
||||
|
||||
>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"nbits": 4, "backend": "quanto"})
|
||||
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
|
||||
I like rock music because it's loud and energetic. It's a great way to express myself and rel
|
||||
|
||||
>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20)
|
||||
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
|
||||
I like rock music because it's loud and energetic. I like to listen to it when I'm feeling
|
||||
```
|
||||
|
||||
## Watermarking
|
||||
|
||||
The `generate()` supports watermarking the generated text by randomly marking a portion of tokens as "green".
|
||||
|
||||
@@ -360,6 +360,12 @@ A [`Constraint`] can be used to force the generation to include specific tokens
|
||||
[[autodoc]] Cache
|
||||
- update
|
||||
|
||||
[[autodoc]] CacheConfig
|
||||
- update
|
||||
|
||||
[[autodoc]] QuantizedCacheConfig
|
||||
- validate
|
||||
|
||||
[[autodoc]] DynamicCache
|
||||
- update
|
||||
- get_seq_length
|
||||
@@ -367,6 +373,14 @@ A [`Constraint`] can be used to force the generation to include specific tokens
|
||||
- to_legacy_cache
|
||||
- from_legacy_cache
|
||||
|
||||
[[autodoc]] QuantizedCache
|
||||
- update
|
||||
- get_seq_length
|
||||
|
||||
[[autodoc]] QuantoQuantizedCache
|
||||
|
||||
[[autodoc]] HQQQuantizedCache
|
||||
|
||||
[[autodoc]] SinkCache
|
||||
- update
|
||||
- get_seq_length
|
||||
@@ -375,7 +389,7 @@ A [`Constraint`] can be used to force the generation to include specific tokens
|
||||
[[autodoc]] StaticCache
|
||||
- update
|
||||
- get_seq_length
|
||||
- reorder_cache
|
||||
- reset
|
||||
|
||||
|
||||
## Watermark Utils
|
||||
|
||||
Reference in New Issue
Block a user