Cache: create docs (#32150)

* draft * updates * works? * try adding python example in hidden section * another try * hwo do i render python * format as html code? * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * one more small update * should render hidden secrtion now * add outputs * fix links * check links * update all links * update with offloaded cache * all cache is importable, so they appear in docs * fix copies * docstring... --------- Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
2024-08-06 10:24:19 +05:00
parent 13dc6b0853
commit 37c5ca5eb9
7 changed files with 597 additions and 121 deletions
--- a/docs/source/en/generation_strategies.md
+++ b/docs/source/en/generation_strategies.md
@@ -174,117 +174,6 @@ An increasing sequence: one, two, three, four, five, six, seven, eight, nine, te
 ```


-## KV Cache Quantization
-
-The `generate()` method supports caching keys and values to enhance efficiency and avoid re-computations. However the key and value
-cache can occupy a large portion of memory, becoming a bottleneck for long-context generation, especially for Large Language Models.
-Quantizing the cache when using `generate()` can significantly reduce memory requirements at the cost of speed.
-
-KV Cache quantization in `transformers` is largely inspired by the paper [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache]
-(https://arxiv.org/abs/2402.02750) and currently supports `quanto` and `HQQ` as backends. For more information on the inner workings see the paper.
-
-To enable quantization of the key-value cache, one needs to indicate `cache_implementation="quantized"` in the `generation_config`.
-Quantization related arguments should be passed to the `generation_config` either as a `dict` or an instance of a [`QuantizedCacheConfig`] class.
-One has to indicate which quantization backend to use in the [`QuantizedCacheConfig`], the default is `quanto`.
-
-<Tip warning={true}>
-
-Cache quantization can be detrimental if the context length is short and there is enough GPU VRAM available to run without cache quantization.
-
-</Tip>
-
-
-```python
->>> import torch
->>> from transformers import AutoTokenizer, AutoModelForCausalLM
-
->>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
->>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
->>> inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
-
->>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"nbits": 4, "backend": "quanto"})
->>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
-I like rock music because it's loud and energetic. It's a great way to express myself and rel
-
->>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20)
->>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
-I like rock music because it's loud and energetic. I like to listen to it when I'm feeling
-```
-
-## KV Cache Offloading
-
-Similarly to KV cache quantization, this strategy aims to reduce GPU VRAM usage.
-It does so by moving the KV cache for most layers to the CPU.
-As the model's `forward()` method iterates over the layers, this strategy maintains the current layer cache on the GPU.
-At the same time it asynchronously prefetches the next layer cache as well as sending the previous layer cache back to the CPU.
-Unlike KV cache quantization, this strategy always produces the same result as the default KV cache implementation.
-Thus, it can serve as a drop-in replacement or a fallback for it.
-
-Depending on your model and the characteristics of your generation task (size of context, number of generated tokens, number of beams, etc.)
-you may notice a small degradation in generation throughput compared to the default KV cache implementation.
-
-To enable KV cache offloading, pass `cache_implementation="offloaded"` in the `generation_config`.
-
-```python
->>> import torch
->>> from transformers import AutoTokenizer, AutoModelForCausalLM
->>> ckpt = "microsoft/Phi-3-mini-4k-instruct"
-
->>> tokenizer = AutoTokenizer.from_pretrained(ckpt)
->>> model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0")
->>> inputs = tokenizer("Fun fact: The shortest", return_tensors="pt").to(model.device)
-
->>> out = model.generate(**inputs, do_sample=False, max_new_tokens=23, cache_implementation="offloaded")
->>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
-Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896.
-
->>> out = model.generate(**inputs, do_sample=False, max_new_tokens=23)
->>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
-Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896.
-```
-
-<Tip warning={true}>
-
-Cache offloading requires a GPU and can be slower than the default KV cache. Use it if you are getting CUDA out of memory errors.
-
-</Tip>
-
-The example below shows how KV cache offloading can be used as a fallback strategy.
-```python
->>> import torch
->>> from transformers import AutoTokenizer, AutoModelForCausalLM
->>> def resilient_generate(model, *args, **kwargs):
-...     oom = False
-...     try:
-...         return model.generate(*args, **kwargs)
-...     except torch.cuda.OutOfMemoryError as e:
-...         print(e)
-...         print("retrying with cache_implementation='offloaded'")
-...         oom = True
-...     if oom:
-...         torch.cuda.empty_cache()
-...         kwargs["cache_implementation"] = "offloaded"
-...         return model.generate(*args, **kwargs)
-...
-...
->>> ckpt = "microsoft/Phi-3-mini-4k-instruct"
->>> tokenizer = AutoTokenizer.from_pretrained(ckpt)
->>> model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0")
->>> prompt = ["okay "*1000 + "Fun fact: The most"]
->>> inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
->>> beams = { "num_beams": 40, "num_beam_groups": 40, "num_return_sequences": 40, "diversity_penalty": 1.0, "max_new_tokens": 23, "early_stopping": True, }
->>> out = resilient_generate(model, **inputs, **beams)
->>> responses = tokenizer.batch_decode(out[:,-28:], skip_special_tokens=True)
-```
-
-On a GPU with 50 GB of RAM, running this code will print
-```
-CUDA out of memory. Tried to allocate 4.83 GiB. GPU
-retrying with cache_implementation='offloaded'
-```
-before successfully generating 40 beams.
-
-
 ## Watermarking

 The `generate()` supports watermarking the generated text by randomly marking a portion of tokens as "green".