Docs: add more cross-references to the KV cache docs (#33323)

* add more cross-references * nit * import guard * more import guards * nit * Update src/transformers/generation/configuration_utils.py
2024-09-06 10:22:00 +01:00
parent 1759bb9126
commit 2b789f27f3
29 changed files with 99 additions and 57 deletions
--- a/docs/source/en/llm_optims.md
+++ b/docs/source/en/llm_optims.md
@@ -24,7 +24,7 @@ This guide will show you how to use the optimization techniques available in Tra

 During decoding, a LLM computes the key-value (kv) values for each input token and since it is autoregressive, it computes the same kv values each time because the generated output becomes part of the input now. This is not very efficient because you're recomputing the same kv values each time.

-To optimize this, you can use a kv-cache to store the past keys and values instead of recomputing them each time. However, since the kv-cache grows with each generation step and is dynamic, it prevents you from taking advantage of [`torch.compile`](./perf_torch_compile), a powerful optimization tool that fuses PyTorch code into fast and optimized kernels.
+To optimize this, you can use a kv-cache to store the past keys and values instead of recomputing them each time. However, since the kv-cache grows with each generation step and is dynamic, it prevents you from taking advantage of [`torch.compile`](./perf_torch_compile), a powerful optimization tool that fuses PyTorch code into fast and optimized kernels. We have an entire guide dedicated to kv-caches [here](./kv_cache).

 The *static kv-cache* solves this issue by pre-allocating the kv-cache size to a maximum value which allows you to combine it with `torch.compile` for up to a 4x speed up. Your speed up may vary depending on the model size (larger models have a smaller speed up) and hardware.