[cache refactor] Move all the caching logic to a per-layer approach (#39106)

* Squash for refactor: Replace monolithic cache classes with modular LayeredCache (#38077) - Introduces CacheLayer and Cache base classes - Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers - Implements method/attr dispatch across layers to reduce boilerplate - Adds CacheProcessor hooks for offloading, quantization, etc. - Updates and passes tests * fix quantized, add tests * remove CacheProcessorList * raushan review, arthur review * joao review: minor things * remove cache configs, make CacheLayer a mixin (joaos review) * back to storage inside Cache() * remove cachebase for decorator * no more __getattr__ * fix tests * joaos review except docs * fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant` More verbose exceptions in `fix_docstring` on docstring formatting issues. * Revert "back to storage inside Cache()" This reverts commit 27916bc2737806bf849ce2148cb1e66d59573913. * cyril review * simplify cache export * fix lfm2 cache * HybridChunked to layer * BC proxy object for cache.key_cache[i]=... * reorder classes * bfff come on LFM2 * better tests for hybrid and hybridChunked * complete coverage for hybrid chunked caches (prefill chunking) * reimplementing HybridChunked * cyril review * fix ci * docs for cache refactor * docs * oopsie * oopsie * fix after merge * cyril review * arthur review * opsie * fix lfm2 * opsie2
2025-07-22 16:10:25 +02:00
parent b16688e96a
commit c338fd43b0
64 changed files with 2779 additions and 2441 deletions
--- a/docs/source/en/internal/generation_utils.md
+++ b/docs/source/en/internal/generation_utils.md
@@ -356,66 +356,93 @@ A [`Constraint`] can be used to force the generation to include specific tokens

 ## Caches

-[[autodoc]] Cache
-    - update
-
-[[autodoc]] CacheConfig
-	- update
-
-[[autodoc]] QuantizedCacheConfig
-	- validate
-
-[[autodoc]] DynamicCache
+[[autodoc]] CacheLayerMixin
    - update
    - get_seq_length
+    - get_mask_sizes
+    - get_max_cache_shape
+    - reset
    - reorder_cache
+
+[[autodoc]] DynamicLayer
+    - update
+    - crop
+    - batch_repeat_interleave
+    - batch_select_indices
+
+[[autodoc]] StaticLayer
+    - update
+
+[[autodoc]] SlidingWindowLayer
+    - update
+
+[[autodoc]] CacheProcessor
+    - pre_update
+    - post_update
+
+[[autodoc]] OffloadedCacheProcessor
+    - pre_update
+
+[[autodoc]] QuantizedCacheProcessor
+    - post_update
+
+[[autodoc]] QuantoQuantizedCacheProcessor
+    - post_update
+
+[[autodoc]] HQQQuantizedCacheProcessor
+    - post_update
+
+[[autodoc]] Cache
+    - update
+    - get_seq_length
+    - get_mask_sizes
+    - get_max_cache_shape
+    - reset
+    - reorder_cache
+    - crop
+    - batch_repeat_interleave
+    - batch_select_indices
+
+[[autodoc]] DynamicCache
    - to_legacy_cache
    - from_legacy_cache

 [[autodoc]] QuantizedCache
-    - update
-    - get_seq_length

 [[autodoc]] QuantoQuantizedCache

+[[autodoc]] QuantoQuantizedCacheProcessor
+
 [[autodoc]] HQQQuantizedCache

+[[autodoc]] HQQQuantizedCacheProcessor
+
 [[autodoc]] OffloadedCache
-    - update
-    - prefetch_layer
-    - evict_previous_layer

 [[autodoc]] StaticCache
-    - update
-    - get_seq_length
-    - reset

 [[autodoc]] OffloadedStaticCache
-    - update
-    - get_seq_length
-    - reset

 [[autodoc]] HybridCache
-    - update
-    - get_seq_length
-    - reset
+
+[[autodoc]] HybridChunkedCache

 [[autodoc]] SlidingWindowCache
-    - update
-    - reset

 [[autodoc]] EncoderDecoderCache
-    - get_seq_length
    - to_legacy_cache
    - from_legacy_cache
-    - reset
-    - reorder_cache

 [[autodoc]] MambaCache
    - update_conv_state
    - update_ssm_state
    - reset

+[[autodoc]] CacheConfig
+
+[[autodoc]] QuantizedCacheConfig
+
+
 ## Watermark Utils

 [[autodoc]] WatermarkingConfig