[cache refactor] Move all the caching logic to a per-layer approach (#39106)

* Squash for refactor: Replace monolithic cache classes with modular LayeredCache (#38077)

- Introduces CacheLayer and Cache base classes
- Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers
- Implements method/attr dispatch across layers to reduce boilerplate
- Adds CacheProcessor hooks for offloading, quantization, etc.
- Updates and passes tests

* fix quantized, add tests

* remove CacheProcessorList

* raushan review, arthur review

* joao review: minor things

* remove cache configs, make CacheLayer a mixin (joaos review)

* back to storage inside Cache()

* remove cachebase for decorator

* no more __getattr__

* fix tests

* joaos review except docs

* fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant`

More verbose exceptions in `fix_docstring` on docstring formatting issues.

* Revert "back to storage inside Cache()"

This reverts commit 27916bc2737806bf849ce2148cb1e66d59573913.

* cyril review

* simplify cache export

* fix lfm2 cache

* HybridChunked to layer

* BC proxy object for cache.key_cache[i]=...

* reorder classes

* bfff come on LFM2

* better tests for hybrid and hybridChunked

* complete coverage for hybrid chunked caches (prefill chunking)

* reimplementing HybridChunked

* cyril review

* fix ci

* docs for cache refactor

* docs

* oopsie

* oopsie

* fix after merge

* cyril review

* arthur review

* opsie

* fix lfm2

* opsie2
This commit is contained in:
Manuel de Prada Corral
2025-07-22 16:10:25 +02:00
committed by GitHub
parent b16688e96a
commit c338fd43b0
64 changed files with 2779 additions and 2441 deletions

View File

@@ -356,66 +356,93 @@ A [`Constraint`] can be used to force the generation to include specific tokens
## Caches
[[autodoc]] Cache
- update
[[autodoc]] CacheConfig
- update
[[autodoc]] QuantizedCacheConfig
- validate
[[autodoc]] DynamicCache
[[autodoc]] CacheLayerMixin
- update
- get_seq_length
- get_mask_sizes
- get_max_cache_shape
- reset
- reorder_cache
[[autodoc]] DynamicLayer
- update
- crop
- batch_repeat_interleave
- batch_select_indices
[[autodoc]] StaticLayer
- update
[[autodoc]] SlidingWindowLayer
- update
[[autodoc]] CacheProcessor
- pre_update
- post_update
[[autodoc]] OffloadedCacheProcessor
- pre_update
[[autodoc]] QuantizedCacheProcessor
- post_update
[[autodoc]] QuantoQuantizedCacheProcessor
- post_update
[[autodoc]] HQQQuantizedCacheProcessor
- post_update
[[autodoc]] Cache
- update
- get_seq_length
- get_mask_sizes
- get_max_cache_shape
- reset
- reorder_cache
- crop
- batch_repeat_interleave
- batch_select_indices
[[autodoc]] DynamicCache
- to_legacy_cache
- from_legacy_cache
[[autodoc]] QuantizedCache
- update
- get_seq_length
[[autodoc]] QuantoQuantizedCache
[[autodoc]] QuantoQuantizedCacheProcessor
[[autodoc]] HQQQuantizedCache
[[autodoc]] HQQQuantizedCacheProcessor
[[autodoc]] OffloadedCache
- update
- prefetch_layer
- evict_previous_layer
[[autodoc]] StaticCache
- update
- get_seq_length
- reset
[[autodoc]] OffloadedStaticCache
- update
- get_seq_length
- reset
[[autodoc]] HybridCache
- update
- get_seq_length
- reset
[[autodoc]] HybridChunkedCache
[[autodoc]] SlidingWindowCache
- update
- reset
[[autodoc]] EncoderDecoderCache
- get_seq_length
- to_legacy_cache
- from_legacy_cache
- reset
- reorder_cache
[[autodoc]] MambaCache
- update_conv_state
- update_ssm_state
- reset
[[autodoc]] CacheConfig
[[autodoc]] QuantizedCacheConfig
## Watermark Utils
[[autodoc]] WatermarkingConfig