Gemma2: add cache warning (#32279)

* gemma2 fallback to dynamic cache * Update src/transformers/models/gemma2/modeling_gemma2.py Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * Update src/transformers/models/gemma2/modeling_gemma2.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * raise error and dont fallback to dynamic cache * prev will break most forward calls/tests * Update src/transformers/models/gemma2/modeling_gemma2.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * update * fix copies --------- Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2024-08-07 10:03:05 +05:00
parent a30c865f99
commit 7ad784ae9d
3 changed files with 27 additions and 1 deletions
--- a/docs/source/en/internal/generation_utils.md
+++ b/docs/source/en/internal/generation_utils.md
@@ -398,6 +398,7 @@ A [`Constraint`] can be used to force the generation to include specific tokens

 [[autodoc]] HybridCache
    - update
+    - get_seq_length
    - reset

 [[autodoc]] SlidingWindowCache
--- a/docs/source/en/model_doc/gemma2.md
+++ b/docs/source/en/model_doc/gemma2.md
@@ -30,6 +30,12 @@ Tips:

 - The original checkpoints can be converted using the conversion script `src/transformers/models/Gemma2/convert_Gemma2_weights_to_hf.py` 

+<Tip warning={true}>
+
+- Gemma2 uses sliding window attention every second layer, which makes it unsuitable for typical kv caching with [`~DynamicCache`] or tuples of tensors. To enable caching in Gemma2 forward call, you must initialize a [`~HybridCache`] instance and pass it as `past_key_values` to the forward call. Note, that you also have to prepare `cache_position` if the `past_key_values` already contains previous keys and values.
+
+</Tip>
+
 This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ), [Pedro Cuenca](https://huggingface.co/pcuenq) and [Tom Arsen]().