Cyril Vallez
5dba4bc7b2
Fix DynamicCache and simplify Cache classes a bit ( #39590 )
...
* fix
* use kwargs
* simplify
* Update cache_utils.py
* Update cache_utils.py
* Update test_cache_utils.py
* fix
* style
2025-07-23 10:13:45 +02:00
Manuel de Prada Corral
c338fd43b0
[cache refactor] Move all the caching logic to a per-layer approach ( #39106 )
...
* Squash for refactor: Replace monolithic cache classes with modular LayeredCache (#38077 )
- Introduces CacheLayer and Cache base classes
- Ports Static, Dynamic, Offloaded, Quantized, Hybrid, etc. to use layers
- Implements method/attr dispatch across layers to reduce boilerplate
- Adds CacheProcessor hooks for offloading, quantization, etc.
- Updates and passes tests
* fix quantized, add tests
* remove CacheProcessorList
* raushan review, arthur review
* joao review: minor things
* remove cache configs, make CacheLayer a mixin (joaos review)
* back to storage inside Cache()
* remove cachebase for decorator
* no more __getattr__
* fix tests
* joaos review except docs
* fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant`
More verbose exceptions in `fix_docstring` on docstring formatting issues.
* Revert "back to storage inside Cache()"
This reverts commit 27916bc2737806bf849ce2148cb1e66d59573913.
* cyril review
* simplify cache export
* fix lfm2 cache
* HybridChunked to layer
* BC proxy object for cache.key_cache[i]=...
* reorder classes
* bfff come on LFM2
* better tests for hybrid and hybridChunked
* complete coverage for hybrid chunked caches (prefill chunking)
* reimplementing HybridChunked
* cyril review
* fix ci
* docs for cache refactor
* docs
* oopsie
* oopsie
* fix after merge
* cyril review
* arthur review
* opsie
* fix lfm2
* opsie2
2025-07-22 16:10:25 +02:00
Manuel de Prada Corral
1aa7256f01
Refactor MambaCache to modeling_mamba.py ( #38086 )
...
* Refactor MambaCache to modeling_mamba.py (parity with Zamba)
* ruff
* fix dummies
* update
* update
* remove mamba ref in cache tests
* remove cache_implementation from tests
* update
* ruff
* ruff
* sneaky regression
* model consistency
* fix test_multi_gpu_data_parallel_forward
* fix falcon slow tests
* ruff
* ruff
* add sample false
* try to fix slow tests
* Revert "fix test_multi_gpu_data_parallel_forward"
This reverts commit 66b7162c7c5c5ce8a73ccf48cffc8a96343ebb33.
* fix tests on nvidia t4, remove dataparallel tests from mamba
* ruff
* remove DDP tests from mamba and falcon_mamba
* add explicit error for MambaCache
* mamba2 also needs to init cache in prepare_inputs_for_generation
* ruff
* ruff
* move MambaCache to its own file
* ruff
* unprotected import fix
* another attempt to fix unprotected imports
* Revert "another attempt to fix unprotected imports"
This reverts commit 2338354fcab630de5899321f5daced5fb312c2a2.
* fixing unprotected import, attempt 3
* Update src/transformers/cache_utils.py
* ruff's fault
* fix arthur review
* modular falcon mamba
* found a hack
* fix config docs
* fix docs
* add export info
* merge modular falcon branch
* oopsie
* fix fast path failing
* new approach
* oopsie
* fix types
* Revert new pragma in modular
This reverts commit 80b1cf160ee251536f07c40b8a0857d499e70db6.
* trying another modular workaround
* review & fix ci
* oopsie
* clear prepare_inputs on mamba/mamba2/falcon_mamba
2025-07-21 14:59:36 +02:00
Wang, Yi
9323d0873c
use the enable_gqa param in torch.nn.functional.scaled_dot_product_at… ( #39412 )
...
* use the enable_gqa param in torch.nn.functional.scaled_dot_product_attention
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
* ci failure fix
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
* add check
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
* fix ci failure
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
* refine code, extend to cuda
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
* refine code
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
* fix review comments
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
* refine the PR
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co >
2025-07-21 14:46:43 +02:00
Stonepia
fc700c2a26
Fix convert_and_export_with_cache failures for GPU models ( #38976 )
...
* Add the `device` option for `generate()`
* Add device for default tensors to avoid tensor mismatch
* [test] Enable test_static_cache_exportability for torch_device
* infer device from the prompt_token_ids
* Add device for generated tensor
* [Test] Make `test_export_static_cache` tests to run on devices rather than only CPU
* fix format
* infer device from the model
2025-07-17 13:12:32 +00:00
Raushan Turganbay
c8524aeb07
[cache] make all classes cache compatible finally ( #38635 )
...
* dump
* push other models
* fix simple greedy generation
* xmod
* add fmst and clean up some mentions of old cache format
* gpt-bigcode now follows standards
* delete tuple cache reference in generation
* fix some models
* fix some models
* fix mambas and support cache in tapas
* fix some more tests
* fix copies
* delete `_reorder_cache`
* another fix copies
* fix typos and delete unnecessary test
* fix rag generate, needs special cache reordering
* fix tapas and superglue
* reformer create special cache
* recurrent gemma `reorder_cache` was a no-op, delete
* fix-copies
* fix blio and musicgen pipeline tests
* fix reformer
* fix reformer, again...
* delete `_supports_cache_class`
* delete `supports_quantized_cache`
* fix failing tests
* fix copies
* some minor clean up
* style
* style
* fix copies
* fix tests
* fix copies
* create causal mask now needs positions?
* fixc copies
* style
* Update tests/test_modeling_common.py
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com >
* clean-up of non-generative model after merging main
* check `is_decoder` for cache
* delete transpose for scores
* remove tuple cache from docs everywhere
* fix tests
* fix copies
* fix copies once more
* properly deprecate `encoder_attention_mask` in Bert-like models
* import `deprecate_kwarg` where needed
* fix copies again
* fix copies
* delete `nex_decoder_cache`
* fix copies asks to update for PLM
* fix copies
* rebasing had a few new models, fix them and merge asap!
* fix copies once more
* fix slow tests
* fix tests and updare PLM checkpoint
* add read token and revert accidentally removed line
* oh com -on, style
* just skip it, read token has no access to PLM yet
---------
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com >
2025-07-16 14:00:17 +02:00
Guang Yang
356fd68109
fix(generation): stop beam search per-instance when heuristic satisfied ( #38778 )
...
* fix(decoding): stop beam search per-instance when heuristic satisfied
Previously, when early_stopping is set to `False`, the early-stopping heuristic only halted generation when **all** batch instances reached the criterion. This caused instances that are impossible (suggested by the heuristic) to improve keep generating, leading to inconsistent and overlong outputs across the batch.
Now we apply the heuristic **per-instance**: once a certain instance of batch has its all beams impossibe to improve, we mark that instance finished while letting others continue. This restores expected behavior and ensures consistency in batched generation.
* Add test case GenerationIntegrationTests.test_beam_search_early_stop_heuristic
* Update naming improvement_possibility -> is_early_stop_heuristic_unsatisfied
* Add comments for early stop heuristic
* Update src/transformers/generation/utils.py
---------
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com >
2025-07-08 08:59:37 +00:00
Arthur
ca7e1a3756
Refactor the way we handle outputs for new llamas and new models ( #39120 )
...
* just update 2 files
* update other models as well just making fix-copies
* also add the changes needed to modeling utils
* put this on the pretrained model instead
* nits and fixes
* update generic, fix to use config value
* update other modelings
* use transformers kwargs instead
* update
* update
* update other models
* update
* updates
* update
* update
* update
* fix
* finally
* very small nits
* this fixes more tests
* fix other models as well!
* update modularqwen2
* update models based on qwen2
* update
* update
* remove the **flash stuff in favor of noraml kwargs
* update
* propagate gemma?
* remove output attentions
* propagate
* support cross attention edge case
* same
* test this
* fixes
* more fix
* update
* update
* fix conflicts
* update
* fix emu3
* fix emu3
* move the fix a bit
* quel enfer
* some fixes, loss_kwargs should never had been
* finish fixing gemma3n
* fix small lm3
* fix another one
* fix csm now
* fux csm and mistral
* fix mistral now
* small fixes
* fix janusss
* only for some models
* fixup
* phix phi3
* more fixes?
* dose this fix it?
* update
* holy shit it was just graph breaks
* protect torch
* updates
* fix samhq?
* fix moonshine
* more moonshine fixes, 3 failures left!
* nits
* generic needs to support more
* more fixes to moonshine!
* fix cross attention outputs!
* fix csm!
* nits
* fix stupid kosmos2
* current updates
* fixes
* use output recorder?
* nicer!
* a little bit of magic
* update
* fix protect
* fix
* small fixes
* protect import
* fix a bunch of more models
* fix fixups
* fix some of the last ones
* nit
* partly fix phi
* update
* fix import path
* make something that is fullgraph compatible just to be sure
* typing was wrong on llama so the rest was wrong as well
* fucking ugly but at least it is still exportable
* syle
* supposed to fix moonshine, it still breaks
* fix some default
* fix the last bits of sam
* update samhq
* more fixes to am hq
* nit
* fix all output+hidden states and output_attentions!
* fix?
* fix diffllama
* updates to fix initialization on the sam pips
* ups there was a bug
* fix the last sam hq test
* fix gotocr
* fix gotocr2!
* fixes
* skip stupid tests
* there was one left :)
* fixup
* fix fix copies issues with this test file
* fix copies for sam_hq
* rm some comments
* skip 2 more failing tests
* fix
* fix everything
* Apply suggestions from code review
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com >
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com >
* add more doc!
* fix public init
* fix modular qwen3
---------
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com >
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com >
2025-07-05 11:34:28 +02:00
Tugsbayasgalan Manlaibaatar
67d36dc1d7
Fix bugs in DynamicCache ( #37880 )
...
* Fix bugs in DynamicCache
* Updarte
* Update
* Lint
* lint
* Rename test
* update
* update
2025-06-24 19:43:40 +02:00
Yao Matrix
3526e25d3d
enable misc test cases on XPU ( #38852 )
...
* enable misc test cases on XPU
Signed-off-by: YAO Matrix <matrix.yao@intel.com >
* fix style
Signed-off-by: YAO Matrix <matrix.yao@intel.com >
* tweak bamba ground truth on XPU
Signed-off-by: YAO Matrix <matrix.yao@intel.com >
* remove print
Signed-off-by: YAO Matrix <matrix.yao@intel.com >
* one more
Signed-off-by: YAO Matrix <matrix.yao@intel.com >
* fix style
Signed-off-by: YAO Matrix <matrix.yao@intel.com >
---------
Signed-off-by: YAO Matrix <matrix.yao@intel.com >
2025-06-18 09:20:49 +02:00
Guang Yang
7f00b325f8
Unbreak optimum-executorch ( #38646 )
...
* Unbreak optimum-executorch
* use static cache if has layer_types but no sliding_window
* revert view on kv_arange
---------
Co-authored-by: Guang Yang <guangyang@fb.com >
2025-06-13 11:13:32 +02:00
Yao Matrix
a5a0c7b888
switch to device agnostic device calling for test cases ( #38247 )
...
* use device agnostic APIs in test cases
Signed-off-by: Matrix Yao <matrix.yao@intel.com >
* fix style
Signed-off-by: Matrix Yao <matrix.yao@intel.com >
* add one more
Signed-off-by: YAO Matrix <matrix.yao@intel.com >
* xpu now supports integer device id, aligning to CUDA behaviors
Signed-off-by: Matrix Yao <matrix.yao@intel.com >
* update to use device_properties
Signed-off-by: Matrix Yao <matrix.yao@intel.com >
* fix style
Signed-off-by: Matrix Yao <matrix.yao@intel.com >
* update comment
Signed-off-by: Matrix Yao <matrix.yao@intel.com >
* fix comments
Signed-off-by: Matrix Yao <matrix.yao@intel.com >
* fix style
Signed-off-by: Matrix Yao <matrix.yao@intel.com >
---------
Signed-off-by: Matrix Yao <matrix.yao@intel.com >
Signed-off-by: YAO Matrix <matrix.yao@intel.com >
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com >
2025-05-26 10:18:53 +02:00
Cyril Vallez
163138a911
🚨 🚨 [core] Completely rewrite the masking logic for all attentions ( #37866 )
...
* start
* start having a clean 4d mask primitive
* Update mask_utils.py
* Update mask_utils.py
* switch name
* Update masking_utils.py
* add a new AttentionMask tensor class
* fix import
* nits
* fixes
* use full and quandrants
* general sdpa mask for all caches
* style
* start some tests
* tests with sliding, chunked
* add styling
* test hybrid
* Update masking_utils.py
* small temp fixes
* Update modeling_gemma2.py
* compile compatible
* Update masking_utils.py
* improve
* start making it more general
* Update masking_utils.py
* generate
* make it work with flex style primitives!
* Update masking_utils.py
* Update masking_utils.py
* Update masking_utils.py
* improve
* Update cache_utils.py
* Update masking_utils.py
* simplify - starting to look good!
* Update masking_utils.py
* name
* Update masking_utils.py
* style
* Update masking_utils.py
* Update masking_utils.py
* Update masking_utils.py
* Update masking_utils.py
* small fix for flex
* flex compile
* FA2
* Update masking_utils.py
* Escape for TGI/vLLM!
* Update masking_utils.py
* Update masking_utils.py
* Update masking_utils.py
* General case without cache
* rename
* full test on llama4
* small fix for FA2 guard with chunk
* Update modeling_gemma2.py
* post rebase cleanup
* FA2 supports static cache!
* Update modeling_flash_attention_utils.py
* Update flex_attention.py
* Update masking_utils.py
* Update masking_utils.py
* Update utils.py
* override for export
* Update executorch.py
* Update executorch.py
* Update executorch.py
* Update executorch.py
* Update masking_utils.py
* Update masking_utils.py
* output attentions
* style
* Update masking_utils.py
* Update executorch.py
* Add doicstring
* Add license and put mask visualizer at the end
* Update test_modeling_common.py
* fix broken test
* Update test_modeling_gemma.py
* Update test_modeling_gemma2.py
* Use fullgraph=False with FA2
* Update utils.py
* change name
* Update masking_utils.py
* improve doc
* change name
* Update modeling_attn_mask_utils.py
* more explicit logic based on model's property
* pattern in config
* extend
* fixes
* make it better
* generalize to other test models
* fix
* Update masking_utils.py
* fix
* do not check mask equivalence if layer types are different
* executorch
* Update modeling_gemma2.py
* Update masking_utils.py
* use layer_idx instead
* adjust
* Update masking_utils.py
* test
* fix imports
* Update modeling_gemma2.py
* other test models
* Update modeling_llama4.py
* Update masking_utils.py
* improve
* simplify
* Update masking_utils.py
* typos
* typo
* fix
* Update masking_utils.py
* default DynamicCache
* remove default cache
* simplify
* Update masking_utils.py
* Update masking_utils.py
* Update masking_utils.py
* Update masking_utils.py
* simplify
* Update masking_utils.py
* Update masking_utils.py
* Update masking_utils.py
* export
* Update executorch.py
* Update executorch.py
* Update flex_attention.py
* Update executorch.py
* upstream to modular gemma 1 & 2
* Update modular_mistral.py
* switch names
* use dict
* put it in the Layer directly
* update copy model source for mask functions
* apply so many modular (hopefully 1 shot)
* use explicite dicts for make style happy
* protect import
* check docstring
* better default in hybrid caches
* qwens
* Update modular_qwen2.py
* simplify core logic!
* Update executorch.py
* qwen3 moe
* Update masking_utils.py
* Update masking_utils.py
* simplify a lot sdpa causal skip
* Update masking_utils.py
* post-rebase
* gemma3 finally
* style
* check it before
* gemma3
* More general with newer torch
* align gemma3
* Update utils.py
* Update utils.py
* Update masking_utils.py
* Update test_modeling_common.py
* Update flex_attention.py
* Update flex_attention.py
* Update flex_attention.py
* test
* executorch
* Update test_modeling_common.py
* Update masking_utils.py
* Update masking_utils.py
* Update masking_utils.py
* Update masking_utils.py
* Update executorch.py
* Update test_modeling_common.py
* fix copies
* device
* sdpa can be used without mask -> pass the torchscript tests in this case
* Use enum for check
* revert enum and add check instead
* remove broken test
* cohere2
* some doc & reorganize the Interface
* Update tensor_parallel.py
* Update tensor_parallel.py
* doc and dummy
* Update test_modeling_paligemma2.py
* Update modeling_falcon_h1.py
* Update masking_utils.py
* executorch patch
* style
* CIs
* use register in executorch
* final comments!
---------
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com >
2025-05-22 11:38:26 +02:00
Manuel de Prada Corral
d34e21e7dd
New cache tests and refactored Hybrid Cache ( #37972 )
2025-05-20 12:46:13 +02:00
Yao Matrix
3bd1c20149
enable misc cases on XPU & use device agnostic APIs for cases in tests ( #38192 )
...
* use device agnostic APIs in tests
Signed-off-by: Matrix Yao <matrix.yao@intel.com >
* more
Signed-off-by: Matrix Yao <matrix.yao@intel.com >
* fix style
Signed-off-by: Matrix Yao <matrix.yao@intel.com >
* add reset_peak_memory_stats API
Signed-off-by: YAO Matrix <matrix.yao@intel.com >
* update
---------
Signed-off-by: Matrix Yao <matrix.yao@intel.com >
Signed-off-by: YAO Matrix <matrix.yao@intel.com >
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com >
2025-05-20 10:09:01 +02:00
Yao Matrix
a72cb31434
enable utils test cases on XPU ( #38005 )
...
* enable utils test cases on XPU
Signed-off-by: Yao Matrix <matrix.yao@intel.com >
* fix style
Signed-off-by: Yao Matrix <matrix.yao@intel.com >
* Update tests/utils/test_skip_decorators.py
Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com >
* fix comment
Signed-off-by: Yao Matrix <matrix.yao@intel.com >
---------
Signed-off-by: Yao Matrix <matrix.yao@intel.com >
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com >
Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com >
2025-05-09 08:45:01 +02:00
Joao Gante
f2b59c6173
[caches] Raise exception on offloaded static caches + multi device ( #37974 )
...
* skip tests on >1 gpu
* add todo
2025-05-08 14:37:36 +01:00
Joao Gante
9981214d32
[tests] Smaller model in slow cache tests ( #37922 )
2025-05-06 11:15:25 +01:00
Joao Gante
1b222903c3
[tests] Test all cache implementations ( #37873 )
2025-04-30 15:37:00 +01:00
Guang Yang
a57274466f
Allow override inputs to export recipe ( #37508 )
...
Add option to specify dynamic shapes during export
Co-authored-by: Guang Yang <guangyang@fb.com >
2025-04-30 10:19:27 +02:00
Joao Gante
755b0fa2fe
[tests] reorganize cache tests and clean memory between tests ( #37684 )
2025-04-29 12:21:14 +01:00
Poedator
7c62e69326
GPT2Model StaticCache support (#35761 )
...
* initial GPT2 changes
* causal_mask support
* return_legacy_cache
* cleanup
* fix1
* outputs shape fixes
* gpt2 return fix
* pkv, attn fixes
* fix dual_head
* is_causal arg fix
* decision transformer updated
* style fix
* batch_size from inputs_embeds
* DecisionTransformerModel fixes
* cross-attn support + cache warning
* x-attn @decision
* EDCache proper init
* simplified logic in `if use_cache:` for GPT2Model
* @deprecate_kwarg for DecisionTr attn fwd
* @deprecate_kwarg in gpt2
* deprecation version updated to 4.51
* kwargs in gradient_checkpointing_fn
* rename next_cache to past_key_values
* attention_mask prep
* +cache_position in GPT2DoubleHeadsModel
* undo kwargs in gradient checkpointing
* moved up `if self.gradient_checkpointing`
* consistency in decision_transformer
* pastkv, cache_pos in grad_checkpt args
* rm _reorder_cache
* output_attentions streamlined
* decision_transformer consistency
* return_legacy_cache improved
* ClvpForCausalLM used for legacy cache test now
* is_causal fixed
* attn_output cleanup
* consistency @ decision_transformer
* Updated deprecation notice version to 4.52
* upd deprecation
* consistent legacy cache code in decision transformers\
* next_cache -> past_kv in decision_tr
* cache support flags in decision_transf
* rm legacy cache warning
* consistency in cache init for decision transf
* no Static Cache for Decision Transformer
---------
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co >
2025-04-24 14:46:35 +02:00
cyyever
1e6b546ea6
Use Python 3.9 syntax in tests ( #37343 )
...
Signed-off-by: cyy <cyyever@outlook.com >
2025-04-08 14:12:08 +02:00
cyyever
6cc9c8d7d1
Remove deprecated batch_size parameter ( #37007 )
2025-03-27 15:01:56 +00:00
Joao Gante
bc1c90a755
[Utils] torch version checks optionally accept dev versions ( #36847 )
2025-03-25 10:58:58 +00:00
Tugsbayasgalan Manlaibaatar
f39f4960f3
Support tracable dynamicKVcache ( #36311 )
...
* Support tracable dynamicKVcache
* Fix lint
* More fine grained test
* Lint
* Update
* Update
* Fix up
* Apply suggestions from code review
* Update src/transformers/cache_utils.py
* Update tests/utils/test_cache_utils.py
* Apply suggestions from code review
* Update
* Change error message
* Rename
* Apply suggestions from code review
* Apply suggestions from code review
* Apply suggestions from code review
---------
Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com >
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com >
2025-03-19 16:52:30 +00:00
Yao Matrix
b11050d6a2
enable OffloadedCache on XPU from PyTorch 2.7 ( #36654 )
...
* fix "Cannot copy out of meta tensor; no data!" issue for BartForConditionalGeneration model
* follow Marc's suggestion to use _tie_weights to fix
Signed-off-by: Yao, Matrix <matrix.yao@intel.com >
* enable OffloadedCache on XPU since PyTorch 2.7
Signed-off-by: Yao, Matrix <matrix.yao@intel.com >
* fix style
Signed-off-by: Yao, Matrix <matrix.yao@intel.com >
* don't change bart
Signed-off-by: root <root@a4bf01945cfe.jf.intel.com >
* make code more concise per review comments
Signed-off-by: N <matrix.yao@intel.com >
* fix review comments
Signed-off-by: root <root@a4bf01945cfe.jf.intel.com >
* Revert "fix review comments"
This reverts commit acf1484b86c7cc58b2dee69e7008c0eeb4c97b1b.
* fix review comments
Signed-off-by: root <root@a4bf01945cfe.jf.intel.com >
* fix style
Signed-off-by: root <root@a4bf01945cfe.jf.intel.com >
---------
Signed-off-by: Yao, Matrix <matrix.yao@intel.com >
Signed-off-by: root <root@a4bf01945cfe.jf.intel.com >
Signed-off-by: N <matrix.yao@intel.com >
Co-authored-by: root <root@a4bf01945cfe.jf.intel.com >
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com >
2025-03-19 15:15:52 +01:00
Joao Gante
c4161238bd
[Cache] Don't initialize the cache on meta device ( #36543 )
2025-03-13 10:13:29 +00:00
Joao Gante
8aed019764
[generate] torch.distributed-compatible DynamicCache ( #36373 )
...
* test
* docstring
* prepare distributed cache data
* fix cat dim
* test mvp
* add test checks
* like this?
* working test and solution
* nit
* nit
* add shape info
2025-02-27 11:48:57 +00:00
Ilyas Moutawwakil
5e2183f344
Make cache traceable ( #35873 )
...
simply make cache traceable
2025-02-20 09:59:25 +01:00
Joao Gante
ece8c42488
Test: generate with torch.compile(model.forward) as a fast test ( #34544 )
2025-01-28 14:10:38 +00:00
Raushan Turganbay
373e50e970
Init cache on meta device ( #35164 )
...
* init cache on meta device
* offloaded static + enable tests
* tests weren't running before :(
* update
* fix mamba
* fix copies
* update
* address comments and fix tests
* fix copies
* Update src/transformers/cache_utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
* update
* mamba fix
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
2025-01-22 09:49:17 +01:00
jiqing-feng
387663e571
Enable gptqmodel ( #35012 )
...
* gptqmodel
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* fix format
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* update readme
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* gptqmodel need use checkpoint_format (#1 )
* gptqmodel need use checkpoint_format
* fix quantize
* Update quantization_config.py
* Update quantization_config.py
* Update quantization_config.py
---------
Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai >
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai >
* Revert quantizer_gptq.py (#2 )
* revert quantizer_gptq.py change
* pass **kwargs
* limit gptqmodel and optimum version
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* fix format
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* fix warning
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* fix version check
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* revert unrelated changes
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* enable gptqmodel tests
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* fix requires gptq
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* Fix Transformer compat (#3 )
* revert quantizer_gptq.py change
* pass **kwargs
* add meta info
* cleanup
* cleanup
* Update quantization_config.py
* hf_select_quant_linear pass checkpoint_format and meta
* fix GPTQTestCUDA
* Update test_gptq.py
* gptqmodel.hf_select_quant_linear() now does not select ExllamaV2
* cleanup
* add backend
* cleanup
* cleanup
* no need check exllama version
* Update quantization_config.py
* lower checkpoint_format and backend
* check none
* cleanup
* Update quantization_config.py
* fix self.use_exllama == False
* spell
* fix unittest
* fix unittest
---------
Co-authored-by: LRL <lrl@lbx.dev >
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai >
* fix format
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* fix format again
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* update gptqmodel version (#6 )
* update gptqmodel version
* update gptqmodel version
* fix unit test (#5 )
* update gptqmodel version
* update gptqmodel version
* "not self.use_exllama" is not equivalent to "self.use_exllama==False"
* fix unittest
* update gptqmodel version
* backend is loading_attibutes (#7 )
* fix format and tests
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* fix memory check
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* fix device mismatch
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* fix result check
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* Update src/transformers/quantizers/quantizer_gptq.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com >
* Update src/transformers/quantizers/quantizer_gptq.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com >
* Update src/transformers/quantizers/quantizer_gptq.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com >
* update tests
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* review: update docs (#10 )
* review: update docs (#12 )
* review: update docs
* fix typo
* update tests for gptqmodel
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
* update document (#9 )
* update overview.md
* cleanup
* Update overview.md
* Update overview.md
* Update overview.md
* update gptq.md
* Update gptq.md
* Update gptq.md
* Update gptq.md
* Update gptq.md
* Update gptq.md
* Update gptq.md
---------
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai >
* typo
* doc note for asymmetric quant
* typo with apple silicon(e)
* typo for marlin
* column name revert: review
* doc rocm support
* Update docs/source/en/quantization/gptq.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com >
* Update docs/source/en/quantization/gptq.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com >
* Update docs/source/en/quantization/gptq.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com >
* Update docs/source/en/quantization/gptq.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com >
* Update docs/source/en/quantization/overview.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com >
* Update docs/source/en/quantization/overview.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com >
---------
Signed-off-by: jiqing-feng <jiqing.feng@intel.com >
Co-authored-by: LRL-ModelCloud <165116337+LRL-ModelCloud@users.noreply.github.com >
Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai >
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai >
Co-authored-by: ZX-ModelCloud <165115237+ZX-ModelCloud@users.noreply.github.com >
Co-authored-by: LRL <lrl@lbx.dev >
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com >
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com >
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com >
2025-01-15 14:22:49 +01:00
Joao Gante
38f9f10dd9
Cache: revert DynamicCache init for BC ( #33861 )
...
* tmp commit
* tmp commit
* make fixup
* missing removal
* fix condition
* fix end-to-end compilation
* if -> elif
* BC
* BC
* use @deprecate_kwarg("num_hidden_layers", version="4.47.0")
* wups the import
* 🥴
---------
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com >
2024-10-04 22:47:08 +02:00
Guang Yang
808997a634
Fix passing str dtype to static cache ( #33741 )
...
Co-authored-by: Guang Yang <guangyang@fb.com >
2024-10-01 09:50:17 +02:00
Arthur
19d58d31f1
Add MLLama ( #33703 )
...
* current changes
* nit
* Add cross_attenttion_mask to processor
* multi-image fixed
* Add cross_attenttion_mask to processor
* cross attn works in all cases
* WIP refactoring function for image processor
* WIP refactoring image processor functions
* Refactor preprocess to use global loops instead of list nested list comps
* Docstrings
* Add channels unification
* fix dtype issues
* Update docsrings and format
* Consistent max_image_tiles
* current script
* updates
* Add convert to rgb
* Add image processor tests
* updates!
* update
* god damn it I am dumb sometimes
* Precompute aspect ratios
* now this works, full match
* fix 😉
* nits
* style
* fix model and conversion
* nit
* nit
* kinda works
* hack for sdpa non-contiguous bias
* nits here and there
* latest c hanges
* merge?
* run forward
* Add aspect_ratio_mask
* vision attention mask
* update script and config variable names
* nit
* nits
* be able to load
* style
* nits
* there
* nits
* make forward run
* small update
* enable generation multi-turn
* nit
* nit
* Clean up a bit for errors and typos
* A bit more constant fixes
* 90B keys and shapes match
* Fix for 11B model
* Fixup, remove debug part
* Docs
* Make max_aspect_ratio_id to be minimal
* Update image processing code to match new implementation
* Adjust conversion for final checkpoint state
* Change dim in repeat_interleave (accordig to meta code)
* tmp fix for num_tiles
* Fix for conversion (gate<->up, q/k_proj rope permute)
* nits
* codestyle
* Vision encoder fixes
* pass cross attn mask further
* Refactor aspect ratio mask
* Disable text-only generation
* Fix cross attention layers order, remove q/k norm rotation for cross atention layers
* Refactor gated position embeddings
* fix bugs but needs test with new weights
* rope scaling should be llama3
* Fix rope scaling name
* Remove debug for linear layer
* fix copies
* Make mask prepare private func
* Remove linear patch embed
* Make precomputed embeddings as nn.Embedding module
* MllamaPrecomputedAspectRatioEmbedding with config init
* Remove unused self.output_dim
* nit, intermediate layers
* Rename ln and pos_embed
* vision_chunk_size -> image_size
* return_intermediate -> intermediate_layers_indices
* vision_input_dim -> hidden_size
* Fix copied from statements
* fix most tests
* Fix more copied from
* layer_id->layer_idx
* Comment
* Fix tests for processor
* Copied from for _prepare_4d_causal_attention_mask_with_cache_position
* Style fix
* Add MllamaForCausalLM
* WIP fixing tests
* Remove duplicated layers
* Remove dummy file
* Fix style
* Fix consistency
* Fix some TODOs
* fix language_model instantiation, add docstring
* Move docstring, remove todos for precomputed embeds (we cannot init them properly)
* Add initial docstrings
* Fix
* fix some tests
* lets skip these
* nits, remove print, style
* Add one more copied from
* Improve test message
* Make validate func private
* Fix dummy objects
* Refactor `data_format` a bit + add comment
* typos/nits
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com >
* fix dummy objects and imports
* Add chat template config json
* remove num_kv_heads from vision attention
* fix
* move some commits and add more tests
* fix test
* Remove `update_key_name` from modeling utils
* remove num-kv-heads again
* some prelimiary docs
* Update chat template + tests
* nit, conversion script max_num_tiles from params
* Fix warning for text-only generation
* Update conversion script for instruct models
* Update chat template in converstion + test
* add tests for CausalLM model
* model_max_length, avoid null chat_template
* Refactor conversion script
* Fix forward
* Fix integration tests
* Refactor vision config + docs
* Fix default
* Refactor text config
* Doc fixes
* Remove unused args, fix docs example
* Squashed commit of the following:
commit b51ce5a2efffbecdefbf6fc92ee87372ec9d8830
Author: qubvel <qubvel@gmail.com >
Date: Wed Sep 18 13:39:15 2024 +0000
Move model + add output hidden states and output attentions
* Fix num_channels
* Add mllama text and mllama vision models
* Fixing repo consistency
* Style fix
* Fixing repo consistency
* Fixing unused config params
* Fix failed tests after refactoring
* hidden_activation -> hidden_act for text mlp
* Remove from_pretrained from sub-configs
* Apply suggestions from code review
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
* Update src/transformers/models/mllama/convert_mllama_weights_to_hf.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
* Reuse lambda in conversion script
* Remove run.py
* Update docs/source/en/model_doc/mllama.md
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
* Update src/transformers/models/mllama/processing_mllama.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
* Remove unused LlamaTokenizerFast
* Fix logging
* Refactor gating
* Remove cycle for collecting intermediate states
* Refactor text-only check, add integration test for text-only
* Revert from pretrained to configs
* Fix example
* Add auto `bos_token` adding in processor
* Fix tips
* Update src/transformers/models/auto/tokenization_auto.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
* Enable supports_gradient_checkpointing model flag
* add eager/sdpa options
* don't skip attn tests and bring back GC skips (did i really remove those?)
* Fix signature, but get error with None gradient
* Fix output attention tests
* Disable GC back
* Change no split modules
* Fix dropout
* Style
* Add Mllama to sdpa list
* Add post init for vision model
* Refine config for MllamaForCausalLMModelTest and skipped tests for CausalLM model
* if skipped, say it, don't pass
* Clean vision tester config
* Doc for args
* Update tests/models/mllama/test_modeling_mllama.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
* Add cross_attention_mask to test
* typehint
* Remove todo
* Enable gradient checkpointing
* Docstring
* Style
* Fixing and skipping some tests for new cache
* Mark flaky test
* Skip `test_sdpa_can_compile_dynamic` test
* Fixing some offload tests
* Add direct GenerationMixin inheritance
* Remove unused code
* Add initializer_range to vision config
* update the test to make sure we show if split
* fix gc?
* Fix repo consistency
* Undo modeling utils debug changes
* Fix link
* mllama -> Mllama
* [mllama] -> [Mllama]
* Enable compile test for CausalLM model (text-only)
* Fix TextModel prefix
* Update doc
* Docs for forward, type hints, and vision model prefix
* make sure to reset
* fix init
* small script refactor and styling
* nit
* updates!
* some nits
* Interpolate embeddings for 560 size and update integration tests
* nit
* does not suppor static cache!
* update
* fix
* nit2
* this?
* Fix conversion
* Style
* 4x memory improvement with image cache AFAIK
* Token decorator for tests
* Skip failing tests
* update processor errors
* fix split issues
* style
* weird
* style
* fix failing tests
* update
* nit fixing the whisper tests
* fix path
* update
---------
Co-authored-by: raushan <raushan@huggingface.co >
Co-authored-by: pavel <ubuntu@ip-10-90-0-11.ec2.internal >
Co-authored-by: qubvel <qubvel@gmail.com >
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com >
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com >
Co-authored-by: Pedro Cuenca <pedro@huggingface.co >
2024-09-25 19:56:25 +02:00
Fanli Lin
b87755aa6d
[tests] skip tests for xpu ( #33553 )
...
* enable
* fix
* add xpu skip
* add marker
* skip for xpu
* add more
* add one more
2024-09-19 19:28:04 +01:00
Guang Yang
f38590dade
Make StaticCache configurable at model construct time ( #32830 )
...
* Make StaticCache configurable at model construct time
* integrations import structure
* add new doc file to toc
---------
Co-authored-by: Guang Yang <guangyang@fb.com >
Co-authored-by: Joao Gante <joao@huggingface.co >
2024-09-10 16:35:57 +01:00
Raushan Turganbay
ebbe8d8014
Cache docs: update ( #32929 )
...
* some changes
* more updates
* fix cache copy
* nits
* nits
* add tests
2024-09-04 15:05:31 +05:00
Gerben van V
5129671290
Add a static cache that offloads to the CPU or other device ( #32161 )
...
* Add a static cache that offloads to the CPU or other device
* Fix PR comments, add unit-tests
2024-08-29 11:51:09 +02:00
Joao Gante
cf32ee1753
Cache: use batch_size instead of max_batch_size ( #32657 )
...
* more precise name
* better docstrings
* Update src/transformers/cache_utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
2024-08-16 11:48:45 +01:00
Guang Yang
0164560353
Fixed test test_static_cache_exportability with torch 2.4.0 ( #32516 )
...
Workaround the export issue in torch 2.4
Co-authored-by: Guang Yang <guangyang@fb.com >
2024-08-08 18:13:40 +01:00
OsamaS99
51ab25e293
Fixed Hybrid Cache Shape Initialization. ( #32163 )
...
* fixed hybrid cache init, added test
* Fix Test Typo
---------
Co-authored-by: Aaron Haag <aaron.haag@siemens.com >
2024-08-01 13:57:42 +01:00
Nikos Karampatziakis
ca59d6f77c
Offloaded KV Cache ( #31325 )
...
* Initial implementation of OffloadedCache
* enable usage via cache_implementation
* Address feedback, add tests, remove legacy methods.
* Remove flash-attn, discover synchronization bugs, fix bugs
* Prevent usage in CPU only mode
* Add a section about offloaded KV cache to the docs
* Fix typos in docs
* Clarifications and better explanation of streams
2024-08-01 14:42:07 +02:00
Guang Yang
811a9caa21
Make static cache compatible with torch.export ( #32168 )
2024-07-29 18:19:15 +01:00
Joao Gante
7ffe25f2b9
Generate: end-to-end compilation ( #30788 )
...
* mvp
* added test (a few models need fixes)
* fix a few test cases
* test nits
* harder test 😈
* revert changes in stablelm
* test with improved condition
* add todo
* tmp commit
* merged with main
* nits
* add todo
* final corrections
* add docs for generation compilation
* docs nits
* add tip
* PR suggestions
* add more details to the compilation docs
* fix cache positions
* cache is now init in generate; update docs
* tag test as flaky
* docs
* post rebase make fixup and other nits
* remove unintended changes
* whisper (encoder-decoder) not supported
* move token default updates to ; add tests for token defaults
* push changes
* manual rebase
* chameleon doesn't support this
* fix test_static_cache_mha_mqa_gqa (broken in another PR)
* docs: dynamic is better with end-to-end compilation
2024-07-29 10:52:13 +01:00
Fanli Lin
27c7f971c0
[tests] fix static cache implementation is not compatible with attn_implementation==flash_attention_2 ( #32039 )
...
* add flash attention check
* fix
* fix
2024-07-26 11:41:27 +02:00
Joao Gante
739a63166d
Generate: remove deprecated code due to Cache and cache_position being default ( #31898 )
...
* tmp commit
* shorter
* nit
* explicit kwargs
* propagate changes
* mass propagation with a few manual touches (let's see how CI behaves)
* fix cacheless case
* Update src/transformers/generation/utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
* make fixup
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com >
2024-07-14 15:16:58 +01:00
Yih-Dar
93cd94b79d
Move some test files (tets/test_xxx_utils.py) to tests/utils ( #31730 )
...
* move
* move
* move
* move
* Update tests/utils/test_image_processing_utils.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com >
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com >
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com >
2024-07-02 13:46:03 +02:00