* fix?
* fixme and style
* Update src/transformers/modeling_utils.py
* update
* update
* fix
* small fixees
* nit
* nits
* fix init check?
* fix
* fix default
* or fucks me
* nits
* include a small nit
* does this make it hapy?
* fixup
* fix the remaining ones
* EP + updates
Co-authored-by: Nouamane Tazi <NouamaneTazi@users.noreply.github.com>
Co-authored-by: drbh <drbh@users.noreply.github.com>
* remove unrelated change
* not working yet but let's see where it goes!
* update the api a bit
* udpate
* where I am at for now
* fix ep
* refactor the API
* yups
* fix
* fixup
* clean modeling
* just support llama4 for now!
* properly avoid
* fix
* nits
* Update src/transformers/models/llama4/modeling_llama4.py
* Update src/transformers/integrations/tensor_parallel.py
* style
* ,,,,
* update
---------
Co-authored-by: Nouamane Tazi <NouamaneTazi@users.noreply.github.com>
Co-authored-by: drbh <drbh@users.noreply.github.com>
* use partial to wrap around `transformers` utils!
* try to refactor?
* revert one wrong change
* just a nit
* push
* reverter watever was wrong!
* some nits
* fixes when there is no attention mask
* bring the licence back
* some fixes
* nit
* style
* remove prints
* correct dtype
* fa flags for testing
* update
* use paged attention if requested!
* updates
* a clone was needed, not sure why
* automatically create cu seq lens when input is flash, this at least makes sure layers don't re-compute
* simplify and improve?
* flash attention is kinda broken on recent cuda version so allow the opportunity to use something else
* fix!
* protect kernels import
* update
* properly parse generation config being passed
* revert and update
* add two tests
* some fixes
* fix test FA2
* takes comment into account
* fixup
* revert changes
* revert the clone, it is only needed because the metal kernel is not doing it?
* [docs] update attention implementation and cache docs (#39547)
* update docs
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* applu suggestions
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* fix mps on our side for now
* Update src/transformers/integrations/flash_paged.py
* no qa
---------
Co-authored-by: Vasqu <antonprogamer@gmail.com>
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* init
* copied from remote
* add proper structure and llama like structure
* fixup
* revert to state that works
* get closer to llama
* slow and steady
* some removal
* masks work
* it is indeed the rope implementation, how dafuq does it mesh with the cache now hmm
* nice
* getting closer
* closer to transformers style
* let's simplify this, batching works now
* simplified
* working version with modular
* it is indeed the rotation per weights, make it complete llama style
* cleanup conversion, next to look at -> tokenizer
* remove llama artefacts
* fix modeling tests (common ones)
* style
* integration test + first look into tokenization (will need more work, focussing on modeling other models first)
* style
* working moe version, based on remote
* lets keep it simple and go step by step - transformers annotations for modular and transformers style rope (complex view)
* more cleanup
* refactor namings and remove addition forXXX classes
* our moe won't cut it it seems, correction bias seems to be missing in remote code version
* tokenization change (remote)
* our moe version works when adding normalization :D
* cleanup moe
* nits
* cleanup modeling -> let's get to modular next
* style
* modular v1
* minor things + attempt at conversion (which doesn't work)
* no conversion follow glm, fixup modular and other nits
* modular cleanup
* fixes
* tests, tests, tests + some moe dtype forcing
* simplify modular, fix fatal fa2 bug, remaining tests
* fix import issue?
* some initial docs, fix bnb faulty behavior --> needs to fix some tests because of gate needing to be float
* fix sdpa test, load on init dtype only
* fixup post merge
* style
* fix doc links
* tokenization cleanup beginnings
* simplify tokenizer by a lot as its basically llama
* tokenizer is full llama with different defaults + extra special tokens
* sync og special tokens of ernie
* fix decoding with numbers (also in remote done what a timing), begin of tok tests
* align with remote and preserve special tokens, adjust tests to ernie legacy behavior, warning for questionable behavior (also in llama)
* nits
* docs
* my daily post merge it is
* check
* tokenization update with explanations and conversion script
* review on modular (til), revert some tokenizer things i did prior, remove mtp comment (low prio)
* post merge fixes
* fixup tokenization, llama fast is the way to go
* more fixups
* check
* import fixes
* correction bias following the paddle code
* fix
* fix TP plan, fix correction bias sharding during forward
* style
* whoops
* fix tied weights
* docs and last nit
* license
* flasky tests
* move repo id, update when merged on the hub
* simplify common get/set
* remove some noise
* change some 5 years old modeling utils
* update examples
* fix copies
* revert some changes
* fixes, gah
* format
* move to Mixin
* remove smolvlm specific require grad
* skip
* force defaults
* remodularise some stuff
* remodularise more stuff
* add safety for audio models
* style
* have a correct fallback, you daft donkey
* remove this argh
* change heuristic for audio models
* fixup
* revert
* this works
* revert again
* 🧠
* aaah ESM has two modelings aaah
* add informative but short comment
* add `input_embed_layer` mixin attribute
* style
* walrus has low precedence
* modular fix
* this was breaking parser
* fix type order
* change all Union[str, dict] to Union[dict, str]
* add hf_parser test && fix test order
* add deepspeed dependency
* replace deepspeed with accelerator
* dump
* push other models
* fix simple greedy generation
* xmod
* add fmst and clean up some mentions of old cache format
* gpt-bigcode now follows standards
* delete tuple cache reference in generation
* fix some models
* fix some models
* fix mambas and support cache in tapas
* fix some more tests
* fix copies
* delete `_reorder_cache`
* another fix copies
* fix typos and delete unnecessary test
* fix rag generate, needs special cache reordering
* fix tapas and superglue
* reformer create special cache
* recurrent gemma `reorder_cache` was a no-op, delete
* fix-copies
* fix blio and musicgen pipeline tests
* fix reformer
* fix reformer, again...
* delete `_supports_cache_class`
* delete `supports_quantized_cache`
* fix failing tests
* fix copies
* some minor clean up
* style
* style
* fix copies
* fix tests
* fix copies
* create causal mask now needs positions?
* fixc copies
* style
* Update tests/test_modeling_common.py
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
* clean-up of non-generative model after merging main
* check `is_decoder` for cache
* delete transpose for scores
* remove tuple cache from docs everywhere
* fix tests
* fix copies
* fix copies once more
* properly deprecate `encoder_attention_mask` in Bert-like models
* import `deprecate_kwarg` where needed
* fix copies again
* fix copies
* delete `nex_decoder_cache`
* fix copies asks to update for PLM
* fix copies
* rebasing had a few new models, fix them and merge asap!
* fix copies once more
* fix slow tests
* fix tests and updare PLM checkpoint
* add read token and revert accidentally removed line
* oh com -on, style
* just skip it, read token has no access to PLM yet
---------
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
* fix vlm with retrieval
* we can't use AutoModel because new ColQwen was released after refactor
* no need for colqwen
* tied weight keys are necessary, if using IMageTextToText
* need to apply renaming in tied weights, only for ColPali
* overwrite tied keys in ColPali
* fix copies, modular can't handle if-statements
* just update 2 files
* update other models as well just making fix-copies
* also add the changes needed to modeling utils
* put this on the pretrained model instead
* nits and fixes
* update generic, fix to use config value
* update other modelings
* use transformers kwargs instead
* update
* update
* update other models
* update
* updates
* update
* update
* update
* fix
* finally
* very small nits
* this fixes more tests
* fix other models as well!
* update modularqwen2
* update models based on qwen2
* update
* update
* remove the **flash stuff in favor of noraml kwargs
* update
* propagate gemma?
* remove output attentions
* propagate
* support cross attention edge case
* same
* test this
* fixes
* more fix
* update
* update
* fix conflicts
* update
* fix emu3
* fix emu3
* move the fix a bit
* quel enfer
* some fixes, loss_kwargs should never had been
* finish fixing gemma3n
* fix small lm3
* fix another one
* fix csm now
* fux csm and mistral
* fix mistral now
* small fixes
* fix janusss
* only for some models
* fixup
* phix phi3
* more fixes?
* dose this fix it?
* update
* holy shit it was just graph breaks
* protect torch
* updates
* fix samhq?
* fix moonshine
* more moonshine fixes, 3 failures left!
* nits
* generic needs to support more
* more fixes to moonshine!
* fix cross attention outputs!
* fix csm!
* nits
* fix stupid kosmos2
* current updates
* fixes
* use output recorder?
* nicer!
* a little bit of magic
* update
* fix protect
* fix
* small fixes
* protect import
* fix a bunch of more models
* fix fixups
* fix some of the last ones
* nit
* partly fix phi
* update
* fix import path
* make something that is fullgraph compatible just to be sure
* typing was wrong on llama so the rest was wrong as well
* fucking ugly but at least it is still exportable
* syle
* supposed to fix moonshine, it still breaks
* fix some default
* fix the last bits of sam
* update samhq
* more fixes to am hq
* nit
* fix all output+hidden states and output_attentions!
* fix?
* fix diffllama
* updates to fix initialization on the sam pips
* ups there was a bug
* fix the last sam hq test
* fix gotocr
* fix gotocr2!
* fixes
* skip stupid tests
* there was one left :)
* fixup
* fix fix copies issues with this test file
* fix copies for sam_hq
* rm some comments
* skip 2 more failing tests
* fix
* fix everything
* Apply suggestions from code review
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
* add more doc!
* fix public init
* fix modular qwen3
---------
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
* add dia model
* add tokenizer files
* cleanup some stuff
* brut copy paste code
* rough cleanup of the modeling code
* nuke some stuff
* more nuking
* more cleanups
* updates
* add mulitLayerEmbedding vectorization
* nits
* more modeling simplifications
* updates
* update rope
* update rope
* just fixup
* update configuration files
* more cleanup!
* default config values
* update
* forgotten comma
* another comma!
* update, more cleanups
* just more nits
* more config cleanups
* time for the encoder
* fix
* sa=mall nit
* nits
* n
* refacto a bit
* cleanup
* update cv scipt
* fix last issues
* fix last nits
* styling
* small fixes
* just run 1 generation
* fixes
* nits
* fix conversion
* fix
* more fixes
* full generate
* ouf!
* fixes!
* updates
* fix
* fix cvrt
* fixup
* nits
* delete wrong test
* update
* update
* test tokenization
* let's start changing things bit by bit - fix encoder step
* removing custom generation, moving to GenerationMixin
* add encoder decoder attention masks for generation
* mask changes, correctness checked against ad29837 in dia repo
* refactor a bit already --> next cache
* too important not to push :)
* minimal cleanup + more todos
* make main overwrite modeling utils
* add cfg filter & eos filter
* add eos countdown & delay pattern
* update eos countdown
* add max step eos countdown
* fix tests
* fix some things
* fix generation with testing
* move cfg & eos stuff to logits processor
* make RepetitionPenaltyLogitsProcessor flexible
- can accept 3D scores like (batch_size, channel, vocab)
* fix input_ids concatenation dimension in GenerationMixin for flexibility
* Add DiaHangoverLogitsProcessor and DiaExponentialDecayLengthPenalty classes; refactor logits processing in DiaForConditionalGeneration to utilize new configurations and improve flexibility.
* Add stopping criteria
* refactor
* move delay pattern from processor to modeling like musicgen.
- add docs
- change eos countdown to eos delay pattern
* fix processor & fix tests
* refactor types
* refactor imports
* format code
* fix docstring to pass ci
* add docstring to DiaConfig & add DiaModel to test
* fix docstring
* add docstring
* fix some bugs
* check
* porting / merging results from other branch - IMPORTANT: it very likely breaks generation, the goal is to have a proper forward path first
* experimental testing of left padding for first channel
* whoops
* Fix merge to make generation work
* fix cfg filter
* add position ids
* add todos, break things
* revert changes to generation --> we will force 2d but go 3d on custom stuff
* refactor a lot, change prepare decoder ids to work with left padding (needs testing), add todos
* some first fixes to get to 10. in generation
* some more generation fixes / adjustment
* style + rope fixes
* move cfg out, simplify a few things, more todos
* nit
* start working on custom logit processors
* nit
* quick fixes
* cfg top k
* more refactor of logits processing, needs a decision if gen config gets the new attributes or if we move it to config or similar
* lets keep changes to core code minimal, only eos scaling is questionable atm
* simpler eos delay logits processor
* that was for debugging :D
* proof of concept rope
* small fix on device mismatch
* cfg fixes + delay logits max len
* transformers rope
* modular dia
* more cleanup
* keep modeling consistently 3D, generate handles 2D internally
* decoder starts with bos if nothing
* post processing prototype
* style
* lol
* force sample / greedy + fixes on padding
* style
* fixup tokenization
* nits
* revert
* start working on dia tests
* fix a lot of tests
* more test fixes
* nit
* more test fixes + some features to simplify code more
* more cleanup
* forgot that one
* autodocs
* small consistency fixes
* fix regression
* small fixes
* dia feature extraction
* docs
* wip processor
* fix processor order
* processing goes brrr
* transpose before
* small fix
* fix major bug but needs now a closer look into the custom processors esp cfg
* small thing on logits
* nits
* simplify indices and shifts
* add simpler version of padding tests back (temporarily)
* add logit processor tests
* starting tests on processor
* fix mask application during generation
* some fixes on the weights conversion
* style + fixup logits order
* simplify conversion
* nit
* remove padding tests
* nits on modeling
* hmm
* fix tests
* trigger
* probably gonna be reverted, just a quick design around audio tokenizer
* fixup typing
* post merge + more typing
* initial design for audio tokenizer
* more design changes
* nit
* more processor tests and style related things
* add to init
* protect import
* not sure why tbh
* add another protect
* more fixes
* wow
* it aint stopping :D
* another missed type issue
* ...
* change design around audio tokenizer to prioritize init and go for auto - in regards to the review
* change to new causal mask function + docstrings
* change ternary
* docs
* remove todo, i dont think its essential tbh
* remove pipeline as current pipelines do not fit in the current scheme, same as csm
* closer to wrapping up the processor
* text to audio, just for demo purposes (will likely be reverted)
* check if it's this
* save audio function
* ensure no grad
* fixes on prefixed audio, hop length is used via preprocess dac, device fixes
* integration tests (tested locally on a100) + some processor utils / fixes
* style
* nits
* another round of smaller things
* docs + some fixes (generate one might be big)
* msytery solved
* small fix on conversion
* add abstract audio tokenizer, change init check to abstract class
* nits
* update docs + fix some processing :D
* change inheritance scheme for audio tokenizer
* delete dead / unnecessary code in copied generate loop
* last nits on new pipeline behavior (+ todo on tests) + style
* trigger
---------
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Vasqu <antonprogamer@gmail.com>
* Support `flash_attn_3`
Implements fwd and tests for Flash Attention 3 https://github.com/Dao-AILab/flash-attention/commits/main/hopper
- Includes checks for dropout>0 and ALiBi in `modeling_utils.PreTrainedModel._check_and_enable_flash_attn_3` (Dropout will likely be supported soon, so this will need to be updated and `modeling_flash_attention_utils._flash_attention_forward` at the `if _IS_FLASH_ATTN_3_AVAILABLE: ...`
An example Llama implementation is included in `modeling_llama.py` but other models would still need to be updated
Based on https://github.com/huggingface/transformers/pull/36190 which has model implementations and examples which could be merged
* Add tests for Flash Attention 2 and 3 parity
* ci fix
* FA2 compatibiity
- `_prepare_flash_attention_from_position_ids` ->`prepare_fa2_from_position_ids`
- Remove bettertransformer check in Flash Attention 3
- Merge tests
- Add licensing
* ci fix
* Test naming consistency
* ci fix
* Deprecation warning for `prepare_fa2_from_position_ids`
* ci fix
* Fix HQQ model param device transfer issue
* modify a comment
* clear the code and add test for hqq device/dtype
* fix test hqq code quality of imports
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* No more Tuple, List, Dict
* make fixup
* More style fixes
* Docstring fixes with regex replacement
* Trigger tests
* Redo fixes after rebase
* Fix copies
* [test all]
* update
* [test all]
* update
* [test all]
* make style after rebase
* Patch the hf_argparser test
* Patch the hf_argparser test
* style fixes
* style fixes
* style fixes
* Fix docstrings in Cohere test
* [test all]
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>