[Ernie 4.5] Add ernie text models (#39228)

* init * copied from remote * add proper structure and llama like structure * fixup * revert to state that works * get closer to llama * slow and steady * some removal * masks work * it is indeed the rope implementation, how dafuq does it mesh with the cache now hmm * nice * getting closer * closer to transformers style * let's simplify this, batching works now * simplified * working version with modular * it is indeed the rotation per weights, make it complete llama style * cleanup conversion, next to look at -> tokenizer * remove llama artefacts * fix modeling tests (common ones) * style * integration test + first look into tokenization (will need more work, focussing on modeling other models first) * style * working moe version, based on remote * lets keep it simple and go step by step - transformers annotations for modular and transformers style rope (complex view) * more cleanup * refactor namings and remove addition forXXX classes * our moe won't cut it it seems, correction bias seems to be missing in remote code version * tokenization change (remote) * our moe version works when adding normalization :D * cleanup moe * nits * cleanup modeling -> let's get to modular next * style * modular v1 * minor things + attempt at conversion (which doesn't work) * no conversion follow glm, fixup modular and other nits * modular cleanup * fixes * tests, tests, tests + some moe dtype forcing * simplify modular, fix fatal fa2 bug, remaining tests * fix import issue? * some initial docs, fix bnb faulty behavior --> needs to fix some tests because of gate needing to be float * fix sdpa test, load on init dtype only * fixup post merge * style * fix doc links * tokenization cleanup beginnings * simplify tokenizer by a lot as its basically llama * tokenizer is full llama with different defaults + extra special tokens * sync og special tokens of ernie * fix decoding with numbers (also in remote done what a timing), begin of tok tests * align with remote and preserve special tokens, adjust tests to ernie legacy behavior, warning for questionable behavior (also in llama) * nits * docs * my daily post merge it is * check * tokenization update with explanations and conversion script * review on modular (til), revert some tokenizer things i did prior, remove mtp comment (low prio) * post merge fixes * fixup tokenization, llama fast is the way to go * more fixups * check * import fixes * correction bias following the paddle code * fix * fix TP plan, fix correction bias sharding during forward * style * whoops * fix tied weights * docs and last nit * license * flasky tests * move repo id, update when merged on the hub
2025-07-21 19:51:49 +02:00
parent 69b158260f
commit b4115a426e
23 changed files with 2956 additions and 2 deletions
--- a/tests/causal_lm_tester.py
+++ b/tests/causal_lm_tester.py
@@ -104,9 +104,11 @@ class CausalLMModelTester:
        is_decoder=False,
        scope=None,
        expert_interval=1,
+        moe_layer_start_index=0,
        moe_intermediate_size=12,
        shared_expert_intermediate_size=36,
        shared_expert_gate=True,
+        moe_num_shared_experts=2,
        num_experts_per_tok=2,
        num_experts=8,
        mamba_n_groups=1,
@@ -146,9 +148,11 @@ class CausalLMModelTester:
        self.head_dim = self.hidden_size // self.num_attention_heads
        self.is_decoder = is_decoder
        self.expert_interval = expert_interval
+        self.moe_layer_start_index = moe_layer_start_index
        self.moe_intermediate_size = moe_intermediate_size
        self.shared_expert_intermediate_size = shared_expert_intermediate_size
        self.shared_expert_gate = shared_expert_gate
+        self.moe_num_shared_experts = moe_num_shared_experts
        self.num_experts_per_tok = num_experts_per_tok
        self.num_experts = num_experts
        self.mamba_n_groups = mamba_n_groups