Add Doge model (#35891)
* Add Doge Model * Fix code quality * Rollback an error commit * Fix config for open-source weights * Revert "Fix config for open-source weights" This reverts commit 229cdcac10a6a4274d1dd13b729bc14c98eb0c76. * Add modular_doge * Update Doge inherits from Llama * Fix import bug * [docs] Add usage of doge model * Fix Doge import pretrainedconfig from modeling_utils to configuration_utils * [docs] remove trust remote code from doge * Fix dynamo bug in doge model * Update docstrings * Import apply_rotary_pos_emb and repeat_kv from Llama * Fix all nits * Fix code quality * Fix some bugs * Fix code quality * Remove inherited `_update_causal_mask` from Llama This leads to incorrect weight initialization. * Fix the wrong tensor orderings in DogeCDMoE * Fix attention mask bug We have to provide attention_mask for dynamic mask computation * Modify most implementations to inherit from Llama But there are two problems: 1. `flex_attention_forward` is not updated properly 2. `Example` error in the forward method of DogeForCausalLM * Modify CDMoE for batch efficient implementation * Uniform MoE configuration names, just like QwenMoE * Fix code quality * Fix code quality * Fix code quality * Add tp plan of CDMoE Module * Hybird DMA with sliding window * Update valid tokens greater than window size * Fix code quality * Add `convert_doge_weights_to_hf` * Fix STATE_DICT_MAPPING in convert_doge_weights_to_hf.py * Fix nits in modular_doge * Fix code quality * Fix all nits * Fix all nits * Make sure the attention function is updated inside the class * Fix code quality issues in the Doge model and add a test for it * Fix `test_generate` * Fix code quality * Fix nits fllowing suggestions * Fix code quality * Fix code quality issues * Fix nits * Fix code quality nits * Fix the missing parameters in the configuration. * Fix the missing parameters in the configuration. * Fix nits * Add initialization of attention * Fix last nits * Simplify dynamic mask generation logic * Rename router_logits to gate_logits for matching latest changes of MixtralModel * Rename typings for matching latest changes of MixtralModel * Fixes typo in comment * Update src/transformers/models/doge/modular_doge.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Fix code quality issues to match other modular * Fix code quality issues to match other modular * Fix the static compilation errors * Update model weights link * Fix code quality issues to match other modular * reapply modular and support for new outputs * style * simplify a lot * fix import location * reapply modular * fix * fix integration test --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
This commit is contained in:
103
docs/source/en/model_doc/doge.md
Normal file
103
docs/source/en/model_doc/doge.md
Normal file
@@ -0,0 +1,103 @@
|
||||
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Doge
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
Doge is a series of small language models based on the [Doge](https://github.com/SmallDoges/small-doge) architecture, aiming to combine the advantages of state-space and self-attention algorithms, calculate dynamic masks from cached value states using the zero-order hold method, and solve the problem of existing mainstream language models getting lost in context. It uses the `wsd_scheduler` scheduler to pre-train on the `smollm-corpus`, and can continue training on new datasets or add sparse activation feedforward networks from stable stage checkpoints.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F426/transformers/model_doc/doge_architecture.png" alt="drawing" width="600"/>
|
||||
|
||||
As shown in the figure below, the sequence transformation part of the Doge architecture uses `Dynamic Mask Attention`, which can be understood as using self-attention related to value states during training, and using state-space without past state decay during inference, to solve the problem of existing Transformers or SSMs getting lost in long text. The state transformation part of Doge uses `Cross Domain Mixture of Experts`, which consists of dense linear layers and sparse embedding layers, and can additionally increase sparse parameters to continue training from dense weight checkpoints without retraining the entire model, thereby reducing the cost of continuous iteration of the model. In addition, Doge also uses `RMSNorm` and `Residual` with learnable parameters to adapt the gradient range of deep models.
|
||||
|
||||
Checkout all Doge model checkpoints [here](https://huggingface.co/collections/SmallDoge/doge-slm-679cc991f027c4a3abbded4a).
|
||||
|
||||
|
||||
## Usage
|
||||
|
||||
<details>
|
||||
<summary>Using Doge-Base for text generation</summary>
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-20M")
|
||||
model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-20M")
|
||||
inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
|
||||
|
||||
outputs = model.generate(**inputs, max_new_tokens=100)
|
||||
print(tokenizer.batch_decode(outputs))
|
||||
```
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Using Doge-Instruct for question answering</summary>
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, TextStreamer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-20M-Instruct")
|
||||
model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-20M-Instruct")
|
||||
|
||||
generation_config = GenerationConfig(
|
||||
max_new_tokens=100,
|
||||
use_cache=True,
|
||||
do_sample=True,
|
||||
temperature=0.8,
|
||||
top_p=0.9,
|
||||
repetition_penalty=1.0
|
||||
)
|
||||
steamer = TextStreamer(tokenizer=tokenizer, skip_prompt=True)
|
||||
|
||||
prompt = "Hi, how are you doing today?"
|
||||
conversation = [
|
||||
{"role": "user", "content": prompt}
|
||||
]
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
conversation=conversation,
|
||||
tokenize=True,
|
||||
return_tensors="pt",
|
||||
)
|
||||
|
||||
outputs = model.generate(
|
||||
inputs,
|
||||
tokenizer=tokenizer,
|
||||
generation_config=generation_config,
|
||||
streamer=steamer
|
||||
)
|
||||
```
|
||||
</details>
|
||||
|
||||
## DogeConfig
|
||||
|
||||
[[autodoc]] DogeConfig
|
||||
|
||||
## DogeModel
|
||||
|
||||
[[autodoc]] DogeModel
|
||||
- forward
|
||||
|
||||
## DogeForCausalLM
|
||||
|
||||
[[autodoc]] DogeForCausalLM
|
||||
- forward
|
||||
|
||||
## DogeForSequenceClassification
|
||||
|
||||
[[autodoc]] DogeForSequenceClassification
|
||||
- forward
|
||||
Reference in New Issue
Block a user