Add DeepSeek V2 Model into Transformers (#36400)
* add initial structure * doc fixes, add model base logic * update init files * some fixes to config and modular * some improvements for attention * format * remove unused attn * some fixes for moe layer and for decoder * adapt _compute_yarn_parameters for deepseek * format * small fix * fix for decoder forward * add tests, small refactoring * fix dummies * fix init * fix doc * fix config docs * add sequce doc, fix init for gate * fix issues in tests * fix config doc * remove unused args * some fixes and refactoring after review * fix doc for config * small fixes for config args * revert config refactoring * small refactoring * minor fixes after rebase * small fix after merge * fix modular * remove rotaryembd from public init * small test fix * some rotary pos calculation improvement * fix format * some improvements and fixes * fix config * some refactoring * adjust some unit tests * skip test * small fixes and tests adjustment * reapply modular * fix all tests except Integration * fix integration testzs * cleanup BC stuff * rope * fix integrations tests based on a10 * style --------- Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
This commit is contained in:
committed by
GitHub
parent
accbd8e0fe
commit
c980904204
@@ -709,6 +709,8 @@
|
||||
title: D-FINE
|
||||
- local: model_doc/dab-detr
|
||||
title: DAB-DETR
|
||||
- local: model_doc/deepseek_v2
|
||||
title: DeepSeek-V2
|
||||
- local: model_doc/deformable_detr
|
||||
title: Deformable DETR
|
||||
- local: model_doc/deit
|
||||
|
||||
49
docs/source/en/model_doc/deepseek_v2.md
Normal file
49
docs/source/en/model_doc/deepseek_v2.md
Normal file
@@ -0,0 +1,49 @@
|
||||
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# DeepSeek-V2
|
||||
|
||||
## Overview
|
||||
|
||||
The DeepSeek-V2 model was proposed in [DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model](https://arxiv.org/abs/2405.04434) by DeepSeek-AI Team.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
|
||||
|
||||
This model was contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber).
|
||||
The original code can be found [here](https://huggingface.co/deepseek-ai/DeepSeek-V2).
|
||||
|
||||
### Usage tips
|
||||
The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.
|
||||
|
||||
## DeepseekV2Config
|
||||
|
||||
[[autodoc]] DeepseekV2Config
|
||||
|
||||
## DeepseekV2Model
|
||||
|
||||
[[autodoc]] DeepseekV2Model
|
||||
- forward
|
||||
|
||||
## DeepseekV2ForCausalLM
|
||||
|
||||
[[autodoc]] DeepseekV2ForCausalLM
|
||||
- forward
|
||||
|
||||
## DeepseekV2ForSequenceClassification
|
||||
|
||||
[[autodoc]] DeepseekV2ForSequenceClassification
|
||||
- forward
|
||||
Reference in New Issue
Block a user