Add support for DeepseekAI's DeepseekVL (#36248)

* upload initial code

* update deepseek-vl adaptor

* update hierarchy of vision model classes

* udpate aligner model

* add text model

* Added Image Processor

* Added Image Processor

* Added Image Processor

* apply masks

* remove projection; add aligner

* remove interpolate_pos_encoding

* remove unused params in config

* cleaning

* Add the __init__ file

* added processing deepseek_vl class

* modified the deepseek-vl processor

* modified the deepseek-vl processor

* update __init__

* Update the image processor class name

* Added Deepseek to src/transformers/__init__.py file

* Added Deepseek to image_processing_auto.py

* update the __init__ file

* update deepseek_vl image processor

* Update Deepseek Processor

* upload fast image processor

* Revert "upload fast image processor"

This reverts commit 68c8fd50bafbb9770ac70c9de02448e2519219b4.

* update image processor

* flatten heirarchy

* remove DeepseekVLModel

* major update (complete modeling)

* auto modeling and other files

* formatting

* fix quality

* replace torchvision in modeling

* set default do_normalize to False

* add fast image processor template using tool

* update image processors

* add fast image processor to other files

* update liscense

* Added deepseek image testcases

* update image test

* update processor

* write CHAT_TEMPLATE

* update model for processor

* fix processor

* minor fixes and formatting

* fix image processing and tests

* fix interpolation in sam

* fix output_attentions in DeepseekVLModel

* upload test_modeling

* fix tests because of vocab size

* set use_high_res_vision=False in tests

* fix all modeling tests

* fix styling

* remove explicit background_color from image processors

* added test_processor

* added test_processor

* fix processor tests

* update docs

* update docs

* update docs

* update conversion script

* Fixed typos

* minor fixes from review

- remove model_id comments in examples
- remove from pre-trained auto mapping
- move to image-text-to-text from vision-to-seq in auto mapping
- add image_token_index to __init__ for config
- remove outdated temporary config in conversion script
- update example to use chat_template in docstring example
- update liscense 2021->2025

* fix type in config docstring

Co-authored-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>

* update get_image_features

* fix config

* improve DeepseekVLImageProcessor.preprocess

* return image_hidden_states

* use AutoTokenizer and AutoImageProcessor in Processor

* fix model outputs

* make num_image_tokens configurable

* fix docstring of processor

* move system prompt to chat template

* fix repo consistency

* fix return_dict

* replace SamVisionEncoder with SamVisionModel

* update to remove deepcopy

* 🛠️  Major Architectural Changes (Adds DeepseekVLHybrid)

* fix quality checks

* add missing hybrid in auto modeling

* run make style

* update sam_hq

* update high_res_size in test

* update docs following #36979

* update code with auto_docstring

* update conversion scripts

* fix style

* fix failing test because of tuple

* set weights_only=True in conversion script

* use safetensors.torch.load_file instead of torch.load in conversion script

* make output_dir optional in conversion script

* fix code snippets in docs (now the examples work fine)

* integration tests for DeepseekVL

* update expected texts

* make style

* integration tests for DeepseekVLHybrid

* fix class name

* update expected texts for hybrid

* run "make style"

* update since changes in main

* run make-style

* nits since changes in main

* undo changes in sam

* fix tests

* fix tests; update with main

* update with main: output_attention/output_hidden_states

* fix copied part in deepseek_vl

* run fix-copies

* fix output_hidden_states

* sam: fix _init_weigths

* use modular for DeepseekVL

* make image processor more modular

* modular: use JanusPreTrainedModel

* janus: provide kwargs in loss

* update processors in conversion script

* Revert "sam: fix _init_weigths"

This reverts commit db625d0c68956c0dad45edd7a469b6a074905c27.

* run fix-copies

---------

Co-authored-by: Shakib-IO <shakib.khan17@northsouth.edu>
Co-authored-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
This commit is contained in:
Armaghan Shakir
2025-07-25 22:18:50 +05:00
committed by GitHub
parent a98bbc294c
commit 69cff312f5
33 changed files with 5856 additions and 4 deletions

View File

@@ -725,6 +725,10 @@
title: DAB-DETR
- local: model_doc/deepseek_v2
title: DeepSeek-V2
- local: model_doc/deepseek_vl
title: DeepseekVL
- local: model_doc/deepseek_vl_hybrid
title: DeepseekVLHybrid
- local: model_doc/deformable_detr
title: Deformable DETR
- local: model_doc/deit

View File

@@ -0,0 +1,220 @@
<!--Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
</div>
# DeepseekVL
[Deepseek-VL](https://arxiv.org/abs/2403.05525) was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages [LLaMA](./llama) as its text encoder, while [SigLip](./siglip) is used for encoding images.
You can find all the original Deepseek-VL checkpoints under the [DeepSeek-community](https://huggingface.co/deepseek-community) organization.
> [!TIP]
> Click on the Deepseek-VL models in the right sidebar for more examples of how to apply Deepseek-VL to different vision and language tasks.
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
<hfoptions id="usage">
<hfoption id="Pipeline">
```py
import torch
from transformers import pipeline
pipe = pipeline(
task="image-text-to-text",
model="deepseek-community/deepseek-vl-1.3b-chat",
device=0,
torch_dtype=torch.float16
)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{ "type": "text", "text": "Describe this image."},
]
}
]
pipe(text=messages, max_new_tokens=20, return_full_text=False)
```
</hfoption>
<hfoption id="AutoModel">
```py
import torch
from transformers import DeepseekVLForConditionalGeneration, AutoProcessor
model = DeepseekVLForConditionalGeneration.from_pretrained(
"deepseek-community/deepseek-vl-1.3b-chat",
torch_dtype=torch.float16,
device_map="auto",
attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-1.3b-chat")
messages = [
{
"role":"user",
"content":[
{
"type":"image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
},
{
"type":"text",
"text":"Describe this image."
}
]
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device, dtype=model.dtype)
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
</hfoption>
</hfoptions>
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
```python
import torch
from transformers import TorchAoConfig, DeepseekVLForConditionalGeneration, AutoProcessor
quantization_config = TorchAoConfig(
"int4_weight_only",
group_size=128
)
model = DeepseekVLForConditionalGeneration.from_pretrained(
"deepseek-community/deepseek-vl-1.3b-chat",
torch_dtype=torch.bfloat16,
device_map="auto",
quantization_config=quantization_config
)
```
### Notes
- Do inference with multiple images in a single conversation.
```py
import torch
from transformers import DeepseekVLForConditionalGeneration, AutoProcessor
model = DeepseekVLForConditionalGeneration.from_pretrained(
"deepseek-community/deepseek-vl-1.3b-chat",
torch_dtype=torch.float16,
device_map="auto",
attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-1.3b-chat")
messages = [
[
{
"role": "user",
"content": [
{"type": "text", "text": "Whats the difference between"},
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": " and "},
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}
]
}
],
[
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"},
{"type": "text", "text": "What do you see in this image?"}
]
}
]
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
padding=True,
truncation=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device, dtype=model.dtype)
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
## DeepseekVLConfig
[[autodoc]] DeepseekVLConfig
## DeepseekVLProcessor
[[autodoc]] DeepseekVLProcessor
## DeepseekVLImageProcessor
[[autodoc]] DeepseekVLImageProcessor
## DeepseekVLModel
[[autodoc]] DeepseekVLModel
- forward
## DeepseekVLForConditionalGeneration
[[autodoc]] DeepseekVLForConditionalGeneration
- forward

View File

@@ -0,0 +1,219 @@
<!--Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
</div>
# DeepseekVLHybrid
[Deepseek-VL-Hybrid](https://arxiv.org/abs/2403.05525) was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages [LLaMA](./llama) as its text encoder, while [SigLip](./siglip) is used for encoding low-resolution images and [SAM (Segment Anything Model)](./sam) is incorporated to handle high-resolution image encoding, enhancing the models ability to process fine-grained visual details. Deepseek-VL-Hybrid is a variant of Deepseek-VL that uses [SAM (Segment Anything Model)](./sam) to handle high-resolution image encoding.
You can find all the original Deepseek-VL-Hybrid checkpoints under the [DeepSeek-community](https://huggingface.co/deepseek-community) organization.
> [!TIP]
> Click on the Deepseek-VL-Hybrid models in the right sidebar for more examples of how to apply Deepseek-VL-Hybrid to different vision and language tasks.
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
<hfoptions id="usage">
<hfoption id="Pipeline">
```py
import torch
from transformers import pipeline
pipe = pipeline(
task="image-text-to-text",
model="deepseek-community/deepseek-vl-7b-chat",
device=0,
torch_dtype=torch.float16
)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{ "type": "text", "text": "Describe this image."},
]
}
]
pipe(text=messages, max_new_tokens=20, return_full_text=False)
```
</hfoption>
<hfoption id="AutoModel">
```py
import torch
from transformers import DeepseekVLHybridForConditionalGeneration, AutoProcessor
model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
"deepseek-community/deepseek-vl-7b-chat",
torch_dtype=torch.float16,
device_map="auto",
attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-7b-chat")
messages = [
{
"role":"user",
"content":[
{
"type":"image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
},
{
"type":"text",
"text":"Describe this image."
}
]
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device, dtype=model.dtype)
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
</hfoption>
</hfoptions>
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
```python
import torch
from transformers import TorchAoConfig, DeepseekVLHybridForConditionalGeneration, AutoProcessor
quantization_config = TorchAoConfig(
"int4_weight_only",
group_size=128
)
model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
"deepseek-community/deepseek-vl-7b-chat",
torch_dtype=torch.bfloat16,
device_map="auto",
quantization_config=quantization_config
)
```
### Notes
- Do inference with multiple images in a single conversation.
```py
import torch
from transformers import DeepseekVLHybridForConditionalGeneration, AutoProcessor
model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
"deepseek-community/deepseek-vl-7b-chat",
torch_dtype=torch.float16,
device_map="auto",
attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-7b-chat")
messages = [
[
{
"role": "user",
"content": [
{"type": "text", "text": "Whats the difference between"},
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": " and "},
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}
]
}
],
[
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"},
{"type": "text", "text": "What do you see in this image?"}
]
}
]
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
padding=True,
truncation=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device, dtype=model.dtype)
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
## DeepseekVLHybridConfig
[[autodoc]] DeepseekVLHybridConfig
## DeepseekVLHybridProcessor
[[autodoc]] DeepseekVLHybridProcessor
## DeepseekVLHybridImageProcessor
[[autodoc]] DeepseekVLHybridImageProcessor
## DeepseekVLHybridModel
[[autodoc]] DeepseekVLHybridModel
- forward
## DeepseekVLHybridForConditionalGeneration
[[autodoc]] DeepseekVLHybridForConditionalGeneration
- forward

View File

@@ -84,6 +84,8 @@ if TYPE_CHECKING:
from .decision_transformer import *
from .deepseek_v2 import *
from .deepseek_v3 import *
from .deepseek_vl import *
from .deepseek_vl_hybrid import *
from .deformable_detr import *
from .deit import *
from .deprecated import *

View File

@@ -103,6 +103,8 @@ CONFIG_MAPPING_NAMES = OrderedDict[str, str](
("decision_transformer", "DecisionTransformerConfig"),
("deepseek_v2", "DeepseekV2Config"),
("deepseek_v3", "DeepseekV3Config"),
("deepseek_vl", "DeepseekVLConfig"),
("deepseek_vl_hybrid", "DeepseekVLHybridConfig"),
("deformable_detr", "DeformableDetrConfig"),
("deit", "DeiTConfig"),
("depth_anything", "DepthAnythingConfig"),
@@ -495,6 +497,8 @@ MODEL_NAMES_MAPPING = OrderedDict[str, str](
("decision_transformer", "Decision Transformer"),
("deepseek_v2", "DeepSeek-V2"),
("deepseek_v3", "DeepSeek-V3"),
("deepseek_vl", "DeepseekVL"),
("deepseek_vl_hybrid", "DeepseekVLHybrid"),
("deformable_detr", "Deformable DETR"),
("deit", "DeiT"),
("deplot", "DePlot"),

View File

@@ -77,6 +77,8 @@ else:
("convnextv2", ("ConvNextImageProcessor", "ConvNextImageProcessorFast")),
("cvt", ("ConvNextImageProcessor", "ConvNextImageProcessorFast")),
("data2vec-vision", ("BeitImageProcessor", "BeitImageProcessorFast")),
("deepseek_vl", ("DeepseekVLImageProcessor")),
("deepseek_vl_hybrid", ("DeepseekVLHybridImageProcessor")),
("deformable_detr", ("DeformableDetrImageProcessor", "DeformableDetrImageProcessorFast")),
("deit", ("DeiTImageProcessor", "DeiTImageProcessorFast")),
("depth_anything", ("DPTImageProcessor", "DPTImageProcessorFast")),

View File

@@ -97,6 +97,8 @@ MODEL_MAPPING_NAMES = OrderedDict(
("decision_transformer", "DecisionTransformerModel"),
("deepseek_v2", "DeepseekV2Model"),
("deepseek_v3", "DeepseekV3Model"),
("deepseek_vl", "DeepseekVLModel"),
("deepseek_vl_hybrid", "DeepseekVLHybridModel"),
("deformable_detr", "DeformableDetrModel"),
("deit", "DeiTModel"),
("depth_pro", "DepthProModel"),
@@ -935,6 +937,8 @@ MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = OrderedDict(
("blip", "BlipForConditionalGeneration"),
("blip-2", "Blip2ForConditionalGeneration"),
("chameleon", "ChameleonForConditionalGeneration"),
("deepseek_vl", "DeepseekVLForConditionalGeneration"),
("deepseek_vl_hybrid", "DeepseekVLHybridForConditionalGeneration"),
("emu3", "Emu3ForConditionalGeneration"),
("evolla", "EvollaForProteinText2Text"),
("fuyu", "FuyuForCausalLM"),

View File

@@ -62,6 +62,8 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("clvp", "ClvpProcessor"),
("colpali", "ColPaliProcessor"),
("colqwen2", "ColQwen2Processor"),
("deepseek_vl", "DeepseekVLProcessor"),
("deepseek_vl_hybrid", "DeepseekVLHybridProcessor"),
("dia", "DiaProcessor"),
("emu3", "Emu3Processor"),
("evolla", "EvollaProcessor"),

View File

@@ -193,6 +193,20 @@ TOKENIZER_MAPPING_NAMES = OrderedDict[str, tuple[Optional[str], Optional[str]]](
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"deepseek_vl",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"deepseek_vl_hybrid",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
),
("dia", ("DiaTokenizer", None)),
(
"diffllama",

View File

@@ -0,0 +1,30 @@
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure
if TYPE_CHECKING:
from .configuration_deepseek_vl import *
from .image_processing_deepseek_vl import *
from .image_processing_deepseek_vl_fast import *
from .modeling_deepseek_vl import *
from .processing_deepseek_vl import *
else:
import sys
_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)

View File

@@ -0,0 +1,96 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/deepseek_vl/modular_deepseek_vl.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_deepseek_vl.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from ...configuration_utils import PretrainedConfig
from ...utils import logging
from ..auto import CONFIG_MAPPING, AutoConfig
logger = logging.get_logger(__name__)
class DeepseekVLConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`DeepseekVLModel`]. It is used to instantiate a
DeepseekVL model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the DeepseekVL
[deepseek-community/deepseek-vl-1.3b-chat](https://huggingface.co/deepseek-community/deepseek-vl-1.3b-chat) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `LlamaConfig`):
The config object or dictionary of the text backbone.
vision_config (`Union[AutoConfig, dict]`, *optional*, defaults to `SiglipVisionConfig`):
The config object or dictionary of the vision backbone.
image_token_id (`int`, *optional*, defaults to 100015):
The index representing image tokens in the model's token vocabulary.
Example:
```python
>>> from transformers import DeepseekVLConfig, DeepseekVLModel
>>> # Initializing a DeepseekVL deepseek-community/deepseek-vl-1.3b-chat style configuration
>>> configuration = DeepseekVLConfig()
>>> # Initializing a model (with random weights) from the deepseek-community/deepseek-vl-1.3b-chat style configuration
>>> model = DeepseekVLModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "deepseek_vl"
sub_configs = {"text_config": AutoConfig, "vision_config": AutoConfig}
def __init__(
self,
text_config: AutoConfig = None,
vision_config: AutoConfig = None,
image_token_id: int = 100015,
**kwargs,
):
super().__init__(**kwargs)
if text_config is None:
text_config = {}
logger.info("`text_config` is `None`. Initializing the `LlamaConfig` with default values.")
if vision_config is None:
vision_config = {}
logger.info("`vision_config` is `None`. Initializing the `SiglipVisionConfig` with default values.")
if isinstance(text_config, dict):
text_config["model_type"] = text_config.get("model_type", "llama")
text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
if isinstance(vision_config, dict):
vision_config["model_type"] = vision_config.get("model_type", "siglip_vision_model")
vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
self.text_config = text_config
self.vision_config = vision_config
self.image_token_id = image_token_id
__all__ = ["DeepseekVLConfig"]

View File

@@ -0,0 +1,356 @@
# coding=utf-8
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import gc
import json
import os
from typing import Optional
import regex as re
import torch
from accelerate import init_empty_weights
from huggingface_hub import snapshot_download
from huggingface_hub.errors import HFValidationError
from safetensors.torch import load_file
from transformers import (
AutoTokenizer,
DeepseekVLConfig,
DeepseekVLForConditionalGeneration,
DeepseekVLImageProcessor,
DeepseekVLProcessor,
)
from transformers.image_utils import IMAGENET_STANDARD_MEAN, IMAGENET_STANDARD_STD
# fmt: off
ORIGINAL_TO_CONVERTED_KEY_MAPPING = {
# Siglip (Low Resolution)
r"vision_model.vision_tower.pos_embed": r"model.vision_model.vision_model.embeddings.position_embedding.weight",
r"vision_model.vision_tower.patch_embed.proj.(weight|bias)": r"model.vision_model.vision_model.embeddings.patch_embedding.\1",
r"vision_model.vision_tower.blocks.(\d+).attn.qkv.(weight|bias)": r"model.vision_model.vision_model.encoder.layers.\1.self_attn.(q|k|v)_proj.\2",
r"vision_model.vision_tower.blocks.(\d+).attn.proj.(weight|bias)": r"model.vision_model.vision_model.encoder.layers.\1.self_attn.out_proj.\2",
r"vision_model.vision_tower.blocks.(\d+).norm(\d+).(weight|bias)": r"model.vision_model.vision_model.encoder.layers.\1.layer_norm\2.\3",
r"vision_model.vision_tower.blocks.(\d+).mlp.fc(\d+).(weight|bias)": r"model.vision_model.vision_model.encoder.layers.\1.mlp.fc\2.\3",
r"vision_model.vision_tower.norm.(weight|bias)": r"model.vision_model.vision_model.post_layernorm.\1",
r"vision_model.vision_tower.attn_pool.latent": r"model.vision_model.vision_model.head.probe",
r"vision_model.vision_tower.attn_pool.proj.(weight|bias)": r"model.vision_model.vision_model.head.attention.out_proj.\1",
r"vision_model.vision_tower.attn_pool.norm.(weight|bias)": r"model.vision_model.vision_model.head.layernorm.\1",
r"vision_model.vision_tower.attn_pool.mlp.fc(\d+).(weight|bias)": r"model.vision_model.vision_model.head.mlp.fc\1.\2",
# Aligner
r"aligner.layers.0.(weight|bias)": r"model.aligner.linear1.\1",
r"aligner.layers.2.(weight|bias)": r"model.aligner.linear2.\1",
# Llama (Text Model)
r"language_model.model.(\w+)": r"model.language_model.\1",
r"language_model.lm_head.(weight|bias)": r"lm_head.\1",
}
# fmt: on
# Adopted from https://github.com/deepseek-ai/DeepSeek-VL/blob/main/deepseek_vl/utils/conversation.py#L80-L91
CHAT_TEMPLATE = (
# Define separators and initialize counter
"{% set seps = ['\n\n', '<\uff5cend\u2581of\u2581sentence\uff5c>'] %}"
"{% set i = 0 %}"
# Start with default system prompt
"You are a helpful language and vision assistant. "
"You are able to understand the visual content that the user provides, "
"and assist the user with a variety of tasks using natural language.\n\n"
# Iterate through messages
"{% for message in messages %}"
# Identify user or assistant role
"{% if message['role']|lower == 'user' %}"
"User: "
"{% elif message['role']|lower == 'assistant' %}"
"Assistant:{% if not (loop.last and not add_generation_prompt and message['content'][0]['type']=='text' and message['content'][0]['text']=='') %} {% endif %}"
"{% else %}"
"{{ message['role'].capitalize() }}: "
"{% endif %}"
# Iterate through message content (text/images)
"{% for content in message['content'] %}"
# If content is an image, replace with placeholder
"{% if content['type'] == 'image' %}"
"<image_placeholder>"
# If content is text, handle formatting
"{% elif content['type'] == 'text' %}"
"{% set text = content['text'] %}"
# Strip whitespace for first and last text blocks
"{% if loop.first %}{% set text = text.lstrip() %}{% endif %}"
"{% if loop.last %}{% set text = text.rstrip() %}{% endif %}"
# If previous content was text, add space
"{% if not loop.first and message['content'][loop.index0-1]['type'] == 'text' %}"
"{{ ' ' + text }}"
"{% else %}"
"{{ text }}"
"{% endif %}"
"{% endif %}"
"{% endfor %}" # End message content loop
# Add separators between messages
"{% if not loop.last or add_generation_prompt %}"
"{% if message['role']|lower == 'user' %}"
"{{ seps[0] }}"
"{% else %}"
"{{ seps[1] }}"
"{% endif %}"
"{% endif %}"
"{% endfor %}" # End messages loop
# Add final Assistant prompt if required
"{% if add_generation_prompt %}Assistant:{% endif %}"
)
def convert_old_keys_to_new_keys(state_dict_keys: dict):
output_dict = {}
old_text = "\n".join(state_dict_keys)
new_text = old_text
for pattern, replacement in ORIGINAL_TO_CONVERTED_KEY_MAPPING.items():
if replacement is None:
new_text = re.sub(pattern, "", new_text) # an empty line
continue
new_text = re.sub(pattern, replacement, new_text)
output_dict = dict(zip(old_text.split("\n"), new_text.split("\n")))
return output_dict
def get_qkv_state_dict(key, parameter):
"""
new key which looks like this
xxxx.(q|k|v).xxx (m, n)
is converted to
xxxx.q.xxxx (m//3, n)
xxxx.k.xxxx (m//3, n)
xxxx.v.xxxx (m//3, n)
"""
qkv_state_dict = {}
placeholder = re.search(r"(\(.*?\))", key).group(1) # finds "(query|key|value)"
replacements_keys = placeholder[1:-1].split("|") # creates ['query', 'key', 'value']
replacements_vals = torch.split(
parameter, split_size_or_sections=parameter.size(0) // len(replacements_keys), dim=0
)
for replacement_key, replacement_val in zip(replacements_keys, replacements_vals):
qkv_state_dict[key.replace(placeholder, replacement_key)] = replacement_val
return qkv_state_dict
def update_state_dict(old_state_dict):
all_keys = list(old_state_dict.keys())
new_keys = convert_old_keys_to_new_keys(all_keys)
state_dict = {}
for key in all_keys:
new_key = new_keys[key]
current_parameter = old_state_dict.pop(key)
if "qkv" in key and "vision_tower_high" not in key:
qkv_state_dict = get_qkv_state_dict(new_key, current_parameter)
state_dict.update(qkv_state_dict)
elif "pos_embed" in key:
if "vision_tower_high" not in key:
# timm implementation of siglip creates this param of size [1, 576, 1024]
# transformers implementation of siglip creates this param of size [576, 1024]
state_dict[new_key] = current_parameter.squeeze(0)
else:
state_dict[new_key] = current_parameter
else:
state_dict[new_key] = current_parameter
return state_dict
def load_model_state_dict(input_path: str) -> dict:
"""
Load model state dict, handling both single and sharded files.
"""
index_path = os.path.join(input_path, "model.safetensors.index.json")
single_file_path = os.path.join(input_path, "model.safetensors")
# Check if we have a sharded model
if os.path.exists(index_path):
print("Loading sharded model...")
state_dict = {}
with open(index_path, "r") as f:
index = json.load(f)
# Get unique shard files and load each one only once
unique_shard_files = sorted(set(index["weight_map"].values()))
for shard_file in unique_shard_files:
print(f"Loading shard {shard_file}...")
shard_path = os.path.join(input_path, shard_file)
shard_dict = load_file(shard_path)
state_dict.update(shard_dict)
return state_dict
# Single file model
elif os.path.exists(single_file_path):
print("Loading single file model...")
return load_file(single_file_path, device="cpu")
else:
raise ValueError(f"No model files found in {input_path}")
def convert_model(
hf_repo_id: str,
output_dir: Optional[str] = None,
output_hub_path: Optional[str] = None,
safe_serialization: bool = True,
):
if output_dir:
os.makedirs(output_dir, exist_ok=True)
try:
input_path = snapshot_download(hf_repo_id)
except HFValidationError:
# If the input path is not a HF repo ID, assume it's a local path
input_path = hf_repo_id
# ------------------------------------------------------------
# Create and save config
# ------------------------------------------------------------
config = DeepseekVLConfig(
text_config={
"hidden_size": 2048,
"intermediate_size": 5632,
"max_position_embeddings": 16384,
"num_attention_heads": 16,
"num_hidden_layers": 24,
"vocab_size": 102400,
},
vision_config={
"hidden_size": 1024,
"intermediate_size": 4096,
"image_size": 384,
"patch_size": 16,
"hidden_act": "gelu",
"vision_use_head": False,
"num_attention_heads": 16,
"num_hidden_layers": 24,
},
)
# save config
if output_dir:
config.save_pretrained(output_dir)
print("Model config saved successfully...")
# ------------------------------------------------------------
# Convert processor
# ------------------------------------------------------------
image_processor = DeepseekVLImageProcessor(
image_mean=IMAGENET_STANDARD_MEAN,
image_std=IMAGENET_STANDARD_STD,
)
tokenizer = AutoTokenizer.from_pretrained(
input_path,
extra_special_tokens={
"pad_token": "<end▁of▁sentence>",
"image_token": "<image_placeholder>",
},
)
processor = DeepseekVLProcessor(
image_processor=image_processor,
tokenizer=tokenizer,
chat_template=CHAT_TEMPLATE,
)
if output_dir:
print(f"Saving processor to {output_dir}...")
processor.save_pretrained(output_dir)
if output_hub_path:
print(f"Pushing processor to hub at {output_hub_path}...")
processor.push_to_hub(output_hub_path)
# ------------------------------------------------------------
# Convert weights
# ------------------------------------------------------------
print("Creating empty model...")
with init_empty_weights():
model = DeepseekVLForConditionalGeneration(config)
# Load and convert state dict
print("Loading state dict...")
state_dict = load_model_state_dict(input_path)
state_dict = update_state_dict(state_dict)
# Load converted state dict
print("Loading converted weights into model...")
info = model.load_state_dict(state_dict, strict=False, assign=True)
if len(info.missing_keys) > 0:
raise ValueError(f"Missing keys: {info.missing_keys}")
# Tie weights before any device mapping
print("Tying weights...")
model.tie_weights()
# Save the model
if output_dir:
print(f"Saving model to {output_dir}...")
model.save_pretrained(output_dir, safe_serialization=safe_serialization)
if output_hub_path:
print(f"Pushing model to hub at {output_hub_path}...")
model.push_to_hub(output_hub_path, safe_serialization=safe_serialization)
del state_dict, model
gc.collect()
# Validate the saved model if saved locally
if output_dir:
print("Reloading the local model to check if it's saved correctly...")
DeepseekVLForConditionalGeneration.from_pretrained(output_dir, device_map="auto")
print("Local model reloaded successfully.")
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"--hf_repo_id",
default="deepseek-ai/deepseek-vl-1.3b-chat",
help="Location of official weights from DeepseekAI on HF",
)
parser.add_argument(
"--output_dir",
default=None,
help="Location to write the converted model and processor",
)
parser.add_argument(
"--output_hub_path",
default=None,
help="Repository ID to push model to hub (e.g. 'username/model-name')",
)
parser.add_argument(
"--safe_serialization", default=True, type=bool, help="Whether or not to save using `safetensors`."
)
args = parser.parse_args()
convert_model(
hf_repo_id=args.hf_repo_id,
output_dir=args.output_dir,
output_hub_path=args.output_hub_path,
safe_serialization=args.safe_serialization,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,414 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/deepseek_vl/modular_deepseek_vl.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_deepseek_vl.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Optional, Union
import numpy as np
from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
from ...image_transforms import convert_to_rgb, resize, to_channel_dimension_format
from ...image_utils import (
OPENAI_CLIP_MEAN,
OPENAI_CLIP_STD,
ChannelDimension,
ImageInput,
PILImageResampling,
get_image_size,
infer_channel_dimension_format,
is_scaled_image,
make_flat_list_of_images,
to_numpy_array,
valid_images,
validate_preprocess_arguments,
)
from ...utils import (
TensorType,
filter_out_non_signature_kwargs,
is_vision_available,
logging,
)
if is_vision_available():
import PIL
logger = logging.get_logger(__name__)
class DeepseekVLImageProcessor(BaseImageProcessor):
r"""
Constructs a DEEPSEEK_VL image processor.
Args:
do_resize (`bool`, *optional*, defaults to `True`):
Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the
`do_resize` parameter in the `preprocess` method.
size (`dict`, *optional*, defaults to `{"height": 384, "width": 384}`):
Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess`
method.
min_size (`int`, *optional*, defaults to 14):
The minimum allowed size for the resized image. Ensures that neither the height nor width
falls below this value after resizing.
resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`. Can be
overridden by the `resample` parameter in the `preprocess` method.
do_rescale (`bool`, *optional*, defaults to `True`):
Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the
`do_rescale` parameter in the `preprocess` method.
rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
Scale factor to use if rescaling the image. Only has an effect if `do_rescale` is set to `True`. Can be
overridden by the `rescale_factor` parameter in the `preprocess` method.
do_normalize (`bool`, *optional*, defaults to `True`):
Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
method. Can be overridden by the `do_normalize` parameter in the `preprocess` method.
image_mean (`float` or `list[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
Mean to use if normalizing the image. This is a float or list of floats the length of the number of
channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. Can be
overridden by the `image_mean` parameter in the `preprocess` method.
image_std (`float` or `list[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
Can be overridden by the `image_std` parameter in the `preprocess` method.
do_convert_rgb (`bool`, *optional*, defaults to `True`):
Whether to convert the image to RGB.
"""
model_input_names = ["pixel_values"]
def __init__(
self,
do_resize: bool = True,
size: Optional[dict[str, int]] = None,
min_size: int = 14,
resample: PILImageResampling = PILImageResampling.BICUBIC,
do_rescale: bool = True,
rescale_factor: Union[int, float] = 1 / 255,
do_normalize: bool = True,
image_mean: Optional[Union[float, list[float]]] = None,
image_std: Optional[Union[float, list[float]]] = None,
do_convert_rgb: Optional[bool] = None,
**kwargs,
) -> None:
super().__init__(**kwargs)
size = size if size is not None else {"height": 384, "width": 384}
size = get_size_dict(size, default_to_square=True)
self.do_resize = do_resize
self.size = size
self.resample = resample
self.do_rescale = do_rescale
self.rescale_factor = rescale_factor
self.do_normalize = do_normalize
self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
self.do_convert_rgb = do_convert_rgb
self.min_size = min_size
if image_mean is None:
self.background_color = (127, 127, 127)
else:
self.background_color = tuple([int(x * 255) for x in image_mean])
def resize(
self,
image: np.ndarray,
size: Union[dict[str, int], int],
resample: PILImageResampling = PILImageResampling.BICUBIC,
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
) -> np.ndarray:
"""
Resize an image to dynamically calculated size.
Args:
image (`np.ndarray`):
Image to resize.
resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
`PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BICUBIC`.
data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the output image. If unset, the channel dimension format of the input
image is used. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `None`: will be inferred from input
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
Returns:
`np.ndarray`: The resized image.
"""
if input_data_format is None:
input_data_format = infer_channel_dimension_format(image)
height, width = get_image_size(image, input_data_format)
max_size = max(height, width)
size = get_size_dict(size, default_to_square=True)
if size["height"] != size["width"]:
raise ValueError(
f"Output height and width must be the same. Got height={size['height']} and width={size['width']}"
)
size = size["height"]
delta = size / max_size
# Largest side becomes `size` and the other side is scaled according to the aspect ratio.
output_size_nonpadded = [
max(int(height * delta), self.min_size),
max(int(width * delta), self.min_size),
]
image = resize(
image,
size=output_size_nonpadded,
resample=resample,
data_format=data_format,
input_data_format=input_data_format,
**kwargs,
)
# Expand and pad the images to obtain a square image of dimensions `size x size`
image = self.pad_to_square(
image=image,
background_color=self.background_color,
input_data_format=input_data_format,
)
return image
@filter_out_non_signature_kwargs()
def preprocess(
self,
images: ImageInput,
do_resize: Optional[bool] = None,
size: Optional[dict[str, int]] = None,
resample: PILImageResampling = None,
do_rescale: Optional[bool] = None,
rescale_factor: Optional[float] = None,
do_normalize: Optional[bool] = None,
image_mean: Optional[Union[float, list[float]]] = None,
image_std: Optional[Union[float, list[float]]] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
do_convert_rgb: Optional[bool] = None,
data_format: ChannelDimension = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> PIL.Image.Image:
"""
Preprocess an image or batch of images.
Args:
images (`ImageInput`):
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set `do_rescale=False`.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
size (`dict[str, int]`, *optional*, defaults to `self.size`):
Controls the size of the image after `resize`. The shortest edge of the image is resized to
`size["shortest_edge"]` whilst preserving the aspect ratio. If the longest edge of this resized image
is > `int(size["shortest_edge"] * (1333 / 800))`, then the image is resized again to make the longest
edge equal to `int(size["shortest_edge"] * (1333 / 800))`.
resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image values between [0 - 1].
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
Rescale factor to rescale the image by if `do_rescale` is set to `True`.
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
Whether to normalize the image.
image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
Image mean to normalize the image by if `do_normalize` is set to `True`.
image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to normalize the image by if `do_normalize` is set to `True`.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
"""
do_resize = do_resize if do_resize is not None else self.do_resize
resample = resample if resample is not None else self.resample
do_rescale = do_rescale if do_rescale is not None else self.do_rescale
rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
do_normalize = do_normalize if do_normalize is not None else self.do_normalize
image_mean = image_mean if image_mean is not None else self.image_mean
image_std = image_std if image_std is not None else self.image_std
do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
size = size if size is not None else self.size
size = get_size_dict(size, default_to_square=False)
images = make_flat_list_of_images(images)
if not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
validate_preprocess_arguments(
do_rescale=do_rescale,
rescale_factor=rescale_factor,
do_normalize=do_normalize,
image_mean=image_mean,
image_std=image_std,
do_resize=do_resize,
size=size,
resample=resample,
)
# PIL RGBA images are converted to RGB
if do_convert_rgb:
images = [convert_to_rgb(image) for image in images]
# All transformations expect numpy arrays.
images = [to_numpy_array(image) for image in images]
if do_rescale and is_scaled_image(images[0]):
logger.warning_once(
"It looks like you are trying to rescale already rescaled images. If the input"
" images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
)
if input_data_format is None:
# We assume that all images have the same channel dimension format.
input_data_format = infer_channel_dimension_format(images[0])
if do_resize:
images = [
self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
for image in images
]
if do_rescale:
images = [
self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
for image in images
]
if do_normalize:
images = [
self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
for image in images
]
images = [
to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
]
encoded_outputs = BatchFeature(data={"pixel_values": images}, tensor_type=return_tensors)
return encoded_outputs
def pad_to_square(
self,
image: np.ndarray,
background_color: Union[int, tuple[int, int, int]] = 0,
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> np.array:
"""
Pads an image to a square based on the longest edge.
Args:
image (`np.ndarray`):
The image to pad.
background_color (`int` or `tuple[int, int, int]`, *optional*, defaults to 0):
The color to use for the padding. Can be an integer for single channel or a
tuple of integers representing for multi-channel images. If passed as integer
in mutli-channel mode, it will default to `0` in subsequent channels.
data_format (`str` or `ChannelDimension`, *optional*):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
If unset, will use same as the input image.
input_data_format (`str` or `ChannelDimension`, *optional*):
The channel dimension format for the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
Returns:
`np.ndarray`: The padded image.
"""
height, width = get_image_size(image, input_data_format)
num_channels = image.shape[0] if input_data_format == ChannelDimension.FIRST else image.shape[-1]
if height == width:
image = (
to_channel_dimension_format(image, data_format, input_data_format)
if data_format is not None
else image
)
return image
max_dim = max(height, width)
# Ensure background_color is the correct shape
if isinstance(background_color, int):
background_color = [background_color]
elif len(background_color) != num_channels:
raise ValueError(
f"background_color must have no more than {num_channels} elements to match the number of channels"
)
if input_data_format == ChannelDimension.FIRST:
result = np.zeros((num_channels, max_dim, max_dim), dtype=image.dtype)
for i, color in enumerate(background_color):
result[i, :, :] = color
if width > height:
start = (max_dim - height) // 2
result[:, start : start + height, :] = image
else:
start = (max_dim - width) // 2
result[:, :, start : start + width] = image
else:
result = np.zeros((max_dim, max_dim, num_channels), dtype=image.dtype)
for i, color in enumerate(background_color):
result[:, :, i] = color
if width > height:
start = (max_dim - height) // 2
result[start : start + height, :, :] = image
else:
start = (max_dim - width) // 2
result[:, start : start + width, :] = image
return result
def postprocess(self):
"""Applies post-processing to the decoded image tokens by reversing transformations applied during preprocessing."""
raise AttributeError("Not needed for DeepseekVL")
__all__ = ["DeepseekVLImageProcessor"]

View File

@@ -0,0 +1,349 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/deepseek_vl/modular_deepseek_vl.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_deepseek_vl.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass
from typing import Optional, Union
from ...cache_utils import Cache
from ...generation import GenerationMixin
from ...modeling_outputs import ModelOutput
from ...modeling_utils import PreTrainedModel
from ...processing_utils import Unpack
from ...utils import (
TransformersKwargs,
auto_docstring,
can_return_tuple,
is_torch_available,
)
from ..auto import AutoModel
from .configuration_deepseek_vl import DeepseekVLConfig
if is_torch_available():
import torch
import torch.nn as nn
@dataclass
@auto_docstring(
custom_intro="""
Base class for DeepseekVL model's outputs that may also contain a past key/values (to speed up sequential decoding).
"""
)
class DeepseekVLBaseModelOutputWithPast(ModelOutput):
r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
hidden_size)` is output.
past_key_values (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
`config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
input) to speed up sequential decoding.
image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
"""
last_hidden_state: Optional[torch.FloatTensor] = None
past_key_values: Optional[tuple[tuple[torch.FloatTensor]]] = None
hidden_states: Optional[tuple[torch.FloatTensor]] = None
attentions: Optional[tuple[torch.FloatTensor]] = None
image_hidden_states: Optional[tuple[torch.FloatTensor]] = None
@dataclass
@auto_docstring(
custom_intro="""
Base class for DeepseekVL causal language model (or autoregressive) outputs.
"""
)
class DeepseekVLCausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
"""
loss: Optional[torch.FloatTensor] = None
logits: Optional[torch.FloatTensor] = None
past_key_values: Optional[list[torch.FloatTensor]] = None
hidden_states: Optional[tuple[torch.FloatTensor]] = None
attentions: Optional[tuple[torch.FloatTensor]] = None
image_hidden_states: Optional[tuple[torch.FloatTensor]] = None
class DeepseekVLAligner(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
in_features = config.vision_config.hidden_size
out_features = config.text_config.hidden_size
self.linear1 = nn.Linear(in_features, out_features)
self.activation = nn.GELU()
self.linear2 = nn.Linear(out_features, out_features)
def forward(self, vision_encodings: torch.Tensor) -> torch.Tensor:
x = self.linear1(vision_encodings)
x = self.activation(x)
x = self.linear2(x)
return x
@auto_docstring
class DeepseekVLPreTrainedModel(PreTrainedModel):
config: DeepseekVLConfig
base_model_prefix = "model"
supports_gradient_checkpointing = True
_no_split_modules = ["LlamaDecoderLayer"]
_skip_keys_device_placement = ["past_key_values", "causal_mask"]
_supports_flash_attn = True
_supports_sdpa = True
_supports_static_cache = True
_supports_param_buffer_assignment = False
def _init_weights(self, module):
"""Initialize the weights"""
# Required only for Linear layer in DeepseekVLAligner
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.text_config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
@auto_docstring
class DeepseekVLModel(DeepseekVLPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.config = config
self.vision_model = AutoModel.from_config(config.vision_config)
self.aligner = DeepseekVLAligner(config)
self.language_model = AutoModel.from_config(config=config.text_config)
self.gradient_checkpointing = False
# Initialize weights and apply final processing.
self.post_init()
def get_input_embeddings(self):
return self.language_model.get_input_embeddings()
def set_input_embeddings(self, value):
self.language_model.set_input_embeddings(value)
def get_image_features(self, pixel_values):
image_embeds = self.vision_model(pixel_values)
image_embeds = self.aligner(image_embeds.last_hidden_state)
return image_embeds
@can_return_tuple
@auto_docstring
def forward(
self,
input_ids: torch.LongTensor = None,
pixel_values: torch.FloatTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Cache] = None,
cache_position: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs,
):
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError(
"You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
)
if inputs_embeds is None:
inputs_embeds = self.get_input_embeddings()(input_ids)
if pixel_values is not None:
if input_ids is None:
image_attention_mask = inputs_embeds == self.get_input_embeddings()(
torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)
)
image_attention_mask = image_attention_mask.all(-1)
else:
image_attention_mask = input_ids == self.config.image_token_id
image_attention_mask = image_attention_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
image_embeds = self.get_image_features(pixel_values)
image_features = image_embeds.reshape(-1, inputs_embeds.shape[-1])
image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(image_attention_mask, image_features)
lm_output = self.language_model(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
use_cache=use_cache,
cache_position=cache_position,
logits_to_keep=logits_to_keep,
**kwargs,
)
return DeepseekVLBaseModelOutputWithPast(
last_hidden_state=lm_output.last_hidden_state,
past_key_values=lm_output.past_key_values,
hidden_states=lm_output.hidden_states,
attentions=lm_output.attentions,
image_hidden_states=image_embeds if pixel_values is not None else None,
)
class DeepseekVLForConditionalGeneration(DeepseekVLPreTrainedModel, GenerationMixin):
_tied_weights_keys = ["model.language_model.embed_tokens.weight", "lm_head.weight"]
_supports_static_cache = True
def __init__(self, config: DeepseekVLConfig):
super().__init__(config)
self.config = config
self.model = DeepseekVLModel(config)
self.lm_head = nn.Linear(config.text_config.hidden_size, config.text_config.vocab_size, bias=False)
# Initialize weights and apply final processing.
self.post_init()
def get_input_embeddings(self):
return self.model.language_model.get_input_embeddings()
def set_input_embeddings(self, value):
self.model.language_model.set_input_embeddings(value)
def prepare_embeddings_for_image_generation(self) -> torch.Tensor:
raise AttributeError("Not needed for DeepseekVL")
def set_decoder(self, decoder):
self.model = decoder
def get_decoder(self):
return self.model
@can_return_tuple
@auto_docstring
def forward(
self,
input_ids: torch.LongTensor = None,
pixel_values: torch.FloatTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Cache] = None,
cache_position: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs: Unpack[TransformersKwargs],
):
r"""
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
"""
outputs = self.model(
input_ids=input_ids,
pixel_values=pixel_values,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
cache_position=cache_position,
**kwargs,
)
hidden_states = outputs.last_hidden_state
# Only compute necessary logits, and do not upcast them to float if we are not computing the loss
slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
logits = self.lm_head(hidden_states[:, slice_indices, :])
loss = None
if labels is not None:
loss = self.loss_function(
logits=logits, labels=labels, vocab_size=self.config.text_config.vocab_size, **kwargs
)
return DeepseekVLCausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
image_hidden_states=outputs.image_hidden_states,
)
def prepare_inputs_for_generation(
self,
input_ids,
pixel_values=None,
past_key_values=None,
attention_mask=None,
inputs_embeds=None,
cache_position=None,
logits_to_keep=None,
**kwargs,
):
# Overwritten -- extra custom processing
model_inputs = super().prepare_inputs_for_generation(
input_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
cache_position=cache_position,
logits_to_keep=logits_to_keep,
**kwargs,
)
# If we're in cached decoding stage, pixel values should be None because input ids do not contain special image token anymore
# Otherwise we need pixel values to be passed to model
if cache_position[0] == 0:
model_inputs["pixel_values"] = pixel_values
return model_inputs
__all__ = ["DeepseekVLPreTrainedModel", "DeepseekVLModel", "DeepseekVLForConditionalGeneration"]

View File

@@ -0,0 +1,326 @@
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Union
from ...configuration_utils import PretrainedConfig
from ...image_processing_utils import BatchFeature
from ...image_utils import (
ImageInput,
make_flat_list_of_images,
)
from ...processing_utils import ProcessingKwargs, ProcessorMixin, Unpack
from ...tokenization_utils_base import (
PreTokenizedInput,
TextInput,
)
from ...utils import (
auto_docstring,
is_torch_available,
logging,
)
from ..auto import CONFIG_MAPPING, AutoConfig, AutoModel
from ..idefics.modeling_idefics import IdeficsBaseModelOutputWithPast, IdeficsCausalLMOutputWithPast
from ..janus.image_processing_janus import JanusImageProcessor
from ..janus.modeling_janus import JanusForConditionalGeneration, JanusModel, JanusPreTrainedModel
if is_torch_available():
import torch
import torch.nn as nn
logger = logging.get_logger(__name__)
class DeepseekVLConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`DeepseekVLModel`]. It is used to instantiate a
DeepseekVL model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the DeepseekVL
[deepseek-community/deepseek-vl-1.3b-chat](https://huggingface.co/deepseek-community/deepseek-vl-1.3b-chat) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `LlamaConfig`):
The config object or dictionary of the text backbone.
vision_config (`Union[AutoConfig, dict]`, *optional*, defaults to `SiglipVisionConfig`):
The config object or dictionary of the vision backbone.
image_token_id (`int`, *optional*, defaults to 100015):
The index representing image tokens in the model's token vocabulary.
Example:
```python
>>> from transformers import DeepseekVLConfig, DeepseekVLModel
>>> # Initializing a DeepseekVL deepseek-community/deepseek-vl-1.3b-chat style configuration
>>> configuration = DeepseekVLConfig()
>>> # Initializing a model (with random weights) from the deepseek-community/deepseek-vl-1.3b-chat style configuration
>>> model = DeepseekVLModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "deepseek_vl"
sub_configs = {"text_config": AutoConfig, "vision_config": AutoConfig}
def __init__(
self,
text_config: AutoConfig = None,
vision_config: AutoConfig = None,
image_token_id: int = 100015,
**kwargs,
):
super().__init__(**kwargs)
if text_config is None:
text_config = {}
logger.info("`text_config` is `None`. Initializing the `LlamaConfig` with default values.")
if vision_config is None:
vision_config = {}
logger.info("`vision_config` is `None`. Initializing the `SiglipVisionConfig` with default values.")
if isinstance(text_config, dict):
text_config["model_type"] = text_config.get("model_type", "llama")
text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
if isinstance(vision_config, dict):
vision_config["model_type"] = vision_config.get("model_type", "siglip_vision_model")
vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
self.text_config = text_config
self.vision_config = vision_config
self.image_token_id = image_token_id
class DeepseekVLBaseModelOutputWithPast(IdeficsBaseModelOutputWithPast):
pass
class DeepseekVLCausalLMOutputWithPast(IdeficsCausalLMOutputWithPast):
pass
class DeepseekVLAligner(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
in_features = config.vision_config.hidden_size
out_features = config.text_config.hidden_size
self.linear1 = nn.Linear(in_features, out_features)
self.activation = nn.GELU()
self.linear2 = nn.Linear(out_features, out_features)
def forward(self, vision_encodings: torch.Tensor) -> torch.Tensor:
x = self.linear1(vision_encodings)
x = self.activation(x)
x = self.linear2(x)
return x
class DeepseekVLPreTrainedModel(JanusPreTrainedModel):
_no_split_modules = ["LlamaDecoderLayer"]
def _init_weights(self, module):
"""Initialize the weights"""
# Required only for Linear layer in DeepseekVLAligner
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.text_config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
@auto_docstring
class DeepseekVLModel(JanusModel):
def __init__(self, config):
super().__init__(config)
self.config = config
self.vision_model = AutoModel.from_config(config.vision_config)
self.aligner = DeepseekVLAligner(config)
self.language_model = AutoModel.from_config(config=config.text_config)
self.gradient_checkpointing = False
# Initialize weights and apply final processing.
self.post_init()
del self.vqmodel
del self.generation_embeddings
del self.generation_aligner
del self.generation_head
class DeepseekVLForConditionalGeneration(JanusForConditionalGeneration):
def prepare_embeddings_for_image_generation(self):
raise AttributeError("Not needed for DeepseekVL")
def decode_image_tokens(self):
raise AttributeError("Not needed for DeepseekVL")
def generate(self):
raise AttributeError("Not needed for DeepseekVL")
class DeepseekVLImageProcessor(JanusImageProcessor):
def postprocess(self):
raise AttributeError("Not needed for DeepseekVL")
def unnormalize(self):
raise AttributeError("Not needed for DeepseekVL")
class DeepseekVLProcessorKwargs(ProcessingKwargs, total=False):
_defaults = {
"text_kwargs": {"padding": False},
"common_kwargs": {"return_tensors": "pt"},
}
class DeepseekVLProcessor(ProcessorMixin):
r"""
Constructs a DeepseekVL processor which wraps a DeepseekVL Image Processor and a Llama tokenizer into a single processor.
[`DeepseekVLProcessor`] offers all the functionalities of [`DeepseekVLImageProcessor`] and [`LlamaTokenizerFast`]. See the
[`~DeepseekVLProcessor.__call__`] and [`~DeepseekVLProcessor.decode`] for more information.
Args:
image_processor ([`DeepseekVLImageProcessor`]):
The image processor is a required input.
tokenizer ([`LlamaTokenizerFast`]):
The tokenizer is a required input.
chat_template (`str`, *optional*):
A Jinja template which will be used to convert lists of messages
in a chat into a tokenizable string.
num_image_tokens (`int`, *optional*, defaults to 576):
The number of special image tokens used as placeholders for visual content in text sequences.
"""
attributes = ["image_processor", "tokenizer"]
valid_kwargs = ["chat_template", "num_image_tokens"]
image_processor_class = "AutoImageProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(
self,
image_processor,
tokenizer,
chat_template=None,
num_image_tokens=576,
):
self.image_token = tokenizer.image_token
self.num_image_tokens = num_image_tokens
super().__init__(image_processor, tokenizer, chat_template=chat_template)
def __call__(
self,
text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]] = None,
images: ImageInput = None,
**kwargs: Unpack[DeepseekVLProcessorKwargs],
) -> BatchFeature:
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
DeepseekVLImageProcessor's [`~DeepseekVLImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
of the above two methods for more information.
Args:
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. Both channels-first and channels-last formats are supported.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
"""
output_kwargs = self._merge_kwargs(
DeepseekVLProcessorKwargs, tokenizer_init_kwargs=self.tokenizer.init_kwargs, **kwargs
)
if text is None and images is None:
raise ValueError("You must specify either text or images.")
if text is not None:
if isinstance(text, str):
text = [text]
elif not (isinstance(text, (list, tuple)) and all(isinstance(t, str) for t in text)):
raise ValueError("Invalid input text. Please provide a string, or a list of strings")
prompt_strings = []
one_img_tokens = self.image_token * self.num_image_tokens
for prompt in text:
prompt = prompt.replace(self.image_token, one_img_tokens)
prompt_strings.append(prompt)
data = self.tokenizer(prompt_strings, **output_kwargs["text_kwargs"])
# process images if pixel_values are provided
if images is not None:
images = make_flat_list_of_images(images)
data["pixel_values"] = self.image_processor(images, **output_kwargs["images_kwargs"])["pixel_values"]
return BatchFeature(data=data)
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
@property
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
image_processor_input_names = self.image_processor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
__all__ = [
"DeepseekVLConfig",
"DeepseekVLPreTrainedModel",
"DeepseekVLModel",
"DeepseekVLForConditionalGeneration",
"DeepseekVLImageProcessor",
"DeepseekVLProcessor",
]

View File

@@ -0,0 +1,157 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/deepseek_vl/modular_deepseek_vl.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_deepseek_vl.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Union
from ...image_processing_utils import BatchFeature
from ...image_utils import ImageInput, make_flat_list_of_images
from ...processing_utils import ProcessingKwargs, ProcessorMixin, Unpack
from ...tokenization_utils_base import PreTokenizedInput, TextInput
class DeepseekVLProcessorKwargs(ProcessingKwargs, total=False):
_defaults = {
"text_kwargs": {"padding": False},
"common_kwargs": {"return_tensors": "pt"},
}
class DeepseekVLProcessor(ProcessorMixin):
r"""
Constructs a DeepseekVL processor which wraps a DeepseekVL Image Processor and a Llama tokenizer into a single processor.
[`DeepseekVLProcessor`] offers all the functionalities of [`DeepseekVLImageProcessor`] and [`LlamaTokenizerFast`]. See the
[`~DeepseekVLProcessor.__call__`] and [`~DeepseekVLProcessor.decode`] for more information.
Args:
image_processor ([`DeepseekVLImageProcessor`]):
The image processor is a required input.
tokenizer ([`LlamaTokenizerFast`]):
The tokenizer is a required input.
chat_template (`str`, *optional*):
A Jinja template which will be used to convert lists of messages
in a chat into a tokenizable string.
num_image_tokens (`int`, *optional*, defaults to 576):
The number of special image tokens used as placeholders for visual content in text sequences.
"""
attributes = ["image_processor", "tokenizer"]
valid_kwargs = ["chat_template", "num_image_tokens"]
image_processor_class = "AutoImageProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(
self,
image_processor,
tokenizer,
chat_template=None,
num_image_tokens=576,
):
self.image_token = tokenizer.image_token
self.num_image_tokens = num_image_tokens
super().__init__(image_processor, tokenizer, chat_template=chat_template)
def __call__(
self,
text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]] = None,
images: ImageInput = None,
**kwargs: Unpack[DeepseekVLProcessorKwargs],
) -> BatchFeature:
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
DeepseekVLImageProcessor's [`~DeepseekVLImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
of the above two methods for more information.
Args:
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. Both channels-first and channels-last formats are supported.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
"""
output_kwargs = self._merge_kwargs(
DeepseekVLProcessorKwargs, tokenizer_init_kwargs=self.tokenizer.init_kwargs, **kwargs
)
if text is None and images is None:
raise ValueError("You must specify either text or images.")
if text is not None:
if isinstance(text, str):
text = [text]
elif not (isinstance(text, (list, tuple)) and all(isinstance(t, str) for t in text)):
raise ValueError("Invalid input text. Please provide a string, or a list of strings")
prompt_strings = []
one_img_tokens = self.image_token * self.num_image_tokens
for prompt in text:
prompt = prompt.replace(self.image_token, one_img_tokens)
prompt_strings.append(prompt)
data = self.tokenizer(prompt_strings, **output_kwargs["text_kwargs"])
# process images if pixel_values are provided
if images is not None:
images = make_flat_list_of_images(images)
data["pixel_values"] = self.image_processor(images, **output_kwargs["images_kwargs"])["pixel_values"]
return BatchFeature(data=data)
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
@property
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
image_processor_input_names = self.image_processor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
__all__ = ["DeepseekVLProcessor"]

View File

@@ -0,0 +1,30 @@
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure
if TYPE_CHECKING:
from .configuration_deepseek_vl_hybrid import *
from .image_processing_deepseek_vl_fast_hybrid import *
from .image_processing_deepseek_vl_hybrid import *
from .modeling_deepseek_vl_hybrid import *
from .processing_deepseek_vl_hybrid import *
else:
import sys
_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)

View File

@@ -0,0 +1,108 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/deepseek_vl_hybrid/modular_deepseek_vl_hybrid.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_deepseek_vl_hybrid.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from ...configuration_utils import PretrainedConfig
from ...utils import logging
from ..auto import CONFIG_MAPPING, AutoConfig
logger = logging.get_logger(__name__)
class DeepseekVLHybridConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`DeepseekVLHybridModel`]. It is used to instantiate a
DeepseekVLHybrid model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the DeepseekVLHybrid
[deepseek-community/deepseek-vl-7b-chat](https://huggingface.co/deepseek-community/deepseek-vl-7b-chat) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `LlamaConfig`):
The config object or dictionary of the text backbone.
vision_config (`Union[AutoConfig, dict]`, *optional*, defaults to `SiglipVisionConfig`):
The config object or dictionary of the vision backbone.
high_res_vision_config (`Union[AutoConfig, dict]`, *optional*, defaults to `SamVisionConfig`):
The config object or dictionary of the high resolution vision backbone.
image_token_id (`int`, *optional*, defaults to 100015):
The index representing image tokens in the model's token vocabulary.
Example:
```python
>>> from transformers import DeepseekVLHybridConfig, DeepseekVLHybridModel
>>> # Initializing a DeepseekVLHybrid deepseek-community/deepseek-vl-7b-chat style configuration
>>> configuration = DeepseekVLHybridConfig()
>>> # Initializing a model (with random weights) from the deepseek-community/deepseek-vl-7b-chat style configuration
>>> model = DeepseekVLHybridModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "deepseek_vl_hybrid"
sub_configs = {"text_config": AutoConfig, "vision_config": AutoConfig, "high_res_vision_config": AutoConfig}
def __init__(
self,
text_config: AutoConfig = None,
vision_config: AutoConfig = None,
high_res_vision_config: AutoConfig = None,
image_token_id: int = 100015,
**kwargs,
):
super().__init__(**kwargs)
if text_config is None:
text_config = {}
logger.info("`text_config` is `None`. Initializing the `LlamaConfig` with default values.")
if vision_config is None:
vision_config = {}
logger.info("`vision_config` is `None`. Initializing the `SiglipVisionConfig` with default values.")
if isinstance(text_config, dict):
text_config["model_type"] = text_config.get("model_type", "llama")
text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
if isinstance(vision_config, dict):
vision_config["model_type"] = vision_config.get("model_type", "siglip_vision_model")
vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
self.text_config = text_config
self.vision_config = vision_config
self.image_token_id = image_token_id
if high_res_vision_config is None:
high_res_vision_config = {}
logger.info("`high_res_vision_config` is `None`. Initializing the `SamVisionConfig` with default values.")
if isinstance(high_res_vision_config, dict):
high_res_vision_config["model_type"] = high_res_vision_config.get("model_type", "sam_vision_model")
high_res_vision_config = CONFIG_MAPPING[high_res_vision_config["model_type"]](**high_res_vision_config)
self.high_res_vision_config = high_res_vision_config
__all__ = ["DeepseekVLHybridConfig"]

View File

@@ -0,0 +1,394 @@
# coding=utf-8
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import gc
import json
import os
from typing import Optional
import regex as re
import torch
from accelerate import init_empty_weights
from huggingface_hub import snapshot_download
from huggingface_hub.errors import HFValidationError
from safetensors.torch import load_file
from transformers import (
AutoTokenizer,
DeepseekVLHybridConfig,
DeepseekVLHybridForConditionalGeneration,
DeepseekVLHybridImageProcessor,
DeepseekVLHybridProcessor,
)
from transformers.image_utils import (
IMAGENET_STANDARD_MEAN,
IMAGENET_STANDARD_STD,
OPENAI_CLIP_MEAN,
OPENAI_CLIP_STD,
PILImageResampling,
)
# fmt: off
ORIGINAL_TO_CONVERTED_KEY_MAPPING = {
# # Sam (High Resolution)
r"vision_model.vision_tower_high.vision_tower.pos_embed": r"model.high_res_vision_model.vision_encoder.pos_embed",
r"vision_model.vision_tower_high.vision_tower.patch_embed.proj.(weight|bias)": r"model.high_res_vision_model.vision_encoder.patch_embed.projection.\1",
r"vision_model.vision_tower_high.vision_tower.blocks.(\d+).norm(\d+).(weight|bias)": r"model.high_res_vision_model.vision_encoder.layers.\1.layer_norm\2.\3",
r"vision_model.vision_tower_high.vision_tower.blocks.(\d+).attn.rel_pos_(h|w)": r"model.high_res_vision_model.vision_encoder.layers.\1.attn.rel_pos_\2",
r"vision_model.vision_tower_high.vision_tower.blocks.(\d+).attn.qkv.(weight|bias)": r"model.high_res_vision_model.vision_encoder.layers.\1.attn.qkv.\2",
r"vision_model.vision_tower_high.vision_tower.blocks.(\d+).attn.proj.(weight|bias)": r"model.high_res_vision_model.vision_encoder.layers.\1.attn.proj.\2",
r"vision_model.vision_tower_high.vision_tower.blocks.(\d+).mlp.lin(\d+).(weight|bias)": r"model.high_res_vision_model.vision_encoder.layers.\1.mlp.lin\2.\3",
r"vision_model.vision_tower_high.vision_tower.neck.0.weight": r"model.high_res_vision_model.vision_encoder.neck.conv1.weight",
r"vision_model.vision_tower_high.vision_tower.neck.1.(weight|bias)": r"model.high_res_vision_model.vision_encoder.neck.layer_norm1.\1",
r"vision_model.vision_tower_high.vision_tower.neck.2.weight": r"model.high_res_vision_model.vision_encoder.neck.conv2.weight",
r"vision_model.vision_tower_high.vision_tower.neck.3.(weight|bias)": r"model.high_res_vision_model.vision_encoder.neck.layer_norm2.\1",
r"vision_model.vision_tower_high.vision_tower.neck_hd.0.weight": r"model.high_res_vision_neck.conv1.weight",
r"vision_model.vision_tower_high.vision_tower.neck_hd.1.(weight|bias)": r"model.high_res_vision_neck.layer_norm1.\1",
r"vision_model.vision_tower_high.vision_tower.neck_hd.2.weight": r"model.high_res_vision_neck.conv2.weight",
r"vision_model.vision_tower_high.vision_tower.neck_hd.3.(weight|bias)": r"model.high_res_vision_neck.layer_norm2.\1",
r"vision_model.vision_tower_high.vision_tower.downsamples.0.weight": r"model.high_res_vision_proj.conv1.weight",
r"vision_model.vision_tower_high.vision_tower.downsamples.1.weight": r"model.high_res_vision_proj.conv2.weight",
r"vision_model.vision_tower_high.vision_tower.hd_alpha_downsamples": r"model.high_res_vision_alpha",
# Siglip (Low Resolution)
r"vision_model.vision_tower_low.vision_tower.pos_embed": r"model.vision_model.vision_model.embeddings.position_embedding.weight",
r"vision_model.vision_tower_low.vision_tower.patch_embed.proj.(weight|bias)": r"model.vision_model.vision_model.embeddings.patch_embedding.\1",
r"vision_model.vision_tower_low.vision_tower.blocks.(\d+).attn.qkv.(weight|bias)": r"model.vision_model.vision_model.encoder.layers.\1.self_attn.(q|k|v)_proj.\2",
r"vision_model.vision_tower_low.vision_tower.blocks.(\d+).attn.proj.(weight|bias)": r"model.vision_model.vision_model.encoder.layers.\1.self_attn.out_proj.\2",
r"vision_model.vision_tower_low.vision_tower.blocks.(\d+).norm(\d+).(weight|bias)": r"model.vision_model.vision_model.encoder.layers.\1.layer_norm\2.\3",
r"vision_model.vision_tower_low.vision_tower.blocks.(\d+).mlp.fc(\d+).(weight|bias)": r"model.vision_model.vision_model.encoder.layers.\1.mlp.fc\2.\3",
r"vision_model.vision_tower_low.vision_tower.norm.(weight|bias)": r"model.vision_model.vision_model.post_layernorm.\1",
r"vision_model.vision_tower_low.vision_tower.attn_pool.latent": r"model.vision_model.vision_model.head.probe",
r"vision_model.vision_tower_low.vision_tower.attn_pool.proj.(weight|bias)": r"model.vision_model.vision_model.head.attention.out_proj.\1",
r"vision_model.vision_tower_low.vision_tower.attn_pool.norm.(weight|bias)": r"model.vision_model.vision_model.head.layernorm.\1",
r"vision_model.vision_tower_low.vision_tower.attn_pool.mlp.fc(\d+).(weight|bias)": r"model.vision_model.vision_model.head.mlp.fc\1.\2",
# Vision Projection
r"aligner.layers.1.(weight|bias)": r"model.aligner.proj.\1",
r"aligner.low_up_proj.(weight|bias)": r"model.aligner.vision_proj.\1",
r"aligner.high_up_proj.(weight|bias)": r"model.aligner.high_res_vision_proj.\1",
# Llama (Text Model)
r"language_model.model.(\w+)": r"model.language_model.\1",
r"language_model.lm_head.(weight|bias)": r"lm_head.\1",
}
# fmt: on
# Adopted from https://github.com/deepseek-ai/DeepSeek-VL/blob/main/deepseek_vl/utils/conversation.py#L80-L91
CHAT_TEMPLATE = (
# Define separators and initialize counter
"{% set seps = ['\n\n', '<\uff5cend\u2581of\u2581sentence\uff5c>'] %}"
"{% set i = 0 %}"
# Start with default system prompt
"You are a helpful language and vision assistant. "
"You are able to understand the visual content that the user provides, "
"and assist the user with a variety of tasks using natural language.\n\n"
# Iterate through messages
"{% for message in messages %}"
# Identify user or assistant role
"{% if message['role']|lower == 'user' %}"
"User: "
"{% elif message['role']|lower == 'assistant' %}"
"Assistant:{% if not (loop.last and not add_generation_prompt and message['content'][0]['type']=='text' and message['content'][0]['text']=='') %} {% endif %}"
"{% else %}"
"{{ message['role'].capitalize() }}: "
"{% endif %}"
# Iterate through message content (text/images)
"{% for content in message['content'] %}"
# If content is an image, replace with placeholder
"{% if content['type'] == 'image' %}"
"<image_placeholder>"
# If content is text, handle formatting
"{% elif content['type'] == 'text' %}"
"{% set text = content['text'] %}"
# Strip whitespace for first and last text blocks
"{% if loop.first %}{% set text = text.lstrip() %}{% endif %}"
"{% if loop.last %}{% set text = text.rstrip() %}{% endif %}"
# If previous content was text, add space
"{% if not loop.first and message['content'][loop.index0-1]['type'] == 'text' %}"
"{{ ' ' + text }}"
"{% else %}"
"{{ text }}"
"{% endif %}"
"{% endif %}"
"{% endfor %}" # End message content loop
# Add separators between messages
"{% if not loop.last or add_generation_prompt %}"
"{% if message['role']|lower == 'user' %}"
"{{ seps[0] }}"
"{% else %}"
"{{ seps[1] }}"
"{% endif %}"
"{% endif %}"
"{% endfor %}" # End messages loop
# Add final Assistant prompt if required
"{% if add_generation_prompt %}Assistant:{% endif %}"
)
def convert_old_keys_to_new_keys(state_dict_keys: dict):
output_dict = {}
old_text = "\n".join(state_dict_keys)
new_text = old_text
for pattern, replacement in ORIGINAL_TO_CONVERTED_KEY_MAPPING.items():
if replacement is None:
new_text = re.sub(pattern, "", new_text) # an empty line
continue
new_text = re.sub(pattern, replacement, new_text)
output_dict = dict(zip(old_text.split("\n"), new_text.split("\n")))
return output_dict
def get_qkv_state_dict(key, parameter):
"""
new key which looks like this
xxxx.(q|k|v).xxx (m, n)
is converted to
xxxx.q.xxxx (m//3, n)
xxxx.k.xxxx (m//3, n)
xxxx.v.xxxx (m//3, n)
"""
qkv_state_dict = {}
placeholder = re.search(r"(\(.*?\))", key).group(1) # finds "(query|key|value)"
replacements_keys = placeholder[1:-1].split("|") # creates ['query', 'key', 'value']
replacements_vals = torch.split(
parameter, split_size_or_sections=parameter.size(0) // len(replacements_keys), dim=0
)
for replacement_key, replacement_val in zip(replacements_keys, replacements_vals):
qkv_state_dict[key.replace(placeholder, replacement_key)] = replacement_val
return qkv_state_dict
def update_state_dict(old_state_dict):
all_keys = list(old_state_dict.keys())
new_keys = convert_old_keys_to_new_keys(all_keys)
state_dict = {}
for key in all_keys:
new_key = new_keys[key]
current_parameter = old_state_dict.pop(key)
if "qkv" in key and "vision_tower_high" not in key:
qkv_state_dict = get_qkv_state_dict(new_key, current_parameter)
state_dict.update(qkv_state_dict)
elif "pos_embed" in key:
if "vision_tower_high" not in key:
# timm implementation of siglip creates this param of size [1, 576, 1024]
# transformers implementation of siglip creates this param of size [576, 1024]
state_dict[new_key] = current_parameter.squeeze(0)
else:
state_dict[new_key] = current_parameter
else:
state_dict[new_key] = current_parameter
return state_dict
def load_model_state_dict(input_path: str) -> dict:
"""
Load model state dict, handling both single and sharded files.
"""
index_path = os.path.join(input_path, "model.safetensors.index.json")
single_file_path = os.path.join(input_path, "model.safetensors")
# Check if we have a sharded model
if os.path.exists(index_path):
print("Loading sharded model...")
state_dict = {}
with open(index_path, "r") as f:
index = json.load(f)
# Get unique shard files and load each one only once
unique_shard_files = sorted(set(index["weight_map"].values()))
for shard_file in unique_shard_files:
print(f"Loading shard {shard_file}...")
shard_path = os.path.join(input_path, shard_file)
shard_dict = load_file(shard_path)
state_dict.update(shard_dict)
return state_dict
# Single file model
elif os.path.exists(single_file_path):
print("Loading single file model...")
return load_file(single_file_path, device="cpu")
else:
raise ValueError(f"No model files found in {input_path}")
def convert_model(
hf_repo_id: str,
output_dir: Optional[str] = None,
output_hub_path: Optional[str] = None,
safe_serialization: bool = True,
):
if output_dir:
os.makedirs(output_dir, exist_ok=True)
try:
input_path = snapshot_download(hf_repo_id)
except HFValidationError:
# If the input path is not a HF repo ID, assume it's a local path
input_path = hf_repo_id
# ------------------------------------------------------------
# Create and save config
# ------------------------------------------------------------
config = DeepseekVLHybridConfig(
text_config={
"hidden_size": 4096,
"intermediate_size": 11008,
"max_position_embeddings": 16384,
"num_attention_heads": 32,
"num_hidden_layers": 30,
"vocab_size": 102400,
},
vision_config={
"hidden_size": 1024,
"intermediate_size": 4096,
"image_size": 384,
"patch_size": 16,
"hidden_act": "gelu",
"vision_use_head": False,
"num_attention_heads": 16,
"num_hidden_layers": 24,
},
high_res_vision_config={
"hidden_size": 768,
"intermediate_size": 3072,
"image_size": 1024,
"patch_size": 16,
"num_attention_heads": 12,
"num_hidden_layers": 12,
},
)
# save config
if output_dir:
config.save_pretrained(output_dir)
print("Model config saved successfully...")
# ------------------------------------------------------------
# Convert processor
# ------------------------------------------------------------
image_processor = DeepseekVLHybridImageProcessor(
image_mean=IMAGENET_STANDARD_MEAN,
image_std=IMAGENET_STANDARD_STD,
high_res_image_mean=OPENAI_CLIP_MEAN,
high_res_image_std=OPENAI_CLIP_STD,
resample=PILImageResampling.BILINEAR,
)
tokenizer = AutoTokenizer.from_pretrained(
input_path,
extra_special_tokens={
"pad_token": "<end▁of▁sentence>",
"image_token": "<image_placeholder>",
},
)
processor = DeepseekVLHybridProcessor(
image_processor=image_processor,
tokenizer=tokenizer,
chat_template=CHAT_TEMPLATE,
)
if output_dir:
print(f"Saving processor to {output_dir}...")
processor.save_pretrained(output_dir)
if output_hub_path:
print(f"Pushing processor to hub at {output_hub_path}...")
processor.push_to_hub(output_hub_path)
# ------------------------------------------------------------
# Convert weights
# ------------------------------------------------------------
print("Creating empty model...")
with init_empty_weights():
model = DeepseekVLHybridForConditionalGeneration(config)
# Load and convert state dict
print("Loading state dict...")
state_dict = load_model_state_dict(input_path)
state_dict = update_state_dict(state_dict)
# Load converted state dict
print("Loading converted weights into model...")
info = model.load_state_dict(state_dict, strict=False, assign=True)
if len(info.missing_keys) > 0:
raise ValueError(f"Missing keys: {info.missing_keys}")
# Tie weights before any device mapping
print("Tying weights...")
model.tie_weights()
# Save the model
if output_dir:
print(f"Saving model to {output_dir}...")
model.save_pretrained(output_dir, safe_serialization=safe_serialization)
if output_hub_path:
print(f"Pushing model to hub at {output_hub_path}...")
model.push_to_hub(output_hub_path, safe_serialization=safe_serialization)
del state_dict, model
gc.collect()
# Validate the saved model if saved locally
if output_dir:
print("Reloading the local model to check if it's saved correctly...")
DeepseekVLHybridForConditionalGeneration.from_pretrained(output_dir, device_map="auto")
print("Local model reloaded successfully.")
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"--hf_repo_id",
default="deepseek-ai/deepseek-vl-7b-chat",
help="Location of official weights from DeepseekAI on HF",
)
parser.add_argument(
"--output_dir",
default=None,
help="Location to write the converted model and processor",
)
parser.add_argument(
"--output_hub_path",
default=None,
help="Repository ID to push model to hub (e.g. 'username/model-name')",
)
parser.add_argument(
"--safe_serialization", default=True, type=bool, help="Whether or not to save using `safetensors`."
)
args = parser.parse_args()
convert_model(
hf_repo_id=args.hf_repo_id,
output_dir=args.output_dir,
output_hub_path=args.output_hub_path,
safe_serialization=args.safe_serialization,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,483 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/deepseek_vl_hybrid/modular_deepseek_vl_hybrid.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_deepseek_vl_hybrid.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Optional, Union
import numpy as np
from ...image_processing_utils import BaseImageProcessor
from ...image_processing_utils_fast import BatchFeature, get_size_dict
from ...image_transforms import convert_to_rgb, resize, to_channel_dimension_format
from ...image_utils import (
OPENAI_CLIP_MEAN,
OPENAI_CLIP_STD,
ChannelDimension,
ImageInput,
PILImageResampling,
get_image_size,
infer_channel_dimension_format,
is_scaled_image,
make_list_of_images,
to_numpy_array,
valid_images,
validate_preprocess_arguments,
)
from ...utils import (
TensorType,
filter_out_non_signature_kwargs,
is_vision_available,
logging,
)
if is_vision_available():
import PIL
logger = logging.get_logger(__name__)
class DeepseekVLHybridImageProcessor(BaseImageProcessor):
r"""
Constructs a DEEPSEEK_VL_HYBRID image processor.
Args:
do_resize (`bool`, *optional*, defaults to `True`):
Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the
`do_resize` parameter in the `preprocess` method.
size (`dict`, *optional*, defaults to `{"height": 384, "width": 384}`):
Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess`
method.
high_res_size (`dict`, *optional*, defaults to `{"height": 1024, "width": 1024}`):
Size of the high resolution output image after resizing. Can be overridden by the `high_res_size` parameter in the `preprocess`
method.
min_size (`int`, *optional*, defaults to 14):
The minimum allowed size for the resized image. Ensures that neither the height nor width
falls below this value after resizing.
resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`. Can be
overridden by the `resample` parameter in the `preprocess` method.
high_res_resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`. Can be
overridden by the `high_res_resample` parameter in the `preprocess` method.
do_rescale (`bool`, *optional*, defaults to `True`):
Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the
`do_rescale` parameter in the `preprocess` method.
rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
Scale factor to use if rescaling the image. Only has an effect if `do_rescale` is set to `True`. Can be
overridden by the `rescale_factor` parameter in the `preprocess` method.
do_normalize (`bool`, *optional*, defaults to `True`):
Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
method. Can be overridden by the `do_normalize` parameter in the `preprocess` method.
image_mean (`float` or `list[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
Mean to use if normalizing the image. This is a float or list of floats the length of the number of
channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. Can be
overridden by the `image_mean` parameter in the `preprocess` method.
image_std (`float` or `list[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
Can be overridden by the `image_std` parameter in the `preprocess` method.
high_res_image_mean (`float` or `list[float]`, *optional*, defaults to `OPENAI_CLIP_MEAN`):
Mean to use if normalizing the high resolution image. This is a float or list of floats the length of the number of
channels in the image. Can be overridden by the `high_res_image_mean` parameter in the `preprocess` method.
high_res_image_std (`float` or `list[float]`, *optional*, defaults to `OPENAI_CLIP_STD`):
Standard deviation to use if normalizing the high resolution image. This is a float or list of floats the length of the
number of channels in the image. Can be overridden by the `high_res_image_std` parameter in the `preprocess` method.
do_convert_rgb (`bool`, *optional*, defaults to `True`):
Whether to convert the image to RGB.
"""
model_input_names = ["pixel_values"]
def __init__(
self,
do_resize: bool = True,
size: Optional[dict[str, int]] = None,
high_res_size: Optional[dict[str, int]] = None,
min_size: int = 14,
resample: PILImageResampling = PILImageResampling.BICUBIC,
high_res_resample: PILImageResampling = PILImageResampling.BICUBIC,
do_rescale: bool = True,
rescale_factor: Union[int, float] = 1 / 255,
do_normalize: bool = True,
image_mean: Optional[Union[float, list[float]]] = None,
image_std: Optional[Union[float, list[float]]] = None,
high_res_image_mean: Optional[Union[float, list[float]]] = None,
high_res_image_std: Optional[Union[float, list[float]]] = None,
do_convert_rgb: Optional[bool] = None,
**kwargs,
) -> None:
super().__init__(**kwargs)
high_res_size = high_res_size if high_res_size is not None else {"height": 1024, "width": 1024}
high_res_size = get_size_dict(high_res_size, default_to_square=True)
self.high_res_size = high_res_size
self.high_res_image_mean = high_res_image_mean if high_res_image_mean is not None else OPENAI_CLIP_MEAN
self.high_res_image_std = high_res_image_std if high_res_image_std is not None else OPENAI_CLIP_STD
self.resample = resample
self.high_res_resample = high_res_resample
size = size if size is not None else {"height": 384, "width": 384}
size = get_size_dict(size, default_to_square=True)
self.do_resize = do_resize
self.size = size
self.resample = resample
self.do_rescale = do_rescale
self.rescale_factor = rescale_factor
self.do_normalize = do_normalize
self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
self.do_convert_rgb = do_convert_rgb
self.min_size = min_size
if image_mean is None:
self.background_color = (127, 127, 127)
else:
self.background_color = tuple([int(x * 255) for x in image_mean])
if high_res_image_mean is None:
self.background_color = (127, 127, 127)
else:
self.background_color = tuple([int(x * 255) for x in high_res_image_mean])
def resize(
self,
image: np.ndarray,
size: Union[dict[str, int], int],
resample: PILImageResampling = PILImageResampling.BICUBIC,
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
) -> np.ndarray:
"""
Resize an image to dynamically calculated size.
Args:
image (`np.ndarray`):
Image to resize.
resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
`PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BICUBIC`.
data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the output image. If unset, the channel dimension format of the input
image is used. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `None`: will be inferred from input
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
Returns:
`np.ndarray`: The resized image.
"""
if input_data_format is None:
input_data_format = infer_channel_dimension_format(image)
height, width = get_image_size(image, input_data_format)
max_size = max(height, width)
size = get_size_dict(size, default_to_square=True)
if size["height"] != size["width"]:
raise ValueError(
f"Output height and width must be the same. Got height={size['height']} and width={size['width']}"
)
size = size["height"]
delta = size / max_size
# Largest side becomes `size` and the other side is scaled according to the aspect ratio.
output_size_nonpadded = [
max(int(height * delta), self.min_size),
max(int(width * delta), self.min_size),
]
image = resize(
image,
size=output_size_nonpadded,
resample=resample,
data_format=data_format,
input_data_format=input_data_format,
**kwargs,
)
# Expand and pad the images to obtain a square image of dimensions `size x size`
image = self.pad_to_square(
image=image,
background_color=self.background_color,
input_data_format=input_data_format,
)
return image
@filter_out_non_signature_kwargs()
def preprocess(
self,
images: ImageInput,
do_resize: Optional[bool] = None,
size: Optional[dict[str, int]] = None,
high_res_size: Optional[dict[str, int]] = None,
resample: PILImageResampling = None,
high_res_resample: PILImageResampling = None,
do_rescale: Optional[bool] = None,
rescale_factor: Optional[float] = None,
do_normalize: Optional[bool] = None,
image_mean: Optional[Union[float, list[float]]] = None,
image_std: Optional[Union[float, list[float]]] = None,
high_res_image_mean: Optional[Union[float, list[float]]] = None,
high_res_image_std: Optional[Union[float, list[float]]] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
data_format: Union[str, ChannelDimension] = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
do_convert_rgb: Optional[bool] = None,
) -> PIL.Image.Image:
"""
Preprocess an image or batch of images.
Args:
images (`ImageInput`):
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set `do_rescale=False`.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
size (`Dict[str, int]`, *optional*, defaults to `self.size`):
Dictionary in the format `{"height": h, "width": w}` specifying the size of the output image after
resizing.
high_res_size (`Dict[str, int]`, *optional*, defaults to `self.high_res_size`):
Dictionary in the format `{"height": h, "width": w}` specifying the size of the high resolution output image after
resizing.
resample (`PILImageResampling` filter, *optional*, defaults to `self.resample`):
`PILImageResampling` filter to use if resizing the image e.g. `PILImageResampling.BILINEAR`. Only has
an effect if `do_resize` is set to `True`.
high_res_resample (`PILImageResampling` filter, *optional*, defaults to `self.resample`):
`PILImageResampling` filter to use if resizing the image e.g. `PILImageResampling.BICUBIC`. Only has
an effect if `do_resize` is set to `True`.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image values between [0 - 1].
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
Rescale factor to rescale the image by if `do_rescale` is set to `True`.
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
Image mean to use if `do_normalize` is set to `True`.
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to use if `do_normalize` is set to `True`.
high_res_image_mean (`float` or `List[float]`, *optional*, defaults to `self.high_res_image_mean`):
Image mean to use if `do_normalize` is set to `True`.
high_res_image_std (`float` or `List[float]`, *optional*, defaults to `self.high_res_image_std`):
Image standard deviation to use if `do_normalize` is set to `True`.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
"""
do_resize = do_resize if do_resize is not None else self.do_resize
do_rescale = do_rescale if do_rescale is not None else self.do_rescale
do_normalize = do_normalize if do_normalize is not None else self.do_normalize
resample = resample if resample is not None else self.resample
high_res_resample = high_res_resample if high_res_resample is not None else self.high_res_resample
rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
image_mean = image_mean if image_mean is not None else self.image_mean
image_std = image_std if image_std is not None else self.image_std
high_res_image_mean = high_res_image_mean if high_res_image_mean is not None else self.high_res_image_mean
high_res_image_std = high_res_image_std if high_res_image_std is not None else self.high_res_image_std
do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
size = size if size is not None else self.size
size_dict = get_size_dict(size)
high_res_size = high_res_size if high_res_size is not None else self.high_res_size
high_res_size_dict = get_size_dict(high_res_size)
images = make_list_of_images(images)
if not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
validate_preprocess_arguments(
do_rescale=do_rescale,
rescale_factor=rescale_factor,
do_normalize=do_normalize,
image_mean=image_mean,
image_std=image_std,
do_resize=do_resize,
size=size,
resample=resample,
)
if do_convert_rgb:
images = [convert_to_rgb(image) for image in images]
# All transformations expect numpy arrays.
images = [to_numpy_array(image) for image in images]
if do_rescale and is_scaled_image(images[0]):
logger.warning_once(
"It looks like you are trying to rescale already rescaled images. If the input"
" images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
)
if input_data_format is None:
# We assume that all images have the same channel dimension format.
input_data_format = infer_channel_dimension_format(images[0])
all_images = []
all_high_res_images = []
for image in images:
# high_res_image: resize (high) -> rescale -> normalize (high)
# low_res_image: resize (high) -> rescale -> resize (low) -> normalize (low)
high_res_image = image
if do_resize:
high_res_image = self.resize(
image=high_res_image,
size=high_res_size_dict,
resample=high_res_resample,
input_data_format=input_data_format,
)
image = self.resize(
image=high_res_image, size=size_dict, resample=resample, input_data_format=input_data_format
)
if do_rescale:
image = self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
high_res_image = self.rescale(
image=high_res_image, scale=rescale_factor, input_data_format=input_data_format
)
if do_normalize:
image = self.normalize(
image=image, mean=image_mean, std=image_std, input_data_format=input_data_format
)
high_res_image = self.normalize(
image=high_res_image,
mean=high_res_image_mean,
std=high_res_image_std,
input_data_format=input_data_format,
)
image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
high_res_image = to_channel_dimension_format(
high_res_image, data_format, input_channel_dim=input_data_format
)
all_images.append(image)
all_high_res_images.append(high_res_image)
data = {"pixel_values": all_images, "high_res_pixel_values": all_high_res_images}
return BatchFeature(data=data, tensor_type=return_tensors)
def pad_to_square(
self,
image: np.ndarray,
background_color: Union[int, tuple[int, int, int]] = 0,
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> np.array:
"""
Pads an image to a square based on the longest edge.
Args:
image (`np.ndarray`):
The image to pad.
background_color (`int` or `tuple[int, int, int]`, *optional*, defaults to 0):
The color to use for the padding. Can be an integer for single channel or a
tuple of integers representing for multi-channel images. If passed as integer
in mutli-channel mode, it will default to `0` in subsequent channels.
data_format (`str` or `ChannelDimension`, *optional*):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
If unset, will use same as the input image.
input_data_format (`str` or `ChannelDimension`, *optional*):
The channel dimension format for the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
Returns:
`np.ndarray`: The padded image.
"""
height, width = get_image_size(image, input_data_format)
num_channels = image.shape[0] if input_data_format == ChannelDimension.FIRST else image.shape[-1]
if height == width:
image = (
to_channel_dimension_format(image, data_format, input_data_format)
if data_format is not None
else image
)
return image
max_dim = max(height, width)
# Ensure background_color is the correct shape
if isinstance(background_color, int):
background_color = [background_color]
elif len(background_color) != num_channels:
raise ValueError(
f"background_color must have no more than {num_channels} elements to match the number of channels"
)
if input_data_format == ChannelDimension.FIRST:
result = np.zeros((num_channels, max_dim, max_dim), dtype=image.dtype)
for i, color in enumerate(background_color):
result[i, :, :] = color
if width > height:
start = (max_dim - height) // 2
result[:, start : start + height, :] = image
else:
start = (max_dim - width) // 2
result[:, :, start : start + width] = image
else:
result = np.zeros((max_dim, max_dim, num_channels), dtype=image.dtype)
for i, color in enumerate(background_color):
result[:, :, i] = color
if width > height:
start = (max_dim - height) // 2
result[start : start + height, :, :] = image
else:
start = (max_dim - width) // 2
result[:, start : start + width, :] = image
return result
def postprocess(self):
"""Applies post-processing to the decoded image tokens by reversing transformations applied during preprocessing."""
raise AttributeError("Not needed for DeepseekVLHybrid")
__all__ = ["DeepseekVLHybridImageProcessor"]

View File

@@ -0,0 +1,491 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/deepseek_vl_hybrid/modular_deepseek_vl_hybrid.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_deepseek_vl_hybrid.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass
from typing import Optional, Union
import torch
import torch.nn as nn
from ...cache_utils import Cache
from ...generation import GenerationMixin
from ...modeling_outputs import ModelOutput
from ...modeling_utils import PreTrainedModel
from ...processing_utils import Unpack
from ...utils import (
TransformersKwargs,
auto_docstring,
can_return_tuple,
)
from ..auto import AutoModel
from .configuration_deepseek_vl_hybrid import DeepseekVLHybridConfig
@dataclass
@auto_docstring(
custom_intro="""
Base class for DeepseekVLHybrid model's outputs that may also contain a past key/values (to speed up sequential decoding).
"""
)
class DeepseekVLHybridBaseModelOutputWithPast(ModelOutput):
r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
hidden_size)` is output.
past_key_values (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
`config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
input) to speed up sequential decoding.
image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
"""
last_hidden_state: Optional[torch.FloatTensor] = None
past_key_values: Optional[tuple[tuple[torch.FloatTensor]]] = None
hidden_states: Optional[tuple[torch.FloatTensor]] = None
attentions: Optional[tuple[torch.FloatTensor]] = None
image_hidden_states: Optional[tuple[torch.FloatTensor]] = None
@dataclass
@auto_docstring(
custom_intro="""
Base class for DeepseekVLHybrid causal language model (or autoregressive) outputs.
"""
)
class DeepseekVLHybridCausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
"""
loss: Optional[torch.FloatTensor] = None
logits: Optional[torch.FloatTensor] = None
past_key_values: Optional[list[torch.FloatTensor]] = None
hidden_states: Optional[tuple[torch.FloatTensor]] = None
attentions: Optional[tuple[torch.FloatTensor]] = None
image_hidden_states: Optional[tuple[torch.FloatTensor]] = None
class DeepseekVLHybridLayerNorm(nn.Module):
r"""LayerNorm that supports two data formats: channels_last (default) or channels_first.
The ordering of the dimensions in the inputs. channels_last corresponds to inputs with shape (batch_size, height,
width, channels) while channels_first corresponds to inputs with shape (batch_size, channels, height, width).
"""
def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last"):
super().__init__()
self.weight = nn.Parameter(torch.ones(normalized_shape))
self.bias = nn.Parameter(torch.zeros(normalized_shape))
self.eps = eps
self.data_format = data_format
if self.data_format not in ["channels_last", "channels_first"]:
raise NotImplementedError(f"Unsupported data format: {self.data_format}")
self.normalized_shape = (normalized_shape,)
def forward(self, x: torch.Tensor) -> torch.Tensor:
if self.data_format == "channels_last":
x = torch.nn.functional.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
elif self.data_format == "channels_first":
input_dtype = x.dtype
x = x.float()
u = x.mean(1, keepdim=True)
s = (x - u).pow(2).mean(1, keepdim=True)
x = (x - u) / torch.sqrt(s + self.eps)
x = x.to(dtype=input_dtype)
x = self.weight[:, None, None] * x + self.bias[:, None, None]
return x
class DeepseekVLSamVisionNeck(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.conv1 = nn.Conv2d(config.hidden_size, config.output_channels, kernel_size=1, bias=False)
self.layer_norm1 = DeepseekVLHybridLayerNorm(config.output_channels, data_format="channels_first")
self.conv2 = nn.Conv2d(config.output_channels, config.output_channels, kernel_size=3, padding=1, bias=False)
self.layer_norm2 = DeepseekVLHybridLayerNorm(config.output_channels, data_format="channels_first")
def forward(self, hidden_states):
hidden_states = hidden_states.permute(0, 3, 1, 2)
hidden_states = self.conv1(hidden_states)
hidden_states = self.layer_norm1(hidden_states)
hidden_states = self.conv2(hidden_states)
hidden_states = self.layer_norm2(hidden_states)
return hidden_states
class DeepseekVLSamVisionProj(nn.Module):
def __init__(self, config, output_size: int = 24):
super().__init__()
self.config = config
self.output_size = output_size
self.conv1 = nn.Conv2d(
config.output_channels, config.output_channels * 2, kernel_size=3, stride=2, padding=1, bias=False
)
self.conv2 = nn.Conv2d(
config.output_channels * 2, config.output_channels * 4, kernel_size=3, stride=2, padding=1, bias=False
)
def forward(self, features: torch.Tensor) -> torch.Tensor:
# interpolate Sam encodings to match Siglip encodings
features = torch.nn.functional.interpolate(
features,
size=(4 * self.output_size, 4 * self.output_size),
mode="bilinear",
align_corners=False,
)
features = self.conv1(features)
features = self.conv2(features)
return features
class DeepseekVLHybridAligner(nn.Module):
def __init__(self, config: DeepseekVLHybridConfig):
super().__init__()
in_channels = config.vision_config.hidden_size
high_res_in_channels = config.high_res_vision_config.output_channels * 4
out_channels = config.text_config.hidden_size
self.vision_proj = nn.Linear(in_channels, out_channels // 2)
self.high_res_vision_proj = nn.Linear(high_res_in_channels, out_channels // 2)
self.act = nn.GELU()
self.proj = nn.Linear(out_channels, out_channels)
def forward(
self,
vision_encodings: torch.Tensor,
high_res_vision_encodings: torch.Tensor,
) -> torch.Tensor:
vision_encodings = self.vision_proj(vision_encodings)
high_res_vision_encodings = self.high_res_vision_proj(high_res_vision_encodings)
encodings = torch.concat([high_res_vision_encodings, vision_encodings], dim=-1)
encodings = self.act(encodings)
encodings = self.proj(encodings)
return encodings
@auto_docstring
class DeepseekVLHybridPreTrainedModel(PreTrainedModel):
config: DeepseekVLHybridConfig
base_model_prefix = "model"
supports_gradient_checkpointing = True
_no_split_modules = ["LlamaDecoderLayer"]
_skip_keys_device_placement = ["past_key_values", "causal_mask"]
_supports_flash_attn = True
_supports_sdpa = True
_supports_static_cache = True
_supports_param_buffer_assignment = False
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.text_config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Conv2d):
nn.init.kaiming_normal_(module.weight, mode="fan_out", nonlinearity="relu")
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, DeepseekVLHybridLayerNorm):
module.weight.data.fill_(1.0)
module.bias.data.zero_()
elif isinstance(module, DeepseekVLHybridModel):
module.high_res_vision_alpha.data.zero_()
DEEPSEEK_VL_COMMON_CUSTOM_ARGS = r"""
high_res_pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size), *optional*):
The tensors corresponding to the input images. Pixel values can be obtained using
[`AutoImageProcessor`].
"""
@auto_docstring
class DeepseekVLHybridModel(DeepseekVLHybridPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.output_size = config.vision_config.image_size // config.vision_config.patch_size
self.global_attn_index = config.high_res_vision_config.global_attn_indexes[0]
self.high_res_vision_model = AutoModel.from_config(config.high_res_vision_config)
self.high_res_vision_neck = DeepseekVLSamVisionNeck(config.high_res_vision_config)
self.high_res_vision_proj = DeepseekVLSamVisionProj(
config.high_res_vision_config, output_size=self.output_size
)
self.high_res_vision_alpha = nn.Parameter(torch.zeros(1))
self.config = config
self.vision_model = AutoModel.from_config(config.vision_config)
self.aligner = DeepseekVLHybridAligner(config)
self.language_model = AutoModel.from_config(config=config.text_config)
self.gradient_checkpointing = False
# Initialize weights and apply final processing.
self.post_init()
def get_input_embeddings(self):
return self.language_model.get_input_embeddings()
def set_input_embeddings(self, value):
self.language_model.set_input_embeddings(value)
def get_image_features(self, pixel_values, high_res_pixel_values):
vision_encodings = self.get_low_res_image_features(pixel_values)
high_res_vision_encodings = self.get_high_res_image_features(high_res_pixel_values)
images_embeds = self.aligner(vision_encodings, high_res_vision_encodings)
return images_embeds
@can_return_tuple
@auto_docstring(custom_args=DEEPSEEK_VL_COMMON_CUSTOM_ARGS)
def forward(
self,
input_ids: torch.LongTensor = None,
pixel_values: torch.FloatTensor = None,
high_res_pixel_values: torch.FloatTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Cache] = None,
cache_position: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs,
):
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError(
"You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
)
if pixel_values is not None and high_res_pixel_values is None:
raise ValueError("Both pixel_values and high_res_pixel_values should be specified at the same time")
if inputs_embeds is None:
inputs_embeds = self.get_input_embeddings()(input_ids)
if pixel_values is not None:
if input_ids is None:
image_attention_mask = inputs_embeds == self.get_input_embeddings()(
torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)
)
image_attention_mask = image_attention_mask.all(-1)
else:
image_attention_mask = input_ids == self.config.image_token_id
image_attention_mask = image_attention_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
image_embeds = self.get_image_features(pixel_values, high_res_pixel_values)
image_features = image_embeds.reshape(-1, inputs_embeds.shape[-1])
image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(image_attention_mask, image_features)
lm_output = self.language_model(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
use_cache=use_cache,
cache_position=cache_position,
logits_to_keep=logits_to_keep,
**kwargs,
)
return DeepseekVLHybridBaseModelOutputWithPast(
last_hidden_state=lm_output.last_hidden_state,
past_key_values=lm_output.past_key_values,
hidden_states=lm_output.hidden_states,
attentions=lm_output.attentions,
image_hidden_states=image_embeds if pixel_values is not None else None,
)
def get_low_res_image_features(self, pixel_values):
output = self.vision_model(pixel_values)
output = output[0]
return output
def get_high_res_image_features(self, pixel_values):
output = self.high_res_vision_model(
pixel_values=pixel_values,
output_hidden_states=True,
return_dict=True,
)
last_hidden_state = output.last_hidden_state
last_hidden_state = self.high_res_vision_proj(last_hidden_state)
hidden_states = output.hidden_states
global_hidden_state = hidden_states[self.global_attn_index + 1] # +1 for embedding layer
global_hidden_state = self.high_res_vision_neck(global_hidden_state)
global_hidden_state = self.high_res_vision_proj(global_hidden_state)
output = last_hidden_state + global_hidden_state * self.high_res_vision_alpha
# batch_size, hidden_size, height, width -> batch_size, seq_len, hidden_size
output = output.permute(0, 2, 3, 1)
output = output.reshape(output.shape[0], -1, output.shape[-1])
return output
class DeepseekVLHybridForConditionalGeneration(DeepseekVLHybridPreTrainedModel, GenerationMixin):
_tied_weights_keys = ["model.language_model.embed_tokens.weight", "lm_head.weight"]
_supports_static_cache = True
def __init__(self, config: DeepseekVLHybridConfig):
super().__init__(config)
self.config = config
self.model = DeepseekVLHybridModel(config)
self.lm_head = nn.Linear(config.text_config.hidden_size, config.text_config.vocab_size, bias=False)
# Initialize weights and apply final processing.
self.post_init()
def get_input_embeddings(self):
return self.model.language_model.get_input_embeddings()
def set_input_embeddings(self, value):
self.model.language_model.set_input_embeddings(value)
def prepare_embeddings_for_image_generation(self) -> torch.Tensor:
raise AttributeError("Not needed for DeepseekVLHybrid")
def set_decoder(self, decoder):
self.model = decoder
def get_decoder(self):
return self.model
@can_return_tuple
@auto_docstring(custom_args=DEEPSEEK_VL_COMMON_CUSTOM_ARGS)
def forward(
self,
input_ids: torch.LongTensor = None,
pixel_values: torch.FloatTensor = None,
high_res_pixel_values: torch.FloatTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Cache] = None,
cache_position: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs: Unpack[TransformersKwargs],
):
r"""
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
"""
outputs = self.model(
input_ids=input_ids,
pixel_values=pixel_values,
high_res_pixel_values=high_res_pixel_values,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
cache_position=cache_position,
**kwargs,
)
hidden_states = outputs.last_hidden_state
# Only compute necessary logits, and do not upcast them to float if we are not computing the loss
slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
logits = self.lm_head(hidden_states[:, slice_indices, :])
loss = None
if labels is not None:
loss = self.loss_function(
logits=logits, labels=labels, vocab_size=self.config.text_config.vocab_size, **kwargs
)
return DeepseekVLHybridCausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
image_hidden_states=outputs.image_hidden_states,
)
def prepare_inputs_for_generation(
self,
input_ids,
past_key_values=None,
inputs_embeds=None,
pixel_values=None,
high_res_pixel_values=None,
attention_mask=None,
cache_position=None,
logits_to_keep=None,
**kwargs,
):
model_inputs = super().prepare_inputs_for_generation(
input_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
cache_position=cache_position,
logits_to_keep=logits_to_keep,
**kwargs,
)
if cache_position[0] == 0:
# If we're in cached decoding stage, pixel values should be None because input ids do not contain special image token anymore
# Otherwise we need pixel values to be passed to model
model_inputs["pixel_values"] = pixel_values
model_inputs["high_res_pixel_values"] = high_res_pixel_values
return model_inputs
__all__ = ["DeepseekVLHybridPreTrainedModel", "DeepseekVLHybridModel", "DeepseekVLHybridForConditionalGeneration"]

View File

@@ -0,0 +1,777 @@
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Optional, Union
import torch
import torch.nn as nn
from ...cache_utils import Cache
from ...image_processing_utils_fast import (
BatchFeature,
get_size_dict,
)
from ...image_transforms import convert_to_rgb, to_channel_dimension_format
from ...image_utils import (
OPENAI_CLIP_MEAN,
OPENAI_CLIP_STD,
ChannelDimension,
ImageInput,
PILImageResampling,
infer_channel_dimension_format,
is_scaled_image,
make_flat_list_of_images,
make_list_of_images,
to_numpy_array,
valid_images,
validate_preprocess_arguments,
)
from ...processing_utils import Unpack
from ...tokenization_utils_base import (
PreTokenizedInput,
TextInput,
)
from ...utils import (
TensorType,
TransformersKwargs,
auto_docstring,
can_return_tuple,
filter_out_non_signature_kwargs,
logging,
)
from ..auto import CONFIG_MAPPING, AutoConfig, AutoModel
from ..deepseek_vl.configuration_deepseek_vl import DeepseekVLConfig
from ..deepseek_vl.image_processing_deepseek_vl import DeepseekVLImageProcessor
from ..deepseek_vl.modeling_deepseek_vl import (
DeepseekVLForConditionalGeneration,
DeepseekVLModel,
DeepseekVLPreTrainedModel,
)
from ..deepseek_vl.processing_deepseek_vl import DeepseekVLProcessor, DeepseekVLProcessorKwargs
from ..idefics.modeling_idefics import IdeficsBaseModelOutputWithPast, IdeficsCausalLMOutputWithPast
from ..sam.modeling_sam import SamLayerNorm, SamVisionNeck
logger = logging.get_logger(__name__)
DEEPSEEK_VL_COMMON_CUSTOM_ARGS = r"""
high_res_pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size), *optional*):
The tensors corresponding to the input images. Pixel values can be obtained using
[`AutoImageProcessor`].
"""
class DeepseekVLHybridConfig(DeepseekVLConfig):
r"""
This is the configuration class to store the configuration of a [`DeepseekVLHybridModel`]. It is used to instantiate a
DeepseekVLHybrid model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the DeepseekVLHybrid
[deepseek-community/deepseek-vl-7b-chat](https://huggingface.co/deepseek-community/deepseek-vl-7b-chat) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `LlamaConfig`):
The config object or dictionary of the text backbone.
vision_config (`Union[AutoConfig, dict]`, *optional*, defaults to `SiglipVisionConfig`):
The config object or dictionary of the vision backbone.
high_res_vision_config (`Union[AutoConfig, dict]`, *optional*, defaults to `SamVisionConfig`):
The config object or dictionary of the high resolution vision backbone.
image_token_id (`int`, *optional*, defaults to 100015):
The index representing image tokens in the model's token vocabulary.
Example:
```python
>>> from transformers import DeepseekVLHybridConfig, DeepseekVLHybridModel
>>> # Initializing a DeepseekVLHybrid deepseek-community/deepseek-vl-7b-chat style configuration
>>> configuration = DeepseekVLHybridConfig()
>>> # Initializing a model (with random weights) from the deepseek-community/deepseek-vl-7b-chat style configuration
>>> model = DeepseekVLHybridModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "deepseek_vl_hybrid"
sub_configs = {"text_config": AutoConfig, "vision_config": AutoConfig, "high_res_vision_config": AutoConfig}
def __init__(
self,
text_config: AutoConfig = None,
vision_config: AutoConfig = None,
high_res_vision_config: AutoConfig = None,
image_token_id: int = 100015,
**kwargs,
):
super().__init__(
text_config=text_config,
vision_config=vision_config,
image_token_id=image_token_id,
**kwargs,
)
if high_res_vision_config is None:
high_res_vision_config = {}
logger.info("`high_res_vision_config` is `None`. Initializing the `SamVisionConfig` with default values.")
if isinstance(high_res_vision_config, dict):
high_res_vision_config["model_type"] = high_res_vision_config.get("model_type", "sam_vision_model")
high_res_vision_config = CONFIG_MAPPING[high_res_vision_config["model_type"]](**high_res_vision_config)
self.high_res_vision_config = high_res_vision_config
class DeepseekVLHybridBaseModelOutputWithPast(IdeficsBaseModelOutputWithPast):
pass
class DeepseekVLHybridCausalLMOutputWithPast(IdeficsCausalLMOutputWithPast):
pass
class DeepseekVLHybridLayerNorm(SamLayerNorm):
pass
class DeepseekVLSamVisionNeck(SamVisionNeck):
def __init__(self, config):
super().__init__(config)
class DeepseekVLSamVisionProj(nn.Module):
def __init__(self, config, output_size: int = 24):
super().__init__()
self.config = config
self.output_size = output_size
self.conv1 = nn.Conv2d(
config.output_channels, config.output_channels * 2, kernel_size=3, stride=2, padding=1, bias=False
)
self.conv2 = nn.Conv2d(
config.output_channels * 2, config.output_channels * 4, kernel_size=3, stride=2, padding=1, bias=False
)
def forward(self, features: torch.Tensor) -> torch.Tensor:
# interpolate Sam encodings to match Siglip encodings
features = torch.nn.functional.interpolate(
features,
size=(4 * self.output_size, 4 * self.output_size),
mode="bilinear",
align_corners=False,
)
features = self.conv1(features)
features = self.conv2(features)
return features
class DeepseekVLHybridAligner(nn.Module):
def __init__(self, config: DeepseekVLHybridConfig):
super().__init__()
in_channels = config.vision_config.hidden_size
high_res_in_channels = config.high_res_vision_config.output_channels * 4
out_channels = config.text_config.hidden_size
self.vision_proj = nn.Linear(in_channels, out_channels // 2)
self.high_res_vision_proj = nn.Linear(high_res_in_channels, out_channels // 2)
self.act = nn.GELU()
self.proj = nn.Linear(out_channels, out_channels)
def forward(
self,
vision_encodings: torch.Tensor,
high_res_vision_encodings: torch.Tensor,
) -> torch.Tensor:
vision_encodings = self.vision_proj(vision_encodings)
high_res_vision_encodings = self.high_res_vision_proj(high_res_vision_encodings)
encodings = torch.concat([high_res_vision_encodings, vision_encodings], dim=-1)
encodings = self.act(encodings)
encodings = self.proj(encodings)
return encodings
class DeepseekVLHybridPreTrainedModel(DeepseekVLPreTrainedModel):
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.text_config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Conv2d):
nn.init.kaiming_normal_(module.weight, mode="fan_out", nonlinearity="relu")
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, DeepseekVLHybridLayerNorm):
module.weight.data.fill_(1.0)
module.bias.data.zero_()
elif isinstance(module, DeepseekVLHybridModel):
module.high_res_vision_alpha.data.zero_()
class DeepseekVLHybridModel(DeepseekVLModel):
def __init__(self, config):
self.output_size = config.vision_config.image_size // config.vision_config.patch_size
self.global_attn_index = config.high_res_vision_config.global_attn_indexes[0]
self.high_res_vision_model = AutoModel.from_config(config.high_res_vision_config)
self.high_res_vision_neck = DeepseekVLSamVisionNeck(config.high_res_vision_config)
self.high_res_vision_proj = DeepseekVLSamVisionProj(
config.high_res_vision_config, output_size=self.output_size
)
self.high_res_vision_alpha = nn.Parameter(torch.zeros(1))
super().__init__(config)
def get_low_res_image_features(self, pixel_values):
output = self.vision_model(pixel_values)
output = output[0]
return output
def get_high_res_image_features(self, pixel_values):
output = self.high_res_vision_model(
pixel_values=pixel_values,
output_hidden_states=True,
return_dict=True,
)
last_hidden_state = output.last_hidden_state
last_hidden_state = self.high_res_vision_proj(last_hidden_state)
hidden_states = output.hidden_states
global_hidden_state = hidden_states[self.global_attn_index + 1] # +1 for embedding layer
global_hidden_state = self.high_res_vision_neck(global_hidden_state)
global_hidden_state = self.high_res_vision_proj(global_hidden_state)
output = last_hidden_state + global_hidden_state * self.high_res_vision_alpha
# batch_size, hidden_size, height, width -> batch_size, seq_len, hidden_size
output = output.permute(0, 2, 3, 1)
output = output.reshape(output.shape[0], -1, output.shape[-1])
return output
def get_image_features(self, pixel_values, high_res_pixel_values):
vision_encodings = self.get_low_res_image_features(pixel_values)
high_res_vision_encodings = self.get_high_res_image_features(high_res_pixel_values)
images_embeds = self.aligner(vision_encodings, high_res_vision_encodings)
return images_embeds
@can_return_tuple
@auto_docstring(custom_args=DEEPSEEK_VL_COMMON_CUSTOM_ARGS)
def forward(
self,
input_ids: torch.LongTensor = None,
pixel_values: torch.FloatTensor = None,
high_res_pixel_values: torch.FloatTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Cache] = None,
cache_position: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs,
):
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError(
"You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
)
if pixel_values is not None and high_res_pixel_values is None:
raise ValueError("Both pixel_values and high_res_pixel_values should be specified at the same time")
if inputs_embeds is None:
inputs_embeds = self.get_input_embeddings()(input_ids)
if pixel_values is not None:
if input_ids is None:
image_attention_mask = inputs_embeds == self.get_input_embeddings()(
torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)
)
image_attention_mask = image_attention_mask.all(-1)
else:
image_attention_mask = input_ids == self.config.image_token_id
image_attention_mask = image_attention_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
image_embeds = self.get_image_features(pixel_values, high_res_pixel_values)
image_features = image_embeds.reshape(-1, inputs_embeds.shape[-1])
image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(image_attention_mask, image_features)
lm_output = self.language_model(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
use_cache=use_cache,
cache_position=cache_position,
logits_to_keep=logits_to_keep,
**kwargs,
)
return DeepseekVLHybridBaseModelOutputWithPast(
last_hidden_state=lm_output.last_hidden_state,
past_key_values=lm_output.past_key_values,
hidden_states=lm_output.hidden_states,
attentions=lm_output.attentions,
image_hidden_states=image_embeds if pixel_values is not None else None,
)
class DeepseekVLHybridForConditionalGeneration(DeepseekVLForConditionalGeneration):
@can_return_tuple
@auto_docstring(custom_args=DEEPSEEK_VL_COMMON_CUSTOM_ARGS)
def forward(
self,
input_ids: torch.LongTensor = None,
pixel_values: torch.FloatTensor = None,
high_res_pixel_values: torch.FloatTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Cache] = None,
cache_position: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs: Unpack[TransformersKwargs],
):
r"""
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
"""
outputs = self.model(
input_ids=input_ids,
pixel_values=pixel_values,
high_res_pixel_values=high_res_pixel_values,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
cache_position=cache_position,
**kwargs,
)
hidden_states = outputs.last_hidden_state
# Only compute necessary logits, and do not upcast them to float if we are not computing the loss
slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
logits = self.lm_head(hidden_states[:, slice_indices, :])
loss = None
if labels is not None:
loss = self.loss_function(
logits=logits, labels=labels, vocab_size=self.config.text_config.vocab_size, **kwargs
)
return DeepseekVLHybridCausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
image_hidden_states=outputs.image_hidden_states,
)
def prepare_inputs_for_generation(
self,
input_ids,
past_key_values=None,
inputs_embeds=None,
pixel_values=None,
high_res_pixel_values=None,
attention_mask=None,
cache_position=None,
logits_to_keep=None,
**kwargs,
):
model_inputs = super().prepare_inputs_for_generation(
input_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
cache_position=cache_position,
logits_to_keep=logits_to_keep,
**kwargs,
)
if cache_position[0] == 0:
# If we're in cached decoding stage, pixel values should be None because input ids do not contain special image token anymore
# Otherwise we need pixel values to be passed to model
model_inputs["pixel_values"] = pixel_values
model_inputs["high_res_pixel_values"] = high_res_pixel_values
return model_inputs
class DeepseekVLHybridImageProcessor(DeepseekVLImageProcessor):
r"""
Constructs a DEEPSEEK_VL_HYBRID image processor.
Args:
do_resize (`bool`, *optional*, defaults to `True`):
Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the
`do_resize` parameter in the `preprocess` method.
size (`dict`, *optional*, defaults to `{"height": 384, "width": 384}`):
Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess`
method.
high_res_size (`dict`, *optional*, defaults to `{"height": 1024, "width": 1024}`):
Size of the high resolution output image after resizing. Can be overridden by the `high_res_size` parameter in the `preprocess`
method.
min_size (`int`, *optional*, defaults to 14):
The minimum allowed size for the resized image. Ensures that neither the height nor width
falls below this value after resizing.
resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`. Can be
overridden by the `resample` parameter in the `preprocess` method.
high_res_resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`. Can be
overridden by the `high_res_resample` parameter in the `preprocess` method.
do_rescale (`bool`, *optional*, defaults to `True`):
Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the
`do_rescale` parameter in the `preprocess` method.
rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
Scale factor to use if rescaling the image. Only has an effect if `do_rescale` is set to `True`. Can be
overridden by the `rescale_factor` parameter in the `preprocess` method.
do_normalize (`bool`, *optional*, defaults to `True`):
Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
method. Can be overridden by the `do_normalize` parameter in the `preprocess` method.
image_mean (`float` or `list[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
Mean to use if normalizing the image. This is a float or list of floats the length of the number of
channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. Can be
overridden by the `image_mean` parameter in the `preprocess` method.
image_std (`float` or `list[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
Can be overridden by the `image_std` parameter in the `preprocess` method.
high_res_image_mean (`float` or `list[float]`, *optional*, defaults to `OPENAI_CLIP_MEAN`):
Mean to use if normalizing the high resolution image. This is a float or list of floats the length of the number of
channels in the image. Can be overridden by the `high_res_image_mean` parameter in the `preprocess` method.
high_res_image_std (`float` or `list[float]`, *optional*, defaults to `OPENAI_CLIP_STD`):
Standard deviation to use if normalizing the high resolution image. This is a float or list of floats the length of the
number of channels in the image. Can be overridden by the `high_res_image_std` parameter in the `preprocess` method.
do_convert_rgb (`bool`, *optional*, defaults to `True`):
Whether to convert the image to RGB.
"""
def __init__(
self,
do_resize: bool = True,
size: Optional[dict[str, int]] = None,
high_res_size: Optional[dict[str, int]] = None,
min_size: int = 14,
resample: PILImageResampling = PILImageResampling.BICUBIC,
high_res_resample: PILImageResampling = PILImageResampling.BICUBIC,
do_rescale: bool = True,
rescale_factor: Union[int, float] = 1 / 255,
do_normalize: bool = True,
image_mean: Optional[Union[float, list[float]]] = None,
image_std: Optional[Union[float, list[float]]] = None,
high_res_image_mean: Optional[Union[float, list[float]]] = None,
high_res_image_std: Optional[Union[float, list[float]]] = None,
do_convert_rgb: Optional[bool] = None,
**kwargs,
) -> None:
high_res_size = high_res_size if high_res_size is not None else {"height": 1024, "width": 1024}
high_res_size = get_size_dict(high_res_size, default_to_square=True)
self.high_res_size = high_res_size
self.high_res_image_mean = high_res_image_mean if high_res_image_mean is not None else OPENAI_CLIP_MEAN
self.high_res_image_std = high_res_image_std if high_res_image_std is not None else OPENAI_CLIP_STD
self.resample = resample
self.high_res_resample = high_res_resample
super().__init__(
do_resize=do_resize,
size=size,
min_size=min_size,
resample=resample,
do_rescale=do_rescale,
rescale_factor=rescale_factor,
do_normalize=do_normalize,
image_mean=image_mean,
image_std=image_std,
do_convert_rgb=do_convert_rgb,
**kwargs,
)
if high_res_image_mean is None:
self.background_color = (127, 127, 127)
else:
self.background_color = tuple([int(x * 255) for x in high_res_image_mean])
@filter_out_non_signature_kwargs()
def preprocess(
self,
images: ImageInput,
do_resize: Optional[bool] = None,
size: Optional[dict[str, int]] = None,
high_res_size: Optional[dict[str, int]] = None,
resample: PILImageResampling = None,
high_res_resample: PILImageResampling = None,
do_rescale: Optional[bool] = None,
rescale_factor: Optional[float] = None,
do_normalize: Optional[bool] = None,
image_mean: Optional[Union[float, list[float]]] = None,
image_std: Optional[Union[float, list[float]]] = None,
high_res_image_mean: Optional[Union[float, list[float]]] = None,
high_res_image_std: Optional[Union[float, list[float]]] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
data_format: Union[str, ChannelDimension] = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
do_convert_rgb: Optional[bool] = None,
):
"""
Preprocess an image or batch of images.
Args:
images (`ImageInput`):
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set `do_rescale=False`.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
size (`Dict[str, int]`, *optional*, defaults to `self.size`):
Dictionary in the format `{"height": h, "width": w}` specifying the size of the output image after
resizing.
high_res_size (`Dict[str, int]`, *optional*, defaults to `self.high_res_size`):
Dictionary in the format `{"height": h, "width": w}` specifying the size of the high resolution output image after
resizing.
resample (`PILImageResampling` filter, *optional*, defaults to `self.resample`):
`PILImageResampling` filter to use if resizing the image e.g. `PILImageResampling.BILINEAR`. Only has
an effect if `do_resize` is set to `True`.
high_res_resample (`PILImageResampling` filter, *optional*, defaults to `self.resample`):
`PILImageResampling` filter to use if resizing the image e.g. `PILImageResampling.BICUBIC`. Only has
an effect if `do_resize` is set to `True`.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image values between [0 - 1].
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
Rescale factor to rescale the image by if `do_rescale` is set to `True`.
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
Image mean to use if `do_normalize` is set to `True`.
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to use if `do_normalize` is set to `True`.
high_res_image_mean (`float` or `List[float]`, *optional*, defaults to `self.high_res_image_mean`):
Image mean to use if `do_normalize` is set to `True`.
high_res_image_std (`float` or `List[float]`, *optional*, defaults to `self.high_res_image_std`):
Image standard deviation to use if `do_normalize` is set to `True`.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
"""
do_resize = do_resize if do_resize is not None else self.do_resize
do_rescale = do_rescale if do_rescale is not None else self.do_rescale
do_normalize = do_normalize if do_normalize is not None else self.do_normalize
resample = resample if resample is not None else self.resample
high_res_resample = high_res_resample if high_res_resample is not None else self.high_res_resample
rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
image_mean = image_mean if image_mean is not None else self.image_mean
image_std = image_std if image_std is not None else self.image_std
high_res_image_mean = high_res_image_mean if high_res_image_mean is not None else self.high_res_image_mean
high_res_image_std = high_res_image_std if high_res_image_std is not None else self.high_res_image_std
do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
size = size if size is not None else self.size
size_dict = get_size_dict(size)
high_res_size = high_res_size if high_res_size is not None else self.high_res_size
high_res_size_dict = get_size_dict(high_res_size)
images = make_list_of_images(images)
if not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
validate_preprocess_arguments(
do_rescale=do_rescale,
rescale_factor=rescale_factor,
do_normalize=do_normalize,
image_mean=image_mean,
image_std=image_std,
do_resize=do_resize,
size=size,
resample=resample,
)
if do_convert_rgb:
images = [convert_to_rgb(image) for image in images]
# All transformations expect numpy arrays.
images = [to_numpy_array(image) for image in images]
if do_rescale and is_scaled_image(images[0]):
logger.warning_once(
"It looks like you are trying to rescale already rescaled images. If the input"
" images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
)
if input_data_format is None:
# We assume that all images have the same channel dimension format.
input_data_format = infer_channel_dimension_format(images[0])
all_images = []
all_high_res_images = []
for image in images:
# high_res_image: resize (high) -> rescale -> normalize (high)
# low_res_image: resize (high) -> rescale -> resize (low) -> normalize (low)
high_res_image = image
if do_resize:
high_res_image = self.resize(
image=high_res_image,
size=high_res_size_dict,
resample=high_res_resample,
input_data_format=input_data_format,
)
image = self.resize(
image=high_res_image, size=size_dict, resample=resample, input_data_format=input_data_format
)
if do_rescale:
image = self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
high_res_image = self.rescale(
image=high_res_image, scale=rescale_factor, input_data_format=input_data_format
)
if do_normalize:
image = self.normalize(
image=image, mean=image_mean, std=image_std, input_data_format=input_data_format
)
high_res_image = self.normalize(
image=high_res_image,
mean=high_res_image_mean,
std=high_res_image_std,
input_data_format=input_data_format,
)
image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
high_res_image = to_channel_dimension_format(
high_res_image, data_format, input_channel_dim=input_data_format
)
all_images.append(image)
all_high_res_images.append(high_res_image)
data = {"pixel_values": all_images, "high_res_pixel_values": all_high_res_images}
return BatchFeature(data=data, tensor_type=return_tensors)
class DeepseekVLHybridProcessorKwargs(DeepseekVLProcessorKwargs):
pass
class DeepseekVLHybridProcessor(DeepseekVLProcessor):
def __call__(
self,
text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]] = None,
images: ImageInput = None,
**kwargs: Unpack[DeepseekVLHybridProcessorKwargs],
) -> BatchFeature:
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
DeepseekVLHybridImageProcessor's [`~DeepseekVLHybridImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
of the above two methods for more information.
Args:
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. Both channels-first and channels-last formats are supported.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
"""
output_kwargs = self._merge_kwargs(
DeepseekVLHybridProcessorKwargs, tokenizer_init_kwargs=self.tokenizer.init_kwargs, **kwargs
)
if text is None and images is None:
raise ValueError("You must specify either text or images.")
if text is not None:
if isinstance(text, str):
text = [text]
elif not (isinstance(text, (list, tuple)) and all(isinstance(t, str) for t in text)):
raise ValueError("Invalid input text. Please provide a string, or a list of strings")
prompt_strings = []
one_img_tokens = self.image_token * self.num_image_tokens
for prompt in text:
prompt = prompt.replace(self.image_token, one_img_tokens)
prompt_strings.append(prompt)
data = self.tokenizer(prompt_strings, **output_kwargs["text_kwargs"])
# process images if pixel_values are provided
if images is not None:
images = make_flat_list_of_images(images)
inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
data["pixel_values"] = inputs["pixel_values"]
data["high_res_pixel_values"] = inputs["high_res_pixel_values"]
return BatchFeature(data=data)
__all__ = [
"DeepseekVLHybridConfig",
"DeepseekVLHybridPreTrainedModel",
"DeepseekVLHybridModel",
"DeepseekVLHybridForConditionalGeneration",
"DeepseekVLHybridImageProcessor",
"DeepseekVLHybridProcessor",
]

View File

@@ -0,0 +1,159 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/deepseek_vl_hybrid/modular_deepseek_vl_hybrid.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_deepseek_vl_hybrid.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# Copyright 2025 Deepseek AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Union
from ...image_processing_utils_fast import BatchFeature
from ...image_utils import ImageInput, make_flat_list_of_images
from ...processing_utils import ProcessingKwargs, ProcessorMixin, Unpack
from ...tokenization_utils_base import PreTokenizedInput, TextInput
class DeepseekVLHybridProcessorKwargs(ProcessingKwargs, total=False):
_defaults = {
"text_kwargs": {"padding": False},
"common_kwargs": {"return_tensors": "pt"},
}
class DeepseekVLHybridProcessor(ProcessorMixin):
r"""
Constructs a DeepseekVLHybrid processor which wraps a DeepseekVLHybrid Image Processor and a Llama tokenizer into a single processor.
[`DeepseekVLHybridProcessor`] offers all the functionalities of [`DeepseekVLHybridImageProcessor`] and [`LlamaTokenizerFast`]. See the
[`~DeepseekVLHybridProcessor.__call__`] and [`~DeepseekVLHybridProcessor.decode`] for more information.
Args:
image_processor ([`DeepseekVLHybridImageProcessor`]):
The image processor is a required input.
tokenizer ([`LlamaTokenizerFast`]):
The tokenizer is a required input.
chat_template (`str`, *optional*):
A Jinja template which will be used to convert lists of messages
in a chat into a tokenizable string.
num_image_tokens (`int`, *optional*, defaults to 576):
The number of special image tokens used as placeholders for visual content in text sequences.
"""
attributes = ["image_processor", "tokenizer"]
valid_kwargs = ["chat_template", "num_image_tokens"]
image_processor_class = "AutoImageProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(
self,
image_processor,
tokenizer,
chat_template=None,
num_image_tokens=576,
):
self.image_token = tokenizer.image_token
self.num_image_tokens = num_image_tokens
super().__init__(image_processor, tokenizer, chat_template=chat_template)
def __call__(
self,
text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]] = None,
images: ImageInput = None,
**kwargs: Unpack[DeepseekVLHybridProcessorKwargs],
) -> BatchFeature:
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
DeepseekVLHybridImageProcessor's [`~DeepseekVLHybridImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
of the above two methods for more information.
Args:
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. Both channels-first and channels-last formats are supported.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
"""
output_kwargs = self._merge_kwargs(
DeepseekVLHybridProcessorKwargs, tokenizer_init_kwargs=self.tokenizer.init_kwargs, **kwargs
)
if text is None and images is None:
raise ValueError("You must specify either text or images.")
if text is not None:
if isinstance(text, str):
text = [text]
elif not (isinstance(text, (list, tuple)) and all(isinstance(t, str) for t in text)):
raise ValueError("Invalid input text. Please provide a string, or a list of strings")
prompt_strings = []
one_img_tokens = self.image_token * self.num_image_tokens
for prompt in text:
prompt = prompt.replace(self.image_token, one_img_tokens)
prompt_strings.append(prompt)
data = self.tokenizer(prompt_strings, **output_kwargs["text_kwargs"])
# process images if pixel_values are provided
if images is not None:
images = make_flat_list_of_images(images)
inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
data["pixel_values"] = inputs["pixel_values"]
data["high_res_pixel_values"] = inputs["high_res_pixel_values"]
return BatchFeature(data=data)
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
@property
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
image_processor_input_names = self.image_processor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
__all__ = ["DeepseekVLHybridProcessor"]

View File

@@ -1147,7 +1147,7 @@ class JanusForConditionalGeneration(JanusPreTrainedModel, GenerationMixin):
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs,
**kwargs: Unpack[TransformersKwargs],
):
r"""
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1173,7 +1173,9 @@ class JanusForConditionalGeneration(JanusPreTrainedModel, GenerationMixin):
loss = None
if labels is not None:
loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.text_config.vocab_size)
loss = self.loss_function(
logits=logits, labels=labels, vocab_size=self.config.text_config.vocab_size, **kwargs
)
return JanusCausalLMOutputWithPast(
loss=loss,

View File

@@ -1007,7 +1007,7 @@ class JanusForConditionalGeneration(JanusPreTrainedModel, GenerationMixin):
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs,
**kwargs: Unpack[TransformersKwargs],
):
r"""
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1033,7 +1033,9 @@ class JanusForConditionalGeneration(JanusPreTrainedModel, GenerationMixin):
loss = None
if labels is not None:
loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.text_config.vocab_size)
loss = self.loss_function(
logits=logits, labels=labels, vocab_size=self.config.text_config.vocab_size, **kwargs
)
return JanusCausalLMOutputWithPast(
loss=loss,

View File

View File

@@ -0,0 +1,119 @@
# coding=utf-8
# Copyright 2025 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
from transformers.testing_utils import require_torch, require_vision
from transformers.utils import is_vision_available
from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs
if is_vision_available():
from transformers import DeepseekVLImageProcessor
# Copied from tests.models.vit.test_image_processing_vit.ViTImageProcessingTester with ViT->DeepseekVL
class DeepseekVLImageProcessingTester:
def __init__(
self,
parent,
batch_size=7,
num_channels=3,
image_size=18,
min_resolution=30,
max_resolution=400,
do_resize=True,
size=None,
do_normalize=True,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
):
size = size if size is not None else {"height": 18, "width": 18}
self.parent = parent
self.batch_size = batch_size
self.num_channels = num_channels
self.image_size = image_size
self.min_resolution = min_resolution
self.max_resolution = max_resolution
self.do_resize = do_resize
self.size = size
self.do_normalize = do_normalize
self.image_mean = image_mean
self.image_std = image_std
def prepare_image_processor_dict(self):
return {
"image_mean": self.image_mean,
"image_std": self.image_std,
"do_normalize": self.do_normalize,
"do_resize": self.do_resize,
"size": self.size,
}
# Ignore copy
def expected_output_image_shape(self, images):
max_size = max(self.size["height"], self.size["width"])
return self.num_channels, max_size, max_size
def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=False):
return prepare_image_inputs(
batch_size=self.batch_size,
num_channels=self.num_channels,
min_resolution=self.min_resolution,
max_resolution=self.max_resolution,
equal_resolution=equal_resolution,
numpify=numpify,
torchify=torchify,
)
@require_torch
@require_vision
# Copied from tests.models.vit.test_image_processing_vit.ViTImageProcessingTest with ViT->DeepseekVL
class DeepseekVLImageProcessingTest(ImageProcessingTestMixin, unittest.TestCase):
# Ignore copy
image_processing_class = DeepseekVLImageProcessor if is_vision_available() else None
def setUp(self):
super().setUp()
self.image_processor_tester = DeepseekVLImageProcessingTester(self)
@property
def image_processor_dict(self):
return self.image_processor_tester.prepare_image_processor_dict()
def test_image_processor_properties(self):
for image_processing_class in self.image_processor_list:
image_processing = image_processing_class(**self.image_processor_dict)
self.assertTrue(hasattr(image_processing, "image_mean"))
self.assertTrue(hasattr(image_processing, "image_std"))
self.assertTrue(hasattr(image_processing, "do_normalize"))
self.assertTrue(hasattr(image_processing, "do_resize"))
self.assertTrue(hasattr(image_processing, "size"))
def test_image_processor_from_dict_with_kwargs(self):
for image_processing_class in self.image_processor_list:
image_processor = image_processing_class.from_dict(self.image_processor_dict)
self.assertEqual(image_processor.size, {"height": 18, "width": 18})
image_processor = image_processing_class.from_dict(self.image_processor_dict, size=42)
self.assertEqual(image_processor.size, {"height": 42, "width": 42})
# Ignore copy
@unittest.skip(reason="Not supported")
def test_call_numpy_4_channels(self):
pass

View File

@@ -0,0 +1,359 @@
# coding=utf-8
# Copyright 2025 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Testing suite for the PyTorch DeepseekVL model."""
import re
import tempfile
import unittest
from transformers import (
AutoProcessor,
DeepseekVLConfig,
DeepseekVLForConditionalGeneration,
DeepseekVLModel,
is_torch_available,
)
from transformers.testing_utils import (
require_torch,
require_torch_accelerator,
require_torch_sdpa,
slow,
torch_device,
)
from ...generation.test_utils import GenerationTesterMixin
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
if is_torch_available():
import torch
class DeepseekVLModelTester:
def __init__(
self,
parent,
batch_size=2,
seq_length=25,
num_channels=3,
initializer_range=0.02,
is_training=True,
use_cache=False,
text_config={
"num_hidden_layers": 2,
"vocab_size": 99,
"hidden_size": 16,
"intermediate_size": 37,
"max_position_embeddings": 512,
"num_attention_heads": 4,
"pad_token_id": 1,
},
vision_config={
"num_hidden_layers": 1,
"hidden_size": 16,
"intermediate_size": 37,
"image_size": 32,
"patch_size": 8,
"hidden_act": "gelu",
"vision_use_head": False,
"num_attention_heads": 4,
},
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.num_channels = num_channels
self.initializer_range = initializer_range
self.is_training = is_training
self.use_cache = use_cache
self.text_config = text_config
self.vision_config = vision_config
self.vision_config["num_channels"] = self.num_channels
self.num_hidden_layers = text_config["num_hidden_layers"]
self.vocab_size = text_config["vocab_size"]
self.hidden_size = text_config["hidden_size"]
self.num_attention_heads = text_config["num_attention_heads"]
self.image_size = vision_config["image_size"]
self.num_image_tokens = vision_config["image_size"] // vision_config["patch_size"]
self.pad_token_id = text_config["pad_token_id"]
self.image_token_id = self.vocab_size - 1
def get_config(self):
return DeepseekVLConfig(
text_config=self.text_config,
vision_config=self.vision_config,
image_token_id=self.image_token_id,
)
def prepare_config_and_inputs(self):
config = self.get_config()
# create text and vision inputs
input_ids = ids_tensor([self.batch_size, self.seq_length], config.text_config.vocab_size - 2) + 1
attention_mask = random_attention_mask([self.batch_size, self.seq_length])
pixel_values = floats_tensor(
[
self.batch_size,
self.num_channels,
self.image_size,
self.image_size,
]
)
# fill image_tokens
input_ids[:, : self.num_image_tokens] = self.image_token_id
return config, input_ids, attention_mask, pixel_values
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, input_ids, attention_mask, pixel_values = config_and_inputs
inputs_dict = {"input_ids": input_ids, "attention_mask": attention_mask, "pixel_values": pixel_values}
return config, inputs_dict
@require_torch
class DeepseekVLModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
all_model_classes = (DeepseekVLModel, DeepseekVLForConditionalGeneration) if is_torch_available() else ()
pipeline_model_mapping = (
{
"feature-extraction": DeepseekVLModel,
"image-text-to-text": DeepseekVLForConditionalGeneration,
}
if is_torch_available()
else {}
)
_is_composite = True
test_pruning = False
test_head_masking = False
def setUp(self):
self.model_tester = DeepseekVLModelTester(self)
self.config_tester = ConfigTester(self, config_class=DeepseekVLConfig, has_text_modality=False)
# overwrite inputs_embeds tests because we need to delete "pixel values" for LVLMs
def test_inputs_embeds(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
inputs = self._prepare_for_class(inputs_dict, model_class)
input_ids = inputs["input_ids"]
del inputs["input_ids"]
del inputs["pixel_values"]
wte = model.get_input_embeddings()
inputs["inputs_embeds"] = wte(input_ids)
with torch.no_grad():
model(**inputs)
# overwrite inputs_embeds tests because we need to delete "pixel values" for VLMs.
def test_inputs_embeds_matches_input_ids(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
inputs = self._prepare_for_class(inputs_dict, model_class)
input_ids = inputs["input_ids"]
del inputs["input_ids"]
del inputs["pixel_values"]
inputs_embeds = model.get_input_embeddings()(input_ids)
with torch.no_grad():
out_ids = model(input_ids=input_ids, **inputs)[0]
out_embeds = model(inputs_embeds=inputs_embeds, **inputs)[0]
torch.testing.assert_close(out_embeds, out_ids)
@unittest.skip(reason="Siglip uses the same initialization scheme as the Flax original implementation")
# Copied from tests.models.siglip.test_modeling_siglip.SiglipVisionModelTest.test_initialization
def test_initialization(self):
pass
@require_torch_sdpa
# Copied from tests.models.janus.test_modeling_janus.JanusVisionText2TextModelTest.test_sdpa_can_dispatch_composite_models
def test_sdpa_can_dispatch_composite_models(self):
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
# Load the model with SDPA
model_sdpa = model_class.from_pretrained(tmpdirname)
model_sdpa = model_sdpa.eval().to(torch_device)
# Load model with eager attention
model_eager = model_class.from_pretrained(
tmpdirname,
attn_implementation="eager",
)
model_eager = model_eager.eval().to(torch_device)
# SigLip has one shared cls attr for all models, so we assign both submodels heer
vision_attn = language_attn = "sdpa" if model._supports_sdpa else "eager"
if hasattr(model_sdpa, "vision_model") and hasattr(model_sdpa, "language_model"):
self.assertTrue(model_sdpa.vision_model.config._attn_implementation == vision_attn)
self.assertTrue(model_sdpa.language_model.config._attn_implementation == language_attn)
self.assertTrue(model_eager.vision_model.config._attn_implementation == "eager")
self.assertTrue(model_eager.language_model.config._attn_implementation == "eager")
self.assertTrue(model_sdpa.config._attn_implementation == "sdpa")
self.assertTrue(model_eager.config._attn_implementation == "eager")
for name, submodule in model_eager.named_modules():
class_name = submodule.__class__.__name__
if any(re.finditer(r"Attention(?!Pool)", class_name)):
self.assertTrue(submodule.config._attn_implementation == "eager")
for name, submodule in model_sdpa.named_modules():
class_name = submodule.__class__.__name__
if any(re.finditer(r"Attention(?!Pool)", class_name)):
self.assertTrue(submodule.config._attn_implementation == "sdpa")
@require_torch
@require_torch_accelerator
@slow
class DeepseekVLIntegrationTest(unittest.TestCase):
def setUp(self):
self.model_id = "deepseek-community/deepseek-vl-1.3b-chat"
def test_model_text_generation(self):
model = DeepseekVLForConditionalGeneration.from_pretrained(
self.model_id, torch_dtype="auto", device_map="auto"
)
model.to(torch_device)
model.eval()
processor = AutoProcessor.from_pretrained(self.model_id)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
EXPECTED_TEXT = 'You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.\n\nUser: Describe this image.\n\nAssistant:In the image, a majestic snow leopard is captured in a moment of tranquility. The snow leopard' # fmt: skip
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
)
inputs = inputs.to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
text = processor.decode(output[0], skip_special_tokens=True)
self.assertEqual(
text,
EXPECTED_TEXT,
)
def test_model_text_generation_batched(self):
model = DeepseekVLForConditionalGeneration.from_pretrained(
self.model_id, torch_dtype="auto", device_map="auto"
)
model.to(torch_device)
model.eval()
processor = AutoProcessor.from_pretrained(self.model_id)
messages = [
[
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
],
[
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{"type": "text", "text": "What animal do you see in the image?"},
],
}
],
]
EXPECTED_TEXT = [
"You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.\n\nUser: Describe this image.\n\nAssistant:In the image, a majestic snow leopard is captured in a moment of tranquility. The snow leopard", # fmt: skip
"You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.\n\nUser: What animal do you see in the image?\n\nAssistant:I see a bear in the image.What is the significance of the color red in the", # fmt: skip
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True, padding=True, return_dict=True, return_tensors="pt"
)
inputs = inputs.to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
text = processor.batch_decode(output, skip_special_tokens=True)
self.assertEqual(EXPECTED_TEXT, text)
def test_model_text_generation_with_multi_image(self):
model = DeepseekVLForConditionalGeneration.from_pretrained(
self.model_id, torch_dtype="auto", device_map="auto"
)
model.to(torch_device)
model.eval()
processor = AutoProcessor.from_pretrained(self.model_id)
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What's the difference between"},
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": " and "},
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
],
}
]
EXPECTED_TEXT = "You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.\n\nUser: What's the difference between and \n\nAssistant:The image is a photograph featuring two cats lying on a pink blanket. The cat on the left is" # fmt: skip
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
)
inputs = inputs.to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
text = processor.decode(output[0], skip_special_tokens=True)
self.assertEqual(
text,
EXPECTED_TEXT,
)

View File

@@ -0,0 +1,54 @@
# coding=utf-8
# Copyright 2025 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import tempfile
import unittest
from transformers import DeepseekVLProcessor, LlamaTokenizer
from transformers.models.deepseek_vl.convert_deepseek_vl_weights_to_hf import CHAT_TEMPLATE
from transformers.testing_utils import get_tests_dir
from transformers.utils import is_vision_available
from ...test_processing_common import ProcessorTesterMixin
if is_vision_available():
from transformers import DeepseekVLImageProcessor
SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
class DeepseekVLProcessorTest(ProcessorTesterMixin, unittest.TestCase):
processor_class = DeepseekVLProcessor
def setUp(self):
self.tmpdirname = tempfile.mkdtemp()
image_processor = DeepseekVLImageProcessor()
tokenizer = LlamaTokenizer(
vocab_file=SAMPLE_VOCAB,
extra_special_tokens={
"pad_token": "<end▁of▁sentence>",
"image_token": "<image_placeholder>",
},
)
processor = self.processor_class(
image_processor=image_processor,
tokenizer=tokenizer,
chat_template=CHAT_TEMPLATE,
)
processor.save_pretrained(self.tmpdirname)
def prepare_processor_dict(self):
return {"chat_template": CHAT_TEMPLATE, "num_image_tokens": 576}

View File

@@ -0,0 +1,218 @@
# coding=utf-8
# Copyright 2025 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import numpy as np
from transformers.testing_utils import require_torch, require_vision
from transformers.utils import is_torch_available, is_vision_available
from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs
if is_torch_available():
import torch
if is_vision_available():
from PIL import Image
from transformers import DeepseekVLHybridImageProcessor
class DeepseekVLHybridImageProcessingTester:
def __init__(
self,
parent,
batch_size=7,
num_channels=3,
image_size=18,
min_resolution=30,
max_resolution=400,
do_resize=True,
size=None,
high_res_size=None,
do_normalize=True,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
high_res_image_mean=[0.5, 0.5, 0.5],
high_res_image_std=[0.5, 0.5, 0.5],
):
size = size if size is not None else {"height": 18, "width": 18}
high_res_size = high_res_size if high_res_size is not None else {"height": 36, "width": 36}
self.parent = parent
self.batch_size = batch_size
self.num_channels = num_channels
self.image_size = image_size
self.min_resolution = min_resolution
self.max_resolution = max_resolution
self.do_resize = do_resize
self.size = size
self.high_res_size = high_res_size
self.do_normalize = do_normalize
self.image_mean = image_mean
self.image_std = image_std
self.high_res_image_mean = high_res_image_mean
self.high_res_image_std = high_res_image_std
def prepare_image_processor_dict(self):
return {
"image_mean": self.image_mean,
"image_std": self.image_std,
"high_res_image_mean": self.high_res_image_mean,
"high_res_image_std": self.high_res_image_std,
"do_normalize": self.do_normalize,
"do_resize": self.do_resize,
"size": self.size,
"high_res_size": self.high_res_size,
}
def expected_output_image_shape(self, images):
max_size = max(self.size["height"], self.size["width"])
return self.num_channels, max_size, max_size
def expected_output_high_res_image_shape(self, images):
max_size = max(self.high_res_size["height"], self.high_res_size["width"])
return self.num_channels, max_size, max_size
def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=False):
return prepare_image_inputs(
batch_size=self.batch_size,
num_channels=self.num_channels,
min_resolution=self.min_resolution,
max_resolution=self.max_resolution,
equal_resolution=equal_resolution,
numpify=numpify,
torchify=torchify,
)
@require_torch
@require_vision
class DeepseekVLHybridImageProcessingTest(ImageProcessingTestMixin, unittest.TestCase):
image_processing_class = DeepseekVLHybridImageProcessor if is_vision_available() else None
# Copied from tests.models.vit.test_image_processing_vit.ViTImageProcessingTester.setUp with ViT->DeepseekVLHybrid
def setUp(self):
super().setUp()
self.image_processor_tester = DeepseekVLHybridImageProcessingTester(self)
@property
# Copied from tests.models.vit.test_image_processing_vit.ViTImageProcessingTester.image_processor_dict with ViT->DeepseekVLHybrid
def image_processor_dict(self):
return self.image_processor_tester.prepare_image_processor_dict()
# Copied from tests.models.vit.test_image_processing_vit.ViTImageProcessingTester.test_image_processor_from_dict_with_kwargs
def test_image_processor_from_dict_with_kwargs(self):
for image_processing_class in self.image_processor_list:
image_processor = image_processing_class.from_dict(self.image_processor_dict)
self.assertEqual(image_processor.size, {"height": 18, "width": 18})
image_processor = image_processing_class.from_dict(self.image_processor_dict, size=42)
self.assertEqual(image_processor.size, {"height": 42, "width": 42})
def test_image_processor_properties(self):
for image_processing_class in self.image_processor_list:
image_processing = image_processing_class(**self.image_processor_dict)
self.assertTrue(hasattr(image_processing, "image_mean"))
self.assertTrue(hasattr(image_processing, "image_std"))
self.assertTrue(hasattr(image_processing, "high_res_image_mean"))
self.assertTrue(hasattr(image_processing, "high_res_image_std"))
self.assertTrue(hasattr(image_processing, "do_normalize"))
self.assertTrue(hasattr(image_processing, "do_resize"))
self.assertTrue(hasattr(image_processing, "size"))
self.assertTrue(hasattr(image_processing, "high_res_size"))
def test_call_pil_high_res(self):
for image_processing_class in self.image_processor_list:
# Initialize image_processing
image_processing = image_processing_class(**self.image_processor_dict)
# create random PIL images
image_inputs = self.image_processor_tester.prepare_image_inputs(equal_resolution=False)
for image in image_inputs:
self.assertIsInstance(image, Image.Image)
# Test not batched input
encoded_images = image_processing(image_inputs[0], return_tensors="pt").high_res_pixel_values
expected_output_image_shape = self.image_processor_tester.expected_output_high_res_image_shape(
[image_inputs[0]]
)
self.assertEqual(tuple(encoded_images.shape), (1, *expected_output_image_shape))
# Test batched
encoded_images = image_processing(image_inputs, return_tensors="pt").high_res_pixel_values
expected_output_image_shape = self.image_processor_tester.expected_output_high_res_image_shape(
image_inputs
)
self.assertEqual(
tuple(encoded_images.shape), (self.image_processor_tester.batch_size, *expected_output_image_shape)
)
def test_call_numpy_high_res(self):
for image_processing_class in self.image_processor_list:
# Initialize image_processing
image_processing = image_processing_class(**self.image_processor_dict)
# create random numpy tensors
image_inputs = self.image_processor_tester.prepare_image_inputs(equal_resolution=False, numpify=True)
for image in image_inputs:
self.assertIsInstance(image, np.ndarray)
# Test not batched input
encoded_images = image_processing(image_inputs[0], return_tensors="pt").high_res_pixel_values
expected_output_image_shape = self.image_processor_tester.expected_output_high_res_image_shape(
[image_inputs[0]]
)
self.assertEqual(tuple(encoded_images.shape), (1, *expected_output_image_shape))
# Test batched
encoded_images = image_processing(image_inputs, return_tensors="pt").high_res_pixel_values
expected_output_image_shape = self.image_processor_tester.expected_output_high_res_image_shape(
image_inputs
)
self.assertEqual(
tuple(encoded_images.shape), (self.image_processor_tester.batch_size, *expected_output_image_shape)
)
def test_call_pytorch_high_res(self):
for image_processing_class in self.image_processor_list:
# Initialize image_processing
image_processing = image_processing_class(**self.image_processor_dict)
# create random PyTorch tensors
image_inputs = self.image_processor_tester.prepare_image_inputs(equal_resolution=False, torchify=True)
for image in image_inputs:
self.assertIsInstance(image, torch.Tensor)
# Test not batched input
encoded_images = image_processing(image_inputs[0], return_tensors="pt").high_res_pixel_values
expected_output_image_shape = self.image_processor_tester.expected_output_high_res_image_shape(
[image_inputs[0]]
)
self.assertEqual(tuple(encoded_images.shape), (1, *expected_output_image_shape))
# Test batched
expected_output_image_shape = self.image_processor_tester.expected_output_high_res_image_shape(
image_inputs
)
encoded_images = image_processing(image_inputs, return_tensors="pt").high_res_pixel_values
self.assertEqual(
tuple(encoded_images.shape),
(self.image_processor_tester.batch_size, *expected_output_image_shape),
)
@unittest.skip(reason="Not supported")
def test_call_numpy_4_channels(self):
pass

View File

@@ -0,0 +1,403 @@
# coding=utf-8
# Copyright 2025 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Testing suite for the PyTorch DeepseekVLHybrid model."""
import re
import tempfile
import unittest
from transformers import (
AutoProcessor,
DeepseekVLHybridConfig,
DeepseekVLHybridForConditionalGeneration,
DeepseekVLHybridModel,
is_torch_available,
)
from transformers.testing_utils import (
require_torch,
require_torch_accelerator,
require_torch_sdpa,
slow,
torch_device,
)
from ...generation.test_utils import GenerationTesterMixin
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
if is_torch_available():
import torch
class DeepseekVLHybridModelTester:
def __init__(
self,
parent,
batch_size=2,
seq_length=25,
num_channels=3,
initializer_range=0.02,
is_training=True,
use_cache=False,
text_config={
"num_hidden_layers": 2,
"vocab_size": 99,
"hidden_size": 16,
"intermediate_size": 37,
"max_position_embeddings": 512,
"num_attention_heads": 4,
"pad_token_id": 1,
},
vision_config={
"num_hidden_layers": 1,
"hidden_size": 16,
"intermediate_size": 37,
"image_size": 32,
"patch_size": 8,
"hidden_act": "gelu",
"vision_use_head": False,
"num_attention_heads": 4,
},
high_res_vision_config={
"num_hidden_layers": 2,
"global_attn_indexes": [0],
"hidden_size": 16,
"intermediate_size": 37,
"mlp_dim": 24,
"output_channels": 4,
"image_size": 128,
"patch_size": 32,
"num_attention_heads": 4,
},
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.num_channels = num_channels
self.initializer_range = initializer_range
self.is_training = is_training
self.use_cache = use_cache
self.text_config = text_config
self.vision_config = vision_config
self.high_res_vision_config = high_res_vision_config
self.vision_config["num_channels"] = self.num_channels
self.high_res_vision_config["num_channels"] = self.num_channels
self.num_hidden_layers = text_config["num_hidden_layers"]
self.vocab_size = text_config["vocab_size"]
self.hidden_size = text_config["hidden_size"]
self.num_attention_heads = text_config["num_attention_heads"]
self.high_res_image_size = high_res_vision_config["image_size"]
self.image_size = vision_config["image_size"]
self.num_image_tokens = vision_config["image_size"] // vision_config["patch_size"]
self.pad_token_id = text_config["pad_token_id"]
self.image_token_id = self.vocab_size - 1
def get_config(self):
return DeepseekVLHybridConfig(
text_config=self.text_config,
vision_config=self.vision_config,
high_res_vision_config=self.high_res_vision_config,
image_token_id=self.image_token_id,
)
def prepare_config_and_inputs(self):
config = self.get_config()
# create text and vision inputs
input_ids = ids_tensor([self.batch_size, self.seq_length], config.text_config.vocab_size - 2) + 1
attention_mask = random_attention_mask([self.batch_size, self.seq_length])
pixel_values = floats_tensor(
[
self.batch_size,
self.num_channels,
self.image_size,
self.image_size,
]
)
high_res_pixel_values = floats_tensor(
[
self.batch_size,
self.num_channels,
self.high_res_image_size,
self.high_res_image_size,
]
)
# fill image_tokens
input_ids[:, : self.num_image_tokens] = self.image_token_id
return config, input_ids, attention_mask, pixel_values, high_res_pixel_values
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, input_ids, attention_mask, pixel_values, high_res_pixel_values = config_and_inputs
inputs_dict = {
"input_ids": input_ids,
"attention_mask": attention_mask,
"pixel_values": pixel_values,
"high_res_pixel_values": high_res_pixel_values,
}
return config, inputs_dict
@require_torch
class DeepseekVLHybridModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
all_model_classes = (
(DeepseekVLHybridModel, DeepseekVLHybridForConditionalGeneration) if is_torch_available() else ()
)
pipeline_model_mapping = (
{
"feature-extraction": DeepseekVLHybridModel,
"image-text-to-text": DeepseekVLHybridForConditionalGeneration,
}
if is_torch_available()
else {}
)
_is_composite = True
test_pruning = False
test_head_masking = False
def setUp(self):
self.model_tester = DeepseekVLHybridModelTester(self)
self.config_tester = ConfigTester(self, config_class=DeepseekVLHybridConfig, has_text_modality=False)
# overwrite inputs_embeds tests because we need to delete "pixel values" for LVLMs
def test_inputs_embeds(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
inputs = self._prepare_for_class(inputs_dict, model_class)
input_ids = inputs["input_ids"]
del inputs["input_ids"]
del inputs["pixel_values"]
del inputs["high_res_pixel_values"]
wte = model.get_input_embeddings()
inputs["inputs_embeds"] = wte(input_ids)
with torch.no_grad():
model(**inputs)
# overwrite inputs_embeds tests because we need to delete "pixel values" for VLMs.
def test_inputs_embeds_matches_input_ids(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
inputs = self._prepare_for_class(inputs_dict, model_class)
input_ids = inputs["input_ids"]
del inputs["input_ids"]
del inputs["pixel_values"]
del inputs["high_res_pixel_values"]
inputs_embeds = model.get_input_embeddings()(input_ids)
with torch.no_grad():
out_ids = model(input_ids=input_ids, **inputs)[0]
out_embeds = model(inputs_embeds=inputs_embeds, **inputs)[0]
torch.testing.assert_close(out_embeds, out_ids)
@unittest.skip(reason="Siglip uses the same initialization scheme as the Flax original implementation")
# Copied from tests.models.siglip.test_modeling_siglip.SiglipVisionModelTest.test_initialization
def test_initialization(self):
pass
@require_torch_sdpa
def test_sdpa_can_dispatch_composite_models(self):
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
# Load the model with SDPA
model_sdpa = model_class.from_pretrained(
tmpdirname,
attn_implementation="sdpa",
)
model_sdpa = model_sdpa.eval().to(torch_device)
# Load model with eager attention
model_eager = model_class.from_pretrained(
tmpdirname,
attn_implementation="eager",
)
model_eager = model_eager.eval().to(torch_device)
self.assertTrue(model_sdpa.config._attn_implementation == "sdpa")
self.assertTrue(model_eager.config._attn_implementation == "eager")
if (
hasattr(model_sdpa, "vision_model")
and hasattr(model_sdpa, "high_res_vision_model")
and hasattr(model_sdpa, "language_model")
):
self.assertTrue(model_sdpa.language_model.config._attn_implementation == "sdpa")
self.assertTrue(model_sdpa.vision_model.config._attn_implementation == "sdpa")
self.assertTrue(model_sdpa.high_res_vision_model.config._attn_implementation == "sdpa")
self.assertTrue(model_eager.language_model.config._attn_implementation == "eager")
self.assertTrue(model_eager.high_res_vision_model.config._attn_implementation == "eager")
for name, submodule in model_eager.named_modules():
class_name = submodule.__class__.__name__
if (
any(re.finditer(r"Attention(?!Pool)", class_name))
and getattr(submodule, "config", None)
and submodule.config._attn_implementation == "sdpa"
):
self.assertTrue(submodule.config._attn_implementation == "eager")
for name, submodule in model_sdpa.named_modules():
class_name = submodule.__class__.__name__
if (
any(re.finditer(r"Attention(?!Pool)", class_name))
and getattr(submodule, "config", None)
and submodule.config._attn_implementation == "eager"
):
self.assertTrue(submodule.config._attn_implementation == "sdpa")
@require_torch
@require_torch_accelerator
@slow
class DeepseekVLHybridIntegrationTest(unittest.TestCase):
def setUp(self):
self.model_id = "deepseek-community/deepseek-vl-7b-chat"
def test_model_text_generation(self):
model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
self.model_id, torch_dtype="auto", device_map="auto"
)
model.to(torch_device)
model.eval()
processor = AutoProcessor.from_pretrained(self.model_id)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
EXPECTED_TEXT = 'You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.\n\nUser: Describe this image.\n\nAssistant:The image depicts a fluffy, beige-colored animal with a long tail, walking on snow. The' # fmt: skip
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
)
inputs = inputs.to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
text = processor.decode(output[0], skip_special_tokens=True)
self.assertEqual(
text,
EXPECTED_TEXT,
)
def test_model_text_generation_batched(self):
model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
self.model_id, torch_dtype="auto", device_map="auto"
)
model.to(torch_device)
model.eval()
processor = AutoProcessor.from_pretrained(self.model_id)
messages = [
[
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
],
[
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{"type": "text", "text": "What animal do you see in the image?"},
],
}
],
]
EXPECTED_TEXT = [
"You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.\n\nUser: Describe this image.\n\nAssistant:The image depicts a fluffy, beige-colored animal with a long tail, walking on snow. The", # fmt: skip
"You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.\n\nUser: What animal do you see in the image?\n\nAssistant:I see a large, furry animal that appears to be a type of bear.The ", # fmt: skip
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True, padding=True, return_dict=True, return_tensors="pt"
)
inputs = inputs.to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
text = processor.batch_decode(output, skip_special_tokens=True)
self.assertEqual(EXPECTED_TEXT, text)
def test_model_text_generation_with_multi_image(self):
model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
self.model_id, torch_dtype="auto", device_map="auto"
)
model.to(torch_device)
model.eval()
processor = AutoProcessor.from_pretrained(self.model_id)
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What's the difference between"},
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": " and "},
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
],
}
]
EXPECTED_TEXT = "You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.\n\nUser: What's the difference between and \n\nAssistant:The image shows a street scene with a prominent red stop sign in the foreground. The sign has the" # fmt: skip
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
)
inputs = inputs.to(model.device, dtype=model.dtype)
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
text = processor.decode(output[0], skip_special_tokens=True)
self.assertEqual(
text,
EXPECTED_TEXT,
)

View File

@@ -0,0 +1,54 @@
# coding=utf-8
# Copyright 2025 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import tempfile
import unittest
from transformers import DeepseekVLHybridProcessor, LlamaTokenizer
from transformers.models.deepseek_vl.convert_deepseek_vl_weights_to_hf import CHAT_TEMPLATE
from transformers.testing_utils import get_tests_dir
from transformers.utils import is_vision_available
from ...test_processing_common import ProcessorTesterMixin
if is_vision_available():
from transformers import DeepseekVLHybridImageProcessor
SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
class DeepseekVLHybridProcessorTest(ProcessorTesterMixin, unittest.TestCase):
processor_class = DeepseekVLHybridProcessor
def setUp(self):
self.tmpdirname = tempfile.mkdtemp()
image_processor = DeepseekVLHybridImageProcessor()
tokenizer = LlamaTokenizer(
vocab_file=SAMPLE_VOCAB,
extra_special_tokens={
"pad_token": "<end▁of▁sentence>",
"image_token": "<image_placeholder>",
},
)
processor = self.processor_class(
image_processor=image_processor,
tokenizer=tokenizer,
chat_template=CHAT_TEMPLATE,
)
processor.save_pretrained(self.tmpdirname)
def prepare_processor_dict(self):
return {"chat_template": CHAT_TEMPLATE, "num_image_tokens": 576}