Fix docs typo (#40167)
* DINOv3 model * working version * linter revert * linter revert * linter revert * fix init * remove flex and add convert to hf script * DINOv3 convnext * working version of convnext * adding to auto * Dinov3 -> DINOv3 * PR feedback * complete convert checkpoint * fix assertion * bf16 -> fp32 * add fast image processor * fixup * change conversion script * Use Pixtral attention * minor renaming * simplify intermediates capturing * refactor DINOv3ViTPatchEmbeddings * Refactor DINOv3ViTEmbeddings * [WIP] rope: remove unused params * [WIP] rope: rename period -> inv_freq for consistency * [WIP] rope: move augs * change inv_freq init (not persistent anymore) * [WIP] rope: move coords to init * rope - done! * use default LayerScale * conversion: truncate expected outputs * remove commented code * Refactor MLP layers * nit * clean up config params * nit docs * simplify embeddings * simplify compile compat lru_cache * fixup * dynamic patch coords * move augmentation * Fix docs * fixup and type hints * fix output capturing * fix tests * fixup * fix auto mappings * Add draft docs * fix dtype cast issue * add push to hub * add image processor tests * fixup * add modular * update modular * convert and test convnext * update conversion script * update prefix * Update LayerNorm * refactor DINOv3ConvNextLayer * rename * refactor convnext model * fix doc check * fix docs * fix convnext config * tmp fix for check docstring * remove unused arg * fix tests * (nit) change init * standardize gated MLP * clear namings and sat493m * fix tensors on different devices * revert linter * pr * pr feedbak ruff format * missing headers * fix code snippet and collection link in docs * DINOv3 description * fix checkpoints in tests * not doc fixes in configs * output_hidden_states * x -> features * remove sequential --------- Co-authored-by: Cijo Jose <cijose@meta.com>
This commit is contained in:
committed by
GitHub
parent
41980ce93e
commit
6f259bc83e
@@ -763,6 +763,8 @@
|
||||
title: DINOV2
|
||||
- local: model_doc/dinov2_with_registers
|
||||
title: DINOv2 with Registers
|
||||
- local: model_doc/dinov3
|
||||
title: DINOv3
|
||||
- local: model_doc/dit
|
||||
title: DiT
|
||||
- local: model_doc/dpt
|
||||
|
||||
181
docs/source/en/model_doc/dinov3.md
Normal file
181
docs/source/en/model_doc/dinov3.md
Normal file
@@ -0,0 +1,181 @@
|
||||
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
<div style="float: right;">
|
||||
<div class="flex flex-wrap space-x-1">
|
||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
<img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAC0AAAAtCAMAAAANxBKoAAAC7lBMVEUAAADg5vYHPVgAoJH+/v76+v39/f9JbLP///9+AIgAnY3///+mcqzt8fXy9fgkXa3Ax9709fr+///9/f8qXq49qp5AaLGMwrv8/P0eW60VWawxYq8yqJzG2dytt9Wyu9elzci519Lf3O3S2efY3OrY0+Xp7PT///////+dqNCexMc6Z7AGpJeGvbenstPZ5ejQ1OfJzOLa7ejh4+/r8fT29vpccbklWK8PVa0AS6ghW63O498vYa+lsdKz1NDRt9Kw1c672tbD3tnAxt7R6OHp5vDe7OrDyuDn6vLl6/EAQKak0MgATakkppo3ZK/Bz9y8w9yzu9jey97axdvHzeG21NHH4trTwthKZrVGZLSUSpuPQJiGAI+GAI8SWKydycLL4d7f2OTi1+S9xNzL0ePT6OLGzeEAo5U0qJw/aLEAo5JFa7JBabEAp5Y4qZ2QxLyKmsm3kL2xoMOehrRNb7RIbbOZgrGre68AUqwAqZqNN5aKJ5N/lMq+qsd8kMa4pcWzh7muhLMEV69juq2kbKqgUaOTR5uMMZWLLZSGAI5VAIdEAH+ovNDHuNCnxcy3qcaYx8K8msGplrx+wLahjbYdXrV6vbMvYK9DrZ8QrZ8tqJuFms+Sos6sw8ecy8RffsNVeMCvmb43aLltv7Q4Y7EZWK4QWa1gt6meZKUdr6GOAZVeA4xPAISyveLUwtivxtKTpNJ2jcqfvcltiMiwwcfAoMVxhL+Kx7xjdrqTe60tsaNQs6KaRKACrJ6UTZwkqpqTL5pkHY4AloSgsd2ptNXPvNOOncuxxsqFl8lmg8apt8FJcr9EbryGxLqlkrkrY7dRa7ZGZLQ5t6iXUZ6PPpgVpZeJCJFKAIGareTa0+KJod3H0deY2M+esM25usmYu8d2zsJOdcBVvrCLbqcAOaaHaKQAMaScWqKBXqCXMJ2RHpiLF5NmJZAdAHN2kta11dKu1M+DkcZLdb+Mcql3TppyRJdzQ5ZtNZNlIY+DF4+voCOQAAAAZ3RSTlMABAT+MEEJ/RH+/TP+Zlv+pUo6Ifz8+fco/fz6+evr39S9nJmOilQaF/7+/f38+smmoYp6b1T+/v7++vj189zU0tDJxsGzsrKSfv34+Pf27dDOysG9t6+n/vv6+vr59uzr1tG+tZ6Qg9Ym3QAABR5JREFUSMeNlVVUG1EQhpcuxEspXqS0SKEtxQp1d3d332STTRpIQhIISQgJhODu7lAoDoUCpe7u7u7+1puGpqnCPOyZvffbOXPm/PsP9JfQgyCC+tmTABTOcbxDz/heENS7/1F+9nhvkHePG0wNDLbGWwdXL+rbLWvpmZHXD8+gMfBjTh+aSe6Gnn7lwQIOTR0c8wfX3PWgv7avbdKwf/ZoBp1Gp/PvuvXW3vw5ib7emnTW4OR+3D4jB9vjNJ/7gNvfWWeH/TO/JyYrsiKCRjVEZA3UB+96kON+DxOQ/NLE8PE5iUYgIXjFnCOlxEQMaSGVxjg4gxOnEycGz8bptuNjVx08LscIgrzH3umcn+KKtiBIyvzOO2O99aAdR8cF19oZalnCtvREUw79tCd5sow1g1UKM6kXqUx4T8wsi3sTjJ3yzDmmhenLXLpo8u45eG5y4Vvbk6kkC4LLtJMowkSQxmk4ggVJEG+7c6QpHT8vvW9X7/o7+3ELmiJi2mEzZJiz8cT6TBlanBk70cB5GGIGC1gRDdZ00yADLW1FL6gqhtvNXNG5S9gdSrk4M1qu7JAsmYshzDS4peoMrU/gT7qQdqYGZaYhxZmVbGJAm/CS/HloWyhRUlknQ9KYcExTwS80d3VNOxUZJpITYyspl0LbhArhpZCD9cRWEQuhYkNGMHToQ/2Cs6swJlb39CsllxdXX6IUKh/H5jbnSsPKjgmoaFQ1f8wRLR0UnGE/RcDEjj2jXG1WVTwUs8+zxfcrVO+vSsuOpVKxCfYZiQ0/aPKuxQbQ8lIz+DClxC8u+snlcJ7Yr1z1JPqUH0V+GDXbOwAib931Y4Imaq0NTIXPXY+N5L18GJ37SVWu+hwXff8l72Ds9XuwYIBaXPq6Shm4l+Vl/5QiOlV+uTk6YR9PxKsI9xNJny31ygK1e+nIRC1N97EGkFPI+jCpiHe5PCEy7oWqWSwRrpOvhFzcbTWMbm3ZJAOn1rUKpYIt/lDhW/5RHHteeWFN60qo98YJuoq1nK3uW5AabyspC1BcIEpOhft+SZAShYoLSvnmSfnYADUERP5jJn2h5XtsgCRuhYQqAvwTwn33+YWEKUI72HX5AtfSAZDe8F2DtPPm77afhl0EkthzuCQU0BWApgQIH9+KB0JhopMM7bJrdTRoleM2JAVNMyPF+wdoaz+XJpGoVAQ7WXUkcV7gT3oUZyi/ISIJAVKhgNp+4b4veCFhYVJw4locdSjZCp9cPUhLF9EZ3KKzURepMEtCDPP3VcWFx4UIiZIklIpFNfHpdEafIF2aRmOcrUmjohbT2WUllbmRvgfbythbQO3222fpDJoufaQPncYYuqoGtUEsCJZL6/3PR5b4syeSjZMQG/T2maGANlXT2v8S4AULWaUkCxfLyW8iW4kdka+nEMjxpL2NCwsYNBp+Q61PF43zyDg9Bm9+3NNySn78jMZUUkumqE4Gp7JmFOdP1vc8PpRrzj9+wPinCy8K1PiJ4aYbnTYpCCbDkBSbzhu2QJ1Gd82t8jI8TH51+OzvXoWbnXUOBkNW+0mWFwGcGOUVpU81/n3TOHb5oMt2FgYGjzau0Nif0Ss7Q3XB33hjjQHjHA5E5aOyIQc8CBrLdQSs3j92VG+3nNEjbkbdbBr9zm04ruvw37vh0QKOdeGIkckc80fX3KH/h7PT4BOjgCty8VZ5ux1MoO5Cf5naca2LAsEgehI+drX8o/0Nu+W0m6K/I9gGPd/dfx/EN/wN62AhsBWuAAAAAElFTkSuQmCC">
|
||||
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
|
||||
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
# DINOv3
|
||||
|
||||
DINOv3 is a family of versatile vision foundation models that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models.
|
||||
|
||||
You can find all the original DINOv3 checkpoints under the [DINOv3](https://huggingface.co/collections/facebook/dinov3-68924841bd6b561778e31009) collection.
|
||||
|
||||
> [!TIP]
|
||||
> Click on the DINOv3 models in the right sidebar for more examples of how to apply DINOv3 to different vision tasks.
|
||||
|
||||
The example below demonstrates how to obtain an image embedding with [`Pipeline`] or the [`AutoModel`] class.
|
||||
|
||||
<hfoptions id="usage">
|
||||
<hfoption id="Pipeline">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import pipeline
|
||||
|
||||
pipe = pipeline(
|
||||
task="image-feature-extraction",
|
||||
model="facebook/dinov3-vits16-pretrain-lvd1689m",
|
||||
torch_dtype=torch.bfloat16,
|
||||
)
|
||||
|
||||
pipe("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="AutoModel">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoImageProcessor, AutoModel
|
||||
from transformers.image_utils import load_image
|
||||
|
||||
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
image = load_image(url)
|
||||
|
||||
processor = AutoImageProcessor.from_pretrained("facebook/dinov3-vits16-pretrain-lvd1689m")
|
||||
model = AutoModel.from_pretrained(
|
||||
"facebook/dinov3-vits16-pretrain-lvd1689m",
|
||||
torch_dtype=torch.float16,
|
||||
device_map="auto",
|
||||
attn_implementation="sdpa"
|
||||
)
|
||||
|
||||
inputs = processor(images=image, return_tensors="pt").to(model.device)
|
||||
with torch.inference_mode():
|
||||
outputs = model(**inputs)
|
||||
|
||||
pooled_output = outputs.pooler_output
|
||||
print("Pooled output shape:", pooled_output.shape)
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
|
||||
|
||||
The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
|
||||
|
||||
```py
|
||||
# pip install torchao
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoImageProcessor, AutoModel
|
||||
from torchao.quantization import Int4WeightOnlyConfig
|
||||
from transformers.image_utils import load_image
|
||||
|
||||
|
||||
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
image = load_image(url)
|
||||
|
||||
processor = AutoImageProcessor.from_pretrained("facebook/dinov3-vitsplus-pretrain-lvd1689m")
|
||||
|
||||
quant_type = Int4WeightOnlyConfig(group_size=128)
|
||||
quantization_config = TorchAoConfig(quant_type=quant_type)
|
||||
|
||||
model = AutoModel.from_pretrained(
|
||||
"facebook/dinov3-vit7b16-pretrain-lvd1689m",
|
||||
torch_dtype=torch.bfloat16,
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
inputs = processor(images=image, return_tensors="pt").to(model.device)
|
||||
with torch.inference_mode():
|
||||
outputs = model(**inputs)
|
||||
|
||||
pooled_output = outputs.pooler_output
|
||||
print("Pooled output shape:", pooled_output.shape)
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The example below shows how to split the output tensor into:
|
||||
- one embedding for the whole image, commonly referred to as a `CLS` token,
|
||||
useful for classification and retrieval
|
||||
- register tokens - learnable embeddings that act as dedicated “memory slots” for global information,
|
||||
they reduce high-norm artifacts in patch tokens, yielding cleaner attention maps and better
|
||||
performance on dense prediction tasks.
|
||||
- a set of local embeddings, one for each `16x16` patch of the input image,
|
||||
useful for dense tasks, such as semantic segmentation
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoImageProcessor, AutoModel
|
||||
from transformers.image_utils import load_image
|
||||
|
||||
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
image = load_image(url)
|
||||
print("Image size:", image.height, image.width) # [480, 640]
|
||||
|
||||
processor = AutoImageProcessor.from_pretrained("facebook/dinov3-vits16-pretrain-lvd1689m")
|
||||
model = AutoModel.from_pretrained("facebook/dinov3-vits16-pretrain-lvd1689m")
|
||||
patch_size = model.config.patch_size
|
||||
print("Patch size:", patch_size) # 16
|
||||
print("Num register tokens:", model.config.num_register_tokens) # 4
|
||||
|
||||
inputs = processor(images=image, return_tensors="pt")
|
||||
print("Preprocessed image size:", inputs.pixel_values.shape) # [1, 3, 224, 224]
|
||||
|
||||
batch_size, _, img_height, img_width = inputs.pixel_values.shape
|
||||
num_patches_height, num_patches_width = img_height // patch_size, img_width // patch_size
|
||||
num_patches_flat = num_patches_height * num_patches_width
|
||||
|
||||
with torch.inference_mode():
|
||||
outputs = model(**inputs)
|
||||
|
||||
last_hidden_states = outputs.last_hidden_state
|
||||
print(last_hidden_states.shape) # [1, 1 + 4 + 256, 384]
|
||||
assert last_hidden_states.shape == (batch_size, 1 + model.config.num_register_tokens + num_patches_flat, model.config.hidden_size)
|
||||
|
||||
cls_token = last_hidden_states[:, 0, :]
|
||||
patch_features_flat = last_hidden_states[:, 1 + model.config.num_register_tokens:, :]
|
||||
patch_features = patch_features_flat.unflatten(1, (num_patches_height, num_patches_width))
|
||||
```
|
||||
|
||||
## DINOv3ViTConfig
|
||||
|
||||
[[autodoc]] DINOv3ViTConfig
|
||||
|
||||
## DINOv3ConvNextConfig
|
||||
|
||||
[[autodoc]] DINOv3ConvNextConfig
|
||||
|
||||
## DINOv3ViTModel
|
||||
|
||||
[[autodoc]] DINOv3ViTModel
|
||||
- forward
|
||||
|
||||
## DINOv3ConvNextModel
|
||||
|
||||
[[autodoc]] DINOv3ConvNextModel
|
||||
- forward
|
||||
|
||||
## DINOv3ViTImageProcessorFast
|
||||
|
||||
[[autodoc]] DINOv3ViTImageProcessorFast
|
||||
- preprocess
|
||||
@@ -99,6 +99,8 @@ if TYPE_CHECKING:
|
||||
from .dinat import *
|
||||
from .dinov2 import *
|
||||
from .dinov2_with_registers import *
|
||||
from .dinov3_convnext import *
|
||||
from .dinov3_vit import *
|
||||
from .distilbert import *
|
||||
from .dit import *
|
||||
from .donut import *
|
||||
|
||||
@@ -117,6 +117,8 @@ CONFIG_MAPPING_NAMES = OrderedDict[str, str](
|
||||
("dinat", "DinatConfig"),
|
||||
("dinov2", "Dinov2Config"),
|
||||
("dinov2_with_registers", "Dinov2WithRegistersConfig"),
|
||||
("dinov3_convnext", "DINOv3ConvNextConfig"),
|
||||
("dinov3_vit", "DINOv3ViTConfig"),
|
||||
("distilbert", "DistilBertConfig"),
|
||||
("doge", "DogeConfig"),
|
||||
("donut-swin", "DonutSwinConfig"),
|
||||
@@ -525,6 +527,8 @@ MODEL_NAMES_MAPPING = OrderedDict[str, str](
|
||||
("dinat", "DiNAT"),
|
||||
("dinov2", "DINOv2"),
|
||||
("dinov2_with_registers", "DINOv2 with Registers"),
|
||||
("dinov3_convnext", "DINOv3 ConvNext"),
|
||||
("dinov3_vit", "DINOv3 ViT"),
|
||||
("distilbert", "DistilBERT"),
|
||||
("dit", "DiT"),
|
||||
("doge", "Doge"),
|
||||
|
||||
@@ -88,6 +88,7 @@ else:
|
||||
("detr", ("DetrImageProcessor", "DetrImageProcessorFast")),
|
||||
("dinat", ("ViTImageProcessor", "ViTImageProcessorFast")),
|
||||
("dinov2", ("BitImageProcessor", "BitImageProcessorFast")),
|
||||
("dinov3_vit", (None, "DINOv3ViTImageProcessorFast")),
|
||||
("donut-swin", ("DonutImageProcessor", "DonutImageProcessorFast")),
|
||||
("dpt", ("DPTImageProcessor", "DPTImageProcessorFast")),
|
||||
("efficientformer", ("EfficientFormerImageProcessor", None)),
|
||||
|
||||
@@ -121,6 +121,8 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
||||
("dinat", "DinatModel"),
|
||||
("dinov2", "Dinov2Model"),
|
||||
("dinov2_with_registers", "Dinov2WithRegistersModel"),
|
||||
("dinov3_convnext", "DINOv3ConvNextModel"),
|
||||
("dinov3_vit", "DINOv3ViTModel"),
|
||||
("distilbert", "DistilBertModel"),
|
||||
("doge", "DogeModel"),
|
||||
("donut-swin", "DonutSwinModel"),
|
||||
@@ -746,6 +748,8 @@ MODEL_FOR_IMAGE_MAPPING_NAMES = OrderedDict(
|
||||
("dinat", "DinatModel"),
|
||||
("dinov2", "Dinov2Model"),
|
||||
("dinov2_with_registers", "Dinov2WithRegistersModel"),
|
||||
("dinov3_convnext", "DINOv3ConvNextModel"),
|
||||
("dinov3_vit", "DINOv3ViTModel"),
|
||||
("dpt", "DPTModel"),
|
||||
("efficientformer", "EfficientFormerModel"),
|
||||
("efficientnet", "EfficientNetModel"),
|
||||
|
||||
27
src/transformers/models/dinov3_convnext/__init__.py
Normal file
27
src/transformers/models/dinov3_convnext/__init__.py
Normal file
@@ -0,0 +1,27 @@
|
||||
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...utils import _LazyModule
|
||||
from ...utils.import_utils import define_import_structure
|
||||
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_dinov3_convnext import *
|
||||
from .modeling_dinov3_convnext import *
|
||||
else:
|
||||
import sys
|
||||
|
||||
_file = globals()["__file__"]
|
||||
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
|
||||
@@ -0,0 +1,103 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2025 Meta Platforms, Inc. and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""ConvNeXT model configuration"""
|
||||
|
||||
from typing import Optional
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
class DINOv3ConvNextConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`DINOv3ConvNextModel`]. It is used to instantiate an
|
||||
DINOv3ConvNext model according to the specified arguments, defining the model architecture. Instantiating a configuration
|
||||
with the defaults will yield a similar configuration to that of the DINOv3ConvNext
|
||||
[facebook/dinov3-convnext-tiny-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-convnext-tiny-pretrain-lvd1689m) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
Args:
|
||||
num_channels (`int`, *optional*, defaults to 3):
|
||||
The number of input channels.
|
||||
hidden_sizes (`list[int]`, *optional*, defaults to [96, 192, 384, 768]):
|
||||
Dimensionality (hidden size) at each stage.
|
||||
depths (`list[int]`, *optional*, defaults to [3, 3, 9, 3]):
|
||||
The number of layers for each stage.
|
||||
hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
|
||||
The non-linear activation function (function or string) in each block. If string, `"gelu"`, `"relu"`,
|
||||
`"selu"` and `"gelu_new"` are supported.
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
layer_norm_eps (`float`, *optional*, defaults to 1e-06):
|
||||
The epsilon used by the layer normalization layers.
|
||||
layer_scale_init_value (`float`, *optional*, defaults to 1e-06):
|
||||
The initial value for the layer scale.
|
||||
drop_path_rate (`float`, *optional*, defaults to 0.0):
|
||||
The drop rate for stochastic depth.
|
||||
image_size (`int`, *optional*, defaults to 224):
|
||||
The size (resolution) of input images.
|
||||
|
||||
Example:
|
||||
```python
|
||||
>>> from transformers import DINOv3ConvNextConfig, DINOv3ConvNextModel
|
||||
|
||||
>>> # Initializing a DINOv3ConvNext (tiny variant) style configuration
|
||||
>>> config = DINOv3ConvNextConfig()
|
||||
|
||||
>>> # Initializing a model (with random weights)
|
||||
>>> model = DINOv3ConvNextModel(config)
|
||||
|
||||
>>> # Accessing the model config
|
||||
>>> config = model.config
|
||||
```"""
|
||||
|
||||
model_type = "dinov3_convnext"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
num_channels: int = 3,
|
||||
hidden_sizes: Optional[list[int]] = None,
|
||||
depths: Optional[list[int]] = None,
|
||||
hidden_act: str = "gelu",
|
||||
initializer_range: float = 0.02,
|
||||
layer_norm_eps: float = 1e-6,
|
||||
layer_scale_init_value: float = 1e-6,
|
||||
drop_path_rate: float = 0.0,
|
||||
image_size: int = 224,
|
||||
**kwargs,
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
|
||||
self.num_channels = num_channels
|
||||
self.hidden_sizes = [96, 192, 384, 768] if hidden_sizes is None else hidden_sizes
|
||||
self.depths = [3, 3, 9, 3] if depths is None else depths
|
||||
self.hidden_act = hidden_act
|
||||
self.initializer_range = initializer_range
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.layer_scale_init_value = layer_scale_init_value
|
||||
self.drop_path_rate = drop_path_rate
|
||||
self.image_size = image_size
|
||||
|
||||
@property
|
||||
def num_stages(self) -> int:
|
||||
return len(self.hidden_sizes)
|
||||
|
||||
|
||||
__all__ = ["DINOv3ConvNextConfig"]
|
||||
@@ -0,0 +1,234 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Convert DINOv3 checkpoints from the original repository.
|
||||
|
||||
URL: https://github.com/facebookresearch/dinov3/tree/main
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import re
|
||||
from typing import Optional
|
||||
|
||||
import requests
|
||||
import torch
|
||||
from huggingface_hub import HfApi, hf_hub_download
|
||||
from PIL import Image
|
||||
from torchvision import transforms
|
||||
|
||||
from transformers import DINOv3ConvNextConfig, DINOv3ConvNextModel, DINOv3ViTImageProcessorFast
|
||||
|
||||
|
||||
HUB_MODELS = {
|
||||
"convnext_tiny": "facebook/dinov3-convnext-tiny-pretrain-lvd1689m",
|
||||
"convnext_small": "facebook/dinov3-convnext-small-pretrain-lvd1689m",
|
||||
"convnext_base": "facebook/dinov3-convnext-base-pretrain-lvd1689m",
|
||||
"convnext_large": "facebook/dinov3-convnext-large-pretrain-lvd1689m",
|
||||
}
|
||||
|
||||
HUB_CHECKPOINTS = {
|
||||
"convnext_tiny": "dinov3_convnext_tiny_pretrain_lvd1689m-21b726bb.pth",
|
||||
"convnext_small": "dinov3_convnext_small_pretrain_lvd1689m-296db49d.pth",
|
||||
"convnext_base": "dinov3_convnext_base_pretrain_lvd1689m-801f2ba9.pth",
|
||||
"convnext_large": "dinov3_convnext_large_pretrain_lvd1689m-61fa432d.pth",
|
||||
}
|
||||
|
||||
# fmt: off
|
||||
ORIGINAL_TO_CONVERTED_KEY_MAPPING = {
|
||||
r"dwconv": r"depthwise_conv",
|
||||
r"pwconv": r"pointwise_conv",
|
||||
r"norm": r"layer_norm",
|
||||
r"stages.(\d+).(\d+)": r"stages.\1.layers.\2",
|
||||
r"downsample_layers.(\d+).(\d+)": r"stages.\1.downsample_layers.\2",
|
||||
}
|
||||
# fmt: on
|
||||
|
||||
|
||||
def get_dinov3_config(model_name: str) -> DINOv3ConvNextConfig:
|
||||
# size of the architecture
|
||||
if model_name == "convnext_tiny":
|
||||
return DINOv3ConvNextConfig(
|
||||
depths=[3, 3, 9, 3],
|
||||
hidden_sizes=[96, 192, 384, 768],
|
||||
)
|
||||
elif model_name == "convnext_small":
|
||||
return DINOv3ConvNextConfig(
|
||||
depths=[3, 3, 27, 3],
|
||||
hidden_sizes=[96, 192, 384, 768],
|
||||
)
|
||||
elif model_name == "convnext_base":
|
||||
return DINOv3ConvNextConfig(
|
||||
depths=[3, 3, 27, 3],
|
||||
hidden_sizes=[128, 256, 512, 1024],
|
||||
)
|
||||
elif model_name == "convnext_large":
|
||||
return DINOv3ConvNextConfig(
|
||||
depths=[3, 3, 27, 3],
|
||||
hidden_sizes=[192, 384, 768, 1536],
|
||||
)
|
||||
else:
|
||||
raise ValueError("Model not supported")
|
||||
|
||||
|
||||
def prepare_img():
|
||||
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
|
||||
return image
|
||||
|
||||
|
||||
def get_transform(resize_size: int = 224):
|
||||
to_tensor = transforms.ToTensor()
|
||||
resize = transforms.Resize((resize_size, resize_size), antialias=True)
|
||||
normalize = transforms.Normalize(
|
||||
mean=(0.485, 0.456, 0.406),
|
||||
std=(0.229, 0.224, 0.225),
|
||||
)
|
||||
return transforms.Compose([to_tensor, resize, normalize])
|
||||
|
||||
|
||||
def get_image_processor(resize_size: int = 224):
|
||||
return DINOv3ViTImageProcessorFast(
|
||||
do_resize=True,
|
||||
size={"height": resize_size, "width": resize_size},
|
||||
resample=2, # BILINEAR
|
||||
)
|
||||
|
||||
|
||||
def convert_old_keys_to_new_keys(state_dict_keys: Optional[dict] = None):
|
||||
"""
|
||||
This function should be applied only once, on the concatenated keys to efficiently rename using
|
||||
the key mappings.
|
||||
"""
|
||||
output_dict = {}
|
||||
if state_dict_keys is not None:
|
||||
old_text = "\n".join(state_dict_keys)
|
||||
new_text = old_text
|
||||
for pattern, replacement in ORIGINAL_TO_CONVERTED_KEY_MAPPING.items():
|
||||
if replacement is None:
|
||||
new_text = re.sub(pattern, "", new_text) # an empty line
|
||||
continue
|
||||
new_text = re.sub(pattern, replacement, new_text)
|
||||
output_dict = dict(zip(old_text.split("\n"), new_text.split("\n")))
|
||||
return output_dict
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def convert_and_test_dinov3_checkpoint(args):
|
||||
expected_outputs = {
|
||||
"convnext_tiny_cls": [-6.372119, 1.300791, 2.074303, -0.079975, 0.607205],
|
||||
"convnext_tiny_patch": [0.490530, -3.713466, 1.848513, -1.040319, -1.090818],
|
||||
"convnext_small_cls": [-0.903914, 1.412183, 0.287465, 0.175296, -2.397940],
|
||||
"convnext_small_patch": [-1.081114, 0.637362, 3.748765, 0.170179, 1.445153],
|
||||
"convnext_base_cls": [0.155366, -0.378771, -0.735157, -2.818718, 0.015095],
|
||||
"convnext_base_patch": [3.039118, 0.778155, -1.961322, -1.607147, -2.411941],
|
||||
"convnext_large_cls": [-2.219094, -0.594451, -2.300294, -0.957415, -0.520473],
|
||||
"convnext_large_patch": [-1.477349, -0.217038, -3.128137, 0.418962, 0.334949],
|
||||
}
|
||||
model_name = args.model_name
|
||||
config = get_dinov3_config(model_name)
|
||||
# print(config)
|
||||
|
||||
model = DINOv3ConvNextModel(config).eval()
|
||||
state_dict_path = hf_hub_download(repo_id=HUB_MODELS[model_name], filename=HUB_CHECKPOINTS[model_name])
|
||||
original_state_dict = torch.load(state_dict_path)
|
||||
original_keys = list(original_state_dict.keys())
|
||||
new_keys = convert_old_keys_to_new_keys(original_keys)
|
||||
|
||||
converted_state_dict = {}
|
||||
for key in original_keys:
|
||||
new_key = new_keys[key]
|
||||
weight_tensor = original_state_dict[key]
|
||||
if key == "norms.3.weight" or key == "norms.3.bias":
|
||||
continue
|
||||
converted_state_dict[new_key] = weight_tensor
|
||||
model.load_state_dict(converted_state_dict, strict=True)
|
||||
model = model.eval()
|
||||
|
||||
transform = get_transform()
|
||||
image_processor = get_image_processor()
|
||||
image = prepare_img()
|
||||
|
||||
# check preprocessing
|
||||
original_pixel_values = transform(image).unsqueeze(0) # add batch dimension
|
||||
inputs = image_processor(image, return_tensors="pt")
|
||||
|
||||
torch.testing.assert_close(original_pixel_values, inputs["pixel_values"], atol=1e-6, rtol=1e-6)
|
||||
print("Preprocessing looks ok!")
|
||||
|
||||
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.float):
|
||||
model_output = model(**inputs)
|
||||
|
||||
last_layer_class_token = model_output.pooler_output
|
||||
last_layer_patch_tokens = model_output.last_hidden_state[:, 1:]
|
||||
|
||||
actual_outputs = {}
|
||||
actual_outputs[f"{model_name}_cls"] = last_layer_class_token[0, :5].tolist()
|
||||
actual_outputs[f"{model_name}_patch"] = last_layer_patch_tokens[0, 0, :5].tolist()
|
||||
|
||||
print("Actual: ", [round(x, 6) for x in actual_outputs[f"{model_name}_cls"]])
|
||||
print("Expected:", expected_outputs[f"{model_name}_cls"])
|
||||
|
||||
torch.testing.assert_close(
|
||||
torch.Tensor(actual_outputs[f"{model_name}_cls"]),
|
||||
torch.Tensor(expected_outputs[f"{model_name}_cls"]),
|
||||
atol=1e-3,
|
||||
rtol=1e-3,
|
||||
)
|
||||
print("Actual: ", [round(x, 6) for x in actual_outputs[f"{model_name}_patch"]])
|
||||
print("Expected:", expected_outputs[f"{model_name}_patch"])
|
||||
|
||||
torch.testing.assert_close(
|
||||
torch.Tensor(actual_outputs[f"{model_name}_patch"]),
|
||||
torch.Tensor(expected_outputs[f"{model_name}_patch"]),
|
||||
atol=1e-3,
|
||||
rtol=1e-3,
|
||||
)
|
||||
print("Forward pass looks ok!")
|
||||
|
||||
save_dir = os.path.join(args.save_dir, model_name)
|
||||
os.makedirs(save_dir, exist_ok=True)
|
||||
model.save_pretrained(save_dir)
|
||||
image_processor.save_pretrained(save_dir)
|
||||
print(f"Model saved to {save_dir}")
|
||||
|
||||
if args.push_to_hub:
|
||||
api = HfApi()
|
||||
repo = HUB_MODELS[model_name]
|
||||
api.upload_folder(folder_path=save_dir, repo_id=repo, repo_type="model")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
# Required parameters
|
||||
parser.add_argument(
|
||||
"--model-name",
|
||||
default="convnext_tiny",
|
||||
type=str,
|
||||
choices=["convnext_tiny", "convnext_small", "convnext_base", "convnext_large"],
|
||||
help="Name of the model you'd like to convert.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--save-dir",
|
||||
default="converted_models",
|
||||
type=str,
|
||||
help="Directory to save the converted model.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--push-to-hub",
|
||||
action="store_true",
|
||||
help="Push the converted model to the Hugging Face Hub.",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
convert_and_test_dinov3_checkpoint(args)
|
||||
@@ -0,0 +1,261 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2025 Meta Platforms, Inc. and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""PyTorch ConvNext model."""
|
||||
|
||||
from typing import Optional
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
import torch.utils.checkpoint
|
||||
from torch import nn
|
||||
|
||||
from ...activations import ACT2FN
|
||||
from ...modeling_outputs import (
|
||||
BaseModelOutputWithPoolingAndNoAttention,
|
||||
)
|
||||
from ...modeling_utils import PreTrainedModel
|
||||
from ...utils import auto_docstring, logging
|
||||
from ...utils.generic import can_return_tuple
|
||||
from .configuration_dinov3_convnext import DINOv3ConvNextConfig
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
# Copied from transformers.models.beit.modeling_beit.drop_path
|
||||
def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor:
|
||||
"""
|
||||
Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
|
||||
|
||||
Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks,
|
||||
however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
|
||||
See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the
|
||||
layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the
|
||||
argument.
|
||||
"""
|
||||
if drop_prob == 0.0 or not training:
|
||||
return input
|
||||
keep_prob = 1 - drop_prob
|
||||
shape = (input.shape[0],) + (1,) * (input.ndim - 1) # work with diff dim tensors, not just 2D ConvNets
|
||||
random_tensor = keep_prob + torch.rand(shape, dtype=input.dtype, device=input.device)
|
||||
random_tensor.floor_() # binarize
|
||||
output = input.div(keep_prob) * random_tensor
|
||||
return output
|
||||
|
||||
|
||||
# Copied from transformers.models.convnext.modeling_convnext.ConvNextDropPath with ConvNext->DINOv3ConvNext
|
||||
class DINOv3ConvNextDropPath(nn.Module):
|
||||
"""Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
|
||||
|
||||
def __init__(self, drop_prob: Optional[float] = None) -> None:
|
||||
super().__init__()
|
||||
self.drop_prob = drop_prob
|
||||
|
||||
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
|
||||
return drop_path(hidden_states, self.drop_prob, self.training)
|
||||
|
||||
def extra_repr(self) -> str:
|
||||
return f"p={self.drop_prob}"
|
||||
|
||||
|
||||
class DINOv3ConvNextLayerNorm(nn.LayerNorm):
|
||||
r"""LayerNorm that supports two data formats: channels_last (default) or channels_first.
|
||||
The ordering of the dimensions in the inputs. channels_last corresponds to inputs with shape (batch_size, height,
|
||||
width, channels) while channels_first corresponds to inputs with shape (batch_size, channels, height, width).
|
||||
"""
|
||||
|
||||
def __init__(self, *args, data_format="channels_last", **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
if data_format not in ["channels_last", "channels_first"]:
|
||||
raise NotImplementedError(f"Unsupported data format: {data_format}")
|
||||
self.data_format = data_format
|
||||
|
||||
def forward(self, features: torch.Tensor) -> torch.Tensor:
|
||||
"""
|
||||
Args:
|
||||
features: Tensor of shape (batch_size, channels, height, width) OR (batch_size, height, width, channels)
|
||||
"""
|
||||
if self.data_format == "channels_first":
|
||||
features = features.permute(0, 2, 3, 1)
|
||||
features = super().forward(features)
|
||||
features = features.permute(0, 3, 1, 2)
|
||||
else:
|
||||
features = super().forward(features)
|
||||
return features
|
||||
|
||||
|
||||
class DINOv3ConvNextLayer(nn.Module):
|
||||
"""This corresponds to the `Block` class in the original implementation.
|
||||
|
||||
There are two equivalent implementations:
|
||||
1) DwConv, LayerNorm (channels_first), Conv, GELU, Conv (all in (N, C, H, W) format)
|
||||
2) DwConv, Permute, LayerNorm (channels_last), Linear, GELU, Linear, Permute
|
||||
|
||||
The authors used (2) as they find it slightly faster in PyTorch.
|
||||
|
||||
Args:
|
||||
config ([`DINOv3ConvNextConfig`]):
|
||||
Model config.
|
||||
channels (`int`):
|
||||
Number of input (and output) channels.
|
||||
drop_path (`float`):
|
||||
Drop path rate. Default: 0.0.
|
||||
"""
|
||||
|
||||
def __init__(self, config: DINOv3ConvNextConfig, channels: int, drop_path: float = 0.0):
|
||||
super().__init__()
|
||||
self.depthwise_conv = nn.Conv2d(channels, channels, kernel_size=7, padding=3, groups=channels)
|
||||
self.layer_norm = DINOv3ConvNextLayerNorm(channels, eps=config.layer_norm_eps)
|
||||
self.pointwise_conv1 = nn.Linear(channels, 4 * channels) # can be seen as a 1x1 conv
|
||||
self.activation_fn = ACT2FN[config.hidden_act]
|
||||
self.pointwise_conv2 = nn.Linear(4 * channels, channels) # can be seen as a 1x1 conv
|
||||
self.gamma = nn.Parameter(torch.full((channels,), config.layer_scale_init_value), requires_grad=True)
|
||||
self.drop_path = DINOv3ConvNextDropPath(drop_path) if drop_path > 0.0 else nn.Identity()
|
||||
|
||||
def forward(self, features: torch.Tensor) -> torch.Tensor:
|
||||
"""
|
||||
Args:
|
||||
features: Tensor of shape (batch_size, channels, height, width)
|
||||
"""
|
||||
residual = features
|
||||
features = self.depthwise_conv(features)
|
||||
features = features.permute(0, 2, 3, 1) # to channels last
|
||||
features = self.layer_norm(features)
|
||||
features = self.pointwise_conv1(features)
|
||||
features = self.activation_fn(features)
|
||||
features = self.pointwise_conv2(features)
|
||||
features = features * self.gamma
|
||||
features = features.permute(0, 3, 1, 2) # back to channels first
|
||||
features = residual + self.drop_path(features)
|
||||
return features
|
||||
|
||||
|
||||
class DINOv3ConvNextStage(nn.Module):
|
||||
""" """
|
||||
|
||||
def __init__(self, config: DINOv3ConvNextConfig, stage_idx: int):
|
||||
super().__init__()
|
||||
|
||||
in_channels = config.hidden_sizes[stage_idx - 1] if stage_idx > 0 else config.num_channels
|
||||
out_channels = config.hidden_sizes[stage_idx]
|
||||
|
||||
if stage_idx == 0:
|
||||
self.downsample_layers = nn.ModuleList(
|
||||
[
|
||||
nn.Conv2d(config.num_channels, out_channels, kernel_size=4, stride=4),
|
||||
DINOv3ConvNextLayerNorm(out_channels, eps=config.layer_norm_eps, data_format="channels_first"),
|
||||
]
|
||||
)
|
||||
else:
|
||||
self.downsample_layers = nn.ModuleList(
|
||||
[
|
||||
DINOv3ConvNextLayerNorm(in_channels, eps=config.layer_norm_eps, data_format="channels_first"),
|
||||
nn.Conv2d(in_channels, out_channels, kernel_size=2, stride=2),
|
||||
]
|
||||
)
|
||||
|
||||
num_stage_layers = config.depths[stage_idx]
|
||||
num_previous_layers = sum(config.depths[:stage_idx])
|
||||
num_total_layers = sum(config.depths)
|
||||
drop_path_rates = np.linspace(0, config.drop_path_rate, num_total_layers).tolist()
|
||||
|
||||
self.layers = nn.ModuleList(
|
||||
[
|
||||
DINOv3ConvNextLayer(config, channels=out_channels, drop_path=drop_path_rates[i])
|
||||
for i in range(num_previous_layers, num_previous_layers + num_stage_layers)
|
||||
]
|
||||
)
|
||||
|
||||
def forward(self, features: torch.Tensor) -> torch.Tensor:
|
||||
"""
|
||||
Args:
|
||||
features: Tensor of shape (batch_size, channels, height, width)
|
||||
"""
|
||||
for layer in self.downsample_layers:
|
||||
features = layer(features)
|
||||
for layer in self.layers:
|
||||
features = layer(features)
|
||||
return features
|
||||
|
||||
|
||||
@auto_docstring
|
||||
class DINOv3ConvNextPreTrainedModel(PreTrainedModel):
|
||||
config: DINOv3ConvNextConfig
|
||||
base_model_prefix = "dinov3_convnext"
|
||||
main_input_name = "pixel_values"
|
||||
_no_split_modules = ["DINOv3ConvNextLayer"]
|
||||
|
||||
def _init_weights(self, module):
|
||||
"""Initialize the weights"""
|
||||
if isinstance(module, (nn.Linear, nn.Conv2d)):
|
||||
# Slightly different from the TF version which uses truncated_normal for initialization
|
||||
# cf https://github.com/pytorch/pytorch/pull/5617
|
||||
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
|
||||
if module.bias is not None:
|
||||
module.bias.data.zero_()
|
||||
elif isinstance(module, (nn.LayerNorm, DINOv3ConvNextLayerNorm)):
|
||||
module.bias.data.zero_()
|
||||
module.weight.data.fill_(1.0)
|
||||
elif isinstance(module, DINOv3ConvNextLayer):
|
||||
if module.gamma is not None:
|
||||
module.gamma.data.fill_(self.config.layer_scale_init_value)
|
||||
|
||||
|
||||
@auto_docstring
|
||||
class DINOv3ConvNextModel(DINOv3ConvNextPreTrainedModel):
|
||||
def __init__(self, config: DINOv3ConvNextConfig):
|
||||
super().__init__(config)
|
||||
self.config = config
|
||||
self.stages = nn.ModuleList([DINOv3ConvNextStage(config, stage_idx) for stage_idx in range(config.num_stages)])
|
||||
self.layer_norm = nn.LayerNorm(config.hidden_sizes[-1], eps=config.layer_norm_eps) # final norm layer
|
||||
self.pool = nn.AdaptiveAvgPool2d(1)
|
||||
self.post_init()
|
||||
|
||||
@can_return_tuple
|
||||
@auto_docstring
|
||||
def forward(
|
||||
self, pixel_values: torch.FloatTensor, output_hidden_states: Optional[bool] = None
|
||||
) -> BaseModelOutputWithPoolingAndNoAttention:
|
||||
hidden_states = pixel_values
|
||||
|
||||
output_hidden_states = output_hidden_states or self.config.output_hidden_states
|
||||
all_hidden_states = [hidden_states] if output_hidden_states else []
|
||||
|
||||
for stage in self.stages:
|
||||
hidden_states = stage(hidden_states)
|
||||
|
||||
# store intermediate stage outputs
|
||||
if output_hidden_states:
|
||||
all_hidden_states.append(hidden_states)
|
||||
|
||||
# make global representation, a.k.a [CLS] token
|
||||
pooled_output = self.pool(hidden_states)
|
||||
|
||||
# (batch_size, channels, height, width) -> (batch_size, height * width, channels)
|
||||
pooled_output = pooled_output.flatten(2).transpose(1, 2)
|
||||
hidden_states = hidden_states.flatten(2).transpose(1, 2)
|
||||
|
||||
# concat "cls" and "patch tokens" as (batch_size, 1 + height * width, channels)
|
||||
hidden_states = torch.cat([pooled_output, hidden_states], dim=1)
|
||||
hidden_states = self.layer_norm(hidden_states)
|
||||
|
||||
return BaseModelOutputWithPoolingAndNoAttention(
|
||||
last_hidden_state=hidden_states,
|
||||
pooler_output=hidden_states[:, 0],
|
||||
hidden_states=tuple(all_hidden_states) if output_hidden_states else None,
|
||||
)
|
||||
|
||||
|
||||
__all__ = ["DINOv3ConvNextModel", "DINOv3ConvNextPreTrainedModel"]
|
||||
28
src/transformers/models/dinov3_vit/__init__.py
Normal file
28
src/transformers/models/dinov3_vit/__init__.py
Normal file
@@ -0,0 +1,28 @@
|
||||
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...utils import _LazyModule
|
||||
from ...utils.import_utils import define_import_structure
|
||||
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_dinov3_vit import *
|
||||
from .image_processing_dinov3_vit_fast import *
|
||||
from .modeling_dinov3_vit import *
|
||||
else:
|
||||
import sys
|
||||
|
||||
_file = globals()["__file__"]
|
||||
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
|
||||
166
src/transformers/models/dinov3_vit/configuration_dinov3_vit.py
Normal file
166
src/transformers/models/dinov3_vit/configuration_dinov3_vit.py
Normal file
@@ -0,0 +1,166 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""DINOv3 model configuration"""
|
||||
|
||||
from typing import Optional
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
class DINOv3ViTConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`DINOv3Model`]. It is used to instantiate an
|
||||
DINOv3 model according to the specified arguments, defining the model architecture. Instantiating a configuration
|
||||
with the defaults will yield a similar configuration to that of the DINOv3
|
||||
[facebook/dinov3-vits16-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vits16-pretrain-lvd1689m) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
Args:
|
||||
patch_size (`int`, *optional*, defaults to 16):
|
||||
The size (resolution) of each patch.
|
||||
hidden_size (`int`, *optional*, defaults to 384):
|
||||
Dimensionality of the encoder layers and the pooler layer.
|
||||
intermediate_size (`int`, *optional*, defaults to 1536):
|
||||
Dimensionality of the "intermediate" (i.e., feed-forward) layer.
|
||||
num_hidden_layers (`int`, *optional*, defaults to 12):
|
||||
Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads (`int`, *optional*, defaults to 6):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||
`"relu"`, `"selu"` and `"gelu_new"` are supported.
|
||||
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout ratio for the attention probabilities.
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
layer_norm_eps (`float`, *optional*, defaults to 1e-05):
|
||||
The epsilon used by the layer normalization layers.
|
||||
rope_theta (`float`, *optional*, defaults to 100.0):
|
||||
The base period of the RoPE embeddings.
|
||||
image_size (`int`, *optional*, defaults to 224):
|
||||
The size (resolution) of each image.
|
||||
num_channels (`int`, *optional*, defaults to 3):
|
||||
The number of input channels.
|
||||
query_bias (`bool`, *optional*, defaults to `True`):
|
||||
Whether to add a bias to the query projection.
|
||||
key_bias (`bool`, *optional*, defaults to `False`):
|
||||
Whether to add a bias to the key projection.
|
||||
value_bias (`bool`, *optional*, defaults to `True`):
|
||||
Whether to add a bias to the value projection.
|
||||
proj_bias (`bool`, *optional*, defaults to `True`):
|
||||
Whether to add a bias to the output projection.
|
||||
mlp_bias (`bool`, *optional*, defaults to `True`):
|
||||
Whether to add a bias to the MLP layers.
|
||||
layerscale_value (`float`, *optional*, defaults to 1.0):
|
||||
Initial value to use for layer scale.
|
||||
drop_path_rate (`float`, *optional*, defaults to 0.0):
|
||||
Stochastic depth rate per sample (when applied in the main path of residual layers).
|
||||
use_gated_mlp (`bool`, *optional*, defaults to `False`):
|
||||
Whether to use the SwiGLU feedforward neural network.
|
||||
num_register_tokens (`int`, *optional*, defaults to 0):
|
||||
The number of register tokens.
|
||||
pos_embed_shift (`float`, *optional*):
|
||||
Amount to randomly shift position embedding coordinates in [-shift, shift],
|
||||
applied only in training mode if not `None`.
|
||||
pos_embed_jitter (`float`, *optional*):
|
||||
Amount to randomly jitter position embedding coordinates in log-uniform value in [1/jitter, jitter],
|
||||
applied only in training mode if not `None`.
|
||||
pos_embed_rescale (`float`, *optional*, defaults to 2.0):
|
||||
Amount to randomly rescale position embedding coordinates in log-uniform value in [1/rescale, rescale],
|
||||
applied only in training mode if not `None`.
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import DINOv3ViTConfig, DINOv3ViTModel
|
||||
|
||||
>>> # Initializing a DINOv3 ViT-small style configuration
|
||||
>>> config = DINOv3ViTConfig()
|
||||
|
||||
>>> # Initializing a model (with random weights) from the config
|
||||
>>> model = DINOv3ViTModel(config)
|
||||
|
||||
>>> # Accessing the model config
|
||||
>>> config = model.config
|
||||
```"""
|
||||
|
||||
model_type = "dinov3_vit"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
patch_size: int = 16,
|
||||
hidden_size: int = 384,
|
||||
intermediate_size: int = 1536,
|
||||
num_hidden_layers: int = 12,
|
||||
num_attention_heads: int = 6,
|
||||
hidden_act: str = "gelu",
|
||||
attention_dropout: float = 0.0,
|
||||
initializer_range: float = 0.02,
|
||||
layer_norm_eps: float = 1e-5,
|
||||
rope_theta: float = 100.0,
|
||||
image_size: int = 224,
|
||||
num_channels: int = 3,
|
||||
query_bias: bool = True,
|
||||
key_bias: bool = False,
|
||||
value_bias: bool = True,
|
||||
proj_bias: bool = True,
|
||||
mlp_bias: bool = True,
|
||||
layerscale_value: float = 1.0,
|
||||
drop_path_rate: float = 0.0,
|
||||
use_gated_mlp: bool = False,
|
||||
num_register_tokens: int = 0,
|
||||
# train augs
|
||||
pos_embed_shift: Optional[float] = None,
|
||||
pos_embed_jitter: Optional[float] = None,
|
||||
pos_embed_rescale: Optional[float] = 2.0,
|
||||
**kwargs,
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
|
||||
self.image_size = image_size
|
||||
self.patch_size = patch_size
|
||||
self.num_channels = num_channels
|
||||
self.hidden_size = hidden_size
|
||||
self.intermediate_size = intermediate_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.hidden_act = hidden_act
|
||||
self.attention_dropout = attention_dropout
|
||||
self.initializer_range = initializer_range
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.layerscale_value = layerscale_value
|
||||
self.drop_path_rate = drop_path_rate
|
||||
self.use_gated_mlp = use_gated_mlp
|
||||
self.rope_theta = rope_theta
|
||||
self.query_bias = query_bias
|
||||
self.key_bias = key_bias
|
||||
self.value_bias = value_bias
|
||||
self.proj_bias = proj_bias
|
||||
self.mlp_bias = mlp_bias
|
||||
self.num_register_tokens = num_register_tokens
|
||||
|
||||
# train augs
|
||||
self.pos_embed_shift = pos_embed_shift
|
||||
self.pos_embed_jitter = pos_embed_jitter
|
||||
self.pos_embed_rescale = pos_embed_rescale
|
||||
|
||||
|
||||
__all__ = ["DINOv3ViTConfig"]
|
||||
337
src/transformers/models/dinov3_vit/convert_dinov3_vit_to_hf.py
Normal file
337
src/transformers/models/dinov3_vit/convert_dinov3_vit_to_hf.py
Normal file
@@ -0,0 +1,337 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Convert DINOv3 checkpoints from the original repository.
|
||||
|
||||
URL: https://github.com/facebookresearch/dinov3/tree/main
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import re
|
||||
from typing import Optional
|
||||
|
||||
import requests
|
||||
import torch
|
||||
from huggingface_hub import HfApi, hf_hub_download
|
||||
from PIL import Image
|
||||
from torchvision import transforms
|
||||
|
||||
from transformers import DINOv3ViTConfig, DINOv3ViTImageProcessorFast, DINOv3ViTModel
|
||||
|
||||
|
||||
HUB_MODELS = {
|
||||
"vits16_lvd1689m": "facebook/dinov3-vits16-pretrain-lvd1689m",
|
||||
"vits16plus_lvd1689m": "facebook/dinov3-vits16plus-pretrain-lvd1689m",
|
||||
"vitb16_lvd1689m": "facebook/dinov3-vitb16-pretrain-lvd1689m",
|
||||
"vitl16_lvd1689m": "facebook/dinov3-vitl16-pretrain-lvd1689m",
|
||||
"vitl16_sat493m": "facebook/dinov3-vitl16-pretrain-sat493m",
|
||||
"vith16plus_lvd1689m": "facebook/dinov3-vith16plus-pretrain-lvd1689m",
|
||||
"vit7b16_lvd1689m": "facebook/dinov3-vit7b16-pretrain-lvd1689m",
|
||||
"vit7b16_sat493m": "facebook/dinov3-vit7b16-pretrain-sat493m",
|
||||
}
|
||||
|
||||
HUB_CHECKPOINTS = {
|
||||
"vits16_lvd1689m": "dinov3_vits16_pretrain_lvd1689m-08c60483.pth",
|
||||
"vits16plus_lvd1689m": "dinov3_vits16plus_pretrain_lvd1689m-4057cbaa.pth",
|
||||
"vitb16_lvd1689m": "dinov3_vitb16_pretrain_lvd1689m-73cec8be.pth",
|
||||
"vitl16_lvd1689m": "dinov3_vitl16_pretrain_lvd1689m-8aa4cbdd.pth",
|
||||
"vitl16_sat493m": "dinov3_vitl16_pretrain_sat493m-eadcf0ff.pth",
|
||||
"vith16plus_lvd1689m": "dinov3_vith16plus_pretrain_lvd1689m-7c1da9a5.pth",
|
||||
"vit7b16_lvd1689m": "dinov3_vit7b16_pretrain_lvd1689m-a955f4ea.pth",
|
||||
"vit7b16_sat493m": "dinov3_vit7b16_pretrain_sat493m-a6675841.pth",
|
||||
}
|
||||
|
||||
# fmt: off
|
||||
ORIGINAL_TO_CONVERTED_KEY_MAPPING = {
|
||||
r"cls_token": r"embeddings.cls_token",
|
||||
r"mask_token": r"embeddings.mask_token",
|
||||
r"storage_tokens": r"embeddings.register_tokens",
|
||||
r"patch_embed.proj": r"embeddings.patch_embeddings",
|
||||
r"periods": r"inv_freq",
|
||||
r"rope_embed": r"rope_embeddings",
|
||||
r"blocks.(\d+).attn.proj": r"layer.\1.attention.o_proj",
|
||||
r"blocks.(\d+).attn.": r"layer.\1.attention.",
|
||||
r"blocks.(\d+).ls(\d+).gamma": r"layer.\1.layer_scale\2.lambda1",
|
||||
r"blocks.(\d+).mlp.fc1": r"layer.\1.mlp.up_proj",
|
||||
r"blocks.(\d+).mlp.fc2": r"layer.\1.mlp.down_proj",
|
||||
r"blocks.(\d+).mlp": r"layer.\1.mlp",
|
||||
r"blocks.(\d+).norm": r"layer.\1.norm",
|
||||
r"w1": r"gate_proj",
|
||||
r"w2": r"up_proj",
|
||||
r"w3": r"down_proj",
|
||||
}
|
||||
# fmt: on
|
||||
|
||||
|
||||
def convert_old_keys_to_new_keys(state_dict_keys: Optional[dict] = None):
|
||||
"""
|
||||
This function should be applied only once, on the concatenated keys to efficiently rename using
|
||||
the key mappings.
|
||||
"""
|
||||
output_dict = {}
|
||||
if state_dict_keys is not None:
|
||||
old_text = "\n".join(state_dict_keys)
|
||||
new_text = old_text
|
||||
for pattern, replacement in ORIGINAL_TO_CONVERTED_KEY_MAPPING.items():
|
||||
if replacement is None:
|
||||
new_text = re.sub(pattern, "", new_text) # an empty line
|
||||
continue
|
||||
new_text = re.sub(pattern, replacement, new_text)
|
||||
output_dict = dict(zip(old_text.split("\n"), new_text.split("\n")))
|
||||
return output_dict
|
||||
|
||||
|
||||
def split_qkv(state_dict: dict):
|
||||
keys = [x for x in state_dict.keys() if "qkv" in x]
|
||||
for key in keys:
|
||||
qkv = state_dict.pop(key)
|
||||
q, k, v = torch.chunk(qkv, 3, dim=0)
|
||||
state_dict[key.replace("qkv", "q_proj")] = q
|
||||
state_dict[key.replace("qkv", "k_proj")] = k
|
||||
state_dict[key.replace("qkv", "v_proj")] = v
|
||||
return state_dict
|
||||
|
||||
|
||||
def get_dinov3_config(model_name: str) -> DINOv3ViTConfig:
|
||||
# size of the architecture
|
||||
if model_name == "vits16_lvd1689m":
|
||||
return DINOv3ViTConfig(
|
||||
patch_size=16,
|
||||
hidden_size=384,
|
||||
intermediate_size=1536,
|
||||
num_hidden_layers=12,
|
||||
num_attention_heads=6,
|
||||
proj_bias=True,
|
||||
num_register_tokens=4,
|
||||
use_gated_mlp=False,
|
||||
hidden_act="gelu",
|
||||
)
|
||||
elif model_name == "vits16plus_lvd1689m":
|
||||
return DINOv3ViTConfig(
|
||||
patch_size=16,
|
||||
hidden_size=384,
|
||||
intermediate_size=1536,
|
||||
num_hidden_layers=12,
|
||||
num_attention_heads=6,
|
||||
num_register_tokens=4,
|
||||
use_gated_mlp=True,
|
||||
hidden_act="silu",
|
||||
)
|
||||
elif model_name == "vitb16_lvd1689m":
|
||||
return DINOv3ViTConfig(
|
||||
patch_size=16,
|
||||
hidden_size=768,
|
||||
intermediate_size=3072,
|
||||
num_hidden_layers=12,
|
||||
num_attention_heads=12,
|
||||
proj_bias=True,
|
||||
num_register_tokens=4,
|
||||
use_gated_mlp=False,
|
||||
hidden_act="gelu",
|
||||
)
|
||||
elif model_name in ("vitl16_lvd1689m", "vitl16_sat493m"):
|
||||
return DINOv3ViTConfig(
|
||||
patch_size=16,
|
||||
hidden_size=1024,
|
||||
intermediate_size=4096,
|
||||
num_hidden_layers=24,
|
||||
num_attention_heads=16,
|
||||
num_register_tokens=4,
|
||||
use_gated_mlp=False,
|
||||
hidden_act="gelu",
|
||||
)
|
||||
elif model_name == "vith16plus_lvd1689m":
|
||||
return DINOv3ViTConfig(
|
||||
patch_size=16,
|
||||
hidden_size=1280,
|
||||
intermediate_size=5120,
|
||||
num_hidden_layers=32,
|
||||
num_attention_heads=20,
|
||||
num_register_tokens=4,
|
||||
use_gated_mlp=True,
|
||||
hidden_act="silu",
|
||||
)
|
||||
elif model_name in ("vit7b16_lvd1689m", "vit7b16_sat493m"):
|
||||
return DINOv3ViTConfig(
|
||||
patch_size=16,
|
||||
hidden_size=4096,
|
||||
intermediate_size=8192,
|
||||
num_hidden_layers=40,
|
||||
num_attention_heads=32,
|
||||
query_bias=False,
|
||||
value_bias=False,
|
||||
num_register_tokens=4,
|
||||
use_gated_mlp=True,
|
||||
hidden_act="silu",
|
||||
)
|
||||
else:
|
||||
raise ValueError("Model not supported")
|
||||
|
||||
|
||||
def prepare_img():
|
||||
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
|
||||
return image
|
||||
|
||||
|
||||
def get_transform(resize_size: int = 224):
|
||||
to_tensor = transforms.ToTensor()
|
||||
resize = transforms.Resize((resize_size, resize_size), antialias=True)
|
||||
normalize = transforms.Normalize(
|
||||
mean=(0.485, 0.456, 0.406),
|
||||
std=(0.229, 0.224, 0.225),
|
||||
)
|
||||
return transforms.Compose([to_tensor, resize, normalize])
|
||||
|
||||
|
||||
def get_image_processor(resize_size: int = 224):
|
||||
return DINOv3ViTImageProcessorFast(
|
||||
do_resize=True,
|
||||
size={"height": resize_size, "width": resize_size},
|
||||
resample=2, # BILINEAR
|
||||
)
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def convert_and_test_dinov3_checkpoint(args):
|
||||
expected_outputs = {
|
||||
"vits16_lvd1689m_cls": [0.463561, -0.415609, 0.408236, -0.126613, -0.286636],
|
||||
"vits16_lvd1689m_patch": [-0.038754, -0.250895, -0.016392, -0.455473, 0.571582],
|
||||
"vits16plus_lvd1689m_cls": [-0.471349, -1.365778, -0.317983, 0.377219, -0.769085],
|
||||
"vits16plus_lvd1689m_patch": [0.144551, -0.388117, -0.393433, -0.157695, -0.600380],
|
||||
"vitb16_lvd1689m_cls": [1.034643, -0.180609, -0.341018, -0.066376, -0.011383],
|
||||
"vitb16_lvd1689m_patch": [-0.082523, -0.456272, -0.728029, -0.430680, -0.152880],
|
||||
"vitl16_lvd1689m_cls": [0.484527, -0.582214, 0.480636, 0.592040, 0.945166],
|
||||
"vitl16_lvd1689m_patch": [-0.211367, -0.490863, -0.257131, 0.101763, 0.154511],
|
||||
"vith16plus_lvd1689m_cls": [-0.064575, -0.148866, -0.621524, 0.634878, 0.152695],
|
||||
"vith16plus_lvd1689m_patch": [-0.093817, 0.287407, -0.050036, 0.428043, 0.094561],
|
||||
"vit7b16_lvd1689m_cls": [0.275439, -0.261353, 0.067772, 0.049936, -0.158747],
|
||||
"vit7b16_lvd1689m_patch": [0.044442, -0.052542, 0.070777, -0.065111, -0.026546],
|
||||
"vitl16_sat493m_cls": [-0.33235, 0.34052, -0.22087, 0.21434, 0.09003],
|
||||
"vitl16_sat493m_patch": [0.18488, 0.30309, -0.20689, 0.12848, 0.06207],
|
||||
"vit7b16_sat493m_cls": [-0.19779, 0.11819, -0.00581, -0.21055, -0.03971],
|
||||
"vit7b16_sat493m_patch": [-0.12423, 0.07879, -0.10057, 0.02835, -0.11727],
|
||||
}
|
||||
|
||||
model_name = args.model_name
|
||||
config = get_dinov3_config(model_name)
|
||||
|
||||
model = DINOv3ViTModel(config).eval()
|
||||
state_dict_path = hf_hub_download(repo_id=HUB_MODELS[model_name], filename=HUB_CHECKPOINTS[model_name])
|
||||
original_state_dict = torch.load(state_dict_path, mmap=True)
|
||||
|
||||
original_state_dict = split_qkv(original_state_dict)
|
||||
original_keys = list(original_state_dict.keys())
|
||||
new_keys = convert_old_keys_to_new_keys(original_keys)
|
||||
|
||||
converted_state_dict = {}
|
||||
for key in original_keys:
|
||||
new_key = new_keys[key]
|
||||
weight_tensor = original_state_dict[key]
|
||||
|
||||
if "bias_mask" in key or "attn.k_proj.bias" in key or "local_cls_norm" in key:
|
||||
continue
|
||||
if "embeddings.mask_token" in new_key:
|
||||
weight_tensor = weight_tensor.unsqueeze(1)
|
||||
if "inv_freq" in new_key:
|
||||
continue
|
||||
|
||||
converted_state_dict[new_key] = weight_tensor
|
||||
|
||||
model.load_state_dict(converted_state_dict, strict=True)
|
||||
model = model.eval()
|
||||
|
||||
transform = get_transform()
|
||||
image_processor = get_image_processor()
|
||||
image = prepare_img()
|
||||
|
||||
# check preprocessing
|
||||
original_pixel_values = transform(image).unsqueeze(0) # add batch dimension
|
||||
inputs = image_processor(image, return_tensors="pt")
|
||||
|
||||
torch.testing.assert_close(original_pixel_values, inputs["pixel_values"], atol=1e-6, rtol=1e-6)
|
||||
print("Preprocessing looks ok!")
|
||||
|
||||
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.float):
|
||||
model_output = model(**inputs)
|
||||
|
||||
last_layer_class_token = model_output.pooler_output
|
||||
last_layer_patch_tokens = model_output.last_hidden_state[:, config.num_register_tokens + 1 :]
|
||||
|
||||
actual_outputs = {}
|
||||
actual_outputs[f"{model_name}_cls"] = last_layer_class_token[0, :5].tolist()
|
||||
actual_outputs[f"{model_name}_patch"] = last_layer_patch_tokens[0, 0, :5].tolist()
|
||||
|
||||
print("Actual: ", [round(x, 6) for x in actual_outputs[f"{model_name}_cls"]])
|
||||
print("Expected:", expected_outputs[f"{model_name}_cls"])
|
||||
|
||||
torch.testing.assert_close(
|
||||
torch.Tensor(actual_outputs[f"{model_name}_cls"]),
|
||||
torch.Tensor(expected_outputs[f"{model_name}_cls"]),
|
||||
atol=1e-3,
|
||||
rtol=1e-3,
|
||||
)
|
||||
torch.testing.assert_close(
|
||||
torch.Tensor(actual_outputs[f"{model_name}_patch"]),
|
||||
torch.Tensor(expected_outputs[f"{model_name}_patch"]),
|
||||
atol=1e-3,
|
||||
rtol=1e-3,
|
||||
)
|
||||
print("Forward pass looks ok!")
|
||||
|
||||
save_dir = os.path.join(args.save_dir, model_name)
|
||||
os.makedirs(save_dir, exist_ok=True)
|
||||
model.save_pretrained(save_dir)
|
||||
image_processor.save_pretrained(save_dir)
|
||||
print(f"Model saved to {save_dir}")
|
||||
|
||||
if args.push_to_hub:
|
||||
api = HfApi()
|
||||
repo = HUB_MODELS[model_name]
|
||||
api.upload_folder(folder_path=save_dir, repo_id=repo, repo_type="model")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
# Required parameters
|
||||
parser.add_argument(
|
||||
"--model-name",
|
||||
default="vith16plus_lvd1689m",
|
||||
type=str,
|
||||
choices=[
|
||||
"vits16_lvd1689m",
|
||||
"vits16plus_lvd1689m",
|
||||
"vitb16_lvd1689m",
|
||||
"vitl16_lvd1689m",
|
||||
"vitl16_sat493m",
|
||||
"vith16plus_lvd1689m",
|
||||
"vit7b16_lvd1689m",
|
||||
"vit7b16_sat493m",
|
||||
],
|
||||
help="Name of the model you'd like to convert.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--save-dir",
|
||||
default="converted_models",
|
||||
type=str,
|
||||
help="Directory to save the converted model.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--push-to-hub",
|
||||
action="store_true",
|
||||
help="Push the converted model to the Hugging Face Hub.",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
convert_and_test_dinov3_checkpoint(args)
|
||||
@@ -0,0 +1,104 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Fast Image processor class for DINOv3."""
|
||||
|
||||
from typing import Optional, Union
|
||||
|
||||
from transformers.image_processing_base import BatchFeature
|
||||
from transformers.image_processing_utils_fast import BaseImageProcessorFast, group_images_by_shape, reorder_images
|
||||
from transformers.image_utils import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD, PILImageResampling, SizeDict
|
||||
from transformers.utils import (
|
||||
TensorType,
|
||||
auto_docstring,
|
||||
is_torch_available,
|
||||
is_torchvision_available,
|
||||
is_torchvision_v2_available,
|
||||
logging,
|
||||
)
|
||||
from transformers.utils.import_utils import requires
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
if is_torchvision_v2_available():
|
||||
from torchvision.transforms.v2 import functional as F
|
||||
elif is_torchvision_available():
|
||||
from torchvision.transforms import functional as F
|
||||
|
||||
|
||||
@auto_docstring
|
||||
@requires(backends=("torchvision", "torch"))
|
||||
class DINOv3ViTImageProcessorFast(BaseImageProcessorFast):
|
||||
resample = PILImageResampling.BILINEAR
|
||||
image_mean = IMAGENET_DEFAULT_MEAN
|
||||
image_std = IMAGENET_DEFAULT_STD
|
||||
size = {"height": 224, "width": 224}
|
||||
do_resize = True
|
||||
do_rescale = True
|
||||
do_normalize = True
|
||||
|
||||
# Overriden for DINOv3 to preserve order of transforms
|
||||
# rescale -> resize -> normalize
|
||||
def _preprocess(
|
||||
self,
|
||||
images: list["torch.Tensor"],
|
||||
do_resize: bool,
|
||||
size: SizeDict,
|
||||
interpolation: Optional["F.InterpolationMode"],
|
||||
do_center_crop: bool,
|
||||
crop_size: SizeDict,
|
||||
do_rescale: bool,
|
||||
rescale_factor: float,
|
||||
do_normalize: bool,
|
||||
image_mean: Optional[Union[float, list[float]]],
|
||||
image_std: Optional[Union[float, list[float]]],
|
||||
disable_grouping: Optional[bool],
|
||||
return_tensors: Optional[Union[str, TensorType]],
|
||||
) -> BatchFeature:
|
||||
# Group images by size for batched resizing
|
||||
grouped_images, grouped_images_index = group_images_by_shape(images, disable_grouping=disable_grouping)
|
||||
resized_images_grouped = {}
|
||||
for shape, stacked_images in grouped_images.items():
|
||||
if do_rescale:
|
||||
stacked_images = self.rescale(stacked_images, rescale_factor)
|
||||
if do_resize:
|
||||
stacked_images = self.resize(
|
||||
image=stacked_images, size=size, interpolation=interpolation, antialias=True
|
||||
)
|
||||
resized_images_grouped[shape] = stacked_images
|
||||
resized_images = reorder_images(resized_images_grouped, grouped_images_index)
|
||||
|
||||
# Group images by size for further processing
|
||||
# Needed in case do_resize is False, or resize returns images with different sizes
|
||||
grouped_images, grouped_images_index = group_images_by_shape(resized_images, disable_grouping=disable_grouping)
|
||||
processed_images_grouped = {}
|
||||
for shape, stacked_images in grouped_images.items():
|
||||
if do_center_crop:
|
||||
stacked_images = self.center_crop(stacked_images, crop_size)
|
||||
if do_normalize:
|
||||
stacked_images = self.normalize(stacked_images, image_mean, image_std)
|
||||
processed_images_grouped[shape] = stacked_images
|
||||
|
||||
processed_images = reorder_images(processed_images_grouped, grouped_images_index)
|
||||
processed_images = torch.stack(processed_images, dim=0) if return_tensors else processed_images
|
||||
|
||||
return BatchFeature(data={"pixel_values": processed_images}, tensor_type=return_tensors)
|
||||
|
||||
|
||||
__all__ = ["DINOv3ViTImageProcessorFast"]
|
||||
538
src/transformers/models/dinov3_vit/modeling_dinov3_vit.py
Normal file
538
src/transformers/models/dinov3_vit/modeling_dinov3_vit.py
Normal file
@@ -0,0 +1,538 @@
|
||||
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
|
||||
# This file was automatically generated from src/transformers/models/dinov3_vit/modular_dinov3_vit.py.
|
||||
# Do NOT edit this file manually as any edits will be overwritten by the generation of
|
||||
# the file from the modular. If any change should be done, please apply the change to the
|
||||
# modular_dinov3_vit.py file directly. One of our CI enforces this.
|
||||
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
|
||||
# coding=utf-8
|
||||
# Copyright 2025 Meta AI and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import math
|
||||
from typing import Callable, Optional
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from torch import nn
|
||||
|
||||
from ...activations import ACT2FN
|
||||
from ...modeling_layers import GradientCheckpointingLayer
|
||||
from ...modeling_outputs import BaseModelOutputWithPooling
|
||||
from ...modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
|
||||
from ...processing_utils import Unpack
|
||||
from ...pytorch_utils import compile_compatible_method_lru_cache
|
||||
from ...utils import TransformersKwargs, auto_docstring
|
||||
from ...utils.generic import check_model_inputs
|
||||
from .configuration_dinov3_vit import DINOv3ViTConfig
|
||||
|
||||
|
||||
class DINOv3ViTEmbeddings(nn.Module):
|
||||
"""
|
||||
Construct the CLS token, mask token, position and patch embeddings.
|
||||
"""
|
||||
|
||||
def __init__(self, config: DINOv3ViTConfig):
|
||||
super().__init__()
|
||||
self.config = config
|
||||
self.cls_token = nn.Parameter(torch.randn(1, 1, config.hidden_size))
|
||||
self.mask_token = nn.Parameter(torch.zeros(1, 1, config.hidden_size))
|
||||
self.register_tokens = nn.Parameter(torch.empty(1, config.num_register_tokens, config.hidden_size))
|
||||
self.patch_embeddings = nn.Conv2d(
|
||||
config.num_channels, config.hidden_size, kernel_size=config.patch_size, stride=config.patch_size
|
||||
)
|
||||
|
||||
def forward(self, pixel_values: torch.Tensor, bool_masked_pos: Optional[torch.Tensor] = None) -> torch.Tensor:
|
||||
batch_size = pixel_values.shape[0]
|
||||
target_dtype = self.patch_embeddings.weight.dtype
|
||||
|
||||
# (batch_size, num_channels, height, width) -> (batch_size, num_patches, hidden_size)
|
||||
patch_embeddings = self.patch_embeddings(pixel_values.to(dtype=target_dtype))
|
||||
patch_embeddings = patch_embeddings.flatten(2).transpose(1, 2)
|
||||
|
||||
if bool_masked_pos is not None:
|
||||
mask_token = self.mask_token.to(patch_embeddings.dtype)
|
||||
patch_embeddings = torch.where(bool_masked_pos.unsqueeze(-1), mask_token, patch_embeddings)
|
||||
|
||||
# Add CLS and register tokens
|
||||
cls_token = self.cls_token.expand(batch_size, -1, -1)
|
||||
register_tokens = self.register_tokens.expand(batch_size, -1, -1)
|
||||
embeddings = torch.cat([cls_token, register_tokens, patch_embeddings], dim=1)
|
||||
|
||||
return embeddings
|
||||
|
||||
|
||||
@compile_compatible_method_lru_cache(maxsize=32)
|
||||
def get_patches_center_coordinates(
|
||||
num_patches_h: int, num_patches_w: int, dtype: torch.dtype, device: torch.device
|
||||
) -> torch.Tensor:
|
||||
"""
|
||||
Computes the 2D coordinates of the centers of image patches, normalized to the range [-1, +1].
|
||||
The center of each patch is exactly halfway between its top-left and bottom-right corners.
|
||||
|
||||
Args:
|
||||
num_patches_h (int): Number of patches along the vertical (height) axis.
|
||||
num_patches_w (int): Number of patches along the horizontal (width) axis.
|
||||
dtype (torch.dtype): The desired data type of the returned tensor.
|
||||
|
||||
Returns:
|
||||
torch.Tensor: A tensor of shape (height * width, 2), where each row contains the (y, x)
|
||||
coordinates of a patch center, normalized to [-1, +1].
|
||||
"""
|
||||
coords_h = torch.arange(0.5, num_patches_h, dtype=dtype, device=device)
|
||||
coords_w = torch.arange(0.5, num_patches_w, dtype=dtype, device=device)
|
||||
coords_h = coords_h / num_patches_h
|
||||
coords_w = coords_w / num_patches_w
|
||||
# (height, width, 2) -> (height * width, 2)
|
||||
coords = torch.stack(torch.meshgrid(coords_h, coords_w, indexing="ij"), dim=-1)
|
||||
coords = coords.flatten(0, 1)
|
||||
# Shift range [0, 1] to [-1, +1]
|
||||
coords = 2.0 * coords - 1.0
|
||||
return coords
|
||||
|
||||
|
||||
def augment_patches_center_coordinates(
|
||||
coords: torch.Tensor,
|
||||
shift: Optional[float] = None,
|
||||
jitter: Optional[float] = None,
|
||||
rescale: Optional[float] = None,
|
||||
) -> torch.Tensor:
|
||||
# Shift coords by adding a uniform value in [-shift, shift]
|
||||
if shift is not None:
|
||||
shift_hw = torch.empty((1, 2), device=coords.device, dtype=coords.dtype)
|
||||
shift_hw = shift_hw.uniform_(-shift, shift)
|
||||
coords = coords + shift_hw
|
||||
|
||||
# Jitter coords by multiplying the range [-1, 1] by a log-uniform value in [1/jitter, jitter]
|
||||
if jitter is not None:
|
||||
jitter_range = np.log(jitter)
|
||||
jitter_hw = torch.empty((1, 2), device=coords.device, dtype=coords.dtype)
|
||||
jitter_hw = jitter_hw.uniform_(-jitter_range, jitter_range).exp()
|
||||
coords = coords * jitter_hw
|
||||
|
||||
# Rescale coords by multiplying the range [-1, 1] by a log-uniform value in [1/rescale, rescale]
|
||||
if rescale is not None:
|
||||
rescale_range = np.log(rescale)
|
||||
rescale_hw = torch.empty(1, device=coords.device, dtype=coords.dtype)
|
||||
rescale_hw = rescale_hw.uniform_(-rescale_range, rescale_range).exp()
|
||||
coords = coords * rescale_hw
|
||||
|
||||
return coords
|
||||
|
||||
|
||||
class DINOv3ViTRopePositionEmbedding(nn.Module):
|
||||
inv_freq: torch.Tensor
|
||||
|
||||
def __init__(self, config: DINOv3ViTConfig):
|
||||
super().__init__()
|
||||
|
||||
self.config = config
|
||||
self.base = config.rope_theta
|
||||
self.head_dim = config.hidden_size // config.num_attention_heads
|
||||
self.num_patches_h = config.image_size // config.patch_size
|
||||
self.num_patches_w = config.image_size // config.patch_size
|
||||
|
||||
inv_freq = 1 / self.base ** torch.arange(0, 1, 4 / self.head_dim, dtype=torch.float32) # (head_dim / 4,)
|
||||
self.register_buffer("inv_freq", inv_freq, persistent=False)
|
||||
|
||||
def forward(self, pixel_values: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
|
||||
_, _, height, width = pixel_values.shape
|
||||
num_patches_h = height // self.config.patch_size
|
||||
num_patches_w = width // self.config.patch_size
|
||||
|
||||
device = pixel_values.device
|
||||
device_type = device.type if isinstance(device.type, str) and device.type != "mps" else "cpu"
|
||||
|
||||
with torch.autocast(device_type=device_type, enabled=False): # Force float32
|
||||
# Although we could precompute static patch_coords from image_size and patch_size in the config,
|
||||
# the model was trained with random_scale, so it can process images of varying sizes.
|
||||
# Therefore, it's better to compute patch_coords dynamically (with lru_cache).
|
||||
patch_coords = get_patches_center_coordinates(
|
||||
num_patches_h, num_patches_w, dtype=torch.float32, device=device
|
||||
)
|
||||
if self.training:
|
||||
patch_coords = augment_patches_center_coordinates(
|
||||
patch_coords,
|
||||
shift=self.config.pos_embed_shift,
|
||||
jitter=self.config.pos_embed_jitter,
|
||||
rescale=self.config.pos_embed_rescale,
|
||||
)
|
||||
|
||||
# (height * width, 2, head_dim / 4) -> (height * width, head_dim / 2) -> (height * width, head_dim)
|
||||
angles = 2 * math.pi * patch_coords[:, :, None] * self.inv_freq[None, None, :]
|
||||
angles = angles.flatten(1, 2)
|
||||
angles = angles.tile(2)
|
||||
|
||||
cos = torch.cos(angles)
|
||||
sin = torch.sin(angles)
|
||||
|
||||
dtype = pixel_values.dtype
|
||||
return cos.to(dtype=dtype), sin.to(dtype=dtype)
|
||||
|
||||
|
||||
def rotate_half(x):
|
||||
"""Rotates half the hidden dims of the input."""
|
||||
x1 = x[..., : x.shape[-1] // 2]
|
||||
x2 = x[..., x.shape[-1] // 2 :]
|
||||
return torch.cat((-x2, x1), dim=-1)
|
||||
|
||||
|
||||
def eager_attention_forward(
|
||||
module: nn.Module,
|
||||
query: torch.Tensor,
|
||||
key: torch.Tensor,
|
||||
value: torch.Tensor,
|
||||
attention_mask: Optional[torch.Tensor],
|
||||
scaling: float,
|
||||
dropout: float = 0.0,
|
||||
**kwargs,
|
||||
):
|
||||
# Take the dot product between "query" and "key" to get the raw attention scores.
|
||||
attn_weights = torch.matmul(query, key.transpose(-1, -2)) * scaling
|
||||
|
||||
# Normalize the attention scores to probabilities.
|
||||
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
|
||||
|
||||
# This is actually dropping out entire tokens to attend to, which might
|
||||
# seem a bit unusual, but is taken from the original Transformer paper.
|
||||
attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
|
||||
|
||||
# Mask heads if we want to
|
||||
if attention_mask is not None:
|
||||
attn_weights = attn_weights * attention_mask
|
||||
|
||||
attn_output = torch.matmul(attn_weights, value)
|
||||
attn_output = attn_output.transpose(1, 2).contiguous()
|
||||
|
||||
return attn_output, attn_weights
|
||||
|
||||
|
||||
def apply_rotary_pos_emb(
|
||||
q: torch.Tensor, k: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, **kwargs
|
||||
) -> tuple[torch.Tensor, torch.Tensor]:
|
||||
"""Applies Rotary Position Embedding to the query and key tensors, but only to the patch tokens,
|
||||
ignoring the prefix tokens (cls token and register tokens).
|
||||
|
||||
Args:
|
||||
q (`torch.Tensor`): The query tensor.
|
||||
k (`torch.Tensor`): The key tensor.
|
||||
cos (`torch.Tensor`): The cosine part of the rotary embedding.
|
||||
sin (`torch.Tensor`): The sine part of the rotary embedding.
|
||||
|
||||
Returns:
|
||||
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
|
||||
"""
|
||||
|
||||
num_tokens = q.shape[-2]
|
||||
num_patches = sin.shape[-2]
|
||||
num_prefix_tokens = num_tokens - num_patches # cls token + register tokens
|
||||
|
||||
q_prefix_tokens, q_patches = q.split((num_prefix_tokens, num_patches), dim=-2)
|
||||
k_prefix_tokens, k_patches = k.split((num_prefix_tokens, num_patches), dim=-2)
|
||||
|
||||
# apply rope only to patch tokens
|
||||
q_patches = (q_patches * cos) + (rotate_half(q_patches) * sin)
|
||||
k_patches = (k_patches * cos) + (rotate_half(k_patches) * sin)
|
||||
|
||||
q = torch.cat((q_prefix_tokens, q_patches), dim=-2)
|
||||
k = torch.cat((k_prefix_tokens, k_patches), dim=-2)
|
||||
|
||||
return q, k
|
||||
|
||||
|
||||
class DINOv3ViTAttention(nn.Module):
|
||||
"""
|
||||
Multi-headed attention compatible with ALL_ATTENTION_FUNCTIONS.
|
||||
"""
|
||||
|
||||
def __init__(self, config: DINOv3ViTConfig):
|
||||
super().__init__()
|
||||
self.config = config
|
||||
self.embed_dim = config.hidden_size
|
||||
self.num_heads = config.num_attention_heads
|
||||
self.head_dim = self.embed_dim // self.num_heads
|
||||
self.is_causal = False
|
||||
|
||||
self.scaling = self.head_dim**-0.5
|
||||
self.is_causal = False
|
||||
|
||||
self.dropout = config.attention_dropout
|
||||
self.k_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.key_bias)
|
||||
self.v_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.value_bias)
|
||||
|
||||
self.q_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.query_bias)
|
||||
self.o_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.proj_bias)
|
||||
|
||||
def forward(
|
||||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
|
||||
**kwargs: Unpack[TransformersKwargs],
|
||||
) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
|
||||
"""Input shape: Batch x Time x Channel"""
|
||||
|
||||
batch_size, patches, _ = hidden_states.size()
|
||||
|
||||
query_states = self.q_proj(hidden_states)
|
||||
key_states = self.k_proj(hidden_states)
|
||||
value_states = self.v_proj(hidden_states)
|
||||
|
||||
query_states = query_states.view(batch_size, patches, self.num_heads, self.head_dim).transpose(1, 2)
|
||||
key_states = key_states.view(batch_size, patches, self.num_heads, self.head_dim).transpose(1, 2)
|
||||
value_states = value_states.view(batch_size, patches, self.num_heads, self.head_dim).transpose(1, 2)
|
||||
|
||||
cos, sin = position_embeddings
|
||||
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
|
||||
|
||||
attention_interface: Callable = eager_attention_forward
|
||||
if self.config._attn_implementation != "eager":
|
||||
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
|
||||
|
||||
attn_output, attn_weights = attention_interface(
|
||||
self,
|
||||
query_states,
|
||||
key_states,
|
||||
value_states,
|
||||
attention_mask,
|
||||
dropout=0.0 if not self.training else self.dropout,
|
||||
scaling=self.scaling,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
attn_output = attn_output.reshape(batch_size, patches, -1).contiguous()
|
||||
attn_output = self.o_proj(attn_output)
|
||||
|
||||
return attn_output, attn_weights
|
||||
|
||||
|
||||
class DINOv3ViTLayerScale(nn.Module):
|
||||
def __init__(self, config) -> None:
|
||||
super().__init__()
|
||||
self.lambda1 = nn.Parameter(config.layerscale_value * torch.ones(config.hidden_size))
|
||||
|
||||
def forward(self, hidden_state: torch.Tensor) -> torch.Tensor:
|
||||
return hidden_state * self.lambda1
|
||||
|
||||
|
||||
def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor:
|
||||
"""
|
||||
Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
|
||||
|
||||
Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks,
|
||||
however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
|
||||
See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the
|
||||
layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the
|
||||
argument.
|
||||
"""
|
||||
if drop_prob == 0.0 or not training:
|
||||
return input
|
||||
keep_prob = 1 - drop_prob
|
||||
shape = (input.shape[0],) + (1,) * (input.ndim - 1) # work with diff dim tensors, not just 2D ConvNets
|
||||
random_tensor = keep_prob + torch.rand(shape, dtype=input.dtype, device=input.device)
|
||||
random_tensor.floor_() # binarize
|
||||
output = input.div(keep_prob) * random_tensor
|
||||
return output
|
||||
|
||||
|
||||
class DINOv3ViTDropPath(nn.Module):
|
||||
"""Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
|
||||
|
||||
def __init__(self, drop_prob: Optional[float] = None) -> None:
|
||||
super().__init__()
|
||||
self.drop_prob = drop_prob
|
||||
|
||||
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
|
||||
return drop_path(hidden_states, self.drop_prob, self.training)
|
||||
|
||||
def extra_repr(self) -> str:
|
||||
return f"p={self.drop_prob}"
|
||||
|
||||
|
||||
class DINOv3ViTMLP(nn.Module):
|
||||
def __init__(self, config):
|
||||
super().__init__()
|
||||
self.config = config
|
||||
self.hidden_size = config.hidden_size
|
||||
self.intermediate_size = config.intermediate_size
|
||||
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
|
||||
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
|
||||
self.act_fn = ACT2FN[config.hidden_act]
|
||||
|
||||
def forward(self, x):
|
||||
return self.down_proj(self.act_fn(self.up_proj(x)))
|
||||
|
||||
|
||||
class DINOv3ViTGatedMLP(nn.Module):
|
||||
def __init__(self, config):
|
||||
super().__init__()
|
||||
self.config = config
|
||||
self.hidden_size = config.hidden_size
|
||||
self.intermediate_size = config.intermediate_size
|
||||
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
|
||||
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
|
||||
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
|
||||
self.act_fn = ACT2FN[config.hidden_act]
|
||||
|
||||
def forward(self, x):
|
||||
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
|
||||
return down_proj
|
||||
|
||||
|
||||
class DINOv3ViTLayer(GradientCheckpointingLayer):
|
||||
"""This corresponds to the Block class in the original implementation."""
|
||||
|
||||
def __init__(self, config: DINOv3ViTConfig):
|
||||
super().__init__()
|
||||
|
||||
self.norm1 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||
self.attention = DINOv3ViTAttention(config)
|
||||
self.layer_scale1 = DINOv3ViTLayerScale(config)
|
||||
self.drop_path = DINOv3ViTDropPath(config.drop_path_rate) if config.drop_path_rate > 0.0 else nn.Identity()
|
||||
|
||||
self.norm2 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||
|
||||
if config.use_gated_mlp:
|
||||
self.mlp = DINOv3ViTGatedMLP(config)
|
||||
else:
|
||||
self.mlp = DINOv3ViTMLP(config)
|
||||
self.layer_scale2 = DINOv3ViTLayerScale(config)
|
||||
|
||||
def forward(
|
||||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
|
||||
) -> torch.Tensor:
|
||||
# Attention with residual connection
|
||||
residual = hidden_states
|
||||
hidden_states = self.norm1(hidden_states)
|
||||
hidden_states, _ = self.attention(
|
||||
hidden_states,
|
||||
attention_mask=attention_mask,
|
||||
position_embeddings=position_embeddings,
|
||||
)
|
||||
hidden_states = self.layer_scale1(hidden_states)
|
||||
hidden_states = self.drop_path(hidden_states) + residual
|
||||
|
||||
# MLP with residual connection
|
||||
residual = hidden_states
|
||||
hidden_states = self.norm2(hidden_states)
|
||||
hidden_states = self.mlp(hidden_states)
|
||||
hidden_states = self.layer_scale2(hidden_states)
|
||||
hidden_states = self.drop_path(hidden_states) + residual
|
||||
|
||||
return hidden_states
|
||||
|
||||
|
||||
@auto_docstring
|
||||
class DINOv3ViTPreTrainedModel(PreTrainedModel):
|
||||
config: DINOv3ViTConfig
|
||||
base_model_prefix = "dinov3_vit"
|
||||
main_input_name = "pixel_values"
|
||||
supports_gradient_checkpointing = True
|
||||
_no_split_modules = ["DINOv3ViTLayer"]
|
||||
_supports_sdpa = True
|
||||
_supports_flash_attn = True
|
||||
_supports_flex_attn = True
|
||||
_supports_attention_backend = True
|
||||
_can_record_outputs = {
|
||||
"hidden_states": DINOv3ViTLayer,
|
||||
"attentions": DINOv3ViTAttention,
|
||||
}
|
||||
|
||||
def _init_weights(self, module) -> None:
|
||||
"""Initialize the weights"""
|
||||
if isinstance(module, (nn.Linear, nn.Conv2d)):
|
||||
# Upcast the input in `fp32` and cast it back to desired `dtype` to avoid
|
||||
# `trunc_normal_cpu` not implemented in `half` issues
|
||||
module.weight.data = nn.init.trunc_normal_(
|
||||
module.weight.data.to(torch.float32),
|
||||
mean=0.0,
|
||||
std=self.config.initializer_range,
|
||||
).to(module.weight.dtype)
|
||||
if module.bias is not None:
|
||||
module.bias.data.zero_()
|
||||
elif isinstance(module, nn.LayerNorm):
|
||||
module.bias.data.zero_()
|
||||
module.weight.data.fill_(1.0)
|
||||
elif isinstance(module, DINOv3ViTEmbeddings):
|
||||
module.cls_token.data = nn.init.trunc_normal_(
|
||||
module.cls_token.data.to(torch.float32),
|
||||
mean=0.0,
|
||||
std=self.config.initializer_range,
|
||||
).to(module.cls_token.dtype)
|
||||
if module.config.num_register_tokens > 0:
|
||||
module.register_tokens.data = nn.init.trunc_normal_(
|
||||
module.register_tokens.data.to(torch.float32),
|
||||
mean=0.0,
|
||||
std=self.config.initializer_range,
|
||||
).to(module.register_tokens.dtype)
|
||||
module.mask_token.data.zero_()
|
||||
elif isinstance(module, DINOv3ViTLayerScale):
|
||||
module.lambda1.data.fill_(self.config.layerscale_value)
|
||||
|
||||
|
||||
@auto_docstring
|
||||
class DINOv3ViTModel(DINOv3ViTPreTrainedModel):
|
||||
def __init__(self, config: DINOv3ViTConfig):
|
||||
super().__init__(config)
|
||||
self.config = config
|
||||
self.embeddings = DINOv3ViTEmbeddings(config)
|
||||
self.rope_embeddings = DINOv3ViTRopePositionEmbedding(config)
|
||||
self.layer = nn.ModuleList([DINOv3ViTLayer(config) for _ in range(config.num_hidden_layers)])
|
||||
self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||
self.gradient_checkpointing = False
|
||||
# Initialize weights and apply final processing
|
||||
self.post_init()
|
||||
|
||||
def get_input_embeddings(self):
|
||||
return self.embeddings.patch_embeddings
|
||||
|
||||
@check_model_inputs
|
||||
@auto_docstring
|
||||
def forward(
|
||||
self,
|
||||
pixel_values: torch.Tensor,
|
||||
bool_masked_pos: Optional[torch.Tensor] = None,
|
||||
head_mask: Optional[torch.Tensor] = None,
|
||||
**kwargs: Unpack[TransformersKwargs],
|
||||
) -> BaseModelOutputWithPooling:
|
||||
r"""
|
||||
bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, sequence_length)`):
|
||||
Boolean masked positions. Indicates which patches are masked (1) and which aren't (0). Only relevant for
|
||||
pre-training.
|
||||
"""
|
||||
|
||||
pixel_values = pixel_values.to(self.embeddings.patch_embeddings.weight.dtype)
|
||||
hidden_states = self.embeddings(pixel_values, bool_masked_pos=bool_masked_pos)
|
||||
position_embeddings = self.rope_embeddings(pixel_values)
|
||||
|
||||
for i, layer_module in enumerate(self.layer):
|
||||
layer_head_mask = head_mask[i] if head_mask is not None else None
|
||||
hidden_states = layer_module(
|
||||
hidden_states,
|
||||
attention_mask=layer_head_mask,
|
||||
position_embeddings=position_embeddings,
|
||||
)
|
||||
|
||||
sequence_output = self.norm(hidden_states)
|
||||
pooled_output = sequence_output[:, 0, :]
|
||||
|
||||
return BaseModelOutputWithPooling(
|
||||
last_hidden_state=sequence_output,
|
||||
pooler_output=pooled_output,
|
||||
)
|
||||
|
||||
|
||||
__all__ = ["DINOv3ViTModel", "DINOv3ViTPreTrainedModel"]
|
||||
429
src/transformers/models/dinov3_vit/modular_dinov3_vit.py
Normal file
429
src/transformers/models/dinov3_vit/modular_dinov3_vit.py
Normal file
@@ -0,0 +1,429 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2025 Meta AI and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""PyTorch DINOv3 model."""
|
||||
|
||||
import math
|
||||
from typing import Callable, Optional
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
import torch.utils.checkpoint
|
||||
from torch import nn
|
||||
|
||||
from transformers.models.arcee.modeling_arcee import ArceeMLP
|
||||
from transformers.models.dinov2.modeling_dinov2 import (
|
||||
Dinov2DropPath,
|
||||
Dinov2LayerScale,
|
||||
Dinov2PreTrainedModel,
|
||||
eager_attention_forward,
|
||||
)
|
||||
from transformers.models.llama.modeling_llama import LlamaMLP
|
||||
from transformers.models.pixtral.modeling_pixtral import PixtralAttention, rotate_half
|
||||
|
||||
from ...modeling_layers import GradientCheckpointingLayer
|
||||
from ...modeling_outputs import BaseModelOutputWithPooling
|
||||
from ...modeling_utils import ALL_ATTENTION_FUNCTIONS
|
||||
from ...processing_utils import Unpack
|
||||
from ...pytorch_utils import compile_compatible_method_lru_cache
|
||||
from ...utils import TransformersKwargs, auto_docstring, logging
|
||||
from ...utils.generic import check_model_inputs
|
||||
from .configuration_dinov3_vit import DINOv3ViTConfig
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
class DINOv3ViTEmbeddings(nn.Module):
|
||||
"""
|
||||
Construct the CLS token, mask token, position and patch embeddings.
|
||||
"""
|
||||
|
||||
def __init__(self, config: DINOv3ViTConfig):
|
||||
super().__init__()
|
||||
self.config = config
|
||||
self.cls_token = nn.Parameter(torch.randn(1, 1, config.hidden_size))
|
||||
self.mask_token = nn.Parameter(torch.zeros(1, 1, config.hidden_size))
|
||||
self.register_tokens = nn.Parameter(torch.empty(1, config.num_register_tokens, config.hidden_size))
|
||||
self.patch_embeddings = nn.Conv2d(
|
||||
config.num_channels, config.hidden_size, kernel_size=config.patch_size, stride=config.patch_size
|
||||
)
|
||||
|
||||
def forward(self, pixel_values: torch.Tensor, bool_masked_pos: Optional[torch.Tensor] = None) -> torch.Tensor:
|
||||
batch_size = pixel_values.shape[0]
|
||||
target_dtype = self.patch_embeddings.weight.dtype
|
||||
|
||||
# (batch_size, num_channels, height, width) -> (batch_size, num_patches, hidden_size)
|
||||
patch_embeddings = self.patch_embeddings(pixel_values.to(dtype=target_dtype))
|
||||
patch_embeddings = patch_embeddings.flatten(2).transpose(1, 2)
|
||||
|
||||
if bool_masked_pos is not None:
|
||||
mask_token = self.mask_token.to(patch_embeddings.dtype)
|
||||
patch_embeddings = torch.where(bool_masked_pos.unsqueeze(-1), mask_token, patch_embeddings)
|
||||
|
||||
# Add CLS and register tokens
|
||||
cls_token = self.cls_token.expand(batch_size, -1, -1)
|
||||
register_tokens = self.register_tokens.expand(batch_size, -1, -1)
|
||||
embeddings = torch.cat([cls_token, register_tokens, patch_embeddings], dim=1)
|
||||
|
||||
return embeddings
|
||||
|
||||
|
||||
@compile_compatible_method_lru_cache(maxsize=32)
|
||||
def get_patches_center_coordinates(
|
||||
num_patches_h: int, num_patches_w: int, dtype: torch.dtype, device: torch.device
|
||||
) -> torch.Tensor:
|
||||
"""
|
||||
Computes the 2D coordinates of the centers of image patches, normalized to the range [-1, +1].
|
||||
The center of each patch is exactly halfway between its top-left and bottom-right corners.
|
||||
|
||||
Args:
|
||||
num_patches_h (int): Number of patches along the vertical (height) axis.
|
||||
num_patches_w (int): Number of patches along the horizontal (width) axis.
|
||||
dtype (torch.dtype): The desired data type of the returned tensor.
|
||||
|
||||
Returns:
|
||||
torch.Tensor: A tensor of shape (height * width, 2), where each row contains the (y, x)
|
||||
coordinates of a patch center, normalized to [-1, +1].
|
||||
"""
|
||||
coords_h = torch.arange(0.5, num_patches_h, dtype=dtype, device=device)
|
||||
coords_w = torch.arange(0.5, num_patches_w, dtype=dtype, device=device)
|
||||
coords_h = coords_h / num_patches_h
|
||||
coords_w = coords_w / num_patches_w
|
||||
# (height, width, 2) -> (height * width, 2)
|
||||
coords = torch.stack(torch.meshgrid(coords_h, coords_w, indexing="ij"), dim=-1)
|
||||
coords = coords.flatten(0, 1)
|
||||
# Shift range [0, 1] to [-1, +1]
|
||||
coords = 2.0 * coords - 1.0
|
||||
return coords
|
||||
|
||||
|
||||
def augment_patches_center_coordinates(
|
||||
coords: torch.Tensor,
|
||||
shift: Optional[float] = None,
|
||||
jitter: Optional[float] = None,
|
||||
rescale: Optional[float] = None,
|
||||
) -> torch.Tensor:
|
||||
# Shift coords by adding a uniform value in [-shift, shift]
|
||||
if shift is not None:
|
||||
shift_hw = torch.empty((1, 2), device=coords.device, dtype=coords.dtype)
|
||||
shift_hw = shift_hw.uniform_(-shift, shift)
|
||||
coords = coords + shift_hw
|
||||
|
||||
# Jitter coords by multiplying the range [-1, 1] by a log-uniform value in [1/jitter, jitter]
|
||||
if jitter is not None:
|
||||
jitter_range = np.log(jitter)
|
||||
jitter_hw = torch.empty((1, 2), device=coords.device, dtype=coords.dtype)
|
||||
jitter_hw = jitter_hw.uniform_(-jitter_range, jitter_range).exp()
|
||||
coords = coords * jitter_hw
|
||||
|
||||
# Rescale coords by multiplying the range [-1, 1] by a log-uniform value in [1/rescale, rescale]
|
||||
if rescale is not None:
|
||||
rescale_range = np.log(rescale)
|
||||
rescale_hw = torch.empty(1, device=coords.device, dtype=coords.dtype)
|
||||
rescale_hw = rescale_hw.uniform_(-rescale_range, rescale_range).exp()
|
||||
coords = coords * rescale_hw
|
||||
|
||||
return coords
|
||||
|
||||
|
||||
class DINOv3ViTRopePositionEmbedding(nn.Module):
|
||||
inv_freq: torch.Tensor
|
||||
|
||||
def __init__(self, config: DINOv3ViTConfig):
|
||||
super().__init__()
|
||||
|
||||
self.config = config
|
||||
self.base = config.rope_theta
|
||||
self.head_dim = config.hidden_size // config.num_attention_heads
|
||||
self.num_patches_h = config.image_size // config.patch_size
|
||||
self.num_patches_w = config.image_size // config.patch_size
|
||||
|
||||
inv_freq = 1 / self.base ** torch.arange(0, 1, 4 / self.head_dim, dtype=torch.float32) # (head_dim / 4,)
|
||||
self.register_buffer("inv_freq", inv_freq, persistent=False)
|
||||
|
||||
def forward(self, pixel_values: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
|
||||
_, _, height, width = pixel_values.shape
|
||||
num_patches_h = height // self.config.patch_size
|
||||
num_patches_w = width // self.config.patch_size
|
||||
|
||||
device = pixel_values.device
|
||||
device_type = device.type if isinstance(device.type, str) and device.type != "mps" else "cpu"
|
||||
|
||||
with torch.autocast(device_type=device_type, enabled=False): # Force float32
|
||||
# Although we could precompute static patch_coords from image_size and patch_size in the config,
|
||||
# the model was trained with random_scale, so it can process images of varying sizes.
|
||||
# Therefore, it's better to compute patch_coords dynamically (with lru_cache).
|
||||
patch_coords = get_patches_center_coordinates(
|
||||
num_patches_h, num_patches_w, dtype=torch.float32, device=device
|
||||
)
|
||||
if self.training:
|
||||
patch_coords = augment_patches_center_coordinates(
|
||||
patch_coords,
|
||||
shift=self.config.pos_embed_shift,
|
||||
jitter=self.config.pos_embed_jitter,
|
||||
rescale=self.config.pos_embed_rescale,
|
||||
)
|
||||
|
||||
# (height * width, 2, head_dim / 4) -> (height * width, head_dim / 2) -> (height * width, head_dim)
|
||||
angles = 2 * math.pi * patch_coords[:, :, None] * self.inv_freq[None, None, :]
|
||||
angles = angles.flatten(1, 2)
|
||||
angles = angles.tile(2)
|
||||
|
||||
cos = torch.cos(angles)
|
||||
sin = torch.sin(angles)
|
||||
|
||||
dtype = pixel_values.dtype
|
||||
return cos.to(dtype=dtype), sin.to(dtype=dtype)
|
||||
|
||||
|
||||
def apply_rotary_pos_emb(
|
||||
q: torch.Tensor, k: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, **kwargs
|
||||
) -> tuple[torch.Tensor, torch.Tensor]:
|
||||
"""Applies Rotary Position Embedding to the query and key tensors, but only to the patch tokens,
|
||||
ignoring the prefix tokens (cls token and register tokens).
|
||||
|
||||
Args:
|
||||
q (`torch.Tensor`): The query tensor.
|
||||
k (`torch.Tensor`): The key tensor.
|
||||
cos (`torch.Tensor`): The cosine part of the rotary embedding.
|
||||
sin (`torch.Tensor`): The sine part of the rotary embedding.
|
||||
|
||||
Returns:
|
||||
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
|
||||
"""
|
||||
|
||||
num_tokens = q.shape[-2]
|
||||
num_patches = sin.shape[-2]
|
||||
num_prefix_tokens = num_tokens - num_patches # cls token + register tokens
|
||||
|
||||
q_prefix_tokens, q_patches = q.split((num_prefix_tokens, num_patches), dim=-2)
|
||||
k_prefix_tokens, k_patches = k.split((num_prefix_tokens, num_patches), dim=-2)
|
||||
|
||||
# apply rope only to patch tokens
|
||||
q_patches = (q_patches * cos) + (rotate_half(q_patches) * sin)
|
||||
k_patches = (k_patches * cos) + (rotate_half(k_patches) * sin)
|
||||
|
||||
q = torch.cat((q_prefix_tokens, q_patches), dim=-2)
|
||||
k = torch.cat((k_prefix_tokens, k_patches), dim=-2)
|
||||
|
||||
return q, k
|
||||
|
||||
|
||||
class DINOv3ViTAttention(PixtralAttention):
|
||||
def __init__(self, config: DINOv3ViTConfig):
|
||||
super().__init__(config)
|
||||
|
||||
self.q_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.query_bias)
|
||||
self.k_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.key_bias)
|
||||
self.v_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.value_bias)
|
||||
self.o_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.proj_bias)
|
||||
|
||||
def forward(
|
||||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
|
||||
**kwargs: Unpack[TransformersKwargs],
|
||||
) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
|
||||
"""Input shape: Batch x Time x Channel"""
|
||||
|
||||
batch_size, patches, _ = hidden_states.size()
|
||||
|
||||
query_states = self.q_proj(hidden_states)
|
||||
key_states = self.k_proj(hidden_states)
|
||||
value_states = self.v_proj(hidden_states)
|
||||
|
||||
query_states = query_states.view(batch_size, patches, self.num_heads, self.head_dim).transpose(1, 2)
|
||||
key_states = key_states.view(batch_size, patches, self.num_heads, self.head_dim).transpose(1, 2)
|
||||
value_states = value_states.view(batch_size, patches, self.num_heads, self.head_dim).transpose(1, 2)
|
||||
|
||||
cos, sin = position_embeddings
|
||||
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
|
||||
|
||||
attention_interface: Callable = eager_attention_forward
|
||||
if self.config._attn_implementation != "eager":
|
||||
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
|
||||
|
||||
attn_output, attn_weights = attention_interface(
|
||||
self,
|
||||
query_states,
|
||||
key_states,
|
||||
value_states,
|
||||
attention_mask,
|
||||
dropout=0.0 if not self.training else self.dropout,
|
||||
scaling=self.scaling,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
attn_output = attn_output.reshape(batch_size, patches, -1).contiguous()
|
||||
attn_output = self.o_proj(attn_output)
|
||||
|
||||
return attn_output, attn_weights
|
||||
|
||||
|
||||
class DINOv3ViTLayerScale(Dinov2LayerScale):
|
||||
pass
|
||||
|
||||
|
||||
class DINOv3ViTDropPath(Dinov2DropPath):
|
||||
pass
|
||||
|
||||
|
||||
class DINOv3ViTMLP(ArceeMLP):
|
||||
pass
|
||||
|
||||
|
||||
class DINOv3ViTGatedMLP(LlamaMLP):
|
||||
pass
|
||||
|
||||
|
||||
class DINOv3ViTLayer(GradientCheckpointingLayer):
|
||||
"""This corresponds to the Block class in the original implementation."""
|
||||
|
||||
def __init__(self, config: DINOv3ViTConfig):
|
||||
super().__init__()
|
||||
|
||||
self.norm1 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||
self.attention = DINOv3ViTAttention(config)
|
||||
self.layer_scale1 = DINOv3ViTLayerScale(config)
|
||||
self.drop_path = DINOv3ViTDropPath(config.drop_path_rate) if config.drop_path_rate > 0.0 else nn.Identity()
|
||||
|
||||
self.norm2 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||
|
||||
if config.use_gated_mlp:
|
||||
self.mlp = DINOv3ViTGatedMLP(config)
|
||||
else:
|
||||
self.mlp = DINOv3ViTMLP(config)
|
||||
self.layer_scale2 = DINOv3ViTLayerScale(config)
|
||||
|
||||
def forward(
|
||||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
|
||||
) -> torch.Tensor:
|
||||
# Attention with residual connection
|
||||
residual = hidden_states
|
||||
hidden_states = self.norm1(hidden_states)
|
||||
hidden_states, _ = self.attention(
|
||||
hidden_states,
|
||||
attention_mask=attention_mask,
|
||||
position_embeddings=position_embeddings,
|
||||
)
|
||||
hidden_states = self.layer_scale1(hidden_states)
|
||||
hidden_states = self.drop_path(hidden_states) + residual
|
||||
|
||||
# MLP with residual connection
|
||||
residual = hidden_states
|
||||
hidden_states = self.norm2(hidden_states)
|
||||
hidden_states = self.mlp(hidden_states)
|
||||
hidden_states = self.layer_scale2(hidden_states)
|
||||
hidden_states = self.drop_path(hidden_states) + residual
|
||||
|
||||
return hidden_states
|
||||
|
||||
|
||||
@auto_docstring
|
||||
class DINOv3ViTPreTrainedModel(Dinov2PreTrainedModel):
|
||||
_can_record_outputs = {
|
||||
"hidden_states": DINOv3ViTLayer,
|
||||
"attentions": DINOv3ViTAttention,
|
||||
}
|
||||
|
||||
def _init_weights(self, module):
|
||||
"""Initialize the weights"""
|
||||
if isinstance(module, (nn.Linear, nn.Conv2d)):
|
||||
# Upcast the input in `fp32` and cast it back to desired `dtype` to avoid
|
||||
# `trunc_normal_cpu` not implemented in `half` issues
|
||||
module.weight.data = nn.init.trunc_normal_(
|
||||
module.weight.data.to(torch.float32),
|
||||
mean=0.0,
|
||||
std=self.config.initializer_range,
|
||||
).to(module.weight.dtype)
|
||||
if module.bias is not None:
|
||||
module.bias.data.zero_()
|
||||
elif isinstance(module, nn.LayerNorm):
|
||||
module.bias.data.zero_()
|
||||
module.weight.data.fill_(1.0)
|
||||
elif isinstance(module, DINOv3ViTEmbeddings):
|
||||
module.cls_token.data = nn.init.trunc_normal_(
|
||||
module.cls_token.data.to(torch.float32),
|
||||
mean=0.0,
|
||||
std=self.config.initializer_range,
|
||||
).to(module.cls_token.dtype)
|
||||
if module.config.num_register_tokens > 0:
|
||||
module.register_tokens.data = nn.init.trunc_normal_(
|
||||
module.register_tokens.data.to(torch.float32),
|
||||
mean=0.0,
|
||||
std=self.config.initializer_range,
|
||||
).to(module.register_tokens.dtype)
|
||||
module.mask_token.data.zero_()
|
||||
elif isinstance(module, DINOv3ViTLayerScale):
|
||||
module.lambda1.data.fill_(self.config.layerscale_value)
|
||||
|
||||
|
||||
@auto_docstring
|
||||
class DINOv3ViTModel(DINOv3ViTPreTrainedModel):
|
||||
def __init__(self, config: DINOv3ViTConfig):
|
||||
super().__init__(config)
|
||||
self.config = config
|
||||
self.embeddings = DINOv3ViTEmbeddings(config)
|
||||
self.rope_embeddings = DINOv3ViTRopePositionEmbedding(config)
|
||||
self.layer = nn.ModuleList([DINOv3ViTLayer(config) for _ in range(config.num_hidden_layers)])
|
||||
self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||
self.gradient_checkpointing = False
|
||||
# Initialize weights and apply final processing
|
||||
self.post_init()
|
||||
|
||||
def get_input_embeddings(self):
|
||||
return self.embeddings.patch_embeddings
|
||||
|
||||
@check_model_inputs
|
||||
@auto_docstring
|
||||
def forward(
|
||||
self,
|
||||
pixel_values: torch.Tensor,
|
||||
bool_masked_pos: Optional[torch.Tensor] = None,
|
||||
head_mask: Optional[torch.Tensor] = None,
|
||||
**kwargs: Unpack[TransformersKwargs],
|
||||
) -> BaseModelOutputWithPooling:
|
||||
r"""
|
||||
bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, sequence_length)`):
|
||||
Boolean masked positions. Indicates which patches are masked (1) and which aren't (0). Only relevant for
|
||||
pre-training.
|
||||
"""
|
||||
|
||||
pixel_values = pixel_values.to(self.embeddings.patch_embeddings.weight.dtype)
|
||||
hidden_states = self.embeddings(pixel_values, bool_masked_pos=bool_masked_pos)
|
||||
position_embeddings = self.rope_embeddings(pixel_values)
|
||||
|
||||
for i, layer_module in enumerate(self.layer):
|
||||
layer_head_mask = head_mask[i] if head_mask is not None else None
|
||||
hidden_states = layer_module(
|
||||
hidden_states,
|
||||
attention_mask=layer_head_mask,
|
||||
position_embeddings=position_embeddings,
|
||||
)
|
||||
|
||||
sequence_output = self.norm(hidden_states)
|
||||
pooled_output = sequence_output[:, 0, :]
|
||||
|
||||
return BaseModelOutputWithPooling(
|
||||
last_hidden_state=sequence_output,
|
||||
pooler_output=pooled_output,
|
||||
)
|
||||
|
||||
|
||||
__all__ = ["DINOv3ViTModel", "DINOv3ViTPreTrainedModel"]
|
||||
@@ -339,6 +339,7 @@ def isin_mps_friendly(elements: torch.Tensor, test_elements: torch.Tensor | int)
|
||||
return torch.isin(elements, test_elements)
|
||||
|
||||
|
||||
@wraps(lru_cache)
|
||||
def compile_compatible_method_lru_cache(*lru_args, **lru_kwargs):
|
||||
"""
|
||||
LRU cache decorator from standard functools library, but with a workaround to disable
|
||||
@@ -346,19 +347,14 @@ def compile_compatible_method_lru_cache(*lru_args, **lru_kwargs):
|
||||
"""
|
||||
|
||||
def decorator(func):
|
||||
func_with_cache = lru_cache(*lru_args, **lru_kwargs)(func)
|
||||
|
||||
@wraps(func)
|
||||
def wrapper(self, *args, **kwargs):
|
||||
if not is_torchdynamo_compiling():
|
||||
# Cache the function only if the model is not being compiled
|
||||
# check if the function is already cached, otherwise create it
|
||||
if not hasattr(self, f"_cached_{func.__name__}"):
|
||||
self.__setattr__(
|
||||
f"_cached_{func.__name__}", lru_cache(*lru_args, **lru_kwargs)(func.__get__(self))
|
||||
)
|
||||
return self.__getattribute__(f"_cached_{func.__name__}")(*args, **kwargs)
|
||||
def wrapper(*args, **kwargs):
|
||||
if is_torchdynamo_compiling():
|
||||
return func(*args, **kwargs)
|
||||
else:
|
||||
# Otherwise, just call the original function
|
||||
return func(self, *args, **kwargs)
|
||||
return func_with_cache(*args, **kwargs)
|
||||
|
||||
return wrapper
|
||||
|
||||
|
||||
@@ -129,6 +129,8 @@ _REGULAR_SUPPORTED_MODEL_NAMES_AND_TASKS = [
|
||||
"deberta",
|
||||
"deberta-v2",
|
||||
"dinov2",
|
||||
"dinov3_convnext",
|
||||
"dinov3_vit",
|
||||
"distilbert",
|
||||
"donut-swin",
|
||||
"electra",
|
||||
|
||||
0
tests/models/dinov3_convnext/__init__.py
Normal file
0
tests/models/dinov3_convnext/__init__.py
Normal file
242
tests/models/dinov3_convnext/test_modeling_dinov3_convnext.py
Normal file
242
tests/models/dinov3_convnext/test_modeling_dinov3_convnext.py
Normal file
@@ -0,0 +1,242 @@
|
||||
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Testing suite for the PyTorch ConvNext model."""
|
||||
|
||||
import unittest
|
||||
|
||||
from transformers import DINOv3ConvNextConfig
|
||||
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
|
||||
from transformers.utils import cached_property, is_torch_available, is_vision_available
|
||||
|
||||
from ...test_configuration_common import ConfigTester
|
||||
from ...test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
|
||||
from ...test_pipeline_mixin import PipelineTesterMixin
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from transformers import DINOv3ConvNextModel
|
||||
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
from transformers import AutoImageProcessor
|
||||
|
||||
|
||||
class DINOv3ConvNextModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
image_size=32,
|
||||
num_channels=3,
|
||||
hidden_sizes=[10, 20, 30, 40],
|
||||
depths=[2, 2, 3, 2],
|
||||
is_training=False,
|
||||
use_labels=True,
|
||||
intermediate_size=37,
|
||||
hidden_act="gelu",
|
||||
num_labels=10,
|
||||
initializer_range=0.02,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.image_size = image_size
|
||||
self.num_channels = num_channels
|
||||
self.hidden_sizes = hidden_sizes
|
||||
self.depths = depths
|
||||
self.is_training = is_training
|
||||
self.use_labels = use_labels
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_act = hidden_act
|
||||
self.num_labels = num_labels
|
||||
self.initializer_range = initializer_range
|
||||
self.scope = scope
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
|
||||
|
||||
labels = None
|
||||
if self.use_labels:
|
||||
labels = ids_tensor([self.batch_size], self.num_labels)
|
||||
|
||||
config = self.get_config()
|
||||
return config, pixel_values, labels
|
||||
|
||||
def get_config(self):
|
||||
return DINOv3ConvNextConfig(
|
||||
num_channels=self.num_channels,
|
||||
hidden_sizes=self.hidden_sizes,
|
||||
depths=self.depths,
|
||||
hidden_act=self.hidden_act,
|
||||
is_decoder=False,
|
||||
initializer_range=self.initializer_range,
|
||||
num_labels=self.num_labels,
|
||||
)
|
||||
|
||||
def create_and_check_model(self, config, pixel_values, labels):
|
||||
model = DINOv3ConvNextModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(pixel_values)
|
||||
# expected last hidden states: B, C, H // 32, W // 32
|
||||
self.parent.assertEqual(
|
||||
result.last_hidden_state.shape,
|
||||
(
|
||||
self.batch_size,
|
||||
1 + self.image_size // 32 * self.image_size // 32,
|
||||
self.hidden_sizes[-1],
|
||||
),
|
||||
)
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
config, pixel_values, labels = config_and_inputs
|
||||
inputs_dict = {"pixel_values": pixel_values}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class DINOv3ConvNextModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
|
||||
"""
|
||||
Here we also overwrite some of the tests of test_modeling_common.py, as ConvNext does not use input_ids, inputs_embeds,
|
||||
attention_mask and seq_length.
|
||||
"""
|
||||
|
||||
all_model_classes = (DINOv3ConvNextModel,) if is_torch_available() else ()
|
||||
pipeline_model_mapping = {"image-feature-extraction": DINOv3ConvNextModel} if is_torch_available() else {}
|
||||
|
||||
fx_compatible = False
|
||||
test_pruning = False
|
||||
test_resize_embeddings = False
|
||||
test_head_masking = False
|
||||
has_attentions = False
|
||||
test_torch_exportable = True
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = DINOv3ConvNextModelTester(self)
|
||||
self.config_tester = ConfigTester(
|
||||
self,
|
||||
config_class=DINOv3ConvNextConfig,
|
||||
has_text_modality=False,
|
||||
hidden_size=37,
|
||||
common_properties=["num_channels", "hidden_sizes"],
|
||||
)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
@unittest.skip(reason="DINOv3ConvNext does not use inputs_embeds")
|
||||
def test_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="DINOv3ConvNext does not support input and output embeddings")
|
||||
def test_model_get_set_embeddings(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="DINOv3ConvNext does not use feedforward chunking")
|
||||
def test_feed_forward_chunking(self):
|
||||
pass
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_hidden_states_output(self):
|
||||
def check_hidden_states_output(inputs_dict, config, model_class):
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
hidden_states = outputs.encoder_hidden_states if config.is_encoder_decoder else outputs.hidden_states
|
||||
|
||||
self.assertEqual(len(hidden_states), 5)
|
||||
|
||||
# DINOv3ConvNext's feature maps are of shape (batch_size, num_channels, height, width)
|
||||
self.assertListEqual(
|
||||
list(hidden_states[1].shape[-2:]),
|
||||
[self.model_tester.image_size // 4, self.model_tester.image_size // 4],
|
||||
)
|
||||
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
inputs_dict["output_hidden_states"] = True
|
||||
check_hidden_states_output(inputs_dict, config, model_class)
|
||||
|
||||
# check that output_hidden_states also work using config
|
||||
del inputs_dict["output_hidden_states"]
|
||||
config.output_hidden_states = True
|
||||
|
||||
check_hidden_states_output(inputs_dict, config, model_class)
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
model_name = "facebook/dinov3-convnext-tiny-pretrain-lvd1689m"
|
||||
model = DINOv3ConvNextModel.from_pretrained(model_name)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
@unittest.skip(reason="DINOv3ConvNext does not retain grads for first hidden state (original pixel_values)")
|
||||
def test_retain_grad_hidden_states_attentions(self):
|
||||
pass
|
||||
|
||||
|
||||
# We will verify our results on an image of cute cats
|
||||
def prepare_img():
|
||||
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
|
||||
return image
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_vision
|
||||
class DINOv3ConvNextModelIntegrationTest(unittest.TestCase):
|
||||
@cached_property
|
||||
def default_image_processor(self):
|
||||
return (
|
||||
AutoImageProcessor.from_pretrained("facebook/dinov3-convnext-tiny-pretrain-lvd1689m")
|
||||
if is_vision_available()
|
||||
else None
|
||||
)
|
||||
|
||||
@slow
|
||||
def test_inference_no_head(self):
|
||||
model = DINOv3ConvNextModel.from_pretrained("facebook/dinov3-convnext-tiny-pretrain-lvd1689m").to(torch_device)
|
||||
|
||||
image_processor = self.default_image_processor
|
||||
image = prepare_img()
|
||||
inputs = image_processor(image, return_tensors="pt").to(torch_device)
|
||||
|
||||
# forward pass
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs)
|
||||
|
||||
# verify the last hidden states
|
||||
_, _, height, width = inputs["pixel_values"].shape
|
||||
expected_seq_length = (height * width) // 4 ** (model.config.num_stages + 1) + 1 # +1 for the "CLS" token
|
||||
expected_shape = torch.Size((1, expected_seq_length, model.config.hidden_sizes[-1]))
|
||||
self.assertEqual(outputs.last_hidden_state.shape, expected_shape)
|
||||
|
||||
last_layer_cls_token = outputs.pooler_output
|
||||
expected_slice = torch.tensor([-6.3721, 1.3008, 2.0743, -0.0800, 0.6072], device=torch_device)
|
||||
torch.testing.assert_close(last_layer_cls_token[0, :5], expected_slice, rtol=1e-4, atol=1e-4)
|
||||
|
||||
last_layer_patch_tokens = outputs.last_hidden_state[:, 1:]
|
||||
expected_slice = torch.tensor([0.4905, -3.7135, 1.8485, -1.0403, -1.0908], device=torch_device)
|
||||
torch.testing.assert_close(last_layer_patch_tokens[0, 0, :5], expected_slice, rtol=1e-4, atol=1e-4)
|
||||
0
tests/models/dinov3_vit/__init__.py
Normal file
0
tests/models/dinov3_vit/__init__.py
Normal file
127
tests/models/dinov3_vit/test_image_processing_dinov3_vit_fast.py
Normal file
127
tests/models/dinov3_vit/test_image_processing_dinov3_vit_fast.py
Normal file
@@ -0,0 +1,127 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import unittest
|
||||
|
||||
from transformers.testing_utils import require_torch, require_vision
|
||||
from transformers.utils import is_torchvision_available
|
||||
|
||||
from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs
|
||||
|
||||
|
||||
if is_torchvision_available():
|
||||
from transformers import DINOv3ViTImageProcessorFast
|
||||
|
||||
|
||||
class DINOv3ViTImageProcessingTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=7,
|
||||
num_channels=3,
|
||||
image_size=18,
|
||||
min_resolution=30,
|
||||
max_resolution=400,
|
||||
do_resize=True,
|
||||
size=None,
|
||||
do_center_crop=True,
|
||||
crop_size=None,
|
||||
do_normalize=True,
|
||||
image_mean=[0.48145466, 0.4578275, 0.40821073],
|
||||
image_std=[0.26862954, 0.26130258, 0.27577711],
|
||||
do_convert_rgb=True,
|
||||
):
|
||||
super().__init__()
|
||||
size = size if size is not None else {"shortest_edge": 20}
|
||||
crop_size = crop_size if crop_size is not None else {"height": 18, "width": 18}
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.num_channels = num_channels
|
||||
self.image_size = image_size
|
||||
self.min_resolution = min_resolution
|
||||
self.max_resolution = max_resolution
|
||||
self.do_resize = do_resize
|
||||
self.size = size
|
||||
self.do_center_crop = do_center_crop
|
||||
self.crop_size = crop_size
|
||||
self.do_normalize = do_normalize
|
||||
self.image_mean = image_mean
|
||||
self.image_std = image_std
|
||||
self.do_convert_rgb = do_convert_rgb
|
||||
|
||||
def prepare_image_processor_dict(self):
|
||||
return {
|
||||
"do_resize": self.do_resize,
|
||||
"size": self.size,
|
||||
"do_center_crop": self.do_center_crop,
|
||||
"crop_size": self.crop_size,
|
||||
"do_normalize": self.do_normalize,
|
||||
"image_mean": self.image_mean,
|
||||
"image_std": self.image_std,
|
||||
"do_convert_rgb": self.do_convert_rgb,
|
||||
}
|
||||
|
||||
def expected_output_image_shape(self, images):
|
||||
return self.num_channels, self.crop_size["height"], self.crop_size["width"]
|
||||
|
||||
def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=False):
|
||||
return prepare_image_inputs(
|
||||
batch_size=self.batch_size,
|
||||
num_channels=self.num_channels,
|
||||
min_resolution=self.min_resolution,
|
||||
max_resolution=self.max_resolution,
|
||||
equal_resolution=equal_resolution,
|
||||
numpify=numpify,
|
||||
torchify=torchify,
|
||||
)
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_vision
|
||||
class DINOv3ViTImageProcessingTest(ImageProcessingTestMixin, unittest.TestCase):
|
||||
image_processing_class = None
|
||||
fast_image_processing_class = DINOv3ViTImageProcessorFast if is_torchvision_available() else None
|
||||
test_slow_image_processor = False
|
||||
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
self.image_processor_tester = DINOv3ViTImageProcessingTester(self)
|
||||
|
||||
@property
|
||||
def image_processor_dict(self):
|
||||
return self.image_processor_tester.prepare_image_processor_dict()
|
||||
|
||||
def test_image_processor_properties(self):
|
||||
for image_processing_class in self.image_processor_list:
|
||||
image_processing = image_processing_class(**self.image_processor_dict)
|
||||
self.assertTrue(hasattr(image_processing, "do_resize"))
|
||||
self.assertTrue(hasattr(image_processing, "size"))
|
||||
self.assertTrue(hasattr(image_processing, "do_center_crop"))
|
||||
self.assertTrue(hasattr(image_processing, "center_crop"))
|
||||
self.assertTrue(hasattr(image_processing, "do_normalize"))
|
||||
self.assertTrue(hasattr(image_processing, "image_mean"))
|
||||
self.assertTrue(hasattr(image_processing, "image_std"))
|
||||
self.assertTrue(hasattr(image_processing, "do_convert_rgb"))
|
||||
|
||||
def test_image_processor_from_dict_with_kwargs(self):
|
||||
for image_processing_class in self.image_processor_list:
|
||||
image_processor = image_processing_class.from_dict(self.image_processor_dict)
|
||||
self.assertEqual(image_processor.size, {"shortest_edge": 20})
|
||||
self.assertEqual(image_processor.crop_size, {"height": 18, "width": 18})
|
||||
|
||||
image_processor = image_processing_class.from_dict(
|
||||
self.image_processor_dict, size={"height": 42, "width": 42}
|
||||
)
|
||||
self.assertEqual(image_processor.size, {"height": 42, "width": 42})
|
||||
278
tests/models/dinov3_vit/test_modeling_dinov3_vit.py
Normal file
278
tests/models/dinov3_vit/test_modeling_dinov3_vit.py
Normal file
@@ -0,0 +1,278 @@
|
||||
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Testing suite for the PyTorch DINOv3 model."""
|
||||
|
||||
import unittest
|
||||
|
||||
from transformers import DINOv3ViTConfig
|
||||
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
|
||||
from transformers.utils import cached_property, is_torch_available, is_vision_available
|
||||
|
||||
from ...test_configuration_common import ConfigTester
|
||||
from ...test_modeling_common import ModelTesterMixin, _config_zero_init, floats_tensor, ids_tensor
|
||||
from ...test_pipeline_mixin import PipelineTesterMixin
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
from torch import nn
|
||||
|
||||
from transformers import DINOv3ViTModel
|
||||
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
from transformers import AutoImageProcessor
|
||||
|
||||
|
||||
class DINOv3ViTModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
image_size=30,
|
||||
patch_size=2,
|
||||
num_channels=3,
|
||||
is_training=False,
|
||||
use_labels=True,
|
||||
hidden_size=32,
|
||||
num_hidden_layers=2,
|
||||
num_attention_heads=4,
|
||||
intermediate_size=37,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
type_sequence_label_size=10,
|
||||
initializer_range=0.02,
|
||||
num_register_tokens=2,
|
||||
mask_ratio=0.5,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.image_size = image_size
|
||||
self.patch_size = patch_size
|
||||
self.num_channels = num_channels
|
||||
self.is_training = is_training
|
||||
self.use_labels = use_labels
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_act = hidden_act
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.type_sequence_label_size = type_sequence_label_size
|
||||
self.initializer_range = initializer_range
|
||||
self.num_register_tokens = num_register_tokens
|
||||
self.scope = scope
|
||||
|
||||
num_patches = (image_size // patch_size) ** 2
|
||||
self.seq_length = num_patches + 1 + self.num_register_tokens
|
||||
self.mask_ratio = mask_ratio
|
||||
self.num_masks = int(mask_ratio * self.seq_length)
|
||||
self.mask_length = num_patches
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
|
||||
|
||||
labels = None
|
||||
if self.use_labels:
|
||||
labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
|
||||
|
||||
config = self.get_config()
|
||||
|
||||
return config, pixel_values, labels
|
||||
|
||||
def get_config(self):
|
||||
return DINOv3ViTConfig(
|
||||
image_size=self.image_size,
|
||||
patch_size=self.patch_size,
|
||||
num_channels=self.num_channels,
|
||||
hidden_size=self.hidden_size,
|
||||
num_hidden_layers=self.num_hidden_layers,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
intermediate_size=self.intermediate_size,
|
||||
hidden_act=self.hidden_act,
|
||||
hidden_dropout_prob=self.hidden_dropout_prob,
|
||||
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||
is_decoder=False,
|
||||
initializer_range=self.initializer_range,
|
||||
num_register_tokens=self.num_register_tokens,
|
||||
)
|
||||
|
||||
def create_and_check_model(self, config, pixel_values, labels):
|
||||
model = DINOv3ViTModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(pixel_values)
|
||||
self.parent.assertEqual(
|
||||
result.last_hidden_state.shape,
|
||||
(self.batch_size, self.seq_length, self.hidden_size),
|
||||
)
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
(
|
||||
config,
|
||||
pixel_values,
|
||||
labels,
|
||||
) = config_and_inputs
|
||||
inputs_dict = {"pixel_values": pixel_values}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class Dinov3ModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
|
||||
"""
|
||||
Here we also overwrite some of the tests of test_modeling_common.py, as Dinov3 does not use input_ids, inputs_embeds,
|
||||
attention_mask and seq_length.
|
||||
"""
|
||||
|
||||
all_model_classes = (DINOv3ViTModel,) if is_torch_available() else ()
|
||||
pipeline_model_mapping = (
|
||||
{
|
||||
"image-feature-extraction": DINOv3ViTModel,
|
||||
}
|
||||
if is_torch_available()
|
||||
else {}
|
||||
)
|
||||
fx_compatible = False
|
||||
|
||||
test_pruning = False
|
||||
test_resize_embeddings = False
|
||||
test_head_masking = False
|
||||
test_torch_exportable = True
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = DINOv3ViTModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=DINOv3ViTConfig, has_text_modality=False, hidden_size=37)
|
||||
|
||||
def test_initialization(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
configs_no_init = _config_zero_init(config)
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config=configs_no_init)
|
||||
for name, param in model.named_parameters():
|
||||
if param.requires_grad and "register_tokens" not in name:
|
||||
# See PR #38607 (to avoid flakiness)
|
||||
data = torch.flatten(param.data)
|
||||
n_elements = torch.numel(data)
|
||||
# skip 2.5% of elements on each side to avoid issues caused by `nn.init.trunc_normal_` described in
|
||||
# https://github.com/huggingface/transformers/pull/27906#issuecomment-1846951332
|
||||
n_elements_to_skip_on_each_side = int(n_elements * 0.025)
|
||||
data_to_check = torch.sort(data).values
|
||||
if n_elements_to_skip_on_each_side > 0:
|
||||
data_to_check = data_to_check[n_elements_to_skip_on_each_side:-n_elements_to_skip_on_each_side]
|
||||
self.assertIn(
|
||||
((data_to_check.mean() * 1e9).round() / 1e9).item(),
|
||||
[0.0, 1.0],
|
||||
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||
)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
@unittest.skip(reason="Dinov3 does not use inputs_embeds")
|
||||
def test_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
reason="This architecture seem to not compute gradients properly when using GC, check: https://github.com/huggingface/transformers/pull/27124"
|
||||
)
|
||||
def test_training_gradient_checkpointing(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
reason="This architecture seem to not compute gradients properly when using GC, check: https://github.com/huggingface/transformers/pull/27124"
|
||||
)
|
||||
def test_training_gradient_checkpointing_use_reentrant(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
reason="This architecture seem to not compute gradients properly when using GC, check: https://github.com/huggingface/transformers/pull/27124"
|
||||
)
|
||||
def test_training_gradient_checkpointing_use_reentrant_false(self):
|
||||
pass
|
||||
|
||||
def test_model_get_set_embeddings(self):
|
||||
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
|
||||
x = model.get_output_embeddings()
|
||||
self.assertTrue(x is None or isinstance(x, nn.Linear))
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
@unittest.skip(reason="Dinov3 does not support feedforward chunking yet")
|
||||
def test_feed_forward_chunking(self):
|
||||
pass
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
model_name = "facebook/dinov3-vits16-pretrain-lvd1689m"
|
||||
model = DINOv3ViTModel.from_pretrained(model_name)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
|
||||
# We will verify our results on an image of cute cats
|
||||
def prepare_img():
|
||||
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
|
||||
return image
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_vision
|
||||
class DINOv3ViTModelIntegrationTest(unittest.TestCase):
|
||||
@cached_property
|
||||
def default_image_processor(self):
|
||||
return (
|
||||
AutoImageProcessor.from_pretrained("facebook/dinov3-vits16-pretrain-lvd1689m")
|
||||
if is_vision_available()
|
||||
else None
|
||||
)
|
||||
|
||||
@slow
|
||||
def test_inference_no_head(self):
|
||||
model = DINOv3ViTModel.from_pretrained("facebook/dinov3-vits16-pretrain-lvd1689m").to(torch_device)
|
||||
|
||||
image_processor = self.default_image_processor
|
||||
image = prepare_img()
|
||||
inputs = image_processor(image, return_tensors="pt").to(torch_device)
|
||||
|
||||
# forward pass
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs)
|
||||
|
||||
# verify the last hidden states
|
||||
# in DINOv3 with Registers, the seq length equals the number of patches + 1 + num_register_tokens (we add 1 for the [CLS] token)
|
||||
_, _, height, width = inputs["pixel_values"].shape
|
||||
num_patches = (height // model.config.patch_size) * (width // model.config.patch_size)
|
||||
expected_seq_length = num_patches + 1 + model.config.num_register_tokens
|
||||
expected_shape = torch.Size((1, expected_seq_length, model.config.hidden_size))
|
||||
self.assertEqual(outputs.last_hidden_state.shape, expected_shape)
|
||||
|
||||
last_layer_cls_token = outputs.pooler_output
|
||||
expected_slice = torch.tensor([0.4637, -0.4160, 0.4086, -0.1265, -0.2865], device=torch_device)
|
||||
torch.testing.assert_close(last_layer_cls_token[0, :5], expected_slice, rtol=1e-4, atol=1e-4)
|
||||
|
||||
last_layer_patch_tokens = outputs.last_hidden_state[:, model.config.num_register_tokens + 1 :]
|
||||
expected_slice = torch.tensor([-0.0386, -0.2509, -0.0161, -0.4556, 0.5716], device=torch_device)
|
||||
torch.testing.assert_close(last_layer_patch_tokens[0, 0, :5], expected_slice, rtol=1e-4, atol=1e-4)
|
||||
@@ -169,6 +169,8 @@ OBJECTS_TO_IGNORE = [
|
||||
"DetrConfig",
|
||||
"DetrImageProcessor",
|
||||
"DinatModel",
|
||||
"DINOv3ConvNextConfig",
|
||||
"DINOv3ViTConfig",
|
||||
"DistilBertConfig",
|
||||
"DistilBertTokenizerFast",
|
||||
"DocumentQuestionAnsweringPipeline",
|
||||
|
||||
@@ -411,6 +411,8 @@ MODEL_TYPE_TO_DOC_MAPPING = OrderedDict(
|
||||
("data2vec-audio", "data2vec"),
|
||||
("data2vec-vision", "data2vec"),
|
||||
("donut-swin", "donut"),
|
||||
("dinov3_convnext", "dinov3"),
|
||||
("dinov3_vit", "dinov3"),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user