Add EfficientLoFTR model (#36355)

* initial commit

* Apply suggestions from code review

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>

* fix: various typos, typehints, refactors from suggestions

* fix: fine_matching method

* Added EfficientLoFTRModel and AutoModelForKeypointMatching class

* fix: got rid of compilation breaking instructions

* docs: added todo for plot

* fix: used correct hub repo

* docs: added comments

* fix: run modular

* doc: added PyTorch badge

* fix: model repo typo in config

* fix: make modular

* fix: removed mask values from outputs

* feat: added plot_keypoint_matching to EfficientLoFTRImageProcessor

* feat: added SuperGlueForKeypointMatching to AutoModelForKeypointMatching list

* fix: reformat

* refactor: renamed aggregation_sizes config parameter into q, kv aggregation kernel size and stride

* doc: added q, kv aggregation kernel size and stride doc to config

* refactor: converted efficientloftr implementation from modular to copied from mechanism

* tests: overwrote batching_equivalence for "keypoints" specific tests

* fix: changed EfficientLoFTRConfig import in test_modeling_rope_utils

* fix: make fix-copies

* fix: make style

* fix: update rope function to make meta tests pass

* fix: rename plot_keypoint_matching to visualize_output for clarity

* refactor: optimize image pair processing by removing redundant target size calculations

* feat: add EfficientLoFTRImageProcessor to image processor mapping

* refactor: removed logger and updated attention forward

* refactor: added auto_docstring and can_return_tuple decorators

* refactor: update type imports

* refactor: update type hints from List/Dict to list/dict for consistency

* refactor: update MODEL_MAPPING_NAMES and __all__ to include LightGlue and AutoModelForKeypointMatching

* fix: change type hint for size parameter in EfficientLoFTRImageProcessor to Optional[dict]

* fix typing

* fix some typing issues

* nit

* a few more typehint fixes

* Remove output_attentions and output_hidden_states from modeling code

* else -> elif to support efficientloftr

* nit

* tests: added EfficientLoFTR image processor tests

* refactor: reorder functions

* chore: update copyright year in EfficientLoFTR test file

* Use default rope

* Add docs

* Update visualization method

* fix doc order

* remove 2d rope test

* Update src/transformers/models/efficientloftr/modeling_efficientloftr.py

* fix docs

* Update src/transformers/models/efficientloftr/image_processing_efficientloftr.py

* update gradient

* refactor: removed unused codepath

* Add motivation to keep postprocessing in modeling code

* refactor: removed unnecessary variable declarations

* docs: use load_image from image_utils

* refactor: moved stage in and out channels computation to configuration

* refactor: set an intermediate_size parameter to be more explicit

* refactor: removed all mentions of attention masks as they are not used

* refactor: moved position_embeddings to be computed once in the model instead of every layer

* refactor: removed unnecessary hidden expansion parameter from config

* refactor: removed completely hidden expansions

* refactor: removed position embeddings slice function

* tests: fixed broken tests because of previous commit

* fix is_grayscale typehint

* not refactoring

* not renaming

* move h/w to embeddings class

* Precompute embeddings in init

* fix: replaced cuda device in convert script to accelerate device

* fix: replaced stevenbucaille repo to zju-community

* Remove accelerator.device from conversion script

* refactor: moved parameter computation in configuration instead of figuring it out when instantiating a Module

* fix: removed unused attributes in configuration

* fix: missing self

* fix: refactoring and tests

* fix: make style

---------

Co-authored-by: steven <steven.bucaille@buawei.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
This commit is contained in:
StevenBucaille
2025-07-22 05:53:16 -04:00
committed by GitHub
parent 3bc726b381
commit a88ea9cbc8
19 changed files with 2938 additions and 5 deletions

View File

@@ -747,6 +747,8 @@
title: DPT
- local: model_doc/efficientformer
title: EfficientFormer
- local: model_doc/efficientloftr
title: EfficientLoFTR
- local: model_doc/efficientnet
title: EfficientNet
- local: model_doc/eomt

View File

@@ -258,6 +258,10 @@ The following auto classes are available for the following computer vision tasks
[[autodoc]] AutoModelForKeypointDetection
### AutoModelForKeypointMatching
[[autodoc]] AutoModelForKeypointMatching
### AutoModelForMaskedImageModeling
[[autodoc]] AutoModelForMaskedImageModeling

View File

@@ -0,0 +1,114 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the MIT License; you may not use this file except in compliance with
the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# EfficientLoFTR
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
## Overview
The EfficientLoFTR model was proposed in [Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed](https://arxiv.org/abs/2403.04765) by Yifan Wang, Xingyi He, Sida Peng, Dongli Tan and Xiaowei Zhou.
This model consists of matching two images together by finding pixel correspondences. It can be used to estimate the pose between them.
This model is useful for tasks such as image matching, homography estimation, etc.
The abstract from the paper is the following:
*We present a novel method for efficiently producing semidense matches across images. Previous detector-free matcher
LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios but suffers
from low efficiency. We revisit its design choices and derive multiple improvements for both efficiency and accuracy.
One key observation is that performing the transformer over the entire feature map is redundant due to shared local
information, therefore we propose an aggregated attention mechanism with adaptive token selection for efficiency.
Furthermore, we find spatial variance exists in LoFTRs fine correlation module, which is adverse to matching accuracy.
A novel two-stage correlation layer is proposed to achieve accurate subpixel correspondences for accuracy improvement.
Our efficiency optimized model is 2.5× faster than LoFTR which can even surpass state-of-the-art efficient sparse
matching pipeline SuperPoint + LightGlue. Moreover, extensive experiments show that our method can achieve higher
accuracy compared with competitive semi-dense matchers, with considerable efficiency benefits. This opens up exciting
prospects for large-scale or latency-sensitive applications such as image retrieval and 3D reconstruction.
Project page: [https://zju3dv.github.io/efficientloftr/](https://zju3dv.github.io/efficientloftr/).*
## How to use
Here is a quick example of using the model.
```python
import torch
from transformers import AutoImageProcessor, AutoModelForKeypointMatching
from transformers.image_utils import load_image
image1 = load_image("https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_98169888_3347710852.jpg")
image2 = load_image("https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_26757027_6717084061.jpg")
images = [image1, image2]
processor = AutoImageProcessor.from_pretrained("stevenbucaille/efficientloftr")
model = AutoModelForKeypointMatching.from_pretrained("stevenbucaille/efficientloftr")
inputs = processor(images, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
```
You can use the `post_process_keypoint_matching` method from the `ImageProcessor` to get the keypoints and matches in a more readable format:
```python
image_sizes = [[(image.height, image.width) for image in images]]
outputs = processor.post_process_keypoint_matching(outputs, image_sizes, threshold=0.2)
for i, output in enumerate(outputs):
print("For the image pair", i)
for keypoint0, keypoint1, matching_score in zip(
output["keypoints0"], output["keypoints1"], output["matching_scores"]
):
print(
f"Keypoint at coordinate {keypoint0.numpy()} in the first image matches with keypoint at coordinate {keypoint1.numpy()} in the second image with a score of {matching_score}."
)
```
From the post processed outputs, you can visualize the matches between the two images using the following code:
```python
images_with_matching = processor.visualize_keypoint_matching(images, outputs)
```
![image/png](https://cdn-uploads.huggingface.co/production/uploads/632885ba1558dac67c440aa8/2nJZQlFToCYp_iLurvcZ4.png)
This model was contributed by [stevenbucaille](https://huggingface.co/stevenbucaille).
The original code can be found [here](https://github.com/zju3dv/EfficientLoFTR).
## EfficientLoFTRConfig
[[autodoc]] EfficientLoFTRConfig
## EfficientLoFTRImageProcessor
[[autodoc]] EfficientLoFTRImageProcessor
- preprocess
- post_process_keypoint_matching
- visualize_keypoint_matching
## EfficientLoFTRModel
[[autodoc]] EfficientLoFTRModel
- forward
## EfficientLoFTRForKeypointMatching
[[autodoc]] EfficientLoFTRForKeypointMatching
- forward

View File

@@ -102,6 +102,7 @@ if TYPE_CHECKING:
from .dots1 import *
from .dpr import *
from .dpt import *
from .efficientloftr import *
from .efficientnet import *
from .electra import *
from .emu3 import *

View File

@@ -121,6 +121,7 @@ CONFIG_MAPPING_NAMES = OrderedDict[str, str](
("dpr", "DPRConfig"),
("dpt", "DPTConfig"),
("efficientformer", "EfficientFormerConfig"),
("efficientloftr", "EfficientLoFTRConfig"),
("efficientnet", "EfficientNetConfig"),
("electra", "ElectraConfig"),
("emu3", "Emu3Config"),
@@ -515,6 +516,7 @@ MODEL_NAMES_MAPPING = OrderedDict[str, str](
("dpr", "DPR"),
("dpt", "DPT"),
("efficientformer", "EfficientFormer"),
("efficientloftr", "EfficientLoFTR"),
("efficientnet", "EfficientNet"),
("electra", "ELECTRA"),
("emu3", "Emu3"),

View File

@@ -85,6 +85,7 @@ else:
("donut-swin", ("DonutImageProcessor", "DonutImageProcessorFast")),
("dpt", ("DPTImageProcessor", "DPTImageProcessorFast")),
("efficientformer", ("EfficientFormerImageProcessor",)),
("efficientloftr", ("EfficientLoFTRImageProcessor",)),
("efficientnet", ("EfficientNetImageProcessor", "EfficientNetImageProcessorFast")),
("eomt", ("EomtImageProcessor", "EomtImageProcessorFast")),
("flava", ("FlavaImageProcessor", "FlavaImageProcessorFast")),

View File

@@ -114,6 +114,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("dpr", "DPRQuestionEncoder"),
("dpt", "DPTModel"),
("efficientformer", "EfficientFormerModel"),
("efficientloftr", "EfficientLoFTRModel"),
("efficientnet", "EfficientNetModel"),
("electra", "ElectraModel"),
("emu3", "Emu3Model"),
@@ -322,7 +323,6 @@ MODEL_MAPPING_NAMES = OrderedDict(
("squeezebert", "SqueezeBertModel"),
("stablelm", "StableLmModel"),
("starcoder2", "Starcoder2Model"),
("superglue", "SuperGlueForKeypointMatching"),
("swiftformer", "SwiftFormerModel"),
("swin", "SwinModel"),
("swin2sr", "Swin2SRModel"),
@@ -1607,6 +1607,13 @@ MODEL_FOR_KEYPOINT_DETECTION_MAPPING_NAMES = OrderedDict(
]
)
MODEL_FOR_KEYPOINT_MATCHING_MAPPING_NAMES = OrderedDict(
[
("efficientloftr", "EfficientLoFTRForKeypointMatching"),
("lightglue", "LightGlueForKeypointMatching"),
("superglue", "SuperGlueForKeypointMatching"),
]
)
MODEL_FOR_TEXT_ENCODING_MAPPING_NAMES = OrderedDict(
[
@@ -1768,6 +1775,8 @@ MODEL_FOR_KEYPOINT_DETECTION_MAPPING = _LazyAutoMapping(
CONFIG_MAPPING_NAMES, MODEL_FOR_KEYPOINT_DETECTION_MAPPING_NAMES
)
MODEL_FOR_KEYPOINT_MATCHING_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_KEYPOINT_MATCHING_MAPPING_NAMES)
MODEL_FOR_TEXT_ENCODING_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_TEXT_ENCODING_MAPPING_NAMES)
MODEL_FOR_TIME_SERIES_CLASSIFICATION_MAPPING = _LazyAutoMapping(
@@ -1795,6 +1804,10 @@ class AutoModelForKeypointDetection(_BaseAutoModelClass):
_model_mapping = MODEL_FOR_KEYPOINT_DETECTION_MAPPING
class AutoModelForKeypointMatching(_BaseAutoModelClass):
_model_mapping = MODEL_FOR_KEYPOINT_MATCHING_MAPPING
class AutoModelForTextEncoding(_BaseAutoModelClass):
_model_mapping = MODEL_FOR_TEXT_ENCODING_MAPPING
@@ -2151,6 +2164,7 @@ __all__ = [
"MODEL_FOR_IMAGE_SEGMENTATION_MAPPING",
"MODEL_FOR_IMAGE_TO_IMAGE_MAPPING",
"MODEL_FOR_KEYPOINT_DETECTION_MAPPING",
"MODEL_FOR_KEYPOINT_MATCHING_MAPPING",
"MODEL_FOR_INSTANCE_SEGMENTATION_MAPPING",
"MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING",
"MODEL_FOR_MASKED_LM_MAPPING",
@@ -2196,6 +2210,7 @@ __all__ = [
"AutoModelForImageToImage",
"AutoModelForInstanceSegmentation",
"AutoModelForKeypointDetection",
"AutoModelForKeypointMatching",
"AutoModelForMaskGeneration",
"AutoModelForTextEncoding",
"AutoModelForMaskedImageModeling",

View File

@@ -0,0 +1,28 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure
if TYPE_CHECKING:
from .configuration_efficientloftr import *
from .image_processing_efficientloftr import *
from .modeling_efficientloftr import *
else:
import sys
_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)

View File

@@ -0,0 +1,203 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Optional
from ...configuration_utils import PretrainedConfig
from ...modeling_rope_utils import rope_config_validation
class EfficientLoFTRConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`EffientLoFTRFromKeypointMatching`].
It is used to instantiate a EfficientLoFTR model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the
EfficientLoFTR [zju-community/efficientloftr](https://huggingface.co/zju-community/efficientloftr) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
stage_num_blocks (`List`, *optional*, defaults to [1, 2, 4, 14]):
The number of blocks in each stages
out_features (`List`, *optional*, defaults to [64, 64, 128, 256]):
The number of channels in each stage
stage_stride (`List`, *optional*, defaults to [2, 1, 2, 2]):
The stride used in each stage
hidden_size (`int`, *optional*, defaults to 256):
The dimension of the descriptors.
activation_function (`str`, *optional*, defaults to `"relu"`):
The activation function used in the backbone
q_aggregation_kernel_size (`int`, *optional*, defaults to 4):
The kernel size of the aggregation of query states in the fusion network
kv_aggregation_kernel_size (`int`, *optional*, defaults to 4):
The kernel size of the aggregation of key and value states in the fusion network
q_aggregation_stride (`int`, *optional*, defaults to 4):
The stride of the aggregation of query states in the fusion network
kv_aggregation_stride (`int`, *optional*, defaults to 4):
The stride of the aggregation of key and value states in the fusion network
num_attention_layers (`int`, *optional*, defaults to 4):
Number of attention layers in the LocalFeatureTransformer
num_attention_heads (`int`, *optional*, defaults to 8):
The number of heads in the GNN layers.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
attention_bias (`bool`, *optional*, defaults to `False`):
Whether to use a bias in the query, key, value and output projection layers during attention.
mlp_activation_function (`str`, *optional*, defaults to `"leaky_relu"`):
Activation function used in the attention mlp layer.
coarse_matching_skip_softmax (`bool`, *optional*, defaults to `False`):
Whether to skip softmax or not at the coarse matching step.
coarse_matching_threshold (`float`, *optional*, defaults to 0.2):
The threshold for the minimum score required for a match.
coarse_matching_temperature (`float`, *optional*, defaults to 0.1):
The temperature to apply to the coarse similarity matrix
coarse_matching_border_removal (`int`, *optional*, defaults to 2):
The size of the border to remove during coarse matching
fine_kernel_size (`int`, *optional*, defaults to 8):
Kernel size used for the fine feature matching
batch_norm_eps (`float`, *optional*, defaults to 1e-05):
The epsilon used by the batch normalization layers.
embedding_size (`List`, *optional*, defaults to [15, 20]):
The size (height, width) of the embedding for the position embeddings.
rope_theta (`float`, *optional*, defaults to 10000.0):
The base period of the RoPE embeddings.
partial_rotary_factor (`float`, *optional*, defaults to 4.0):
Dim factor for the RoPE embeddings, in EfficientLoFTR, frequencies should be generated for
the whole hidden_size, so this factor is used to compensate.
rope_scaling (`Dict`, *optional*):
Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
accordingly.
Expected contents:
`rope_type` (`str`):
The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
'llama3', '2d'], with 'default' being the original RoPE implementation.
`dim` (`int`): The dimension of the RoPE embeddings.
fine_matching_slice_dim (`int`, *optional*, defaults to 8):
The size of the slice used to divide the fine features for the first and second fine matching stages.
fine_matching_regress_temperature (`float`, *optional*, defaults to 10.0):
The temperature to apply to the fine similarity matrix
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
Examples:
```python
>>> from transformers import EfficientLoFTRConfig, EfficientLoFTRForKeypointMatching
>>> # Initializing a EfficientLoFTR configuration
>>> configuration = EfficientLoFTRConfig()
>>> # Initializing a model from the EfficientLoFTR configuration
>>> model = EfficientLoFTRForKeypointMatching(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```
"""
model_type = "efficientloftr"
def __init__(
self,
stage_num_blocks: Optional[list[int]] = None,
out_features: Optional[list[int]] = None,
stage_stride: Optional[list[int]] = None,
hidden_size: int = 256,
activation_function: str = "relu",
q_aggregation_kernel_size: int = 4,
kv_aggregation_kernel_size: int = 4,
q_aggregation_stride: int = 4,
kv_aggregation_stride: int = 4,
num_attention_layers: int = 4,
num_attention_heads: int = 8,
attention_dropout: float = 0.0,
attention_bias: bool = False,
mlp_activation_function: str = "leaky_relu",
coarse_matching_skip_softmax: bool = False,
coarse_matching_threshold: float = 0.2,
coarse_matching_temperature: float = 0.1,
coarse_matching_border_removal: int = 2,
fine_kernel_size: int = 8,
batch_norm_eps: float = 1e-5,
embedding_size: Optional[list[int]] = None,
rope_theta: float = 10000.0,
partial_rotary_factor: float = 4.0,
rope_scaling: Optional[dict] = None,
fine_matching_slice_dim: int = 8,
fine_matching_regress_temperature: float = 10.0,
initializer_range: float = 0.02,
**kwargs,
):
# Stage level of RepVGG
self.stage_num_blocks = stage_num_blocks if stage_num_blocks is not None else [1, 2, 4, 14]
self.stage_stride = stage_stride if stage_stride is not None else [2, 1, 2, 2]
self.out_features = out_features if out_features is not None else [64, 64, 128, 256]
self.stage_in_channels = [1] + self.out_features[:-1]
# Block level of RepVGG
self.stage_block_stride = [
[stride] + [1] * (num_blocks - 1) for stride, num_blocks in zip(self.stage_stride, self.stage_num_blocks)
]
self.stage_block_out_channels = [
[self.out_features[stage_idx]] * num_blocks for stage_idx, num_blocks in enumerate(self.stage_num_blocks)
]
self.stage_block_in_channels = [
[self.stage_in_channels[stage_idx]] + self.stage_block_out_channels[stage_idx][:-1]
for stage_idx in range(len(self.stage_num_blocks))
]
# Fine matching level of EfficientLoFTR
self.fine_fusion_dims = list(reversed(self.out_features))[:-1]
self.hidden_size = hidden_size
if self.hidden_size != self.out_features[-1]:
raise ValueError(
f"hidden_size should be equal to the last value in out_features. hidden_size = {self.hidden_size}, out_features = {self.stage_out_channels}"
)
self.activation_function = activation_function
self.q_aggregation_kernel_size = q_aggregation_kernel_size
self.kv_aggregation_kernel_size = kv_aggregation_kernel_size
self.q_aggregation_stride = q_aggregation_stride
self.kv_aggregation_stride = kv_aggregation_stride
self.num_attention_layers = num_attention_layers
self.num_attention_heads = num_attention_heads
self.attention_dropout = attention_dropout
self.attention_bias = attention_bias
self.intermediate_size = self.hidden_size * 2
self.mlp_activation_function = mlp_activation_function
self.coarse_matching_skip_softmax = coarse_matching_skip_softmax
self.coarse_matching_threshold = coarse_matching_threshold
self.coarse_matching_temperature = coarse_matching_temperature
self.coarse_matching_border_removal = coarse_matching_border_removal
self.fine_kernel_size = fine_kernel_size
self.batch_norm_eps = batch_norm_eps
self.fine_matching_slice_dim = fine_matching_slice_dim
self.fine_matching_regress_temperature = fine_matching_regress_temperature
self.num_key_value_heads = num_attention_heads
self.embedding_size = embedding_size if embedding_size is not None else [15, 20]
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling if rope_scaling is not None else {"rope_type": "default"}
# for compatibility with "default" rope type
self.partial_rotary_factor = partial_rotary_factor
rope_config_validation(self)
self.initializer_range = initializer_range
super().__init__(**kwargs)
__all__ = ["EfficientLoFTRConfig"]

View File

@@ -0,0 +1,257 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import gc
import os
import re
import torch
from datasets import load_dataset
from huggingface_hub import hf_hub_download
from transformers.models.efficientloftr.image_processing_efficientloftr import EfficientLoFTRImageProcessor
from transformers.models.efficientloftr.modeling_efficientloftr import (
EfficientLoFTRConfig,
EfficientLoFTRForKeypointMatching,
)
DEFAULT_MODEL_REPO = "stevenbucaille/efficient_loftr_pth"
DEFAULT_FILE = "eloftr.pth"
def prepare_imgs():
dataset = load_dataset("hf-internal-testing/image-matching-test-dataset", split="train")
image0 = dataset[0]["image"]
image2 = dataset[2]["image"]
return [[image2, image0]]
def verify_model_outputs(model, device):
images = prepare_imgs()
preprocessor = EfficientLoFTRImageProcessor()
inputs = preprocessor(images=images, return_tensors="pt").to(device)
model.to(device)
model.eval()
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True, output_attentions=True)
predicted_number_of_matches = outputs.matches.shape[-1]
predicted_top10 = torch.topk(outputs.matching_scores[0, 0], k=10)
predicted_top10_matches_indices = predicted_top10.indices
predicted_top10_matching_scores = predicted_top10.values
expected_number_of_matches = 4800
expected_matches_shape = torch.Size((len(images), 2, expected_number_of_matches))
expected_matching_scores_shape = torch.Size((len(images), 2, expected_number_of_matches))
expected_top10_matches_indices = torch.tensor(
[1798, 1639, 1401, 1559, 2596, 2362, 2441, 2605, 1643, 2607], dtype=torch.int64
).to(device)
expected_top10_matching_scores = torch.tensor(
[0.9563, 0.9355, 0.9265, 0.9091, 0.9071, 0.9062, 0.9000, 0.8978, 0.8908, 0.8853]
).to(device)
assert outputs.matches.shape == expected_matches_shape
assert outputs.matching_scores.shape == expected_matching_scores_shape
torch.testing.assert_close(predicted_top10_matches_indices, expected_top10_matches_indices, rtol=5e-3, atol=5e-3)
torch.testing.assert_close(predicted_top10_matching_scores, expected_top10_matching_scores, rtol=5e-3, atol=5e-3)
assert predicted_number_of_matches == expected_number_of_matches
ORIGINAL_TO_CONVERTED_KEY_MAPPING = {
r"matcher.backbone.layer(\d+).rbr_dense.conv": r"efficientloftr.backbone.stages.\1.blocks.0.conv1.conv",
r"matcher.backbone.layer(\d+).rbr_dense.bn": r"efficientloftr.backbone.stages.\1.blocks.0.conv1.norm",
r"matcher.backbone.layer(\d+).rbr_1x1.conv": r"efficientloftr.backbone.stages.\1.blocks.0.conv2.conv",
r"matcher.backbone.layer(\d+).rbr_1x1.bn": r"efficientloftr.backbone.stages.\1.blocks.0.conv2.norm",
r"matcher.backbone.layer(\d+).(\d+).rbr_dense.conv": r"efficientloftr.backbone.stages.\1.blocks.\2.conv1.conv",
r"matcher.backbone.layer(\d+).(\d+).rbr_dense.bn": r"efficientloftr.backbone.stages.\1.blocks.\2.conv1.norm",
r"matcher.backbone.layer(\d+).(\d+).rbr_1x1.conv": r"efficientloftr.backbone.stages.\1.blocks.\2.conv2.conv",
r"matcher.backbone.layer(\d+).(\d+).rbr_1x1.bn": r"efficientloftr.backbone.stages.\1.blocks.\2.conv2.norm",
r"matcher.backbone.layer(\d+).(\d+).rbr_identity": r"efficientloftr.backbone.stages.\1.blocks.\2.identity",
r"matcher.loftr_coarse.layers.(\d*[02468]).aggregate": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.aggregation.q_aggregation",
r"matcher.loftr_coarse.layers.(\d*[02468]).norm1": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.aggregation.norm",
r"matcher.loftr_coarse.layers.(\d*[02468]).q_proj": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.attention.q_proj",
r"matcher.loftr_coarse.layers.(\d*[02468]).k_proj": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.attention.k_proj",
r"matcher.loftr_coarse.layers.(\d*[02468]).v_proj": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.attention.v_proj",
r"matcher.loftr_coarse.layers.(\d*[02468]).merge": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.attention.o_proj",
r"matcher.loftr_coarse.layers.(\d*[02468]).mlp.(\d+)": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.mlp.fc{1 if m.group(2) == '0' else 2}",
r"matcher.loftr_coarse.layers.(\d*[02468]).norm2": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.mlp.layer_norm",
r"matcher.loftr_coarse.layers.(\d*[13579]).aggregate": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.aggregation.q_aggregation",
r"matcher.loftr_coarse.layers.(\d*[13579]).norm1": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.aggregation.norm",
r"matcher.loftr_coarse.layers.(\d*[13579]).q_proj": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.attention.q_proj",
r"matcher.loftr_coarse.layers.(\d*[13579]).k_proj": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.attention.k_proj",
r"matcher.loftr_coarse.layers.(\d*[13579]).v_proj": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.attention.v_proj",
r"matcher.loftr_coarse.layers.(\d*[13579]).merge": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.attention.o_proj",
r"matcher.loftr_coarse.layers.(\d*[13579]).mlp.(\d+)": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.mlp.fc{1 if m.group(2) == '0' else 2}",
r"matcher.loftr_coarse.layers.(\d*[13579]).norm2": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.mlp.layer_norm",
r"matcher.fine_preprocess.layer3_outconv": "refinement_layer.out_conv",
r"matcher.fine_preprocess.layer(\d+)_outconv.weight": lambda m: f"refinement_layer.out_conv_layers.{0 if int(m.group(1)) == 2 else m.group(1)}.out_conv1.weight",
r"matcher.fine_preprocess.layer(\d+)_outconv2\.0": lambda m: f"refinement_layer.out_conv_layers.{0 if int(m.group(1)) == 2 else m.group(1)}.out_conv2",
r"matcher.fine_preprocess.layer(\d+)_outconv2\.1": lambda m: f"refinement_layer.out_conv_layers.{0 if int(m.group(1)) == 2 else m.group(1)}.batch_norm",
r"matcher.fine_preprocess.layer(\d+)_outconv2\.3": lambda m: f"refinement_layer.out_conv_layers.{0 if int(m.group(1)) == 2 else m.group(1)}.out_conv3",
}
def convert_old_keys_to_new_keys(state_dict_keys: list[str]):
"""
This function should be applied only once, on the concatenated keys to efficiently rename using
the key mappings.
"""
output_dict = {}
if state_dict_keys is not None:
old_text = "\n".join(state_dict_keys)
new_text = old_text
for pattern, replacement in ORIGINAL_TO_CONVERTED_KEY_MAPPING.items():
if replacement is None:
new_text = re.sub(pattern, "", new_text) # an empty line
continue
new_text = re.sub(pattern, replacement, new_text)
output_dict = dict(zip(old_text.split("\n"), new_text.split("\n")))
return output_dict
@torch.no_grad()
def write_model(
model_path,
model_repo,
file_name,
organization,
safe_serialization=True,
push_to_hub=False,
):
os.makedirs(model_path, exist_ok=True)
# ------------------------------------------------------------
# EfficientLoFTR config
# ------------------------------------------------------------
config = EfficientLoFTRConfig()
config.architectures = ["EfficientLoFTRForKeypointMatching"]
config.save_pretrained(model_path)
print("Model config saved successfully...")
# ------------------------------------------------------------
# Convert weights
# ------------------------------------------------------------
print(f"Fetching all parameters from the checkpoint at {model_repo}/{file_name}...")
checkpoint_path = hf_hub_download(repo_id=model_repo, filename=file_name)
original_state_dict = torch.load(checkpoint_path, weights_only=True, map_location="cpu")["state_dict"]
print("Converting model...")
all_keys = list(original_state_dict.keys())
new_keys = convert_old_keys_to_new_keys(all_keys)
state_dict = {}
for key in all_keys:
new_key = new_keys[key]
state_dict[new_key] = original_state_dict.pop(key).contiguous().clone()
del original_state_dict
gc.collect()
print("Loading the checkpoint in a EfficientLoFTR model...")
device = "cuda" if torch.cuda.is_available() else "cpu"
with torch.device(device):
model = EfficientLoFTRForKeypointMatching(config)
model.load_state_dict(state_dict)
print("Checkpoint loaded successfully...")
del model.config._name_or_path
print("Saving the model...")
model.save_pretrained(model_path, safe_serialization=safe_serialization)
del state_dict, model
# Safety check: reload the converted model
gc.collect()
print("Reloading the model to check if it's saved correctly.")
model = EfficientLoFTRForKeypointMatching.from_pretrained(model_path)
print("Model reloaded successfully.")
model_name = "efficientloftr"
if model_repo == DEFAULT_MODEL_REPO:
print("Checking the model outputs...")
verify_model_outputs(model, device)
print("Model outputs verified successfully.")
if push_to_hub:
print("Pushing model to the hub...")
model.push_to_hub(
repo_id=f"{organization}/{model_name}",
commit_message="Add model",
)
config.push_to_hub(repo_id=f"{organization}/{model_name}", commit_message="Add config")
write_image_processor(model_path, model_name, organization, push_to_hub=push_to_hub)
def write_image_processor(save_dir, model_name, organization, push_to_hub=False):
image_processor = EfficientLoFTRImageProcessor()
image_processor.save_pretrained(save_dir)
if push_to_hub:
print("Pushing image processor to the hub...")
image_processor.push_to_hub(
repo_id=f"{organization}/{model_name}",
commit_message="Add image processor",
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--repo_id",
default=DEFAULT_MODEL_REPO,
type=str,
help="Model repo ID of the original EfficientLoFTR checkpoint you'd like to convert.",
)
parser.add_argument(
"--file_name",
default=DEFAULT_FILE,
type=str,
help="File name of the original EfficientLoFTR checkpoint you'd like to convert.",
)
parser.add_argument(
"--pytorch_dump_folder_path",
default=None,
type=str,
required=True,
help="Path to the output PyTorch model directory.",
)
parser.add_argument("--save_model", action="store_true", help="Save model to local")
parser.add_argument(
"--push_to_hub",
action="store_true",
help="Push model and image preprocessor to the hub",
)
parser.add_argument(
"--organization",
default="zju-community",
type=str,
help="Hub organization in which you want the model to be uploaded.",
)
args = parser.parse_args()
write_model(
args.pytorch_dump_folder_path,
args.repo_id,
args.file_name,
args.organization,
safe_serialization=True,
push_to_hub=args.push_to_hub,
)

View File

@@ -0,0 +1,461 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Image processor class for SuperPoint."""
from typing import Optional, Union
import numpy as np
from ... import is_torch_available, is_vision_available
from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
from ...image_transforms import resize, to_channel_dimension_format
from ...image_utils import (
ChannelDimension,
ImageInput,
ImageType,
PILImageResampling,
get_image_type,
infer_channel_dimension_format,
is_pil_image,
is_scaled_image,
is_valid_image,
to_numpy_array,
valid_images,
validate_preprocess_arguments,
)
from ...utils import TensorType, logging, requires_backends
if is_torch_available():
import torch
if is_vision_available():
import PIL
from PIL import Image, ImageDraw
from .modeling_efficientloftr import KeypointMatchingOutput
logger = logging.get_logger(__name__)
# Copied from transformers.models.superpoint.image_processing_superpoint.is_grayscale
def is_grayscale(
image: np.ndarray,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
if input_data_format == ChannelDimension.FIRST:
if image.shape[0] == 1:
return True
return np.all(image[0, ...] == image[1, ...]) and np.all(image[1, ...] == image[2, ...])
elif input_data_format == ChannelDimension.LAST:
if image.shape[-1] == 1:
return True
return np.all(image[..., 0] == image[..., 1]) and np.all(image[..., 1] == image[..., 2])
# Copied from transformers.models.superpoint.image_processing_superpoint.convert_to_grayscale
def convert_to_grayscale(
image: ImageInput,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> ImageInput:
"""
Converts an image to grayscale format using the NTSC formula. Only support numpy and PIL Image. TODO support torch
and tensorflow grayscale conversion
This function is supposed to return a 1-channel image, but it returns a 3-channel image with the same value in each
channel, because of an issue that is discussed in :
https://github.com/huggingface/transformers/pull/25786#issuecomment-1730176446
Args:
image (Image):
The image to convert.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image.
"""
requires_backends(convert_to_grayscale, ["vision"])
if isinstance(image, np.ndarray):
if is_grayscale(image, input_data_format=input_data_format):
return image
if input_data_format == ChannelDimension.FIRST:
gray_image = image[0, ...] * 0.2989 + image[1, ...] * 0.5870 + image[2, ...] * 0.1140
gray_image = np.stack([gray_image] * 3, axis=0)
elif input_data_format == ChannelDimension.LAST:
gray_image = image[..., 0] * 0.2989 + image[..., 1] * 0.5870 + image[..., 2] * 0.1140
gray_image = np.stack([gray_image] * 3, axis=-1)
return gray_image
if not isinstance(image, PIL.Image.Image):
return image
image = image.convert("L")
return image
# Copied from transformers.models.superglue.image_processing_superglue.validate_and_format_image_pairs
def validate_and_format_image_pairs(images: ImageInput):
error_message = (
"Input images must be a one of the following :",
" - A pair of PIL images.",
" - A pair of 3D arrays.",
" - A list of pairs of PIL images.",
" - A list of pairs of 3D arrays.",
)
def _is_valid_image(image):
"""images is a PIL Image or a 3D array."""
return is_pil_image(image) or (
is_valid_image(image) and get_image_type(image) != ImageType.PIL and len(image.shape) == 3
)
if isinstance(images, list):
if len(images) == 2 and all((_is_valid_image(image)) for image in images):
return images
if all(
isinstance(image_pair, list)
and len(image_pair) == 2
and all(_is_valid_image(image) for image in image_pair)
for image_pair in images
):
return [image for image_pair in images for image in image_pair]
raise ValueError(error_message)
class EfficientLoFTRImageProcessor(BaseImageProcessor):
r"""
Constructs a EfficientLoFTR image processor.
Args:
do_resize (`bool`, *optional*, defaults to `True`):
Controls whether to resize the image's (height, width) dimensions to the specified `size`. Can be overriden
by `do_resize` in the `preprocess` method.
size (`Dict[str, int]` *optional*, defaults to `{"height": 480, "width": 640}`):
Resolution of the output image after `resize` is applied. Only has an effect if `do_resize` is set to
`True`. Can be overriden by `size` in the `preprocess` method.
resample (`PILImageResampling`, *optional*, defaults to `Resampling.BILINEAR`):
Resampling filter to use if resizing the image. Can be overriden by `resample` in the `preprocess` method.
do_rescale (`bool`, *optional*, defaults to `True`):
Whether to rescale the image by the specified scale `rescale_factor`. Can be overriden by `do_rescale` in
the `preprocess` method.
rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
Scale factor to use if rescaling the image. Can be overriden by `rescale_factor` in the `preprocess`
method.
do_grayscale (`bool`, *optional*, defaults to `True`):
Whether to convert the image to grayscale. Can be overriden by `do_grayscale` in the `preprocess` method.
"""
model_input_names = ["pixel_values"]
def __init__(
self,
do_resize: bool = True,
size: Optional[dict[str, int]] = None,
resample: PILImageResampling = PILImageResampling.BILINEAR,
do_rescale: bool = True,
rescale_factor: float = 1 / 255,
do_grayscale: bool = True,
**kwargs,
) -> None:
super().__init__(**kwargs)
size = size if size is not None else {"height": 480, "width": 640}
size = get_size_dict(size, default_to_square=False)
self.do_resize = do_resize
self.size = size
self.resample = resample
self.do_rescale = do_rescale
self.rescale_factor = rescale_factor
self.do_grayscale = do_grayscale
# Copied from transformers.models.superpoint.image_processing_superpoint.SuperPointImageProcessor.resize
def resize(
self,
image: np.ndarray,
size: dict[str, int],
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
):
"""
Resize an image.
Args:
image (`np.ndarray`):
Image to resize.
size (`dict[str, int]`):
Dictionary of the form `{"height": int, "width": int}`, specifying the size of the output image.
data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format of the output image. If not provided, it will be inferred from the input
image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
"""
size = get_size_dict(size, default_to_square=False)
return resize(
image,
size=(size["height"], size["width"]),
data_format=data_format,
input_data_format=input_data_format,
**kwargs,
)
# Copied from transformers.models.superglue.image_processing_superglue.SuperGlueImageProcessor.preprocess
def preprocess(
self,
images,
do_resize: Optional[bool] = None,
size: Optional[dict[str, int]] = None,
resample: PILImageResampling = None,
do_rescale: Optional[bool] = None,
rescale_factor: Optional[float] = None,
do_grayscale: Optional[bool] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
data_format: ChannelDimension = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
) -> BatchFeature:
"""
Preprocess an image or batch of images.
Args:
images (`ImageInput`):
Image pairs to preprocess. Expects either a list of 2 images or a list of list of 2 images list with
pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set
`do_rescale=False`.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
size (`dict[str, int]`, *optional*, defaults to `self.size`):
Size of the output image after `resize` has been applied. If `size["shortest_edge"]` >= 384, the image
is resized to `(size["shortest_edge"], size["shortest_edge"])`. Otherwise, the smaller edge of the
image will be matched to `int(size["shortest_edge"]/ crop_pct)`, after which the image is cropped to
`(size["shortest_edge"], size["shortest_edge"])`. Only has an effect if `do_resize` is set to `True`.
resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
Resampling filter to use if resizing the image. This can be one of `PILImageResampling`, filters. Only
has an effect if `do_resize` is set to `True`.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image values between [0 - 1].
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
Rescale factor to rescale the image by if `do_rescale` is set to `True`.
do_grayscale (`bool`, *optional*, defaults to `self.do_grayscale`):
Whether to convert the image to grayscale.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
"""
do_resize = do_resize if do_resize is not None else self.do_resize
resample = resample if resample is not None else self.resample
do_rescale = do_rescale if do_rescale is not None else self.do_rescale
rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
do_grayscale = do_grayscale if do_grayscale is not None else self.do_grayscale
size = size if size is not None else self.size
size = get_size_dict(size, default_to_square=False)
# Validate and convert the input images into a flattened list of images for all subsequent processing steps.
images = validate_and_format_image_pairs(images)
if not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
validate_preprocess_arguments(
do_resize=do_resize,
size=size,
resample=resample,
do_rescale=do_rescale,
rescale_factor=rescale_factor,
)
# All transformations expect numpy arrays.
images = [to_numpy_array(image) for image in images]
if is_scaled_image(images[0]) and do_rescale:
logger.warning_once(
"It looks like you are trying to rescale already rescaled images. If the input"
" images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
)
if input_data_format is None:
# We assume that all images have the same channel dimension format.
input_data_format = infer_channel_dimension_format(images[0])
all_images = []
for image in images:
if do_resize:
image = self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
if do_rescale:
image = self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
if do_grayscale:
image = convert_to_grayscale(image, input_data_format=input_data_format)
image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
all_images.append(image)
# Convert back the flattened list of images into a list of pairs of images.
image_pairs = [all_images[i : i + 2] for i in range(0, len(all_images), 2)]
data = {"pixel_values": image_pairs}
return BatchFeature(data=data, tensor_type=return_tensors)
def post_process_keypoint_matching(
self,
outputs: "KeypointMatchingOutput",
target_sizes: Union[TensorType, list[tuple]],
threshold: float = 0.0,
) -> list[dict[str, torch.Tensor]]:
"""
Converts the raw output of [`KeypointMatchingOutput`] into lists of keypoints, scores and descriptors
with coordinates absolute to the original image sizes.
Args:
outputs ([`KeypointMatchingOutput`]):
Raw outputs of the model.
target_sizes (`torch.Tensor` or `List[Tuple[Tuple[int, int]]]`, *optional*):
Tensor of shape `(batch_size, 2, 2)` or list of tuples of tuples (`Tuple[int, int]`) containing the
target size `(height, width)` of each image in the batch. This must be the original image size (before
any processing).
threshold (`float`, *optional*, defaults to 0.0):
Threshold to filter out the matches with low scores.
Returns:
`List[Dict]`: A list of dictionaries, each dictionary containing the keypoints in the first and second image
of the pair, the matching scores and the matching indices.
"""
if outputs.matches.shape[0] != len(target_sizes):
raise ValueError("Make sure that you pass in as many target sizes as the batch dimension of the mask")
if not all(len(target_size) == 2 for target_size in target_sizes):
raise ValueError("Each element of target_sizes must contain the size (h, w) of each image of the batch")
if isinstance(target_sizes, list):
image_pair_sizes = torch.tensor(target_sizes, device=outputs.matches.device)
else:
if target_sizes.shape[1] != 2 or target_sizes.shape[2] != 2:
raise ValueError(
"Each element of target_sizes must contain the size (h, w) of each image of the batch"
)
image_pair_sizes = target_sizes
keypoints = outputs.keypoints.clone()
keypoints = keypoints * image_pair_sizes.flip(-1).reshape(-1, 2, 1, 2)
keypoints = keypoints.to(torch.int32)
results = []
for keypoints_pair, matches, scores in zip(keypoints, outputs.matches, outputs.matching_scores):
# Filter out matches with low scores
valid_matches = torch.logical_and(scores > threshold, matches > -1)
matched_keypoints0 = keypoints_pair[0][valid_matches[0]]
matched_keypoints1 = keypoints_pair[1][valid_matches[1]]
matching_scores = scores[0][valid_matches[0]]
results.append(
{
"keypoints0": matched_keypoints0,
"keypoints1": matched_keypoints1,
"matching_scores": matching_scores,
}
)
return results
def visualize_keypoint_matching(
self,
images: ImageInput,
keypoint_matching_output: list[dict[str, torch.Tensor]],
) -> list["Image.Image"]:
"""
Plots the image pairs side by side with the detected keypoints as well as the matching between them.
Args:
images (`ImageInput`):
Image pairs to plot. Same as `EfficientLoFTRImageProcessor.preprocess`. Expects either a list of 2
images or a list of list of 2 images list with pixel values ranging from 0 to 255.
outputs (List[Dict[str, torch.Tensor]]]):
A post processed keypoint matching output
Returns:
`List[PIL.Image.Image]`: A list of PIL images, each containing the image pairs side by side with the detected
keypoints as well as the matching between them.
"""
images = validate_and_format_image_pairs(images)
images = [to_numpy_array(image) for image in images]
image_pairs = [images[i : i + 2] for i in range(0, len(images), 2)]
results = []
for image_pair, pair_output in zip(image_pairs, keypoint_matching_output):
height0, width0 = image_pair[0].shape[:2]
height1, width1 = image_pair[1].shape[:2]
plot_image = np.zeros((max(height0, height1), width0 + width1, 3), dtype=np.uint8)
plot_image[:height0, :width0] = image_pair[0]
plot_image[:height1, width0:] = image_pair[1]
plot_image_pil = Image.fromarray(plot_image)
draw = ImageDraw.Draw(plot_image_pil)
keypoints0_x, keypoints0_y = pair_output["keypoints0"].unbind(1)
keypoints1_x, keypoints1_y = pair_output["keypoints1"].unbind(1)
for keypoint0_x, keypoint0_y, keypoint1_x, keypoint1_y, matching_score in zip(
keypoints0_x, keypoints0_y, keypoints1_x, keypoints1_y, pair_output["matching_scores"]
):
color = self._get_color(matching_score)
draw.line(
(keypoint0_x, keypoint0_y, keypoint1_x + width0, keypoint1_y),
fill=color,
width=3,
)
draw.ellipse((keypoint0_x - 2, keypoint0_y - 2, keypoint0_x + 2, keypoint0_y + 2), fill="black")
draw.ellipse(
(keypoint1_x + width0 - 2, keypoint1_y - 2, keypoint1_x + width0 + 2, keypoint1_y + 2),
fill="black",
)
results.append(plot_image_pil)
return results
def _get_color(self, score):
"""Maps a score to a color."""
r = int(255 * (1 - score))
g = int(255 * score)
b = 0
return (r, g, b)
__all__ = ["EfficientLoFTRImageProcessor"]

File diff suppressed because it is too large Load Diff

View File

@@ -51,7 +51,7 @@ logger = logging.get_logger(__name__)
def is_grayscale(
image: ImageInput,
image: np.ndarray,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
if input_data_format == ChannelDimension.FIRST:

View File

@@ -53,7 +53,7 @@ logger = logging.get_logger(__name__)
# Copied from transformers.models.superpoint.image_processing_superpoint.is_grayscale
def is_grayscale(
image: ImageInput,
image: np.ndarray,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
if input_data_format == ChannelDimension.FIRST:

View File

@@ -45,7 +45,7 @@ logger = logging.get_logger(__name__)
def is_grayscale(
image: ImageInput,
image: np.ndarray,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
if input_data_format == ChannelDimension.FIRST:

View File

@@ -1075,7 +1075,7 @@ def check_model_inputs(func):
if key == "hidden_states":
if hasattr(outputs, "vision_hidden_states"):
collected_outputs[key] += (outputs.vision_hidden_states,)
else:
elif hasattr(outputs, "last_hidden_state"):
collected_outputs[key] += (outputs.last_hidden_state,)
outputs[key] = collected_outputs[key]
elif key == "attentions":

View File

View File

@@ -0,0 +1,90 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
from tests.models.superglue.test_image_processing_superglue import (
SuperGlueImageProcessingTest,
SuperGlueImageProcessingTester,
)
from transformers.testing_utils import require_torch, require_vision
from transformers.utils import is_torch_available, is_vision_available
if is_torch_available():
import numpy as np
import torch
from transformers.models.efficientloftr.modeling_efficientloftr import KeypointMatchingOutput
if is_vision_available():
from transformers import EfficientLoFTRImageProcessor
def random_array(size):
return np.random.randint(255, size=size)
def random_tensor(size):
return torch.rand(size)
class EfficientLoFTRImageProcessingTester(SuperGlueImageProcessingTester):
"""Tester for EfficientLoFTRImageProcessor"""
def __init__(
self,
parent,
batch_size=6,
num_channels=3,
image_size=18,
min_resolution=30,
max_resolution=400,
do_resize=True,
size=None,
do_grayscale=True,
):
super().__init__(
parent, batch_size, num_channels, image_size, min_resolution, max_resolution, do_resize, size, do_grayscale
)
def prepare_keypoint_matching_output(self, pixel_values):
"""Prepare a fake output for the keypoint matching model with random matches between 50 keypoints per image."""
max_number_keypoints = 50
batch_size = len(pixel_values)
keypoints = torch.zeros((batch_size, 2, max_number_keypoints, 2))
matches = torch.full((batch_size, 2, max_number_keypoints), -1, dtype=torch.int)
scores = torch.zeros((batch_size, 2, max_number_keypoints))
for i in range(batch_size):
random_number_keypoints0 = np.random.randint(10, max_number_keypoints)
random_number_keypoints1 = np.random.randint(10, max_number_keypoints)
random_number_matches = np.random.randint(5, min(random_number_keypoints0, random_number_keypoints1))
keypoints[i, 0, :random_number_keypoints0] = torch.rand((random_number_keypoints0, 2))
keypoints[i, 1, :random_number_keypoints1] = torch.rand((random_number_keypoints1, 2))
random_matches_indices0 = torch.randperm(random_number_keypoints1, dtype=torch.int)[:random_number_matches]
random_matches_indices1 = torch.randperm(random_number_keypoints0, dtype=torch.int)[:random_number_matches]
matches[i, 0, random_matches_indices1] = random_matches_indices0
matches[i, 1, random_matches_indices0] = random_matches_indices1
scores[i, 0, random_matches_indices1] = torch.rand((random_number_matches,))
scores[i, 1, random_matches_indices0] = torch.rand((random_number_matches,))
return KeypointMatchingOutput(keypoints=keypoints, matches=matches, matching_scores=scores)
@require_torch
@require_vision
class EfficientLoFTRImageProcessingTest(SuperGlueImageProcessingTest, unittest.TestCase):
image_processing_class = EfficientLoFTRImageProcessor if is_vision_available() else None
def setUp(self) -> None:
super().setUp()
self.image_processor_tester = EfficientLoFTRImageProcessingTester(self)

View File

@@ -0,0 +1,453 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import inspect
import unittest
from functools import reduce
from datasets import load_dataset
from transformers.models.efficientloftr import EfficientLoFTRConfig, EfficientLoFTRModel
from transformers.testing_utils import (
require_torch,
require_vision,
set_config_for_less_flaky_test,
set_model_for_less_flaky_test,
set_model_tester_for_less_flaky_test,
slow,
torch_device,
)
from transformers.utils import cached_property, is_torch_available, is_vision_available
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import ModelTesterMixin, floats_tensor
if is_torch_available():
import torch
from transformers import EfficientLoFTRForKeypointMatching
if is_vision_available():
from transformers import AutoImageProcessor
class EfficientLoFTRModelTester:
def __init__(
self,
parent,
batch_size=2,
image_width=80,
image_height=60,
stage_num_blocks: list[int] = [1, 1, 1],
out_features: list[int] = [32, 32, 64],
stage_stride: list[int] = [2, 1, 2],
q_aggregation_kernel_size: int = 1,
kv_aggregation_kernel_size: int = 1,
q_aggregation_stride: int = 1,
kv_aggregation_stride: int = 1,
num_attention_layers: int = 2,
num_attention_heads: int = 8,
hidden_size: int = 64,
coarse_matching_threshold: float = 0.0,
fine_kernel_size: int = 2,
coarse_matching_border_removal: int = 0,
):
self.parent = parent
self.batch_size = batch_size
self.image_width = image_width
self.image_height = image_height
self.stage_num_blocks = stage_num_blocks
self.out_features = out_features
self.stage_stride = stage_stride
self.q_aggregation_kernel_size = q_aggregation_kernel_size
self.kv_aggregation_kernel_size = kv_aggregation_kernel_size
self.q_aggregation_stride = q_aggregation_stride
self.kv_aggregation_stride = kv_aggregation_stride
self.num_attention_layers = num_attention_layers
self.num_attention_heads = num_attention_heads
self.hidden_size = hidden_size
self.coarse_matching_threshold = coarse_matching_threshold
self.coarse_matching_border_removal = coarse_matching_border_removal
self.fine_kernel_size = fine_kernel_size
def prepare_config_and_inputs(self):
# EfficientLoFTR expects a grayscale image as input
pixel_values = floats_tensor([self.batch_size, 2, 3, self.image_height, self.image_width])
config = self.get_config()
return config, pixel_values
def get_config(self):
return EfficientLoFTRConfig(
stage_num_blocks=self.stage_num_blocks,
out_features=self.out_features,
stage_stride=self.stage_stride,
q_aggregation_kernel_size=self.q_aggregation_kernel_size,
kv_aggregation_kernel_size=self.kv_aggregation_kernel_size,
q_aggregation_stride=self.q_aggregation_stride,
kv_aggregation_stride=self.kv_aggregation_stride,
num_attention_layers=self.num_attention_layers,
num_attention_heads=self.num_attention_heads,
hidden_size=self.hidden_size,
coarse_matching_threshold=self.coarse_matching_threshold,
coarse_matching_border_removal=self.coarse_matching_border_removal,
fine_kernel_size=self.fine_kernel_size,
)
def create_and_check_model(self, config, pixel_values):
model = EfficientLoFTRForKeypointMatching(config=config)
model.to(torch_device)
model.eval()
result = model(pixel_values)
maximum_num_matches = result.matches.shape[-1]
self.parent.assertEqual(
result.keypoints.shape,
(self.batch_size, 2, maximum_num_matches, 2),
)
self.parent.assertEqual(
result.matches.shape,
(self.batch_size, 2, maximum_num_matches),
)
self.parent.assertEqual(
result.matching_scores.shape,
(self.batch_size, 2, maximum_num_matches),
)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, pixel_values = config_and_inputs
inputs_dict = {"pixel_values": pixel_values}
return config, inputs_dict
@require_torch
class EfficientLoFTRModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (EfficientLoFTRForKeypointMatching, EfficientLoFTRModel) if is_torch_available() else ()
test_pruning = False
test_resize_embeddings = False
test_head_masking = False
has_attentions = True
def setUp(self):
self.model_tester = EfficientLoFTRModelTester(self)
self.config_tester = ConfigTester(self, config_class=EfficientLoFTRConfig, has_text_modality=False)
def test_config(self):
self.config_tester.create_and_test_config_to_json_string()
self.config_tester.create_and_test_config_to_json_file()
self.config_tester.create_and_test_config_from_and_save_pretrained()
self.config_tester.create_and_test_config_with_num_labels()
self.config_tester.check_config_can_be_init_without_params()
self.config_tester.check_config_arguments_init()
@unittest.skip(reason="EfficientLoFTRForKeypointMatching does not use inputs_embeds")
def test_inputs_embeds(self):
pass
@unittest.skip(reason="EfficientLoFTRForKeypointMatching does not support input and output embeddings")
def test_model_get_set_embeddings(self):
pass
@unittest.skip(reason="EfficientLoFTRForKeypointMatching does not use feedforward chunking")
def test_feed_forward_chunking(self):
pass
@unittest.skip(reason="EfficientLoFTRForKeypointMatching is not trainable")
def test_training(self):
pass
@unittest.skip(reason="EfficientLoFTRForKeypointMatching is not trainable")
def test_training_gradient_checkpointing(self):
pass
@unittest.skip(reason="EfficientLoFTRForKeypointMatching is not trainable")
def test_training_gradient_checkpointing_use_reentrant(self):
pass
@unittest.skip(reason="EfficientLoFTRForKeypointMatching is not trainable")
def test_training_gradient_checkpointing_use_reentrant_false(self):
pass
@unittest.skip(reason="EfficientLoFTR does not output any loss term in the forward pass")
def test_retain_grad_hidden_states_attentions(self):
pass
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
def test_forward_signature(self):
config, _ = self.model_tester.prepare_config_and_inputs()
for model_class in self.all_model_classes:
model = model_class(config)
signature = inspect.signature(model.forward)
# signature.parameters is an OrderedDict => so arg_names order is deterministic
arg_names = [*signature.parameters.keys()]
expected_arg_names = ["pixel_values"]
self.assertListEqual(arg_names[:1], expected_arg_names)
def test_hidden_states_output(self):
def check_hidden_states_output(inputs_dict, config, model_class):
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
hidden_states = outputs.hidden_states
expected_num_hidden_states = len(self.model_tester.stage_num_blocks)
self.assertEqual(len(hidden_states), expected_num_hidden_states)
self.assertListEqual(
list(hidden_states[0].shape[-2:]),
[self.model_tester.image_height // 2, self.model_tester.image_width // 2],
)
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
inputs_dict["output_hidden_states"] = True
check_hidden_states_output(inputs_dict, config, model_class)
# check that output_hidden_states also work using config
del inputs_dict["output_hidden_states"]
config.output_hidden_states = True
check_hidden_states_output(inputs_dict, config, model_class)
def test_attention_outputs(self):
def check_attention_output(inputs_dict, config, model_class):
config._attn_implementation = "eager"
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
attentions = outputs.attentions
total_stride = reduce(lambda a, b: a * b, config.stage_stride)
hidden_size = (
self.model_tester.image_height // total_stride * self.model_tester.image_width // total_stride
)
expected_attention_shape = [
self.model_tester.num_attention_heads,
hidden_size,
hidden_size,
]
for i, attention in enumerate(attentions):
self.assertListEqual(
list(attention.shape[-3:]),
expected_attention_shape,
)
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
inputs_dict["output_attentions"] = True
check_attention_output(inputs_dict, config, model_class)
# check that output_hidden_states also work using config
del inputs_dict["output_attentions"]
config.output_attentions = True
check_attention_output(inputs_dict, config, model_class)
@slow
def test_model_from_pretrained(self):
from_pretrained_ids = ["stevenbucaille/efficientloftr"]
for model_name in from_pretrained_ids:
model = EfficientLoFTRForKeypointMatching.from_pretrained(model_name)
self.assertIsNotNone(model)
def test_forward_labels_should_be_none(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
model_inputs = self._prepare_for_class(inputs_dict, model_class)
# Provide an arbitrary sized Tensor as labels to model inputs
model_inputs["labels"] = torch.rand((128, 128))
with self.assertRaises(ValueError) as cm:
model(**model_inputs)
self.assertEqual(ValueError, cm.exception.__class__)
def test_batching_equivalence(self, atol=1e-5, rtol=1e-5):
"""
This test is overwritten because the model outputs do not contain only regressive values but also keypoint
locations.
Similarly to the problem discussed about SuperGlue implementation
[here](https://github.com/huggingface/transformers/pull/29886#issuecomment-2482752787), the consequence of
having different scores for matching, makes the maximum indices differ. These indices are being used to compute
the keypoint coordinates. The keypoint coordinates, in the model outputs, are floating point tensors, so the
original implementation of this test cover this case. But the resulting tensors may have differences exceeding
the relative and absolute tolerance.
Therefore, similarly to SuperGlue integration test, for the key "keypoints" in the model outputs, we check the
number of differences in keypoint coordinates being less than a TODO given number
"""
def recursive_check(batched_object, single_row_object, model_name, key):
if isinstance(batched_object, (list, tuple)):
for batched_object_value, single_row_object_value in zip(batched_object, single_row_object):
recursive_check(batched_object_value, single_row_object_value, model_name, key)
elif isinstance(batched_object, dict):
for batched_object_value, single_row_object_value in zip(
batched_object.values(), single_row_object.values()
):
recursive_check(batched_object_value, single_row_object_value, model_name, key)
# do not compare returned loss (0-dim tensor) / codebook ids (int) / caching objects
elif batched_object is None or not isinstance(batched_object, torch.Tensor):
return
elif batched_object.dim() == 0:
return
# do not compare int or bool outputs as they are mostly computed with max/argmax/topk methods which are
# very sensitive to the inputs (e.g. tiny differences may give totally different results)
elif not torch.is_floating_point(batched_object):
return
else:
# indexing the first element does not always work
# e.g. models that output similarity scores of size (N, M) would need to index [0, 0]
slice_ids = [slice(0, index) for index in single_row_object.shape]
batched_row = batched_object[slice_ids]
if key == "keypoints":
batched_row = torch.sum(batched_row, dim=-1)
single_row_object = torch.sum(single_row_object, dim=-1)
tolerance = 0.02 * single_row_object.shape[-1]
self.assertTrue(
torch.sum(~torch.isclose(batched_row, single_row_object, rtol=rtol, atol=atol)) < tolerance
)
else:
self.assertFalse(
torch.isnan(batched_row).any(), f"Batched output has `nan` in {model_name} for key={key}"
)
self.assertFalse(
torch.isinf(batched_row).any(), f"Batched output has `inf` in {model_name} for key={key}"
)
self.assertFalse(
torch.isnan(single_row_object).any(),
f"Single row output has `nan` in {model_name} for key={key}",
)
self.assertFalse(
torch.isinf(single_row_object).any(),
f"Single row output has `inf` in {model_name} for key={key}",
)
try:
torch.testing.assert_close(batched_row, single_row_object, atol=atol, rtol=rtol)
except AssertionError as e:
msg = f"Batched and Single row outputs are not equal in {model_name} for key={key}.\n\n"
msg += str(e)
raise AssertionError(msg)
set_model_tester_for_less_flaky_test(self)
config, batched_input = self.model_tester.prepare_config_and_inputs_for_common()
set_config_for_less_flaky_test(config)
for model_class in self.all_model_classes:
config.output_hidden_states = True
model_name = model_class.__name__
if hasattr(self.model_tester, "prepare_config_and_inputs_for_model_class"):
config, batched_input = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
batched_input_prepared = self._prepare_for_class(batched_input, model_class)
model = model_class(config).to(torch_device).eval()
set_model_for_less_flaky_test(model)
batch_size = self.model_tester.batch_size
single_row_input = {}
for key, value in batched_input_prepared.items():
if isinstance(value, torch.Tensor) and value.shape[0] % batch_size == 0:
# e.g. musicgen has inputs of size (bs*codebooks). in most cases value.shape[0] == batch_size
single_batch_shape = value.shape[0] // batch_size
single_row_input[key] = value[:single_batch_shape]
else:
single_row_input[key] = value
with torch.no_grad():
model_batched_output = model(**batched_input_prepared)
model_row_output = model(**single_row_input)
if isinstance(model_batched_output, torch.Tensor):
model_batched_output = {"model_output": model_batched_output}
model_row_output = {"model_output": model_row_output}
for key in model_batched_output:
# DETR starts from zero-init queries to decoder, leading to cos_similarity = `nan`
if hasattr(self, "zero_init_hidden_state") and "decoder_hidden_states" in key:
model_batched_output[key] = model_batched_output[key][1:]
model_row_output[key] = model_row_output[key][1:]
recursive_check(model_batched_output[key], model_row_output[key], model_name, key)
def prepare_imgs():
dataset = load_dataset("hf-internal-testing/image-matching-test-dataset", split="train")
image1 = dataset[0]["image"]
image2 = dataset[1]["image"]
image3 = dataset[2]["image"]
return [[image1, image2], [image3, image2]]
@require_torch
@require_vision
class EfficientLoFTRModelIntegrationTest(unittest.TestCase):
@cached_property
def default_image_processor(self):
return AutoImageProcessor.from_pretrained("stevenbucaille/efficientloftr") if is_vision_available() else None
@slow
def test_inference(self):
model = EfficientLoFTRForKeypointMatching.from_pretrained(
"stevenbucaille/efficientloftr", attn_implementation="eager"
).to(torch_device)
preprocessor = self.default_image_processor
images = prepare_imgs()
inputs = preprocessor(images=images, return_tensors="pt").to(torch_device)
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True, output_attentions=True)
predicted_top10 = torch.topk(outputs.matching_scores[0, 0], k=10)
predicted_top10_matches_indices = predicted_top10.indices
predicted_top10_matching_scores = predicted_top10.values
expected_number_of_matches = 4800
expected_matches_shape = torch.Size((len(images), 2, expected_number_of_matches))
expected_matching_scores_shape = torch.Size((len(images), 2, expected_number_of_matches))
expected_top10_matches_indices = torch.tensor(
[3145, 3065, 3143, 3066, 3144, 1397, 1705, 3151, 2342, 2422], dtype=torch.int64, device=torch_device
)
expected_top10_matching_scores = torch.tensor(
[0.9997, 0.9996, 0.9996, 0.9995, 0.9995, 0.9995, 0.9994, 0.9994, 0.9994, 0.9994], device=torch_device
)
self.assertEqual(outputs.matches.shape, expected_matches_shape)
self.assertEqual(outputs.matching_scores.shape, expected_matching_scores_shape)
torch.testing.assert_close(
predicted_top10_matches_indices, expected_top10_matches_indices, rtol=5e-3, atol=5e-3
)
torch.testing.assert_close(
predicted_top10_matching_scores, expected_top10_matching_scores, rtol=5e-3, atol=5e-3
)