Add EfficientLoFTR model (#36355)
* initial commit * Apply suggestions from code review Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * fix: various typos, typehints, refactors from suggestions * fix: fine_matching method * Added EfficientLoFTRModel and AutoModelForKeypointMatching class * fix: got rid of compilation breaking instructions * docs: added todo for plot * fix: used correct hub repo * docs: added comments * fix: run modular * doc: added PyTorch badge * fix: model repo typo in config * fix: make modular * fix: removed mask values from outputs * feat: added plot_keypoint_matching to EfficientLoFTRImageProcessor * feat: added SuperGlueForKeypointMatching to AutoModelForKeypointMatching list * fix: reformat * refactor: renamed aggregation_sizes config parameter into q, kv aggregation kernel size and stride * doc: added q, kv aggregation kernel size and stride doc to config * refactor: converted efficientloftr implementation from modular to copied from mechanism * tests: overwrote batching_equivalence for "keypoints" specific tests * fix: changed EfficientLoFTRConfig import in test_modeling_rope_utils * fix: make fix-copies * fix: make style * fix: update rope function to make meta tests pass * fix: rename plot_keypoint_matching to visualize_output for clarity * refactor: optimize image pair processing by removing redundant target size calculations * feat: add EfficientLoFTRImageProcessor to image processor mapping * refactor: removed logger and updated attention forward * refactor: added auto_docstring and can_return_tuple decorators * refactor: update type imports * refactor: update type hints from List/Dict to list/dict for consistency * refactor: update MODEL_MAPPING_NAMES and __all__ to include LightGlue and AutoModelForKeypointMatching * fix: change type hint for size parameter in EfficientLoFTRImageProcessor to Optional[dict] * fix typing * fix some typing issues * nit * a few more typehint fixes * Remove output_attentions and output_hidden_states from modeling code * else -> elif to support efficientloftr * nit * tests: added EfficientLoFTR image processor tests * refactor: reorder functions * chore: update copyright year in EfficientLoFTR test file * Use default rope * Add docs * Update visualization method * fix doc order * remove 2d rope test * Update src/transformers/models/efficientloftr/modeling_efficientloftr.py * fix docs * Update src/transformers/models/efficientloftr/image_processing_efficientloftr.py * update gradient * refactor: removed unused codepath * Add motivation to keep postprocessing in modeling code * refactor: removed unnecessary variable declarations * docs: use load_image from image_utils * refactor: moved stage in and out channels computation to configuration * refactor: set an intermediate_size parameter to be more explicit * refactor: removed all mentions of attention masks as they are not used * refactor: moved position_embeddings to be computed once in the model instead of every layer * refactor: removed unnecessary hidden expansion parameter from config * refactor: removed completely hidden expansions * refactor: removed position embeddings slice function * tests: fixed broken tests because of previous commit * fix is_grayscale typehint * not refactoring * not renaming * move h/w to embeddings class * Precompute embeddings in init * fix: replaced cuda device in convert script to accelerate device * fix: replaced stevenbucaille repo to zju-community * Remove accelerator.device from conversion script * refactor: moved parameter computation in configuration instead of figuring it out when instantiating a Module * fix: removed unused attributes in configuration * fix: missing self * fix: refactoring and tests * fix: make style --------- Co-authored-by: steven <steven.bucaille@buawei.com> Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
This commit is contained in:
@@ -747,6 +747,8 @@
|
|||||||
title: DPT
|
title: DPT
|
||||||
- local: model_doc/efficientformer
|
- local: model_doc/efficientformer
|
||||||
title: EfficientFormer
|
title: EfficientFormer
|
||||||
|
- local: model_doc/efficientloftr
|
||||||
|
title: EfficientLoFTR
|
||||||
- local: model_doc/efficientnet
|
- local: model_doc/efficientnet
|
||||||
title: EfficientNet
|
title: EfficientNet
|
||||||
- local: model_doc/eomt
|
- local: model_doc/eomt
|
||||||
|
|||||||
@@ -258,6 +258,10 @@ The following auto classes are available for the following computer vision tasks
|
|||||||
|
|
||||||
[[autodoc]] AutoModelForKeypointDetection
|
[[autodoc]] AutoModelForKeypointDetection
|
||||||
|
|
||||||
|
### AutoModelForKeypointMatching
|
||||||
|
|
||||||
|
[[autodoc]] AutoModelForKeypointMatching
|
||||||
|
|
||||||
### AutoModelForMaskedImageModeling
|
### AutoModelForMaskedImageModeling
|
||||||
|
|
||||||
[[autodoc]] AutoModelForMaskedImageModeling
|
[[autodoc]] AutoModelForMaskedImageModeling
|
||||||
|
|||||||
114
docs/source/en/model_doc/efficientloftr.md
Normal file
114
docs/source/en/model_doc/efficientloftr.md
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the MIT License; you may not use this file except in compliance with
|
||||||
|
the License.
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
|
||||||
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||||
|
rendered properly in your Markdown viewer.
|
||||||
|
|
||||||
|
|
||||||
|
-->
|
||||||
|
|
||||||
|
# EfficientLoFTR
|
||||||
|
|
||||||
|
<div class="flex flex-wrap space-x-1">
|
||||||
|
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The EfficientLoFTR model was proposed in [Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed](https://arxiv.org/abs/2403.04765) by Yifan Wang, Xingyi He, Sida Peng, Dongli Tan and Xiaowei Zhou.
|
||||||
|
|
||||||
|
This model consists of matching two images together by finding pixel correspondences. It can be used to estimate the pose between them.
|
||||||
|
This model is useful for tasks such as image matching, homography estimation, etc.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*We present a novel method for efficiently producing semidense matches across images. Previous detector-free matcher
|
||||||
|
LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios but suffers
|
||||||
|
from low efficiency. We revisit its design choices and derive multiple improvements for both efficiency and accuracy.
|
||||||
|
One key observation is that performing the transformer over the entire feature map is redundant due to shared local
|
||||||
|
information, therefore we propose an aggregated attention mechanism with adaptive token selection for efficiency.
|
||||||
|
Furthermore, we find spatial variance exists in LoFTR’s fine correlation module, which is adverse to matching accuracy.
|
||||||
|
A novel two-stage correlation layer is proposed to achieve accurate subpixel correspondences for accuracy improvement.
|
||||||
|
Our efficiency optimized model is ∼ 2.5× faster than LoFTR which can even surpass state-of-the-art efficient sparse
|
||||||
|
matching pipeline SuperPoint + LightGlue. Moreover, extensive experiments show that our method can achieve higher
|
||||||
|
accuracy compared with competitive semi-dense matchers, with considerable efficiency benefits. This opens up exciting
|
||||||
|
prospects for large-scale or latency-sensitive applications such as image retrieval and 3D reconstruction.
|
||||||
|
Project page: [https://zju3dv.github.io/efficientloftr/](https://zju3dv.github.io/efficientloftr/).*
|
||||||
|
|
||||||
|
## How to use
|
||||||
|
|
||||||
|
Here is a quick example of using the model.
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers import AutoImageProcessor, AutoModelForKeypointMatching
|
||||||
|
from transformers.image_utils import load_image
|
||||||
|
|
||||||
|
|
||||||
|
image1 = load_image("https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_98169888_3347710852.jpg")
|
||||||
|
image2 = load_image("https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_26757027_6717084061.jpg")
|
||||||
|
|
||||||
|
images = [image1, image2]
|
||||||
|
|
||||||
|
processor = AutoImageProcessor.from_pretrained("stevenbucaille/efficientloftr")
|
||||||
|
model = AutoModelForKeypointMatching.from_pretrained("stevenbucaille/efficientloftr")
|
||||||
|
|
||||||
|
inputs = processor(images, return_tensors="pt")
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**inputs)
|
||||||
|
```
|
||||||
|
|
||||||
|
You can use the `post_process_keypoint_matching` method from the `ImageProcessor` to get the keypoints and matches in a more readable format:
|
||||||
|
|
||||||
|
```python
|
||||||
|
image_sizes = [[(image.height, image.width) for image in images]]
|
||||||
|
outputs = processor.post_process_keypoint_matching(outputs, image_sizes, threshold=0.2)
|
||||||
|
for i, output in enumerate(outputs):
|
||||||
|
print("For the image pair", i)
|
||||||
|
for keypoint0, keypoint1, matching_score in zip(
|
||||||
|
output["keypoints0"], output["keypoints1"], output["matching_scores"]
|
||||||
|
):
|
||||||
|
print(
|
||||||
|
f"Keypoint at coordinate {keypoint0.numpy()} in the first image matches with keypoint at coordinate {keypoint1.numpy()} in the second image with a score of {matching_score}."
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
From the post processed outputs, you can visualize the matches between the two images using the following code:
|
||||||
|
```python
|
||||||
|
images_with_matching = processor.visualize_keypoint_matching(images, outputs)
|
||||||
|
```
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This model was contributed by [stevenbucaille](https://huggingface.co/stevenbucaille).
|
||||||
|
The original code can be found [here](https://github.com/zju3dv/EfficientLoFTR).
|
||||||
|
|
||||||
|
## EfficientLoFTRConfig
|
||||||
|
|
||||||
|
[[autodoc]] EfficientLoFTRConfig
|
||||||
|
|
||||||
|
## EfficientLoFTRImageProcessor
|
||||||
|
|
||||||
|
[[autodoc]] EfficientLoFTRImageProcessor
|
||||||
|
|
||||||
|
- preprocess
|
||||||
|
- post_process_keypoint_matching
|
||||||
|
- visualize_keypoint_matching
|
||||||
|
|
||||||
|
## EfficientLoFTRModel
|
||||||
|
|
||||||
|
[[autodoc]] EfficientLoFTRModel
|
||||||
|
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## EfficientLoFTRForKeypointMatching
|
||||||
|
|
||||||
|
[[autodoc]] EfficientLoFTRForKeypointMatching
|
||||||
|
|
||||||
|
- forward
|
||||||
@@ -102,6 +102,7 @@ if TYPE_CHECKING:
|
|||||||
from .dots1 import *
|
from .dots1 import *
|
||||||
from .dpr import *
|
from .dpr import *
|
||||||
from .dpt import *
|
from .dpt import *
|
||||||
|
from .efficientloftr import *
|
||||||
from .efficientnet import *
|
from .efficientnet import *
|
||||||
from .electra import *
|
from .electra import *
|
||||||
from .emu3 import *
|
from .emu3 import *
|
||||||
|
|||||||
@@ -121,6 +121,7 @@ CONFIG_MAPPING_NAMES = OrderedDict[str, str](
|
|||||||
("dpr", "DPRConfig"),
|
("dpr", "DPRConfig"),
|
||||||
("dpt", "DPTConfig"),
|
("dpt", "DPTConfig"),
|
||||||
("efficientformer", "EfficientFormerConfig"),
|
("efficientformer", "EfficientFormerConfig"),
|
||||||
|
("efficientloftr", "EfficientLoFTRConfig"),
|
||||||
("efficientnet", "EfficientNetConfig"),
|
("efficientnet", "EfficientNetConfig"),
|
||||||
("electra", "ElectraConfig"),
|
("electra", "ElectraConfig"),
|
||||||
("emu3", "Emu3Config"),
|
("emu3", "Emu3Config"),
|
||||||
@@ -515,6 +516,7 @@ MODEL_NAMES_MAPPING = OrderedDict[str, str](
|
|||||||
("dpr", "DPR"),
|
("dpr", "DPR"),
|
||||||
("dpt", "DPT"),
|
("dpt", "DPT"),
|
||||||
("efficientformer", "EfficientFormer"),
|
("efficientformer", "EfficientFormer"),
|
||||||
|
("efficientloftr", "EfficientLoFTR"),
|
||||||
("efficientnet", "EfficientNet"),
|
("efficientnet", "EfficientNet"),
|
||||||
("electra", "ELECTRA"),
|
("electra", "ELECTRA"),
|
||||||
("emu3", "Emu3"),
|
("emu3", "Emu3"),
|
||||||
|
|||||||
@@ -85,6 +85,7 @@ else:
|
|||||||
("donut-swin", ("DonutImageProcessor", "DonutImageProcessorFast")),
|
("donut-swin", ("DonutImageProcessor", "DonutImageProcessorFast")),
|
||||||
("dpt", ("DPTImageProcessor", "DPTImageProcessorFast")),
|
("dpt", ("DPTImageProcessor", "DPTImageProcessorFast")),
|
||||||
("efficientformer", ("EfficientFormerImageProcessor",)),
|
("efficientformer", ("EfficientFormerImageProcessor",)),
|
||||||
|
("efficientloftr", ("EfficientLoFTRImageProcessor",)),
|
||||||
("efficientnet", ("EfficientNetImageProcessor", "EfficientNetImageProcessorFast")),
|
("efficientnet", ("EfficientNetImageProcessor", "EfficientNetImageProcessorFast")),
|
||||||
("eomt", ("EomtImageProcessor", "EomtImageProcessorFast")),
|
("eomt", ("EomtImageProcessor", "EomtImageProcessorFast")),
|
||||||
("flava", ("FlavaImageProcessor", "FlavaImageProcessorFast")),
|
("flava", ("FlavaImageProcessor", "FlavaImageProcessorFast")),
|
||||||
|
|||||||
@@ -114,6 +114,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
|||||||
("dpr", "DPRQuestionEncoder"),
|
("dpr", "DPRQuestionEncoder"),
|
||||||
("dpt", "DPTModel"),
|
("dpt", "DPTModel"),
|
||||||
("efficientformer", "EfficientFormerModel"),
|
("efficientformer", "EfficientFormerModel"),
|
||||||
|
("efficientloftr", "EfficientLoFTRModel"),
|
||||||
("efficientnet", "EfficientNetModel"),
|
("efficientnet", "EfficientNetModel"),
|
||||||
("electra", "ElectraModel"),
|
("electra", "ElectraModel"),
|
||||||
("emu3", "Emu3Model"),
|
("emu3", "Emu3Model"),
|
||||||
@@ -322,7 +323,6 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
|||||||
("squeezebert", "SqueezeBertModel"),
|
("squeezebert", "SqueezeBertModel"),
|
||||||
("stablelm", "StableLmModel"),
|
("stablelm", "StableLmModel"),
|
||||||
("starcoder2", "Starcoder2Model"),
|
("starcoder2", "Starcoder2Model"),
|
||||||
("superglue", "SuperGlueForKeypointMatching"),
|
|
||||||
("swiftformer", "SwiftFormerModel"),
|
("swiftformer", "SwiftFormerModel"),
|
||||||
("swin", "SwinModel"),
|
("swin", "SwinModel"),
|
||||||
("swin2sr", "Swin2SRModel"),
|
("swin2sr", "Swin2SRModel"),
|
||||||
@@ -1607,6 +1607,13 @@ MODEL_FOR_KEYPOINT_DETECTION_MAPPING_NAMES = OrderedDict(
|
|||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
|
MODEL_FOR_KEYPOINT_MATCHING_MAPPING_NAMES = OrderedDict(
|
||||||
|
[
|
||||||
|
("efficientloftr", "EfficientLoFTRForKeypointMatching"),
|
||||||
|
("lightglue", "LightGlueForKeypointMatching"),
|
||||||
|
("superglue", "SuperGlueForKeypointMatching"),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
MODEL_FOR_TEXT_ENCODING_MAPPING_NAMES = OrderedDict(
|
MODEL_FOR_TEXT_ENCODING_MAPPING_NAMES = OrderedDict(
|
||||||
[
|
[
|
||||||
@@ -1768,6 +1775,8 @@ MODEL_FOR_KEYPOINT_DETECTION_MAPPING = _LazyAutoMapping(
|
|||||||
CONFIG_MAPPING_NAMES, MODEL_FOR_KEYPOINT_DETECTION_MAPPING_NAMES
|
CONFIG_MAPPING_NAMES, MODEL_FOR_KEYPOINT_DETECTION_MAPPING_NAMES
|
||||||
)
|
)
|
||||||
|
|
||||||
|
MODEL_FOR_KEYPOINT_MATCHING_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_KEYPOINT_MATCHING_MAPPING_NAMES)
|
||||||
|
|
||||||
MODEL_FOR_TEXT_ENCODING_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_TEXT_ENCODING_MAPPING_NAMES)
|
MODEL_FOR_TEXT_ENCODING_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_TEXT_ENCODING_MAPPING_NAMES)
|
||||||
|
|
||||||
MODEL_FOR_TIME_SERIES_CLASSIFICATION_MAPPING = _LazyAutoMapping(
|
MODEL_FOR_TIME_SERIES_CLASSIFICATION_MAPPING = _LazyAutoMapping(
|
||||||
@@ -1795,6 +1804,10 @@ class AutoModelForKeypointDetection(_BaseAutoModelClass):
|
|||||||
_model_mapping = MODEL_FOR_KEYPOINT_DETECTION_MAPPING
|
_model_mapping = MODEL_FOR_KEYPOINT_DETECTION_MAPPING
|
||||||
|
|
||||||
|
|
||||||
|
class AutoModelForKeypointMatching(_BaseAutoModelClass):
|
||||||
|
_model_mapping = MODEL_FOR_KEYPOINT_MATCHING_MAPPING
|
||||||
|
|
||||||
|
|
||||||
class AutoModelForTextEncoding(_BaseAutoModelClass):
|
class AutoModelForTextEncoding(_BaseAutoModelClass):
|
||||||
_model_mapping = MODEL_FOR_TEXT_ENCODING_MAPPING
|
_model_mapping = MODEL_FOR_TEXT_ENCODING_MAPPING
|
||||||
|
|
||||||
@@ -2151,6 +2164,7 @@ __all__ = [
|
|||||||
"MODEL_FOR_IMAGE_SEGMENTATION_MAPPING",
|
"MODEL_FOR_IMAGE_SEGMENTATION_MAPPING",
|
||||||
"MODEL_FOR_IMAGE_TO_IMAGE_MAPPING",
|
"MODEL_FOR_IMAGE_TO_IMAGE_MAPPING",
|
||||||
"MODEL_FOR_KEYPOINT_DETECTION_MAPPING",
|
"MODEL_FOR_KEYPOINT_DETECTION_MAPPING",
|
||||||
|
"MODEL_FOR_KEYPOINT_MATCHING_MAPPING",
|
||||||
"MODEL_FOR_INSTANCE_SEGMENTATION_MAPPING",
|
"MODEL_FOR_INSTANCE_SEGMENTATION_MAPPING",
|
||||||
"MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING",
|
"MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING",
|
||||||
"MODEL_FOR_MASKED_LM_MAPPING",
|
"MODEL_FOR_MASKED_LM_MAPPING",
|
||||||
@@ -2196,6 +2210,7 @@ __all__ = [
|
|||||||
"AutoModelForImageToImage",
|
"AutoModelForImageToImage",
|
||||||
"AutoModelForInstanceSegmentation",
|
"AutoModelForInstanceSegmentation",
|
||||||
"AutoModelForKeypointDetection",
|
"AutoModelForKeypointDetection",
|
||||||
|
"AutoModelForKeypointMatching",
|
||||||
"AutoModelForMaskGeneration",
|
"AutoModelForMaskGeneration",
|
||||||
"AutoModelForTextEncoding",
|
"AutoModelForTextEncoding",
|
||||||
"AutoModelForMaskedImageModeling",
|
"AutoModelForMaskedImageModeling",
|
||||||
|
|||||||
28
src/transformers/models/efficientloftr/__init__.py
Normal file
28
src/transformers/models/efficientloftr/__init__.py
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
from ...utils import _LazyModule
|
||||||
|
from ...utils.import_utils import define_import_structure
|
||||||
|
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .configuration_efficientloftr import *
|
||||||
|
from .image_processing_efficientloftr import *
|
||||||
|
from .modeling_efficientloftr import *
|
||||||
|
else:
|
||||||
|
import sys
|
||||||
|
|
||||||
|
_file = globals()["__file__"]
|
||||||
|
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
|
||||||
@@ -0,0 +1,203 @@
|
|||||||
|
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from ...configuration_utils import PretrainedConfig
|
||||||
|
from ...modeling_rope_utils import rope_config_validation
|
||||||
|
|
||||||
|
|
||||||
|
class EfficientLoFTRConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`EffientLoFTRFromKeypointMatching`].
|
||||||
|
It is used to instantiate a EfficientLoFTR model according to the specified arguments, defining the model
|
||||||
|
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the
|
||||||
|
EfficientLoFTR [zju-community/efficientloftr](https://huggingface.co/zju-community/efficientloftr) architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
stage_num_blocks (`List`, *optional*, defaults to [1, 2, 4, 14]):
|
||||||
|
The number of blocks in each stages
|
||||||
|
out_features (`List`, *optional*, defaults to [64, 64, 128, 256]):
|
||||||
|
The number of channels in each stage
|
||||||
|
stage_stride (`List`, *optional*, defaults to [2, 1, 2, 2]):
|
||||||
|
The stride used in each stage
|
||||||
|
hidden_size (`int`, *optional*, defaults to 256):
|
||||||
|
The dimension of the descriptors.
|
||||||
|
activation_function (`str`, *optional*, defaults to `"relu"`):
|
||||||
|
The activation function used in the backbone
|
||||||
|
q_aggregation_kernel_size (`int`, *optional*, defaults to 4):
|
||||||
|
The kernel size of the aggregation of query states in the fusion network
|
||||||
|
kv_aggregation_kernel_size (`int`, *optional*, defaults to 4):
|
||||||
|
The kernel size of the aggregation of key and value states in the fusion network
|
||||||
|
q_aggregation_stride (`int`, *optional*, defaults to 4):
|
||||||
|
The stride of the aggregation of query states in the fusion network
|
||||||
|
kv_aggregation_stride (`int`, *optional*, defaults to 4):
|
||||||
|
The stride of the aggregation of key and value states in the fusion network
|
||||||
|
num_attention_layers (`int`, *optional*, defaults to 4):
|
||||||
|
Number of attention layers in the LocalFeatureTransformer
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 8):
|
||||||
|
The number of heads in the GNN layers.
|
||||||
|
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
attention_bias (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether to use a bias in the query, key, value and output projection layers during attention.
|
||||||
|
mlp_activation_function (`str`, *optional*, defaults to `"leaky_relu"`):
|
||||||
|
Activation function used in the attention mlp layer.
|
||||||
|
coarse_matching_skip_softmax (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether to skip softmax or not at the coarse matching step.
|
||||||
|
coarse_matching_threshold (`float`, *optional*, defaults to 0.2):
|
||||||
|
The threshold for the minimum score required for a match.
|
||||||
|
coarse_matching_temperature (`float`, *optional*, defaults to 0.1):
|
||||||
|
The temperature to apply to the coarse similarity matrix
|
||||||
|
coarse_matching_border_removal (`int`, *optional*, defaults to 2):
|
||||||
|
The size of the border to remove during coarse matching
|
||||||
|
fine_kernel_size (`int`, *optional*, defaults to 8):
|
||||||
|
Kernel size used for the fine feature matching
|
||||||
|
batch_norm_eps (`float`, *optional*, defaults to 1e-05):
|
||||||
|
The epsilon used by the batch normalization layers.
|
||||||
|
embedding_size (`List`, *optional*, defaults to [15, 20]):
|
||||||
|
The size (height, width) of the embedding for the position embeddings.
|
||||||
|
rope_theta (`float`, *optional*, defaults to 10000.0):
|
||||||
|
The base period of the RoPE embeddings.
|
||||||
|
partial_rotary_factor (`float`, *optional*, defaults to 4.0):
|
||||||
|
Dim factor for the RoPE embeddings, in EfficientLoFTR, frequencies should be generated for
|
||||||
|
the whole hidden_size, so this factor is used to compensate.
|
||||||
|
rope_scaling (`Dict`, *optional*):
|
||||||
|
Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
|
||||||
|
and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
|
||||||
|
accordingly.
|
||||||
|
Expected contents:
|
||||||
|
`rope_type` (`str`):
|
||||||
|
The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
|
||||||
|
'llama3', '2d'], with 'default' being the original RoPE implementation.
|
||||||
|
`dim` (`int`): The dimension of the RoPE embeddings.
|
||||||
|
fine_matching_slice_dim (`int`, *optional*, defaults to 8):
|
||||||
|
The size of the slice used to divide the fine features for the first and second fine matching stages.
|
||||||
|
fine_matching_regress_temperature (`float`, *optional*, defaults to 10.0):
|
||||||
|
The temperature to apply to the fine similarity matrix
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
```python
|
||||||
|
>>> from transformers import EfficientLoFTRConfig, EfficientLoFTRForKeypointMatching
|
||||||
|
|
||||||
|
>>> # Initializing a EfficientLoFTR configuration
|
||||||
|
>>> configuration = EfficientLoFTRConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a model from the EfficientLoFTR configuration
|
||||||
|
>>> model = EfficientLoFTRForKeypointMatching(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```
|
||||||
|
"""
|
||||||
|
|
||||||
|
model_type = "efficientloftr"
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
stage_num_blocks: Optional[list[int]] = None,
|
||||||
|
out_features: Optional[list[int]] = None,
|
||||||
|
stage_stride: Optional[list[int]] = None,
|
||||||
|
hidden_size: int = 256,
|
||||||
|
activation_function: str = "relu",
|
||||||
|
q_aggregation_kernel_size: int = 4,
|
||||||
|
kv_aggregation_kernel_size: int = 4,
|
||||||
|
q_aggregation_stride: int = 4,
|
||||||
|
kv_aggregation_stride: int = 4,
|
||||||
|
num_attention_layers: int = 4,
|
||||||
|
num_attention_heads: int = 8,
|
||||||
|
attention_dropout: float = 0.0,
|
||||||
|
attention_bias: bool = False,
|
||||||
|
mlp_activation_function: str = "leaky_relu",
|
||||||
|
coarse_matching_skip_softmax: bool = False,
|
||||||
|
coarse_matching_threshold: float = 0.2,
|
||||||
|
coarse_matching_temperature: float = 0.1,
|
||||||
|
coarse_matching_border_removal: int = 2,
|
||||||
|
fine_kernel_size: int = 8,
|
||||||
|
batch_norm_eps: float = 1e-5,
|
||||||
|
embedding_size: Optional[list[int]] = None,
|
||||||
|
rope_theta: float = 10000.0,
|
||||||
|
partial_rotary_factor: float = 4.0,
|
||||||
|
rope_scaling: Optional[dict] = None,
|
||||||
|
fine_matching_slice_dim: int = 8,
|
||||||
|
fine_matching_regress_temperature: float = 10.0,
|
||||||
|
initializer_range: float = 0.02,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
# Stage level of RepVGG
|
||||||
|
self.stage_num_blocks = stage_num_blocks if stage_num_blocks is not None else [1, 2, 4, 14]
|
||||||
|
self.stage_stride = stage_stride if stage_stride is not None else [2, 1, 2, 2]
|
||||||
|
self.out_features = out_features if out_features is not None else [64, 64, 128, 256]
|
||||||
|
self.stage_in_channels = [1] + self.out_features[:-1]
|
||||||
|
|
||||||
|
# Block level of RepVGG
|
||||||
|
self.stage_block_stride = [
|
||||||
|
[stride] + [1] * (num_blocks - 1) for stride, num_blocks in zip(self.stage_stride, self.stage_num_blocks)
|
||||||
|
]
|
||||||
|
self.stage_block_out_channels = [
|
||||||
|
[self.out_features[stage_idx]] * num_blocks for stage_idx, num_blocks in enumerate(self.stage_num_blocks)
|
||||||
|
]
|
||||||
|
self.stage_block_in_channels = [
|
||||||
|
[self.stage_in_channels[stage_idx]] + self.stage_block_out_channels[stage_idx][:-1]
|
||||||
|
for stage_idx in range(len(self.stage_num_blocks))
|
||||||
|
]
|
||||||
|
|
||||||
|
# Fine matching level of EfficientLoFTR
|
||||||
|
self.fine_fusion_dims = list(reversed(self.out_features))[:-1]
|
||||||
|
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
if self.hidden_size != self.out_features[-1]:
|
||||||
|
raise ValueError(
|
||||||
|
f"hidden_size should be equal to the last value in out_features. hidden_size = {self.hidden_size}, out_features = {self.stage_out_channels}"
|
||||||
|
)
|
||||||
|
|
||||||
|
self.activation_function = activation_function
|
||||||
|
self.q_aggregation_kernel_size = q_aggregation_kernel_size
|
||||||
|
self.kv_aggregation_kernel_size = kv_aggregation_kernel_size
|
||||||
|
self.q_aggregation_stride = q_aggregation_stride
|
||||||
|
self.kv_aggregation_stride = kv_aggregation_stride
|
||||||
|
self.num_attention_layers = num_attention_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.attention_dropout = attention_dropout
|
||||||
|
self.attention_bias = attention_bias
|
||||||
|
self.intermediate_size = self.hidden_size * 2
|
||||||
|
self.mlp_activation_function = mlp_activation_function
|
||||||
|
self.coarse_matching_skip_softmax = coarse_matching_skip_softmax
|
||||||
|
self.coarse_matching_threshold = coarse_matching_threshold
|
||||||
|
self.coarse_matching_temperature = coarse_matching_temperature
|
||||||
|
self.coarse_matching_border_removal = coarse_matching_border_removal
|
||||||
|
self.fine_kernel_size = fine_kernel_size
|
||||||
|
self.batch_norm_eps = batch_norm_eps
|
||||||
|
self.fine_matching_slice_dim = fine_matching_slice_dim
|
||||||
|
self.fine_matching_regress_temperature = fine_matching_regress_temperature
|
||||||
|
|
||||||
|
self.num_key_value_heads = num_attention_heads
|
||||||
|
self.embedding_size = embedding_size if embedding_size is not None else [15, 20]
|
||||||
|
self.rope_theta = rope_theta
|
||||||
|
self.rope_scaling = rope_scaling if rope_scaling is not None else {"rope_type": "default"}
|
||||||
|
|
||||||
|
# for compatibility with "default" rope type
|
||||||
|
self.partial_rotary_factor = partial_rotary_factor
|
||||||
|
rope_config_validation(self)
|
||||||
|
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["EfficientLoFTRConfig"]
|
||||||
@@ -0,0 +1,257 @@
|
|||||||
|
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
import argparse
|
||||||
|
import gc
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from datasets import load_dataset
|
||||||
|
from huggingface_hub import hf_hub_download
|
||||||
|
|
||||||
|
from transformers.models.efficientloftr.image_processing_efficientloftr import EfficientLoFTRImageProcessor
|
||||||
|
from transformers.models.efficientloftr.modeling_efficientloftr import (
|
||||||
|
EfficientLoFTRConfig,
|
||||||
|
EfficientLoFTRForKeypointMatching,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
DEFAULT_MODEL_REPO = "stevenbucaille/efficient_loftr_pth"
|
||||||
|
DEFAULT_FILE = "eloftr.pth"
|
||||||
|
|
||||||
|
|
||||||
|
def prepare_imgs():
|
||||||
|
dataset = load_dataset("hf-internal-testing/image-matching-test-dataset", split="train")
|
||||||
|
image0 = dataset[0]["image"]
|
||||||
|
image2 = dataset[2]["image"]
|
||||||
|
return [[image2, image0]]
|
||||||
|
|
||||||
|
|
||||||
|
def verify_model_outputs(model, device):
|
||||||
|
images = prepare_imgs()
|
||||||
|
preprocessor = EfficientLoFTRImageProcessor()
|
||||||
|
inputs = preprocessor(images=images, return_tensors="pt").to(device)
|
||||||
|
model.to(device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**inputs, output_hidden_states=True, output_attentions=True)
|
||||||
|
|
||||||
|
predicted_number_of_matches = outputs.matches.shape[-1]
|
||||||
|
predicted_top10 = torch.topk(outputs.matching_scores[0, 0], k=10)
|
||||||
|
predicted_top10_matches_indices = predicted_top10.indices
|
||||||
|
predicted_top10_matching_scores = predicted_top10.values
|
||||||
|
|
||||||
|
expected_number_of_matches = 4800
|
||||||
|
expected_matches_shape = torch.Size((len(images), 2, expected_number_of_matches))
|
||||||
|
expected_matching_scores_shape = torch.Size((len(images), 2, expected_number_of_matches))
|
||||||
|
|
||||||
|
expected_top10_matches_indices = torch.tensor(
|
||||||
|
[1798, 1639, 1401, 1559, 2596, 2362, 2441, 2605, 1643, 2607], dtype=torch.int64
|
||||||
|
).to(device)
|
||||||
|
expected_top10_matching_scores = torch.tensor(
|
||||||
|
[0.9563, 0.9355, 0.9265, 0.9091, 0.9071, 0.9062, 0.9000, 0.8978, 0.8908, 0.8853]
|
||||||
|
).to(device)
|
||||||
|
|
||||||
|
assert outputs.matches.shape == expected_matches_shape
|
||||||
|
assert outputs.matching_scores.shape == expected_matching_scores_shape
|
||||||
|
|
||||||
|
torch.testing.assert_close(predicted_top10_matches_indices, expected_top10_matches_indices, rtol=5e-3, atol=5e-3)
|
||||||
|
torch.testing.assert_close(predicted_top10_matching_scores, expected_top10_matching_scores, rtol=5e-3, atol=5e-3)
|
||||||
|
|
||||||
|
assert predicted_number_of_matches == expected_number_of_matches
|
||||||
|
|
||||||
|
|
||||||
|
ORIGINAL_TO_CONVERTED_KEY_MAPPING = {
|
||||||
|
r"matcher.backbone.layer(\d+).rbr_dense.conv": r"efficientloftr.backbone.stages.\1.blocks.0.conv1.conv",
|
||||||
|
r"matcher.backbone.layer(\d+).rbr_dense.bn": r"efficientloftr.backbone.stages.\1.blocks.0.conv1.norm",
|
||||||
|
r"matcher.backbone.layer(\d+).rbr_1x1.conv": r"efficientloftr.backbone.stages.\1.blocks.0.conv2.conv",
|
||||||
|
r"matcher.backbone.layer(\d+).rbr_1x1.bn": r"efficientloftr.backbone.stages.\1.blocks.0.conv2.norm",
|
||||||
|
r"matcher.backbone.layer(\d+).(\d+).rbr_dense.conv": r"efficientloftr.backbone.stages.\1.blocks.\2.conv1.conv",
|
||||||
|
r"matcher.backbone.layer(\d+).(\d+).rbr_dense.bn": r"efficientloftr.backbone.stages.\1.blocks.\2.conv1.norm",
|
||||||
|
r"matcher.backbone.layer(\d+).(\d+).rbr_1x1.conv": r"efficientloftr.backbone.stages.\1.blocks.\2.conv2.conv",
|
||||||
|
r"matcher.backbone.layer(\d+).(\d+).rbr_1x1.bn": r"efficientloftr.backbone.stages.\1.blocks.\2.conv2.norm",
|
||||||
|
r"matcher.backbone.layer(\d+).(\d+).rbr_identity": r"efficientloftr.backbone.stages.\1.blocks.\2.identity",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[02468]).aggregate": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.aggregation.q_aggregation",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[02468]).norm1": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.aggregation.norm",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[02468]).q_proj": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.attention.q_proj",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[02468]).k_proj": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.attention.k_proj",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[02468]).v_proj": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.attention.v_proj",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[02468]).merge": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.attention.o_proj",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[02468]).mlp.(\d+)": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.mlp.fc{1 if m.group(2) == '0' else 2}",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[02468]).norm2": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.self_attention.mlp.layer_norm",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[13579]).aggregate": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.aggregation.q_aggregation",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[13579]).norm1": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.aggregation.norm",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[13579]).q_proj": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.attention.q_proj",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[13579]).k_proj": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.attention.k_proj",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[13579]).v_proj": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.attention.v_proj",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[13579]).merge": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.attention.o_proj",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[13579]).mlp.(\d+)": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.mlp.fc{1 if m.group(2) == '0' else 2}",
|
||||||
|
r"matcher.loftr_coarse.layers.(\d*[13579]).norm2": lambda m: f"efficientloftr.local_feature_transformer.layers.{int(m.group(1)) // 2}.cross_attention.mlp.layer_norm",
|
||||||
|
r"matcher.fine_preprocess.layer3_outconv": "refinement_layer.out_conv",
|
||||||
|
r"matcher.fine_preprocess.layer(\d+)_outconv.weight": lambda m: f"refinement_layer.out_conv_layers.{0 if int(m.group(1)) == 2 else m.group(1)}.out_conv1.weight",
|
||||||
|
r"matcher.fine_preprocess.layer(\d+)_outconv2\.0": lambda m: f"refinement_layer.out_conv_layers.{0 if int(m.group(1)) == 2 else m.group(1)}.out_conv2",
|
||||||
|
r"matcher.fine_preprocess.layer(\d+)_outconv2\.1": lambda m: f"refinement_layer.out_conv_layers.{0 if int(m.group(1)) == 2 else m.group(1)}.batch_norm",
|
||||||
|
r"matcher.fine_preprocess.layer(\d+)_outconv2\.3": lambda m: f"refinement_layer.out_conv_layers.{0 if int(m.group(1)) == 2 else m.group(1)}.out_conv3",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def convert_old_keys_to_new_keys(state_dict_keys: list[str]):
|
||||||
|
"""
|
||||||
|
This function should be applied only once, on the concatenated keys to efficiently rename using
|
||||||
|
the key mappings.
|
||||||
|
"""
|
||||||
|
output_dict = {}
|
||||||
|
if state_dict_keys is not None:
|
||||||
|
old_text = "\n".join(state_dict_keys)
|
||||||
|
new_text = old_text
|
||||||
|
for pattern, replacement in ORIGINAL_TO_CONVERTED_KEY_MAPPING.items():
|
||||||
|
if replacement is None:
|
||||||
|
new_text = re.sub(pattern, "", new_text) # an empty line
|
||||||
|
continue
|
||||||
|
new_text = re.sub(pattern, replacement, new_text)
|
||||||
|
output_dict = dict(zip(old_text.split("\n"), new_text.split("\n")))
|
||||||
|
return output_dict
|
||||||
|
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
def write_model(
|
||||||
|
model_path,
|
||||||
|
model_repo,
|
||||||
|
file_name,
|
||||||
|
organization,
|
||||||
|
safe_serialization=True,
|
||||||
|
push_to_hub=False,
|
||||||
|
):
|
||||||
|
os.makedirs(model_path, exist_ok=True)
|
||||||
|
# ------------------------------------------------------------
|
||||||
|
# EfficientLoFTR config
|
||||||
|
# ------------------------------------------------------------
|
||||||
|
|
||||||
|
config = EfficientLoFTRConfig()
|
||||||
|
config.architectures = ["EfficientLoFTRForKeypointMatching"]
|
||||||
|
config.save_pretrained(model_path)
|
||||||
|
print("Model config saved successfully...")
|
||||||
|
|
||||||
|
# ------------------------------------------------------------
|
||||||
|
# Convert weights
|
||||||
|
# ------------------------------------------------------------
|
||||||
|
|
||||||
|
print(f"Fetching all parameters from the checkpoint at {model_repo}/{file_name}...")
|
||||||
|
checkpoint_path = hf_hub_download(repo_id=model_repo, filename=file_name)
|
||||||
|
original_state_dict = torch.load(checkpoint_path, weights_only=True, map_location="cpu")["state_dict"]
|
||||||
|
|
||||||
|
print("Converting model...")
|
||||||
|
all_keys = list(original_state_dict.keys())
|
||||||
|
new_keys = convert_old_keys_to_new_keys(all_keys)
|
||||||
|
|
||||||
|
state_dict = {}
|
||||||
|
for key in all_keys:
|
||||||
|
new_key = new_keys[key]
|
||||||
|
state_dict[new_key] = original_state_dict.pop(key).contiguous().clone()
|
||||||
|
|
||||||
|
del original_state_dict
|
||||||
|
gc.collect()
|
||||||
|
|
||||||
|
print("Loading the checkpoint in a EfficientLoFTR model...")
|
||||||
|
|
||||||
|
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||||
|
with torch.device(device):
|
||||||
|
model = EfficientLoFTRForKeypointMatching(config)
|
||||||
|
model.load_state_dict(state_dict)
|
||||||
|
print("Checkpoint loaded successfully...")
|
||||||
|
del model.config._name_or_path
|
||||||
|
|
||||||
|
print("Saving the model...")
|
||||||
|
model.save_pretrained(model_path, safe_serialization=safe_serialization)
|
||||||
|
del state_dict, model
|
||||||
|
|
||||||
|
# Safety check: reload the converted model
|
||||||
|
gc.collect()
|
||||||
|
print("Reloading the model to check if it's saved correctly.")
|
||||||
|
model = EfficientLoFTRForKeypointMatching.from_pretrained(model_path)
|
||||||
|
print("Model reloaded successfully.")
|
||||||
|
|
||||||
|
model_name = "efficientloftr"
|
||||||
|
if model_repo == DEFAULT_MODEL_REPO:
|
||||||
|
print("Checking the model outputs...")
|
||||||
|
verify_model_outputs(model, device)
|
||||||
|
print("Model outputs verified successfully.")
|
||||||
|
|
||||||
|
if push_to_hub:
|
||||||
|
print("Pushing model to the hub...")
|
||||||
|
model.push_to_hub(
|
||||||
|
repo_id=f"{organization}/{model_name}",
|
||||||
|
commit_message="Add model",
|
||||||
|
)
|
||||||
|
config.push_to_hub(repo_id=f"{organization}/{model_name}", commit_message="Add config")
|
||||||
|
|
||||||
|
write_image_processor(model_path, model_name, organization, push_to_hub=push_to_hub)
|
||||||
|
|
||||||
|
|
||||||
|
def write_image_processor(save_dir, model_name, organization, push_to_hub=False):
|
||||||
|
image_processor = EfficientLoFTRImageProcessor()
|
||||||
|
image_processor.save_pretrained(save_dir)
|
||||||
|
|
||||||
|
if push_to_hub:
|
||||||
|
print("Pushing image processor to the hub...")
|
||||||
|
image_processor.push_to_hub(
|
||||||
|
repo_id=f"{organization}/{model_name}",
|
||||||
|
commit_message="Add image processor",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
# Required parameters
|
||||||
|
parser.add_argument(
|
||||||
|
"--repo_id",
|
||||||
|
default=DEFAULT_MODEL_REPO,
|
||||||
|
type=str,
|
||||||
|
help="Model repo ID of the original EfficientLoFTR checkpoint you'd like to convert.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--file_name",
|
||||||
|
default=DEFAULT_FILE,
|
||||||
|
type=str,
|
||||||
|
help="File name of the original EfficientLoFTR checkpoint you'd like to convert.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--pytorch_dump_folder_path",
|
||||||
|
default=None,
|
||||||
|
type=str,
|
||||||
|
required=True,
|
||||||
|
help="Path to the output PyTorch model directory.",
|
||||||
|
)
|
||||||
|
parser.add_argument("--save_model", action="store_true", help="Save model to local")
|
||||||
|
parser.add_argument(
|
||||||
|
"--push_to_hub",
|
||||||
|
action="store_true",
|
||||||
|
help="Push model and image preprocessor to the hub",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--organization",
|
||||||
|
default="zju-community",
|
||||||
|
type=str,
|
||||||
|
help="Hub organization in which you want the model to be uploaded.",
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
write_model(
|
||||||
|
args.pytorch_dump_folder_path,
|
||||||
|
args.repo_id,
|
||||||
|
args.file_name,
|
||||||
|
args.organization,
|
||||||
|
safe_serialization=True,
|
||||||
|
push_to_hub=args.push_to_hub,
|
||||||
|
)
|
||||||
@@ -0,0 +1,461 @@
|
|||||||
|
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Image processor class for SuperPoint."""
|
||||||
|
|
||||||
|
from typing import Optional, Union
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
from ... import is_torch_available, is_vision_available
|
||||||
|
from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
|
||||||
|
from ...image_transforms import resize, to_channel_dimension_format
|
||||||
|
from ...image_utils import (
|
||||||
|
ChannelDimension,
|
||||||
|
ImageInput,
|
||||||
|
ImageType,
|
||||||
|
PILImageResampling,
|
||||||
|
get_image_type,
|
||||||
|
infer_channel_dimension_format,
|
||||||
|
is_pil_image,
|
||||||
|
is_scaled_image,
|
||||||
|
is_valid_image,
|
||||||
|
to_numpy_array,
|
||||||
|
valid_images,
|
||||||
|
validate_preprocess_arguments,
|
||||||
|
)
|
||||||
|
from ...utils import TensorType, logging, requires_backends
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
import PIL
|
||||||
|
from PIL import Image, ImageDraw
|
||||||
|
|
||||||
|
from .modeling_efficientloftr import KeypointMatchingOutput
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
# Copied from transformers.models.superpoint.image_processing_superpoint.is_grayscale
|
||||||
|
def is_grayscale(
|
||||||
|
image: np.ndarray,
|
||||||
|
input_data_format: Optional[Union[str, ChannelDimension]] = None,
|
||||||
|
):
|
||||||
|
if input_data_format == ChannelDimension.FIRST:
|
||||||
|
if image.shape[0] == 1:
|
||||||
|
return True
|
||||||
|
return np.all(image[0, ...] == image[1, ...]) and np.all(image[1, ...] == image[2, ...])
|
||||||
|
elif input_data_format == ChannelDimension.LAST:
|
||||||
|
if image.shape[-1] == 1:
|
||||||
|
return True
|
||||||
|
return np.all(image[..., 0] == image[..., 1]) and np.all(image[..., 1] == image[..., 2])
|
||||||
|
|
||||||
|
|
||||||
|
# Copied from transformers.models.superpoint.image_processing_superpoint.convert_to_grayscale
|
||||||
|
def convert_to_grayscale(
|
||||||
|
image: ImageInput,
|
||||||
|
input_data_format: Optional[Union[str, ChannelDimension]] = None,
|
||||||
|
) -> ImageInput:
|
||||||
|
"""
|
||||||
|
Converts an image to grayscale format using the NTSC formula. Only support numpy and PIL Image. TODO support torch
|
||||||
|
and tensorflow grayscale conversion
|
||||||
|
|
||||||
|
This function is supposed to return a 1-channel image, but it returns a 3-channel image with the same value in each
|
||||||
|
channel, because of an issue that is discussed in :
|
||||||
|
https://github.com/huggingface/transformers/pull/25786#issuecomment-1730176446
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image (Image):
|
||||||
|
The image to convert.
|
||||||
|
input_data_format (`ChannelDimension` or `str`, *optional*):
|
||||||
|
The channel dimension format for the input image.
|
||||||
|
"""
|
||||||
|
requires_backends(convert_to_grayscale, ["vision"])
|
||||||
|
|
||||||
|
if isinstance(image, np.ndarray):
|
||||||
|
if is_grayscale(image, input_data_format=input_data_format):
|
||||||
|
return image
|
||||||
|
if input_data_format == ChannelDimension.FIRST:
|
||||||
|
gray_image = image[0, ...] * 0.2989 + image[1, ...] * 0.5870 + image[2, ...] * 0.1140
|
||||||
|
gray_image = np.stack([gray_image] * 3, axis=0)
|
||||||
|
elif input_data_format == ChannelDimension.LAST:
|
||||||
|
gray_image = image[..., 0] * 0.2989 + image[..., 1] * 0.5870 + image[..., 2] * 0.1140
|
||||||
|
gray_image = np.stack([gray_image] * 3, axis=-1)
|
||||||
|
return gray_image
|
||||||
|
|
||||||
|
if not isinstance(image, PIL.Image.Image):
|
||||||
|
return image
|
||||||
|
|
||||||
|
image = image.convert("L")
|
||||||
|
return image
|
||||||
|
|
||||||
|
|
||||||
|
# Copied from transformers.models.superglue.image_processing_superglue.validate_and_format_image_pairs
|
||||||
|
def validate_and_format_image_pairs(images: ImageInput):
|
||||||
|
error_message = (
|
||||||
|
"Input images must be a one of the following :",
|
||||||
|
" - A pair of PIL images.",
|
||||||
|
" - A pair of 3D arrays.",
|
||||||
|
" - A list of pairs of PIL images.",
|
||||||
|
" - A list of pairs of 3D arrays.",
|
||||||
|
)
|
||||||
|
|
||||||
|
def _is_valid_image(image):
|
||||||
|
"""images is a PIL Image or a 3D array."""
|
||||||
|
return is_pil_image(image) or (
|
||||||
|
is_valid_image(image) and get_image_type(image) != ImageType.PIL and len(image.shape) == 3
|
||||||
|
)
|
||||||
|
|
||||||
|
if isinstance(images, list):
|
||||||
|
if len(images) == 2 and all((_is_valid_image(image)) for image in images):
|
||||||
|
return images
|
||||||
|
if all(
|
||||||
|
isinstance(image_pair, list)
|
||||||
|
and len(image_pair) == 2
|
||||||
|
and all(_is_valid_image(image) for image in image_pair)
|
||||||
|
for image_pair in images
|
||||||
|
):
|
||||||
|
return [image for image_pair in images for image in image_pair]
|
||||||
|
raise ValueError(error_message)
|
||||||
|
|
||||||
|
|
||||||
|
class EfficientLoFTRImageProcessor(BaseImageProcessor):
|
||||||
|
r"""
|
||||||
|
Constructs a EfficientLoFTR image processor.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
do_resize (`bool`, *optional*, defaults to `True`):
|
||||||
|
Controls whether to resize the image's (height, width) dimensions to the specified `size`. Can be overriden
|
||||||
|
by `do_resize` in the `preprocess` method.
|
||||||
|
size (`Dict[str, int]` *optional*, defaults to `{"height": 480, "width": 640}`):
|
||||||
|
Resolution of the output image after `resize` is applied. Only has an effect if `do_resize` is set to
|
||||||
|
`True`. Can be overriden by `size` in the `preprocess` method.
|
||||||
|
resample (`PILImageResampling`, *optional*, defaults to `Resampling.BILINEAR`):
|
||||||
|
Resampling filter to use if resizing the image. Can be overriden by `resample` in the `preprocess` method.
|
||||||
|
do_rescale (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to rescale the image by the specified scale `rescale_factor`. Can be overriden by `do_rescale` in
|
||||||
|
the `preprocess` method.
|
||||||
|
rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
|
||||||
|
Scale factor to use if rescaling the image. Can be overriden by `rescale_factor` in the `preprocess`
|
||||||
|
method.
|
||||||
|
do_grayscale (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to convert the image to grayscale. Can be overriden by `do_grayscale` in the `preprocess` method.
|
||||||
|
"""
|
||||||
|
|
||||||
|
model_input_names = ["pixel_values"]
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
do_resize: bool = True,
|
||||||
|
size: Optional[dict[str, int]] = None,
|
||||||
|
resample: PILImageResampling = PILImageResampling.BILINEAR,
|
||||||
|
do_rescale: bool = True,
|
||||||
|
rescale_factor: float = 1 / 255,
|
||||||
|
do_grayscale: bool = True,
|
||||||
|
**kwargs,
|
||||||
|
) -> None:
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
size = size if size is not None else {"height": 480, "width": 640}
|
||||||
|
size = get_size_dict(size, default_to_square=False)
|
||||||
|
|
||||||
|
self.do_resize = do_resize
|
||||||
|
self.size = size
|
||||||
|
self.resample = resample
|
||||||
|
self.do_rescale = do_rescale
|
||||||
|
self.rescale_factor = rescale_factor
|
||||||
|
self.do_grayscale = do_grayscale
|
||||||
|
|
||||||
|
# Copied from transformers.models.superpoint.image_processing_superpoint.SuperPointImageProcessor.resize
|
||||||
|
def resize(
|
||||||
|
self,
|
||||||
|
image: np.ndarray,
|
||||||
|
size: dict[str, int],
|
||||||
|
data_format: Optional[Union[str, ChannelDimension]] = None,
|
||||||
|
input_data_format: Optional[Union[str, ChannelDimension]] = None,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Resize an image.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image (`np.ndarray`):
|
||||||
|
Image to resize.
|
||||||
|
size (`dict[str, int]`):
|
||||||
|
Dictionary of the form `{"height": int, "width": int}`, specifying the size of the output image.
|
||||||
|
data_format (`ChannelDimension` or `str`, *optional*):
|
||||||
|
The channel dimension format of the output image. If not provided, it will be inferred from the input
|
||||||
|
image. Can be one of:
|
||||||
|
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
|
||||||
|
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
|
||||||
|
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
|
||||||
|
input_data_format (`ChannelDimension` or `str`, *optional*):
|
||||||
|
The channel dimension format for the input image. If unset, the channel dimension format is inferred
|
||||||
|
from the input image. Can be one of:
|
||||||
|
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
|
||||||
|
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
|
||||||
|
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
|
||||||
|
"""
|
||||||
|
size = get_size_dict(size, default_to_square=False)
|
||||||
|
|
||||||
|
return resize(
|
||||||
|
image,
|
||||||
|
size=(size["height"], size["width"]),
|
||||||
|
data_format=data_format,
|
||||||
|
input_data_format=input_data_format,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Copied from transformers.models.superglue.image_processing_superglue.SuperGlueImageProcessor.preprocess
|
||||||
|
def preprocess(
|
||||||
|
self,
|
||||||
|
images,
|
||||||
|
do_resize: Optional[bool] = None,
|
||||||
|
size: Optional[dict[str, int]] = None,
|
||||||
|
resample: PILImageResampling = None,
|
||||||
|
do_rescale: Optional[bool] = None,
|
||||||
|
rescale_factor: Optional[float] = None,
|
||||||
|
do_grayscale: Optional[bool] = None,
|
||||||
|
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||||
|
data_format: ChannelDimension = ChannelDimension.FIRST,
|
||||||
|
input_data_format: Optional[Union[str, ChannelDimension]] = None,
|
||||||
|
**kwargs,
|
||||||
|
) -> BatchFeature:
|
||||||
|
"""
|
||||||
|
Preprocess an image or batch of images.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
images (`ImageInput`):
|
||||||
|
Image pairs to preprocess. Expects either a list of 2 images or a list of list of 2 images list with
|
||||||
|
pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set
|
||||||
|
`do_rescale=False`.
|
||||||
|
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
|
||||||
|
Whether to resize the image.
|
||||||
|
size (`dict[str, int]`, *optional*, defaults to `self.size`):
|
||||||
|
Size of the output image after `resize` has been applied. If `size["shortest_edge"]` >= 384, the image
|
||||||
|
is resized to `(size["shortest_edge"], size["shortest_edge"])`. Otherwise, the smaller edge of the
|
||||||
|
image will be matched to `int(size["shortest_edge"]/ crop_pct)`, after which the image is cropped to
|
||||||
|
`(size["shortest_edge"], size["shortest_edge"])`. Only has an effect if `do_resize` is set to `True`.
|
||||||
|
resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
|
||||||
|
Resampling filter to use if resizing the image. This can be one of `PILImageResampling`, filters. Only
|
||||||
|
has an effect if `do_resize` is set to `True`.
|
||||||
|
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
|
||||||
|
Whether to rescale the image values between [0 - 1].
|
||||||
|
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
|
||||||
|
Rescale factor to rescale the image by if `do_rescale` is set to `True`.
|
||||||
|
do_grayscale (`bool`, *optional*, defaults to `self.do_grayscale`):
|
||||||
|
Whether to convert the image to grayscale.
|
||||||
|
return_tensors (`str` or `TensorType`, *optional*):
|
||||||
|
The type of tensors to return. Can be one of:
|
||||||
|
- Unset: Return a list of `np.ndarray`.
|
||||||
|
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
|
||||||
|
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
|
||||||
|
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
|
||||||
|
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
|
||||||
|
data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
|
||||||
|
The channel dimension format for the output image. Can be one of:
|
||||||
|
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
|
||||||
|
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
|
||||||
|
- Unset: Use the channel dimension format of the input image.
|
||||||
|
input_data_format (`ChannelDimension` or `str`, *optional*):
|
||||||
|
The channel dimension format for the input image. If unset, the channel dimension format is inferred
|
||||||
|
from the input image. Can be one of:
|
||||||
|
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
|
||||||
|
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
|
||||||
|
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
|
||||||
|
"""
|
||||||
|
|
||||||
|
do_resize = do_resize if do_resize is not None else self.do_resize
|
||||||
|
resample = resample if resample is not None else self.resample
|
||||||
|
do_rescale = do_rescale if do_rescale is not None else self.do_rescale
|
||||||
|
rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
|
||||||
|
do_grayscale = do_grayscale if do_grayscale is not None else self.do_grayscale
|
||||||
|
|
||||||
|
size = size if size is not None else self.size
|
||||||
|
size = get_size_dict(size, default_to_square=False)
|
||||||
|
|
||||||
|
# Validate and convert the input images into a flattened list of images for all subsequent processing steps.
|
||||||
|
images = validate_and_format_image_pairs(images)
|
||||||
|
|
||||||
|
if not valid_images(images):
|
||||||
|
raise ValueError(
|
||||||
|
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
|
||||||
|
"torch.Tensor, tf.Tensor or jax.ndarray."
|
||||||
|
)
|
||||||
|
|
||||||
|
validate_preprocess_arguments(
|
||||||
|
do_resize=do_resize,
|
||||||
|
size=size,
|
||||||
|
resample=resample,
|
||||||
|
do_rescale=do_rescale,
|
||||||
|
rescale_factor=rescale_factor,
|
||||||
|
)
|
||||||
|
|
||||||
|
# All transformations expect numpy arrays.
|
||||||
|
images = [to_numpy_array(image) for image in images]
|
||||||
|
|
||||||
|
if is_scaled_image(images[0]) and do_rescale:
|
||||||
|
logger.warning_once(
|
||||||
|
"It looks like you are trying to rescale already rescaled images. If the input"
|
||||||
|
" images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
|
||||||
|
)
|
||||||
|
|
||||||
|
if input_data_format is None:
|
||||||
|
# We assume that all images have the same channel dimension format.
|
||||||
|
input_data_format = infer_channel_dimension_format(images[0])
|
||||||
|
|
||||||
|
all_images = []
|
||||||
|
for image in images:
|
||||||
|
if do_resize:
|
||||||
|
image = self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
|
||||||
|
|
||||||
|
if do_rescale:
|
||||||
|
image = self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
|
||||||
|
|
||||||
|
if do_grayscale:
|
||||||
|
image = convert_to_grayscale(image, input_data_format=input_data_format)
|
||||||
|
|
||||||
|
image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
|
||||||
|
all_images.append(image)
|
||||||
|
|
||||||
|
# Convert back the flattened list of images into a list of pairs of images.
|
||||||
|
image_pairs = [all_images[i : i + 2] for i in range(0, len(all_images), 2)]
|
||||||
|
|
||||||
|
data = {"pixel_values": image_pairs}
|
||||||
|
|
||||||
|
return BatchFeature(data=data, tensor_type=return_tensors)
|
||||||
|
|
||||||
|
def post_process_keypoint_matching(
|
||||||
|
self,
|
||||||
|
outputs: "KeypointMatchingOutput",
|
||||||
|
target_sizes: Union[TensorType, list[tuple]],
|
||||||
|
threshold: float = 0.0,
|
||||||
|
) -> list[dict[str, torch.Tensor]]:
|
||||||
|
"""
|
||||||
|
Converts the raw output of [`KeypointMatchingOutput`] into lists of keypoints, scores and descriptors
|
||||||
|
with coordinates absolute to the original image sizes.
|
||||||
|
Args:
|
||||||
|
outputs ([`KeypointMatchingOutput`]):
|
||||||
|
Raw outputs of the model.
|
||||||
|
target_sizes (`torch.Tensor` or `List[Tuple[Tuple[int, int]]]`, *optional*):
|
||||||
|
Tensor of shape `(batch_size, 2, 2)` or list of tuples of tuples (`Tuple[int, int]`) containing the
|
||||||
|
target size `(height, width)` of each image in the batch. This must be the original image size (before
|
||||||
|
any processing).
|
||||||
|
threshold (`float`, *optional*, defaults to 0.0):
|
||||||
|
Threshold to filter out the matches with low scores.
|
||||||
|
Returns:
|
||||||
|
`List[Dict]`: A list of dictionaries, each dictionary containing the keypoints in the first and second image
|
||||||
|
of the pair, the matching scores and the matching indices.
|
||||||
|
"""
|
||||||
|
if outputs.matches.shape[0] != len(target_sizes):
|
||||||
|
raise ValueError("Make sure that you pass in as many target sizes as the batch dimension of the mask")
|
||||||
|
if not all(len(target_size) == 2 for target_size in target_sizes):
|
||||||
|
raise ValueError("Each element of target_sizes must contain the size (h, w) of each image of the batch")
|
||||||
|
|
||||||
|
if isinstance(target_sizes, list):
|
||||||
|
image_pair_sizes = torch.tensor(target_sizes, device=outputs.matches.device)
|
||||||
|
else:
|
||||||
|
if target_sizes.shape[1] != 2 or target_sizes.shape[2] != 2:
|
||||||
|
raise ValueError(
|
||||||
|
"Each element of target_sizes must contain the size (h, w) of each image of the batch"
|
||||||
|
)
|
||||||
|
image_pair_sizes = target_sizes
|
||||||
|
|
||||||
|
keypoints = outputs.keypoints.clone()
|
||||||
|
keypoints = keypoints * image_pair_sizes.flip(-1).reshape(-1, 2, 1, 2)
|
||||||
|
keypoints = keypoints.to(torch.int32)
|
||||||
|
|
||||||
|
results = []
|
||||||
|
for keypoints_pair, matches, scores in zip(keypoints, outputs.matches, outputs.matching_scores):
|
||||||
|
# Filter out matches with low scores
|
||||||
|
valid_matches = torch.logical_and(scores > threshold, matches > -1)
|
||||||
|
|
||||||
|
matched_keypoints0 = keypoints_pair[0][valid_matches[0]]
|
||||||
|
matched_keypoints1 = keypoints_pair[1][valid_matches[1]]
|
||||||
|
matching_scores = scores[0][valid_matches[0]]
|
||||||
|
|
||||||
|
results.append(
|
||||||
|
{
|
||||||
|
"keypoints0": matched_keypoints0,
|
||||||
|
"keypoints1": matched_keypoints1,
|
||||||
|
"matching_scores": matching_scores,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
def visualize_keypoint_matching(
|
||||||
|
self,
|
||||||
|
images: ImageInput,
|
||||||
|
keypoint_matching_output: list[dict[str, torch.Tensor]],
|
||||||
|
) -> list["Image.Image"]:
|
||||||
|
"""
|
||||||
|
Plots the image pairs side by side with the detected keypoints as well as the matching between them.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
images (`ImageInput`):
|
||||||
|
Image pairs to plot. Same as `EfficientLoFTRImageProcessor.preprocess`. Expects either a list of 2
|
||||||
|
images or a list of list of 2 images list with pixel values ranging from 0 to 255.
|
||||||
|
outputs (List[Dict[str, torch.Tensor]]]):
|
||||||
|
A post processed keypoint matching output
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`List[PIL.Image.Image]`: A list of PIL images, each containing the image pairs side by side with the detected
|
||||||
|
keypoints as well as the matching between them.
|
||||||
|
"""
|
||||||
|
images = validate_and_format_image_pairs(images)
|
||||||
|
images = [to_numpy_array(image) for image in images]
|
||||||
|
image_pairs = [images[i : i + 2] for i in range(0, len(images), 2)]
|
||||||
|
|
||||||
|
results = []
|
||||||
|
for image_pair, pair_output in zip(image_pairs, keypoint_matching_output):
|
||||||
|
height0, width0 = image_pair[0].shape[:2]
|
||||||
|
height1, width1 = image_pair[1].shape[:2]
|
||||||
|
plot_image = np.zeros((max(height0, height1), width0 + width1, 3), dtype=np.uint8)
|
||||||
|
plot_image[:height0, :width0] = image_pair[0]
|
||||||
|
plot_image[:height1, width0:] = image_pair[1]
|
||||||
|
|
||||||
|
plot_image_pil = Image.fromarray(plot_image)
|
||||||
|
draw = ImageDraw.Draw(plot_image_pil)
|
||||||
|
|
||||||
|
keypoints0_x, keypoints0_y = pair_output["keypoints0"].unbind(1)
|
||||||
|
keypoints1_x, keypoints1_y = pair_output["keypoints1"].unbind(1)
|
||||||
|
for keypoint0_x, keypoint0_y, keypoint1_x, keypoint1_y, matching_score in zip(
|
||||||
|
keypoints0_x, keypoints0_y, keypoints1_x, keypoints1_y, pair_output["matching_scores"]
|
||||||
|
):
|
||||||
|
color = self._get_color(matching_score)
|
||||||
|
draw.line(
|
||||||
|
(keypoint0_x, keypoint0_y, keypoint1_x + width0, keypoint1_y),
|
||||||
|
fill=color,
|
||||||
|
width=3,
|
||||||
|
)
|
||||||
|
draw.ellipse((keypoint0_x - 2, keypoint0_y - 2, keypoint0_x + 2, keypoint0_y + 2), fill="black")
|
||||||
|
draw.ellipse(
|
||||||
|
(keypoint1_x + width0 - 2, keypoint1_y - 2, keypoint1_x + width0 + 2, keypoint1_y + 2),
|
||||||
|
fill="black",
|
||||||
|
)
|
||||||
|
|
||||||
|
results.append(plot_image_pil)
|
||||||
|
return results
|
||||||
|
|
||||||
|
def _get_color(self, score):
|
||||||
|
"""Maps a score to a color."""
|
||||||
|
r = int(255 * (1 - score))
|
||||||
|
g = int(255 * score)
|
||||||
|
b = 0
|
||||||
|
return (r, g, b)
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["EfficientLoFTRImageProcessor"]
|
||||||
1302
src/transformers/models/efficientloftr/modeling_efficientloftr.py
Normal file
1302
src/transformers/models/efficientloftr/modeling_efficientloftr.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -51,7 +51,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
|
|
||||||
def is_grayscale(
|
def is_grayscale(
|
||||||
image: ImageInput,
|
image: np.ndarray,
|
||||||
input_data_format: Optional[Union[str, ChannelDimension]] = None,
|
input_data_format: Optional[Union[str, ChannelDimension]] = None,
|
||||||
):
|
):
|
||||||
if input_data_format == ChannelDimension.FIRST:
|
if input_data_format == ChannelDimension.FIRST:
|
||||||
|
|||||||
@@ -53,7 +53,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# Copied from transformers.models.superpoint.image_processing_superpoint.is_grayscale
|
# Copied from transformers.models.superpoint.image_processing_superpoint.is_grayscale
|
||||||
def is_grayscale(
|
def is_grayscale(
|
||||||
image: ImageInput,
|
image: np.ndarray,
|
||||||
input_data_format: Optional[Union[str, ChannelDimension]] = None,
|
input_data_format: Optional[Union[str, ChannelDimension]] = None,
|
||||||
):
|
):
|
||||||
if input_data_format == ChannelDimension.FIRST:
|
if input_data_format == ChannelDimension.FIRST:
|
||||||
|
|||||||
@@ -45,7 +45,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
|
|
||||||
def is_grayscale(
|
def is_grayscale(
|
||||||
image: ImageInput,
|
image: np.ndarray,
|
||||||
input_data_format: Optional[Union[str, ChannelDimension]] = None,
|
input_data_format: Optional[Union[str, ChannelDimension]] = None,
|
||||||
):
|
):
|
||||||
if input_data_format == ChannelDimension.FIRST:
|
if input_data_format == ChannelDimension.FIRST:
|
||||||
|
|||||||
@@ -1075,7 +1075,7 @@ def check_model_inputs(func):
|
|||||||
if key == "hidden_states":
|
if key == "hidden_states":
|
||||||
if hasattr(outputs, "vision_hidden_states"):
|
if hasattr(outputs, "vision_hidden_states"):
|
||||||
collected_outputs[key] += (outputs.vision_hidden_states,)
|
collected_outputs[key] += (outputs.vision_hidden_states,)
|
||||||
else:
|
elif hasattr(outputs, "last_hidden_state"):
|
||||||
collected_outputs[key] += (outputs.last_hidden_state,)
|
collected_outputs[key] += (outputs.last_hidden_state,)
|
||||||
outputs[key] = collected_outputs[key]
|
outputs[key] = collected_outputs[key]
|
||||||
elif key == "attentions":
|
elif key == "attentions":
|
||||||
|
|||||||
0
tests/models/efficientloftr/__init__.py
Normal file
0
tests/models/efficientloftr/__init__.py
Normal file
@@ -0,0 +1,90 @@
|
|||||||
|
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from tests.models.superglue.test_image_processing_superglue import (
|
||||||
|
SuperGlueImageProcessingTest,
|
||||||
|
SuperGlueImageProcessingTester,
|
||||||
|
)
|
||||||
|
from transformers.testing_utils import require_torch, require_vision
|
||||||
|
from transformers.utils import is_torch_available, is_vision_available
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers.models.efficientloftr.modeling_efficientloftr import KeypointMatchingOutput
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
from transformers import EfficientLoFTRImageProcessor
|
||||||
|
|
||||||
|
|
||||||
|
def random_array(size):
|
||||||
|
return np.random.randint(255, size=size)
|
||||||
|
|
||||||
|
|
||||||
|
def random_tensor(size):
|
||||||
|
return torch.rand(size)
|
||||||
|
|
||||||
|
|
||||||
|
class EfficientLoFTRImageProcessingTester(SuperGlueImageProcessingTester):
|
||||||
|
"""Tester for EfficientLoFTRImageProcessor"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=6,
|
||||||
|
num_channels=3,
|
||||||
|
image_size=18,
|
||||||
|
min_resolution=30,
|
||||||
|
max_resolution=400,
|
||||||
|
do_resize=True,
|
||||||
|
size=None,
|
||||||
|
do_grayscale=True,
|
||||||
|
):
|
||||||
|
super().__init__(
|
||||||
|
parent, batch_size, num_channels, image_size, min_resolution, max_resolution, do_resize, size, do_grayscale
|
||||||
|
)
|
||||||
|
|
||||||
|
def prepare_keypoint_matching_output(self, pixel_values):
|
||||||
|
"""Prepare a fake output for the keypoint matching model with random matches between 50 keypoints per image."""
|
||||||
|
max_number_keypoints = 50
|
||||||
|
batch_size = len(pixel_values)
|
||||||
|
keypoints = torch.zeros((batch_size, 2, max_number_keypoints, 2))
|
||||||
|
matches = torch.full((batch_size, 2, max_number_keypoints), -1, dtype=torch.int)
|
||||||
|
scores = torch.zeros((batch_size, 2, max_number_keypoints))
|
||||||
|
for i in range(batch_size):
|
||||||
|
random_number_keypoints0 = np.random.randint(10, max_number_keypoints)
|
||||||
|
random_number_keypoints1 = np.random.randint(10, max_number_keypoints)
|
||||||
|
random_number_matches = np.random.randint(5, min(random_number_keypoints0, random_number_keypoints1))
|
||||||
|
keypoints[i, 0, :random_number_keypoints0] = torch.rand((random_number_keypoints0, 2))
|
||||||
|
keypoints[i, 1, :random_number_keypoints1] = torch.rand((random_number_keypoints1, 2))
|
||||||
|
random_matches_indices0 = torch.randperm(random_number_keypoints1, dtype=torch.int)[:random_number_matches]
|
||||||
|
random_matches_indices1 = torch.randperm(random_number_keypoints0, dtype=torch.int)[:random_number_matches]
|
||||||
|
matches[i, 0, random_matches_indices1] = random_matches_indices0
|
||||||
|
matches[i, 1, random_matches_indices0] = random_matches_indices1
|
||||||
|
scores[i, 0, random_matches_indices1] = torch.rand((random_number_matches,))
|
||||||
|
scores[i, 1, random_matches_indices0] = torch.rand((random_number_matches,))
|
||||||
|
return KeypointMatchingOutput(keypoints=keypoints, matches=matches, matching_scores=scores)
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
@require_vision
|
||||||
|
class EfficientLoFTRImageProcessingTest(SuperGlueImageProcessingTest, unittest.TestCase):
|
||||||
|
image_processing_class = EfficientLoFTRImageProcessor if is_vision_available() else None
|
||||||
|
|
||||||
|
def setUp(self) -> None:
|
||||||
|
super().setUp()
|
||||||
|
self.image_processor_tester = EfficientLoFTRImageProcessingTester(self)
|
||||||
453
tests/models/efficientloftr/test_modeling_efficientloftr.py
Normal file
453
tests/models/efficientloftr/test_modeling_efficientloftr.py
Normal file
@@ -0,0 +1,453 @@
|
|||||||
|
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
import inspect
|
||||||
|
import unittest
|
||||||
|
from functools import reduce
|
||||||
|
|
||||||
|
from datasets import load_dataset
|
||||||
|
|
||||||
|
from transformers.models.efficientloftr import EfficientLoFTRConfig, EfficientLoFTRModel
|
||||||
|
from transformers.testing_utils import (
|
||||||
|
require_torch,
|
||||||
|
require_vision,
|
||||||
|
set_config_for_less_flaky_test,
|
||||||
|
set_model_for_less_flaky_test,
|
||||||
|
set_model_tester_for_less_flaky_test,
|
||||||
|
slow,
|
||||||
|
torch_device,
|
||||||
|
)
|
||||||
|
from transformers.utils import cached_property, is_torch_available, is_vision_available
|
||||||
|
|
||||||
|
from ...test_configuration_common import ConfigTester
|
||||||
|
from ...test_modeling_common import ModelTesterMixin, floats_tensor
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers import EfficientLoFTRForKeypointMatching
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
from transformers import AutoImageProcessor
|
||||||
|
|
||||||
|
|
||||||
|
class EfficientLoFTRModelTester:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=2,
|
||||||
|
image_width=80,
|
||||||
|
image_height=60,
|
||||||
|
stage_num_blocks: list[int] = [1, 1, 1],
|
||||||
|
out_features: list[int] = [32, 32, 64],
|
||||||
|
stage_stride: list[int] = [2, 1, 2],
|
||||||
|
q_aggregation_kernel_size: int = 1,
|
||||||
|
kv_aggregation_kernel_size: int = 1,
|
||||||
|
q_aggregation_stride: int = 1,
|
||||||
|
kv_aggregation_stride: int = 1,
|
||||||
|
num_attention_layers: int = 2,
|
||||||
|
num_attention_heads: int = 8,
|
||||||
|
hidden_size: int = 64,
|
||||||
|
coarse_matching_threshold: float = 0.0,
|
||||||
|
fine_kernel_size: int = 2,
|
||||||
|
coarse_matching_border_removal: int = 0,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.image_width = image_width
|
||||||
|
self.image_height = image_height
|
||||||
|
|
||||||
|
self.stage_num_blocks = stage_num_blocks
|
||||||
|
self.out_features = out_features
|
||||||
|
self.stage_stride = stage_stride
|
||||||
|
self.q_aggregation_kernel_size = q_aggregation_kernel_size
|
||||||
|
self.kv_aggregation_kernel_size = kv_aggregation_kernel_size
|
||||||
|
self.q_aggregation_stride = q_aggregation_stride
|
||||||
|
self.kv_aggregation_stride = kv_aggregation_stride
|
||||||
|
self.num_attention_layers = num_attention_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.coarse_matching_threshold = coarse_matching_threshold
|
||||||
|
self.coarse_matching_border_removal = coarse_matching_border_removal
|
||||||
|
self.fine_kernel_size = fine_kernel_size
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
# EfficientLoFTR expects a grayscale image as input
|
||||||
|
pixel_values = floats_tensor([self.batch_size, 2, 3, self.image_height, self.image_width])
|
||||||
|
config = self.get_config()
|
||||||
|
return config, pixel_values
|
||||||
|
|
||||||
|
def get_config(self):
|
||||||
|
return EfficientLoFTRConfig(
|
||||||
|
stage_num_blocks=self.stage_num_blocks,
|
||||||
|
out_features=self.out_features,
|
||||||
|
stage_stride=self.stage_stride,
|
||||||
|
q_aggregation_kernel_size=self.q_aggregation_kernel_size,
|
||||||
|
kv_aggregation_kernel_size=self.kv_aggregation_kernel_size,
|
||||||
|
q_aggregation_stride=self.q_aggregation_stride,
|
||||||
|
kv_aggregation_stride=self.kv_aggregation_stride,
|
||||||
|
num_attention_layers=self.num_attention_layers,
|
||||||
|
num_attention_heads=self.num_attention_heads,
|
||||||
|
hidden_size=self.hidden_size,
|
||||||
|
coarse_matching_threshold=self.coarse_matching_threshold,
|
||||||
|
coarse_matching_border_removal=self.coarse_matching_border_removal,
|
||||||
|
fine_kernel_size=self.fine_kernel_size,
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_model(self, config, pixel_values):
|
||||||
|
model = EfficientLoFTRForKeypointMatching(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(pixel_values)
|
||||||
|
maximum_num_matches = result.matches.shape[-1]
|
||||||
|
self.parent.assertEqual(
|
||||||
|
result.keypoints.shape,
|
||||||
|
(self.batch_size, 2, maximum_num_matches, 2),
|
||||||
|
)
|
||||||
|
self.parent.assertEqual(
|
||||||
|
result.matches.shape,
|
||||||
|
(self.batch_size, 2, maximum_num_matches),
|
||||||
|
)
|
||||||
|
self.parent.assertEqual(
|
||||||
|
result.matching_scores.shape,
|
||||||
|
(self.batch_size, 2, maximum_num_matches),
|
||||||
|
)
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
|
config, pixel_values = config_and_inputs
|
||||||
|
inputs_dict = {"pixel_values": pixel_values}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class EfficientLoFTRModelTest(ModelTesterMixin, unittest.TestCase):
|
||||||
|
all_model_classes = (EfficientLoFTRForKeypointMatching, EfficientLoFTRModel) if is_torch_available() else ()
|
||||||
|
|
||||||
|
test_pruning = False
|
||||||
|
test_resize_embeddings = False
|
||||||
|
test_head_masking = False
|
||||||
|
has_attentions = True
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = EfficientLoFTRModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=EfficientLoFTRConfig, has_text_modality=False)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.create_and_test_config_to_json_string()
|
||||||
|
self.config_tester.create_and_test_config_to_json_file()
|
||||||
|
self.config_tester.create_and_test_config_from_and_save_pretrained()
|
||||||
|
self.config_tester.create_and_test_config_with_num_labels()
|
||||||
|
self.config_tester.check_config_can_be_init_without_params()
|
||||||
|
self.config_tester.check_config_arguments_init()
|
||||||
|
|
||||||
|
@unittest.skip(reason="EfficientLoFTRForKeypointMatching does not use inputs_embeds")
|
||||||
|
def test_inputs_embeds(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="EfficientLoFTRForKeypointMatching does not support input and output embeddings")
|
||||||
|
def test_model_get_set_embeddings(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="EfficientLoFTRForKeypointMatching does not use feedforward chunking")
|
||||||
|
def test_feed_forward_chunking(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="EfficientLoFTRForKeypointMatching is not trainable")
|
||||||
|
def test_training(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="EfficientLoFTRForKeypointMatching is not trainable")
|
||||||
|
def test_training_gradient_checkpointing(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="EfficientLoFTRForKeypointMatching is not trainable")
|
||||||
|
def test_training_gradient_checkpointing_use_reentrant(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="EfficientLoFTRForKeypointMatching is not trainable")
|
||||||
|
def test_training_gradient_checkpointing_use_reentrant_false(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="EfficientLoFTR does not output any loss term in the forward pass")
|
||||||
|
def test_retain_grad_hidden_states_attentions(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_forward_signature(self):
|
||||||
|
config, _ = self.model_tester.prepare_config_and_inputs()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
signature = inspect.signature(model.forward)
|
||||||
|
# signature.parameters is an OrderedDict => so arg_names order is deterministic
|
||||||
|
arg_names = [*signature.parameters.keys()]
|
||||||
|
|
||||||
|
expected_arg_names = ["pixel_values"]
|
||||||
|
self.assertListEqual(arg_names[:1], expected_arg_names)
|
||||||
|
|
||||||
|
def test_hidden_states_output(self):
|
||||||
|
def check_hidden_states_output(inputs_dict, config, model_class):
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
|
||||||
|
hidden_states = outputs.hidden_states
|
||||||
|
|
||||||
|
expected_num_hidden_states = len(self.model_tester.stage_num_blocks)
|
||||||
|
self.assertEqual(len(hidden_states), expected_num_hidden_states)
|
||||||
|
|
||||||
|
self.assertListEqual(
|
||||||
|
list(hidden_states[0].shape[-2:]),
|
||||||
|
[self.model_tester.image_height // 2, self.model_tester.image_width // 2],
|
||||||
|
)
|
||||||
|
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
inputs_dict["output_hidden_states"] = True
|
||||||
|
check_hidden_states_output(inputs_dict, config, model_class)
|
||||||
|
|
||||||
|
# check that output_hidden_states also work using config
|
||||||
|
del inputs_dict["output_hidden_states"]
|
||||||
|
config.output_hidden_states = True
|
||||||
|
|
||||||
|
check_hidden_states_output(inputs_dict, config, model_class)
|
||||||
|
|
||||||
|
def test_attention_outputs(self):
|
||||||
|
def check_attention_output(inputs_dict, config, model_class):
|
||||||
|
config._attn_implementation = "eager"
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
|
||||||
|
attentions = outputs.attentions
|
||||||
|
total_stride = reduce(lambda a, b: a * b, config.stage_stride)
|
||||||
|
hidden_size = (
|
||||||
|
self.model_tester.image_height // total_stride * self.model_tester.image_width // total_stride
|
||||||
|
)
|
||||||
|
|
||||||
|
expected_attention_shape = [
|
||||||
|
self.model_tester.num_attention_heads,
|
||||||
|
hidden_size,
|
||||||
|
hidden_size,
|
||||||
|
]
|
||||||
|
|
||||||
|
for i, attention in enumerate(attentions):
|
||||||
|
self.assertListEqual(
|
||||||
|
list(attention.shape[-3:]),
|
||||||
|
expected_attention_shape,
|
||||||
|
)
|
||||||
|
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
inputs_dict["output_attentions"] = True
|
||||||
|
check_attention_output(inputs_dict, config, model_class)
|
||||||
|
|
||||||
|
# check that output_hidden_states also work using config
|
||||||
|
del inputs_dict["output_attentions"]
|
||||||
|
config.output_attentions = True
|
||||||
|
|
||||||
|
check_attention_output(inputs_dict, config, model_class)
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_model_from_pretrained(self):
|
||||||
|
from_pretrained_ids = ["stevenbucaille/efficientloftr"]
|
||||||
|
for model_name in from_pretrained_ids:
|
||||||
|
model = EfficientLoFTRForKeypointMatching.from_pretrained(model_name)
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
|
||||||
|
def test_forward_labels_should_be_none(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
model_inputs = self._prepare_for_class(inputs_dict, model_class)
|
||||||
|
# Provide an arbitrary sized Tensor as labels to model inputs
|
||||||
|
model_inputs["labels"] = torch.rand((128, 128))
|
||||||
|
|
||||||
|
with self.assertRaises(ValueError) as cm:
|
||||||
|
model(**model_inputs)
|
||||||
|
self.assertEqual(ValueError, cm.exception.__class__)
|
||||||
|
|
||||||
|
def test_batching_equivalence(self, atol=1e-5, rtol=1e-5):
|
||||||
|
"""
|
||||||
|
This test is overwritten because the model outputs do not contain only regressive values but also keypoint
|
||||||
|
locations.
|
||||||
|
Similarly to the problem discussed about SuperGlue implementation
|
||||||
|
[here](https://github.com/huggingface/transformers/pull/29886#issuecomment-2482752787), the consequence of
|
||||||
|
having different scores for matching, makes the maximum indices differ. These indices are being used to compute
|
||||||
|
the keypoint coordinates. The keypoint coordinates, in the model outputs, are floating point tensors, so the
|
||||||
|
original implementation of this test cover this case. But the resulting tensors may have differences exceeding
|
||||||
|
the relative and absolute tolerance.
|
||||||
|
Therefore, similarly to SuperGlue integration test, for the key "keypoints" in the model outputs, we check the
|
||||||
|
number of differences in keypoint coordinates being less than a TODO given number
|
||||||
|
"""
|
||||||
|
|
||||||
|
def recursive_check(batched_object, single_row_object, model_name, key):
|
||||||
|
if isinstance(batched_object, (list, tuple)):
|
||||||
|
for batched_object_value, single_row_object_value in zip(batched_object, single_row_object):
|
||||||
|
recursive_check(batched_object_value, single_row_object_value, model_name, key)
|
||||||
|
elif isinstance(batched_object, dict):
|
||||||
|
for batched_object_value, single_row_object_value in zip(
|
||||||
|
batched_object.values(), single_row_object.values()
|
||||||
|
):
|
||||||
|
recursive_check(batched_object_value, single_row_object_value, model_name, key)
|
||||||
|
# do not compare returned loss (0-dim tensor) / codebook ids (int) / caching objects
|
||||||
|
elif batched_object is None or not isinstance(batched_object, torch.Tensor):
|
||||||
|
return
|
||||||
|
elif batched_object.dim() == 0:
|
||||||
|
return
|
||||||
|
# do not compare int or bool outputs as they are mostly computed with max/argmax/topk methods which are
|
||||||
|
# very sensitive to the inputs (e.g. tiny differences may give totally different results)
|
||||||
|
elif not torch.is_floating_point(batched_object):
|
||||||
|
return
|
||||||
|
else:
|
||||||
|
# indexing the first element does not always work
|
||||||
|
# e.g. models that output similarity scores of size (N, M) would need to index [0, 0]
|
||||||
|
slice_ids = [slice(0, index) for index in single_row_object.shape]
|
||||||
|
batched_row = batched_object[slice_ids]
|
||||||
|
if key == "keypoints":
|
||||||
|
batched_row = torch.sum(batched_row, dim=-1)
|
||||||
|
single_row_object = torch.sum(single_row_object, dim=-1)
|
||||||
|
tolerance = 0.02 * single_row_object.shape[-1]
|
||||||
|
self.assertTrue(
|
||||||
|
torch.sum(~torch.isclose(batched_row, single_row_object, rtol=rtol, atol=atol)) < tolerance
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.assertFalse(
|
||||||
|
torch.isnan(batched_row).any(), f"Batched output has `nan` in {model_name} for key={key}"
|
||||||
|
)
|
||||||
|
self.assertFalse(
|
||||||
|
torch.isinf(batched_row).any(), f"Batched output has `inf` in {model_name} for key={key}"
|
||||||
|
)
|
||||||
|
self.assertFalse(
|
||||||
|
torch.isnan(single_row_object).any(),
|
||||||
|
f"Single row output has `nan` in {model_name} for key={key}",
|
||||||
|
)
|
||||||
|
self.assertFalse(
|
||||||
|
torch.isinf(single_row_object).any(),
|
||||||
|
f"Single row output has `inf` in {model_name} for key={key}",
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
torch.testing.assert_close(batched_row, single_row_object, atol=atol, rtol=rtol)
|
||||||
|
except AssertionError as e:
|
||||||
|
msg = f"Batched and Single row outputs are not equal in {model_name} for key={key}.\n\n"
|
||||||
|
msg += str(e)
|
||||||
|
raise AssertionError(msg)
|
||||||
|
|
||||||
|
set_model_tester_for_less_flaky_test(self)
|
||||||
|
|
||||||
|
config, batched_input = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
set_config_for_less_flaky_test(config)
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
config.output_hidden_states = True
|
||||||
|
|
||||||
|
model_name = model_class.__name__
|
||||||
|
if hasattr(self.model_tester, "prepare_config_and_inputs_for_model_class"):
|
||||||
|
config, batched_input = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
|
||||||
|
batched_input_prepared = self._prepare_for_class(batched_input, model_class)
|
||||||
|
model = model_class(config).to(torch_device).eval()
|
||||||
|
set_model_for_less_flaky_test(model)
|
||||||
|
|
||||||
|
batch_size = self.model_tester.batch_size
|
||||||
|
single_row_input = {}
|
||||||
|
for key, value in batched_input_prepared.items():
|
||||||
|
if isinstance(value, torch.Tensor) and value.shape[0] % batch_size == 0:
|
||||||
|
# e.g. musicgen has inputs of size (bs*codebooks). in most cases value.shape[0] == batch_size
|
||||||
|
single_batch_shape = value.shape[0] // batch_size
|
||||||
|
single_row_input[key] = value[:single_batch_shape]
|
||||||
|
else:
|
||||||
|
single_row_input[key] = value
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
model_batched_output = model(**batched_input_prepared)
|
||||||
|
model_row_output = model(**single_row_input)
|
||||||
|
|
||||||
|
if isinstance(model_batched_output, torch.Tensor):
|
||||||
|
model_batched_output = {"model_output": model_batched_output}
|
||||||
|
model_row_output = {"model_output": model_row_output}
|
||||||
|
|
||||||
|
for key in model_batched_output:
|
||||||
|
# DETR starts from zero-init queries to decoder, leading to cos_similarity = `nan`
|
||||||
|
if hasattr(self, "zero_init_hidden_state") and "decoder_hidden_states" in key:
|
||||||
|
model_batched_output[key] = model_batched_output[key][1:]
|
||||||
|
model_row_output[key] = model_row_output[key][1:]
|
||||||
|
recursive_check(model_batched_output[key], model_row_output[key], model_name, key)
|
||||||
|
|
||||||
|
|
||||||
|
def prepare_imgs():
|
||||||
|
dataset = load_dataset("hf-internal-testing/image-matching-test-dataset", split="train")
|
||||||
|
image1 = dataset[0]["image"]
|
||||||
|
image2 = dataset[1]["image"]
|
||||||
|
image3 = dataset[2]["image"]
|
||||||
|
return [[image1, image2], [image3, image2]]
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
@require_vision
|
||||||
|
class EfficientLoFTRModelIntegrationTest(unittest.TestCase):
|
||||||
|
@cached_property
|
||||||
|
def default_image_processor(self):
|
||||||
|
return AutoImageProcessor.from_pretrained("stevenbucaille/efficientloftr") if is_vision_available() else None
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_inference(self):
|
||||||
|
model = EfficientLoFTRForKeypointMatching.from_pretrained(
|
||||||
|
"stevenbucaille/efficientloftr", attn_implementation="eager"
|
||||||
|
).to(torch_device)
|
||||||
|
preprocessor = self.default_image_processor
|
||||||
|
images = prepare_imgs()
|
||||||
|
inputs = preprocessor(images=images, return_tensors="pt").to(torch_device)
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**inputs, output_hidden_states=True, output_attentions=True)
|
||||||
|
|
||||||
|
predicted_top10 = torch.topk(outputs.matching_scores[0, 0], k=10)
|
||||||
|
predicted_top10_matches_indices = predicted_top10.indices
|
||||||
|
predicted_top10_matching_scores = predicted_top10.values
|
||||||
|
|
||||||
|
expected_number_of_matches = 4800
|
||||||
|
expected_matches_shape = torch.Size((len(images), 2, expected_number_of_matches))
|
||||||
|
expected_matching_scores_shape = torch.Size((len(images), 2, expected_number_of_matches))
|
||||||
|
|
||||||
|
expected_top10_matches_indices = torch.tensor(
|
||||||
|
[3145, 3065, 3143, 3066, 3144, 1397, 1705, 3151, 2342, 2422], dtype=torch.int64, device=torch_device
|
||||||
|
)
|
||||||
|
expected_top10_matching_scores = torch.tensor(
|
||||||
|
[0.9997, 0.9996, 0.9996, 0.9995, 0.9995, 0.9995, 0.9994, 0.9994, 0.9994, 0.9994], device=torch_device
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(outputs.matches.shape, expected_matches_shape)
|
||||||
|
self.assertEqual(outputs.matching_scores.shape, expected_matching_scores_shape)
|
||||||
|
|
||||||
|
torch.testing.assert_close(
|
||||||
|
predicted_top10_matches_indices, expected_top10_matches_indices, rtol=5e-3, atol=5e-3
|
||||||
|
)
|
||||||
|
torch.testing.assert_close(
|
||||||
|
predicted_top10_matching_scores, expected_top10_matching_scores, rtol=5e-3, atol=5e-3
|
||||||
|
)
|
||||||
Reference in New Issue
Block a user