✨ Add EoMT Model || 🚨 Fix Mask2Former loss calculation (#37610)

* Initial Commit * up * More changes * up * Only mask_logits mismatch * close enough logits debug later * fixes * format * Add dummy loss * Close enough processing for semantic seg * nit * Added panoptic postprocessor * refactor * refactor * finally fixed panoptic postprocessor * temp update * Refactor ForUniversalSegmentation class * nits and config update * Few fixes and inference matches * change mapping * Added training support but loss slightly off 🥲 * Loss is matching 😀 * update * Initial tests skelton * changes * tests update * more modular * initial tests * updates * better docstrings * changes * proc tests passing :) * Image processor update * tiny change * QOL changes * Update test w.r.t latest attn refactor * repo-consistency fixes * up * Image proc fix and integration tests :) * docs update * integration tests * fix * docs update 🥰 * minor fix * Happy CI * fix * obvious refactoring * refactoring w.r.t review * Add fask image proc skelton * Fast Image proc and cleanups * Use more modular * tests update * Add more tests * Nit * QOL updates * change init_weights to torch default * add eager func coz of make style * up * changes * typo fix * Updates * More deterministic tests * More modular * go more modular 🚀 * up * dump * add supprot for giant ckpts * overhaul * modular * refactor * instace seg is ready * cleanup * forgot this * docs cleanup * minor changes * EoMT - > Eomt * Happy CI * remove redundant comment * Change model references * final change * check annealing per block * My other PR changes 😂 --------- Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
2025-06-27 17:48:18 +05:30
parent 0106a50a6b
commit 1750c518dd
16 changed files with 4923 additions and 1 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -737,6 +737,8 @@
        title: EfficientFormer
      - local: model_doc/efficientnet
        title: EfficientNet
      - local: model_doc/eomt
        title: EoMT
      - local: model_doc/focalnet
        title: FocalNet
      - local: model_doc/glpn
--- a/docs/source/en/model_doc/eomt.md
+++ b/docs/source/en/model_doc/eomt.md
@@ -0,0 +1,214 @@
 <!--Copyright 2025 Mobile Perception Systems Lab at TU/e and The HuggingFace Inc. team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
 rendered properly in your Markdown viewer.
 -->
 # EoMT
 <div class="flex flex-wrap space-x-1">
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 ## Overview
 The Encoder-only Mask Transformer (EoMT) model was introduced in the CVPR 2025 Highlight Paper [Your ViT is Secretly an Image Segmentation Model](https://www.tue-mps.org/eomt) by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus.
 EoMT reveals Vision Transformers can perform image segmentation efficiently without task-specific components.
 The abstract from the paper is the following:
 *Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity.*
 This model was contributed by [Yaswanth Gali](https://huggingface.co/yaswanthgali).
 The original code can be found [here](https://github.com/tue-mps/eomt).
 ## Architecture Info
 The `EoMT` model uses a DINOv2-pretrained Vision Transformer with **register tokens** as its backbone. EoMT simplifies the segmentation pipeline by relying solely on the encoder, eliminating the need for task-specific decoders commonly used in prior approaches.
 Architecturally, EoMT introduces a small set of **learned queries** and a lightweight **mask prediction module**. These queries are injected into the final encoder blocks, enabling **joint attention** between image patches and object queries. During training, **masked attention** is applied to constrain each query to focus on its corresponding region—effectively mimicking cross-attention. This constraint is gradually phased out via a **mask annealing strategy**, allowing for **efficient, decoder-free inference** without compromising segmentation performance.
 <div style="text-align: center;">
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/eomt_architecture.png"
       alt="drawing" width="500"/>
 </div>
 The model supports semantic, instance, and panoptic segmentation using a unified architecture and task-specific post-processing.
 ## Usage Examples
 Use the Hugging Face implementation of EoMT for inference with pre-trained models.
 ### Semantic Segmentation
 The EoMT model performs semantic segmentation using sliding-window inference. The input image is resized such that the shorter side matches the target input size, then it is split into overlapping crops. Each crop is then passed through the model. After inference, the predicted logits from each crop are stitched back together and rescaled to the original image size to get the final segmentation mask.
 > **Note:**  
 > If you want to use a custom target size for **semantic segmentation**, specify it in the following format:  
 > `{"shortest_edge": 512}`  
 > Notice that `longest_edge` is not provided here — this is intentional. For semantic segmentation, images are typically **scaled so that the shortest edge is greater than or equal to the target size** hence longest_edge is not necessary.
 ```python
 import matplotlib.pyplot as plt
 import requests
 import torch
 from PIL import Image
 from transformers import EomtForUniversalSegmentation, AutoImageProcessor
 model_id = "tue-mps/ade20k_semantic_eomt_large_512"
 processor = AutoImageProcessor.from_pretrained(model_id)
 model = EomtForUniversalSegmentation.from_pretrained(model_id)
 image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
 inputs = processor(
    images=image,
    return_tensors="pt",
 )
 # Remove Patch Offsets from inputs — only used later for post-processing.
 patch_offsets = inputs.pop("patch_offsets")
 with torch.inference_mode():
    outputs = model(**inputs)
 # Prepare the original image size in the format (height, width)
 original_image_sizes = [(image.height, image.width)]
 # Post-process the model outputs to get final segmentation prediction
 preds = processor.post_process_semantic_segmentation(
    outputs,
    patch_offsets=patch_offsets,
    original_image_sizes=original_image_sizes,
 )
 # Visualize the segmentation mask
 plt.imshow(preds[0])
 plt.axis("off")
 plt.title("Semantic Segmentation")
 plt.show()
 ```
 ### Instance Segmentation
 The EoMT model performs instance segmentation using padded inference. The input image is resized so that the longer side matches the target input size, and the shorter side is zero-padded to form a square. The resulting mask and class logits are combined through post-processing (adapted from Mask2Former) to produce a unified instance segmentation map, along with segment metadata like segment id, class labels and confidence scores.
 > **Note:**  
 > To use a custom target size, specify the size as a dictionary in the following format:  
 > `{"shortest_edge": 512, "longest_edge": 512}`  
 > For both instance and panoptic segmentation, input images will be **scaled and padded** to this target size.
 ```python
 import matplotlib.pyplot as plt
 import requests
 import torch
 from PIL import Image
 from transformers import EomtForUniversalSegmentation, AutoImageProcessor
 model_id = "tue-mps/coco_instance_eomt_large_640"
 processor = AutoImageProcessor.from_pretrained(model_id)
 model = EomtForUniversalSegmentation.from_pretrained(model_id)
 image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
 inputs = processor(
    images=image,
    return_tensors="pt",
 )
 with torch.inference_mode():
    outputs = model(**inputs)
 # Prepare the original image size in the format (height, width)
 original_image_sizes = [(image.height, image.width)]
 # Post-process the model outputs to get final segmentation prediction
 preds = processor.post_process_instance_segmentation(
    outputs,
    original_image_sizes=original_image_sizes,
 )
 # Visualize the segmentation mask
 plt.imshow(preds[0]["segmentation"])
 plt.axis("off")
 plt.title("Instance Segmentation")
 plt.show()
 ```
 ### Panoptic Segmentation
 The EoMT model performs panoptic segmentation using the same padded inference strategy as in instance segmentation. After padding and normalization, the model predicts both thing (instances) and stuff (amorphous regions) classes. The resulting mask and class logits are combined through post-processing (adapted from Mask2Former) to produce a unified panoptic segmentation map, along with segment metadata like segment id, class labels and confidence scores.
 ```python
 import matplotlib.pyplot as plt
 import requests
 import torch
 from PIL import Image
 from transformers import EomtForUniversalSegmentation, AutoImageProcessor
 model_id = "tue-mps/coco_panoptic_eomt_large_640"
 processor = AutoImageProcessor.from_pretrained(model_id)
 model = EomtForUniversalSegmentation.from_pretrained(model_id)
 image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
 inputs = processor(
    images=image,
    return_tensors="pt",
 )
 with torch.inference_mode():
    outputs = model(**inputs)
 # Prepare the original image size in the format (height, width)
 original_image_sizes = [(image.height, image.width)]
 # Post-process the model outputs to get final segmentation prediction
 preds = processor.post_process_panoptic_segmentation(
    outputs,
    original_image_sizes=original_image_sizes,
 )
 # Visualize the panoptic segmentation mask
 plt.imshow(preds[0]["segmentation"])
 plt.axis("off")
 plt.title("Panoptic Segmentation")
 plt.show()
 ```
 ## EomtImageProcessor
 [[autodoc]] EomtImageProcessor
    - preprocess
    - post_process_semantic_segmentation
    - post_process_instance_segmentation
    - post_process_panoptic_segmentation
 ## EomtImageProcessorFast
 [[autodoc]] EomtImageProcessorFast
    - preprocess
    - post_process_semantic_segmentation
    - post_process_instance_segmentation
    - post_process_panoptic_segmentation
 ## EomtConfig
 [[autodoc]] EomtConfig
 ## EomtForUniversalSegmentation
 [[autodoc]] EomtForUniversalSegmentation
    - forward
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -122,6 +122,7 @@ CONFIG_MAPPING_NAMES = OrderedDict[str, str](
        ("emu3", "Emu3Config"),
        ("encodec", "EncodecConfig"),
        ("encoder-decoder", "EncoderDecoderConfig"),
        ("eomt", "EomtConfig"),
        ("ernie", "ErnieConfig"),
        ("ernie_m", "ErnieMConfig"),
        ("esm", "EsmConfig"),
@@ -501,6 +502,7 @@ MODEL_NAMES_MAPPING = OrderedDict[str, str](
        ("emu3", "Emu3"),
        ("encodec", "EnCodec"),
        ("encoder-decoder", "Encoder decoder"),
        ("eomt", "EoMT"),
        ("ernie", "ERNIE"),
        ("ernie_m", "ErnieM"),
        ("esm", "ESM"),
--- a/src/transformers/models/auto/image_processing_auto.py
+++ b/src/transformers/models/auto/image_processing_auto.py
@@ -84,6 +84,7 @@ else:
            ("dpt", ("DPTImageProcessor", "DPTImageProcessorFast")),
            ("efficientformer", ("EfficientFormerImageProcessor",)),
            ("efficientnet", ("EfficientNetImageProcessor", "EfficientNetImageProcessorFast")),
            ("eomt", ("EomtImageProcessor", "EomtImageProcessorFast")),
            ("flava", ("FlavaImageProcessor", "FlavaImageProcessorFast")),
            ("focalnet", ("BitImageProcessor", "BitImageProcessorFast")),
            ("fuyu", ("FuyuImageProcessor",)),
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -854,6 +854,7 @@ MODEL_FOR_UNIVERSAL_SEGMENTATION_MAPPING_NAMES = OrderedDict(
    [
        # Model for Universal Segmentation mapping
        ("detr", "DetrForSegmentation"),
        ("eomt", "EomtForUniversalSegmentation"),
        ("mask2former", "Mask2FormerForUniversalSegmentation"),
        ("maskformer", "MaskFormerForInstanceSegmentation"),
        ("oneformer", "OneFormerForUniversalSegmentation"),
--- a/src/transformers/models/eomt/init.py
+++ b/src/transformers/models/eomt/init.py
@@ -0,0 +1,29 @@
 # Copyright 2025 The HuggingFace Team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import TYPE_CHECKING
 from ...utils import _LazyModule
 from ...utils.import_utils import define_import_structure
 if TYPE_CHECKING:
    from .configuration_eomt import *
    from .image_processing_eomt import *
    from .image_processing_eomt_fast import *
    from .modeling_eomt import *
 else:
    import sys
    _file = globals()["__file__"]
    sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
--- a/src/transformers/models/eomt/configuration_eomt.py
+++ b/src/transformers/models/eomt/configuration_eomt.py
@@ -0,0 +1,168 @@
 #                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
 #           This file was automatically generated from src/transformers/models/eomt/modular_eomt.py.
 #               Do NOT edit this file manually as any edits will be overwritten by the generation of
 #             the file from the modular. If any change should be done, please apply the change to the
 #                          modular_eomt.py file directly. One of our CI enforces this.
 #                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
 # coding=utf-8
 # Copyright 2025 Mobile Perception Systems Lab at TU/e and The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from ...configuration_utils import PretrainedConfig
 class EomtConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`EomtForUniversalSegmentation`]. It is used to instantiate an EoMT model
    according to the specified arguments, defining the model architecture. Instantiating a configuration with the
    defaults will yield a similar configuration to that of the EoMT
    [tue-mps/coco_panoptic_eomt_large_640](https://huggingface.co/tue-mps/coco_panoptic_eomt_large_640)
    architecture.
    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.
    Args:
        hidden_size (`int`, *optional*, defaults to 1024):
            Dimensionality of the hidden representations.
        num_hidden_layers (`int`, *optional*, defaults to 24):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 16):
            Number of attention heads in each attention layer.
        mlp_ratio (`int`, *optional*, defaults to 4):
            Ratio of the MLP hidden dimensionality to the hidden size.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
            The dropout probability for all fully connected layers in the embeddings and encoder.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-06):
            The epsilon used by the layer normalization layers.
        image_size (`int`, *optional*, defaults to 640):
            The size (resolution) of each input image.
        patch_size (`int`, *optional*, defaults to 16):
            The size (resolution) of each patch.
        num_channels (`int`, *optional*, defaults to 3):
            The number of input channels.
        layerscale_value (`float`, *optional*, defaults to 1.0):
            Initial value for the LayerScale parameter.
        drop_path_rate (`float`, *optional*, defaults to 0.0):
            The stochastic depth rate (drop path) used during training.
        num_upscale_blocks (`int`, *optional*, defaults to 2):
            Number of upsampling blocks used in the decoder or segmentation head.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            Dropout probability applied after attention projection.
        use_swiglu_ffn (`bool`, *optional*, defaults to `False`):
            Whether to use the SwiGLU feedforward neural network.
        num_blocks (`int`, *optional*, defaults to 4):
            Number of feature blocks or stages in the architecture.
        no_object_weight (`float`, *optional*, defaults to 0.1):
            Loss weight for the 'no object' class in panoptic/instance segmentation.
        class_weight (`float`, *optional*, defaults to 2.0):
            Loss weight for classification targets.
        mask_weight (`float`, *optional*, defaults to 5.0):
            Loss weight for mask prediction.
        dice_weight (`float`, *optional*, defaults to 5.0):
            Loss weight for the dice loss component.
        train_num_points (`int`, *optional*, defaults to 12544):
            Number of points to sample for mask loss computation during training.
        oversample_ratio (`float`, *optional*, defaults to 3.0):
            Oversampling ratio used in point sampling for mask training.
        importance_sample_ratio (`float`, *optional*, defaults to 0.75):
            Ratio of points to sample based on importance during training.
        num_queries (`int`, *optional*, defaults to 200):
            Number of object queries in the Transformer.
        num_register_tokens (`int`, *optional*, defaults to 4):
            Number of learnable register tokens added to the transformer input.
    Example:
    ```python
    >>> from transformers import EomtConfig, EomtForUniversalSegmentation
    >>> # Initialize configuration
    >>> config = EomtConfig()
    >>> # Initialize model
    >>> model = EomtForUniversalSegmentation(config)
    >>> # Access config
    >>> config = model.config
    ```"""
    model_type = "eomt"
    def __init__(
        self,
        hidden_size=1024,
        num_hidden_layers=24,
        num_attention_heads=16,
        mlp_ratio=4,
        hidden_act="gelu",
        hidden_dropout_prob=0.0,
        initializer_range=0.02,
        layer_norm_eps=1e-6,
        image_size=640,
        patch_size=16,
        num_channels=3,
        layerscale_value=1.0,
        drop_path_rate=0.0,
        num_upscale_blocks=2,
        attention_dropout=0.0,
        use_swiglu_ffn=False,
        num_blocks=4,
        no_object_weight: float = 0.1,
        class_weight: float = 2.0,
        mask_weight: float = 5.0,
        dice_weight: float = 5.0,
        train_num_points: int = 12544,
        oversample_ratio: float = 3.0,
        importance_sample_ratio: float = 0.75,
        num_queries=200,
        num_register_tokens=4,
        **kwargs,
    ):
        super().__init__(**kwargs)
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.hidden_act = hidden_act
        self.hidden_dropout_prob = hidden_dropout_prob
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.image_size = image_size
        self.patch_size = patch_size
        self.num_channels = num_channels
        self.mlp_ratio = mlp_ratio
        self.attention_dropout = attention_dropout
        self.layerscale_value = layerscale_value
        self.drop_path_rate = drop_path_rate
        self.num_upscale_blocks = num_upscale_blocks
        self.use_swiglu_ffn = use_swiglu_ffn
        self.num_blocks = num_blocks
        self.no_object_weight = no_object_weight
        self.class_weight = class_weight
        self.mask_weight = mask_weight
        self.dice_weight = dice_weight
        self.train_num_points = train_num_points
        self.oversample_ratio = oversample_ratio
        self.importance_sample_ratio = importance_sample_ratio
        self.num_queries = num_queries
        self.num_register_tokens = num_register_tokens
 __all__ = ["EomtConfig"]
--- a/src/transformers/models/eomt/convert_eomt_to_hf.py
+++ b/src/transformers/models/eomt/convert_eomt_to_hf.py
@@ -0,0 +1,340 @@
 # coding=utf-8
 # Copyright 2025 The HuggingFace Team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import gc
 import json
 import os
 import re
 from typing import Optional
 import torch
 from accelerate import init_empty_weights
 from huggingface_hub import snapshot_download
 from transformers import EomtConfig, EomtForUniversalSegmentation, EomtImageProcessorFast
 # fmt: off
 MAPPINGS = {
    # Embeddings
    r"network.encoder.backbone.cls_token"                : r"embeddings.cls_token",
    r"network.encoder.backbone.reg_token"                : r"embeddings.register_tokens",
    r"network.encoder.backbone.pos_embed"                : r"embeddings.position_embeddings.weight",
    r"network.encoder.backbone.patch_embed.proj"         : r"embeddings.patch_embeddings.projection",
    # Encoder Block
    r"network.encoder.backbone.blocks.(\d+).norm1"       : r"layers.\1.norm1",
    r"network.encoder.backbone.blocks.(\d+).attn.proj"   : r"layers.\1.attention.out_proj",
    r"network.encoder.backbone.blocks.(\d+).ls1.gamma"   : r"layers.\1.layer_scale1.lambda1",
    r"network.encoder.backbone.blocks.(\d+).norm2"       : r"layers.\1.norm2",
    r"network.encoder.backbone.blocks.(\d+).ls2.gamma"   : r"layers.\1.layer_scale2.lambda1",
    r"network.encoder.backbone.blocks.(\d+).attn"        : r"layers.\1.attention",
    # Others
    r"network.q.weight"                                  : r"query.weight",
    r"network.class_head"                                : r"class_predictor",
    r"network.upscale.(\d+).conv1"                       : r"upscale_block.block.\1.conv1",
    r"network.upscale.(\d+).conv2"                       : r"upscale_block.block.\1.conv2",
    r"network.upscale.(\d+).norm"                        : r"upscale_block.block.\1.layernorm2d",
    r"network.mask_head.0"                               : r"mask_head.fc1",
    r"network.mask_head.2"                               : r"mask_head.fc2",
    r"network.mask_head.4"                               : r"mask_head.fc3",
    r"network.encoder.backbone.norm"                     : r"layernorm",
    r"network.attn_mask_probs"                           : r"attn_mask_probs",
 }
 # fmt: on
 # Mappings for MLP layers, depending on the type of MLP used in ckpts.
 MLP_MAPPINGS = {
    "swiglu_ffn": {
        r"network.encoder.backbone.blocks.(\d+).mlp.fc1": r"layers.\1.mlp.weights_in",
        r"network.encoder.backbone.blocks.(\d+).mlp.fc2": r"layers.\1.mlp.weights_out",
    },
    "vanilla_mlp": {
        r"network.encoder.backbone.blocks.(\d+).mlp": r"layers.\1.mlp",
    },
 }
 def convert_old_keys_to_new_keys(state_dict):
    keys_as_text = "\n".join(state_dict.keys())
    new_keys_as_text = keys_as_text
    for old, repl in MAPPINGS.items():
        if repl is None:
            new_keys_as_text = re.sub(old, "", new_keys_as_text)
        else:
            new_keys_as_text = re.sub(old, repl, new_keys_as_text)
    output_dict = dict(zip(keys_as_text.split("\n"), new_keys_as_text.split("\n")))
    return output_dict
 def split_qkv_tensor(key, tensor):
    """Splits a qkv tensor into separate q, k, v tensors and updates the key accordingly."""
    new_keys = ["q_proj", "k_proj", "v_proj"]
    split_size = tensor.shape[0] // 3
    split_tensors = torch.split(tensor, split_size, dim=0)
    return {key.replace("qkv", new_key): split_tensors[i] for i, new_key in enumerate(new_keys)}
 def convert_state_dict_to_hf(state_dict):
    """Convert state dict keys to HF format."""
    conversion_dict = convert_old_keys_to_new_keys(state_dict)
    converted_state_dict = {}
    for old_key, new_key in conversion_dict.items():
        if new_key:
            if "qkv" in new_key:  # Detect merged attention keys and split them.
                qkv_split_dict = split_qkv_tensor(new_key, state_dict[old_key])
                converted_state_dict.update(qkv_split_dict)
            else:
                converted_state_dict[new_key] = state_dict[old_key]
    for i in [
        "network.encoder.pixel_mean",
        "network.encoder.pixel_std",
    ]:
        converted_state_dict.pop(i)
    # Embeddings will not have initial dimension
    pos_embed_key = "embeddings.position_embeddings.weight"
    converted_state_dict[pos_embed_key] = converted_state_dict[pos_embed_key].squeeze(0)
    return converted_state_dict
 def ensure_model_downloaded(
    repo_id: Optional[str] = None, revision: Optional[str] = None, local_dir: Optional[str] = None
 ) -> str:
    """
    Ensures model files are downloaded locally, downloads them if not.
    Returns path to local files.
    Args:
        repo_id: The Hugging Face model repo ID (required if local_dir not provided)
        revision: Optional git revision to use
        local_dir: Optional local directory path where model files should be stored/found
    """
    if local_dir is not None:
        if os.path.exists(local_dir):
            print(f"Using provided local directory: {local_dir}")
        else:
            # Create the local directory if it doesn't exist
            os.makedirs(local_dir, exist_ok=True)
            print(f"Created local directory: {local_dir}")
    if repo_id is None:
        raise ValueError("Either repo_id or local_dir must be provided")
    print(f"Ensuring {repo_id} (revision: {revision or 'latest'}) is downloaded...")
    try:
        # First try to find files locally
        download_dir = snapshot_download(repo_id, revision=revision, local_files_only=True, local_dir=local_dir)
        print(f"Found model files locally at {download_dir}")
        return download_dir
    except Exception:
        # If files not found locally, download them
        print(f"Downloading model files for {repo_id}...")
        download_dir = snapshot_download(repo_id, revision=revision, local_files_only=False, local_dir=local_dir)
        print(f"Downloaded model files to {download_dir}")
        return download_dir
 def load_model_state_dict(input_path: str) -> dict:
    """
    Load model state dict, handling both single and sharded files.
    """
    index_path = os.path.join(input_path, "pytorch_model.bin.index.json")
    single_file_path = os.path.join(input_path, "pytorch_model.bin")
    # Check if we have a sharded model
    if os.path.exists(index_path):
        print("Loading sharded model...")
        state_dict = {}
        with open(index_path, "r") as f:
            index = json.load(f)
        # Get unique shard files and load each one only once
        unique_shard_files = sorted(set(index["weight_map"].values()))
        for shard_file in unique_shard_files:
            print(f"Loading shard {shard_file}...")
            shard_path = os.path.join(input_path, shard_file)
            shard_dict = torch.load(shard_path, map_location="cpu")
            state_dict.update(shard_dict)
        return state_dict
    # Single file model
    elif os.path.exists(single_file_path):
        print("Loading single file model...")
        return torch.load(single_file_path, map_location="cpu")
    else:
        raise ValueError(f"No model files found in {input_path}")
 def convert_model(
    repo_id=None,
    local_dir=None,
    output_dir=None,
    output_hub_path=None,
    safe_serialization=True,
    revision=None,
 ):
    """Convert and save the model weights, processor, and configuration."""
    if output_dir is None and output_hub_path is None:
        raise ValueError("At least one of output_dir or output_hub_path must be specified")
    if repo_id is None and local_dir is None:
        raise ValueError("Either repo_id or local_dir must be specified")
    # Create output directory if specified
    if output_dir:
        os.makedirs(output_dir, exist_ok=True)
        print(f"Created/verified output directory: {output_dir}")
    torch.set_default_dtype(torch.float16)
    # Download or locate model files
    input_path = ensure_model_downloaded(repo_id=repo_id, revision=revision, local_dir=local_dir)
    with open(os.path.join(input_path, "config.json"), "r") as f:
        config_data = json.load(f)
    # Pop off unwanted keys
    _ = config_data.pop("backbone", None)
    config = EomtConfig(
        **{
            **config_data,
            "layerscale_value": 1e-5,
        }
    )
    if "semantic" in repo_id.split("_"):
        size = {"shortest_edge": config.image_size, "longest_edge": None}
        do_split_image = True
        do_pad = False
    else:
        size = {"shortest_edge": config.image_size, "longest_edge": config.image_size}
        do_split_image = False
        do_pad = True
    if "giant" in repo_id.split("_"):
        config.use_swiglu_ffn = True
        config.hidden_size = 1536
        config.num_hidden_layers = 40
        config.num_attention_heads = 24
        # Update MAPPINGS for ckpts depending on the MLP type
        MAPPINGS.update(MLP_MAPPINGS["swiglu_ffn"])
    else:
        MAPPINGS.update(MLP_MAPPINGS["vanilla_mlp"])
    processor = EomtImageProcessorFast(size=size, do_split_image=do_split_image, do_pad=do_pad)
    # Save the config and processor
    if output_dir:
        config.save_pretrained(output_dir)
        processor.save_pretrained(output_dir)
    if output_hub_path:
        config.push_to_hub(output_hub_path)
        processor.push_to_hub(output_hub_path)
    # Initialize model with empty weights
    print("Creating empty model...")
    with init_empty_weights():
        model = EomtForUniversalSegmentation(config)
    # Load and convert state dict
    print("Loading state dict...")
    state_dict = load_model_state_dict(input_path)
    state_dict = convert_state_dict_to_hf(state_dict)
    # Load converted state dict
    print("Loading converted weights into model...")
    model.load_state_dict(state_dict, strict=True, assign=True)
    # Save the model
    if output_dir:
        print(f"Saving model to {output_dir}...")
        model.save_pretrained(output_dir, safe_serialization=safe_serialization)
    if output_hub_path:
        print(f"Pushing model to hub at {output_hub_path}...")
        model.push_to_hub(output_hub_path, safe_serialization=safe_serialization)
    del state_dict, model
    gc.collect()
    # Validate the saved model if saved locally
    if output_dir:
        print("Reloading the local model to check if it's saved correctly...")
        EomtForUniversalSegmentation.from_pretrained(output_dir, device_map="auto")
        print("Local model reloaded successfully.")
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--hf_repo_id",
        help="HuggingFace Hub repo ID for the model",
        default=None,
    )
    parser.add_argument(
        "--local_dir",
        help="Local directory containing the model files",
        default=None,
    )
    parser.add_argument(
        "--revision",
        help="Specific revision to download from the Hub",
        default=None,
    )
    parser.add_argument(
        "--output_dir",
        help="Location to write HF model locally",
        default=None,
    )
    parser.add_argument(
        "--output_hub_path",
        help="Repository ID to push model to hub (e.g. 'username/model-name')",
        default=None,
    )
    parser.add_argument(
        "--safe_serialization",
        action="store_true",
        help="Whether to save using safetensors",
    )
    args = parser.parse_args()
    if args.output_dir is None and args.output_hub_path is None:
        raise ValueError("At least one of --output_dir or --output_hub_path must be specified")
    if args.hf_repo_id is None and args.local_dir is None:
        raise ValueError("Either --hf_repo_id or --local_dir must be specified")
    convert_model(
        repo_id=args.hf_repo_id,
        local_dir=args.local_dir,
        output_dir=args.output_dir,
        output_hub_path=args.output_hub_path,
        safe_serialization=args.safe_serialization,
        revision=args.revision,
    )
 if __name__ == "__main__":
    main()
--- a/src/transformers/models/eomt/image_processing_eomt.py
+++ b/src/transformers/models/eomt/image_processing_eomt.py
@@ -0,0 +1,972 @@
 # coding=utf-8
 # Copyright 2025 Mobile Perception Systems Lab at TU/e and The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Image processor class for EoMT."""
 import math
 from typing import Optional, Union
 import numpy as np
 from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
 from ...image_transforms import (
    PaddingMode,
    pad,
    resize,
 )
 from ...image_utils import (
    ChannelDimension,
    ImageInput,
    PILImageResampling,
    get_image_size,
    infer_channel_dimension_format,
    make_flat_list_of_images,
    make_list_of_images,
    to_numpy_array,
    valid_images,
    validate_preprocess_arguments,
 )
 from ...utils import (
    IMAGENET_DEFAULT_MEAN,
    IMAGENET_DEFAULT_STD,
    TensorType,
    filter_out_non_signature_kwargs,
    is_torch_available,
    logging,
 )
 logger = logging.get_logger(__name__)
 if is_torch_available():
    import torch
    import torch.nn.functional as F
 # Adapted from transformers.models.maskformer.image_processing_maskformer.convert_segmentation_map_to_binary_masks
 def convert_segmentation_map_to_binary_masks(
    segmentation_map: "np.ndarray",
    instance_id_to_semantic_id: Optional[dict[int, int]] = None,
    ignore_index: Optional[int] = None,
 ):
    if ignore_index is not None:
        segmentation_map = np.where(segmentation_map == 0, ignore_index, segmentation_map - 1)
    # Get unique ids (class or instance ids based on input)
    all_labels = np.unique(segmentation_map)
    # Drop background label if applicable
    if ignore_index is not None:
        all_labels = all_labels[all_labels != ignore_index]
    # Generate a binary mask for each object instance
    binary_masks = [(segmentation_map == i) for i in all_labels]
    # Stack the binary masks
    if binary_masks:
        binary_masks = np.stack(binary_masks, axis=0)
    else:
        binary_masks = np.zeros((0, *segmentation_map.shape))
    # Convert instance ids to class ids
    if instance_id_to_semantic_id is not None:
        labels = np.zeros(all_labels.shape[0])
        for label in all_labels:
            class_id = instance_id_to_semantic_id[label + 1 if ignore_index is not None else label]
            labels[all_labels == label] = class_id - 1 if ignore_index is not None else class_id
    else:
        labels = all_labels
    return binary_masks.astype(np.float32), labels.astype(np.int64)
 def get_size_with_aspect_ratio(image_size, size, max_size=None) -> tuple[int, int]:
    """
    Computes the output image size given the input image size and the desired output size.
    Args:
        image_size (`Tuple[int, int]`):
            The input image size.
        size (`int`):
            The desired output size.
        max_size (`int`, *optional*):
            The maximum allowed output size.
    """
    height, width = image_size
    raw_size = None
    if max_size is not None:
        min_original_size = float(min((height, width)))
        max_original_size = float(max((height, width)))
        if max_original_size / min_original_size * size > max_size:
            raw_size = max_size * min_original_size / max_original_size
            size = int(round(raw_size))
    if (height <= width and height == size) or (width <= height and width == size):
        oh, ow = height, width
    elif width < height:
        ow = size
        if max_size is not None and raw_size is not None:
            oh = round(raw_size * height / width)
        else:
            oh = round(size * height / width)
    else:
        oh = size
        if max_size is not None and raw_size is not None:
            ow = round(raw_size * width / height)
        else:
            ow = round(size * width / height)
    return (oh, ow)
 # Copied from transformers.models.detr.image_processing_detr.remove_low_and_no_objects
 def remove_low_and_no_objects(masks, scores, labels, object_mask_threshold, num_labels):
    """
    Binarize the given masks using `object_mask_threshold`, it returns the associated values of `masks`, `scores` and
    `labels`.
    Args:
        masks (`torch.Tensor`):
            A tensor of shape `(num_queries, height, width)`.
        scores (`torch.Tensor`):
            A tensor of shape `(num_queries)`.
        labels (`torch.Tensor`):
            A tensor of shape `(num_queries)`.
        object_mask_threshold (`float`):
            A number between 0 and 1 used to binarize the masks.
    Raises:
        `ValueError`: Raised when the first dimension doesn't match in all input tensors.
    Returns:
        `tuple[`torch.Tensor`, `torch.Tensor`, `torch.Tensor`]`: The `masks`, `scores` and `labels` without the region
        < `object_mask_threshold`.
    """
    if not (masks.shape[0] == scores.shape[0] == labels.shape[0]):
        raise ValueError("mask, scores and labels must have the same shape!")
    to_keep = labels.ne(num_labels) & (scores > object_mask_threshold)
    return masks[to_keep], scores[to_keep], labels[to_keep]
 def check_segment_validity(mask_labels, mask_probs, k, mask_threshold=0.5, overlap_mask_area_threshold=0.8):
    # Get the mask associated with the k class
    mask_k = mask_labels == k
    mask_k_area = mask_k.sum()
    # Compute the area of all the stuff in query k
    original_mask = mask_probs[k] >= mask_threshold
    original_area = original_mask.sum()
    final_mask = mask_k & original_mask
    final_mask_area = final_mask.sum()
    mask_exists = mask_k_area > 0 and original_area > 0 and final_mask_area > 0
    if mask_exists:
        area_ratio = mask_k_area / original_area
        if not area_ratio.item() > overlap_mask_area_threshold:
            mask_exists = False
    return mask_exists, final_mask
 def compute_segments(
    mask_probs,
    pred_scores,
    pred_labels,
    stuff_classes,
    mask_threshold: float = 0.5,
    overlap_mask_area_threshold: float = 0.8,
    target_size: Optional[tuple[int, int]] = None,
 ):
    height = mask_probs.shape[1] if target_size is None else target_size[0]
    width = mask_probs.shape[2] if target_size is None else target_size[1]
    segmentation = torch.zeros((height, width), dtype=torch.long, device=mask_probs.device) - 1
    segments: list[dict] = []
    # Compute per-pixel assignment based on weighted mask scores
    mask_probs = mask_probs.sigmoid()
    mask_labels = (pred_scores[:, None, None] * mask_probs).argmax(0)
    # Keep track of instances of each class
    current_segment_id = 0
    stuff_memory_list: dict[str, int] = {}
    for k in range(pred_labels.shape[0]):
        pred_class = pred_labels[k].item()
        # Check if mask exists and large enough to be a segment
        mask_exists, final_mask = check_segment_validity(
            mask_labels, mask_probs, k, mask_threshold, overlap_mask_area_threshold
        )
        if not mask_exists:
            continue
        if stuff_classes and pred_class in stuff_classes:
            if pred_class in stuff_memory_list:
                segmentation[final_mask] = stuff_memory_list[pred_class]
                continue
            else:
                stuff_memory_list[pred_class] = current_segment_id
        segmentation[final_mask] = current_segment_id
        segment_score = round(pred_scores[k].item(), 6)
        segments.append(
            {
                "id": current_segment_id,
                "label_id": pred_class,
                "score": segment_score,
            }
        )
        current_segment_id += 1
    return segmentation, segments
 def get_target_size(size_dict: dict[str, int]) -> tuple[int, int]:
    """Returns the height and width from a size dict."""
    target_height = size_dict["shortest_edge"]
    target_width = size_dict.get("longest_edge", None) or target_height
    return target_height, target_width
 class EomtImageProcessor(BaseImageProcessor):
    r"""
    Constructs a EoMT image processor. The image processor can be used to prepare image(s) and optional targets
    for the model.
    This image processor inherits from [`BaseImageProcessor`] which contains most of the main methods. Users should
    refer to this superclass for more information regarding those methods.
    Args:
        do_resize (`bool`, *optional*, defaults to `True`):
            Whether to resize the input to a certain `size`.
        size (`int`, *optional*, defaults to 640):
            Resize the input to the given size. Only has an effect if `do_resize` is set to `True`. If size is a
            sequence like `(width, height)`, output size will be matched to this. If size is an int, smaller edge of
            the image will be matched to this number. i.e, if `height > width`, then image will be rescaled to `(size *
            height / width, size)`.
        resample (`int`, *optional*, defaults to `Resampling.BILINEAR`):
            An optional resampling filter. This can be one of `PIL.Image.Resampling.NEAREST`,
            `PIL.Image.Resampling.BOX`, `PIL.Image.Resampling.BILINEAR`, `PIL.Image.Resampling.HAMMING`,
            `PIL.Image.Resampling.BICUBIC` or `PIL.Image.Resampling.LANCZOS`. Only has an effect if `do_resize` is set
            to `True`.
        do_rescale (`bool`, *optional*, defaults to `True`):
            Whether to rescale the input to a certain `scale`.
        rescale_factor (`float`, *optional*, defaults to `1/ 255`):
            Rescale the input by the given factor. Only has an effect if `do_rescale` is set to `True`.
        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether or not to normalize the input with mean and standard deviation.
        do_split_image (`bool`, *optional*, defaults to `False`):
            Whether to split the input images into overlapping patches for semantic segmentation. If set to `True`, the
            input images will be split into patches of size `size["shortest_edge"]` with an overlap between patches.
            Otherwise, the input images will be padded to the target size.
        do_pad (`bool`, *optional*, defaults to `False`):
            Whether to pad the image. If `True`, will pad the patch dimension of the images in the batch to the largest
            number of patches in the batch. Padding will be applied to the bottom and right with zeros.
        image_mean (`int`, *optional*, defaults to `[0.485, 0.456, 0.406]`):
            The sequence of means for each channel, to be used when normalizing images. Defaults to the ImageNet mean.
        image_std (`int`, *optional*, defaults to `[0.229, 0.224, 0.225]`):
            The sequence of standard deviations for each channel, to be used when normalizing images. Defaults to the
            ImageNet std.
        ignore_index (`int`, *optional*):
            Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels
            denoted with 0 (background) will be replaced with `ignore_index`.
        num_labels (`int`, *optional*):
            The number of labels in the segmentation map.
    """
    model_input_names = ["pixel_values"]
    def __init__(
        self,
        do_resize: bool = True,
        size: Optional[dict[str, int]] = None,
        resample: PILImageResampling = PILImageResampling.BILINEAR,
        do_rescale: bool = True,
        rescale_factor: float = 1 / 255,
        do_normalize: bool = True,
        do_split_image: bool = False,
        do_pad: bool = False,
        image_mean: Optional[Union[float, list[float]]] = None,
        image_std: Optional[Union[float, list[float]]] = None,
        ignore_index: Optional[int] = None,
        num_labels: Optional[int] = None,
        **kwargs,
    ):
        super().__init__(**kwargs)
        size = size if size is not None else {"shortest_edge": 640, "longest_edge": 640}
        size = get_size_dict(size, default_to_square=False)
        self.do_resize = do_resize
        self.size = size
        self.resample = resample
        self.do_rescale = do_rescale
        self.rescale_factor = rescale_factor
        self.do_normalize = do_normalize
        self.do_split_image = do_split_image
        self.do_pad = do_pad
        self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
        self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
        self.ignore_index = ignore_index
        self.num_labels = num_labels
    def resize(
        self,
        image: np.ndarray,
        size: dict,
        resample: PILImageResampling = PILImageResampling.BILINEAR,
        data_format=None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> np.ndarray:
        """
        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
        resized to keep the input aspect ratio.
        Args:
            image (`np.ndarray`):
                Image to resize.
            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
                Resampling filter to use when resiizing the image.
            data_format (`str` or `ChannelDimension`, *optional*):
                The channel dimension format of the image. If not provided, it will be the same as the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format of the input image. If not provided, it will be inferred.
        """
        image_size = get_image_size(image)
        output_size = get_size_with_aspect_ratio(image_size, size["shortest_edge"], size["longest_edge"])
        image = resize(
            image=image,
            size=output_size,
            resample=resample,
            data_format=data_format,
            input_data_format=input_data_format,
            return_numpy=True,
            **kwargs,
        )
        return image
    def _split_image(self, image: ImageInput, size: dict, image_index: int) -> tuple[list, list]:
        """Slices an image into overlapping patches for semantic segmentation."""
        patches, patch_offsets = [], []
        image_size = get_image_size(image)
        patch_size = size["shortest_edge"]
        longer_side = max(image_size)
        num_patches = math.ceil(longer_side / patch_size)
        total_overlap = num_patches * patch_size - longer_side
        overlap_per_patch = total_overlap / (num_patches - 1) if num_patches > 1 else 0
        for i in range(num_patches):
            start = int(i * (patch_size - overlap_per_patch))
            end = start + patch_size
            if image_size[0] > image_size[1]:
                patch = image[:, start:end, :]
            else:
                patch = image[:, :, start:end]
            patches.append(patch)
            patch_offsets.append([image_index, start, end])
        return patches, patch_offsets
    def _pad(self, image: ImageInput, size: dict) -> np.ndarray:
        """Pads the image to the target size using zero padding."""
        height, width = get_image_size(image)
        target_height, target_width = get_target_size(size)
        pad_h = max(0, target_height - height)
        pad_w = max(0, target_width - width)
        padding = ((0, pad_h), (0, pad_w))
        # Channel axis is last; default padding format is compatible
        padded_image = pad(image=image, padding=padding, mode=PaddingMode.CONSTANT, constant_values=0.0)
        return padded_image
    def _preprocess_images(
        self,
        images: ImageInput,
        do_resize: Optional[bool] = None,
        size: Optional[dict[str, int]] = None,
        resample: PILImageResampling = None,
        do_split_image: Optional[bool] = None,
        do_pad: Optional[bool] = None,
        do_rescale: Optional[bool] = None,
        rescale_factor: Optional[float] = None,
        do_normalize: Optional[bool] = None,
        image_mean: Optional[Union[float, list[float]]] = None,
        image_std: Optional[Union[float, list[float]]] = None,
        data_format: Optional[Union[str, ChannelDimension]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ) -> np.ndarray:
        """Preprocesses a batch of images."""
        images = [to_numpy_array(image) for image in images]
        if do_resize:
            images = [
                self.resize(
                    image,
                    size=size,
                    resample=resample,
                    data_format=data_format,
                    input_data_format=input_data_format,
                )
                for image in images
            ]
        processed_images, patch_offsets = [], []
        if do_split_image:
            for idx, img in enumerate(images):
                patches, offsets = self._split_image(img, size, idx)
                processed_images.extend(patches)
                patch_offsets.extend(offsets)
            images = processed_images
        if do_pad:
            images = [self._pad(img, size) for img in images]
        if do_rescale:
            images = [self.rescale(img, scale=rescale_factor, input_data_format=input_data_format) for img in images]
        if do_normalize:
            images = [
                self.normalize(
                    image,
                    mean=image_mean,
                    std=image_std,
                    input_data_format=input_data_format,
                )
                for image in images
            ]
        return images, patch_offsets
    def _preprocess_mask(
        self,
        segmentation_map: ImageInput,
        do_resize: Optional[bool] = False,
        do_pad: Optional[bool] = False,
        size: Optional[dict[str, int]] = None,
        resample: PILImageResampling = None,
        data_format: Union[str, ChannelDimension] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ) -> np.ndarray:
        """Preprocesses a single mask."""
        # Add channel dimension if missing - needed for certain transformations
        if segmentation_map.ndim == 2:
            added_channel_dim = True
            segmentation_map = segmentation_map[None, ...]
            input_data_format = ChannelDimension.FIRST
        else:
            added_channel_dim = False
            if input_data_format is None:
                input_data_format = infer_channel_dimension_format(segmentation_map)
        if do_resize:
            segmentation_map = self.resize(
                segmentation_map,
                size=size,
                resample=resample,
                data_format=data_format,
            )
        if do_pad:
            segmentation_map = self._pad(segmentation_map, size)
        # Remove extra channel dimension if added for processing
        if added_channel_dim:
            segmentation_map = segmentation_map.squeeze(0)
        return torch.from_numpy(segmentation_map)
    @filter_out_non_signature_kwargs()
    def preprocess(
        self,
        images: ImageInput,
        segmentation_maps: Optional[Union[list[dict[int, int]], dict[int, int]]] = None,
        instance_id_to_semantic_id: Optional[dict[int, int]] = None,
        do_split_image: Optional[bool] = None,
        do_resize: Optional[bool] = None,
        size: Optional[dict[str, int]] = None,
        resample: PILImageResampling = None,
        do_rescale: Optional[bool] = None,
        rescale_factor: Optional[float] = None,
        do_normalize: Optional[bool] = None,
        do_pad: Optional[bool] = None,
        image_mean: Optional[Union[float, list[float]]] = None,
        image_std: Optional[Union[float, list[float]]] = None,
        ignore_index: Optional[int] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        data_format: Union[str, ChannelDimension] = ChannelDimension.FIRST,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ) -> BatchFeature:
        """
        Preprocesses images or a batch of images.
        Args:
            images (`ImageInput`):
                Image or batch of images to preprocess.
            segmentation_maps (`ImageInput`, *optional*):
                The corresponding semantic segmentation maps with the pixel-wise annotations.
            instance_id_to_semantic_id (`List[Dict[int, int]]` or `Dict[int, int]`, *optional*):
                A mapping between object instance ids and class ids.
            do_split_image (`bool`, *optional*, defaults to `self.do_split_image`):
                Whether to split the input images into overlapping patches for semantic segmentation.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the input images.
            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
                Target size as a dictionary with `"shortest_edge"` and `"longest_edge"` keys.
            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
                Resampling filter to use when resizing.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the input images by `rescale_factor`.
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Factor to scale image pixel values.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the input images.
            do_pad (`bool`, *optional*, defaults to `False`):
                Whether to pad the image. If `True`, will pad the patch dimension of the images in the batch to the largest
                number of patches in the batch. Padding will be applied to the bottom and right with zeros.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Mean for normalization. Single value or list for each channel.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Standard deviation for normalization. Single value or list for each channel.
            ignore_index (`int`, *optional*):
                Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels
                denoted with 0 (background) will be replaced with `ignore_index`.
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return. Can be `"pt"`, `"tf"`, `"np"`, or `"jax"`.
            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
                Channel format of the output image. Either `"channels_first"` or `"channels_last"`.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                Channel format of the input image.
        """
        do_split_image = do_split_image if do_split_image is not None else self.do_split_image
        do_resize = do_resize if do_resize is not None else self.do_resize
        size = size if size is not None else self.size
        size = get_size_dict(size, default_to_square=False)
        resample = resample if resample is not None else self.resample
        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
        do_pad = do_pad if do_pad is not None else self.do_pad
        image_mean = image_mean if image_mean is not None else self.image_mean
        image_std = image_std if image_std is not None else self.image_std
        ignore_index = ignore_index if ignore_index is not None else self.ignore_index
        images = make_flat_list_of_images(images)
        if not valid_images(images):
            raise ValueError(
                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "torch.Tensor, tf.Tensor or jax.ndarray."
            )
        validate_preprocess_arguments(
            do_rescale=do_rescale,
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            do_resize=do_resize,
            size=size,
            resample=resample,
        )
        pixel_values_list, patch_offsets = self._preprocess_images(
            images=images,
            do_resize=do_resize,
            size=size,
            resample=resample,
            do_split_image=do_split_image,
            do_pad=do_pad,
            do_rescale=do_rescale,
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            data_format=data_format,
            input_data_format=input_data_format,
        )
        if segmentation_maps is not None:
            segmentation_maps = make_list_of_images(segmentation_maps, expected_ndims=2)
            segmentation_maps = [to_numpy_array(mask) for mask in segmentation_maps]
            segmentation_maps = [
                self._preprocess_mask(
                    segmentation_map,
                    do_resize=do_resize,
                    do_pad=do_pad,
                    size=size,
                    resample=PILImageResampling.NEAREST,
                    data_format=data_format,
                    input_data_format=input_data_format,
                )
                for segmentation_map in segmentation_maps
            ]
        encoded_inputs = self.encode_inputs(
            pixel_values_list,
            segmentation_maps,
            instance_id_to_semantic_id,
            ignore_index,
            return_tensors,
            input_data_format=data_format,
        )
        if do_split_image and patch_offsets:
            encoded_inputs["patch_offsets"] = patch_offsets
        return encoded_inputs
    def encode_inputs(
        self,
        pixel_values_list: list[ImageInput],
        segmentation_maps: ImageInput = None,
        instance_id_to_semantic_id: Optional[Union[list[dict[int, int]], dict[int, int]]] = None,
        ignore_index: Optional[int] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ):
        """
        Pad images up to the largest image in a batch and create a corresponding `pixel_mask`.
        EoMT addresses semantic segmentation with a mask classification paradigm, thus input segmentation maps
        will be converted to lists of binary masks and their respective labels. Let's see an example, assuming
        `segmentation_maps = [[2,6,7,9]]`, the output will contain `mask_labels =
        [[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]]` (four binary masks) and `class_labels = [2,6,7,9]`, the labels for
        each mask.
        Args:
            pixel_values_list (`List[ImageInput]`):
                List of images (pixel values) to be padded. Each image should be a tensor of shape `(channels, height,
                width)`.
            segmentation_maps (`ImageInput`, *optional*):
                The corresponding semantic segmentation maps with the pixel-wise annotations.
             (`bool`, *optional*, defaults to `True`):
                Whether or not to pad images up to the largest image in a batch and create a pixel mask.
                If left to the default, will return a pixel mask that is:
                - 1 for pixels that are real (i.e. **not masked**),
                - 0 for pixels that are padding (i.e. **masked**).
            instance_id_to_semantic_id (`List[Dict[int, int]]` or `Dict[int, int]`, *optional*):
                A mapping between object instance ids and class ids. If passed, `segmentation_maps` is treated as an
                instance segmentation map where each pixel represents an instance id. Can be provided as a single
                dictionary with a global/dataset-level mapping or as a list of dictionaries (one per image), to map
                instance ids in each image separately.
            return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
                If set, will return tensors instead of NumPy arrays. If set to `'pt'`, return PyTorch `torch.Tensor`
                objects.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format of the input image. If not provided, it will be inferred.
        Returns:
            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
            - **pixel_values** -- Pixel values to be fed to a model.
            - **mask_labels** -- Optional list of mask labels of shape `(labels, height, width)` to be fed to a model
              (when `annotations` are provided).
            - **class_labels** -- Optional list of class labels of shape `(labels)` to be fed to a model (when
              `annotations` are provided). They identify the labels of `mask_labels`, e.g. the label of
              `mask_labels[i][j]` if `class_labels[i][j]`.
        """
        ignore_index = self.ignore_index if ignore_index is None else ignore_index
        pixel_values_list = [to_numpy_array(pixel_values) for pixel_values in pixel_values_list]
        if input_data_format is None:
            input_data_format = infer_channel_dimension_format(pixel_values_list[0])
        encoded_inputs = BatchFeature({"pixel_values": pixel_values_list}, tensor_type=return_tensors)
        if segmentation_maps is not None:
            mask_labels = []
            class_labels = []
            # Convert to list of binary masks and labels
            for idx, segmentation_map in enumerate(segmentation_maps):
                segmentation_map = to_numpy_array(segmentation_map)
                if isinstance(instance_id_to_semantic_id, list):
                    instance_id = instance_id_to_semantic_id[idx]
                else:
                    instance_id = instance_id_to_semantic_id
                # Use instance2class_id mapping per image
                masks, classes = convert_segmentation_map_to_binary_masks(
                    segmentation_map,
                    instance_id,
                    ignore_index=ignore_index,
                )
                mask_labels.append(torch.from_numpy(masks))
                class_labels.append(torch.from_numpy(classes))
            # we cannot batch them since they don't share a common class size
            encoded_inputs["mask_labels"] = mask_labels
            encoded_inputs["class_labels"] = class_labels
        return encoded_inputs
    def merge_image_patches(
        self,
        segmentation_logits: torch.Tensor,
        patch_offsets: list[tuple[int, int, int]],
        original_image_sizes: list[tuple[int, int]],
        size: dict[str, int],
    ) -> list[torch.Tensor]:
        """
        Reconstructs full-size semantic segmentation logits from patch predictions.
        Args:
            segmentation_logits (`torch.Tensor`):
                A tensor of shape `(num_patches, num_classes, patch_height, patch_width)` representing predicted logits
                for each image patch.
            patch_offsets (`List[Tuple[int, int, int]]`):
                A list of tuples where each tuple contains:
                - `image_index` (int): Index of the original image this patch belongs to.
                - `start` (int): Start pixel index of the patch along the long dimension (height or width).
                - `end` (int): End pixel index of the patch along the long dimension.
            original_image_sizes (`List[Tuple[int, int]]`):
                List of original (height, width) dimensions for each image before preprocessing.
            size (`Dict[str, int]`):
                A size dict which was used to resize.
        """
        num_classes = segmentation_logits.shape[1]
        aggregated_logits = []
        patch_counts = []
        for image_size in original_image_sizes:
            height, width = get_size_with_aspect_ratio(image_size, size["shortest_edge"], size["longest_edge"])
            aggregated_logits.append(torch.zeros((num_classes, height, width), device=segmentation_logits.device))
            patch_counts.append(torch.zeros((num_classes, height, width), device=segmentation_logits.device))
        # Stitch patches back into full-sized logit maps
        for patch_idx, (image_idx, patch_start, patch_end) in enumerate(patch_offsets):
            if original_image_sizes[image_idx][0] > original_image_sizes[image_idx][1]:
                aggregated_logits[image_idx][:, patch_start:patch_end, :] += segmentation_logits[patch_idx]
                patch_counts[image_idx][:, patch_start:patch_end, :] += 1
            else:
                aggregated_logits[image_idx][:, :, patch_start:patch_end] += segmentation_logits[patch_idx]
                patch_counts[image_idx][:, :, patch_start:patch_end] += 1
        # Normalize and resize logits to original image size
        reconstructed_logits = []
        for idx, (logit_sum, count) in enumerate(zip(aggregated_logits, patch_counts)):
            averaged_logits = logit_sum / count.clamp(min=1)
            resized_logits = F.interpolate(
                averaged_logits[None, ...],
                size=original_image_sizes[idx],
                mode="bilinear",
                align_corners=False,
            )[0]
            reconstructed_logits.append(resized_logits)
        return reconstructed_logits
    def unpad_image(
        self,
        segmentation_logits: torch.Tensor,
        original_image_sizes: list[tuple[int, int]],
        size: dict[str, int],
    ) -> list[torch.Tensor]:
        """Restores panoptic segmentation logits to their original image resolutions."""
        resized_logits = []
        for idx, original_size in enumerate(original_image_sizes):
            target_height, target_width = get_size_with_aspect_ratio(
                original_size, size["shortest_edge"], size["longest_edge"]
            )
            cropped_logits = segmentation_logits[idx][:, :target_height, :target_width]
            upsampled_logits = F.interpolate(
                cropped_logits[None, ...], size=original_size, mode="bilinear", align_corners=False
            )[0]
            resized_logits.append(upsampled_logits)
        return resized_logits
    def post_process_semantic_segmentation(
        self,
        outputs,
        patch_offsets: list[tuple[int, int, int]],
        original_image_sizes: list[tuple[int, int]],
        size: Optional[dict[str, int]] = None,
    ) -> np.ndarray:
        """Post-processes model outputs into final semantic segmentation prediction."""
        size = size if size is not None else self.size
        masks_queries_logits = outputs.masks_queries_logits  # [batch_size, num_queries, height, width]
        class_queries_logits = outputs.class_queries_logits  # [batch_size, num_queries, num_classes+1]
        output_size = get_target_size(size)
        masks_queries_logits = F.interpolate(
            masks_queries_logits,
            size=output_size,
            mode="bilinear",
        )
        # Remove the null class `[..., :-1]`
        masks_classes = class_queries_logits.softmax(dim=-1)[..., :-1]
        masks_probs = masks_queries_logits.sigmoid()  # [batch_size, num_queries, height, width]
        segmentation_logits = torch.einsum("bqc, bqhw -> bchw", masks_classes, masks_probs)
        output_logits = self.merge_image_patches(segmentation_logits, patch_offsets, original_image_sizes, size)
        preds = torch.stack(output_logits).argmax(dim=1)
        return preds
    def post_process_panoptic_segmentation(
        self,
        outputs,
        original_image_sizes: list[tuple[int, int]],
        threshold: float = 0.8,
        mask_threshold: float = 0.5,
        overlap_mask_area_threshold: float = 0.8,
        stuff_classes: Optional[list[int]] = None,
        size: Optional[dict[str, int]] = None,
    ):
        """Post-processes model outputs into final panoptic segmentation prediction."""
        size = size if size is not None else self.size
        masks_queries_logits = outputs.masks_queries_logits  # [batch_size, num_queries, height, width]
        class_queries_logits = outputs.class_queries_logits  # [batch_size, num_queries, num_classes+1]
        batch_size = class_queries_logits.shape[0]
        num_labels = class_queries_logits.shape[-1] - 1
        output_size = get_target_size(size)
        masks_queries_logits = F.interpolate(
            masks_queries_logits,
            size=output_size,
            mode="bilinear",
        )
        mask_probs_batch = self.unpad_image(masks_queries_logits, original_image_sizes, size)
        pred_scores_batch, pred_labels_batch = class_queries_logits.softmax(dim=-1).max(-1)
        results: list = []
        for i in range(batch_size):
            mask_probs, pred_scores, pred_labels = remove_low_and_no_objects(
                mask_probs_batch[i], pred_scores_batch[i], pred_labels_batch[i], threshold, num_labels
            )
            # No mask found
            if mask_probs.shape[0] <= 0:
                height, width = original_image_sizes[i] if original_image_sizes is not None else mask_probs.shape[1:]
                segmentation = torch.zeros((height, width)) - 1
                results.append({"segmentation": segmentation, "segments_info": []})
                continue
            segmentation, segments = compute_segments(
                mask_probs=mask_probs,
                pred_scores=pred_scores,
                pred_labels=pred_labels,
                stuff_classes=stuff_classes,
                mask_threshold=mask_threshold,
                overlap_mask_area_threshold=overlap_mask_area_threshold,
                target_size=original_image_sizes[i] if original_image_sizes is not None else None,
            )
            results.append({"segmentation": segmentation, "segments_info": segments})
        return results
    def post_process_instance_segmentation(
        self,
        outputs,
        original_image_sizes: list[tuple[int, int]],
        threshold: float = 0.5,
        size: Optional[dict[str, int]] = None,
    ):
        """Post-processes model outputs into Instance Segmentation Predictions."""
        size = size if size is not None else self.size
        class_queries_logits = outputs.class_queries_logits
        masks_queries_logits = outputs.masks_queries_logits
        output_size = get_target_size(size)
        masks_queries_logits = F.interpolate(
            masks_queries_logits,
            size=output_size,
            mode="bilinear",
        )
        mask_probs_batch = self.unpad_image(masks_queries_logits, original_image_sizes, size)
        device = masks_queries_logits.device
        batch_size = class_queries_logits.shape[0]
        num_queries = class_queries_logits.shape[-2]
        results = []
        for i in range(batch_size):
            mask_pred = mask_probs_batch[i]
            mask_class = class_queries_logits[i]
            # Remove the null class `[..., :-1]`
            scores, pred_classes = mask_class.softmax(dim=-1)[..., :-1].max(-1)
            pred_masks = (mask_pred > 0).float()
            # Calculate average mask prob
            mask_scores = (mask_pred.sigmoid().flatten(1) * pred_masks.flatten(1)).sum(1) / (
                pred_masks.flatten(1).sum(1) + 1e-6
            )
            pred_scores = scores * mask_scores
            segmentation = torch.zeros(original_image_sizes[i], device=device) - 1
            instance_maps, segments = [], []
            current_segment_id = 0
            for j in range(num_queries):
                score = pred_scores[j].item()
                if not torch.all(pred_masks[j] == 0) and score >= threshold:
                    segmentation[pred_masks[j] == 1] = current_segment_id
                    segments.append(
                        {
                            "id": current_segment_id,
                            "label_id": pred_classes[j].item(),
                            "score": round(score, 6),
                        }
                    )
                    current_segment_id += 1
                    instance_maps.append(pred_masks[j])
            results.append({"segmentation": segmentation, "segments_info": segments})
        return results
 __all__ = ["EomtImageProcessor"]
--- a/src/transformers/models/eomt/image_processing_eomt_fast.py
+++ b/src/transformers/models/eomt/image_processing_eomt_fast.py
@@ -0,0 +1,580 @@
 # coding=utf-8
 # Copyright 2025 Mobile Perception Systems Lab at TU/e and The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Fast Image processor class for EoMT."""
 import math
 from typing import Optional, Union
 import numpy as np
 from ...image_processing_utils import BatchFeature
 from ...image_processing_utils_fast import (
    BaseImageProcessorFast,
    DefaultFastImageProcessorKwargs,
    group_images_by_shape,
    reorder_images,
 )
 from ...image_utils import (
    IMAGENET_DEFAULT_MEAN,
    IMAGENET_DEFAULT_STD,
    ChannelDimension,
    ImageInput,
    PILImageResampling,
    SizeDict,
    make_list_of_images,
    pil_torch_interpolation_mapping,
    validate_kwargs,
 )
 from ...processing_utils import Unpack
 from ...utils import (
    TensorType,
    auto_docstring,
    is_torch_available,
    is_torchvision_available,
    is_torchvision_v2_available,
 )
 from .image_processing_eomt import (
    compute_segments,
    convert_segmentation_map_to_binary_masks,
    get_size_with_aspect_ratio,
    remove_low_and_no_objects,
 )
 if is_torch_available():
    import torch
 if is_torchvision_available():
    if is_torchvision_v2_available():
        from torchvision.transforms.v2 import functional as F
    else:
        from torchvision.transforms import functional as F
 class EomtImageProcessorFastKwargs(DefaultFastImageProcessorKwargs):
    """
    do_split_image (`bool`, *optional*, defaults to `False`):
            Whether to split the input images into overlapping patches for semantic segmentation. If set to `True`, the
            input images will be split into patches of size `size["shortest_edge"]` with an overlap between patches.
            Otherwise, the input images will be padded to the target size.
    do_pad (`bool`, *optional*, defaults to `False`):
            Whether to pad the image. If `True`, will pad the patch dimension of the images in the batch to the largest
            number of patches in the batch. Padding will be applied to the bottom and right with zeros.
    ignore_index (`int`, *optional*):
            Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels
            denoted with 0 (background) will be replaced with `ignore_index`.
    """
    do_split_image: bool
    do_pad: bool
    ignore_index: Optional[int] = None
 def get_target_size(size_dict: dict[str, int]) -> tuple[int, int]:
    """Returns the height and width from a size dict."""
    target_height = size_dict["shortest_edge"]
    target_width = size_dict["longest_edge"] or target_height
    return target_height, target_width
 def reorder_patches_and_offsets(
    patches: list[torch.Tensor], offsets: list[list[int]]
 ) -> tuple[list[torch.Tensor], list[list[int]]]:
    """Sorts patches and offsets according to the original image index."""
    combined = list(zip(offsets, patches))
    combined.sort(key=lambda x: x[0][0])
    sorted_offsets, sorted_patches = zip(*combined)
    return list(sorted_patches), list(sorted_offsets)
@auto_docstring
 class EomtImageProcessorFast(BaseImageProcessorFast):
    resample = PILImageResampling.BILINEAR
    image_mean = IMAGENET_DEFAULT_MEAN
    image_std = IMAGENET_DEFAULT_STD
    size = {"shortest_edge": 640, "longest_edge": 640}
    default_to_square = False
    do_resize = True
    do_rescale = True
    do_normalize = True
    do_split_image = False
    do_pad = False
    ignore_index = None
    valid_kwargs = EomtImageProcessorFastKwargs
    def __init__(self, **kwargs: Unpack[EomtImageProcessorFastKwargs]):
        super().__init__(**kwargs)
    def _split_image(self, images: torch.Tensor, size: dict, image_indices: int) -> tuple[list, list]:
        """Slices an image into overlapping patches for semantic segmentation."""
        patches, patch_offsets = [], []
        _, _, height, width = images.shape
        patch_size = size["shortest_edge"]
        longer_side = max(height, width)
        num_patches = math.ceil(longer_side / patch_size)
        total_overlap = num_patches * patch_size - longer_side
        overlap_per_patch = total_overlap / (num_patches - 1) if num_patches > 1 else 0
        for i in range(num_patches):
            start = int(i * (patch_size - overlap_per_patch))
            end = start + patch_size
            if height > width:
                batch_patch = images[:, :, start:end, :]
            else:
                batch_patch = images[:, :, :, start:end]
            for batch_idx, single in enumerate(torch.unbind(batch_patch, dim=0)):
                patches.append(single)
                patch_offsets.append([image_indices[batch_idx], start, end])
        return patches, patch_offsets
    def _pad(self, images: torch.Tensor, size: dict) -> torch.Tensor:
        """Pads the image to the target size using zero padding."""
        _, _, height, width = images.shape
        target_height, target_width = get_target_size(size)
        pad_h = max(0, target_height - height)
        pad_w = max(0, target_width - width)
        padding = (0, pad_w, 0, pad_h)
        padded_images = torch.nn.functional.pad(images, padding, mode="constant", value=0.0)
        return padded_images
    def _preprocess(
        self,
        images: list["torch.Tensor"],
        do_resize: bool,
        size: SizeDict,
        interpolation: Optional["F.InterpolationMode"],
        do_rescale: bool,
        rescale_factor: float,
        do_normalize: bool,
        do_split_image: bool,
        do_pad: bool,
        image_mean: Optional[Union[float, list[float]]],
        image_std: Optional[Union[float, list[float]]],
        disable_grouping: Optional[bool],
        return_tensors: Optional[Union[str, TensorType]],
        **kwargs,
    ):
        """Preprocesses the input images and masks if provided."""
        processed_images, patch_offsets = [], []
        grouped_images, grouped_images_index = group_images_by_shape(images, disable_grouping=disable_grouping)
        resized_images_grouped = {}
        for shape, stacked_images in grouped_images.items():
            if do_resize:
                stacked_images = self.resize(image=stacked_images, size=size, interpolation=interpolation)
                resized_images_grouped[shape] = stacked_images
        images = reorder_images(resized_images_grouped, grouped_images_index)
        # Group images by size for batched resizing, Needed in case do_resize is False.
        grouped_images, grouped_images_index = group_images_by_shape(images, disable_grouping=disable_grouping)
        processed_images_grouped = {}
        for shape, stacked_images in grouped_images.items():
            original_indices = [
                original_idx for original_idx, (img_shape, _) in grouped_images_index.items() if img_shape == shape
            ]
            if do_split_image:
                patches, offsets = self._split_image(stacked_images, size, original_indices)
                processed_images.extend(patches)
                patch_offsets.extend(offsets)
            if do_pad:
                stacked_images = self._pad(stacked_images, size)
                processed_images_grouped[shape] = stacked_images
        if do_split_image:
            images, patch_offsets = reorder_patches_and_offsets(processed_images, patch_offsets)
        if do_pad:
            images = reorder_images(processed_images_grouped, grouped_images_index)
        grouped_images, grouped_images_index = group_images_by_shape(images, disable_grouping=disable_grouping)
        processed_images_grouped = {}
        for shape, stacked_images in grouped_images.items():
            stacked_images = self.rescale_and_normalize(
                stacked_images, do_rescale, rescale_factor, do_normalize, image_mean, image_std
            )
            processed_images_grouped[shape] = stacked_images
        images = reorder_images(processed_images_grouped, grouped_images_index)
        processed_images = torch.stack(images, dim=0) if return_tensors else images
        return processed_images, patch_offsets
    def _preprocess_images(self, images, **kwargs):
        """Preprocesses the input images."""
        return self._preprocess(images, **kwargs)
    def _preprocess_masks(self, segmentation_maps: list[torch.Tensor], **kwargs):
        """Preprocesses segmentation maps."""
        processed_segmentation_maps = []
        for segmentation_map in segmentation_maps:
            segmentation_map = self._process_image(
                segmentation_map, do_convert_rgb=False, input_data_format=ChannelDimension.FIRST
            )
            if segmentation_map.ndim == 2:
                segmentation_map = segmentation_map[None, ...]
            processed_segmentation_maps.append(segmentation_map)
        kwargs["do_normalize"] = False
        kwargs["do_rescale"] = False
        kwargs["input_data_format"] = ChannelDimension.FIRST
        # Nearest interpolation is used for segmentation maps instead of BILINEAR.
        kwargs["interpolation"] = pil_torch_interpolation_mapping[PILImageResampling.NEAREST]
        processed_segmentation_maps, _ = self._preprocess(images=processed_segmentation_maps, **kwargs)
        processed_segmentation_maps = processed_segmentation_maps.squeeze(1)
        processed_segmentation_maps = processed_segmentation_maps.to(torch.int64)
        return processed_segmentation_maps
    @auto_docstring
    def preprocess(
        self,
        images: ImageInput,
        segmentation_maps: Optional[list[torch.Tensor]] = None,
        instance_id_to_semantic_id: Optional[dict[int, int]] = None,
        **kwargs: Unpack[EomtImageProcessorFastKwargs],
    ) -> BatchFeature:
        r"""
        segmentation_maps (`ImageInput`, *optional*):
            The segmentation maps to preprocess for corresponding images.
        instance_id_to_semantic_id (`List[Dict[int, int]]` or `Dict[int, int]`, *optional*):
            A mapping between object instance ids and class ids.
        """
        # args are not validated, but their order in the `preprocess` and `_preprocess` signatures must be the same
        validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_kwargs_names)
        # Set default kwargs from self. This ensures that if a kwarg is not provided
        # by the user, it gets its default value from the instance, or is set to None.
        for kwarg_name in self._valid_kwargs_names:
            kwargs.setdefault(kwarg_name, getattr(self, kwarg_name, None))
        # Extract parameters that are only used for preparing the input images
        do_convert_rgb = kwargs.pop("do_convert_rgb")
        input_data_format = kwargs.pop("input_data_format")
        device = kwargs.pop("device")
        # Prepare input images
        images = self._prepare_input_images(
            images=images, do_convert_rgb=do_convert_rgb, input_data_format=input_data_format, device=device
        )
        # Prepare segmentation maps
        if segmentation_maps is not None:
            segmentation_maps = make_list_of_images(images=segmentation_maps, expected_ndims=2)
        # Update kwargs that need further processing before being validated
        kwargs = self._further_process_kwargs(**kwargs)
        # Validate kwargs
        self._validate_preprocess_kwargs(**kwargs)
        # torch resize uses interpolation instead of resample
        resample = kwargs.pop("resample")
        # Check if resample is an int before checking if it's an instance of PILImageResampling
        # because if pillow < 9.1.0, resample is an int and PILImageResampling is a module.
        # Checking PILImageResampling will fail with error `TypeError: isinstance() arg 2 must be a type or tuple of types`.
        kwargs["interpolation"] = (
            pil_torch_interpolation_mapping[resample] if isinstance(resample, (int, PILImageResampling)) else resample
        )
        # Pop kwargs that are not needed in _preprocess
        kwargs.pop("default_to_square")
        kwargs.pop("data_format")
        ignore_index = kwargs.pop("ignore_index", None)
        processed_images, patch_offsets = self._preprocess_images(images=images, **kwargs)
        outputs = BatchFeature({"pixel_values": processed_images})
        mask_labels, class_labels = [], []
        if segmentation_maps is not None:
            segmentation_maps = self._preprocess_masks(segmentation_maps=segmentation_maps, **kwargs)
            # Convert to list of binary masks and labels
            for idx, segmentation_map in enumerate(segmentation_maps):
                if isinstance(instance_id_to_semantic_id, list):
                    instance_id = instance_id_to_semantic_id[idx]
                else:
                    instance_id = instance_id_to_semantic_id
                # Use instance2class_id mapping per image
                masks, classes = convert_segmentation_map_to_binary_masks(
                    segmentation_map,
                    instance_id,
                    ignore_index=ignore_index,
                )
                mask_labels.append(torch.from_numpy(masks))
                class_labels.append(torch.from_numpy(classes))
            # we cannot batch them since they don't share a common class size
            outputs["mask_labels"] = mask_labels
            outputs["class_labels"] = class_labels
        if patch_offsets:
            outputs["patch_offsets"] = patch_offsets
        return outputs
    def merge_image_patches(
        self,
        segmentation_logits: torch.Tensor,
        patch_offsets: list[tuple[int, int, int]],
        original_image_sizes: list[tuple[int, int]],
        size: dict[str, int],
    ) -> list[torch.Tensor]:
        """
        Reconstructs full-size semantic segmentation logits from patch predictions.
        Args:
            segmentation_logits (`torch.Tensor`):
                A tensor of shape `(num_patches, num_classes, patch_height, patch_width)` representing predicted logits
                for each image patch.
            patch_offsets (`List[Tuple[int, int, int]]`):
                A list of tuples where each tuple contains:
                - `image_index` (int): Index of the original image this patch belongs to.
                - `start` (int): Start pixel index of the patch along the long dimension (height or width).
                - `end` (int): End pixel index of the patch along the long dimension.
            original_image_sizes (`List[Tuple[int, int]]`):
                List of original (height, width) dimensions for each image before preprocessing.
            size (`Dict[str, int]`):
                A size dict which was used to resize.
        """
        num_classes = segmentation_logits.shape[1]
        aggregated_logits = []
        patch_counts = []
        for image_size in original_image_sizes:
            height, width = get_size_with_aspect_ratio(image_size, size["shortest_edge"], size["longest_edge"])
            aggregated_logits.append(torch.zeros((num_classes, height, width), device=segmentation_logits.device))
            patch_counts.append(torch.zeros((num_classes, height, width), device=segmentation_logits.device))
        # Stitch patches back into full-sized logit maps
        for patch_idx, (image_idx, patch_start, patch_end) in enumerate(patch_offsets):
            if original_image_sizes[image_idx][0] > original_image_sizes[image_idx][1]:
                aggregated_logits[image_idx][:, patch_start:patch_end, :] += segmentation_logits[patch_idx]
                patch_counts[image_idx][:, patch_start:patch_end, :] += 1
            else:
                aggregated_logits[image_idx][:, :, patch_start:patch_end] += segmentation_logits[patch_idx]
                patch_counts[image_idx][:, :, patch_start:patch_end] += 1
        # Normalize and resize logits to original image size
        reconstructed_logits = []
        for idx, (logit_sum, count) in enumerate(zip(aggregated_logits, patch_counts)):
            averaged_logits = logit_sum / count.clamp(min=1)
            resized_logits = torch.nn.functional.interpolate(
                averaged_logits[None, ...],
                size=original_image_sizes[idx],
                mode="bilinear",
                align_corners=False,
            )[0]
            reconstructed_logits.append(resized_logits)
        return reconstructed_logits
    def unpad_image(
        self,
        segmentation_logits: torch.Tensor,
        original_image_sizes: list[tuple[int, int]],
        size: dict[str, int],
    ) -> list[torch.Tensor]:
        """Restores panoptic segmentation logits to their original image resolutions."""
        resized_logits = []
        for idx, original_size in enumerate(original_image_sizes):
            target_height, target_width = get_size_with_aspect_ratio(
                original_size, size["shortest_edge"], size["longest_edge"]
            )
            cropped_logits = segmentation_logits[idx][:, :target_height, :target_width]
            upsampled_logits = torch.nn.functional.interpolate(
                cropped_logits[None, ...], size=original_size, mode="bilinear", align_corners=False
            )[0]
            resized_logits.append(upsampled_logits)
        return resized_logits
    def post_process_semantic_segmentation(
        self,
        outputs,
        patch_offsets: list[tuple[int, int, int]],
        original_image_sizes: list[tuple[int, int]],
        size: Optional[dict[str, int]] = None,
    ) -> np.ndarray:
        """Post-processes model outputs into final semantic segmentation prediction."""
        size = size if size is not None else self.size
        masks_queries_logits = outputs.masks_queries_logits  # [batch_size, num_queries, height, width]
        class_queries_logits = outputs.class_queries_logits  # [batch_size, num_queries, num_classes+1]
        output_size = get_target_size(size)
        masks_queries_logits = torch.nn.functional.interpolate(
            masks_queries_logits,
            size=output_size,
            mode="bilinear",
        )
        # Remove the null class `[..., :-1]`
        masks_classes = class_queries_logits.softmax(dim=-1)[..., :-1]
        masks_probs = masks_queries_logits.sigmoid()  # [batch_size, num_queries, height, width]
        segmentation_logits = torch.einsum("bqc, bqhw -> bchw", masks_classes, masks_probs)
        output_logits = self.merge_image_patches(segmentation_logits, patch_offsets, original_image_sizes, size)
        preds = torch.stack(output_logits).argmax(dim=1)
        return preds
    def post_process_panoptic_segmentation(
        self,
        outputs,
        original_image_sizes: list[tuple[int, int]],
        threshold: float = 0.8,
        mask_threshold: float = 0.5,
        overlap_mask_area_threshold: float = 0.8,
        stuff_classes: Optional[list[int]] = None,
        size: Optional[dict[str, int]] = None,
    ):
        """Post-processes model outputs into final panoptic segmentation prediction."""
        size = size if size is not None else self.size
        masks_queries_logits = outputs.masks_queries_logits  # [batch_size, num_queries, height, width]
        class_queries_logits = outputs.class_queries_logits  # [batch_size, num_queries, num_classes+1]
        batch_size = class_queries_logits.shape[0]
        num_labels = class_queries_logits.shape[-1] - 1
        output_size = get_target_size(size)
        masks_queries_logits = torch.nn.functional.interpolate(
            masks_queries_logits,
            size=output_size,
            mode="bilinear",
        )
        mask_probs_batch = self.unpad_image(masks_queries_logits, original_image_sizes, size)
        pred_scores_batch, pred_labels_batch = class_queries_logits.softmax(dim=-1).max(-1)
        results: list = []
        for i in range(batch_size):
            mask_probs, pred_scores, pred_labels = remove_low_and_no_objects(
                mask_probs_batch[i], pred_scores_batch[i], pred_labels_batch[i], threshold, num_labels
            )
            # No mask found
            if mask_probs.shape[0] <= 0:
                height, width = original_image_sizes[i] if original_image_sizes is not None else mask_probs.shape[1:]
                segmentation = torch.zeros((height, width)) - 1
                results.append({"segmentation": segmentation, "segments_info": []})
                continue
            segmentation, segments = compute_segments(
                mask_probs=mask_probs,
                pred_scores=pred_scores,
                pred_labels=pred_labels,
                stuff_classes=stuff_classes,
                mask_threshold=mask_threshold,
                overlap_mask_area_threshold=overlap_mask_area_threshold,
                target_size=original_image_sizes[i] if original_image_sizes is not None else None,
            )
            results.append({"segmentation": segmentation, "segments_info": segments})
        return results
    def post_process_instance_segmentation(
        self,
        outputs,
        original_image_sizes: list[tuple[int, int]],
        threshold: float = 0.8,
        size: Optional[dict[str, int]] = None,
    ):
        """Post-processes model outputs into Instance Segmentation Predictions."""
        size = size if size is not None else self.size
        masks_queries_logits = outputs.masks_queries_logits
        class_queries_logits = outputs.class_queries_logits
        output_size = get_target_size(size)
        masks_queries_logits = torch.nn.functional.interpolate(
            masks_queries_logits,
            size=output_size,
            mode="bilinear",
        )
        mask_probs_batch = self.unpad_image(masks_queries_logits, original_image_sizes, size)
        device = masks_queries_logits.device
        batch_size = class_queries_logits.shape[0]
        num_queries = class_queries_logits.shape[-2]
        results = []
        for i in range(batch_size):
            mask_pred = mask_probs_batch[i]
            mask_class = class_queries_logits[i]
            # Remove the null class `[..., :-1]`
            scores, pred_classes = mask_class.softmax(dim=-1)[..., :-1].max(-1)
            pred_masks = (mask_pred > 0).float()
            # Calculate average mask prob
            mask_scores = (mask_pred.sigmoid().flatten(1) * pred_masks.flatten(1)).sum(1) / (
                pred_masks.flatten(1).sum(1) + 1e-6
            )
            pred_scores = scores * mask_scores
            segmentation = torch.zeros(original_image_sizes[i], device=device) - 1
            instance_maps, segments = [], []
            current_segment_id = 0
            for j in range(num_queries):
                score = pred_scores[j].item()
                if not torch.all(pred_masks[j] == 0) and score >= threshold:
                    segmentation[pred_masks[j] == 1] = current_segment_id
                    segments.append(
                        {
                            "id": current_segment_id,
                            "label_id": pred_classes[j].item(),
                            "score": round(score, 6),
                        }
                    )
                    current_segment_id += 1
                    instance_maps.append(pred_masks[j])
            results.append({"segmentation": segmentation, "segments_info": segments})
        return results
 __all__ = ["EomtImageProcessorFast"]
--- a/src/transformers/models/eomt/modeling_eomt.py
+++ b/src/transformers/models/eomt/modeling_eomt.py
--- a/src/transformers/models/eomt/modular_eomt.py
+++ b/src/transformers/models/eomt/modular_eomt.py
@@ -0,0 +1,588 @@
 # coding=utf-8
 # Copyright 2025 Mobile Perception Systems Lab at TU/e and The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """PyTorch EoMT model."""
 import math
 from dataclasses import dataclass
 from typing import Optional
 import torch
 import torch.nn.functional as F
 from torch import Tensor, nn
 from ...activations import ACT2FN
 from ...file_utils import (
    ModelOutput,
 )
 from ...modeling_utils import PreTrainedModel
 from ...utils import (
    auto_docstring,
    can_return_tuple,
    logging,
 )
 from ..dinov2.modeling_dinov2 import (
    Dinov2Embeddings,
    Dinov2Layer,
    Dinov2LayerScale,
    Dinov2PatchEmbeddings,
 )
 from ..mask2former.modeling_mask2former import Mask2FormerForUniversalSegmentation, Mask2FormerLoss
 from ..siglip.modeling_siglip import SiglipAttention
 from ..vit.configuration_vit import ViTConfig
 logger = logging.get_logger(__name__)
 class EomtConfig(ViTConfig):
    r"""
    This is the configuration class to store the configuration of a [`EomtForUniversalSegmentation`]. It is used to instantiate an EoMT model
    according to the specified arguments, defining the model architecture. Instantiating a configuration with the
    defaults will yield a similar configuration to that of the EoMT
    [tue-mps/coco_panoptic_eomt_large_640](https://huggingface.co/tue-mps/coco_panoptic_eomt_large_640)
    architecture.
    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.
    Args:
        hidden_size (`int`, *optional*, defaults to 1024):
            Dimensionality of the hidden representations.
        num_hidden_layers (`int`, *optional*, defaults to 24):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 16):
            Number of attention heads in each attention layer.
        mlp_ratio (`int`, *optional*, defaults to 4):
            Ratio of the MLP hidden dimensionality to the hidden size.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
            The dropout probability for all fully connected layers in the embeddings and encoder.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-06):
            The epsilon used by the layer normalization layers.
        image_size (`int`, *optional*, defaults to 640):
            The size (resolution) of each input image.
        patch_size (`int`, *optional*, defaults to 16):
            The size (resolution) of each patch.
        num_channels (`int`, *optional*, defaults to 3):
            The number of input channels.
        layerscale_value (`float`, *optional*, defaults to 1.0):
            Initial value for the LayerScale parameter.
        drop_path_rate (`float`, *optional*, defaults to 0.0):
            The stochastic depth rate (drop path) used during training.
        num_upscale_blocks (`int`, *optional*, defaults to 2):
            Number of upsampling blocks used in the decoder or segmentation head.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            Dropout probability applied after attention projection.
        use_swiglu_ffn (`bool`, *optional*, defaults to `False`):
            Whether to use the SwiGLU feedforward neural network.
        num_blocks (`int`, *optional*, defaults to 4):
            Number of feature blocks or stages in the architecture.
        no_object_weight (`float`, *optional*, defaults to 0.1):
            Loss weight for the 'no object' class in panoptic/instance segmentation.
        class_weight (`float`, *optional*, defaults to 2.0):
            Loss weight for classification targets.
        mask_weight (`float`, *optional*, defaults to 5.0):
            Loss weight for mask prediction.
        dice_weight (`float`, *optional*, defaults to 5.0):
            Loss weight for the dice loss component.
        train_num_points (`int`, *optional*, defaults to 12544):
            Number of points to sample for mask loss computation during training.
        oversample_ratio (`float`, *optional*, defaults to 3.0):
            Oversampling ratio used in point sampling for mask training.
        importance_sample_ratio (`float`, *optional*, defaults to 0.75):
            Ratio of points to sample based on importance during training.
        num_queries (`int`, *optional*, defaults to 200):
            Number of object queries in the Transformer.
        num_register_tokens (`int`, *optional*, defaults to 4):
            Number of learnable register tokens added to the transformer input.
    Example:
    ```python
    >>> from transformers import EomtConfig, EomtForUniversalSegmentation
    >>> # Initialize configuration
    >>> config = EomtConfig()
    >>> # Initialize model
    >>> model = EomtForUniversalSegmentation(config)
    >>> # Access config
    >>> config = model.config
    ```"""
    model_type = "eomt"
    def __init__(
        self,
        hidden_size=1024,
        num_hidden_layers=24,
        num_attention_heads=16,
        mlp_ratio=4,
        hidden_act="gelu",
        hidden_dropout_prob=0.0,
        initializer_range=0.02,
        layer_norm_eps=1e-6,
        image_size=640,
        patch_size=16,
        num_channels=3,
        layerscale_value=1.0,
        drop_path_rate=0.0,
        num_upscale_blocks=2,
        attention_dropout=0.0,
        use_swiglu_ffn=False,
        num_blocks=4,
        no_object_weight: float = 0.1,
        class_weight: float = 2.0,
        mask_weight: float = 5.0,
        dice_weight: float = 5.0,
        train_num_points: int = 12544,
        oversample_ratio: float = 3.0,
        importance_sample_ratio: float = 0.75,
        num_queries=200,
        num_register_tokens=4,
        **kwargs,
    ):
        super().__init__(
            hidden_size=hidden_size,
            num_hidden_layers=num_hidden_layers,
            num_attention_heads=num_attention_heads,
            hidden_dropout_prob=hidden_dropout_prob,
            hidden_act=hidden_act,
            initializer_range=initializer_range,
            layer_norm_eps=layer_norm_eps,
            image_size=image_size,
            patch_size=patch_size,
            num_channels=num_channels,
            **kwargs,
        )
        del self.intermediate_size
        del self.qkv_bias
        del self.pooler_act
        del self.pooler_output_size
        del self.encoder_stride
        del self.attention_probs_dropout_prob
        self.mlp_ratio = mlp_ratio
        self.attention_dropout = attention_dropout
        self.layerscale_value = layerscale_value
        self.drop_path_rate = drop_path_rate
        self.num_upscale_blocks = num_upscale_blocks
        self.use_swiglu_ffn = use_swiglu_ffn
        self.num_blocks = num_blocks
        self.no_object_weight = no_object_weight
        self.class_weight = class_weight
        self.mask_weight = mask_weight
        self.dice_weight = dice_weight
        self.train_num_points = train_num_points
        self.oversample_ratio = oversample_ratio
        self.importance_sample_ratio = importance_sample_ratio
        self.num_queries = num_queries
        self.num_register_tokens = num_register_tokens
@dataclass
@auto_docstring(
    custom_intro="""
    Class for outputs of [`EomtForUniversalSegmentationOutput`].
    This output can be directly passed to [`~EomtImageProcessor.post_process_semantic_segmentation`] or
    [`~EomtImageProcessor.post_process_instance_segmentation`] or
    [`~EomtImageProcessor.post_process_panoptic_segmentation`] to compute final segmentation maps. Please, see
    [`~EomtImageProcessor] for details regarding usage.
    """
 )
 class EomtForUniversalSegmentationOutput(ModelOutput):
    r"""
    loss (`torch.Tensor`, *optional*):
        The computed loss, returned when labels are present.
    class_queries_logits (`torch.FloatTensor`):
        A tensor of shape `(batch_size, num_queries, num_labels + 1)` representing the proposed classes for each
        query. Note the `+ 1` is needed because we incorporate the null class.
    masks_queries_logits (`torch.FloatTensor`):
        A tensor of shape `(batch_size, num_queries, height, width)` representing the proposed masks for each
        query.
    last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
        Last hidden states (final feature map) of the last layer.
    hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
        Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
        shape `(batch_size, sequence_length, hidden_size)`. Hidden-states all layers of the model.
    attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
        Tuple of `tuple(torch.FloatTensor)` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
        sequence_length)`. Self and Cross Attentions weights from transformer decoder.
    """
    loss: Optional[torch.FloatTensor] = None
    class_queries_logits: Optional[torch.FloatTensor] = None
    masks_queries_logits: Optional[torch.FloatTensor] = None
    last_hidden_state: Optional[torch.FloatTensor] = None
    hidden_states: Optional[tuple[torch.FloatTensor]] = None
    attentions: Optional[tuple[torch.FloatTensor]] = None
 class EomtLoss(Mask2FormerLoss):
    pass
 class EomtPatchEmbeddings(Dinov2PatchEmbeddings):
    pass
 class EomtEmbeddings(Dinov2Embeddings, nn.Module):
    def __init__(self, config: EomtConfig) -> None:
        Dinov2Embeddings().__init__()
        self.config = config
        self.patch_size = config.patch_size
        self.cls_token = nn.Parameter(torch.randn(1, 1, config.hidden_size))
        self.register_tokens = nn.Parameter(torch.zeros(1, config.num_register_tokens, config.hidden_size))
        self.patch_embeddings = EomtPatchEmbeddings(config)
        num_patches = self.patch_embeddings.num_patches
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.num_prefix_tokens = 1 + config.num_register_tokens  # 1 for [CLS]
        self.position_embeddings = nn.Embedding(num_patches, config.hidden_size)
        self.register_buffer("position_ids", torch.arange(num_patches).expand((1, -1)), persistent=False)
    def interpolate_pos_encoding(self):
        raise AttributeError("Not needed for Eomt Model")
    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
        batch_size, _, _, _ = pixel_values.shape
        target_dtype = self.patch_embeddings.projection.weight.dtype
        embeddings = self.patch_embeddings(pixel_values.to(dtype=target_dtype))
        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
        register_tokens = self.register_tokens.expand(batch_size, -1, -1)
        embeddings = embeddings + self.position_embeddings(self.position_ids)
        embeddings = torch.cat([cls_tokens, register_tokens, embeddings], dim=1)
        embeddings = self.dropout(embeddings)
        return embeddings
 class EomtAttention(SiglipAttention):
    pass
 class EomtLayerScale(Dinov2LayerScale):
    pass
 class EomtLayer(Dinov2Layer):
    pass
 class EomtLayerNorm2d(nn.LayerNorm):
    def __init__(self, num_channels, eps=1e-6, affine=True):
        super().__init__(num_channels, eps=eps, elementwise_affine=affine)
    def forward(self, hidden_state: torch.Tensor) -> torch.Tensor:
        hidden_state = hidden_state.permute(0, 2, 3, 1)
        hidden_state = F.layer_norm(hidden_state, self.normalized_shape, self.weight, self.bias, self.eps)
        hidden_state = hidden_state.permute(0, 3, 1, 2)
        return hidden_state
 class EomtScaleLayer(nn.Module):
    def __init__(self, config: EomtConfig):
        super().__init__()
        hidden_size = config.hidden_size
        self.conv1 = nn.ConvTranspose2d(hidden_size, hidden_size, kernel_size=2, stride=2)
        self.activation = ACT2FN[config.hidden_act]
        self.conv2 = nn.Conv2d(
            hidden_size,
            hidden_size,
            kernel_size=3,
            padding=1,
            groups=hidden_size,
            bias=False,
        )
        self.layernorm2d = EomtLayerNorm2d(hidden_size)
    def forward(self, hidden_states: torch.tensor) -> torch.Tensor:
        hidden_states = self.conv1(hidden_states)
        hidden_states = self.activation(hidden_states)
        hidden_states = self.conv2(hidden_states)
        hidden_states = self.layernorm2d(hidden_states)
        return hidden_states
 class EomtScaleBlock(nn.Module):
    def __init__(self, config: EomtConfig):
        super().__init__()
        self.num_blocks = config.num_upscale_blocks
        self.block = nn.ModuleList([EomtScaleLayer(config) for _ in range(self.num_blocks)])
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        for block in self.block:
            hidden_states = block(hidden_states)
        return hidden_states
 class EomtMaskHead(nn.Module):
    def __init__(self, config: EomtConfig):
        super().__init__()
        hidden_size = config.hidden_size
        self.fc1 = nn.Linear(hidden_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, hidden_size)
        self.activation = ACT2FN[config.hidden_act]
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        hidden_states = self.activation(self.fc1(hidden_states))
        hidden_states = self.activation(self.fc2(hidden_states))
        hidden_states = self.fc3(hidden_states)
        return hidden_states
@auto_docstring
 class EomtPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """
    config_class = EomtConfig
    base_model_prefix = "eomt"
    main_input_name = "pixel_values"
    supports_gradient_checkpointing = False
    _no_split_modules = ["EomtMLP"]
    _supports_sdpa = True
    _supports_flash_attn_2 = True
    def _init_weights(self, module: nn.Module) -> None:
        std = self.config.initializer_range
        if isinstance(module, (nn.Linear, nn.Conv2d, nn.ConvTranspose2d)):
            nn.init.kaiming_uniform_(module.weight, a=math.sqrt(5))
            if module.bias is not None:
                fan_in, _ = nn.init._calculate_fan_in_and_fan_out(module.weight)
                bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
                nn.init.uniform_(module.bias, -bound, bound)
        elif isinstance(module, nn.LayerNorm):
            module.weight.data.fill_(1.0)
            module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=1)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, EomtLayerScale):
            if hasattr(module, "lambda1"):
                module.lambda1.data.fill_(self.config.layerscale_value)
        elif isinstance(module, EomtEmbeddings):
            module.cls_token.data = nn.init.trunc_normal_(
                module.cls_token.data.to(torch.float32), mean=0.0, std=std
            ).to(module.cls_token.dtype)
            module.register_tokens.data.zero_()
@auto_docstring(
    custom_intro="""
    The EoMT Model with head on top for instance/semantic/panoptic segmentation.
    """
 )
 class EomtForUniversalSegmentation(Mask2FormerForUniversalSegmentation, nn.Module):
    def __init__(self, config: EomtConfig) -> None:
        nn.Module().__init__(config)
        self.config = config
        self.num_hidden_layers = config.num_hidden_layers
        self.embeddings = EomtEmbeddings(config)
        self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.query = nn.Embedding(config.num_queries, config.hidden_size)
        self.layers = nn.ModuleList([EomtLayer(config) for _ in range(config.num_hidden_layers)])
        self.upscale_block = EomtScaleBlock(config)
        self.mask_head = EomtMaskHead(config)
        self.class_predictor = nn.Linear(config.hidden_size, config.num_labels + 1)
        self.grid_size = (config.image_size // config.patch_size, config.image_size // config.patch_size)
        self.weight_dict: dict[str, float] = {
            "loss_cross_entropy": config.class_weight,
            "loss_mask": config.mask_weight,
            "loss_dice": config.dice_weight,
        }
        self.criterion = EomtLoss(config=config, weight_dict=self.weight_dict)
        self.register_buffer("attn_mask_probs", torch.ones(config.num_blocks))
        self.post_init()
    def get_input_embeddings(self):
        return self.embeddings.patch_embeddings
    def get_auxiliary_logits(self):
        raise AttributeError("Note needed for Eomt Model.")
    def predict(self, logits: torch.Tensor):
        query_tokens = logits[:, : self.config.num_queries, :]
        class_logits = self.class_predictor(query_tokens)
        prefix_tokens = logits[:, self.config.num_queries + self.embeddings.num_prefix_tokens :, :]
        prefix_tokens = prefix_tokens.transpose(1, 2)
        prefix_tokens = prefix_tokens.reshape(prefix_tokens.shape[0], -1, *self.grid_size)
        query_tokens = self.mask_head(query_tokens)
        prefix_tokens = self.upscale_block(prefix_tokens)
        mask_logits = torch.einsum("bqc, bchw -> bqhw", query_tokens, prefix_tokens)
        return mask_logits, class_logits
    @staticmethod
    def _disable_attention_mask(attn_mask, prob, num_query_tokens, encoder_start_tokens, device):
        if prob < 1:
            # Generate random queries to disable based on the probs
            random_queries = torch.rand(attn_mask.shape[0], num_query_tokens, device=device) > prob
            # Disable attention to the query tokens, considering the prefix tokens
            attn_mask[:, :num_query_tokens, encoder_start_tokens:][random_queries] = 1
        return attn_mask
    @auto_docstring
    @can_return_tuple
    def forward(
        self,
        pixel_values: Tensor,
        mask_labels: Optional[list[Tensor]] = None,
        class_labels: Optional[list[Tensor]] = None,
        output_hidden_states: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
    ):
        r"""
        mask_labels (`List[torch.Tensor]`, *optional*):
            List of mask labels of shape `(num_labels, height, width)` to be fed to a model
        class_labels (`List[torch.LongTensor]`, *optional*):
            list of target class labels of shape `(num_labels, height, width)` to be fed to a model. They identify the
            labels of `mask_labels`, e.g. the label of `mask_labels[i][j]` if `class_labels[i][j]`.
        """
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        all_hidden_states = () if output_hidden_states else None
        all_attentions = () if output_attentions else None
        masks_queries_logits_per_layer, class_queries_logits_per_layer = (), ()
        attention_mask = None
        if pixel_values is None:
            raise ValueError("You have to specify pixel_values")
        hidden_states = self.embeddings(pixel_values)
        for idx, layer_module in enumerate(self.layers):
            if output_hidden_states:
                all_hidden_states += (hidden_states,)
            if idx == self.num_hidden_layers - self.config.num_blocks:
                query = self.query.weight[None, :, :].expand(hidden_states.shape[0], -1, -1)
                hidden_states = torch.cat((query, hidden_states), dim=1)
            if idx >= self.num_hidden_layers - self.config.num_blocks and (
                self.training or self.attn_mask_probs[idx - self.num_hidden_layers + self.config.num_blocks] > 0
            ):
                norm_hidden_states = self.layernorm(hidden_states)
                masks_queries_logits, class_queries_logits = self.predict(norm_hidden_states)
                masks_queries_logits_per_layer += (masks_queries_logits,)
                class_queries_logits_per_layer += (class_queries_logits,)
                attention_mask = torch.ones(
                    hidden_states.shape[0],
                    hidden_states.shape[1],
                    hidden_states.shape[1],
                    device=hidden_states.device,
                    dtype=torch.bool,
                )
                interpolated_logits = F.interpolate(masks_queries_logits, size=self.grid_size, mode="bilinear")
                interpolated_logits = interpolated_logits.view(
                    interpolated_logits.size(0), interpolated_logits.size(1), -1
                )
                num_query_tokens = self.config.num_queries
                encoder_start_tokens = num_query_tokens + self.embeddings.num_prefix_tokens
                # Set attention mask for queries to focus on encoder tokens based on interpolated logits
                attention_mask[:, :num_query_tokens, encoder_start_tokens:] = interpolated_logits > 0
                # Disable attention mask for random query tokens.
                attention_mask = self._disable_attention_mask(
                    attention_mask,
                    prob=self.attn_mask_probs[idx - self.num_hidden_layers + self.config.num_blocks],
                    num_query_tokens=num_query_tokens,
                    encoder_start_tokens=encoder_start_tokens,
                    device=attention_mask.device,
                )
                # Expand attention mask to 4d mask.
                attention_mask = attention_mask[:, None, ...].expand(-1, self.config.num_attention_heads, -1, -1)
                attention_mask = attention_mask.float().masked_fill(~attention_mask, -1e9)
            layer_outputs = layer_module(hidden_states, attention_mask, output_attentions)
            hidden_states = layer_outputs[0]
            if output_attentions:
                all_attentions += (layer_outputs[1],)
        sequence_output = self.layernorm(hidden_states)
        if output_hidden_states:
            all_hidden_states += (sequence_output,)
        masks_queries_logits, class_queries_logits = self.predict(sequence_output)
        masks_queries_logits_per_layer += (masks_queries_logits,)
        class_queries_logits_per_layer += (class_queries_logits,)
        loss = None
        if mask_labels is not None and class_labels is not None:
            loss = 0.0
            for masks_queries_logits, class_queries_logits in zip(
                masks_queries_logits_per_layer, class_queries_logits_per_layer
            ):
                loss_dict = self.get_loss_dict(
                    masks_queries_logits=masks_queries_logits,
                    class_queries_logits=class_queries_logits,
                    mask_labels=mask_labels,
                    class_labels=class_labels,
                    auxiliary_predictions=None,
                )
                loss += self.get_loss(loss_dict)
        return EomtForUniversalSegmentationOutput(
            loss=loss,
            masks_queries_logits=masks_queries_logits,
            class_queries_logits=class_queries_logits,
            last_hidden_state=sequence_output,
            hidden_states=all_hidden_states,
            attentions=all_attentions,
        )
 __all__ = ["EomtConfig", "EomtPreTrainedModel", "EomtForUniversalSegmentation"]
--- a/src/transformers/models/mask2former/modeling_mask2former.py
+++ b/src/transformers/models/mask2former/modeling_mask2former.py
@@ -512,7 +512,7 @@ class Mask2FormerLoss(nn.Module):
        self.importance_sample_ratio = config.importance_sample_ratio
        self.matcher = Mask2FormerHungarianMatcher(
-            cost_class=1.0,
+            cost_class=config.class_weight,
            cost_dice=config.dice_weight,
            cost_mask=config.mask_weight,
            num_points=self.num_points,
--- a/tests/models/eomt/init.py
+++ b/tests/models/eomt/init.py
--- a/tests/models/eomt/test_image_processing_eomt.py
+++ b/tests/models/eomt/test_image_processing_eomt.py
@@ -0,0 +1,308 @@
 # Copyright 2025 The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Testing suite for the PyTorch EoMT Image Processor."""
 import unittest
 import numpy as np
 import requests
 from datasets import load_dataset
 from transformers.testing_utils import require_torch, require_vision
 from transformers.utils import is_torch_available, is_torchvision_available, is_vision_available
 from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs
 if is_torch_available():
    import torch
 if is_vision_available():
    from PIL import Image
    from transformers import EomtImageProcessor
    if is_torchvision_available():
        from transformers import EomtImageProcessorFast
    from transformers.models.eomt.modeling_eomt import EomtForUniversalSegmentationOutput
 class EomtImageProcessingTester:
    def __init__(
        self,
        parent,
        batch_size=7,
        num_channels=3,
        min_resolution=30,
        max_resolution=400,
        size=None,
        do_resize=True,
        do_pad=True,
        do_normalize=True,
        image_mean=[0.5, 0.5, 0.5],
        image_std=[0.5, 0.5, 0.5],
        num_labels=10,
    ):
        self.parent = parent
        self.batch_size = batch_size
        self.num_channels = num_channels
        self.min_resolution = min_resolution
        self.max_resolution = max_resolution
        self.do_resize = do_resize
        self.do_pad = do_pad
        self.size = size if size is not None else {"shortest_edge": 18, "longest_edge": 18}
        self.do_normalize = do_normalize
        self.image_mean = image_mean
        self.image_std = image_std
        # for the post_process_functions
        self.batch_size = 2
        self.num_queries = 3
        self.num_classes = 2
        self.height = 18
        self.width = 18
        self.num_labels = num_labels
    def prepare_image_processor_dict(self):
        return {
            "do_resize": self.do_resize,
            "size": self.size,
            "do_normalize": self.do_normalize,
            "image_mean": self.image_mean,
            "image_std": self.image_std,
            "do_pad": self.do_pad,
            "num_labels": self.num_labels,
        }
    def prepare_fake_eomt_outputs(self, batch_size):
        return EomtForUniversalSegmentationOutput(
            masks_queries_logits=torch.randn((batch_size, self.num_queries, self.height, self.width)),
            class_queries_logits=torch.randn((batch_size, self.num_queries, self.num_classes + 1)),
        )
    def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=False):
        return prepare_image_inputs(
            batch_size=self.batch_size,
            num_channels=self.num_channels,
            min_resolution=self.min_resolution,
            max_resolution=self.max_resolution,
            equal_resolution=equal_resolution,
            numpify=numpify,
            torchify=torchify,
        )
 def prepare_semantic_single_inputs():
    ds = load_dataset("hf-internal-testing/fixtures_ade20k", split="test")
    example = ds[0]
    return example["image"], example["map"]
 def prepare_semantic_batch_inputs():
    ds = load_dataset("hf-internal-testing/fixtures_ade20k", split="test")
    return list(ds["image"][:2]), list(ds["map"][:2])
@require_torch
@require_vision
 class EomtImageProcessingTest(ImageProcessingTestMixin, unittest.TestCase):
    image_processing_class = EomtImageProcessor if is_vision_available() else None
    fast_image_processing_class = EomtImageProcessorFast if is_torchvision_available() else None
    def setUp(self):
        super().setUp()
        self.image_processor_tester = EomtImageProcessingTester(self)
        self.model_id = "tue-mps/coco_panoptic_eomt_large_640"
    @property
    def image_processor_dict(self):
        return self.image_processor_tester.prepare_image_processor_dict()
    def test_image_processor_properties(self):
        for image_processing_class in self.image_processor_list:
            image_processing = image_processing_class(**self.image_processor_dict)
            self.assertTrue(hasattr(image_processing, "image_mean"))
            self.assertTrue(hasattr(image_processing, "image_std"))
            self.assertTrue(hasattr(image_processing, "do_normalize"))
            self.assertTrue(hasattr(image_processing, "do_resize"))
            self.assertTrue(hasattr(image_processing, "size"))
            self.assertTrue(hasattr(image_processing, "do_rescale"))
            self.assertTrue(hasattr(image_processing, "rescale_factor"))
            self.assertTrue(hasattr(image_processing, "resample"))
    def test_image_processor_from_dict_with_kwargs(self):
        image_processor = self.image_processing_class.from_dict(self.image_processor_dict)
        self.assertEqual(image_processor.size, {"shortest_edge": 18, "longest_edge": 18})
        image_processor = self.image_processing_class.from_dict(self.image_processor_dict, size=42)
        self.assertEqual(image_processor.size, {"shortest_edge": 42})
    def test_call_numpy(self):
        for image_processing_class in self.image_processor_list:
            # Initialize image_processing
            image_processing = image_processing_class(**self.image_processor_dict)
            # create random numpy tensors
            image_inputs = self.image_processor_tester.prepare_image_inputs(equal_resolution=True, numpify=True)
            for image in image_inputs:
                self.assertIsInstance(image, np.ndarray)
            # Test not batched input
            encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values
            expected_output_image_shape = (1, 3, 18, 18)
            self.assertEqual(tuple(encoded_images.shape), expected_output_image_shape)
            # Test batched
            encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values
            expected_output_image_shape = (2, 3, 18, 18)
            self.assertEqual(tuple(encoded_images.shape), expected_output_image_shape)
    @unittest.skip(reason="Not supported")
    def test_call_numpy_4_channels(self):
        pass
    def test_call_pil(self):
        image_processing = self.image_processing_class(**self.image_processor_dict)
        image_inputs = self.image_processor_tester.prepare_image_inputs(equal_resolution=True)
        for image in image_inputs:
            self.assertIsInstance(image, Image.Image)
        # Test Non batched input
        encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values
        expected_output_image_shape = (1, 3, 18, 18)
        self.assertEqual(tuple(encoded_images.shape), expected_output_image_shape)
        # Test batched
        encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values
        expected_output_image_shape = (2, 3, 18, 18)
        self.assertEqual(tuple(encoded_images.shape), expected_output_image_shape)
    def test_call_pytorch(self):
        image_processing = self.image_processing_class(**self.image_processor_dict)
        image_inputs = self.image_processor_tester.prepare_image_inputs(equal_resolution=True, torchify=True)
        for image in image_inputs:
            self.assertIsInstance(image, torch.Tensor)
        encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values
        expected_output_image_shape = (1, 3, 18, 18)
        self.assertEqual(tuple(encoded_images.shape), expected_output_image_shape)
        encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values
        expected_output_image_shape = (2, 3, 18, 18)
        self.assertEqual(tuple(encoded_images.shape), expected_output_image_shape)
    def test_slow_fast_equivalence(self):
        if not self.test_slow_image_processor or not self.test_fast_image_processor:
            self.skipTest(reason="Skipping slow/fast equivalence test")
        if self.image_processing_class is None or self.fast_image_processing_class is None:
            self.skipTest(reason="Skipping slow/fast equivalence test as one of the image processors is not defined")
        dummy_image, dummy_map = prepare_semantic_single_inputs()
        image_processor_slow = self.image_processing_class(**self.image_processor_dict)
        image_processor_fast = self.fast_image_processing_class(**self.image_processor_dict)
        image_encoding_slow = image_processor_slow(dummy_image, segmentation_maps=dummy_map, return_tensors="pt")
        image_encoding_fast = image_processor_fast(dummy_image, segmentation_maps=dummy_map, return_tensors="pt")
        self.assertTrue(torch.allclose(image_encoding_slow.pixel_values, image_encoding_fast.pixel_values, atol=1e-1))
        self.assertLessEqual(
            torch.mean(torch.abs(image_encoding_slow.pixel_values - image_encoding_fast.pixel_values)).item(), 1e-3
        )
        # Lets check whether 99.9% of mask_labels values match or not.
        match_ratio = (image_encoding_slow.mask_labels[0] == image_encoding_fast.mask_labels[0]).float().mean().item()
        self.assertGreaterEqual(match_ratio, 0.999, "Mask labels do not match between slow and fast image processor.")
    def test_slow_fast_equivalence_batched(self):
        if not self.test_slow_image_processor or not self.test_fast_image_processor:
            self.skipTest(reason="Skipping slow/fast equivalence test")
        if self.image_processing_class is None or self.fast_image_processing_class is None:
            self.skipTest(reason="Skipping slow/fast equivalence test as one of the image processors is not defined")
        if hasattr(self.image_processor_tester, "do_center_crop") and self.image_processor_tester.do_center_crop:
            self.skipTest(
                reason="Skipping as do_center_crop is True and center_crop functions are not equivalent for fast and slow processors"
            )
        dummy_images, dummy_maps = prepare_semantic_batch_inputs()
        image_processor_slow = self.image_processing_class(**self.image_processor_dict)
        image_processor_fast = self.fast_image_processing_class(**self.image_processor_dict)
        encoding_slow = image_processor_slow(dummy_images, segmentation_maps=dummy_maps, return_tensors="pt")
        encoding_fast = image_processor_fast(dummy_images, segmentation_maps=dummy_maps, return_tensors="pt")
        self.assertTrue(torch.allclose(encoding_slow.pixel_values, encoding_fast.pixel_values, atol=1e-1))
        self.assertLessEqual(
            torch.mean(torch.abs(encoding_slow.pixel_values - encoding_fast.pixel_values)).item(), 1e-3
        )
        for idx in range(len(dummy_maps)):
            match_ratio = (encoding_slow.mask_labels[idx] == encoding_fast.mask_labels[idx]).float().mean().item()
            self.assertGreaterEqual(
                match_ratio, 0.999, "Mask labels do not match between slow and fast image processors."
            )
    def test_post_process_semantic_segmentation(self):
        processor = self.image_processing_class(**self.image_processor_dict)
        # Set longest_edge to None to test for semantic segmentatiom.
        processor.size = {"shortest_edge": 18, "longest_edge": None}
        image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
        inputs = processor(images=image, do_split_image=True, return_tensors="pt")
        patch_offsets = inputs.pop("patch_offsets")
        original_sizes = [image.size[::-1]]
        # For semantic segmentation, the BS of output is 2 coz, two patches are created for the image.
        outputs = self.image_processor_tester.prepare_fake_eomt_outputs(inputs["pixel_values"].shape[0])
        segmentation = processor.post_process_semantic_segmentation(outputs, patch_offsets, original_sizes)
        self.assertEqual(segmentation[0].shape, (image.height, image.width))
    def test_post_process_panoptic_segmentation(self):
        processor = self.image_processing_class(**self.image_processor_dict)
        image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
        original_sizes = [image.size[::-1], image.size[::-1]]
        # lets test for batched input of 2
        outputs = self.image_processor_tester.prepare_fake_eomt_outputs(2)
        segmentation = processor.post_process_panoptic_segmentation(outputs, original_sizes)
        self.assertTrue(len(segmentation) == 2)
        for el in segmentation:
            self.assertTrue("segmentation" in el)
            self.assertTrue("segments_info" in el)
            self.assertEqual(type(el["segments_info"]), list)
            self.assertEqual(el["segmentation"].shape, (image.height, image.width))
    def test_post_process_instance_segmentation(self):
        processor = self.image_processing_class(**self.image_processor_dict)
        image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
        original_sizes = [image.size[::-1], image.size[::-1]]
        # lets test for batched input of 2
        outputs = self.image_processor_tester.prepare_fake_eomt_outputs(2)
        segmentation = processor.post_process_instance_segmentation(outputs, original_sizes)
        self.assertTrue(len(segmentation) == 2)
        for el in segmentation:
            self.assertTrue("segmentation" in el)
            self.assertTrue("segments_info" in el)
            self.assertEqual(type(el["segments_info"]), list)
            self.assertEqual(el["segmentation"].shape, (image.height, image.width))
--- a/tests/models/eomt/test_modeling_eomt.py
+++ b/tests/models/eomt/test_modeling_eomt.py
@@ -0,0 +1,475 @@
 # Copyright 2025 The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Testing suite for the PyTorch EoMT model."""
 import unittest
 import requests
 from transformers import AutoImageProcessor, EomtConfig, EomtForUniversalSegmentation
 from transformers.testing_utils import require_torch, require_torch_accelerator, require_torch_fp16, slow, torch_device
 from transformers.utils import is_torch_available, is_vision_available
 from ...test_configuration_common import ConfigTester
 from ...test_modeling_common import ModelTesterMixin, _config_zero_init, floats_tensor
 if is_torch_available():
    import torch
 if is_vision_available():
    from PIL import Image
 class EomtForUniversalSegmentationTester:
    def __init__(
        self,
        parent,
        batch_size=2,
        is_training=True,
        image_size=40,
        patch_size=2,
        num_queries=5,
        num_register_tokens=19,
        num_labels=4,
        hidden_size=8,
        num_attention_heads=2,
        num_hidden_layers=4,
    ):
        self.parent = parent
        self.batch_size = batch_size
        self.is_training = is_training
        self.num_queries = num_queries
        self.image_size = image_size
        self.patch_size = patch_size
        self.num_labels = num_labels
        self.hidden_size = hidden_size
        self.num_attention_heads = num_attention_heads
        self.num_hidden_layers = num_hidden_layers
        self.num_register_tokens = num_register_tokens
        num_patches = (image_size // patch_size) ** 2
        self.seq_length = num_patches + 1
    def get_config(self):
        config = {
            "image_size": self.image_size,
            "patch_size": self.patch_size,
            "num_labels": self.num_labels,
            "hidden_size": self.hidden_size,
            "num_attention_heads": self.num_attention_heads,
            "num_hidden_layers": self.num_hidden_layers,
            "num_register_tokens": self.num_register_tokens,
            "num_queries": self.num_queries,
            "num_blocks": 1,
        }
        return EomtConfig(**config)
    def prepare_config_and_inputs(self):
        pixel_values = floats_tensor([self.batch_size, 3, self.image_size, self.image_size]).to(torch_device)
        mask_labels = (
            torch.rand([self.batch_size, self.num_labels, self.image_size, self.image_size], device=torch_device) > 0.5
        ).float()
        class_labels = (torch.rand((self.batch_size, self.num_labels), device=torch_device) > 0.5).long()
        config = self.get_config()
        return config, pixel_values, mask_labels, class_labels
    def prepare_config_and_inputs_for_common(self):
        config, pixel_values, mask_labels, class_labels = self.prepare_config_and_inputs()
        inputs_dict = {"pixel_values": pixel_values}
        return config, inputs_dict
    def prepare_config_and_inputs_for_training(self):
        config, pixel_values, mask_labels, class_labels = self.prepare_config_and_inputs()
        inputs_dict = {"pixel_values": pixel_values, "mask_labels": mask_labels, "class_labels": class_labels}
        return config, inputs_dict
@require_torch
 class EomtForUniversalSegmentationTest(ModelTesterMixin, unittest.TestCase):
    all_model_classes = (EomtForUniversalSegmentation,) if is_torch_available() else ()
    is_encoder_decoder = False
    test_pruning = False
    test_head_masking = False
    test_missing_keys = False
    test_torch_exportable = False
    def setUp(self):
        self.model_tester = EomtForUniversalSegmentationTester(self)
        self.config_tester = ConfigTester(self, config_class=EomtConfig, has_text_modality=False)
    def test_config(self):
        self.config_tester.run_common_tests()
    def test_model_with_labels(self):
        size = (self.model_tester.image_size,) * 2
        inputs = {
            "pixel_values": torch.randn((2, 3, *size), device=torch_device),
            "mask_labels": torch.randn((2, 10, *size), device=torch_device),
            "class_labels": torch.zeros(2, 10, device=torch_device).long(),
        }
        config = self.model_tester.get_config()
        model = EomtForUniversalSegmentation(config).to(torch_device)
        outputs = model(**inputs)
        self.assertTrue(outputs.loss is not None)
    def test_attention_outputs(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        config.return_dict = True
        for model_class in self.all_model_classes:
            inputs_dict["output_attentions"] = True
            inputs_dict["output_hidden_states"] = False
            config.return_dict = True
            model = model_class._from_config(config, attn_implementation="eager")
            config = model.config
            model.to(torch_device)
            model.eval()
            with torch.no_grad():
                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
            attentions = outputs.attentions
            self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
            # Check that output_attentions also work using config
            del inputs_dict["output_attentions"]
            config.output_attentions = True
            model = model_class(config)
            model.to(torch_device)
            model.eval()
            with torch.no_grad():
                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
            attentions = outputs.attentions
            self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
            out_len = len(outputs)
            # Check attention is always last and order is fine
            inputs_dict["output_attentions"] = True
            inputs_dict["output_hidden_states"] = True
            model = model_class(config)
            model.to(torch_device)
            model.eval()
            with torch.no_grad():
                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
            added_hidden_states = 1
            self.assertEqual(out_len + added_hidden_states, len(outputs))
            self_attentions = outputs.attentions
            self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
    def test_hidden_states_output(self):
        def check_hidden_states_output(inputs_dict, config, model_class):
            model = model_class(config)
            model.to(torch_device)
            model.eval()
            with torch.no_grad():
                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
            hidden_states = outputs.encoder_hidden_states if config.is_encoder_decoder else outputs.hidden_states
            expected_num_layers = getattr(
                self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
            )
            self.assertEqual(len(hidden_states), expected_num_layers)
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        for model_class in self.all_model_classes:
            inputs_dict["output_hidden_states"] = True
            check_hidden_states_output(inputs_dict, config, model_class)
            # check that output_hidden_states also work using config
            del inputs_dict["output_hidden_states"]
            config.output_hidden_states = True
            check_hidden_states_output(inputs_dict, config, model_class)
    @unittest.skip(reason="EoMT does not use inputs_embeds")
    def test_inputs_embeds(self):
        pass
    @unittest.skip(reason="EoMT does not have a get_input_embeddings method")
    def test_model_get_set_embeddings(self):
        pass
    @unittest.skip(reason="EoMT is not a generative model")
    def test_generate_without_input_ids(self):
        pass
    @unittest.skip(reason="EoMT does not use token embeddings")
    def test_resize_tokens_embeddings(self):
        pass
    def test_training(self):
        if not self.model_tester.is_training:
            self.skipTest(reason="ModelTester is not configured to run training tests")
        for model_class in self.all_model_classes:
            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_training()
            config.return_dict = True
            model = model_class(config)
            model.to(torch_device)
            model.train()
            inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
            loss = model(**inputs).loss
            loss.backward()
    def test_initialization(self):
        # Apart from the below params, all other parameters are initialized using kaiming uniform.
        non_uniform_init_parms = [
            "layernorm.bias",
            "layernorm.weight",
            "norm1.bias",
            "norm1.weight",
            "norm2.bias",
            "norm2.weight",
            "layer_scale1.lambda1",
            "layer_scale2.lambda1",
            "register_tokens",
            "cls_token",
        ]
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        configs_no_init = _config_zero_init(config)
        for model_class in self.all_model_classes:
            model = model_class(config=configs_no_init)
            for name, param in model.named_parameters():
                if param.requires_grad:
                    if any(x in name for x in non_uniform_init_parms):
                        self.assertIn(
                            ((param.data.mean() * 1e9).round() / 1e9).item(),
                            [0.0, 1.0],
                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
                        )
                    else:
                        self.assertTrue(
                            -1.0 <= ((param.data.mean() * 1e9).round() / 1e9).item() <= 1.0,
                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
                        )
@require_torch
 class EomtForUniversalSegmentationIntegrationTest(unittest.TestCase):
    def setUp(self):
        self.model_id = "tue-mps/coco_panoptic_eomt_large_640"
    @slow
    def test_inference(self):
        model = EomtForUniversalSegmentation.from_pretrained(self.model_id, device_map="auto")
        processor = AutoImageProcessor.from_pretrained(self.model_id)
        image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
        inputs = processor(images=image, return_tensors="pt").to(model.device)
        with torch.inference_mode():
            outputs = model(**inputs)
        self.assertTrue(outputs.class_queries_logits.shape == (1, 200, 134))
        self.assertTrue(outputs.masks_queries_logits.shape == (1, 200, 160, 160))
        # fmt: off
        EXPECTED_SLICE = torch.tensor([
            [ 13.2540,   8.9279,   8.6631,  12.3760,  10.1429],
            [ -3.4815, -36.4630, -45.5604, -46.8404, -37.5099],
            [ -6.8689, -44.4206, -62.7591, -59.2928, -47.7035],
            [ -2.9380, -42.0659, -57.4382, -55.1537, -43.5142],
            [ -8.4387, -38.5275, -53.1383, -47.0064, -38.9667],
        ]).to(model.device)
        # fmt: on
        output_slice = outputs.masks_queries_logits[0, 0, :5, :5]
        torch.testing.assert_close(output_slice, EXPECTED_SLICE, rtol=1e-2, atol=1e-2)
        # fmt: off
        EXPECTED_SLICE = torch.tensor([
            [-0.6977, -6.4907, -4.1178, -6.5554, -6.6529],
            [-0.3650, -6.6560, -4.0143, -6.5776, -6.5879],
            [-0.8820, -6.7175, -3.5334, -6.8569, -6.2415],
            [ 0.4502, -5.3911, -3.0232, -5.9411, -6.3243],
            [ 0.3157, -5.6321, -2.6716, -5.5740, -5.5607],
        ]).to(model.device)
        # fmt: on
        output_slice = outputs.class_queries_logits[0, :5, :5]
        torch.testing.assert_close(output_slice, EXPECTED_SLICE, rtol=1e-2, atol=1e-2)
    @require_torch_accelerator
    @require_torch_fp16
    @slow
    def test_inference_fp16(self):
        model = EomtForUniversalSegmentation.from_pretrained(
            self.model_id, torch_dtype=torch.float16, device_map="auto"
        )
        processor = AutoImageProcessor.from_pretrained(self.model_id)
        image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
        inputs = processor(images=image, return_tensors="pt").to(model.device)
        with torch.inference_mode():
            outputs = model(**inputs)
        self.assertTrue(outputs.class_queries_logits.shape == (1, 200, 134))
        self.assertTrue(outputs.masks_queries_logits.shape == (1, 200, 160, 160))
    @slow
    def test_semantic_segmentation_inference(self):
        model_id = "tue-mps/ade20k_semantic_eomt_large_512"
        model = EomtForUniversalSegmentation.from_pretrained(model_id, device_map="auto")
        processor = AutoImageProcessor.from_pretrained(model_id)
        image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
        inputs = processor(images=image, return_tensors="pt").to(model.device)
        patch_offsets = inputs.pop("patch_offsets", None)
        with torch.inference_mode():
            outputs = model(**inputs)
        self.assertTrue(outputs.class_queries_logits.shape == (2, 100, 151))
        self.assertTrue(outputs.masks_queries_logits.shape == (2, 100, 128, 128))
        preds = processor.post_process_semantic_segmentation(
            outputs, original_image_sizes=[(image.size[1], image.size[0])], patch_offsets=patch_offsets
        )
        self.assertTrue(preds.shape[1:] == (image.size[1], image.size[0]))
        # fmt: off
        EXPECTED_SLICE = torch.tensor([
            [39, 39, 39, 39, 39, 39, 39, 39, 39, 39],
            [39, 39, 39, 39, 39, 39, 39, 39, 39, 39],
            [39, 39, 39, 39, 39, 39, 39, 39, 39, 39],
            [39, 39, 39, 39, 39, 39, 39, 39, 39, 39],
            [39, 39, 39, 39, 39, 39, 39, 39, 39, 39],
            [39, 39, 39, 39, 39, 39, 39, 39, 39, 39],
            [39, 39, 39, 39, 39, 39, 39, 39, 39, 39],
            [39, 39, 39, 39, 39, 39, 39, 39, 39, 39],
            [39, 39, 39, 39, 39, 39, 39, 39, 39, 39],
            [39, 39, 39, 39, 39, 39, 39, 39, 39, 39]
        ], device=model.device)
        # fmt: on
        output_slice = preds[0, :10, :10]
        torch.testing.assert_close(output_slice, EXPECTED_SLICE, rtol=1e-2, atol=1e-2)
    @slow
    def test_panoptic_segmentation_inference(self):
        model = EomtForUniversalSegmentation.from_pretrained(self.model_id, device_map="auto")
        processor = AutoImageProcessor.from_pretrained(self.model_id)
        image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
        inputs = processor(images=image, return_tensors="pt").to(model.device)
        with torch.inference_mode():
            outputs = model(**inputs)
        self.assertTrue(outputs.class_queries_logits.shape == (1, 200, 134))
        self.assertTrue(outputs.masks_queries_logits.shape == (1, 200, 160, 160))
        preds = processor.post_process_panoptic_segmentation(
            outputs, original_image_sizes=[(image.size[1], image.size[0])]
        )[0]
        segmentation, segments_info = preds["segmentation"], preds["segments_info"]
        # fmt: off
        EXPECTED_SLICE = torch.tensor([
            [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
            [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
            [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
            [-1, -1, -1, -1, -1,  2,  2,  2,  2,  2],
            [-1, -1, -1,  2,  2,  2,  2,  2,  2,  2],
            [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
            [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
            [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
            [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
            [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2]
        ], device=model.device)
        EXPECTED_SEGMENTS_INFO = [
            {"id": 0, "label_id": 15, "score": 0.99935},
            {"id": 1, "label_id": 15, "score": 0.998688},
            {"id": 2, "label_id": 57, "score": 0.954325},
            {"id": 3, "label_id": 65, "score": 0.997285},
            {"id": 4, "label_id": 65, "score": 0.99711}
        ]
        # fmt: on
        output_slice = segmentation[:10, :10]
        torch.testing.assert_close(output_slice, EXPECTED_SLICE, rtol=1e-2, atol=1e-2)
        for actual, expected in zip(segments_info, EXPECTED_SEGMENTS_INFO):
            self.assertEqual(actual["id"], expected["id"])
            self.assertEqual(actual["label_id"], expected["label_id"])
            self.assertAlmostEqual(actual["score"], expected["score"], delta=1e-3)
    @slow
    def test_instance_segmentation_inference(self):
        model_id = "tue-mps/coco_instance_eomt_large_640"
        model = EomtForUniversalSegmentation.from_pretrained(model_id, device_map="auto")
        processor = AutoImageProcessor.from_pretrained(model_id)
        image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
        inputs = processor(images=image, return_tensors="pt").to(model.device)
        with torch.inference_mode():
            outputs = model(**inputs)
        self.assertTrue(outputs.class_queries_logits.shape == (1, 200, 81))
        self.assertTrue(outputs.masks_queries_logits.shape == (1, 200, 160, 160))
        preds = processor.post_process_instance_segmentation(
            outputs, original_image_sizes=[(image.size[1], image.size[0])]
        )[0]
        segmentation, segments_info = preds["segmentation"], preds["segments_info"]
        # fmt: off
        EXPECTED_SLICE = torch.tensor([
            [-1., -1., -1., -1., -1., -1., -1., -1., -1., -1.],
            [-1., -1., -1., -1., -1., -1., -1., -1., -1., -1.],
            [-1., -1., -1., -1., -1., -1., -1., -1., -1., -1.],
            [-1., -1., -1.,  0.,  0.,  1.,  1.,  1.,  1.,  1.],
            [ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
            [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
            [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
            [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
            [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
            [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]
        ], device=model.device)
        EXPECTED_SEGMENTS_INFO = [
            {'id': 0, 'label_id': 57, 'score': 0.871247},
            {'id': 1, 'label_id': 57, 'score': 0.821225},
            {'id': 2, 'label_id': 15, 'score': 0.976252},
            {'id': 3, 'label_id': 65, 'score': 0.972960},
            {'id': 4, 'label_id': 65, 'score': 0.981109},
            {'id': 5, 'label_id': 15, 'score': 0.972689}
        ]
        # fmt: on
        output_slice = segmentation[:10, :10]
        torch.testing.assert_close(output_slice, EXPECTED_SLICE, rtol=1e-2, atol=1e-2)
        for actual, expected in zip(segments_info, EXPECTED_SEGMENTS_INFO):
            self.assertEqual(actual["id"], expected["id"])
            self.assertEqual(actual["label_id"], expected["label_id"])
            self.assertAlmostEqual(actual["score"], expected["score"], delta=1e-3)