Add TimmWrapper (#34564)

* Add files * Init * Add TimmWrapperModel * Fix up * Some fixes * Fix up * Remove old file * Sort out import orders * Fix some model loading * Compatible with pipeline and trainer * Fix up * Delete test_timm_model_1/config.json * Remove accidentally commited files * Delete src/transformers/models/modeling_timm_wrapper.py * Remove empty imports; fix transformations applied * Tidy up * Add image classifcation model to special cases * Create pretrained model; enable device_map='auto' * Enable most tests; fix init order * Sort imports * [run-slow] timm_wrapper * Pass num_classes into timm.create_model * Remove train transforms from image processor * Update timm creation with pretrained=False * Fix gamma/beta issue for timm models * Fixing gamma and beta renaming for timm models * Simplify config and model creation * Remove attn_implementation diff * Fixup * Docstrings * Fix warning msg text according to test case * Fix device_map auto * Set dtype and device for pixel_values in forward * Enable output hidden states * Enable tests for hidden_states and model parallel * Remove default scriptable arg * Refactor inner model * Update timm version * Fix _find_mismatched_keys function * Change inheritance for Classification model (fix weights loading with device_map) * Minor bugfix * Disable save pretrained for image processor * Rename hook method for loaded keys correction * Rename state dict keys on save, remove `timm_model` prefix, make checkpoint compatible with `timm` * Managing num_labels <-> num_classes attributes * Enable loading checkpoints in Trainer to resume training * Update error message for output_hidden_states * Add output hidden states test * Decouple base and classification models * Add more test cases * Add save-load-to-timm test * Fix test name * Fixup * Add do_pooling * Add test for do_pooling * Fix doc * Add tests for TimmWrapperModel * Add validation for `num_classes=0` in timm config + test for DINO checkpoint * Adjust atol for test * Fix docs * dev-ci * dev-ci * Add tests for image processor * Update docs * Update init to new format * Update docs in configuration * Fix some docs in image processor * Improve docs for modeling * fix for is_timm_checkpoint * Update code examples * Fix header * Fix typehint * Increase tolerance a bit * Fix Path * Fixing model parallel tests * Disable "parallel" tests * Add comment for metadata * Refactor AutoImageProcessor for timm wrapper loading * Remove custom test_model_outputs_equivalence * Add require_timm decorator * Fix comment * Make image processor work with older timm versions and tensor input * Save config instead of whole model in image processor tests * Add docstring for `image_processor_filename` * Sanitize kwargs for timm image processor * Fix doc style * Update check for tensor input * Update normalize * Remove _load_timm_model function --------- Co-authored-by: Amy Roberts <22614925+amyeroberts@users.noreply.github.com>
2024-12-11 13:40:30 +01:00
parent bcc50cc7ce
commit 5fcf6286bf
25 changed files with 1432 additions and 129 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -705,6 +705,8 @@
        title: Swin2SR
      - local: model_doc/table-transformer
        title: Table Transformer
      - local: model_doc/timm_wrapper
        title: Timm Wrapper
      - local: model_doc/upernet
        title: UperNet
      - local: model_doc/van
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@@ -321,6 +321,7 @@ Flax), PyTorch, and/or TensorFlow.
 |                         [TAPEX](model_doc/tapex)                         |       ✅        |         ✅         |      ✅      |
 |       [Time Series Transformer](model_doc/time_series_transformer)       |       ✅        |         ❌         |      ❌      |
 |                   [TimeSformer](model_doc/timesformer)                   |       ✅        |         ❌         |      ❌      |
 |                [TimmWrapperModel](model_doc/timm_wrapper)                |       ✅        |         ❌         |      ❌      |
 |        [Trajectory Transformer](model_doc/trajectory_transformer)        |       ✅        |         ❌         |      ❌      |
 |                  [Transformer-XL](model_doc/transfo-xl)                  |       ✅        |         ✅         |      ❌      |
 |                         [TrOCR](model_doc/trocr)                         |       ✅        |         ❌         |      ❌      |
--- a/docs/source/en/model_doc/timm_wrapper.md
+++ b/docs/source/en/model_doc/timm_wrapper.md
@@ -0,0 +1,67 @@
 <!--Copyright 2024 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
 rendered properly in your Markdown viewer.
 -->
 # TimmWrapper
 ## Overview
 Helper class to enable loading timm models to be used with the transformers library and its autoclasses.
 ```python
 >>> import torch
 >>> from PIL import Image
 >>> from urllib.request import urlopen
 >>> from transformers import AutoModelForImageClassification, AutoImageProcessor
 >>> # Load image
 >>> image = Image.open(urlopen(
 ...     'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
 ... ))
 >>> # Load model and image processor
 >>> checkpoint = "timm/resnet50.a1_in1k"
 >>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
 >>> model = AutoModelForImageClassification.from_pretrained(checkpoint).eval()
 >>> # Preprocess image
 >>> inputs = image_processor(image)
 >>> # Forward pass
 >>> with torch.no_grad():
 ...     logits = model(**inputs).logits
 >>> # Get top 5 predictions
 >>> top5_probabilities, top5_class_indices = torch.topk(logits.softmax(dim=1) * 100, k=5)
 ```
 ## TimmWrapperConfig
 [[autodoc]] TimmWrapperConfig
 ## TimmWrapperImageProcessor
 [[autodoc]] TimmWrapperImageProcessor
    - preprocess
 ## TimmWrapperModel
 [[autodoc]] TimmWrapperModel
    - forward
 ## TimmWrapperForImageClassification
 [[autodoc]] TimmWrapperForImageClassification
    - forward
--- a/examples/pytorch/image-classification/run_image_classification.py
+++ b/examples/pytorch/image-classification/run_image_classification.py
@@ -42,6 +42,7 @@ from transformers import (
    AutoImageProcessor,
    AutoModelForImageClassification,
    HfArgumentParser,
    TimmWrapperImageProcessor,
    Trainer,
    TrainingArguments,
    set_seed,
@@ -329,15 +330,20 @@ def main():
    )
    # Define torchvision transforms to be applied to each image.
    if isinstance(image_processor, TimmWrapperImageProcessor):
        _train_transforms = image_processor.train_transforms
        _val_transforms = image_processor.val_transforms
    else:
        if "shortest_edge" in image_processor.size:
            size = image_processor.size["shortest_edge"]
        else:
            size = (image_processor.size["height"], image_processor.size["width"])
-    normalize = (
+
-        Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
+        # Create normalization transform
-        if hasattr(image_processor, "image_mean") and hasattr(image_processor, "image_std")
+        if hasattr(image_processor, "image_mean") and hasattr(image_processor, "image_std"):
-        else Lambda(lambda x: x)
+            normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
-    )
+        else:
            normalize = Lambda(lambda x: x)
        _train_transforms = Compose(
            [
                RandomResizedCrop(size),
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@@ -782,6 +782,7 @@ _import_structure = {
    "models.time_series_transformer": ["TimeSeriesTransformerConfig"],
    "models.timesformer": ["TimesformerConfig"],
    "models.timm_backbone": ["TimmBackboneConfig"],
    "models.timm_wrapper": ["TimmWrapperConfig"],
    "models.trocr": [
        "TrOCRConfig",
        "TrOCRProcessor",
@@ -1272,6 +1273,18 @@ else:
    _import_structure["models.rt_detr"].append("RTDetrImageProcessorFast")
    _import_structure["models.vit"].append("ViTImageProcessorFast")
 try:
    if not is_torchvision_available() and not is_timm_available():
        raise OptionalDependencyNotAvailable()
 except OptionalDependencyNotAvailable:
    from .utils import dummy_timm_and_torchvision_objects
    _import_structure["utils.dummy_timm_and_torchvision_objects"] = [
        name for name in dir(dummy_timm_and_torchvision_objects) if not name.startswith("_")
    ]
 else:
    _import_structure["models.timm_wrapper"].extend(["TimmWrapperImageProcessor"])
 # PyTorch-backed objects
 try:
    if not is_torch_available():
@@ -3532,6 +3545,9 @@ else:
        ]
    )
    _import_structure["models.timm_backbone"].extend(["TimmBackbone"])
    _import_structure["models.timm_wrapper"].extend(
        ["TimmWrapperForImageClassification", "TimmWrapperModel", "TimmWrapperPreTrainedModel"]
    )
    _import_structure["models.trocr"].extend(
        [
            "TrOCRForCausalLM",
@@ -5734,6 +5750,7 @@ if TYPE_CHECKING:
        TimesformerConfig,
    )
    from .models.timm_backbone import TimmBackboneConfig
    from .models.timm_wrapper import TimmWrapperConfig
    from .models.trocr import (
        TrOCRConfig,
        TrOCRProcessor,
@@ -6227,6 +6244,14 @@ if TYPE_CHECKING:
        from .models.rt_detr import RTDetrImageProcessorFast
        from .models.vit import ViTImageProcessorFast
    try:
        if not is_torchvision_available() and not is_timm_available():
            raise OptionalDependencyNotAvailable()
    except OptionalDependencyNotAvailable:
        from .utils.dummy_timm_and_torchvision_objects import *
    else:
        from .models.timm_wrapper import TimmWrapperImageProcessor
    # Modeling
    try:
        if not is_torch_available():
@@ -8037,6 +8062,11 @@ if TYPE_CHECKING:
            TimesformerPreTrainedModel,
        )
        from .models.timm_backbone import TimmBackbone
        from .models.timm_wrapper import (
            TimmWrapperForImageClassification,
            TimmWrapperModel,
            TimmWrapperPreTrainedModel,
        )
        from .models.trocr import (
            TrOCRForCausalLM,
            TrOCRPreTrainedModel,
--- a/src/transformers/configuration_utils.py
+++ b/src/transformers/configuration_utils.py
@@ -37,6 +37,7 @@ from .utils import (
    download_url,
    extract_commit_hash,
    is_remote_url,
    is_timm_config_dict,
    is_torch_available,
    logging,
 )
@@ -702,6 +703,11 @@ class PretrainedConfig(PushToHubMixin):
            config_dict["custom_pipelines"] = add_model_info_to_custom_pipelines(
                config_dict["custom_pipelines"], pretrained_model_name_or_path
            )
        # timm models are not saved with the model_type in the config file
        if "model_type" not in config_dict and is_timm_config_dict(config_dict):
            config_dict["model_type"] = "timm_wrapper"
        return config_dict, kwargs
    @classmethod
--- a/src/transformers/image_processing_base.py
+++ b/src/transformers/image_processing_base.py
@@ -285,6 +285,8 @@ class ImageProcessingMixin(PushToHubMixin):
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            image_processor_filename (`str`, *optional*, defaults to `"config.json"`):
                The name of the file in the model directory to use for the image processor config.
        Returns:
            `Tuple[Dict, Dict]`: The dictionary(ies) that will be used to instantiate the image processor object.
@@ -298,6 +300,7 @@ class ImageProcessingMixin(PushToHubMixin):
        local_files_only = kwargs.pop("local_files_only", False)
        revision = kwargs.pop("revision", None)
        subfolder = kwargs.pop("subfolder", "")
        image_processor_filename = kwargs.pop("image_processor_filename", IMAGE_PROCESSOR_NAME)
        from_pipeline = kwargs.pop("_from_pipeline", None)
        from_auto_class = kwargs.pop("_from_auto", False)
@@ -324,7 +327,7 @@ class ImageProcessingMixin(PushToHubMixin):
        pretrained_model_name_or_path = str(pretrained_model_name_or_path)
        is_local = os.path.isdir(pretrained_model_name_or_path)
        if os.path.isdir(pretrained_model_name_or_path):
-            image_processor_file = os.path.join(pretrained_model_name_or_path, IMAGE_PROCESSOR_NAME)
+            image_processor_file = os.path.join(pretrained_model_name_or_path, image_processor_filename)
        if os.path.isfile(pretrained_model_name_or_path):
            resolved_image_processor_file = pretrained_model_name_or_path
            is_local = True
@@ -332,7 +335,7 @@ class ImageProcessingMixin(PushToHubMixin):
            image_processor_file = pretrained_model_name_or_path
            resolved_image_processor_file = download_url(pretrained_model_name_or_path)
        else:
-            image_processor_file = IMAGE_PROCESSOR_NAME
+            image_processor_file = image_processor_filename
            try:
                # Load from local folder or from cache or download from model Hub and cache
                resolved_image_processor_file = cached_file(
@@ -358,7 +361,7 @@ class ImageProcessingMixin(PushToHubMixin):
                    f"Can't load image processor for '{pretrained_model_name_or_path}'. If you were trying to load"
                    " it from 'https://huggingface.co/models', make sure you don't have a local directory with the"
                    f" same name. Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a"
-                    f" directory containing a {IMAGE_PROCESSOR_NAME} file"
+                    f" directory containing a {image_processor_filename} file"
                )
        try:
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -503,7 +503,7 @@ def load_state_dict(
        # Check format of the archive
        with safe_open(checkpoint_file, framework="pt") as f:
            metadata = f.metadata()
-        if metadata.get("format") not in ["pt", "tf", "flax", "mlx"]:
+        if metadata is not None and metadata.get("format") not in ["pt", "tf", "flax", "mlx"]:
            raise OSError(
                f"The safetensors archive passed at {checkpoint_file} does not contain the valid metadata. Make sure "
                "you save your model with the `save_pretrained` method."
@@ -652,36 +652,6 @@ def _find_identical(tensors: List[Set[str]], state_dict: Dict[str, torch.Tensor]
 def _load_state_dict_into_model(model_to_load, state_dict, start_prefix, assign_to_params_buffers=False):
    # Convert old format to new format if needed from a PyTorch state_dict
    old_keys = []
    new_keys = []
    renamed_keys = {}
    renamed_gamma = {}
    renamed_beta = {}
    warning_msg = f"A pretrained model of type `{model_to_load.__class__.__name__}` "
    for key in state_dict.keys():
        new_key = None
        if "gamma" in key:
            # We add only the first key as an example
            new_key = key.replace("gamma", "weight")
            renamed_gamma[key] = new_key if not renamed_gamma else renamed_gamma
        if "beta" in key:
            # We add only the first key as an example
            new_key = key.replace("beta", "bias")
            renamed_beta[key] = new_key if not renamed_beta else renamed_beta
        if new_key:
            old_keys.append(key)
            new_keys.append(new_key)
    renamed_keys = {**renamed_gamma, **renamed_beta}
    if renamed_keys:
        warning_msg += "contains parameters that have been renamed internally (a few are listed below but more are present in the model):\n"
        for old_key, new_key in renamed_keys.items():
            warning_msg += f"* `{old_key}` -> `{new_key}`\n"
        warning_msg += "If you are using a model from the Hub, consider submitting a PR to adjust these weights and help future users."
        logger.info_once(warning_msg)
    for old_key, new_key in zip(old_keys, new_keys):
        state_dict[new_key] = state_dict.pop(old_key)
    # copy state_dict so _load_from_state_dict can modify it
    metadata = getattr(state_dict, "_metadata", None)
    state_dict = state_dict.copy()
@@ -812,46 +782,7 @@ def _load_state_dict_into_meta_model(
    error_msgs = []
    old_keys = []
    new_keys = []
    renamed_gamma = {}
    renamed_beta = {}
    is_quantized = hf_quantizer is not None
    warning_msg = f"This model {type(model)}"
    for key in state_dict.keys():
        new_key = None
        if "gamma" in key:
            # We add only the first key as an example
            new_key = key.replace("gamma", "weight")
            renamed_gamma[key] = new_key if not renamed_gamma else renamed_gamma
        if "beta" in key:
            # We add only the first key as an example
            new_key = key.replace("beta", "bias")
            renamed_beta[key] = new_key if not renamed_beta else renamed_beta
        # To reproduce `_load_state_dict_into_model` behaviour, we need to manually rename parametrized weigth norm, if necessary.
        if hasattr(nn.utils.parametrizations, "weight_norm"):
            if "weight_g" in key:
                new_key = key.replace("weight_g", "parametrizations.weight.original0")
            if "weight_v" in key:
                new_key = key.replace("weight_v", "parametrizations.weight.original1")
        else:
            if "parametrizations.weight.original0" in key:
                new_key = key.replace("parametrizations.weight.original0", "weight_g")
            if "parametrizations.weight.original1" in key:
                new_key = key.replace("parametrizations.weight.original1", "weight_v")
        if new_key:
            old_keys.append(key)
            new_keys.append(new_key)
    renamed_keys = {**renamed_gamma, **renamed_beta}
    if renamed_keys:
        warning_msg += "contains parameters that have been renamed internally (a few are listed below but more are present in the model):\n"
        for old_key, new_key in renamed_keys.items():
            warning_msg += f"* `{old_key}` -> `{new_key}`\n"
        warning_msg += "If you are using a model from the Hub, consider submitting a PR to adjust these weights and help future users."
        logger.info_once(warning_msg)
    for old_key, new_key in zip(old_keys, new_keys):
        state_dict[new_key] = state_dict.pop(old_key)
    is_torch_e4m3fn_available = hasattr(torch, "float8_e4m3fn")
@@ -2888,6 +2819,11 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
            for ignore_key in self._keys_to_ignore_on_save:
                if ignore_key in state_dict.keys():
                    del state_dict[ignore_key]
        # Rename state_dict keys before saving to file. Do nothing unless overriden in a particular model.
        # (initially introduced with TimmWrapperModel to remove prefix and make checkpoints compatible with timm)
        state_dict = self._fix_state_dict_keys_on_save(state_dict)
        if safe_serialization:
            # Safetensors does not allow tensor aliasing.
            # We're going to remove aliases before saving
@@ -4010,7 +3946,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
            with safe_open(resolved_archive_file, framework="pt") as f:
                metadata = f.metadata()
-            if metadata.get("format") == "pt":
+            if metadata is None:
                # Assume it's a pytorch checkpoint (introduced for timm checkpoints)
                pass
            elif metadata.get("format") == "pt":
                pass
            elif metadata.get("format") == "tf":
                from_tf = True
@@ -4375,6 +4314,72 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
        return model
    @staticmethod
    def _fix_state_dict_key_on_load(key):
        """Replace legacy parameter names with their modern equivalents. E.g. beta -> bias, gamma -> weight."""
        if "beta" in key:
            return key.replace("beta", "bias")
        if "gamma" in key:
            return key.replace("gamma", "weight")
        # to avoid logging parametrized weight norm renaming
        if hasattr(nn.utils.parametrizations, "weight_norm"):
            if "weight_g" in key:
                return key.replace("weight_g", "parametrizations.weight.original0")
            if "weight_v" in key:
                return key.replace("weight_v", "parametrizations.weight.original1")
        else:
            if "parametrizations.weight.original0" in key:
                return key.replace("parametrizations.weight.original0", "weight_g")
            if "parametrizations.weight.original1" in key:
                return key.replace("parametrizations.weight.original1", "weight_v")
        return key
    @classmethod
    def _fix_state_dict_keys_on_load(cls, state_dict):
        """Fixes state dict keys by replacing legacy parameter names with their modern equivalents.
        Logs if any parameters have been renamed.
        """
        renamed_keys = {}
        state_dict_keys = list(state_dict.keys())
        for key in state_dict_keys:
            new_key = cls._fix_state_dict_key_on_load(key)
            if new_key != key:
                state_dict[new_key] = state_dict.pop(key)
            # add it once for logging
            if "gamma" in key and "gamma" not in renamed_keys:
                renamed_keys["gamma"] = (key, new_key)
            if "beta" in key and "beta" not in renamed_keys:
                renamed_keys["beta"] = (key, new_key)
        if renamed_keys:
            warning_msg = f"A pretrained model of type `{cls.__name__}` "
            warning_msg += "contains parameters that have been renamed internally (a few are listed below but more are present in the model):\n"
            for old_key, new_key in renamed_keys.values():
                warning_msg += f"* `{old_key}` -> `{new_key}`\n"
            warning_msg += "If you are using a model from the Hub, consider submitting a PR to adjust these weights and help future users."
            logger.info_once(warning_msg)
        return state_dict
    @staticmethod
    def _fix_state_dict_key_on_save(key):
        """
        Similar to `_fix_state_dict_key_on_load` allows to define hook for state dict key renaming on model save.
        Do nothing by default, but can be overriden in particular models.
        """
        return key
    def _fix_state_dict_keys_on_save(self, state_dict):
        """
        Similar to `_fix_state_dict_keys_on_load` allows to define hook for state dict key renaming on model save.
        Apply `_fix_state_dict_key_on_save` to all keys in `state_dict`.
        """
        return {self._fix_state_dict_key_on_save(key): value for key, value in state_dict.items()}
    @classmethod
    def _load_pretrained_model(
        cls,
@@ -4430,27 +4435,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
        if hf_quantizer is not None:
            expected_keys = hf_quantizer.update_expected_keys(model, expected_keys, loaded_keys)
        def _fix_key(key):
            if "beta" in key:
                return key.replace("beta", "bias")
            if "gamma" in key:
                return key.replace("gamma", "weight")
            # to avoid logging parametrized weight norm renaming
            if hasattr(nn.utils.parametrizations, "weight_norm"):
                if "weight_g" in key:
                    return key.replace("weight_g", "parametrizations.weight.original0")
                if "weight_v" in key:
                    return key.replace("weight_v", "parametrizations.weight.original1")
            else:
                if "parametrizations.weight.original0" in key:
                    return key.replace("parametrizations.weight.original0", "weight_g")
                if "parametrizations.weight.original1" in key:
                    return key.replace("parametrizations.weight.original1", "weight_v")
            return key
        original_loaded_keys = loaded_keys
-        loaded_keys = [_fix_key(key) for key in loaded_keys]
+        loaded_keys = [cls._fix_state_dict_key_on_load(key) for key in loaded_keys]
        if len(prefix) > 0:
            has_prefix_module = any(s.startswith(prefix) for s in loaded_keys)
@@ -4615,23 +4601,23 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
            state_dict,
            model_state_dict,
            loaded_keys,
            original_loaded_keys,
            add_prefix_to_model,
            remove_prefix_from_model,
            ignore_mismatched_sizes,
        ):
            mismatched_keys = []
            if ignore_mismatched_sizes:
-                for checkpoint_key in loaded_keys:
+                for checkpoint_key, model_key in zip(original_loaded_keys, loaded_keys):
                    # If the checkpoint is sharded, we may not have the key here.
                    if checkpoint_key not in state_dict:
                        continue
                    model_key = checkpoint_key
                    if remove_prefix_from_model:
                        # The model key starts with `prefix` but `checkpoint_key` doesn't so we add it.
-                        model_key = f"{prefix}.{checkpoint_key}"
+                        model_key = f"{prefix}.{model_key}"
                    elif add_prefix_to_model:
                        # The model key doesn't start with `prefix` but `checkpoint_key` does so we remove it.
-                        model_key = ".".join(checkpoint_key.split(".")[1:])
+                        model_key = ".".join(model_key.split(".")[1:])
                    if (
                        model_key in model_state_dict
@@ -4680,6 +4666,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
            mismatched_keys = _find_mismatched_keys(
                state_dict,
                model_state_dict,
                loaded_keys,
                original_loaded_keys,
                add_prefix_to_model,
                remove_prefix_from_model,
@@ -4688,9 +4675,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
            # For GGUF models `state_dict` is never set to None as the state dict is always small
            if gguf_path:
                fixed_state_dict = cls._fix_state_dict_keys_on_load(state_dict)
                error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
                    model_to_load,
-                    state_dict,
+                    fixed_state_dict,
                    start_prefix,
                    expected_keys,
                    device_map=device_map,
@@ -4709,8 +4697,9 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
                assign_to_params_buffers = check_support_param_buffer_assignment(
                    model_to_load, state_dict, start_prefix
                )
                fixed_state_dict = cls._fix_state_dict_keys_on_load(state_dict)
                error_msgs = _load_state_dict_into_model(
-                    model_to_load, state_dict, start_prefix, assign_to_params_buffers
+                    model_to_load, fixed_state_dict, start_prefix, assign_to_params_buffers
                )
        else:
@@ -4761,6 +4750,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
                mismatched_keys += _find_mismatched_keys(
                    state_dict,
                    model_state_dict,
                    loaded_keys,
                    original_loaded_keys,
                    add_prefix_to_model,
                    remove_prefix_from_model,
@@ -4774,9 +4764,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
                                    model_to_load, key, "cpu", torch.empty(*param.size(), dtype=dtype)
                                )
                    else:
                        fixed_state_dict = cls._fix_state_dict_keys_on_load(state_dict)
                        new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
                            model_to_load,
-                            state_dict,
+                            fixed_state_dict,
                            start_prefix,
                            expected_keys,
                            device_map=device_map,
@@ -4797,8 +4788,9 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
                        assign_to_params_buffers = check_support_param_buffer_assignment(
                            model_to_load, state_dict, start_prefix
                        )
                    fixed_state_dict = cls._fix_state_dict_keys_on_load(state_dict)
                    error_msgs += _load_state_dict_into_model(
-                        model_to_load, state_dict, start_prefix, assign_to_params_buffers
+                        model_to_load, fixed_state_dict, start_prefix, assign_to_params_buffers
                    )
                # force memory release
@@ -4930,9 +4922,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
        _move_model_to_meta(model, loaded_state_dict_keys, start_prefix)
        state_dict = load_state_dict(resolved_archive_file, weights_only=weights_only)
        expected_keys = loaded_state_dict_keys  # plug for missing expected_keys. TODO: replace with proper keys
        fixed_state_dict = model._fix_state_dict_keys_on_load(state_dict)
        error_msgs = _load_state_dict_into_meta_model(
            model,
-            state_dict,
+            fixed_state_dict,
            start_prefix,
            expected_keys=expected_keys,
            hf_quantizer=hf_quantizer,
--- a/src/transformers/models/init.py
+++ b/src/transformers/models/init.py
@@ -249,6 +249,7 @@ from . import (
    time_series_transformer,
    timesformer,
    timm_backbone,
    timm_wrapper,
    trocr,
    tvp,
    udop,
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -276,6 +276,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
        ("time_series_transformer", "TimeSeriesTransformerConfig"),
        ("timesformer", "TimesformerConfig"),
        ("timm_backbone", "TimmBackboneConfig"),
        ("timm_wrapper", "TimmWrapperConfig"),
        ("trajectory_transformer", "TrajectoryTransformerConfig"),
        ("transfo-xl", "TransfoXLConfig"),
        ("trocr", "TrOCRConfig"),
@@ -599,6 +600,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
        ("time_series_transformer", "Time Series Transformer"),
        ("timesformer", "TimeSformer"),
        ("timm_backbone", "TimmBackbone"),
        ("timm_wrapper", "TimmWrapperModel"),
        ("trajectory_transformer", "Trajectory Transformer"),
        ("transfo-xl", "Transformer-XL"),
        ("trocr", "TrOCR"),
--- a/src/transformers/models/auto/image_processing_auto.py
+++ b/src/transformers/models/auto/image_processing_auto.py
@@ -30,6 +30,8 @@ from ...utils import (
    CONFIG_NAME,
    IMAGE_PROCESSOR_NAME,
    get_file_from_repo,
    is_timm_config_dict,
    is_timm_local_checkpoint,
    is_torchvision_available,
    is_vision_available,
    logging,
@@ -137,6 +139,7 @@ else:
            ("swinv2", ("ViTImageProcessor", "ViTImageProcessorFast")),
            ("table-transformer", ("DetrImageProcessor",)),
            ("timesformer", ("VideoMAEImageProcessor",)),
            ("timm_wrapper", ("TimmWrapperImageProcessor",)),
            ("tvlt", ("TvltImageProcessor",)),
            ("tvp", ("TvpImageProcessor",)),
            ("udop", ("LayoutLMv3ImageProcessor",)),
@@ -376,6 +379,8 @@ class AutoImageProcessor:
                Whether or not to allow for custom models defined on the Hub in their own modeling files. This option
                should only be set to `True` for repositories you trust and in which you have read the code, as it will
                execute code present on the Hub on your local machine.
            image_processor_filename (`str`, *optional*, defaults to `"config.json"`):
                The name of the file in the model directory to use for the image processor config.
            kwargs (`Dict[str, Any]`, *optional*):
                The values in kwargs of any keys which are image processor attributes will be used to override the
                loaded values. Behavior concerning key/value pairs whose keys are *not* image processor attributes is
@@ -415,7 +420,37 @@ class AutoImageProcessor:
        trust_remote_code = kwargs.pop("trust_remote_code", None)
        kwargs["_from_auto"] = True
-        config_dict, _ = ImageProcessingMixin.get_image_processor_dict(pretrained_model_name_or_path, **kwargs)
+        # Resolve the image processor config filename
        if "image_processor_filename" in kwargs:
            image_processor_filename = kwargs.pop("image_processor_filename")
        elif is_timm_local_checkpoint(pretrained_model_name_or_path):
            image_processor_filename = CONFIG_NAME
        else:
            image_processor_filename = IMAGE_PROCESSOR_NAME
        # Load the image processor config
        try:
            # Main path for all transformers models and local TimmWrapper checkpoints
            config_dict, _ = ImageProcessingMixin.get_image_processor_dict(
                pretrained_model_name_or_path, image_processor_filename=image_processor_filename, **kwargs
            )
        except Exception as initial_exception:
            # Fallback path for Hub TimmWrapper checkpoints. Timm models' image processing is saved in `config.json`
            # instead of `preprocessor_config.json`. Because this is an Auto class and we don't have any information
            # except the model name, the only way to check if a remote checkpoint is a timm model is to try to
            # load `config.json` and if it fails with some error, we raise the initial exception.
            try:
                config_dict, _ = ImageProcessingMixin.get_image_processor_dict(
                    pretrained_model_name_or_path, image_processor_filename=CONFIG_NAME, **kwargs
                )
            except Exception:
                raise initial_exception
            # In case we have a config_dict, but it's not a timm config dict, we raise the initial exception,
            # because only timm models have image processing in `config.json`.
            if not is_timm_config_dict(config_dict):
                raise initial_exception
        image_processor_class = config_dict.get("image_processor_type", None)
        image_processor_auto_map = None
        if "AutoImageProcessor" in config_dict.get("auto_map", {}):
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -255,6 +255,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
        ("time_series_transformer", "TimeSeriesTransformerModel"),
        ("timesformer", "TimesformerModel"),
        ("timm_backbone", "TimmBackbone"),
        ("timm_wrapper", "TimmWrapperModel"),
        ("trajectory_transformer", "TrajectoryTransformerModel"),
        ("transfo-xl", "TransfoXLModel"),
        ("tvlt", "TvltModel"),
@@ -605,6 +606,7 @@ MODEL_FOR_IMAGE_MAPPING_NAMES = OrderedDict(
        ("table-transformer", "TableTransformerModel"),
        ("timesformer", "TimesformerModel"),
        ("timm_backbone", "TimmBackbone"),
        ("timm_wrapper", "TimmWrapperModel"),
        ("van", "VanModel"),
        ("videomae", "VideoMAEModel"),
        ("vit", "ViTModel"),
@@ -690,6 +692,7 @@ MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
        ("swiftformer", "SwiftFormerForImageClassification"),
        ("swin", "SwinForImageClassification"),
        ("swinv2", "Swinv2ForImageClassification"),
        ("timm_wrapper", "TimmWrapperForImageClassification"),
        ("van", "VanForImageClassification"),
        ("vit", "ViTForImageClassification"),
        ("vit_hybrid", "ViTHybridForImageClassification"),
--- a/src/transformers/models/timm_wrapper/init.py
+++ b/src/transformers/models/timm_wrapper/init.py
@@ -0,0 +1,28 @@
 # Copyright 2024 The HuggingFace Team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import TYPE_CHECKING
 from ...utils import _LazyModule
 from ...utils.import_utils import define_import_structure
 if TYPE_CHECKING:
    from .configuration_timm_wrapper import *
    from .modeling_timm_wrapper import *
    from .processing_timm_wrapper import *
 else:
    import sys
    _file = globals()["__file__"]
    sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
--- a/src/transformers/models/timm_wrapper/configuration_timm_wrapper.py
+++ b/src/transformers/models/timm_wrapper/configuration_timm_wrapper.py
@@ -0,0 +1,86 @@
 # coding=utf-8
 # Copyright 2024 The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Configuration for TimmWrapper models"""
 from typing import Any, Dict
 from ...configuration_utils import PretrainedConfig
 from ...utils import logging
 logger = logging.get_logger(__name__)
 class TimmWrapperConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration for a timm backbone [`TimmWrapper`].
    It is used to instantiate a timm model according to the specified arguments, defining the model.
    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.
    Args:
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        do_pooling (`bool`, *optional*, defaults to `True`):
            Whether to do pooling for the last_hidden_state in `TimmWrapperModel` or not.
    Example:
    ```python
    >>> from transformers import TimmWrapperModel
    >>> # Initializing a timm model
    >>> model = TimmWrapperModel.from_pretrained("timm/resnet18.a1_in1k")
    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```
    """
    model_type = "timm_wrapper"
    def __init__(self, initializer_range: float = 0.02, do_pooling: bool = True, **kwargs):
        self.initializer_range = initializer_range
        self.do_pooling = do_pooling
        super().__init__(**kwargs)
    @classmethod
    def from_dict(cls, config_dict: Dict[str, Any], **kwargs):
        # timm config stores the `num_classes` attribute in both the root of config and in the "pretrained_cfg" dict.
        # We are removing these attributes in order to have the native `transformers` num_labels attribute in config
        # and to avoid duplicate attributes
        num_labels_in_kwargs = kwargs.pop("num_labels", None)
        num_labels_in_dict = config_dict.pop("num_classes", None)
        # passed num_labels has priority over num_classes in config_dict
        kwargs["num_labels"] = num_labels_in_kwargs or num_labels_in_dict
        # pop num_classes from "pretrained_cfg",
        # it is not necessary to have it, only root one is used in timm
        if "pretrained_cfg" in config_dict and "num_classes" in config_dict["pretrained_cfg"]:
            config_dict["pretrained_cfg"].pop("num_classes", None)
        return super().from_dict(config_dict, **kwargs)
    def to_dict(self) -> Dict[str, Any]:
        output = super().to_dict()
        output["num_classes"] = self.num_labels
        return output
 __all__ = ["TimmWrapperConfig"]
--- a/src/transformers/models/timm_wrapper/image_processing_timm_wrapper.py
+++ b/src/transformers/models/timm_wrapper/image_processing_timm_wrapper.py
@@ -0,0 +1,138 @@
 # coding=utf-8
 # Copyright 2024 The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import os
 from typing import Any, Dict, Optional, Tuple, Union
 import torch
 from ...image_processing_utils import BaseImageProcessor, BatchFeature
 from ...image_transforms import to_pil_image
 from ...image_utils import ImageInput, make_list_of_images
 from ...utils import TensorType, logging, requires_backends
 from ...utils.import_utils import is_timm_available, is_torch_available
 if is_timm_available():
    import timm
 if is_torch_available():
    import torch
 logger = logging.get_logger(__name__)
 class TimmWrapperImageProcessor(BaseImageProcessor):
    """
    Wrapper class for timm models to be used within transformers.
    Args:
        pretrained_cfg (`Dict[str, Any]`):
            The configuration of the pretrained model used to resolve evaluation and
            training transforms.
        architecture (`Optional[str]`, *optional*):
            Name of the architecture of the model.
    """
    main_input_name = "pixel_values"
    def __init__(
        self,
        pretrained_cfg: Dict[str, Any],
        architecture: Optional[str] = None,
        **kwargs,
    ):
        requires_backends(self, "timm")
        super().__init__(architecture=architecture)
        self.data_config = timm.data.resolve_data_config(pretrained_cfg, model=None, verbose=False)
        self.val_transforms = timm.data.create_transform(**self.data_config, is_training=False)
        # useful for training, see examples/pytorch/image-classification/run_image_classification.py
        self.train_transforms = timm.data.create_transform(**self.data_config, is_training=True)
        # If `ToTensor` is in the transforms, then the input should be numpy array or PIL image.
        # Otherwise, the input can be a tensor. In later timm versions, `MaybeToTensor` is used
        # which can handle both numpy arrays / PIL images and tensors.
        self._not_supports_tensor_input = any(
            transform.__class__.__name__ == "ToTensor" for transform in self.val_transforms.transforms
        )
    def to_dict(self) -> Dict[str, Any]:
        """
        Serializes this instance to a Python dictionary.
        """
        output = super().to_dict()
        output.pop("train_transforms", None)
        output.pop("val_transforms", None)
        output.pop("_not_supports_tensor_input", None)
        return output
    @classmethod
    def get_image_processor_dict(
        cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs
    ) -> Tuple[Dict[str, Any], Dict[str, Any]]:
        """
        Get the image processor dict for the model.
        """
        image_processor_filename = kwargs.pop("image_processor_filename", "config.json")
        return super().get_image_processor_dict(
            pretrained_model_name_or_path, image_processor_filename=image_processor_filename, **kwargs
        )
    def preprocess(
        self,
        images: ImageInput,
        return_tensors: Optional[Union[str, TensorType]] = "pt",
    ) -> BatchFeature:
        """
        Preprocess an image or batch of images.
        Args:
            images (`ImageInput`):
                Image to preprocess. Expects a single or batch of images
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return.
        """
        if return_tensors != "pt":
            raise ValueError(f"return_tensors for TimmWrapperImageProcessor must be 'pt', but got {return_tensors}")
        if self._not_supports_tensor_input and isinstance(images, torch.Tensor):
            images = images.cpu().numpy()
        # If the input is a torch tensor, then no conversion is needed
        # Otherwise, we need to pass in a list of PIL images
        if isinstance(images, torch.Tensor):
            images = self.val_transforms(images)
            # Add batch dimension if a single image
            images = images.unsqueeze(0) if images.ndim == 3 else images
        else:
            images = make_list_of_images(images)
            images = [to_pil_image(image) for image in images]
            images = torch.stack([self.val_transforms(image) for image in images])
        return BatchFeature({"pixel_values": images}, tensor_type=return_tensors)
    def save_pretrained(self, *args, **kwargs):
        # disable it to make checkpoint the same as in `timm` library.
        logger.warning_once(
            "The `save_pretrained` method is disabled for TimmWrapperImageProcessor. "
            "The image processor configuration is saved directly in `config.json` when "
            "`save_pretrained` is called for saving the model."
        )
 __all__ = ["TimmWrapperImageProcessor"]
--- a/src/transformers/models/timm_wrapper/modeling_timm_wrapper.py
+++ b/src/transformers/models/timm_wrapper/modeling_timm_wrapper.py
@@ -0,0 +1,363 @@
 # coding=utf-8
 # Copyright 2024 The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from dataclasses import dataclass
 from typing import List, Optional, Tuple, Union
 import torch
 from torch import Tensor, nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 from ...modeling_outputs import ImageClassifierOutput, ModelOutput
 from ...modeling_utils import PreTrainedModel
 from ...utils import (
    add_start_docstrings_to_model_forward,
    is_timm_available,
    replace_return_docstrings,
    requires_backends,
 )
 from .configuration_timm_wrapper import TimmWrapperConfig
 if is_timm_available():
    import timm
@dataclass
 class TimmWrapperModelOutput(ModelOutput):
    """
    Output class for models TimmWrapperModel, containing the last hidden states, an optional pooled output,
    and optional hidden states.
    Args:
        last_hidden_state (`torch.FloatTensor`):
            The last hidden state of the model, output before applying the classification head.
        pooler_output (`torch.FloatTensor`, *optional*):
            The pooled output derived from the last hidden state, if applicable.
        hidden_states (`tuple(torch.FloatTensor)`, *optional*):
            A tuple containing the intermediate hidden states of the model at the output of each layer or specified layers.
            Returned if `output_hidden_states=True` is set or if `config.output_hidden_states=True`.
        attentions (`tuple(torch.FloatTensor)`, *optional*):
            A tuple containing the intermediate attention weights of the model at the output of each layer.
            Returned if `output_attentions=True` is set or if `config.output_attentions=True`.
            Note: Currently, Timm models do not support attentions output.
    """
    last_hidden_state: torch.FloatTensor
    pooler_output: Optional[torch.FloatTensor] = None
    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
 TIMM_WRAPPER_INPUTS_DOCSTRING = r"""
    Args:
        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
            Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See [`TimmWrapperImageProcessor.preprocess`]
            for details.
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. Not compatible with timm wrapped models.
        output_hidden_states (`bool`, *optional*):
            Whether or not to return the hidden states of all layers. Not compatible with timm wrapped models.
        return_dict (`bool`, *optional*):
            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
        **kwargs:
            Additional keyword arguments passed along to the `timm` model forward.
 """
 class TimmWrapperPreTrainedModel(PreTrainedModel):
    main_input_name = "pixel_values"
    config_class = TimmWrapperConfig
    _no_split_modules = []
    def __init__(self, *args, **kwargs):
        requires_backends(self, ["vision", "timm"])
        super().__init__(*args, **kwargs)
    @staticmethod
    def _fix_state_dict_key_on_load(key):
        """
        Overrides original method that renames `gamma` and `beta` to `weight` and `bias`.
        We don't want this behavior for timm wrapped models. Instead, this method adds a
        "timm_model." prefix to enable loading official timm Hub checkpoints.
        """
        if "timm_model." not in key:
            return f"timm_model.{key}"
        return key
    def _fix_state_dict_key_on_save(self, key):
        """
        Overrides original method to remove "timm_model." prefix from state_dict keys.
        Makes the saved checkpoint compatible with the `timm` library.
        """
        return key.replace("timm_model.", "")
    def load_state_dict(self, state_dict, *args, **kwargs):
        """
        Override original method to fix state_dict keys on load for cases when weights are loaded
        without using the `from_pretrained` method (e.g., in Trainer to resume from checkpoint).
        """
        state_dict = self._fix_state_dict_keys_on_load(state_dict)
        return super().load_state_dict(state_dict, *args, **kwargs)
    def _init_weights(self, module):
        """
        Initialize weights function to properly initialize Linear layer weights.
        Since model architectures may vary, we assume only the classifier requires
        initialization, while all other weights should be loaded from the checkpoint.
        """
        if isinstance(module, (nn.Linear)):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
 class TimmWrapperModel(TimmWrapperPreTrainedModel):
    """
    Wrapper class for timm models to be used in transformers.
    """
    def __init__(self, config: TimmWrapperConfig):
        super().__init__(config)
        # using num_classes=0 to avoid creating classification head
        self.timm_model = timm.create_model(config.architecture, pretrained=False, num_classes=0)
        self.post_init()
    @add_start_docstrings_to_model_forward(TIMM_WRAPPER_INPUTS_DOCSTRING)
    @replace_return_docstrings(output_type=TimmWrapperModelOutput, config_class=TimmWrapperConfig)
    def forward(
        self,
        pixel_values: torch.FloatTensor,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[Union[bool, List[int]]] = None,
        return_dict: Optional[bool] = None,
        do_pooling: Optional[bool] = None,
        **kwargs,
    ) -> Union[TimmWrapperModelOutput, Tuple[Tensor, ...]]:
        r"""
        do_pooling (`bool`, *optional*):
            Whether to do pooling for the last_hidden_state in `TimmWrapperModel` or not. If `None` is passed, the
            `do_pooling` value from the config is used.
        Returns:
        Examples:
        ```python
        >>> import torch
        >>> from PIL import Image
        >>> from urllib.request import urlopen
        >>> from transformers import AutoModel, AutoImageProcessor
        >>> # Load image
        >>> image = Image.open(urlopen(
        ...     'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
        ... ))
        >>> # Load model and image processor
        >>> checkpoint = "timm/resnet50.a1_in1k"
        >>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
        >>> model = AutoModel.from_pretrained(checkpoint).eval()
        >>> # Preprocess image
        >>> inputs = image_processor(image)
        >>> # Forward pass
        >>> with torch.no_grad():
        ...     outputs = model(**inputs)
        >>> # Get pooled output
        >>> pooled_output = outputs.pooler_output
        >>> # Get last hidden state
        >>> last_hidden_state = outputs.last_hidden_state
        ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        do_pooling = do_pooling if do_pooling is not None else self.config.do_pooling
        if output_attentions:
            raise ValueError("Cannot set `output_attentions` for timm models.")
        if output_hidden_states and not hasattr(self.timm_model, "forward_intermediates"):
            raise ValueError(
                "The 'output_hidden_states' option cannot be set for this timm model. "
                "To enable this feature, the 'forward_intermediates' method must be implemented "
                "in the timm model (available in timm versions > 1.*). Please consider using a "
                "different architecture or updating the timm package to a compatible version."
            )
        pixel_values = pixel_values.to(self.device, self.dtype)
        if output_hidden_states:
            # to enable hidden states selection
            if isinstance(output_hidden_states, (list, tuple)):
                kwargs["indices"] = output_hidden_states
            last_hidden_state, hidden_states = self.timm_model.forward_intermediates(pixel_values, **kwargs)
        else:
            last_hidden_state = self.timm_model.forward_features(pixel_values, **kwargs)
            hidden_states = None
        if do_pooling:
            # classification head is not created, applying pooling only
            pooler_output = self.timm_model.forward_head(last_hidden_state)
        else:
            pooler_output = None
        if not return_dict:
            outputs = (last_hidden_state, pooler_output, hidden_states)
            outputs = tuple(output for output in outputs if output is not None)
            return outputs
        return TimmWrapperModelOutput(
            last_hidden_state=last_hidden_state,
            pooler_output=pooler_output,
            hidden_states=hidden_states,
        )
 class TimmWrapperForImageClassification(TimmWrapperPreTrainedModel):
    """
    Wrapper class for timm models to be used in transformers for image classification.
    """
    def __init__(self, config: TimmWrapperConfig):
        super().__init__(config)
        if config.num_labels == 0:
            raise ValueError(
                "You are trying to load weights into `TimmWrapperForImageClassification` from a checkpoint with no classifier head. "
                "Please specify the number of classes, e.g. `model = TimmWrapperForImageClassification.from_pretrained(..., num_labels=10)`, "
                "or use `TimmWrapperModel` for feature extraction."
            )
        self.timm_model = timm.create_model(config.architecture, pretrained=False, num_classes=config.num_labels)
        self.num_labels = config.num_labels
        self.post_init()
    @add_start_docstrings_to_model_forward(TIMM_WRAPPER_INPUTS_DOCSTRING)
    @replace_return_docstrings(output_type=ImageClassifierOutput, config_class=TimmWrapperConfig)
    def forward(
        self,
        pixel_values: torch.FloatTensor,
        labels: Optional[torch.LongTensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[Union[bool, List[int]]] = None,
        return_dict: Optional[bool] = None,
        **kwargs,
    ) -> Union[ImageClassifierOutput, Tuple[Tensor, ...]]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        Returns:
        Examples:
        ```python
        >>> import torch
        >>> from PIL import Image
        >>> from urllib.request import urlopen
        >>> from transformers import AutoModelForImageClassification, AutoImageProcessor
        >>> # Load image
        >>> image = Image.open(urlopen(
        ...     'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
        ... ))
        >>> # Load model and image processor
        >>> checkpoint = "timm/resnet50.a1_in1k"
        >>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
        >>> model = AutoModelForImageClassification.from_pretrained(checkpoint).eval()
        >>> # Preprocess image
        >>> inputs = image_processor(image)
        >>> # Forward pass
        >>> with torch.no_grad():
        ...     logits = model(**inputs).logits
        >>> # Get top 5 predictions
        >>> top5_probabilities, top5_class_indices = torch.topk(logits.softmax(dim=1) * 100, k=5)
        ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        if output_attentions:
            raise ValueError("Cannot set `output_attentions` for timm models.")
        if output_hidden_states and not hasattr(self.timm_model, "forward_intermediates"):
            raise ValueError(
                "The 'output_hidden_states' option cannot be set for this timm model. "
                "To enable this feature, the 'forward_intermediates' method must be implemented "
                "in the timm model (available in timm versions > 1.*). Please consider using a "
                "different architecture or updating the timm package to a compatible version."
            )
        pixel_values = pixel_values.to(self.device, self.dtype)
        if output_hidden_states:
            # to enable hidden states selection
            if isinstance(output_hidden_states, (list, tuple)):
                kwargs["indices"] = output_hidden_states
            last_hidden_state, hidden_states = self.timm_model.forward_intermediates(pixel_values, **kwargs)
            logits = self.timm_model.forward_head(last_hidden_state)
        else:
            logits = self.timm_model(pixel_values, **kwargs)
            hidden_states = None
        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"
            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)
        if not return_dict:
            outputs = (loss, logits, hidden_states)
            outputs = tuple(output for output in outputs if output is not None)
            return outputs
        return ImageClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=hidden_states,
        )
 __all__ = ["TimmWrapperPreTrainedModel", "TimmWrapperModel", "TimmWrapperForImageClassification"]
--- a/src/transformers/utils/init.py
+++ b/src/transformers/utils/init.py
@@ -55,6 +55,8 @@ from .generic import (
    is_tensor,
    is_tf_symbolic_tensor,
    is_tf_tensor,
    is_timm_config_dict,
    is_timm_local_checkpoint,
    is_torch_device,
    is_torch_dtype,
    is_torch_tensor,
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -9043,6 +9043,27 @@ class TimmBackbone(metaclass=DummyObject):
        requires_backends(self, ["torch"])
 class TimmWrapperForImageClassification(metaclass=DummyObject):
    _backends = ["torch"]
    def __init__(self, *args, **kwargs):
        requires_backends(self, ["torch"])
 class TimmWrapperModel(metaclass=DummyObject):
    _backends = ["torch"]
    def __init__(self, *args, **kwargs):
        requires_backends(self, ["torch"])
 class TimmWrapperPreTrainedModel(metaclass=DummyObject):
    _backends = ["torch"]
    def __init__(self, *args, **kwargs):
        requires_backends(self, ["torch"])
 class TrOCRForCausalLM(metaclass=DummyObject):
    _backends = ["torch"]
--- a/src/transformers/utils/dummy_timm_and_torchvision_objects.py
+++ b/src/transformers/utils/dummy_timm_and_torchvision_objects.py
@@ -0,0 +1,9 @@
 # This file is autogenerated by the command `make fix-copies`, do not edit.
 from ..utils import DummyObject, requires_backends
 class TimmWrapperImageProcessor(metaclass=DummyObject):
    _backends = ["timm", "torchvision"]
    def __init__(self, *args, **kwargs):
        requires_backends(self, ["timm", "torchvision"])
--- a/src/transformers/utils/generic.py
+++ b/src/transformers/utils/generic.py
@@ -16,6 +16,8 @@ Generic utilities
 """
 import inspect
 import json
 import os
 import tempfile
 import warnings
 from collections import OrderedDict, UserDict
@@ -24,7 +26,7 @@ from contextlib import ExitStack, contextmanager
 from dataclasses import fields, is_dataclass
 from enum import Enum
 from functools import partial, wraps
-from typing import Any, ContextManager, Iterable, List, Optional, Tuple, TypedDict
+from typing import Any, ContextManager, Dict, Iterable, List, Optional, Tuple, TypedDict
 import numpy as np
 from packaging import version
@@ -867,3 +869,36 @@ class LossKwargs(TypedDict, total=False):
    """
    num_items_in_batch: Optional[int]
 def is_timm_config_dict(config_dict: Dict[str, Any]) -> bool:
    """Checks whether a config dict is a timm config dict."""
    return "pretrained_cfg" in config_dict
 def is_timm_local_checkpoint(pretrained_model_path: str) -> bool:
    """
    Checks whether a checkpoint is a timm model checkpoint.
    """
    if pretrained_model_path is None:
        return False
    # in case it's Path, not str
    pretrained_model_path = str(pretrained_model_path)
    is_file = os.path.isfile(pretrained_model_path)
    is_dir = os.path.isdir(pretrained_model_path)
    # pretrained_model_path is a file
    if is_file and pretrained_model_path.endswith(".json"):
        with open(pretrained_model_path, "r") as f:
            config_dict = json.load(f)
        return is_timm_config_dict(config_dict)
    # pretrained_model_path is a directory with a config.json
    if is_dir and os.path.exists(os.path.join(pretrained_model_path, "config.json")):
        with open(os.path.join(pretrained_model_path, "config.json"), "r") as f:
            config_dict = json.load(f)
        return is_timm_config_dict(config_dict)
    return False
--- a/tests/models/timm_wrapper/init.py
+++ b/tests/models/timm_wrapper/init.py
--- a/tests/models/timm_wrapper/test_image_processing_timm_wrapper.py
+++ b/tests/models/timm_wrapper/test_image_processing_timm_wrapper.py
@@ -0,0 +1,103 @@
 # coding=utf-8
 # Copyright 2024 HuggingFace Inc.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import tempfile
 import unittest
 import numpy as np
 from transformers.testing_utils import require_torch, require_torchvision, require_vision
 from transformers.utils import is_torch_available, is_vision_available
 if is_torch_available():
    import torch
 if is_vision_available():
    from PIL import Image
    from transformers import TimmWrapperConfig, TimmWrapperImageProcessor
@require_torch
@require_vision
@require_torchvision
 class TimmWrapperImageProcessingTest(unittest.TestCase):
    image_processing_class = TimmWrapperImageProcessor if is_vision_available() else None
    def setUp(self):
        super().setUp()
        self.temp_dir = tempfile.TemporaryDirectory()
        config = TimmWrapperConfig.from_pretrained("timm/resnet18.a1_in1k")
        config.save_pretrained(self.temp_dir.name)
    def tearDown(self):
        self.temp_dir.cleanup()
    def test_load_from_hub(self):
        image_processor = TimmWrapperImageProcessor.from_pretrained("timm/resnet18.a1_in1k")
        self.assertIsInstance(image_processor, TimmWrapperImageProcessor)
    def test_load_from_local_dir(self):
        image_processor = TimmWrapperImageProcessor.from_pretrained(self.temp_dir.name)
        self.assertIsInstance(image_processor, TimmWrapperImageProcessor)
    def test_image_processor_properties(self):
        image_processor = TimmWrapperImageProcessor.from_pretrained(self.temp_dir.name)
        self.assertTrue(hasattr(image_processor, "data_config"))
        self.assertTrue(hasattr(image_processor, "val_transforms"))
        self.assertTrue(hasattr(image_processor, "train_transforms"))
    def test_image_processor_call_numpy(self):
        image_processor = TimmWrapperImageProcessor.from_pretrained(self.temp_dir.name)
        single_image = np.random.randint(256, size=(256, 256, 3), dtype=np.uint8)
        batch_images = [single_image, single_image, single_image]
        # single image
        pixel_values = image_processor(single_image).pixel_values
        self.assertEqual(pixel_values.shape, (1, 3, 224, 224))
        # batch images
        pixel_values = image_processor(batch_images).pixel_values
        self.assertEqual(pixel_values.shape, (3, 3, 224, 224))
    def test_image_processor_call_pil(self):
        image_processor = TimmWrapperImageProcessor.from_pretrained(self.temp_dir.name)
        single_image = Image.fromarray(np.random.randint(256, size=(256, 256, 3), dtype=np.uint8))
        batch_images = [single_image, single_image, single_image]
        # single image
        pixel_values = image_processor(single_image).pixel_values
        self.assertEqual(pixel_values.shape, (1, 3, 224, 224))
        # batch images
        pixel_values = image_processor(batch_images).pixel_values
        self.assertEqual(pixel_values.shape, (3, 3, 224, 224))
    def test_image_processor_call_tensor(self):
        image_processor = TimmWrapperImageProcessor.from_pretrained(self.temp_dir.name)
        single_image = torch.from_numpy(np.random.randint(256, size=(3, 256, 256), dtype=np.uint8)).float()
        batch_images = [single_image, single_image, single_image]
        # single image
        pixel_values = image_processor(single_image).pixel_values
        self.assertEqual(pixel_values.shape, (1, 3, 224, 224))
        # batch images
        pixel_values = image_processor(batch_images).pixel_values
        self.assertEqual(pixel_values.shape, (3, 3, 224, 224))
--- a/tests/models/timm_wrapper/test_modeling_timm_wrapper.py
+++ b/tests/models/timm_wrapper/test_modeling_timm_wrapper.py
@@ -0,0 +1,366 @@
 # coding=utf-8
 # Copyright 2024 The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import inspect
 import tempfile
 import unittest
 from transformers.testing_utils import (
    require_bitsandbytes,
    require_timm,
    require_torch,
    require_vision,
    slow,
    torch_device,
 )
 from transformers.utils.import_utils import is_timm_available, is_torch_available, is_vision_available
 from ...test_configuration_common import ConfigTester
 from ...test_modeling_common import ModelTesterMixin, floats_tensor
 from ...test_pipeline_mixin import PipelineTesterMixin
 if is_torch_available():
    import torch
    from transformers import TimmWrapperConfig, TimmWrapperForImageClassification, TimmWrapperModel
 if is_timm_available():
    import timm
 if is_vision_available():
    from PIL import Image
    from transformers import TimmWrapperImageProcessor
 class TimmWrapperModelTester:
    def __init__(
        self,
        parent,
        model_name="timm/resnet18.a1_in1k",
        batch_size=3,
        image_size=32,
        num_channels=3,
        is_training=True,
    ):
        self.parent = parent
        self.model_name = model_name
        self.batch_size = batch_size
        self.image_size = image_size
        self.num_channels = num_channels
        self.is_training = is_training
    def prepare_config_and_inputs(self):
        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
        config = self.get_config()
        return config, pixel_values
    def get_config(self):
        return TimmWrapperConfig.from_pretrained(self.model_name)
    def create_and_check_model(self, config, pixel_values):
        model = TimmWrapperModel(config=config)
        model.to(torch_device)
        model.eval()
        with torch.no_grad():
            result = model(pixel_values)
        self.parent.assertEqual(
            result.feature_map[-1].shape,
            (self.batch_size, model.channels[-1], 14, 14),
        )
    def prepare_config_and_inputs_for_common(self):
        config_and_inputs = self.prepare_config_and_inputs()
        config, pixel_values = config_and_inputs
        inputs_dict = {"pixel_values": pixel_values}
        return config, inputs_dict
@require_torch
@require_timm
 class TimmWrapperModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
    all_model_classes = (TimmWrapperModel, TimmWrapperForImageClassification) if is_torch_available() else ()
    pipeline_model_mapping = (
        {"image-feature-extraction": TimmWrapperModel, "image-classification": TimmWrapperForImageClassification}
        if is_torch_available()
        else {}
    )
    test_resize_embeddings = False
    test_head_masking = False
    test_pruning = False
    has_attentions = False
    test_model_parallel = False
    def setUp(self):
        self.config_class = TimmWrapperConfig
        self.model_tester = TimmWrapperModelTester(self)
        self.config_tester = ConfigTester(
            self,
            config_class=self.config_class,
            has_text_modality=False,
            common_properties=[],
            model_name="timm/resnet18.a1_in1k",
        )
    def test_config(self):
        self.config_tester.run_common_tests()
    def test_hidden_states_output(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        for model_class in self.all_model_classes:
            model = model_class(config)
            # check all hidden states
            with torch.no_grad():
                outputs = model(**inputs_dict, output_hidden_states=True)
            self.assertTrue(
                len(outputs.hidden_states) == 5, f"expected 5 hidden states, but got {len(outputs.hidden_states)}"
            )
            expected_shapes = [[16, 16], [8, 8], [4, 4], [2, 2], [1, 1]]
            resulted_shapes = [list(h.shape[2:]) for h in outputs.hidden_states]
            self.assertListEqual(expected_shapes, resulted_shapes)
            # check we can select hidden states by indices
            with torch.no_grad():
                outputs = model(**inputs_dict, output_hidden_states=[-2, -1])
            self.assertTrue(
                len(outputs.hidden_states) == 2, f"expected 2 hidden states, but got {len(outputs.hidden_states)}"
            )
            expected_shapes = [[2, 2], [1, 1]]
            resulted_shapes = [list(h.shape[2:]) for h in outputs.hidden_states]
            self.assertListEqual(expected_shapes, resulted_shapes)
    @unittest.skip(reason="TimmWrapper models doesn't have inputs_embeds")
    def test_inputs_embeds(self):
        pass
    @unittest.skip(reason="TimmWrapper models doesn't have inputs_embeds")
    def test_model_get_set_embeddings(self):
        pass
    @unittest.skip(reason="TimmWrapper doesn't support output_attentions=True.")
    def test_torchscript_output_attentions(self):
        pass
    @unittest.skip(reason="TimmWrapper doesn't support this.")
    def test_retain_grad_hidden_states_attentions(self):
        pass
    @unittest.skip(reason="TimmWrapper initialization is managed on the timm side")
    def test_initialization(self):
        pass
    @unittest.skip(reason="Need to use a timm model and there is no tiny model available.")
    def test_model_is_small(self):
        pass
    def test_forward_signature(self):
        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
        for model_class in self.all_model_classes:
            model = model_class(config)
            signature = inspect.signature(model.forward)
            # signature.parameters is an OrderedDict => so arg_names order is deterministic
            arg_names = [*signature.parameters.keys()]
            expected_arg_names = ["pixel_values"]
            self.assertListEqual(arg_names[:1], expected_arg_names)
    def test_do_pooling_option(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        config.do_pooling = False
        model = TimmWrapperModel._from_config(config)
        # check there is no pooling
        with torch.no_grad():
            output = model(**inputs_dict)
        self.assertIsNone(output.pooler_output)
        # check there is pooler output
        with torch.no_grad():
            output = model(**inputs_dict, do_pooling=True)
        self.assertIsNotNone(output.pooler_output)
 # We will verify our results on an image of cute cats
 def prepare_img():
    image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
    return image
@require_torch
@require_timm
@require_vision
 class TimmWrapperModelIntegrationTest(unittest.TestCase):
    # some popular ones
    model_names_to_test = [
        "vit_small_patch16_384.augreg_in21k_ft_in1k",
        "resnet50.a1_in1k",
        "tf_mobilenetv3_large_minimal_100.in1k",
        "swin_tiny_patch4_window7_224.ms_in1k",
        "ese_vovnet19b_dw.ra_in1k",
        "hrnet_w18.ms_aug_in1k",
    ]
    @slow
    def test_inference_image_classification_head(self):
        checkpoint = "timm/resnet18.a1_in1k"
        model = TimmWrapperForImageClassification.from_pretrained(checkpoint, device_map=torch_device).eval()
        image_processor = TimmWrapperImageProcessor.from_pretrained(checkpoint)
        image = prepare_img()
        inputs = image_processor(images=image, return_tensors="pt").to(torch_device)
        # forward pass
        with torch.no_grad():
            outputs = model(**inputs)
        # verify the shape and logits
        expected_shape = torch.Size((1, 1000))
        self.assertEqual(outputs.logits.shape, expected_shape)
        expected_label = 281  # tabby cat
        self.assertEqual(torch.argmax(outputs.logits).item(), expected_label)
        expected_slice = torch.tensor([-11.2618, -9.6192, -10.3205]).to(torch_device)
        resulted_slice = outputs.logits[0, :3]
        is_close = torch.allclose(resulted_slice, expected_slice, atol=1e-3)
        self.assertTrue(is_close, f"Expected {expected_slice}, but got {resulted_slice}")
    @slow
    @require_bitsandbytes
    def test_inference_image_classification_quantized(self):
        from transformers import BitsAndBytesConfig
        checkpoint = "timm/vit_small_patch16_384.augreg_in21k_ft_in1k"
        quantization_config = BitsAndBytesConfig(load_in_8bit=True)
        model = TimmWrapperForImageClassification.from_pretrained(
            checkpoint, quantization_config=quantization_config, device_map=torch_device
        ).eval()
        image_processor = TimmWrapperImageProcessor.from_pretrained(checkpoint)
        image = prepare_img()
        inputs = image_processor(images=image, return_tensors="pt").to(torch_device)
        # forward pass
        with torch.no_grad():
            outputs = model(**inputs)
        # verify the shape and logits
        expected_shape = torch.Size((1, 1000))
        self.assertEqual(outputs.logits.shape, expected_shape)
        expected_label = 281  # tabby cat
        self.assertEqual(torch.argmax(outputs.logits).item(), expected_label)
        expected_slice = torch.tensor([-2.4043, 1.4492, -0.5127]).to(outputs.logits.dtype)
        resulted_slice = outputs.logits[0, :3].cpu()
        is_close = torch.allclose(resulted_slice, expected_slice, atol=0.1)
        self.assertTrue(is_close, f"Expected {expected_slice}, but got {resulted_slice}")
    @slow
    def test_transformers_model_for_classification_is_equivalent_to_timm(self):
        # check that wrapper logits are the same as timm model logits
        image = prepare_img()
        for model_name in self.model_names_to_test:
            checkpoint = f"timm/{model_name}"
            with self.subTest(msg=model_name):
                # prepare inputs
                image_processor = TimmWrapperImageProcessor.from_pretrained(checkpoint)
                pixel_values = image_processor(images=image).pixel_values.to(torch_device)
                # load models
                model = TimmWrapperForImageClassification.from_pretrained(checkpoint, device_map=torch_device).eval()
                timm_model = timm.create_model(model_name, pretrained=True).to(torch_device).eval()
                with torch.inference_mode():
                    outputs = model(pixel_values)
                    timm_outputs = timm_model(pixel_values)
                # check shape is the same
                self.assertEqual(outputs.logits.shape, timm_outputs.shape)
                # check logits are the same
                diff = (outputs.logits - timm_outputs).max().item()
                self.assertLess(diff, 1e-4)
    @slow
    def test_transformers_model_is_equivalent_to_timm(self):
        # check that wrapper logits are the same as timm model logits
        image = prepare_img()
        models_to_test = ["vit_small_patch16_224.dino"] + self.model_names_to_test
        for model_name in models_to_test:
            checkpoint = f"timm/{model_name}"
            with self.subTest(msg=model_name):
                # prepare inputs
                image_processor = TimmWrapperImageProcessor.from_pretrained(checkpoint)
                pixel_values = image_processor(images=image).pixel_values.to(torch_device)
                # load models
                model = TimmWrapperModel.from_pretrained(checkpoint, device_map=torch_device).eval()
                timm_model = timm.create_model(model_name, pretrained=True, num_classes=0).to(torch_device).eval()
                with torch.inference_mode():
                    outputs = model(pixel_values)
                    timm_outputs = timm_model(pixel_values)
                # check shape is the same
                self.assertEqual(outputs.pooler_output.shape, timm_outputs.shape)
                # check logits are the same
                diff = (outputs.pooler_output - timm_outputs).max().item()
                self.assertLess(diff, 1e-4)
    @slow
    def test_save_load_to_timm(self):
        # test that timm model can be loaded to transformers, saved and then loaded back into timm
        model = TimmWrapperForImageClassification.from_pretrained(
            "timm/resnet18.a1_in1k", num_labels=10, ignore_mismatched_sizes=True
        )
        with tempfile.TemporaryDirectory() as tmpdirname:
            model.save_pretrained(tmpdirname)
            # there is no direct way to load timm model from folder, use the same config + path to weights
            timm_model = timm.create_model(
                "resnet18", num_classes=10, checkpoint_path=f"{tmpdirname}/model.safetensors"
            )
        # check that all weights are the same after reload
        different_weights = []
        for (name1, param1), (name2, param2) in zip(
            model.timm_model.named_parameters(), timm_model.named_parameters()
        ):
            if param1.shape != param2.shape or not torch.equal(param1, param2):
                different_weights.append((name1, name2))
        if different_weights:
            self.fail(f"Found different weights after reloading: {different_weights}")
--- a/tests/test_modeling_common.py
+++ b/tests/test_modeling_common.py
@@ -3443,6 +3443,7 @@ class ModelTesterMixin:
                "Data2VecAudioForSequenceClassification",
                "UniSpeechForSequenceClassification",
                "PvtForImageClassification",
                "TimmWrapperForImageClassification",
            ]
            special_param_names = [
                r"^bit\.",
@@ -3463,6 +3464,7 @@ class ModelTesterMixin:
                r"^swiftformer\.",
                r"^swinv2\.",
                r"^transformers\.models\.swiftformer\.",
                r"^timm_model\.",
                r"^unispeech\.",
                r"^unispeech_sat\.",
                r"^vision_model\.",
--- a/utils/check_config_docstrings.py
+++ b/utils/check_config_docstrings.py
@@ -41,6 +41,7 @@ CONFIG_CLASSES_TO_IGNORE_FOR_DOCSTRING_CHECKPOINT_CHECK = {
    "RagConfig",
    "SpeechEncoderDecoderConfig",
    "TimmBackboneConfig",
    "TimmWrapperConfig",
    "VisionEncoderDecoderConfig",
    "VisionTextDualEncoderConfig",
    "LlamaConfig",