Deprecate low use models (#30781)

* Deprecate models - graphormer - time_series_transformer - xlm_prophetnet - qdqbert - nat - ernie_m - tvlt - nezha - mega - jukebox - vit_hybrid - x_clip - deta - speech_to_text_2 - efficientformer - realm - gptsan_japanese * Fix up * Fix speech2text2 imports * Make sure message isn't indented * Fix docstrings * Correctly map for deprecated models from model_type * Uncomment out * Add back time series transformer and x-clip * Import fix and fix-up * Fix up with updated ruff
2024-05-28 18:07:07 +01:00
parent 7f08817be4
commit a564d10afe
142 changed files with 1308 additions and 11908 deletions
--- a/docs/source/en/model_doc/deta.md
+++ b/docs/source/en/model_doc/deta.md
@@ -16,6 +16,14 @@ rendered properly in your Markdown viewer.

 # DETA

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 ## Overview

 The DETA model was proposed in [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl.
--- a/docs/source/en/model_doc/efficientformer.md
+++ b/docs/source/en/model_doc/efficientformer.md
@@ -16,28 +16,36 @@ rendered properly in your Markdown viewer.

 # EfficientFormer

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 ## Overview

-The EfficientFormer model was proposed in [EfficientFormer: Vision Transformers at MobileNet Speed](https://arxiv.org/abs/2206.01191) 
+The EfficientFormer model was proposed in [EfficientFormer: Vision Transformers at MobileNet Speed](https://arxiv.org/abs/2206.01191)
 by Yanyu Li, Geng Yuan, Yang Wen, Eric Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren.  EfficientFormer proposes a
 dimension-consistent pure transformer that can be run on mobile devices for dense prediction tasks like image classification, object
 detection and semantic segmentation.

 The abstract from the paper is the following:

-*Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. 
-However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally 
-times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly 
-challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation 
-complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still 
-unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? 
-To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. 
-Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. 
-Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. 
-Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. 
-Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on 
-iPhone 12 (compiled with CoreML), which { runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1),} and our largest model, 
-EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can 
+*Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks.
+However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally
+times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly
+challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation
+complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still
+unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance?
+To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs.
+Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm.
+Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer.
+Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices.
+Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on
+iPhone 12 (compiled with CoreML), which { runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1),} and our largest model,
+EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can
 reach extremely low latency on mobile devices while maintaining high performance.*

 This model was contributed by [novice03](https://huggingface.co/novice03) and [Bearnardd](https://huggingface.co/Bearnardd).
@@ -93,4 +101,4 @@ The original code can be found [here](https://github.com/snap-research/Efficient
    - call

 </tf>
-</frameworkcontent>
+</frameworkcontent>
--- a/docs/source/en/model_doc/ernie_m.md
+++ b/docs/source/en/model_doc/ernie_m.md
@@ -16,6 +16,14 @@ rendered properly in your Markdown viewer.

 # ErnieM

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 ## Overview

 The ErnieM model was proposed in [ERNIE-M: Enhanced Multilingual Representation by Aligning
--- a/docs/source/en/model_doc/gptsan-japanese.md
+++ b/docs/source/en/model_doc/gptsan-japanese.md
@@ -16,6 +16,14 @@ rendered properly in your Markdown viewer.

 # GPTSAN-japanese

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 ## Overview

 The GPTSAN-japanese model was released in the repository by Toshiyuki Sakamoto (tanreinama).
--- a/docs/source/en/model_doc/graphormer.md
+++ b/docs/source/en/model_doc/graphormer.md
@@ -1,7 +1,7 @@
 <!--Copyright 2022 The HuggingFace Team and Microsoft. All rights reserved.

 Licensed under the MIT License; you may not use this file except in compliance with
-the License. 
+the License.

 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
@@ -14,9 +14,17 @@ rendered properly in your Markdown viewer.

 # Graphormer

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 ## Overview

-The Graphormer model was proposed in [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234)  by 
+The Graphormer model was proposed in [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234)  by
 Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen and Tie-Yan Liu. It is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessing and collation, then using a modified attention.

 The abstract from the paper is the following:
--- a/docs/source/en/model_doc/jukebox.md
+++ b/docs/source/en/model_doc/jukebox.md
@@ -15,6 +15,14 @@ rendered properly in your Markdown viewer.
 -->
 # Jukebox

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 ## Overview

 The Jukebox model was proposed in [Jukebox: A generative model for music](https://arxiv.org/pdf/2005.00341.pdf)
@@ -27,7 +35,7 @@ The abstract from the paper is the following:
 *We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multiscale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples, along with model weights and code.*

 As shown on the following figure, Jukebox is made of 3 `priors` which are decoder only models. They follow the architecture described in [Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509), modified to support longer context length.
-First, a autoencoder is used to encode the text lyrics. Next, the first (also called `top_prior`) prior attends to the last hidden states extracted from the lyrics encoder. The priors are linked to the previous priors respectively via an `AudioConditioner` module. The`AudioConditioner` upsamples the outputs of the previous prior to raw tokens at a certain audio frame per second resolution. 
+First, a autoencoder is used to encode the text lyrics. Next, the first (also called `top_prior`) prior attends to the last hidden states extracted from the lyrics encoder. The priors are linked to the previous priors respectively via an `AudioConditioner` module. The`AudioConditioner` upsamples the outputs of the previous prior to raw tokens at a certain audio frame per second resolution.
 The metadata such as *artist, genre and timing* are passed to each prior, in the form of a start token and positional embedding for the timing data.  The hidden states are mapped to the closest codebook vector from the VQVAE in order to convert them to raw audio.

 ![JukeboxModel](https://gist.githubusercontent.com/ArthurZucker/92c1acaae62ebf1b6a951710bdd8b6af/raw/c9c517bf4eff61393f6c7dec9366ef02bdd059a3/jukebox.svg)
--- a/docs/source/en/model_doc/mega.md
+++ b/docs/source/en/model_doc/mega.md
@@ -16,12 +16,20 @@ rendered properly in your Markdown viewer.

 # MEGA

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 ## Overview

 The MEGA model was proposed in [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) by Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer.
-MEGA proposes a new approach to self-attention with each encoder layer having a multi-headed exponential moving average in addition to a single head of standard dot-product attention, giving the attention mechanism 
-stronger positional biases. This allows MEGA to perform competitively to Transformers on standard benchmarks including LRA 
-while also having significantly fewer parameters. MEGA's compute efficiency allows it to scale to very long sequences, making it an 
+MEGA proposes a new approach to self-attention with each encoder layer having a multi-headed exponential moving average in addition to a single head of standard dot-product attention, giving the attention mechanism
+stronger positional biases. This allows MEGA to perform competitively to Transformers on standard benchmarks including LRA
+while also having significantly fewer parameters. MEGA's compute efficiency allows it to scale to very long sequences, making it an
 attractive option for long-document NLP tasks.

 The abstract from the paper is the following:
@@ -34,8 +42,8 @@ The original code can be found [here](https://github.com/facebookresearch/mega).

 ## Usage tips

- MEGA can perform quite well with relatively few parameters. See Appendix D in the MEGA paper for examples of architectural specs which perform well in various settings. If using MEGA as a decoder, be sure to set `bidirectional=False` to avoid errors with default bidirectional. 
- Mega-chunk is a variant of mega that reduces time and spaces complexity from quadratic to linear. Utilize chunking with MegaConfig.use_chunking and control chunk size with MegaConfig.chunk_size 
+- MEGA can perform quite well with relatively few parameters. See Appendix D in the MEGA paper for examples of architectural specs which perform well in various settings. If using MEGA as a decoder, be sure to set `bidirectional=False` to avoid errors with default bidirectional.
+- Mega-chunk is a variant of mega that reduces time and spaces complexity from quadratic to linear. Utilize chunking with MegaConfig.use_chunking and control chunk size with MegaConfig.chunk_size


 ## Implementation Notes
--- a/docs/source/en/model_doc/nat.md
+++ b/docs/source/en/model_doc/nat.md
@@ -16,6 +16,14 @@ rendered properly in your Markdown viewer.

 # Neighborhood Attention Transformer

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 ## Overview

 NAT was proposed in [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143)
--- a/docs/source/en/model_doc/nezha.md
+++ b/docs/source/en/model_doc/nezha.md
@@ -16,6 +16,14 @@ rendered properly in your Markdown viewer.

 # Nezha

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 ## Overview

 The Nezha model was proposed in [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei et al.
@@ -25,8 +33,8 @@ The abstract from the paper is the following:
 *The pre-trained language models have achieved great successes in various natural language understanding (NLU) tasks
 due to its capacity to capture the deep contextualized information in text by pre-training on large-scale corpora.
 In this technical report, we present our practice of pre-training language models named NEZHA (NEural contextualiZed
-representation for CHinese lAnguage understanding) on Chinese corpora and finetuning for the Chinese NLU tasks. 
-The current version of NEZHA is based on BERT with a collection of proven improvements, which include Functional 
+representation for CHinese lAnguage understanding) on Chinese corpora and finetuning for the Chinese NLU tasks.
+The current version of NEZHA is based on BERT with a collection of proven improvements, which include Functional
 Relative Positional Encoding as an effective positional encoding scheme, Whole Word Masking strategy,
 Mixed Precision Training and the LAMB Optimizer in training the models. The experimental results show that NEZHA
 achieves the state-of-the-art performances when finetuned on several representative Chinese tasks, including
@@ -85,4 +93,4 @@ This model was contributed by [sijunhe](https://huggingface.co/sijunhe). The ori
 ## NezhaForQuestionAnswering

 [[autodoc]] NezhaForQuestionAnswering
-    - forward
+    - forward
--- a/docs/source/en/model_doc/qdqbert.md
+++ b/docs/source/en/model_doc/qdqbert.md
@@ -16,6 +16,14 @@ rendered properly in your Markdown viewer.

 # QDQBERT

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 ## Overview

 The QDQBERT model can be referenced in [Integer Quantization for Deep Learning Inference: Principles and Empirical
--- a/docs/source/en/model_doc/realm.md
+++ b/docs/source/en/model_doc/realm.md
@@ -16,6 +16,14 @@ rendered properly in your Markdown viewer.

 # REALM

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 ## Overview

 The REALM model was proposed in [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang. It's a
@@ -86,4 +94,4 @@ This model was contributed by [qqaatw](https://huggingface.co/qqaatw). The origi

 [[autodoc]] RealmForOpenQA
    - block_embedding_to
-    - forward
+    - forward
--- a/docs/source/en/model_doc/speech_to_text_2.md
+++ b/docs/source/en/model_doc/speech_to_text_2.md
@@ -16,6 +16,14 @@ rendered properly in your Markdown viewer.

 # Speech2Text2

+  <Tip warning={true}>
+
+  This model is in maintenance mode only, we don't accept any new PRs changing its code.
+  If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+  You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+  </Tip>
+
 ## Overview

 The Speech2Text2 model is used together with [Wav2Vec2](wav2vec2) for Speech Translation models proposed in
--- a/docs/source/en/model_doc/tvlt.md
+++ b/docs/source/en/model_doc/tvlt.md
@@ -16,6 +16,14 @@ rendered properly in your Markdown viewer.

 # TVLT

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 ## Overview

 The TVLT model was proposed in [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156)
@@ -60,7 +68,7 @@ The original code can be found [here](https://github.com/zinengtang/TVLT). This

 [[autodoc]] TvltFeatureExtractor
    - __call__
-    
+
 ## TvltModel

 [[autodoc]] TvltModel
--- a/docs/source/en/model_doc/vit_hybrid.md
+++ b/docs/source/en/model_doc/vit_hybrid.md
@@ -16,6 +16,14 @@ rendered properly in your Markdown viewer.

 # Hybrid Vision Transformer (ViT Hybrid)

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 ## Overview

 The hybrid Vision Transformer (ViT) model was proposed in [An Image is Worth 16x16 Words: Transformers for Image Recognition
--- a/docs/source/en/model_doc/xclip.md
+++ b/docs/source/en/model_doc/xclip.md
@@ -30,7 +30,7 @@ Tips:
 - Usage of X-CLIP is identical to [CLIP](clip).

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/xclip_architecture.png"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>

 <small> X-CLIP architecture. Taken from the <a href="https://arxiv.org/abs/2208.02816">original paper.</a> </small>

--- a/docs/source/en/model_doc/xlm-prophetnet.md
+++ b/docs/source/en/model_doc/xlm-prophetnet.md
@@ -16,6 +16,14 @@ rendered properly in your Markdown viewer.

 # XLM-ProphetNet

+<Tip warning={true}>
+
+This model is in maintenance mode only, we don't accept any new PRs changing its code.
+If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
+You can do so by running the following command: `pip install -U transformers==4.40.2`.
+
+</Tip>
+
 <div class="flex flex-wrap space-x-1">
 <a href="https://huggingface.co/models?filter=xprophetnet">
 <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xprophetnet-blueviolet">
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
--- a/src/transformers/models/init.py
+++ b/src/transformers/models/init.py
@@ -67,7 +67,6 @@ from . import (
    deit,
    deprecated,
    depth_anything,
-    deta,
    detr,
    dialogpt,
    dinat,
@@ -77,13 +76,11 @@ from . import (
    donut,
    dpr,
    dpt,
-    efficientformer,
    efficientnet,
    electra,
    encodec,
    encoder_decoder,
    ernie,
-    ernie_m,
    esm,
    falcon,
    fastspeech2_conformer,
@@ -104,8 +101,6 @@ from . import (
    gpt_neox_japanese,
    gpt_sw3,
    gptj,
-    gptsan_japanese,
-    graphormer,
    grounding_dino,
    groupvit,
    herbert,
@@ -118,7 +113,6 @@ from . import (
    instructblip,
    jamba,
    jetmoe,
-    jukebox,
    kosmos2,
    layoutlm,
    layoutlmv2,
@@ -142,7 +136,6 @@ from . import (
    maskformer,
    mbart,
    mbart50,
-    mega,
    megatron_bert,
    megatron_gpt2,
    mgp_str,
@@ -161,8 +154,6 @@ from . import (
    musicgen,
    musicgen_melody,
    mvp,
-    nat,
-    nezha,
    nllb,
    nllb_moe,
    nougat,
@@ -190,11 +181,9 @@ from . import (
    prophetnet,
    pvt,
    pvt_v2,
-    qdqbert,
    qwen2,
    qwen2_moe,
    rag,
-    realm,
    recurrent_gemma,
    reformer,
    regnet,
@@ -215,7 +204,6 @@ from . import (
    siglip,
    speech_encoder_decoder,
    speech_to_text,
-    speech_to_text_2,
    speecht5,
    splinter,
    squeezebert,
@@ -234,7 +222,6 @@ from . import (
    timesformer,
    timm_backbone,
    trocr,
-    tvlt,
    tvp,
    udop,
    umt5,
@@ -250,7 +237,6 @@ from . import (
    vision_text_dual_encoder,
    visual_bert,
    vit,
-    vit_hybrid,
    vit_mae,
    vit_msn,
    vitdet,
@@ -267,7 +253,6 @@ from . import (
    x_clip,
    xglm,
    xlm,
-    xlm_prophetnet,
    xlm_roberta,
    xlm_roberta_xl,
    xlnet,
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -585,14 +585,29 @@ MODEL_NAMES_MAPPING = OrderedDict(
 # `transfo-xl` (as in `CONFIG_MAPPING_NAMES`), we should use `transfo_xl`.
 DEPRECATED_MODELS = [
    "bort",
+    "deta",
+    "efficientformer",
+    "ernie_m",
+    "gptsan_japanese",
+    "graphormer",
+    "jukebox",
    "mctct",
+    "mega",
    "mmbt",
+    "nat",
+    "nezha",
    "open_llama",
+    "qdqbert",
+    "realm",
    "retribert",
+    "speech_to_text_2",
    "tapex",
    "trajectory_transformer",
    "transfo_xl",
+    "tvlt",
    "van",
+    "vit_hybrid",
+    "xlm_prophetnet",
 ]

 SPECIAL_MODEL_TYPE_TO_MODULE_NAME = OrderedDict(
@@ -616,7 +631,11 @@ def model_type_to_module_name(key):
    """Converts a config key to the corresponding module."""
    # Special treatment
    if key in SPECIAL_MODEL_TYPE_TO_MODULE_NAME:
-        return SPECIAL_MODEL_TYPE_TO_MODULE_NAME[key]
+        key = SPECIAL_MODEL_TYPE_TO_MODULE_NAME[key]
+
+        if key in DEPRECATED_MODELS:
+            key = f"deprecated.{key}"
+        return key

    key = key.replace("-", "_")
    if key in DEPRECATED_MODELS:
--- a/src/transformers/models/deprecated/deta/init.py
+++ b/src/transformers/models/deprecated/deta/init.py
@@ -14,7 +14,7 @@

 from typing import TYPE_CHECKING

-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available


 _import_structure = {
--- a/src/transformers/models/deprecated/deta/configuration_deta.py
+++ b/src/transformers/models/deprecated/deta/configuration_deta.py
@@ -14,9 +14,9 @@
 # limitations under the License.
 """DETA model configuration"""

-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
-from ..auto import CONFIG_MAPPING
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging
+from ...auto import CONFIG_MAPPING


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/deta/convert_deta_resnet_to_pytorch.py
+++ b/src/transformers/models/deprecated/deta/convert_deta_resnet_to_pytorch.py
--- a/src/transformers/models/deprecated/deta/convert_deta_swin_to_pytorch.py
+++ b/src/transformers/models/deprecated/deta/convert_deta_swin_to_pytorch.py
--- a/src/transformers/models/deprecated/deta/image_processing_deta.py
+++ b/src/transformers/models/deprecated/deta/image_processing_deta.py
@@ -19,9 +19,9 @@ from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union

 import numpy as np

-from ...feature_extraction_utils import BatchFeature
-from ...image_processing_utils import BaseImageProcessor, get_size_dict
-from ...image_transforms import (
+from ....feature_extraction_utils import BatchFeature
+from ....image_processing_utils import BaseImageProcessor, get_size_dict
+from ....image_transforms import (
    PaddingMode,
    center_to_corners_format,
    corners_to_center_format,
@@ -31,7 +31,7 @@ from ...image_transforms import (
    rgb_to_id,
    to_channel_dimension_format,
 )
-from ...image_utils import (
+from ....image_utils import (
    IMAGENET_DEFAULT_MEAN,
    IMAGENET_DEFAULT_STD,
    AnnotationFormat,
@@ -48,7 +48,7 @@ from ...image_utils import (
    validate_annotations,
    validate_preprocess_arguments,
 )
-from ...utils import (
+from ....utils import (
    is_flax_available,
    is_jax_tensor,
    is_tf_available,
@@ -59,7 +59,7 @@ from ...utils import (
    is_vision_available,
    logging,
 )
-from ...utils.generic import TensorType
+from ....utils.generic import TensorType


 if is_torch_available():
--- a/src/transformers/models/deprecated/deta/modeling_deta.py
+++ b/src/transformers/models/deprecated/deta/modeling_deta.py
@@ -28,8 +28,8 @@ from torch import Tensor, nn
 from torch.autograd import Function
 from torch.autograd.function import once_differentiable

-from ...activations import ACT2FN
-from ...file_utils import (
+from ....activations import ACT2FN
+from ....file_utils import (
    ModelOutput,
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
@@ -38,12 +38,12 @@ from ...file_utils import (
    is_vision_available,
    replace_return_docstrings,
 )
-from ...modeling_attn_mask_utils import _prepare_4d_attention_mask
-from ...modeling_outputs import BaseModelOutput
-from ...modeling_utils import PreTrainedModel
-from ...pytorch_utils import meshgrid
-from ...utils import is_accelerate_available, is_ninja_available, is_torchvision_available, logging, requires_backends
-from ...utils.backbone_utils import load_backbone
+from ....modeling_attn_mask_utils import _prepare_4d_attention_mask
+from ....modeling_outputs import BaseModelOutput
+from ....modeling_utils import PreTrainedModel
+from ....pytorch_utils import meshgrid
+from ....utils import is_accelerate_available, is_ninja_available, is_torchvision_available, logging, requires_backends
+from ....utils.backbone_utils import load_backbone
 from .configuration_deta import DetaConfig


--- a/src/transformers/models/deprecated/efficientformer/init.py
+++ b/src/transformers/models/deprecated/efficientformer/init.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 from typing import TYPE_CHECKING

-from ...utils import (
+from ....utils import (
    OptionalDependencyNotAvailable,
    _LazyModule,
    is_tf_available,
--- a/src/transformers/models/deprecated/efficientformer/configuration_efficientformer.py
+++ b/src/transformers/models/deprecated/efficientformer/configuration_efficientformer.py
@@ -16,8 +16,8 @@

 from typing import List

-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/efficientformer/convert_efficientformer_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/deprecated/efficientformer/convert_efficientformer_original_pytorch_checkpoint_to_pytorch.py
--- a/src/transformers/models/deprecated/efficientformer/image_processing_efficientformer.py
+++ b/src/transformers/models/deprecated/efficientformer/image_processing_efficientformer.py
@@ -18,13 +18,13 @@ from typing import Dict, List, Optional, Union

 import numpy as np

-from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
-from ...image_transforms import (
+from ....image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ....image_transforms import (
    get_resize_output_image_size,
    resize,
    to_channel_dimension_format,
 )
-from ...image_utils import (
+from ....image_utils import (
    IMAGENET_DEFAULT_MEAN,
    IMAGENET_DEFAULT_STD,
    ChannelDimension,
@@ -38,7 +38,7 @@ from ...image_utils import (
    validate_kwargs,
    validate_preprocess_arguments,
 )
-from ...utils import TensorType, logging
+from ....utils import TensorType, logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/efficientformer/modeling_efficientformer.py
+++ b/src/transformers/models/deprecated/efficientformer/modeling_efficientformer.py
@@ -23,10 +23,10 @@ import torch.utils.checkpoint
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

-from ...activations import ACT2FN
-from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, ImageClassifierOutput
-from ...modeling_utils import PreTrainedModel
-from ...utils import (
+from ....activations import ACT2FN
+from ....modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, ImageClassifierOutput
+from ....modeling_utils import PreTrainedModel
+from ....utils import (
    ModelOutput,
    add_code_sample_docstrings,
    add_start_docstrings,
--- a/src/transformers/models/deprecated/efficientformer/modeling_tf_efficientformer.py
+++ b/src/transformers/models/deprecated/efficientformer/modeling_tf_efficientformer.py
@@ -20,13 +20,13 @@ from typing import Optional, Tuple, Union

 import tensorflow as tf

-from ...activations_tf import ACT2FN
-from ...modeling_tf_outputs import (
+from ....activations_tf import ACT2FN
+from ....modeling_tf_outputs import (
    TFBaseModelOutput,
    TFBaseModelOutputWithPooling,
    TFImageClassifierOutput,
 )
-from ...modeling_tf_utils import (
+from ....modeling_tf_utils import (
    TFPreTrainedModel,
    TFSequenceClassificationLoss,
    get_initializer,
@@ -34,8 +34,8 @@ from ...modeling_tf_utils import (
    keras_serializable,
    unpack_inputs,
 )
-from ...tf_utils import shape_list, stable_softmax
-from ...utils import (
+from ....tf_utils import shape_list, stable_softmax
+from ....utils import (
    ModelOutput,
    add_code_sample_docstrings,
    add_start_docstrings,
--- a/src/transformers/models/deprecated/ernie_m/init.py
+++ b/src/transformers/models/deprecated/ernie_m/init.py
@@ -14,7 +14,7 @@
 from typing import TYPE_CHECKING

 # rely on isort to merge the imports
-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_sentencepiece_available, is_torch_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_sentencepiece_available, is_torch_available


 _import_structure = {
--- a/src/transformers/models/deprecated/ernie_m/configuration_ernie_m.py
+++ b/src/transformers/models/deprecated/ernie_m/configuration_ernie_m.py
@@ -19,7 +19,7 @@ from __future__ import annotations

 from typing import Dict

-from ...configuration_utils import PretrainedConfig
+from ....configuration_utils import PretrainedConfig


 class ErnieMConfig(PretrainedConfig):
--- a/src/transformers/models/deprecated/ernie_m/modeling_ernie_m.py
+++ b/src/transformers/models/deprecated/ernie_m/modeling_ernie_m.py
@@ -22,8 +22,8 @@ import torch.utils.checkpoint
 from torch import nn, tensor
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

-from ...activations import ACT2FN
-from ...modeling_outputs import (
+from ....activations import ACT2FN
+from ....modeling_outputs import (
    BaseModelOutputWithPastAndCrossAttentions,
    BaseModelOutputWithPoolingAndCrossAttentions,
    MultipleChoiceModelOutput,
@@ -31,9 +31,9 @@ from ...modeling_outputs import (
    SequenceClassifierOutput,
    TokenClassifierOutput,
 )
-from ...modeling_utils import PreTrainedModel
-from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
-from ...utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging
+from ....modeling_utils import PreTrainedModel
+from ....pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
+from ....utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging
 from .configuration_ernie_m import ErnieMConfig


--- a/src/transformers/models/deprecated/ernie_m/tokenization_ernie_m.py
+++ b/src/transformers/models/deprecated/ernie_m/tokenization_ernie_m.py
@@ -21,8 +21,8 @@ from typing import Any, Dict, List, Optional, Tuple

 import sentencepiece as spm

-from ...tokenization_utils import PreTrainedTokenizer
-from ...utils import logging
+from ....tokenization_utils import PreTrainedTokenizer
+from ....utils import logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/gptsan_japanese/init.py
+++ b/src/transformers/models/deprecated/gptsan_japanese/init.py
@@ -14,7 +14,7 @@

 from typing import TYPE_CHECKING

-from ...utils import (
+from ....utils import (
    OptionalDependencyNotAvailable,
    _LazyModule,
    is_flax_available,
--- a/src/transformers/models/deprecated/gptsan_japanese/configuration_gptsan_japanese.py
+++ b/src/transformers/models/deprecated/gptsan_japanese/configuration_gptsan_japanese.py
@@ -14,8 +14,8 @@
 # limitations under the License.
 """GPTSAN-japanese model configuration"""

-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/gptsan_japanese/convert_gptsan_tf_checkpoint_to_pytorch.py
+++ b/src/transformers/models/deprecated/gptsan_japanese/convert_gptsan_tf_checkpoint_to_pytorch.py
--- a/src/transformers/models/deprecated/gptsan_japanese/modeling_gptsan_japanese.py
+++ b/src/transformers/models/deprecated/gptsan_japanese/modeling_gptsan_japanese.py
@@ -20,10 +20,10 @@ from typing import List, Optional, Tuple, Union
 import torch
 import torch.nn as nn

-from ...activations import ACT2FN
-from ...modeling_outputs import MoECausalLMOutputWithPast, MoEModelOutputWithPastAndCrossAttentions
-from ...modeling_utils import PreTrainedModel
-from ...utils import (
+from ....activations import ACT2FN
+from ....modeling_outputs import MoECausalLMOutputWithPast, MoEModelOutputWithPastAndCrossAttentions
+from ....modeling_utils import PreTrainedModel
+from ....utils import (
    DUMMY_INPUTS,
    DUMMY_MASK,
    add_start_docstrings,
--- a/src/transformers/models/deprecated/gptsan_japanese/tokenization_gptsan_japanese.py
+++ b/src/transformers/models/deprecated/gptsan_japanese/tokenization_gptsan_japanese.py
@@ -22,8 +22,8 @@ from typing import List, Optional, Tuple, Union

 import numpy as np

-from ...tokenization_utils import PreTrainedTokenizer
-from ...tokenization_utils_base import (
+from ....tokenization_utils import PreTrainedTokenizer
+from ....tokenization_utils_base import (
    BatchEncoding,
    PreTokenizedInput,
    PreTokenizedInputPair,
@@ -31,7 +31,7 @@ from ...tokenization_utils_base import (
    TextInputPair,
    TruncationStrategy,
 )
-from ...utils import PaddingStrategy, logging
+from ....utils import PaddingStrategy, logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/graphormer/init.py
+++ b/src/transformers/models/deprecated/graphormer/init.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 from typing import TYPE_CHECKING

-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available


 _import_structure = {
--- a/src/transformers/models/deprecated/graphormer/algos_graphormer.pyx
+++ b/src/transformers/models/deprecated/graphormer/algos_graphormer.pyx
--- a/src/transformers/models/deprecated/graphormer/collating_graphormer.py
+++ b/src/transformers/models/deprecated/graphormer/collating_graphormer.py
@@ -6,7 +6,7 @@ from typing import Any, Dict, List, Mapping
 import numpy as np
 import torch

-from ...utils import is_cython_available, requires_backends
+from ....utils import is_cython_available, requires_backends


 if is_cython_available():
--- a/src/transformers/models/deprecated/graphormer/configuration_graphormer.py
+++ b/src/transformers/models/deprecated/graphormer/configuration_graphormer.py
@@ -14,8 +14,8 @@
 # limitations under the License.
 """Graphormer model configuration"""

-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/graphormer/modeling_graphormer.py
+++ b/src/transformers/models/deprecated/graphormer/modeling_graphormer.py
@@ -21,13 +21,13 @@ import torch
 import torch.nn as nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

-from ...activations import ACT2FN
-from ...modeling_outputs import (
+from ....activations import ACT2FN
+from ....modeling_outputs import (
    BaseModelOutputWithNoAttention,
    SequenceClassifierOutput,
 )
-from ...modeling_utils import PreTrainedModel
-from ...utils import logging
+from ....modeling_utils import PreTrainedModel
+from ....utils import logging
 from .configuration_graphormer import GraphormerConfig


--- a/src/transformers/models/deprecated/jukebox/init.py
+++ b/src/transformers/models/deprecated/jukebox/init.py
@@ -14,7 +14,7 @@

 from typing import TYPE_CHECKING

-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available


 _import_structure = {
--- a/src/transformers/models/deprecated/jukebox/configuration_jukebox.py
+++ b/src/transformers/models/deprecated/jukebox/configuration_jukebox.py
@@ -17,8 +17,8 @@
 import os
 from typing import List, Union

-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/jukebox/convert_jukebox.py
+++ b/src/transformers/models/deprecated/jukebox/convert_jukebox.py
--- a/src/transformers/models/deprecated/jukebox/modeling_jukebox.py
+++ b/src/transformers/models/deprecated/jukebox/modeling_jukebox.py
@@ -24,10 +24,10 @@ import torch.nn.functional as F
 from torch import nn
 from torch.nn import LayerNorm as FusedLayerNorm

-from ...activations import ACT2FN
-from ...modeling_utils import PreTrainedModel
-from ...utils import add_start_docstrings, logging
-from ...utils.logging import tqdm
+from ....activations import ACT2FN
+from ....modeling_utils import PreTrainedModel
+from ....utils import add_start_docstrings, logging
+from ....utils.logging import tqdm
 from .configuration_jukebox import ATTENTION_PATTERNS, JukeboxConfig, JukeboxPriorConfig, JukeboxVQVAEConfig


--- a/src/transformers/models/deprecated/jukebox/tokenization_jukebox.py
+++ b/src/transformers/models/deprecated/jukebox/tokenization_jukebox.py
@@ -24,10 +24,10 @@ from typing import Any, Dict, List, Optional, Tuple, Union
 import numpy as np
 import regex

-from ...tokenization_utils import AddedToken, PreTrainedTokenizer
-from ...tokenization_utils_base import BatchEncoding
-from ...utils import TensorType, is_flax_available, is_tf_available, is_torch_available, logging
-from ...utils.generic import _is_jax, _is_numpy
+from ....tokenization_utils import AddedToken, PreTrainedTokenizer
+from ....tokenization_utils_base import BatchEncoding
+from ....utils import TensorType, is_flax_available, is_tf_available, is_torch_available, logging
+from ....utils.generic import _is_jax, _is_numpy


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/mega/init.py
+++ b/src/transformers/models/deprecated/mega/init.py
@@ -14,7 +14,7 @@

 from typing import TYPE_CHECKING

-from ...utils import (
+from ....utils import (
    OptionalDependencyNotAvailable,
    _LazyModule,
    is_torch_available,
--- a/src/transformers/models/deprecated/mega/configuration_mega.py
+++ b/src/transformers/models/deprecated/mega/configuration_mega.py
@@ -17,9 +17,9 @@
 from collections import OrderedDict
 from typing import Mapping

-from ...configuration_utils import PretrainedConfig
-from ...onnx import OnnxConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....onnx import OnnxConfig
+from ....utils import logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/mega/convert_mega_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/deprecated/mega/convert_mega_original_pytorch_checkpoint_to_pytorch.py
--- a/src/transformers/models/deprecated/mega/modeling_mega.py
+++ b/src/transformers/models/deprecated/mega/modeling_mega.py
@@ -23,8 +23,8 @@ import torch.utils.checkpoint
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

-from ...activations import ACT2FN
-from ...modeling_outputs import (
+from ....activations import ACT2FN
+from ....modeling_outputs import (
    BaseModelOutputWithPoolingAndCrossAttentions,
    CausalLMOutputWithCrossAttentions,
    MaskedLMOutput,
@@ -33,9 +33,9 @@ from ...modeling_outputs import (
    SequenceClassifierOutput,
    TokenClassifierOutput,
 )
-from ...modeling_utils import PreTrainedModel
-from ...pytorch_utils import ALL_LAYERNORM_LAYERS
-from ...utils import (
+from ....modeling_utils import PreTrainedModel
+from ....pytorch_utils import ALL_LAYERNORM_LAYERS
+from ....utils import (
    add_code_sample_docstrings,
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
--- a/src/transformers/models/deprecated/nat/init.py
+++ b/src/transformers/models/deprecated/nat/init.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 from typing import TYPE_CHECKING

-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available


 _import_structure = {"configuration_nat": ["NatConfig"]}
--- a/src/transformers/models/deprecated/nat/configuration_nat.py
+++ b/src/transformers/models/deprecated/nat/configuration_nat.py
@@ -14,9 +14,9 @@
 # limitations under the License.
 """Neighborhood Attention Transformer model configuration"""

-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
-from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging
+from ....utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/nat/modeling_nat.py
+++ b/src/transformers/models/deprecated/nat/modeling_nat.py
@@ -23,11 +23,11 @@ import torch.utils.checkpoint
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

-from ...activations import ACT2FN
-from ...modeling_outputs import BackboneOutput
-from ...modeling_utils import PreTrainedModel
-from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
-from ...utils import (
+from ....activations import ACT2FN
+from ....modeling_outputs import BackboneOutput
+from ....modeling_utils import PreTrainedModel
+from ....pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
+from ....utils import (
    ModelOutput,
    OptionalDependencyNotAvailable,
    add_code_sample_docstrings,
@@ -38,7 +38,7 @@ from ...utils import (
    replace_return_docstrings,
    requires_backends,
 )
-from ...utils.backbone_utils import BackboneMixin
+from ....utils.backbone_utils import BackboneMixin
 from .configuration_nat import NatConfig


--- a/src/transformers/models/deprecated/nezha/init.py
+++ b/src/transformers/models/deprecated/nezha/init.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 from typing import TYPE_CHECKING

-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available


 _import_structure = {
--- a/src/transformers/models/deprecated/nezha/configuration_nezha.py
+++ b/src/transformers/models/deprecated/nezha/configuration_nezha.py
@@ -1,4 +1,4 @@
-from ... import PretrainedConfig
+from .... import PretrainedConfig


 class NezhaConfig(PretrainedConfig):
--- a/src/transformers/models/deprecated/nezha/modeling_nezha.py
+++ b/src/transformers/models/deprecated/nezha/modeling_nezha.py
@@ -25,8 +25,8 @@ import torch.utils.checkpoint
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

-from ...activations import ACT2FN
-from ...modeling_outputs import (
+from ....activations import ACT2FN
+from ....modeling_outputs import (
    BaseModelOutputWithPastAndCrossAttentions,
    BaseModelOutputWithPoolingAndCrossAttentions,
    MaskedLMOutput,
@@ -36,9 +36,9 @@ from ...modeling_outputs import (
    SequenceClassifierOutput,
    TokenClassifierOutput,
 )
-from ...modeling_utils import PreTrainedModel
-from ...pytorch_utils import apply_chunking_to_forward, find_pruneable_heads_and_indices, prune_linear_layer
-from ...utils import (
+from ....modeling_utils import PreTrainedModel
+from ....pytorch_utils import apply_chunking_to_forward, find_pruneable_heads_and_indices, prune_linear_layer
+from ....utils import (
    ModelOutput,
    add_code_sample_docstrings,
    add_start_docstrings,
--- a/src/transformers/models/deprecated/qdqbert/init.py
+++ b/src/transformers/models/deprecated/qdqbert/init.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 from typing import TYPE_CHECKING

-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available


 _import_structure = {"configuration_qdqbert": ["QDQBertConfig"]}
--- a/src/transformers/models/deprecated/qdqbert/configuration_qdqbert.py
+++ b/src/transformers/models/deprecated/qdqbert/configuration_qdqbert.py
@@ -14,8 +14,8 @@
 # limitations under the License.
 """QDQBERT model configuration"""

-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/qdqbert/modeling_qdqbert.py
+++ b/src/transformers/models/deprecated/qdqbert/modeling_qdqbert.py
@@ -25,8 +25,8 @@ import torch.utils.checkpoint
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

-from ...activations import ACT2FN
-from ...modeling_outputs import (
+from ....activations import ACT2FN
+from ....modeling_outputs import (
    BaseModelOutputWithPastAndCrossAttentions,
    BaseModelOutputWithPoolingAndCrossAttentions,
    CausalLMOutputWithCrossAttentions,
@@ -37,9 +37,9 @@ from ...modeling_outputs import (
    SequenceClassifierOutput,
    TokenClassifierOutput,
 )
-from ...modeling_utils import PreTrainedModel
-from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
-from ...utils import (
+from ....modeling_utils import PreTrainedModel
+from ....pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
+from ....utils import (
    add_code_sample_docstrings,
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
--- a/src/transformers/models/deprecated/realm/init.py
+++ b/src/transformers/models/deprecated/realm/init.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 from typing import TYPE_CHECKING

-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available


 _import_structure = {
--- a/src/transformers/models/deprecated/realm/configuration_realm.py
+++ b/src/transformers/models/deprecated/realm/configuration_realm.py
@@ -14,8 +14,8 @@
 # limitations under the License.
 """REALM model configuration."""

-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/realm/modeling_realm.py
+++ b/src/transformers/models/deprecated/realm/modeling_realm.py
@@ -23,16 +23,16 @@ import torch
 from torch import nn
 from torch.nn import CrossEntropyLoss

-from ...activations import ACT2FN
-from ...modeling_outputs import (
+from ....activations import ACT2FN
+from ....modeling_outputs import (
    BaseModelOutputWithPastAndCrossAttentions,
    BaseModelOutputWithPoolingAndCrossAttentions,
    MaskedLMOutput,
    ModelOutput,
 )
-from ...modeling_utils import PreTrainedModel
-from ...pytorch_utils import apply_chunking_to_forward, find_pruneable_heads_and_indices, prune_linear_layer
-from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
+from ....modeling_utils import PreTrainedModel
+from ....pytorch_utils import apply_chunking_to_forward, find_pruneable_heads_and_indices, prune_linear_layer
+from ....utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
 from .configuration_realm import RealmConfig


--- a/src/transformers/models/deprecated/realm/retrieval_realm.py
+++ b/src/transformers/models/deprecated/realm/retrieval_realm.py
@@ -20,8 +20,8 @@ from typing import Optional, Union
 import numpy as np
 from huggingface_hub import hf_hub_download

-from ... import AutoTokenizer
-from ...utils import logging
+from .... import AutoTokenizer
+from ....utils import logging


 _REALM_BLOCK_RECORDS_FILENAME = "block_records.npy"
--- a/src/transformers/models/deprecated/realm/tokenization_realm.py
+++ b/src/transformers/models/deprecated/realm/tokenization_realm.py
@@ -19,9 +19,9 @@ import os
 import unicodedata
 from typing import List, Optional, Tuple

-from ...tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace
-from ...tokenization_utils_base import BatchEncoding
-from ...utils import PaddingStrategy, logging
+from ....tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace
+from ....tokenization_utils_base import BatchEncoding
+from ....utils import PaddingStrategy, logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/realm/tokenization_realm_fast.py
+++ b/src/transformers/models/deprecated/realm/tokenization_realm_fast.py
@@ -19,9 +19,9 @@ from typing import List, Optional, Tuple

 from tokenizers import normalizers

-from ...tokenization_utils_base import BatchEncoding
-from ...tokenization_utils_fast import PreTrainedTokenizerFast
-from ...utils import PaddingStrategy, logging
+from ....tokenization_utils_base import BatchEncoding
+from ....tokenization_utils_fast import PreTrainedTokenizerFast
+from ....utils import PaddingStrategy, logging
 from .tokenization_realm import RealmTokenizer


--- a/src/transformers/models/deprecated/speech_to_text_2/init.py
+++ b/src/transformers/models/deprecated/speech_to_text_2/init.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 from typing import TYPE_CHECKING

-from ...utils import (
+from ....utils import (
    OptionalDependencyNotAvailable,
    _LazyModule,
    is_sentencepiece_available,
--- a/src/transformers/models/deprecated/speech_to_text_2/configuration_speech_to_text_2.py
+++ b/src/transformers/models/deprecated/speech_to_text_2/configuration_speech_to_text_2.py
@@ -14,8 +14,8 @@
 # limitations under the License.
 """Speech2Text model configuration"""

-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/speech_to_text_2/modeling_speech_to_text_2.py
+++ b/src/transformers/models/deprecated/speech_to_text_2/modeling_speech_to_text_2.py
@@ -22,11 +22,11 @@ import torch
 from torch import nn
 from torch.nn import CrossEntropyLoss

-from ...activations import ACT2FN
-from ...modeling_attn_mask_utils import _prepare_4d_attention_mask, _prepare_4d_causal_attention_mask
-from ...modeling_outputs import BaseModelOutputWithPastAndCrossAttentions, CausalLMOutputWithCrossAttentions
-from ...modeling_utils import PreTrainedModel
-from ...utils import add_start_docstrings, logging, replace_return_docstrings
+from ....activations import ACT2FN
+from ....modeling_attn_mask_utils import _prepare_4d_attention_mask, _prepare_4d_causal_attention_mask
+from ....modeling_outputs import BaseModelOutputWithPastAndCrossAttentions, CausalLMOutputWithCrossAttentions
+from ....modeling_utils import PreTrainedModel
+from ....utils import add_start_docstrings, logging, replace_return_docstrings
 from .configuration_speech_to_text_2 import Speech2Text2Config


--- a/src/transformers/models/deprecated/speech_to_text_2/processing_speech_to_text_2.py
+++ b/src/transformers/models/deprecated/speech_to_text_2/processing_speech_to_text_2.py
@@ -19,7 +19,7 @@ Speech processor class for Speech2Text2
 import warnings
 from contextlib import contextmanager

-from ...processing_utils import ProcessorMixin
+from ....processing_utils import ProcessorMixin


 class Speech2Text2Processor(ProcessorMixin):
--- a/src/transformers/models/deprecated/speech_to_text_2/tokenization_speech_to_text_2.py
+++ b/src/transformers/models/deprecated/speech_to_text_2/tokenization_speech_to_text_2.py
@@ -18,8 +18,8 @@ import json
 import os
 from typing import Dict, List, Optional, Tuple

-from ...tokenization_utils import PreTrainedTokenizer
-from ...utils import logging
+from ....tokenization_utils import PreTrainedTokenizer
+from ....utils import logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/tvlt/init.py
+++ b/src/transformers/models/deprecated/tvlt/init.py
@@ -17,7 +17,7 @@
 # limitations under the License.
 from typing import TYPE_CHECKING

-from ...utils import (
+from ....utils import (
    OptionalDependencyNotAvailable,
    _LazyModule,
    is_torch_available,
--- a/src/transformers/models/deprecated/tvlt/configuration_tvlt.py
+++ b/src/transformers/models/deprecated/tvlt/configuration_tvlt.py
@@ -14,8 +14,8 @@
 # limitations under the License.
 """TVLT model configuration"""

-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/tvlt/feature_extraction_tvlt.py
+++ b/src/transformers/models/deprecated/tvlt/feature_extraction_tvlt.py
@@ -19,9 +19,9 @@ from typing import List, Optional, Union

 import numpy as np

-from ...audio_utils import mel_filter_bank, spectrogram, window_function
-from ...feature_extraction_sequence_utils import BatchFeature, SequenceFeatureExtractor
-from ...utils import TensorType, logging
+from ....audio_utils import mel_filter_bank, spectrogram, window_function
+from ....feature_extraction_sequence_utils import BatchFeature, SequenceFeatureExtractor
+from ....utils import TensorType, logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/tvlt/image_processing_tvlt.py
+++ b/src/transformers/models/deprecated/tvlt/image_processing_tvlt.py
@@ -18,13 +18,13 @@ from typing import Dict, List, Optional, Union

 import numpy as np

-from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
-from ...image_transforms import (
+from ....image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ....image_transforms import (
    get_resize_output_image_size,
    resize,
    to_channel_dimension_format,
 )
-from ...image_utils import (
+from ....image_utils import (
    IMAGENET_STANDARD_MEAN,
    IMAGENET_STANDARD_STD,
    ChannelDimension,
@@ -38,7 +38,7 @@ from ...image_utils import (
    validate_kwargs,
    validate_preprocess_arguments,
 )
-from ...utils import TensorType, logging
+from ....utils import TensorType, logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/tvlt/modeling_tvlt.py
+++ b/src/transformers/models/deprecated/tvlt/modeling_tvlt.py
@@ -25,11 +25,11 @@ import torch.utils.checkpoint
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

-from ...activations import ACT2FN
-from ...modeling_outputs import BaseModelOutput, SequenceClassifierOutput
-from ...modeling_utils import PreTrainedModel
-from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
-from ...utils import (
+from ....activations import ACT2FN
+from ....modeling_outputs import BaseModelOutput, SequenceClassifierOutput
+from ....modeling_utils import PreTrainedModel
+from ....pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
+from ....utils import (
    ModelOutput,
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
--- a/src/transformers/models/deprecated/tvlt/processing_tvlt.py
+++ b/src/transformers/models/deprecated/tvlt/processing_tvlt.py
@@ -16,7 +16,7 @@
 Processor class for TVLT.
 """

-from ...processing_utils import ProcessorMixin
+from ....processing_utils import ProcessorMixin


 class TvltProcessor(ProcessorMixin):
--- a/src/transformers/models/deprecated/vit_hybrid/init.py
+++ b/src/transformers/models/deprecated/vit_hybrid/init.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 from typing import TYPE_CHECKING

-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available


 _import_structure = {"configuration_vit_hybrid": ["ViTHybridConfig"]}
--- a/src/transformers/models/deprecated/vit_hybrid/configuration_vit_hybrid.py
+++ b/src/transformers/models/deprecated/vit_hybrid/configuration_vit_hybrid.py
@@ -14,10 +14,10 @@
 # limitations under the License.
 """ViT Hybrid model configuration"""

-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
-from ..auto.configuration_auto import CONFIG_MAPPING
-from ..bit import BitConfig
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging
+from ...auto.configuration_auto import CONFIG_MAPPING
+from ...bit import BitConfig


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/vit_hybrid/convert_vit_hybrid_timm_to_pytorch.py
+++ b/src/transformers/models/deprecated/vit_hybrid/convert_vit_hybrid_timm_to_pytorch.py
--- a/src/transformers/models/deprecated/vit_hybrid/image_processing_vit_hybrid.py
+++ b/src/transformers/models/deprecated/vit_hybrid/image_processing_vit_hybrid.py
@@ -18,14 +18,14 @@ from typing import Dict, List, Optional, Union

 import numpy as np

-from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
-from ...image_transforms import (
+from ....image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ....image_transforms import (
    convert_to_rgb,
    get_resize_output_image_size,
    resize,
    to_channel_dimension_format,
 )
-from ...image_utils import (
+from ....image_utils import (
    OPENAI_CLIP_MEAN,
    OPENAI_CLIP_STD,
    ChannelDimension,
@@ -39,7 +39,7 @@ from ...image_utils import (
    validate_kwargs,
    validate_preprocess_arguments,
 )
-from ...utils import TensorType, is_vision_available, logging
+from ....utils import TensorType, is_vision_available, logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/vit_hybrid/modeling_vit_hybrid.py
+++ b/src/transformers/models/deprecated/vit_hybrid/modeling_vit_hybrid.py
@@ -23,12 +23,12 @@ import torch.utils.checkpoint
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

-from ...activations import ACT2FN
-from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, ImageClassifierOutput
-from ...modeling_utils import PreTrainedModel
-from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
-from ...utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging
-from ...utils.backbone_utils import load_backbone
+from ....activations import ACT2FN
+from ....modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, ImageClassifierOutput
+from ....modeling_utils import PreTrainedModel
+from ....pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
+from ....utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging
+from ....utils.backbone_utils import load_backbone
 from .configuration_vit_hybrid import ViTHybridConfig


--- a/src/transformers/models/deprecated/xlm_prophetnet/init.py
+++ b/src/transformers/models/deprecated/xlm_prophetnet/init.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 from typing import TYPE_CHECKING

-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_sentencepiece_available, is_torch_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_sentencepiece_available, is_torch_available


 _import_structure = {
--- a/src/transformers/models/deprecated/xlm_prophetnet/configuration_xlm_prophetnet.py
+++ b/src/transformers/models/deprecated/xlm_prophetnet/configuration_xlm_prophetnet.py
@@ -16,8 +16,8 @@

 from typing import Callable, Optional, Union

-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/deprecated/xlm_prophetnet/modeling_xlm_prophetnet.py
+++ b/src/transformers/models/deprecated/xlm_prophetnet/modeling_xlm_prophetnet.py
@@ -25,10 +25,10 @@ import torch.utils.checkpoint
 from torch import Tensor, nn
 from torch.nn import LayerNorm

-from ...activations import ACT2FN
-from ...modeling_outputs import BaseModelOutput
-from ...modeling_utils import PreTrainedModel
-from ...utils import (
+from ....activations import ACT2FN
+from ....modeling_outputs import BaseModelOutput
+from ....modeling_utils import PreTrainedModel
+from ....utils import (
    ModelOutput,
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
--- a/src/transformers/models/deprecated/xlm_prophetnet/tokenization_xlm_prophetnet.py
+++ b/src/transformers/models/deprecated/xlm_prophetnet/tokenization_xlm_prophetnet.py
@@ -18,8 +18,8 @@ import os
 from shutil import copyfile
 from typing import Any, Dict, List, Optional, Tuple

-from ...tokenization_utils import PreTrainedTokenizer
-from ...utils import logging
+from ....tokenization_utils import PreTrainedTokenizer
+from ....utils import logging


 logger = logging.get_logger(__name__)
--- a/src/transformers/models/dinat/modeling_dinat.py
+++ b/src/transformers/models/dinat/modeling_dinat.py
@@ -71,7 +71,6 @@ _IMAGE_CLASS_EXPECTED_OUTPUT = "tabby, tabby cat"


@dataclass
-# Copied from transformers.models.nat.modeling_nat.NatEncoderOutput with Nat->Dinat
 class DinatEncoderOutput(ModelOutput):
    """
    Dinat encoder's outputs, with potential hidden states and attentions.
@@ -105,7 +104,6 @@ class DinatEncoderOutput(ModelOutput):


@dataclass
-# Copied from transformers.models.nat.modeling_nat.NatModelOutput with Nat->Dinat
 class DinatModelOutput(ModelOutput):
    """
    Dinat model's outputs that also contains a pooling of the last hidden states.
@@ -142,7 +140,6 @@ class DinatModelOutput(ModelOutput):


@dataclass
-# Copied from transformers.models.nat.modeling_nat.NatImageClassifierOutput with Nat->Dinat
 class DinatImageClassifierOutput(ModelOutput):
    """
    Dinat outputs for image classification.
@@ -178,7 +175,6 @@ class DinatImageClassifierOutput(ModelOutput):
    reshaped_hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None


-# Copied from transformers.models.nat.modeling_nat.NatEmbeddings with Nat->Dinat
 class DinatEmbeddings(nn.Module):
    """
    Construct the patch and position embeddings.
@@ -201,7 +197,6 @@ class DinatEmbeddings(nn.Module):
        return embeddings


-# Copied from transformers.models.nat.modeling_nat.NatPatchEmbeddings with Nat->Dinat
 class DinatPatchEmbeddings(nn.Module):
    """
    This class turns `pixel_values` of shape `(batch_size, num_channels, height, width)` into the initial
@@ -238,7 +233,6 @@ class DinatPatchEmbeddings(nn.Module):
        return embeddings


-# Copied from transformers.models.nat.modeling_nat.NatDownsampler with Nat->Dinat
 class DinatDownsampler(nn.Module):
    """
    Convolutional Downsampling Layer.
@@ -321,7 +315,6 @@ class NeighborhoodAttention(nn.Module):

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

-    # Copied from transformers.models.nat.modeling_nat.NeighborhoodAttention.transpose_for_scores with Nat->Dinat
    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(new_x_shape)
@@ -361,7 +354,6 @@ class NeighborhoodAttention(nn.Module):
        return outputs


-# Copied from transformers.models.nat.modeling_nat.NeighborhoodAttentionOutput
 class NeighborhoodAttentionOutput(nn.Module):
    def __init__(self, config, dim):
        super().__init__()
@@ -382,7 +374,6 @@ class NeighborhoodAttentionModule(nn.Module):
        self.output = NeighborhoodAttentionOutput(config, dim)
        self.pruned_heads = set()

-    # Copied from transformers.models.nat.modeling_nat.NeighborhoodAttentionModule.prune_heads
    def prune_heads(self, heads):
        if len(heads) == 0:
            return
@@ -401,7 +392,6 @@ class NeighborhoodAttentionModule(nn.Module):
        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
        self.pruned_heads = self.pruned_heads.union(heads)

-    # Copied from transformers.models.nat.modeling_nat.NeighborhoodAttentionModule.forward
    def forward(
        self,
        hidden_states: torch.Tensor,
@@ -413,7 +403,6 @@ class NeighborhoodAttentionModule(nn.Module):
        return outputs


-# Copied from transformers.models.nat.modeling_nat.NatIntermediate with Nat->Dinat
 class DinatIntermediate(nn.Module):
    def __init__(self, config, dim):
        super().__init__()
@@ -429,7 +418,6 @@ class DinatIntermediate(nn.Module):
        return hidden_states


-# Copied from transformers.models.nat.modeling_nat.NatOutput with Nat->Dinat
 class DinatOutput(nn.Module):
    def __init__(self, config, dim):
        super().__init__()
@@ -539,7 +527,6 @@ class DinatStage(nn.Module):

        self.pointing = False

-    # Copied from transformers.models.nat.modeling_nat.NatStage.forward
    def forward(
        self,
        hidden_states: torch.Tensor,
@@ -582,7 +569,6 @@ class DinatEncoder(nn.Module):
            ]
        )

-    # Copied from transformers.models.nat.modeling_nat.NatEncoder.forward with Nat->Dinat
    def forward(
        self,
        hidden_states: torch.Tensor,
@@ -687,7 +673,6 @@ DINAT_INPUTS_DOCSTRING = r"""
    "The bare Dinat Model transformer outputting raw hidden-states without any specific head on top.",
    DINAT_START_DOCSTRING,
 )
-# Copied from transformers.models.nat.modeling_nat.NatModel with Nat->Dinat, NAT->DINAT
 class DinatModel(DinatPreTrainedModel):
    def __init__(self, config, add_pooling_layer=True):
        super().__init__(config)
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
--- a/src/transformers/utils/dummy_sentencepiece_objects.py
+++ b/src/transformers/utils/dummy_sentencepiece_objects.py
@@ -72,6 +72,13 @@ class ErnieMTokenizer(metaclass=DummyObject):
        requires_backends(self, ["sentencepiece"])


+class XLMProphetNetTokenizer(metaclass=DummyObject):
+    _backends = ["sentencepiece"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["sentencepiece"])
+
+
 class FNetTokenizer(metaclass=DummyObject):
    _backends = ["sentencepiece"]

@@ -233,13 +240,6 @@ class XGLMTokenizer(metaclass=DummyObject):
        requires_backends(self, ["sentencepiece"])


-class XLMProphetNetTokenizer(metaclass=DummyObject):
-    _backends = ["sentencepiece"]
-
-    def __init__(self, *args, **kwargs):
-        requires_backends(self, ["sentencepiece"])
-
-
 class XLMRobertaTokenizer(metaclass=DummyObject):
    _backends = ["sentencepiece"]

--- a/src/transformers/utils/dummy_tf_objects.py
+++ b/src/transformers/utils/dummy_tf_objects.py
@@ -1038,6 +1038,34 @@ class TFDeiTPreTrainedModel(metaclass=DummyObject):
        requires_backends(self, ["tf"])


+class TFEfficientFormerForImageClassification(metaclass=DummyObject):
+    _backends = ["tf"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["tf"])
+
+
+class TFEfficientFormerForImageClassificationWithTeacher(metaclass=DummyObject):
+    _backends = ["tf"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["tf"])
+
+
+class TFEfficientFormerModel(metaclass=DummyObject):
+    _backends = ["tf"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["tf"])
+
+
+class TFEfficientFormerPreTrainedModel(metaclass=DummyObject):
+    _backends = ["tf"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["tf"])
+
+
 class TFAdaptiveEmbedding(metaclass=DummyObject):
    _backends = ["tf"]

@@ -1178,34 +1206,6 @@ class TFDPRReader(metaclass=DummyObject):
        requires_backends(self, ["tf"])


-class TFEfficientFormerForImageClassification(metaclass=DummyObject):
-    _backends = ["tf"]
-
-    def __init__(self, *args, **kwargs):
-        requires_backends(self, ["tf"])
-
-
-class TFEfficientFormerForImageClassificationWithTeacher(metaclass=DummyObject):
-    _backends = ["tf"]
-
-    def __init__(self, *args, **kwargs):
-        requires_backends(self, ["tf"])
-
-
-class TFEfficientFormerModel(metaclass=DummyObject):
-    _backends = ["tf"]
-
-    def __init__(self, *args, **kwargs):
-        requires_backends(self, ["tf"])
-
-
-class TFEfficientFormerPreTrainedModel(metaclass=DummyObject):
-    _backends = ["tf"]
-
-    def __init__(self, *args, **kwargs):
-        requires_backends(self, ["tf"])
-
-
 class TFElectraForMaskedLM(metaclass=DummyObject):
    _backends = ["tf"]

--- a/src/transformers/utils/dummy_tokenizers_objects.py
+++ b/src/transformers/utils/dummy_tokenizers_objects.py
@@ -121,6 +121,13 @@ class DebertaV2TokenizerFast(metaclass=DummyObject):
        requires_backends(self, ["tokenizers"])


+class RealmTokenizerFast(metaclass=DummyObject):
+    _backends = ["tokenizers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["tokenizers"])
+
+
 class RetriBertTokenizerFast(metaclass=DummyObject):
    _backends = ["tokenizers"]

@@ -352,13 +359,6 @@ class Qwen2TokenizerFast(metaclass=DummyObject):
        requires_backends(self, ["tokenizers"])


-class RealmTokenizerFast(metaclass=DummyObject):
-    _backends = ["tokenizers"]
-
-    def __init__(self, *args, **kwargs):
-        requires_backends(self, ["tokenizers"])
-
-
 class ReformerTokenizerFast(metaclass=DummyObject):
    _backends = ["tokenizers"]

--- a/src/transformers/utils/dummy_vision_objects.py
+++ b/src/transformers/utils/dummy_vision_objects.py
@@ -142,6 +142,27 @@ class DetaImageProcessor(metaclass=DummyObject):
        requires_backends(self, ["vision"])


+class EfficientFormerImageProcessor(metaclass=DummyObject):
+    _backends = ["vision"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["vision"])
+
+
+class TvltImageProcessor(metaclass=DummyObject):
+    _backends = ["vision"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["vision"])
+
+
+class ViTHybridImageProcessor(metaclass=DummyObject):
+    _backends = ["vision"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["vision"])
+
+
 class DetrFeatureExtractor(metaclass=DummyObject):
    _backends = ["vision"]

@@ -184,13 +205,6 @@ class DPTImageProcessor(metaclass=DummyObject):
        requires_backends(self, ["vision"])


-class EfficientFormerImageProcessor(metaclass=DummyObject):
-    _backends = ["vision"]
-
-    def __init__(self, *args, **kwargs):
-        requires_backends(self, ["vision"])
-
-
 class EfficientNetImageProcessor(metaclass=DummyObject):
    _backends = ["vision"]

@@ -520,13 +534,6 @@ class Swin2SRImageProcessor(metaclass=DummyObject):
        requires_backends(self, ["vision"])


-class TvltImageProcessor(metaclass=DummyObject):
-    _backends = ["vision"]
-
-    def __init__(self, *args, **kwargs):
-        requires_backends(self, ["vision"])
-
-
 class TvpImageProcessor(metaclass=DummyObject):
    _backends = ["vision"]

@@ -590,13 +597,6 @@ class ViTImageProcessor(metaclass=DummyObject):
        requires_backends(self, ["vision"])


-class ViTHybridImageProcessor(metaclass=DummyObject):
-    _backends = ["vision"]
-
-    def __init__(self, *args, **kwargs):
-        requires_backends(self, ["vision"])
-
-
 class VitMatteImageProcessor(metaclass=DummyObject):
    _backends = ["vision"]

--- a/tests/models/deta/init.py
+++ b/tests/models/deta/init.py
--- a/tests/models/deta/test_image_processing_deta.py
+++ b/tests/models/deta/test_image_processing_deta.py
@@ -1,535 +0,0 @@
-# coding=utf-8
-# Copyright 2022 HuggingFace Inc.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import json
-import pathlib
-import unittest
-
-from transformers.testing_utils import require_torch, require_vision, slow
-from transformers.utils import is_torch_available, is_vision_available
-
-from ...test_image_processing_common import AnnotationFormatTestMixin, ImageProcessingTestMixin, prepare_image_inputs
-
-
-if is_torch_available():
-    import torch
-
-if is_vision_available():
-    from PIL import Image
-
-    from transformers import DetaImageProcessor
-
-
-class DetaImageProcessingTester(unittest.TestCase):
-    def __init__(
-        self,
-        parent,
-        batch_size=7,
-        num_channels=3,
-        min_resolution=30,
-        max_resolution=400,
-        do_resize=True,
-        size=None,
-        do_normalize=True,
-        image_mean=[0.5, 0.5, 0.5],
-        image_std=[0.5, 0.5, 0.5],
-        do_rescale=True,
-        rescale_factor=1 / 255,
-        do_pad=True,
-    ):
-        # by setting size["longest_edge"] > max_resolution we're effectively not testing this :p
-        size = size if size is not None else {"shortest_edge": 18, "longest_edge": 1333}
-        self.parent = parent
-        self.batch_size = batch_size
-        self.num_channels = num_channels
-        self.min_resolution = min_resolution
-        self.max_resolution = max_resolution
-        self.do_resize = do_resize
-        self.size = size
-        self.do_normalize = do_normalize
-        self.image_mean = image_mean
-        self.image_std = image_std
-        self.do_rescale = do_rescale
-        self.rescale_factor = rescale_factor
-        self.do_pad = do_pad
-
-    def prepare_image_processor_dict(self):
-        return {
-            "do_resize": self.do_resize,
-            "size": self.size,
-            "do_normalize": self.do_normalize,
-            "image_mean": self.image_mean,
-            "image_std": self.image_std,
-            "do_rescale": self.do_rescale,
-            "rescale_factor": self.rescale_factor,
-            "do_pad": self.do_pad,
-        }
-
-    def get_expected_values(self, image_inputs, batched=False):
-        """
-        This function computes the expected height and width when providing images to DetaImageProcessor,
-        assuming do_resize is set to True with a scalar size.
-        """
-        if not batched:
-            image = image_inputs[0]
-            if isinstance(image, Image.Image):
-                w, h = image.size
-            else:
-                h, w = image.shape[1], image.shape[2]
-            if w < h:
-                expected_height = int(self.size["shortest_edge"] * h / w)
-                expected_width = self.size["shortest_edge"]
-            elif w > h:
-                expected_height = self.size["shortest_edge"]
-                expected_width = int(self.size["shortest_edge"] * w / h)
-            else:
-                expected_height = self.size["shortest_edge"]
-                expected_width = self.size["shortest_edge"]
-
-        else:
-            expected_values = []
-            for image in image_inputs:
-                expected_height, expected_width = self.get_expected_values([image])
-                expected_values.append((expected_height, expected_width))
-            expected_height = max(expected_values, key=lambda item: item[0])[0]
-            expected_width = max(expected_values, key=lambda item: item[1])[1]
-
-        return expected_height, expected_width
-
-    def expected_output_image_shape(self, images):
-        height, width = self.get_expected_values(images, batched=True)
-        return self.num_channels, height, width
-
-    def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=False):
-        return prepare_image_inputs(
-            batch_size=self.batch_size,
-            num_channels=self.num_channels,
-            min_resolution=self.min_resolution,
-            max_resolution=self.max_resolution,
-            equal_resolution=equal_resolution,
-            numpify=numpify,
-            torchify=torchify,
-        )
-
-
-@require_torch
-@require_vision
-class DetaImageProcessingTest(AnnotationFormatTestMixin, ImageProcessingTestMixin, unittest.TestCase):
-    image_processing_class = DetaImageProcessor if is_vision_available() else None
-
-    def setUp(self):
-        self.image_processor_tester = DetaImageProcessingTester(self)
-
-    @property
-    def image_processor_dict(self):
-        return self.image_processor_tester.prepare_image_processor_dict()
-
-    def test_image_processor_properties(self):
-        image_processing = self.image_processing_class(**self.image_processor_dict)
-        self.assertTrue(hasattr(image_processing, "image_mean"))
-        self.assertTrue(hasattr(image_processing, "image_std"))
-        self.assertTrue(hasattr(image_processing, "do_normalize"))
-        self.assertTrue(hasattr(image_processing, "do_resize"))
-        self.assertTrue(hasattr(image_processing, "do_rescale"))
-        self.assertTrue(hasattr(image_processing, "do_pad"))
-        self.assertTrue(hasattr(image_processing, "size"))
-
-    def test_image_processor_from_dict_with_kwargs(self):
-        image_processor = self.image_processing_class.from_dict(self.image_processor_dict)
-        self.assertEqual(image_processor.size, {"shortest_edge": 18, "longest_edge": 1333})
-        self.assertEqual(image_processor.do_pad, True)
-
-    @slow
-    def test_call_pytorch_with_coco_detection_annotations(self):
-        # prepare image and target
-        image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
-        with open("./tests/fixtures/tests_samples/COCO/coco_annotations.txt", "r") as f:
-            target = json.loads(f.read())
-
-        target = {"image_id": 39769, "annotations": target}
-
-        # encode them
-        image_processing = DetaImageProcessor()
-        encoding = image_processing(images=image, annotations=target, return_tensors="pt")
-
-        # verify pixel values
-        expected_shape = torch.Size([1, 3, 800, 1066])
-        self.assertEqual(encoding["pixel_values"].shape, expected_shape)
-
-        expected_slice = torch.tensor([0.2796, 0.3138, 0.3481])
-        self.assertTrue(torch.allclose(encoding["pixel_values"][0, 0, 0, :3], expected_slice, atol=1e-4))
-
-        # verify area
-        expected_area = torch.tensor([5887.9600, 11250.2061, 489353.8438, 837122.7500, 147967.5156, 165732.3438])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["area"], expected_area))
-        # verify boxes
-        expected_boxes_shape = torch.Size([6, 4])
-        self.assertEqual(encoding["labels"][0]["boxes"].shape, expected_boxes_shape)
-        expected_boxes_slice = torch.tensor([0.5503, 0.2765, 0.0604, 0.2215])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"][0], expected_boxes_slice, atol=1e-3))
-        # verify image_id
-        expected_image_id = torch.tensor([39769])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["image_id"], expected_image_id))
-        # verify is_crowd
-        expected_is_crowd = torch.tensor([0, 0, 0, 0, 0, 0])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["iscrowd"], expected_is_crowd))
-        # verify class_labels
-        expected_class_labels = torch.tensor([75, 75, 63, 65, 17, 17])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["class_labels"], expected_class_labels))
-        # verify orig_size
-        expected_orig_size = torch.tensor([480, 640])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["orig_size"], expected_orig_size))
-        # verify size
-        expected_size = torch.tensor([800, 1066])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["size"], expected_size))
-
-    @slow
-    def test_call_pytorch_with_coco_panoptic_annotations(self):
-        # prepare image, target and masks_path
-        image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
-        with open("./tests/fixtures/tests_samples/COCO/coco_panoptic_annotations.txt", "r") as f:
-            target = json.loads(f.read())
-
-        target = {"file_name": "000000039769.png", "image_id": 39769, "segments_info": target}
-
-        masks_path = pathlib.Path("./tests/fixtures/tests_samples/COCO/coco_panoptic")
-
-        # encode them
-        image_processing = DetaImageProcessor(format="coco_panoptic")
-        encoding = image_processing(images=image, annotations=target, masks_path=masks_path, return_tensors="pt")
-
-        # verify pixel values
-        expected_shape = torch.Size([1, 3, 800, 1066])
-        self.assertEqual(encoding["pixel_values"].shape, expected_shape)
-
-        expected_slice = torch.tensor([0.2796, 0.3138, 0.3481])
-        self.assertTrue(torch.allclose(encoding["pixel_values"][0, 0, 0, :3], expected_slice, atol=1e-4))
-
-        # verify area
-        expected_area = torch.tensor([147979.6875, 165527.0469, 484638.5938, 11292.9375, 5879.6562, 7634.1147])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["area"], expected_area))
-        # verify boxes
-        expected_boxes_shape = torch.Size([6, 4])
-        self.assertEqual(encoding["labels"][0]["boxes"].shape, expected_boxes_shape)
-        expected_boxes_slice = torch.tensor([0.2625, 0.5437, 0.4688, 0.8625])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"][0], expected_boxes_slice, atol=1e-3))
-        # verify image_id
-        expected_image_id = torch.tensor([39769])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["image_id"], expected_image_id))
-        # verify is_crowd
-        expected_is_crowd = torch.tensor([0, 0, 0, 0, 0, 0])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["iscrowd"], expected_is_crowd))
-        # verify class_labels
-        expected_class_labels = torch.tensor([17, 17, 63, 75, 75, 93])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["class_labels"], expected_class_labels))
-        # verify masks
-        expected_masks_sum = 822873
-        self.assertEqual(encoding["labels"][0]["masks"].sum().item(), expected_masks_sum)
-        # verify orig_size
-        expected_orig_size = torch.tensor([480, 640])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["orig_size"], expected_orig_size))
-        # verify size
-        expected_size = torch.tensor([800, 1066])
-        self.assertTrue(torch.allclose(encoding["labels"][0]["size"], expected_size))
-
-    @slow
-    # Copied from tests.models.detr.test_image_processing_detr.DetrImageProcessingTest.test_batched_coco_detection_annotations with Detr->Deta
-    def test_batched_coco_detection_annotations(self):
-        image_0 = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
-        image_1 = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png").resize((800, 800))
-
-        with open("./tests/fixtures/tests_samples/COCO/coco_annotations.txt", "r") as f:
-            target = json.loads(f.read())
-
-        annotations_0 = {"image_id": 39769, "annotations": target}
-        annotations_1 = {"image_id": 39769, "annotations": target}
-
-        # Adjust the bounding boxes for the resized image
-        w_0, h_0 = image_0.size
-        w_1, h_1 = image_1.size
-        for i in range(len(annotations_1["annotations"])):
-            coords = annotations_1["annotations"][i]["bbox"]
-            new_bbox = [
-                coords[0] * w_1 / w_0,
-                coords[1] * h_1 / h_0,
-                coords[2] * w_1 / w_0,
-                coords[3] * h_1 / h_0,
-            ]
-            annotations_1["annotations"][i]["bbox"] = new_bbox
-
-        images = [image_0, image_1]
-        annotations = [annotations_0, annotations_1]
-
-        image_processing = DetaImageProcessor()
-        encoding = image_processing(
-            images=images,
-            annotations=annotations,
-            return_segmentation_masks=True,
-            return_tensors="pt",  # do_convert_annotations=True
-        )
-
-        # Check the pixel values have been padded
-        postprocessed_height, postprocessed_width = 800, 1066
-        expected_shape = torch.Size([2, 3, postprocessed_height, postprocessed_width])
-        self.assertEqual(encoding["pixel_values"].shape, expected_shape)
-
-        # Check the bounding boxes have been adjusted for padded images
-        self.assertEqual(encoding["labels"][0]["boxes"].shape, torch.Size([6, 4]))
-        self.assertEqual(encoding["labels"][1]["boxes"].shape, torch.Size([6, 4]))
-        expected_boxes_0 = torch.tensor(
-            [
-                [0.6879, 0.4609, 0.0755, 0.3691],
-                [0.2118, 0.3359, 0.2601, 0.1566],
-                [0.5011, 0.5000, 0.9979, 1.0000],
-                [0.5010, 0.5020, 0.9979, 0.9959],
-                [0.3284, 0.5944, 0.5884, 0.8112],
-                [0.8394, 0.5445, 0.3213, 0.9110],
-            ]
-        )
-        expected_boxes_1 = torch.tensor(
-            [
-                [0.4130, 0.2765, 0.0453, 0.2215],
-                [0.1272, 0.2016, 0.1561, 0.0940],
-                [0.3757, 0.4933, 0.7488, 0.9865],
-                [0.3759, 0.5002, 0.7492, 0.9955],
-                [0.1971, 0.5456, 0.3532, 0.8646],
-                [0.5790, 0.4115, 0.3430, 0.7161],
-            ]
-        )
-        self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1e-3))
-        self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1e-3))
-
-        # Check the masks have also been padded
-        self.assertEqual(encoding["labels"][0]["masks"].shape, torch.Size([6, 800, 1066]))
-        self.assertEqual(encoding["labels"][1]["masks"].shape, torch.Size([6, 800, 1066]))
-
-        # Check if do_convert_annotations=False, then the annotations are not converted to centre_x, centre_y, width, height
-        # format and not in the range [0, 1]
-        encoding = image_processing(
-            images=images,
-            annotations=annotations,
-            return_segmentation_masks=True,
-            do_convert_annotations=False,
-            return_tensors="pt",
-        )
-        self.assertEqual(encoding["labels"][0]["boxes"].shape, torch.Size([6, 4]))
-        self.assertEqual(encoding["labels"][1]["boxes"].shape, torch.Size([6, 4]))
-        # Convert to absolute coordinates
-        unnormalized_boxes_0 = torch.vstack(
-            [
-                expected_boxes_0[:, 0] * postprocessed_width,
-                expected_boxes_0[:, 1] * postprocessed_height,
-                expected_boxes_0[:, 2] * postprocessed_width,
-                expected_boxes_0[:, 3] * postprocessed_height,
-            ]
-        ).T
-        unnormalized_boxes_1 = torch.vstack(
-            [
-                expected_boxes_1[:, 0] * postprocessed_width,
-                expected_boxes_1[:, 1] * postprocessed_height,
-                expected_boxes_1[:, 2] * postprocessed_width,
-                expected_boxes_1[:, 3] * postprocessed_height,
-            ]
-        ).T
-        # Convert from centre_x, centre_y, width, height to x_min, y_min, x_max, y_max
-        expected_boxes_0 = torch.vstack(
-            [
-                unnormalized_boxes_0[:, 0] - unnormalized_boxes_0[:, 2] / 2,
-                unnormalized_boxes_0[:, 1] - unnormalized_boxes_0[:, 3] / 2,
-                unnormalized_boxes_0[:, 0] + unnormalized_boxes_0[:, 2] / 2,
-                unnormalized_boxes_0[:, 1] + unnormalized_boxes_0[:, 3] / 2,
-            ]
-        ).T
-        expected_boxes_1 = torch.vstack(
-            [
-                unnormalized_boxes_1[:, 0] - unnormalized_boxes_1[:, 2] / 2,
-                unnormalized_boxes_1[:, 1] - unnormalized_boxes_1[:, 3] / 2,
-                unnormalized_boxes_1[:, 0] + unnormalized_boxes_1[:, 2] / 2,
-                unnormalized_boxes_1[:, 1] + unnormalized_boxes_1[:, 3] / 2,
-            ]
-        ).T
-        self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1))
-        self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1))
-
-    # Copied from tests.models.detr.test_image_processing_detr.DetrImageProcessingTest.test_batched_coco_panoptic_annotations with Detr->Deta
-    def test_batched_coco_panoptic_annotations(self):
-        # prepare image, target and masks_path
-        image_0 = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
-        image_1 = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png").resize((800, 800))
-
-        with open("./tests/fixtures/tests_samples/COCO/coco_panoptic_annotations.txt", "r") as f:
-            target = json.loads(f.read())
-
-        annotation_0 = {"file_name": "000000039769.png", "image_id": 39769, "segments_info": target}
-        annotation_1 = {"file_name": "000000039769.png", "image_id": 39769, "segments_info": target}
-
-        w_0, h_0 = image_0.size
-        w_1, h_1 = image_1.size
-        for i in range(len(annotation_1["segments_info"])):
-            coords = annotation_1["segments_info"][i]["bbox"]
-            new_bbox = [
-                coords[0] * w_1 / w_0,
-                coords[1] * h_1 / h_0,
-                coords[2] * w_1 / w_0,
-                coords[3] * h_1 / h_0,
-            ]
-            annotation_1["segments_info"][i]["bbox"] = new_bbox
-
-        masks_path = pathlib.Path("./tests/fixtures/tests_samples/COCO/coco_panoptic")
-
-        images = [image_0, image_1]
-        annotations = [annotation_0, annotation_1]
-
-        # encode them
-        image_processing = DetaImageProcessor(format="coco_panoptic")
-        encoding = image_processing(
-            images=images,
-            annotations=annotations,
-            masks_path=masks_path,
-            return_tensors="pt",
-            return_segmentation_masks=True,
-        )
-
-        # Check the pixel values have been padded
-        postprocessed_height, postprocessed_width = 800, 1066
-        expected_shape = torch.Size([2, 3, postprocessed_height, postprocessed_width])
-        self.assertEqual(encoding["pixel_values"].shape, expected_shape)
-
-        # Check the bounding boxes have been adjusted for padded images
-        self.assertEqual(encoding["labels"][0]["boxes"].shape, torch.Size([6, 4]))
-        self.assertEqual(encoding["labels"][1]["boxes"].shape, torch.Size([6, 4]))
-        expected_boxes_0 = torch.tensor(
-            [
-                [0.2625, 0.5437, 0.4688, 0.8625],
-                [0.7719, 0.4104, 0.4531, 0.7125],
-                [0.5000, 0.4927, 0.9969, 0.9854],
-                [0.1688, 0.2000, 0.2063, 0.0917],
-                [0.5492, 0.2760, 0.0578, 0.2187],
-                [0.4992, 0.4990, 0.9984, 0.9979],
-            ]
-        )
-        expected_boxes_1 = torch.tensor(
-            [
-                [0.1576, 0.3262, 0.2814, 0.5175],
-                [0.4634, 0.2463, 0.2720, 0.4275],
-                [0.3002, 0.2956, 0.5985, 0.5913],
-                [0.1013, 0.1200, 0.1238, 0.0550],
-                [0.3297, 0.1656, 0.0347, 0.1312],
-                [0.2997, 0.2994, 0.5994, 0.5987],
-            ]
-        )
-        self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1e-3))
-        self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1e-3))
-
-        # Check the masks have also been padded
-        self.assertEqual(encoding["labels"][0]["masks"].shape, torch.Size([6, 800, 1066]))
-        self.assertEqual(encoding["labels"][1]["masks"].shape, torch.Size([6, 800, 1066]))
-
-        # Check if do_convert_annotations=False, then the annotations are not converted to centre_x, centre_y, width, height
-        # format and not in the range [0, 1]
-        encoding = image_processing(
-            images=images,
-            annotations=annotations,
-            masks_path=masks_path,
-            return_segmentation_masks=True,
-            do_convert_annotations=False,
-            return_tensors="pt",
-        )
-        self.assertEqual(encoding["labels"][0]["boxes"].shape, torch.Size([6, 4]))
-        self.assertEqual(encoding["labels"][1]["boxes"].shape, torch.Size([6, 4]))
-        # Convert to absolute coordinates
-        unnormalized_boxes_0 = torch.vstack(
-            [
-                expected_boxes_0[:, 0] * postprocessed_width,
-                expected_boxes_0[:, 1] * postprocessed_height,
-                expected_boxes_0[:, 2] * postprocessed_width,
-                expected_boxes_0[:, 3] * postprocessed_height,
-            ]
-        ).T
-        unnormalized_boxes_1 = torch.vstack(
-            [
-                expected_boxes_1[:, 0] * postprocessed_width,
-                expected_boxes_1[:, 1] * postprocessed_height,
-                expected_boxes_1[:, 2] * postprocessed_width,
-                expected_boxes_1[:, 3] * postprocessed_height,
-            ]
-        ).T
-        # Convert from centre_x, centre_y, width, height to x_min, y_min, x_max, y_max
-        expected_boxes_0 = torch.vstack(
-            [
-                unnormalized_boxes_0[:, 0] - unnormalized_boxes_0[:, 2] / 2,
-                unnormalized_boxes_0[:, 1] - unnormalized_boxes_0[:, 3] / 2,
-                unnormalized_boxes_0[:, 0] + unnormalized_boxes_0[:, 2] / 2,
-                unnormalized_boxes_0[:, 1] + unnormalized_boxes_0[:, 3] / 2,
-            ]
-        ).T
-        expected_boxes_1 = torch.vstack(
-            [
-                unnormalized_boxes_1[:, 0] - unnormalized_boxes_1[:, 2] / 2,
-                unnormalized_boxes_1[:, 1] - unnormalized_boxes_1[:, 3] / 2,
-                unnormalized_boxes_1[:, 0] + unnormalized_boxes_1[:, 2] / 2,
-                unnormalized_boxes_1[:, 1] + unnormalized_boxes_1[:, 3] / 2,
-            ]
-        ).T
-        self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1))
-        self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1))
-
-    # Copied from tests.models.detr.test_image_processing_detr.DetrImageProcessingTest.test_max_width_max_height_resizing_and_pad_strategy with Detr->Deta
-    def test_max_width_max_height_resizing_and_pad_strategy(self):
-        image_1 = torch.ones([200, 100, 3], dtype=torch.uint8)
-
-        # do_pad=False, max_height=100, max_width=100, image=200x100 -> 100x50
-        image_processor = DetaImageProcessor(
-            size={"max_height": 100, "max_width": 100},
-            do_pad=False,
-        )
-        inputs = image_processor(images=[image_1], return_tensors="pt")
-        self.assertEqual(inputs["pixel_values"].shape, torch.Size([1, 3, 100, 50]))
-
-        # do_pad=False, max_height=300, max_width=100, image=200x100 -> 200x100
-        image_processor = DetaImageProcessor(
-            size={"max_height": 300, "max_width": 100},
-            do_pad=False,
-        )
-        inputs = image_processor(images=[image_1], return_tensors="pt")
-
-        # do_pad=True, max_height=100, max_width=100, image=200x100 -> 100x100
-        image_processor = DetaImageProcessor(
-            size={"max_height": 100, "max_width": 100}, do_pad=True, pad_size={"height": 100, "width": 100}
-        )
-        inputs = image_processor(images=[image_1], return_tensors="pt")
-        self.assertEqual(inputs["pixel_values"].shape, torch.Size([1, 3, 100, 100]))
-
-        # do_pad=True, max_height=300, max_width=100, image=200x100 -> 300x100
-        image_processor = DetaImageProcessor(
-            size={"max_height": 300, "max_width": 100},
-            do_pad=True,
-            pad_size={"height": 301, "width": 101},
-        )
-        inputs = image_processor(images=[image_1], return_tensors="pt")
-        self.assertEqual(inputs["pixel_values"].shape, torch.Size([1, 3, 301, 101]))
-
-        ### Check for batch
-        image_2 = torch.ones([100, 150, 3], dtype=torch.uint8)
-
-        # do_pad=True, max_height=150, max_width=100, images=[200x100, 100x150] -> 150x100
-        image_processor = DetaImageProcessor(
-            size={"max_height": 150, "max_width": 100},
-            do_pad=True,
-            pad_size={"height": 150, "width": 100},
-        )
-        inputs = image_processor(images=[image_1, image_2], return_tensors="pt")
-        self.assertEqual(inputs["pixel_values"].shape, torch.Size([2, 3, 150, 100]))
--- a/tests/models/deta/test_modeling_deta.py
+++ b/tests/models/deta/test_modeling_deta.py
@@ -1,671 +0,0 @@
-# coding=utf-8
-# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Testing suite for the PyTorch DETA model."""
-
-import collections
-import inspect
-import math
-import re
-import unittest
-
-from transformers import DetaConfig, ResNetConfig, is_torch_available, is_torchvision_available, is_vision_available
-from transformers.file_utils import cached_property
-from transformers.testing_utils import require_torchvision, require_vision, slow, torch_device
-
-from ...generation.test_utils import GenerationTesterMixin
-from ...test_configuration_common import ConfigTester
-from ...test_modeling_common import ModelTesterMixin, _config_zero_init, floats_tensor
-from ...test_pipeline_mixin import PipelineTesterMixin
-
-
-if is_torch_available():
-    import torch
-
-    from transformers.pytorch_utils import id_tensor_storage
-
-if is_torchvision_available():
-    from transformers import DetaForObjectDetection, DetaModel
-
-
-if is_vision_available():
-    from PIL import Image
-
-    from transformers import AutoImageProcessor
-
-
-class DetaModelTester:
-    def __init__(
-        self,
-        parent,
-        batch_size=8,
-        is_training=True,
-        use_labels=True,
-        hidden_size=32,
-        num_hidden_layers=2,
-        num_attention_heads=8,
-        intermediate_size=4,
-        hidden_act="gelu",
-        hidden_dropout_prob=0.1,
-        attention_probs_dropout_prob=0.1,
-        num_queries=12,
-        two_stage_num_proposals=12,
-        num_channels=3,
-        image_size=224,
-        n_targets=8,
-        num_labels=91,
-        num_feature_levels=4,
-        encoder_n_points=2,
-        decoder_n_points=6,
-        two_stage=True,
-        assign_first_stage=True,
-        assign_second_stage=True,
-    ):
-        self.parent = parent
-        self.batch_size = batch_size
-        self.is_training = is_training
-        self.use_labels = use_labels
-        self.hidden_size = hidden_size
-        self.num_hidden_layers = num_hidden_layers
-        self.num_attention_heads = num_attention_heads
-        self.intermediate_size = intermediate_size
-        self.hidden_act = hidden_act
-        self.hidden_dropout_prob = hidden_dropout_prob
-        self.attention_probs_dropout_prob = attention_probs_dropout_prob
-        self.num_queries = num_queries
-        self.two_stage_num_proposals = two_stage_num_proposals
-        self.num_channels = num_channels
-        self.image_size = image_size
-        self.n_targets = n_targets
-        self.num_labels = num_labels
-        self.num_feature_levels = num_feature_levels
-        self.encoder_n_points = encoder_n_points
-        self.decoder_n_points = decoder_n_points
-        self.two_stage = two_stage
-        self.assign_first_stage = assign_first_stage
-        self.assign_second_stage = assign_second_stage
-
-        # we also set the expected seq length for both encoder and decoder
-        self.encoder_seq_length = (
-            math.ceil(self.image_size / 8) ** 2
-            + math.ceil(self.image_size / 16) ** 2
-            + math.ceil(self.image_size / 32) ** 2
-            + math.ceil(self.image_size / 64) ** 2
-        )
-        self.decoder_seq_length = self.num_queries
-
-    def prepare_config_and_inputs(self, model_class_name):
-        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
-
-        pixel_mask = torch.ones([self.batch_size, self.image_size, self.image_size], device=torch_device)
-
-        labels = None
-        if self.use_labels:
-            # labels is a list of Dict (each Dict being the labels for a given example in the batch)
-            labels = []
-            for i in range(self.batch_size):
-                target = {}
-                target["class_labels"] = torch.randint(
-                    high=self.num_labels, size=(self.n_targets,), device=torch_device
-                )
-                target["boxes"] = torch.rand(self.n_targets, 4, device=torch_device)
-                target["masks"] = torch.rand(self.n_targets, self.image_size, self.image_size, device=torch_device)
-                labels.append(target)
-
-        config = self.get_config(model_class_name)
-        return config, pixel_values, pixel_mask, labels
-
-    def get_config(self, model_class_name):
-        resnet_config = ResNetConfig(
-            num_channels=3,
-            embeddings_size=10,
-            hidden_sizes=[10, 20, 30, 40],
-            depths=[1, 1, 2, 1],
-            hidden_act="relu",
-            num_labels=3,
-            out_features=["stage2", "stage3", "stage4"],
-            out_indices=[2, 3, 4],
-        )
-        two_stage = model_class_name == "DetaForObjectDetection"
-        assign_first_stage = model_class_name == "DetaForObjectDetection"
-        assign_second_stage = model_class_name == "DetaForObjectDetection"
-        return DetaConfig(
-            d_model=self.hidden_size,
-            encoder_layers=self.num_hidden_layers,
-            decoder_layers=self.num_hidden_layers,
-            encoder_attention_heads=self.num_attention_heads,
-            decoder_attention_heads=self.num_attention_heads,
-            encoder_ffn_dim=self.intermediate_size,
-            decoder_ffn_dim=self.intermediate_size,
-            dropout=self.hidden_dropout_prob,
-            attention_dropout=self.attention_probs_dropout_prob,
-            num_queries=self.num_queries,
-            two_stage_num_proposals=self.two_stage_num_proposals,
-            num_labels=self.num_labels,
-            num_feature_levels=self.num_feature_levels,
-            encoder_n_points=self.encoder_n_points,
-            decoder_n_points=self.decoder_n_points,
-            two_stage=two_stage,
-            assign_first_stage=assign_first_stage,
-            assign_second_stage=assign_second_stage,
-            backbone_config=resnet_config,
-            backbone=None,
-        )
-
-    def prepare_config_and_inputs_for_common(self, model_class_name="DetaModel"):
-        config, pixel_values, pixel_mask, labels = self.prepare_config_and_inputs(model_class_name)
-        inputs_dict = {"pixel_values": pixel_values, "pixel_mask": pixel_mask}
-        return config, inputs_dict
-
-    def create_and_check_deta_model(self, config, pixel_values, pixel_mask, labels):
-        model = DetaModel(config=config)
-        model.to(torch_device)
-        model.eval()
-
-        result = model(pixel_values=pixel_values, pixel_mask=pixel_mask)
-        result = model(pixel_values)
-
-        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.num_queries, self.hidden_size))
-
-    def create_and_check_deta_freeze_backbone(self, config, pixel_values, pixel_mask, labels):
-        model = DetaModel(config=config)
-        model.to(torch_device)
-        model.eval()
-
-        model.freeze_backbone()
-
-        for _, param in model.backbone.model.named_parameters():
-            self.parent.assertEqual(False, param.requires_grad)
-
-    def create_and_check_deta_unfreeze_backbone(self, config, pixel_values, pixel_mask, labels):
-        model = DetaModel(config=config)
-        model.to(torch_device)
-        model.eval()
-
-        model.unfreeze_backbone()
-
-        for _, param in model.backbone.model.named_parameters():
-            self.parent.assertEqual(True, param.requires_grad)
-
-    def create_and_check_deta_object_detection_head_model(self, config, pixel_values, pixel_mask, labels):
-        model = DetaForObjectDetection(config=config)
-        model.to(torch_device)
-        model.eval()
-
-        result = model(pixel_values=pixel_values, pixel_mask=pixel_mask)
-        result = model(pixel_values)
-
-        self.parent.assertEqual(result.logits.shape, (self.batch_size, self.two_stage_num_proposals, self.num_labels))
-        self.parent.assertEqual(result.pred_boxes.shape, (self.batch_size, self.two_stage_num_proposals, 4))
-
-        result = model(pixel_values=pixel_values, pixel_mask=pixel_mask, labels=labels)
-
-        self.parent.assertEqual(result.loss.shape, ())
-        self.parent.assertEqual(result.logits.shape, (self.batch_size, self.two_stage_num_proposals, self.num_labels))
-        self.parent.assertEqual(result.pred_boxes.shape, (self.batch_size, self.two_stage_num_proposals, 4))
-
-
-@require_torchvision
-class DetaModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMixin, unittest.TestCase):
-    all_model_classes = (DetaModel, DetaForObjectDetection) if is_torchvision_available() else ()
-    pipeline_model_mapping = (
-        {"image-feature-extraction": DetaModel, "object-detection": DetaForObjectDetection}
-        if is_torchvision_available()
-        else {}
-    )
-    is_encoder_decoder = True
-    test_torchscript = False
-    test_pruning = False
-    test_head_masking = False
-    test_missing_keys = False
-
-    # TODO: Fix the failed tests when this model gets more usage
-    def is_pipeline_test_to_skip(
-        self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
-    ):
-        if pipeline_test_casse_name == "ObjectDetectionPipelineTests":
-            return True
-
-        return False
-
-    @unittest.skip("Skip for now. PR #22437 causes some loading issue. See (not merged) #22656 for some discussions.")
-    def test_can_use_safetensors(self):
-        super().test_can_use_safetensors()
-
-    # special case for head models
-    def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
-        inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
-
-        if return_labels:
-            if model_class.__name__ == "DetaForObjectDetection":
-                labels = []
-                for i in range(self.model_tester.batch_size):
-                    target = {}
-                    target["class_labels"] = torch.ones(
-                        size=(self.model_tester.n_targets,), device=torch_device, dtype=torch.long
-                    )
-                    target["boxes"] = torch.ones(
-                        self.model_tester.n_targets, 4, device=torch_device, dtype=torch.float
-                    )
-                    target["masks"] = torch.ones(
-                        self.model_tester.n_targets,
-                        self.model_tester.image_size,
-                        self.model_tester.image_size,
-                        device=torch_device,
-                        dtype=torch.float,
-                    )
-                    labels.append(target)
-                inputs_dict["labels"] = labels
-
-        return inputs_dict
-
-    def setUp(self):
-        self.model_tester = DetaModelTester(self)
-        self.config_tester = ConfigTester(self, config_class=DetaConfig, has_text_modality=False)
-
-    def test_config(self):
-        # we don't test common_properties and arguments_init as these don't apply for DETA
-        self.config_tester.create_and_test_config_to_json_string()
-        self.config_tester.create_and_test_config_to_json_file()
-        self.config_tester.create_and_test_config_from_and_save_pretrained()
-        self.config_tester.create_and_test_config_with_num_labels()
-        self.config_tester.check_config_can_be_init_without_params()
-
-    def test_deta_model(self):
-        config_and_inputs = self.model_tester.prepare_config_and_inputs(model_class_name="DetaModel")
-        self.model_tester.create_and_check_deta_model(*config_and_inputs)
-
-    def test_deta_freeze_backbone(self):
-        config_and_inputs = self.model_tester.prepare_config_and_inputs(model_class_name="DetaModel")
-        self.model_tester.create_and_check_deta_freeze_backbone(*config_and_inputs)
-
-    def test_deta_unfreeze_backbone(self):
-        config_and_inputs = self.model_tester.prepare_config_and_inputs(model_class_name="DetaModel")
-        self.model_tester.create_and_check_deta_unfreeze_backbone(*config_and_inputs)
-
-    def test_deta_object_detection_head_model(self):
-        config_and_inputs = self.model_tester.prepare_config_and_inputs(model_class_name="DetaForObjectDetection")
-        self.model_tester.create_and_check_deta_object_detection_head_model(*config_and_inputs)
-
-    @unittest.skip(reason="DETA does not use inputs_embeds")
-    def test_inputs_embeds(self):
-        pass
-
-    @unittest.skip(reason="DETA does not use inputs_embeds")
-    def test_inputs_embeds_matches_input_ids(self):
-        pass
-
-    @unittest.skip(reason="DETA does not have a get_input_embeddings method")
-    def test_model_common_attributes(self):
-        pass
-
-    @unittest.skip(reason="DETA is not a generative model")
-    def test_generate_without_input_ids(self):
-        pass
-
-    @unittest.skip(reason="DETA does not use token embeddings")
-    def test_resize_tokens_embeddings(self):
-        pass
-
-    @unittest.skip(reason="Feed forward chunking is not implemented")
-    def test_feed_forward_chunking(self):
-        pass
-
-    def test_attention_outputs(self):
-        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
-        config.return_dict = True
-
-        for model_class in self.all_model_classes:
-            inputs_dict["output_attentions"] = True
-            inputs_dict["output_hidden_states"] = False
-            config.return_dict = True
-            model = model_class(config)
-            model.to(torch_device)
-            model.eval()
-            with torch.no_grad():
-                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
-            attentions = outputs.encoder_attentions
-            self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
-
-            # check that output_attentions also work using config
-            del inputs_dict["output_attentions"]
-            config.output_attentions = True
-            model = model_class(config)
-            model.to(torch_device)
-            model.eval()
-            with torch.no_grad():
-                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
-            attentions = outputs.encoder_attentions
-            self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
-
-            self.assertListEqual(
-                list(attentions[0].shape[-3:]),
-                [
-                    self.model_tester.num_attention_heads,
-                    self.model_tester.num_feature_levels,
-                    self.model_tester.encoder_n_points,
-                ],
-            )
-            out_len = len(outputs)
-
-            correct_outlen = 8
-
-            # loss is at first position
-            if "labels" in inputs_dict:
-                correct_outlen += 1  # loss is added to beginning
-            # Object Detection model returns pred_logits and pred_boxes
-            if model_class.__name__ == "DetaForObjectDetection":
-                correct_outlen += 2
-
-            self.assertEqual(out_len, correct_outlen)
-
-            # decoder attentions
-            decoder_attentions = outputs.decoder_attentions
-            self.assertIsInstance(decoder_attentions, (list, tuple))
-            self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
-            self.assertListEqual(
-                list(decoder_attentions[0].shape[-3:]),
-                [self.model_tester.num_attention_heads, self.model_tester.num_queries, self.model_tester.num_queries],
-            )
-
-            # cross attentions
-            cross_attentions = outputs.cross_attentions
-            self.assertIsInstance(cross_attentions, (list, tuple))
-            self.assertEqual(len(cross_attentions), self.model_tester.num_hidden_layers)
-            self.assertListEqual(
-                list(cross_attentions[0].shape[-3:]),
-                [
-                    self.model_tester.num_attention_heads,
-                    self.model_tester.num_feature_levels,
-                    self.model_tester.decoder_n_points,
-                ],
-            )
-
-            # Check attention is always last and order is fine
-            inputs_dict["output_attentions"] = True
-            inputs_dict["output_hidden_states"] = True
-            model = model_class(config)
-            model.to(torch_device)
-            model.eval()
-            with torch.no_grad():
-                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
-
-            if hasattr(self.model_tester, "num_hidden_states_types"):
-                added_hidden_states = self.model_tester.num_hidden_states_types
-            elif self.is_encoder_decoder:
-                added_hidden_states = 2
-            else:
-                added_hidden_states = 1
-            self.assertEqual(out_len + added_hidden_states, len(outputs))
-
-            self_attentions = outputs.encoder_attentions
-
-            self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
-            self.assertListEqual(
-                list(self_attentions[0].shape[-3:]),
-                [
-                    self.model_tester.num_attention_heads,
-                    self.model_tester.num_feature_levels,
-                    self.model_tester.encoder_n_points,
-                ],
-            )
-
-    # removed retain_grad and grad on decoder_hidden_states, as queries don't require grad
-    def test_retain_grad_hidden_states_attentions(self):
-        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
-        config.output_hidden_states = True
-        config.output_attentions = True
-
-        # no need to test all models as different heads yield the same functionality
-        model_class = self.all_model_classes[0]
-        model = model_class(config)
-        model.to(torch_device)
-
-        inputs = self._prepare_for_class(inputs_dict, model_class)
-
-        outputs = model(**inputs)
-
-        # we take the second output since last_hidden_state is the second item
-        output = outputs[1]
-
-        encoder_hidden_states = outputs.encoder_hidden_states[0]
-        encoder_attentions = outputs.encoder_attentions[0]
-        encoder_hidden_states.retain_grad()
-        encoder_attentions.retain_grad()
-
-        decoder_attentions = outputs.decoder_attentions[0]
-        decoder_attentions.retain_grad()
-
-        cross_attentions = outputs.cross_attentions[0]
-        cross_attentions.retain_grad()
-
-        output.flatten()[0].backward(retain_graph=True)
-
-        self.assertIsNotNone(encoder_hidden_states.grad)
-        self.assertIsNotNone(encoder_attentions.grad)
-        self.assertIsNotNone(decoder_attentions.grad)
-        self.assertIsNotNone(cross_attentions.grad)
-
-    def test_forward_auxiliary_loss(self):
-        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
-        config.auxiliary_loss = True
-
-        # only test for object detection and segmentation model
-        for model_class in self.all_model_classes[1:]:
-            model = model_class(config)
-            model.to(torch_device)
-
-            inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
-
-            outputs = model(**inputs)
-
-            self.assertIsNotNone(outputs.auxiliary_outputs)
-            self.assertEqual(len(outputs.auxiliary_outputs), self.model_tester.num_hidden_layers - 1)
-
-    def test_forward_signature(self):
-        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
-
-        for model_class in self.all_model_classes:
-            model = model_class(config)
-            signature = inspect.signature(model.forward)
-            # signature.parameters is an OrderedDict => so arg_names order is deterministic
-            arg_names = [*signature.parameters.keys()]
-
-            if model.config.is_encoder_decoder:
-                expected_arg_names = ["pixel_values", "pixel_mask"]
-                expected_arg_names.extend(
-                    ["head_mask", "decoder_head_mask", "encoder_outputs"]
-                    if "head_mask" and "decoder_head_mask" in arg_names
-                    else []
-                )
-                self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
-            else:
-                expected_arg_names = ["pixel_values", "pixel_mask"]
-                self.assertListEqual(arg_names[:1], expected_arg_names)
-
-    @unittest.skip(reason="Model doesn't use tied weights")
-    def test_tied_model_weights_key_ignore(self):
-        pass
-
-    def test_initialization(self):
-        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
-
-        configs_no_init = _config_zero_init(config)
-        for model_class in self.all_model_classes:
-            model = model_class(config=configs_no_init)
-            # Skip the check for the backbone
-            for name, module in model.named_modules():
-                if module.__class__.__name__ == "DetaBackboneWithPositionalEncodings":
-                    backbone_params = [f"{name}.{key}" for key in module.state_dict().keys()]
-                    break
-
-            for name, param in model.named_parameters():
-                if param.requires_grad:
-                    if (
-                        "level_embed" in name
-                        or "sampling_offsets.bias" in name
-                        or "value_proj" in name
-                        or "output_proj" in name
-                        or "reference_points" in name
-                        or name in backbone_params
-                    ):
-                        continue
-                    self.assertIn(
-                        ((param.data.mean() * 1e9).round() / 1e9).item(),
-                        [0.0, 1.0],
-                        msg=f"Parameter {name} of model {model_class} seems not properly initialized",
-                    )
-
-    @unittest.skip("No support for low_cpu_mem_usage=True.")
-    def test_save_load_low_cpu_mem_usage(self):
-        pass
-
-    @unittest.skip("No support for low_cpu_mem_usage=True.")
-    def test_save_load_low_cpu_mem_usage_checkpoints(self):
-        pass
-
-    @unittest.skip("No support for low_cpu_mem_usage=True.")
-    def test_save_load_low_cpu_mem_usage_no_safetensors(self):
-        pass
-
-    # Inspired by tests.test_modeling_common.ModelTesterMixin.test_tied_weights_keys
-    def test_tied_weights_keys(self):
-        for model_class in self.all_model_classes:
-            # We need to pass model class name to correctly initialize the config.
-            # If we don't pass it, the config for `DetaForObjectDetection`` will be initialized
-            # with `two_stage=False` and the test will fail because for that case `class_embed`
-            # weights are not tied.
-            config, _ = self.model_tester.prepare_config_and_inputs_for_common(model_class_name=model_class.__name__)
-            config.tie_word_embeddings = True
-
-            model_tied = model_class(config)
-
-            ptrs = collections.defaultdict(list)
-            for name, tensor in model_tied.state_dict().items():
-                ptrs[id_tensor_storage(tensor)].append(name)
-
-            # These are all the pointers of shared tensors.
-            tied_params = [names for _, names in ptrs.items() if len(names) > 1]
-
-            tied_weight_keys = model_tied._tied_weights_keys if model_tied._tied_weights_keys is not None else []
-            # Detect we get a hit for each key
-            for key in tied_weight_keys:
-                is_tied_key = any(re.search(key, p) for group in tied_params for p in group)
-                self.assertTrue(is_tied_key, f"{key} is not a tied weight key for {model_class}.")
-
-            # Removed tied weights found from tied params -> there should only be one left after
-            for key in tied_weight_keys:
-                for i in range(len(tied_params)):
-                    tied_params[i] = [p for p in tied_params[i] if re.search(key, p) is None]
-
-            tied_params = [group for group in tied_params if len(group) > 1]
-            self.assertListEqual(
-                tied_params,
-                [],
-                f"Missing `_tied_weights_keys` for {model_class}: add all of {tied_params} except one.",
-            )
-
-
-TOLERANCE = 1e-4
-
-
-# We will verify our results on an image of cute cats
-def prepare_img():
-    image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
-    return image
-
-
-@require_torchvision
-@require_vision
-@slow
-class DetaModelIntegrationTests(unittest.TestCase):
-    @cached_property
-    def default_image_processor(self):
-        return AutoImageProcessor.from_pretrained("jozhang97/deta-resnet-50") if is_vision_available() else None
-
-    def test_inference_object_detection_head(self):
-        model = DetaForObjectDetection.from_pretrained("jozhang97/deta-resnet-50").to(torch_device)
-
-        image_processor = self.default_image_processor
-        image = prepare_img()
-        inputs = image_processor(images=image, return_tensors="pt").to(torch_device)
-
-        with torch.no_grad():
-            outputs = model(**inputs)
-
-        expected_shape_logits = torch.Size((1, 300, model.config.num_labels))
-        self.assertEqual(outputs.logits.shape, expected_shape_logits)
-
-        expected_logits = torch.tensor(
-            [[-7.3978, -2.5406, -4.1668], [-8.2684, -3.9933, -3.8096], [-7.0515, -3.7973, -5.8516]]
-        ).to(torch_device)
-        expected_boxes = torch.tensor(
-            [[0.5043, 0.4973, 0.9998], [0.2542, 0.5489, 0.4748], [0.5490, 0.2765, 0.0570]]
-        ).to(torch_device)
-
-        self.assertTrue(torch.allclose(outputs.logits[0, :3, :3], expected_logits, atol=1e-4))
-
-        expected_shape_boxes = torch.Size((1, 300, 4))
-        self.assertEqual(outputs.pred_boxes.shape, expected_shape_boxes)
-        self.assertTrue(torch.allclose(outputs.pred_boxes[0, :3, :3], expected_boxes, atol=1e-4))
-
-        # verify postprocessing
-        results = image_processor.post_process_object_detection(
-            outputs, threshold=0.3, target_sizes=[image.size[::-1]]
-        )[0]
-        expected_scores = torch.tensor([0.6392, 0.6276, 0.5546, 0.5260, 0.4706], device=torch_device)
-        expected_labels = [75, 17, 17, 75, 63]
-        expected_slice_boxes = torch.tensor([40.5866, 73.2107, 176.1421, 117.1751], device=torch_device)
-
-        self.assertTrue(torch.allclose(results["scores"], expected_scores, atol=1e-4))
-        self.assertSequenceEqual(results["labels"].tolist(), expected_labels)
-        self.assertTrue(torch.allclose(results["boxes"][0, :], expected_slice_boxes))
-
-    def test_inference_object_detection_head_swin_backbone(self):
-        model = DetaForObjectDetection.from_pretrained("jozhang97/deta-swin-large").to(torch_device)
-
-        image_processor = self.default_image_processor
-        image = prepare_img()
-        inputs = image_processor(images=image, return_tensors="pt").to(torch_device)
-
-        with torch.no_grad():
-            outputs = model(**inputs)
-
-        expected_shape_logits = torch.Size((1, 300, model.config.num_labels))
-        self.assertEqual(outputs.logits.shape, expected_shape_logits)
-
-        expected_logits = torch.tensor(
-            [[-7.6308, -2.8485, -5.3737], [-7.2037, -4.5505, -4.8027], [-7.2943, -4.2611, -4.6617]]
-        ).to(torch_device)
-        expected_boxes = torch.tensor(
-            [[0.4987, 0.4969, 0.9999], [0.2549, 0.5498, 0.4805], [0.5498, 0.2757, 0.0569]]
-        ).to(torch_device)
-
-        self.assertTrue(torch.allclose(outputs.logits[0, :3, :3], expected_logits, atol=1e-4))
-
-        expected_shape_boxes = torch.Size((1, 300, 4))
-        self.assertEqual(outputs.pred_boxes.shape, expected_shape_boxes)
-        self.assertTrue(torch.allclose(outputs.pred_boxes[0, :3, :3], expected_boxes, atol=1e-4))
-
-        # verify postprocessing
-        results = image_processor.post_process_object_detection(
-            outputs, threshold=0.3, target_sizes=[image.size[::-1]]
-        )[0]
-        expected_scores = torch.tensor([0.6831, 0.6826, 0.5684, 0.5464, 0.4392], device=torch_device)
-        expected_labels = [17, 17, 75, 75, 63]
-        expected_slice_boxes = torch.tensor([345.8478, 23.6754, 639.8562, 372.8265], device=torch_device)
-
-        self.assertTrue(torch.allclose(results["scores"], expected_scores, atol=1e-4))
-        self.assertSequenceEqual(results["labels"].tolist(), expected_labels)
-        self.assertTrue(torch.allclose(results["boxes"][0, :], expected_slice_boxes))
--- a/tests/models/efficientformer/init.py
+++ b/tests/models/efficientformer/init.py
--- a/tests/models/efficientformer/test_image_processing_efficientformer.py
+++ b/tests/models/efficientformer/test_image_processing_efficientformer.py
@@ -1,99 +0,0 @@
-# coding=utf-8
-# Copyright 2021 HuggingFace Inc.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import unittest
-
-from transformers.testing_utils import require_torch, require_vision
-from transformers.utils import is_vision_available
-
-from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs
-
-
-if is_vision_available():
-    from transformers import ViTImageProcessor
-
-
-class EfficientFormerImageProcessorTester(unittest.TestCase):
-    def __init__(
-        self,
-        parent,
-        batch_size=13,
-        num_channels=3,
-        image_size=224,
-        min_resolution=30,
-        max_resolution=400,
-        do_resize=True,
-        size=None,
-        do_normalize=True,
-        image_mean=[0.5, 0.5, 0.5],
-        image_std=[0.5, 0.5, 0.5],
-    ):
-        size = size if size is not None else {"height": 18, "width": 18}
-        self.parent = parent
-        self.batch_size = batch_size
-        self.num_channels = num_channels
-        self.image_size = image_size
-        self.min_resolution = min_resolution
-        self.max_resolution = max_resolution
-        self.do_resize = do_resize
-        self.size = size
-        self.do_normalize = do_normalize
-        self.image_mean = image_mean
-        self.image_std = image_std
-
-    def prepare_image_processor_dict(self):
-        return {
-            "image_mean": self.image_mean,
-            "image_std": self.image_std,
-            "do_normalize": self.do_normalize,
-            "do_resize": self.do_resize,
-            "size": self.size,
-        }
-
-    def expected_output_image_shape(self, images):
-        return self.num_channels, self.size["height"], self.size["width"]
-
-    def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=False):
-        return prepare_image_inputs(
-            batch_size=self.batch_size,
-            num_channels=self.num_channels,
-            min_resolution=self.min_resolution,
-            max_resolution=self.max_resolution,
-            equal_resolution=equal_resolution,
-            numpify=numpify,
-            torchify=torchify,
-        )
-
-
-@require_torch
-@require_vision
-class EfficientFormerImageProcessorTest(ImageProcessingTestMixin, unittest.TestCase):
-    image_processing_class = ViTImageProcessor if is_vision_available() else None
-
-    def setUp(self):
-        self.image_processor_tester = EfficientFormerImageProcessorTester(self)
-
-    @property
-    def image_processor_dict(self):
-        return self.image_processor_tester.prepare_image_processor_dict()
-
-    def test_image_proc_properties(self):
-        image_processor = self.image_processing_class(**self.image_processor_dict)
-        self.assertTrue(hasattr(image_processor, "image_mean"))
-        self.assertTrue(hasattr(image_processor, "image_std"))
-        self.assertTrue(hasattr(image_processor, "do_normalize"))
-        self.assertTrue(hasattr(image_processor, "do_resize"))
-        self.assertTrue(hasattr(image_processor, "size"))
--- a/Show More
+++ b/Show More