AutoImageProcessor (#20111)

* AutoImageProcessor skeleton * Update references * Add mapping in init * Add model image processors to __init__ for importing * Add AutoImageProcessor tests * Fix up * Image Processor documentation * Remove pdb * Update docs/source/en/model_doc/mobilevit.mdx * Update docs * Don't add whitespace on json files * Remove fixtures * Move checking model config down * Fix up * Add check for image processor * Remove FeatureExtractorMixin in docstrings * Rename model_tmpfile to config_tmpfile * Don't make None if not in image processor map
2022-11-08 19:54:41 +00:00
parent c08a1e26ab
commit 4eb918e656
51 changed files with 1371 additions and 123 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -183,6 +183,8 @@
      title: DeepSpeed Integration
    - local: main_classes/feature_extractor
      title: Feature Extractor
+    - local: main_classes/image_processor
+      title: Image Processor
    title: Main Classes
  - sections:
    - isExpanded: false
--- a/docs/source/en/internal/image_processing_utils.mdx
+++ b/docs/source/en/internal/image_processing_utils.mdx
@@ -29,6 +29,6 @@ Most of those are only useful if you are studying the code of the image processo

 [[autodoc]] image_transforms.to_pil_image

-## ImageProcessorMixin
+## ImageProcessingMixin

-[[autodoc]] image_processing_utils.ImageProcessorMixin
+[[autodoc]] image_processing_utils.ImageProcessingMixin
--- a/docs/source/en/main_classes/image_processor.mdx
+++ b/docs/source/en/main_classes/image_processor.mdx
@@ -0,0 +1,30 @@
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Image Processor
+
+An image processor is in charge of preparing input features for vision models and post processing their outputs. This includes transformations such as resizing, normalization, and conversion to PyTorch, TensorFlow, Flax and Numpy tensors. It may also include model specific post-processing such as converting logits to segmentation masks.
+
+
+## ImageProcessingMixin
+
+[[autodoc]] image_processing_utils.ImageProcessingMixin
+    - from_pretrained
+    - save_pretrained
+
+## BatchFeature
+
+[[autodoc]] BatchFeature
+
+## BaseImageProcessor
+
+[[autodoc]] image_processing_utils.BaseImageProcessor
--- a/docs/source/en/model_doc/auto.mdx
+++ b/docs/source/en/model_doc/auto.mdx
@@ -66,6 +66,10 @@ Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its

 [[autodoc]] AutoFeatureExtractor

+## AutoImageProcessor
+
+[[autodoc]] AutoImageProcessor
+
 ## AutoProcessor

 [[autodoc]] AutoProcessor
--- a/docs/source/en/model_doc/beit.mdx
+++ b/docs/source/en/model_doc/beit.mdx
@@ -60,7 +60,7 @@ Tips:
  position embeddings.

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/beit_architecture.jpg"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>

 <small> BEiT pre-training. Taken from the <a href="https://arxiv.org/abs/2106.08254">original paper.</a> </small>

@@ -84,6 +84,12 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code
    - __call__
    - post_process_semantic_segmentation

+## BeitImageProcessor
+
+[[autodoc]] BeitImageProcessor
+    - preprocess
+    - post_process_semantic_segmentation
+
 ## BeitModel

 [[autodoc]] BeitModel
--- a/docs/source/en/model_doc/clip.mdx
+++ b/docs/source/en/model_doc/clip.mdx
@@ -100,6 +100,11 @@ This model was contributed by [valhalla](https://huggingface.co/valhalla). The o

 [[autodoc]] CLIPTokenizerFast

+## CLIPImageProcessor
+
+[[autodoc]] CLIPImageProcessor
+    - preprocess
+
 ## CLIPFeatureExtractor

 [[autodoc]] CLIPFeatureExtractor
--- a/docs/source/en/model_doc/convnext.mdx
+++ b/docs/source/en/model_doc/convnext.mdx
@@ -33,7 +33,7 @@ Tips:
 - See the code examples below each model regarding usage.

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/convnext_architecture.jpg"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>

 <small> ConvNeXT architecture. Taken from the <a href="https://arxiv.org/abs/2201.03545">original paper</a>.</small>

@@ -50,6 +50,11 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). TensorFlo
 [[autodoc]] ConvNextFeatureExtractor


+## ConvNextImageProcessor
+
+[[autodoc]] ConvNextImageProcessor
+    - preprocess
+
 ## ConvNextModel

 [[autodoc]] ConvNextModel
@@ -71,4 +76,4 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). TensorFlo
 ## TFConvNextForImageClassification

 [[autodoc]] TFConvNextForImageClassification
-    - call
+    - call
--- a/docs/source/en/model_doc/deit.mdx
+++ b/docs/source/en/model_doc/deit.mdx
@@ -81,6 +81,11 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The Tenso
 [[autodoc]] DeiTFeatureExtractor
    - __call__

+## DeiTImageProcessor
+
+[[autodoc]] DeiTImageProcessor
+    - preprocess
+
 ## DeiTModel

 [[autodoc]] DeiTModel
--- a/docs/source/en/model_doc/dpt.mdx
+++ b/docs/source/en/model_doc/dpt.mdx
@@ -22,7 +22,7 @@ The abstract from the paper is the following:
 *We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art.*

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.jpg"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>

 <small> DPT architecture. Taken from the <a href="https://arxiv.org/abs/2103.13413" target="_blank">original paper</a>. </small>

@@ -40,6 +40,13 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
    - post_process_semantic_segmentation


+## DPTImageProcessor
+
+[[autodoc]] DPTImageProcessor
+    - preprocess
+    - post_process_semantic_segmentation
+
+
 ## DPTModel

 [[autodoc]] DPTModel
@@ -55,4 +62,4 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
 ## DPTForSemanticSegmentation

 [[autodoc]] DPTForSemanticSegmentation
-    - forward
+    - forward
--- a/docs/source/en/model_doc/flava.mdx
+++ b/docs/source/en/model_doc/flava.mdx
@@ -16,17 +16,17 @@ specific language governing permissions and limitations under the License.

 The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.

-The paper aims at creating a single unified foundation model which can work across vision, language 
+The paper aims at creating a single unified foundation model which can work across vision, language
 as well as vision-and-language multimodal tasks.

 The abstract from the paper is the following:

-*State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety 
-of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal 
-(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising 
-direction would be to use a single holistic universal model, as a "foundation", that targets all modalities 
-at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and 
-cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate 
+*State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety
+of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal
+(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising
+direction would be to use a single holistic universal model, as a "foundation", that targets all modalities
+at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and
+cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate
 impressive performance on a wide range of 35 tasks spanning these target modalities.*


@@ -61,6 +61,11 @@ This model was contributed by [aps](https://huggingface.co/aps). The original co

 [[autodoc]] FlavaFeatureExtractor

+## FlavaImageProcessor
+
+[[autodoc]] FlavaImageProcessor
+    - preprocess
+
 ## FlavaForPreTraining

 [[autodoc]] FlavaForPreTraining
--- a/docs/source/en/model_doc/glpn.mdx
+++ b/docs/source/en/model_doc/glpn.mdx
@@ -35,7 +35,7 @@ Tips:
 - One can use [`GLPNFeatureExtractor`] to prepare images for the model.

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/glpn_architecture.jpg"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>

 <small> Summary of the approach. Taken from the <a href="https://arxiv.org/abs/2201.07436" target="_blank">original paper</a>. </small>

@@ -50,6 +50,11 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
 [[autodoc]] GLPNFeatureExtractor
    - __call__

+## GLPNImageProcessor
+
+[[autodoc]] GLPNImageProcessor
+    - preprocess
+
 ## GLPNModel

 [[autodoc]] GLPNModel
@@ -58,4 +63,4 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
 ## GLPNForDepthEstimation

 [[autodoc]] GLPNForDepthEstimation
-    - forward
+    - forward
--- a/docs/source/en/model_doc/imagegpt.mdx
+++ b/docs/source/en/model_doc/imagegpt.mdx
@@ -29,7 +29,7 @@ competitive with self-supervised benchmarks on ImageNet when substituting pixels
 top-1 accuracy on a linear probe of our features.*

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/imagegpt_architecture.png"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>

 <small> Summary of the approach. Taken from the [original paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf). </small>

@@ -81,6 +81,11 @@ Tips:

    - __call__

+## ImageGPTImageProcessor
+
+[[autodoc]] ImageGPTImageProcessor
+    - preprocess
+
 ## ImageGPTModel

 [[autodoc]] ImageGPTModel
@@ -97,4 +102,4 @@ Tips:

 [[autodoc]] ImageGPTForImageClassification

-    - forward
+    - forward
--- a/docs/source/en/model_doc/layoutlmv2.mdx
+++ b/docs/source/en/model_doc/layoutlmv2.mdx
@@ -45,7 +45,7 @@ RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). The pre-trained Layo
 this https URL.*

 LayoutLMv2 depends on `detectron2`, `torchvision` and `tesseract`. Run the
-following to install them: 
+following to install them:
 ```
 python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
 python -m pip install torchvision tesseract
@@ -275,6 +275,11 @@ print(encoding.keys())
 [[autodoc]] LayoutLMv2FeatureExtractor
    - __call__

+## LayoutLMv2ImageProcessor
+
+[[autodoc]] LayoutLMv2ImageProcessor
+    - preprocess
+
 ## LayoutLMv2Tokenizer

 [[autodoc]] LayoutLMv2Tokenizer
--- a/docs/source/en/model_doc/layoutlmv3.mdx
+++ b/docs/source/en/model_doc/layoutlmv3.mdx
@@ -73,6 +73,11 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
 [[autodoc]] LayoutLMv3FeatureExtractor
    - __call__

+## LayoutLMv3ImageProcessor
+
+[[autodoc]] LayoutLMv3ImageProcessor
+    - preprocess
+
 ## LayoutLMv3Tokenizer

 [[autodoc]] LayoutLMv3Tokenizer
--- a/docs/source/en/model_doc/levit.mdx
+++ b/docs/source/en/model_doc/levit.mdx
@@ -19,18 +19,18 @@ The LeViT model was proposed in [LeViT: Introducing Convolutions to Vision Trans
 The abstract from the paper is the following:

 *We design a family of image classification architectures that optimize the trade-off between accuracy
-and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, 
-which are competitive on highly parallel processing hardware. We revisit principles from the extensive 
-literature on convolutional neural networks to apply them to transformers, in particular activation maps 
+and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures,
+which are competitive on highly parallel processing hardware. We revisit principles from the extensive
+literature on convolutional neural networks to apply them to transformers, in particular activation maps
 with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information
-in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. 
-We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of 
-application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable 
-to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect 
+in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification.
+We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of
+application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable
+to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect
 to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. *

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/levit_architecture.png"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>

 <small> LeViT Architecture. Taken from the <a href="https://arxiv.org/abs/2104.01136">original paper</a>.</small>

@@ -38,25 +38,25 @@ Tips:

 - Compared to ViT, LeViT models use an additional distillation head to effectively learn from a teacher (which, in the LeViT paper, is a ResNet like-model). The distillation head is learned through backpropagation under supervision of a ResNet like-model. They also draw inspiration from convolution neural networks to use activation maps with decreasing resolutions to increase the efficiency.
 - There are 2 ways to fine-tune distilled models, either (1) in a classic way, by only placing a prediction head on top
-  of the final hidden state and not using the distillation head, or (2) by placing both a prediction head and distillation 
-  head on top of the final hidden state. In that case, the prediction head is trained using regular cross-entropy between 
-  the prediction of the head and the ground-truth label, while the distillation prediction head is trained using hard distillation 
-  (cross-entropy between the prediction of the distillation head and the label predicted by the teacher). At inference time, 
-  one takes the average prediction between both heads as final prediction. (2) is also called "fine-tuning with distillation", 
-  because one relies on a teacher that has already been fine-tuned on the downstream dataset. In terms of models, (1) corresponds 
+  of the final hidden state and not using the distillation head, or (2) by placing both a prediction head and distillation
+  head on top of the final hidden state. In that case, the prediction head is trained using regular cross-entropy between
+  the prediction of the head and the ground-truth label, while the distillation prediction head is trained using hard distillation
+  (cross-entropy between the prediction of the distillation head and the label predicted by the teacher). At inference time,
+  one takes the average prediction between both heads as final prediction. (2) is also called "fine-tuning with distillation",
+  because one relies on a teacher that has already been fine-tuned on the downstream dataset. In terms of models, (1) corresponds
  to [`LevitForImageClassification`] and (2) corresponds to [`LevitForImageClassificationWithTeacher`].
- All released checkpoints were pre-trained and fine-tuned on  [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) 
+- All released checkpoints were pre-trained and fine-tuned on  [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k)
  (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). only. No external data was used. This is in
  contrast with the original ViT model, which used external data like the JFT-300M dataset/Imagenet-21k for
  pre-training.
- The authors of LeViT released 5 trained LeViT models, which you can directly plug into [`LevitModel`] or [`LevitForImageClassification`]. 
+- The authors of LeViT released 5 trained LeViT models, which you can directly plug into [`LevitModel`] or [`LevitForImageClassification`].
  Techniques like data augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
  (while only using ImageNet-1k for pre-training). The 5 variants available are (all trained on images of size 224x224):
  *facebook/levit-128S*, *facebook/levit-128*, *facebook/levit-192*, *facebook/levit-256* and
  *facebook/levit-384*. Note that one should use [`LevitFeatureExtractor`] in order to
  prepare images for the model.
 - [`LevitForImageClassificationWithTeacher`] currently supports only inference and not training or fine-tuning.
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) 
+- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer)
  (you can just replace [`ViTFeatureExtractor`] by [`LevitFeatureExtractor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]).

 This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT).
@@ -71,6 +71,12 @@ This model was contributed by [anugunj](https://huggingface.co/anugunj). The ori
 [[autodoc]] LevitFeatureExtractor
    - __call__

+## LevitImageProcessor
+
+  [[autodoc]] LevitImageProcessor
+    - preprocess
+
+
 ## LevitModel

 [[autodoc]] LevitModel
--- a/docs/source/en/model_doc/mobilevit.mdx
+++ b/docs/source/en/model_doc/mobilevit.mdx
@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.

 ## Overview

-The MobileViT model was proposed in [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. MobileViT introduces a new layer that replaces local processing in convolutions with global processing using transformers. 
+The MobileViT model was proposed in [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. MobileViT introduces a new layer that replaces local processing in convolutions with global processing using transformers.

 The abstract from the paper is the following:

@@ -25,10 +25,10 @@ Tips:
 - MobileViT is more like a CNN than a Transformer model. It does not work on sequence data but on batches of images. Unlike ViT, there are no embeddings. The backbone model outputs a feature map. You can follow [this tutorial](https://keras.io/examples/vision/mobilevit) for a lightweight introduction.
 - One can use [`MobileViTFeatureExtractor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB).
 - The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).
- The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/). 
- As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite). 
+- The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
+- As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite).

-  You can use the following code to convert a MobileViT checkpoint (be it image classification or semantic segmentation) to generate a 
+  You can use the following code to convert a MobileViT checkpoint (be it image classification or semantic segmentation) to generate a
  TensorFlow Lite model:

 ```py
@@ -52,7 +52,7 @@ with open(tflite_filename, "wb") as f:
 ```

  The resulting model will be just **about an MB** making it a good fit for mobile applications where resources and network
-  bandwidth can be constrained. 
+  bandwidth can be constrained.


 This model was contributed by [matthijs](https://huggingface.co/Matthijs). The TensorFlow version of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code and weights can be found [here](https://github.com/apple/ml-cvnets).
@@ -68,6 +68,12 @@ This model was contributed by [matthijs](https://huggingface.co/Matthijs). The T
    - __call__
    - post_process_semantic_segmentation

+## MobileViTImageProcessor
+
+[[autodoc]] MobileViTImageProcessor
+    - preprocess
+    - post_process_semantic_segmentation
+
 ## MobileViTModel

 [[autodoc]] MobileViTModel
@@ -86,14 +92,14 @@ This model was contributed by [matthijs](https://huggingface.co/Matthijs). The T
 ## TFMobileViTModel

 [[autodoc]] TFMobileViTModel
-    - call 
+    - call

 ## TFMobileViTForImageClassification

 [[autodoc]] TFMobileViTForImageClassification
-    - call 
+    - call

 ## TFMobileViTForSemanticSegmentation

 [[autodoc]] TFMobileViTForSemanticSegmentation
-    - call 
+    - call
--- a/docs/source/en/model_doc/perceiver.mdx
+++ b/docs/source/en/model_doc/perceiver.mdx
@@ -70,7 +70,7 @@ vocabulary size of the model, i.e. creating logits of shape `(batch_size, 2048,
 size of 262 byte IDs).

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perceiver_architecture.jpg"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>

 <small> Perceiver IO architecture. Taken from the <a href="https://arxiv.org/abs/2105.15203">original paper</a> </small>

@@ -83,8 +83,8 @@ Tips:
  notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Perceiver).
 - Refer to the [blog post](https://huggingface.co/blog/perceiver) if you want to fully understand how the model works and
 is implemented in the library. Note that the models available in the library only showcase some examples of what you can do
-with the Perceiver. There are many more use cases, including question answering, named-entity recognition, object detection, 
-audio classification, video classification, etc. 
+with the Perceiver. There are many more use cases, including question answering, named-entity recognition, object detection,
+audio classification, video classification, etc.

 **Note**:

@@ -114,6 +114,11 @@ audio classification, video classification, etc.
 [[autodoc]] PerceiverFeatureExtractor
    - __call__

+## PerceiverImageProcessor
+
+[[autodoc]] PerceiverImageProcessor
+    - preprocess
+
 ## PerceiverTextPreprocessor

 [[autodoc]] models.perceiver.modeling_perceiver.PerceiverTextPreprocessor
--- a/docs/source/en/model_doc/poolformer.mdx
+++ b/docs/source/en/model_doc/poolformer.mdx
@@ -50,12 +50,17 @@ This model was contributed by [heytanay](https://huggingface.co/heytanay). The o
 [[autodoc]] PoolFormerFeatureExtractor
    - __call__

+## PoolFormerImageProcessor
+
+[[autodoc]] PoolFormerImageProcessor
+    - preprocess
+
 ## PoolFormerModel

 [[autodoc]] PoolFormerModel
    - forward
-    
+
 ## PoolFormerForImageClassification

 [[autodoc]] PoolFormerForImageClassification
-    - forward
+    - forward
--- a/docs/source/en/model_doc/segformer.mdx
+++ b/docs/source/en/model_doc/segformer.mdx
@@ -36,7 +36,7 @@ The figure below illustrates the architecture of SegFormer. Taken from the [orig

 <img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/segformer_architecture.png"/>

-This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version 
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version
 of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/NVlabs/SegFormer).

 Tips:
@@ -55,7 +55,7 @@ Tips:
 - TensorFlow users should refer to [this repository](https://github.com/deep-diver/segformer-tf-transformers) that shows off-the-shelf inference and fine-tuning.
 - One can also check out [this interactive demo on Hugging Face Spaces](https://huggingface.co/spaces/chansung/segformer-tf-transformers)
  to try out a SegFormer model on custom images.
- SegFormer works on any input size, as it pads the input to be divisible by `config.patch_sizes`. 
+- SegFormer works on any input size, as it pads the input to be divisible by `config.patch_sizes`.
 - One can use [`SegformerFeatureExtractor`] to prepare images and corresponding segmentation maps
  for the model. Note that this feature extractor is fairly basic and does not include all data augmentations used in
  the original paper. The original preprocessing pipelines (for the ADE20k dataset for instance) can be found [here](https://github.com/NVlabs/SegFormer/blob/master/local_configs/_base_/datasets/ade20k_repeat.py). The most
@@ -95,6 +95,12 @@ SegFormer's results on the segmentation datasets like ADE20k, refer to the [pape
    - __call__
    - post_process_semantic_segmentation

+## SegformerImageProcessor
+
+[[autodoc]] SegformerImageProcessor
+    - preprocess
+    - post_process_semantic_segmentation
+
 ## SegformerModel

 [[autodoc]] SegformerModel
@@ -123,14 +129,14 @@ SegFormer's results on the segmentation datasets like ADE20k, refer to the [pape
 ## TFSegformerModel

 [[autodoc]] TFSegformerModel
-    - call 
+    - call

 ## TFSegformerForImageClassification

 [[autodoc]] TFSegformerForImageClassification
-    - call 
+    - call

 ## TFSegformerForSemanticSegmentation

 [[autodoc]] TFSegformerForSemanticSegmentation
-    - call 
+    - call
--- a/docs/source/en/model_doc/videomae.mdx
+++ b/docs/source/en/model_doc/videomae.mdx
@@ -27,7 +27,7 @@ Tips:
 - [`VideoMAEForPreTraining`] includes the decoder on top for self-supervised pre-training.

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/videomae_architecture.jpeg"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>

 <small> VideoMAE pre-training. Taken from the <a href="https://arxiv.org/abs/2203.12602">original paper</a>. </small>

@@ -44,6 +44,11 @@ The original code can be found [here](https://github.com/MCG-NJU/VideoMAE).
 [[autodoc]] VideoMAEFeatureExtractor
    - __call__

+## VideoMAEImageProcessor
+
+[[autodoc]] VideoMAEImageProcessor
+    - preprocess
+
 ## VideoMAEModel

 [[autodoc]] VideoMAEModel
@@ -57,4 +62,4 @@ The original code can be found [here](https://github.com/MCG-NJU/VideoMAE).
 ## VideoMAEForVideoClassification

 [[autodoc]] transformers.VideoMAEForVideoClassification
-    - forward
+    - forward
--- a/docs/source/en/model_doc/vilt.mdx
+++ b/docs/source/en/model_doc/vilt.mdx
@@ -38,12 +38,12 @@ Tips:
  This processor wraps a feature extractor (for the image modality) and a tokenizer (for the language modality) into one.
 - ViLT is trained with images of various sizes: the authors resize the shorter edge of input images to 384 and limit the longer edge to
  under 640 while preserving the aspect ratio. To make batching of images possible, the authors use a `pixel_mask` that indicates
-  which pixel values are real and which are padding. [`ViltProcessor`] automatically creates this for you. 
- The design of ViLT is very similar to that of a standard Vision Transformer (ViT). The only difference is that the model includes 
+  which pixel values are real and which are padding. [`ViltProcessor`] automatically creates this for you.
+- The design of ViLT is very similar to that of a standard Vision Transformer (ViT). The only difference is that the model includes
  additional embedding layers for the language modality.

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vilt_architecture.jpg"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>

 <small> ViLT architecture. Taken from the <a href="https://arxiv.org/abs/2102.03334">original paper</a>. </small>

@@ -63,6 +63,11 @@ Tips:
 [[autodoc]] ViltFeatureExtractor
    - __call__

+## ViltImageProcessor
+
+[[autodoc]] ViltImageProcessor
+    - preprocess
+
 ## ViltProcessor

 [[autodoc]] ViltProcessor
--- a/docs/source/en/model_doc/vit.mdx
+++ b/docs/source/en/model_doc/vit.mdx
@@ -57,7 +57,7 @@ Tips:
  improvement of 2% to training from scratch, but still 4% behind supervised pre-training.

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/vit_architecture.jpg"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>

 <small> ViT architecture. Taken from the <a href="https://arxiv.org/abs/2010.11929">original paper.</a> </small>

@@ -96,6 +96,12 @@ go to him!
 [[autodoc]] ViTFeatureExtractor
    - __call__

+
+## ViTImageProcessor
+
+[[autodoc]] ViTImageProcessor
+    - preprocess
+
 ## ViTModel

 [[autodoc]] ViTModel
--- a/docs/source/en/preprocessing.mdx
+++ b/docs/source/en/preprocessing.mdx
@@ -17,7 +17,8 @@ specific language governing permissions and limitations under the License.
 Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:

 * Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
-* Computer vision and speech, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and images and convert them into tensors.
+* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors.
+* Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
 * Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor.

 <Tip>