AutoImageProcessor (#20111)
* AutoImageProcessor skeleton * Update references * Add mapping in init * Add model image processors to __init__ for importing * Add AutoImageProcessor tests * Fix up * Image Processor documentation * Remove pdb * Update docs/source/en/model_doc/mobilevit.mdx * Update docs * Don't add whitespace on json files * Remove fixtures * Move checking model config down * Fix up * Add check for image processor * Remove FeatureExtractorMixin in docstrings * Rename model_tmpfile to config_tmpfile * Don't make None if not in image processor map
This commit is contained in:
@@ -66,6 +66,10 @@ Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its
|
||||
|
||||
[[autodoc]] AutoFeatureExtractor
|
||||
|
||||
## AutoImageProcessor
|
||||
|
||||
[[autodoc]] AutoImageProcessor
|
||||
|
||||
## AutoProcessor
|
||||
|
||||
[[autodoc]] AutoProcessor
|
||||
|
||||
@@ -60,7 +60,7 @@ Tips:
|
||||
position embeddings.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/beit_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> BEiT pre-training. Taken from the <a href="https://arxiv.org/abs/2106.08254">original paper.</a> </small>
|
||||
|
||||
@@ -84,6 +84,12 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code
|
||||
- __call__
|
||||
- post_process_semantic_segmentation
|
||||
|
||||
## BeitImageProcessor
|
||||
|
||||
[[autodoc]] BeitImageProcessor
|
||||
- preprocess
|
||||
- post_process_semantic_segmentation
|
||||
|
||||
## BeitModel
|
||||
|
||||
[[autodoc]] BeitModel
|
||||
|
||||
@@ -100,6 +100,11 @@ This model was contributed by [valhalla](https://huggingface.co/valhalla). The o
|
||||
|
||||
[[autodoc]] CLIPTokenizerFast
|
||||
|
||||
## CLIPImageProcessor
|
||||
|
||||
[[autodoc]] CLIPImageProcessor
|
||||
- preprocess
|
||||
|
||||
## CLIPFeatureExtractor
|
||||
|
||||
[[autodoc]] CLIPFeatureExtractor
|
||||
|
||||
@@ -33,7 +33,7 @@ Tips:
|
||||
- See the code examples below each model regarding usage.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/convnext_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> ConvNeXT architecture. Taken from the <a href="https://arxiv.org/abs/2201.03545">original paper</a>.</small>
|
||||
|
||||
@@ -50,6 +50,11 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). TensorFlo
|
||||
[[autodoc]] ConvNextFeatureExtractor
|
||||
|
||||
|
||||
## ConvNextImageProcessor
|
||||
|
||||
[[autodoc]] ConvNextImageProcessor
|
||||
- preprocess
|
||||
|
||||
## ConvNextModel
|
||||
|
||||
[[autodoc]] ConvNextModel
|
||||
@@ -71,4 +76,4 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). TensorFlo
|
||||
## TFConvNextForImageClassification
|
||||
|
||||
[[autodoc]] TFConvNextForImageClassification
|
||||
- call
|
||||
- call
|
||||
|
||||
@@ -81,6 +81,11 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The Tenso
|
||||
[[autodoc]] DeiTFeatureExtractor
|
||||
- __call__
|
||||
|
||||
## DeiTImageProcessor
|
||||
|
||||
[[autodoc]] DeiTImageProcessor
|
||||
- preprocess
|
||||
|
||||
## DeiTModel
|
||||
|
||||
[[autodoc]] DeiTModel
|
||||
|
||||
@@ -22,7 +22,7 @@ The abstract from the paper is the following:
|
||||
*We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art.*
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> DPT architecture. Taken from the <a href="https://arxiv.org/abs/2103.13413" target="_blank">original paper</a>. </small>
|
||||
|
||||
@@ -40,6 +40,13 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
|
||||
- post_process_semantic_segmentation
|
||||
|
||||
|
||||
## DPTImageProcessor
|
||||
|
||||
[[autodoc]] DPTImageProcessor
|
||||
- preprocess
|
||||
- post_process_semantic_segmentation
|
||||
|
||||
|
||||
## DPTModel
|
||||
|
||||
[[autodoc]] DPTModel
|
||||
@@ -55,4 +62,4 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
|
||||
## DPTForSemanticSegmentation
|
||||
|
||||
[[autodoc]] DPTForSemanticSegmentation
|
||||
- forward
|
||||
- forward
|
||||
|
||||
@@ -16,17 +16,17 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.
|
||||
|
||||
The paper aims at creating a single unified foundation model which can work across vision, language
|
||||
The paper aims at creating a single unified foundation model which can work across vision, language
|
||||
as well as vision-and-language multimodal tasks.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety
|
||||
of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal
|
||||
(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising
|
||||
direction would be to use a single holistic universal model, as a "foundation", that targets all modalities
|
||||
at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and
|
||||
cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate
|
||||
*State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety
|
||||
of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal
|
||||
(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising
|
||||
direction would be to use a single holistic universal model, as a "foundation", that targets all modalities
|
||||
at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and
|
||||
cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate
|
||||
impressive performance on a wide range of 35 tasks spanning these target modalities.*
|
||||
|
||||
|
||||
@@ -61,6 +61,11 @@ This model was contributed by [aps](https://huggingface.co/aps). The original co
|
||||
|
||||
[[autodoc]] FlavaFeatureExtractor
|
||||
|
||||
## FlavaImageProcessor
|
||||
|
||||
[[autodoc]] FlavaImageProcessor
|
||||
- preprocess
|
||||
|
||||
## FlavaForPreTraining
|
||||
|
||||
[[autodoc]] FlavaForPreTraining
|
||||
|
||||
@@ -35,7 +35,7 @@ Tips:
|
||||
- One can use [`GLPNFeatureExtractor`] to prepare images for the model.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/glpn_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> Summary of the approach. Taken from the <a href="https://arxiv.org/abs/2201.07436" target="_blank">original paper</a>. </small>
|
||||
|
||||
@@ -50,6 +50,11 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
|
||||
[[autodoc]] GLPNFeatureExtractor
|
||||
- __call__
|
||||
|
||||
## GLPNImageProcessor
|
||||
|
||||
[[autodoc]] GLPNImageProcessor
|
||||
- preprocess
|
||||
|
||||
## GLPNModel
|
||||
|
||||
[[autodoc]] GLPNModel
|
||||
@@ -58,4 +63,4 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
|
||||
## GLPNForDepthEstimation
|
||||
|
||||
[[autodoc]] GLPNForDepthEstimation
|
||||
- forward
|
||||
- forward
|
||||
|
||||
@@ -29,7 +29,7 @@ competitive with self-supervised benchmarks on ImageNet when substituting pixels
|
||||
top-1 accuracy on a linear probe of our features.*
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/imagegpt_architecture.png"
|
||||
alt="drawing" width="600"/>
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> Summary of the approach. Taken from the [original paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf). </small>
|
||||
|
||||
@@ -81,6 +81,11 @@ Tips:
|
||||
|
||||
- __call__
|
||||
|
||||
## ImageGPTImageProcessor
|
||||
|
||||
[[autodoc]] ImageGPTImageProcessor
|
||||
- preprocess
|
||||
|
||||
## ImageGPTModel
|
||||
|
||||
[[autodoc]] ImageGPTModel
|
||||
@@ -97,4 +102,4 @@ Tips:
|
||||
|
||||
[[autodoc]] ImageGPTForImageClassification
|
||||
|
||||
- forward
|
||||
- forward
|
||||
|
||||
@@ -45,7 +45,7 @@ RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). The pre-trained Layo
|
||||
this https URL.*
|
||||
|
||||
LayoutLMv2 depends on `detectron2`, `torchvision` and `tesseract`. Run the
|
||||
following to install them:
|
||||
following to install them:
|
||||
```
|
||||
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
|
||||
python -m pip install torchvision tesseract
|
||||
@@ -275,6 +275,11 @@ print(encoding.keys())
|
||||
[[autodoc]] LayoutLMv2FeatureExtractor
|
||||
- __call__
|
||||
|
||||
## LayoutLMv2ImageProcessor
|
||||
|
||||
[[autodoc]] LayoutLMv2ImageProcessor
|
||||
- preprocess
|
||||
|
||||
## LayoutLMv2Tokenizer
|
||||
|
||||
[[autodoc]] LayoutLMv2Tokenizer
|
||||
|
||||
@@ -73,6 +73,11 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
|
||||
[[autodoc]] LayoutLMv3FeatureExtractor
|
||||
- __call__
|
||||
|
||||
## LayoutLMv3ImageProcessor
|
||||
|
||||
[[autodoc]] LayoutLMv3ImageProcessor
|
||||
- preprocess
|
||||
|
||||
## LayoutLMv3Tokenizer
|
||||
|
||||
[[autodoc]] LayoutLMv3Tokenizer
|
||||
|
||||
@@ -19,18 +19,18 @@ The LeViT model was proposed in [LeViT: Introducing Convolutions to Vision Trans
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*We design a family of image classification architectures that optimize the trade-off between accuracy
|
||||
and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures,
|
||||
which are competitive on highly parallel processing hardware. We revisit principles from the extensive
|
||||
literature on convolutional neural networks to apply them to transformers, in particular activation maps
|
||||
and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures,
|
||||
which are competitive on highly parallel processing hardware. We revisit principles from the extensive
|
||||
literature on convolutional neural networks to apply them to transformers, in particular activation maps
|
||||
with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information
|
||||
in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification.
|
||||
We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of
|
||||
application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable
|
||||
to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect
|
||||
in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification.
|
||||
We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of
|
||||
application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable
|
||||
to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect
|
||||
to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. *
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/levit_architecture.png"
|
||||
alt="drawing" width="600"/>
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> LeViT Architecture. Taken from the <a href="https://arxiv.org/abs/2104.01136">original paper</a>.</small>
|
||||
|
||||
@@ -38,25 +38,25 @@ Tips:
|
||||
|
||||
- Compared to ViT, LeViT models use an additional distillation head to effectively learn from a teacher (which, in the LeViT paper, is a ResNet like-model). The distillation head is learned through backpropagation under supervision of a ResNet like-model. They also draw inspiration from convolution neural networks to use activation maps with decreasing resolutions to increase the efficiency.
|
||||
- There are 2 ways to fine-tune distilled models, either (1) in a classic way, by only placing a prediction head on top
|
||||
of the final hidden state and not using the distillation head, or (2) by placing both a prediction head and distillation
|
||||
head on top of the final hidden state. In that case, the prediction head is trained using regular cross-entropy between
|
||||
the prediction of the head and the ground-truth label, while the distillation prediction head is trained using hard distillation
|
||||
(cross-entropy between the prediction of the distillation head and the label predicted by the teacher). At inference time,
|
||||
one takes the average prediction between both heads as final prediction. (2) is also called "fine-tuning with distillation",
|
||||
because one relies on a teacher that has already been fine-tuned on the downstream dataset. In terms of models, (1) corresponds
|
||||
of the final hidden state and not using the distillation head, or (2) by placing both a prediction head and distillation
|
||||
head on top of the final hidden state. In that case, the prediction head is trained using regular cross-entropy between
|
||||
the prediction of the head and the ground-truth label, while the distillation prediction head is trained using hard distillation
|
||||
(cross-entropy between the prediction of the distillation head and the label predicted by the teacher). At inference time,
|
||||
one takes the average prediction between both heads as final prediction. (2) is also called "fine-tuning with distillation",
|
||||
because one relies on a teacher that has already been fine-tuned on the downstream dataset. In terms of models, (1) corresponds
|
||||
to [`LevitForImageClassification`] and (2) corresponds to [`LevitForImageClassificationWithTeacher`].
|
||||
- All released checkpoints were pre-trained and fine-tuned on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k)
|
||||
- All released checkpoints were pre-trained and fine-tuned on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k)
|
||||
(also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). only. No external data was used. This is in
|
||||
contrast with the original ViT model, which used external data like the JFT-300M dataset/Imagenet-21k for
|
||||
pre-training.
|
||||
- The authors of LeViT released 5 trained LeViT models, which you can directly plug into [`LevitModel`] or [`LevitForImageClassification`].
|
||||
- The authors of LeViT released 5 trained LeViT models, which you can directly plug into [`LevitModel`] or [`LevitForImageClassification`].
|
||||
Techniques like data augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
|
||||
(while only using ImageNet-1k for pre-training). The 5 variants available are (all trained on images of size 224x224):
|
||||
*facebook/levit-128S*, *facebook/levit-128*, *facebook/levit-192*, *facebook/levit-256* and
|
||||
*facebook/levit-384*. Note that one should use [`LevitFeatureExtractor`] in order to
|
||||
prepare images for the model.
|
||||
- [`LevitForImageClassificationWithTeacher`] currently supports only inference and not training or fine-tuning.
|
||||
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer)
|
||||
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer)
|
||||
(you can just replace [`ViTFeatureExtractor`] by [`LevitFeatureExtractor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]).
|
||||
|
||||
This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT).
|
||||
@@ -71,6 +71,12 @@ This model was contributed by [anugunj](https://huggingface.co/anugunj). The ori
|
||||
[[autodoc]] LevitFeatureExtractor
|
||||
- __call__
|
||||
|
||||
## LevitImageProcessor
|
||||
|
||||
[[autodoc]] LevitImageProcessor
|
||||
- preprocess
|
||||
|
||||
|
||||
## LevitModel
|
||||
|
||||
[[autodoc]] LevitModel
|
||||
|
||||
@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
## Overview
|
||||
|
||||
The MobileViT model was proposed in [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. MobileViT introduces a new layer that replaces local processing in convolutions with global processing using transformers.
|
||||
The MobileViT model was proposed in [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. MobileViT introduces a new layer that replaces local processing in convolutions with global processing using transformers.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
@@ -25,10 +25,10 @@ Tips:
|
||||
- MobileViT is more like a CNN than a Transformer model. It does not work on sequence data but on batches of images. Unlike ViT, there are no embeddings. The backbone model outputs a feature map. You can follow [this tutorial](https://keras.io/examples/vision/mobilevit) for a lightweight introduction.
|
||||
- One can use [`MobileViTFeatureExtractor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB).
|
||||
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).
|
||||
- The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
|
||||
- As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite).
|
||||
- The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
|
||||
- As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite).
|
||||
|
||||
You can use the following code to convert a MobileViT checkpoint (be it image classification or semantic segmentation) to generate a
|
||||
You can use the following code to convert a MobileViT checkpoint (be it image classification or semantic segmentation) to generate a
|
||||
TensorFlow Lite model:
|
||||
|
||||
```py
|
||||
@@ -52,7 +52,7 @@ with open(tflite_filename, "wb") as f:
|
||||
```
|
||||
|
||||
The resulting model will be just **about an MB** making it a good fit for mobile applications where resources and network
|
||||
bandwidth can be constrained.
|
||||
bandwidth can be constrained.
|
||||
|
||||
|
||||
This model was contributed by [matthijs](https://huggingface.co/Matthijs). The TensorFlow version of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code and weights can be found [here](https://github.com/apple/ml-cvnets).
|
||||
@@ -68,6 +68,12 @@ This model was contributed by [matthijs](https://huggingface.co/Matthijs). The T
|
||||
- __call__
|
||||
- post_process_semantic_segmentation
|
||||
|
||||
## MobileViTImageProcessor
|
||||
|
||||
[[autodoc]] MobileViTImageProcessor
|
||||
- preprocess
|
||||
- post_process_semantic_segmentation
|
||||
|
||||
## MobileViTModel
|
||||
|
||||
[[autodoc]] MobileViTModel
|
||||
@@ -86,14 +92,14 @@ This model was contributed by [matthijs](https://huggingface.co/Matthijs). The T
|
||||
## TFMobileViTModel
|
||||
|
||||
[[autodoc]] TFMobileViTModel
|
||||
- call
|
||||
- call
|
||||
|
||||
## TFMobileViTForImageClassification
|
||||
|
||||
[[autodoc]] TFMobileViTForImageClassification
|
||||
- call
|
||||
- call
|
||||
|
||||
## TFMobileViTForSemanticSegmentation
|
||||
|
||||
[[autodoc]] TFMobileViTForSemanticSegmentation
|
||||
- call
|
||||
- call
|
||||
|
||||
@@ -70,7 +70,7 @@ vocabulary size of the model, i.e. creating logits of shape `(batch_size, 2048,
|
||||
size of 262 byte IDs).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perceiver_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> Perceiver IO architecture. Taken from the <a href="https://arxiv.org/abs/2105.15203">original paper</a> </small>
|
||||
|
||||
@@ -83,8 +83,8 @@ Tips:
|
||||
notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Perceiver).
|
||||
- Refer to the [blog post](https://huggingface.co/blog/perceiver) if you want to fully understand how the model works and
|
||||
is implemented in the library. Note that the models available in the library only showcase some examples of what you can do
|
||||
with the Perceiver. There are many more use cases, including question answering, named-entity recognition, object detection,
|
||||
audio classification, video classification, etc.
|
||||
with the Perceiver. There are many more use cases, including question answering, named-entity recognition, object detection,
|
||||
audio classification, video classification, etc.
|
||||
|
||||
**Note**:
|
||||
|
||||
@@ -114,6 +114,11 @@ audio classification, video classification, etc.
|
||||
[[autodoc]] PerceiverFeatureExtractor
|
||||
- __call__
|
||||
|
||||
## PerceiverImageProcessor
|
||||
|
||||
[[autodoc]] PerceiverImageProcessor
|
||||
- preprocess
|
||||
|
||||
## PerceiverTextPreprocessor
|
||||
|
||||
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverTextPreprocessor
|
||||
|
||||
@@ -50,12 +50,17 @@ This model was contributed by [heytanay](https://huggingface.co/heytanay). The o
|
||||
[[autodoc]] PoolFormerFeatureExtractor
|
||||
- __call__
|
||||
|
||||
## PoolFormerImageProcessor
|
||||
|
||||
[[autodoc]] PoolFormerImageProcessor
|
||||
- preprocess
|
||||
|
||||
## PoolFormerModel
|
||||
|
||||
[[autodoc]] PoolFormerModel
|
||||
- forward
|
||||
|
||||
|
||||
## PoolFormerForImageClassification
|
||||
|
||||
[[autodoc]] PoolFormerForImageClassification
|
||||
- forward
|
||||
- forward
|
||||
|
||||
@@ -36,7 +36,7 @@ The figure below illustrates the architecture of SegFormer. Taken from the [orig
|
||||
|
||||
<img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/segformer_architecture.png"/>
|
||||
|
||||
This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version
|
||||
This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version
|
||||
of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/NVlabs/SegFormer).
|
||||
|
||||
Tips:
|
||||
@@ -55,7 +55,7 @@ Tips:
|
||||
- TensorFlow users should refer to [this repository](https://github.com/deep-diver/segformer-tf-transformers) that shows off-the-shelf inference and fine-tuning.
|
||||
- One can also check out [this interactive demo on Hugging Face Spaces](https://huggingface.co/spaces/chansung/segformer-tf-transformers)
|
||||
to try out a SegFormer model on custom images.
|
||||
- SegFormer works on any input size, as it pads the input to be divisible by `config.patch_sizes`.
|
||||
- SegFormer works on any input size, as it pads the input to be divisible by `config.patch_sizes`.
|
||||
- One can use [`SegformerFeatureExtractor`] to prepare images and corresponding segmentation maps
|
||||
for the model. Note that this feature extractor is fairly basic and does not include all data augmentations used in
|
||||
the original paper. The original preprocessing pipelines (for the ADE20k dataset for instance) can be found [here](https://github.com/NVlabs/SegFormer/blob/master/local_configs/_base_/datasets/ade20k_repeat.py). The most
|
||||
@@ -95,6 +95,12 @@ SegFormer's results on the segmentation datasets like ADE20k, refer to the [pape
|
||||
- __call__
|
||||
- post_process_semantic_segmentation
|
||||
|
||||
## SegformerImageProcessor
|
||||
|
||||
[[autodoc]] SegformerImageProcessor
|
||||
- preprocess
|
||||
- post_process_semantic_segmentation
|
||||
|
||||
## SegformerModel
|
||||
|
||||
[[autodoc]] SegformerModel
|
||||
@@ -123,14 +129,14 @@ SegFormer's results on the segmentation datasets like ADE20k, refer to the [pape
|
||||
## TFSegformerModel
|
||||
|
||||
[[autodoc]] TFSegformerModel
|
||||
- call
|
||||
- call
|
||||
|
||||
## TFSegformerForImageClassification
|
||||
|
||||
[[autodoc]] TFSegformerForImageClassification
|
||||
- call
|
||||
- call
|
||||
|
||||
## TFSegformerForSemanticSegmentation
|
||||
|
||||
[[autodoc]] TFSegformerForSemanticSegmentation
|
||||
- call
|
||||
- call
|
||||
|
||||
@@ -27,7 +27,7 @@ Tips:
|
||||
- [`VideoMAEForPreTraining`] includes the decoder on top for self-supervised pre-training.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/videomae_architecture.jpeg"
|
||||
alt="drawing" width="600"/>
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> VideoMAE pre-training. Taken from the <a href="https://arxiv.org/abs/2203.12602">original paper</a>. </small>
|
||||
|
||||
@@ -44,6 +44,11 @@ The original code can be found [here](https://github.com/MCG-NJU/VideoMAE).
|
||||
[[autodoc]] VideoMAEFeatureExtractor
|
||||
- __call__
|
||||
|
||||
## VideoMAEImageProcessor
|
||||
|
||||
[[autodoc]] VideoMAEImageProcessor
|
||||
- preprocess
|
||||
|
||||
## VideoMAEModel
|
||||
|
||||
[[autodoc]] VideoMAEModel
|
||||
@@ -57,4 +62,4 @@ The original code can be found [here](https://github.com/MCG-NJU/VideoMAE).
|
||||
## VideoMAEForVideoClassification
|
||||
|
||||
[[autodoc]] transformers.VideoMAEForVideoClassification
|
||||
- forward
|
||||
- forward
|
||||
|
||||
@@ -38,12 +38,12 @@ Tips:
|
||||
This processor wraps a feature extractor (for the image modality) and a tokenizer (for the language modality) into one.
|
||||
- ViLT is trained with images of various sizes: the authors resize the shorter edge of input images to 384 and limit the longer edge to
|
||||
under 640 while preserving the aspect ratio. To make batching of images possible, the authors use a `pixel_mask` that indicates
|
||||
which pixel values are real and which are padding. [`ViltProcessor`] automatically creates this for you.
|
||||
- The design of ViLT is very similar to that of a standard Vision Transformer (ViT). The only difference is that the model includes
|
||||
which pixel values are real and which are padding. [`ViltProcessor`] automatically creates this for you.
|
||||
- The design of ViLT is very similar to that of a standard Vision Transformer (ViT). The only difference is that the model includes
|
||||
additional embedding layers for the language modality.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vilt_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> ViLT architecture. Taken from the <a href="https://arxiv.org/abs/2102.03334">original paper</a>. </small>
|
||||
|
||||
@@ -63,6 +63,11 @@ Tips:
|
||||
[[autodoc]] ViltFeatureExtractor
|
||||
- __call__
|
||||
|
||||
## ViltImageProcessor
|
||||
|
||||
[[autodoc]] ViltImageProcessor
|
||||
- preprocess
|
||||
|
||||
## ViltProcessor
|
||||
|
||||
[[autodoc]] ViltProcessor
|
||||
|
||||
@@ -57,7 +57,7 @@ Tips:
|
||||
improvement of 2% to training from scratch, but still 4% behind supervised pre-training.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/vit_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> ViT architecture. Taken from the <a href="https://arxiv.org/abs/2010.11929">original paper.</a> </small>
|
||||
|
||||
@@ -96,6 +96,12 @@ go to him!
|
||||
[[autodoc]] ViTFeatureExtractor
|
||||
- __call__
|
||||
|
||||
|
||||
## ViTImageProcessor
|
||||
|
||||
[[autodoc]] ViTImageProcessor
|
||||
- preprocess
|
||||
|
||||
## ViTModel
|
||||
|
||||
[[autodoc]] ViTModel
|
||||
|
||||
Reference in New Issue
Block a user