Update doc examples feature extractor -> image processor (#20501)
* Update doc example feature extractor -> image processor * Apply suggestions from code review
This commit is contained in:
@@ -40,12 +40,12 @@ Tips:
|
||||
- BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
|
||||
outperform both the [original model (ViT)](vit) as well as [Data-efficient Image Transformers (DeiT)](deit) when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
|
||||
fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace
|
||||
[`ViTFeatureExtractor`] by [`BeitFeatureExtractor`] and
|
||||
[`ViTFeatureExtractor`] by [`BeitImageProcessor`] and
|
||||
[`ViTForImageClassification`] by [`BeitForImageClassification`]).
|
||||
- There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
|
||||
performing masked image modeling. You can find it [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT).
|
||||
- As the BEiT models expect each image to be of the same size (resolution), one can use
|
||||
[`BeitFeatureExtractor`] to resize (or rescale) and normalize images for the model.
|
||||
[`BeitImageProcessor`] to resize (or rescale) and normalize images for the model.
|
||||
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
|
||||
each checkpoint. For example, `microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
|
||||
resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=microsoft/beit).
|
||||
|
||||
@@ -32,7 +32,7 @@ a crucial component in existing Vision Transformers, can be safely removed in ou
|
||||
Tips:
|
||||
|
||||
- CvT models are regular Vision Transformers, but trained with convolutions. They outperform the [original model (ViT)](vit) when fine-tuned on ImageNet-1K and CIFAR-100.
|
||||
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace [`ViTFeatureExtractor`] by [`AutoFeatureExtractor`] and [`ViTForImageClassification`] by [`CvtForImageClassification`]).
|
||||
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace [`ViTFeatureExtractor`] by [`AutoImageProcessor`] and [`ViTForImageClassification`] by [`CvtForImageClassification`]).
|
||||
- The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of 14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
|
||||
images and 1,000 classes).
|
||||
|
||||
|
||||
@@ -23,7 +23,7 @@ The abstract from the paper is the following:
|
||||
|
||||
Tips:
|
||||
|
||||
- One can use [`DeformableDetrFeatureExtractor`] to prepare images (and optional targets) for the model.
|
||||
- One can use [`DeformableDetrImageProcessor`] to prepare images (and optional targets) for the model.
|
||||
- Training Deformable DETR is equivalent to training the original [DETR](detr) model. Demo notebooks can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/deformable_detr_architecture.png"
|
||||
|
||||
@@ -66,7 +66,7 @@ Tips:
|
||||
augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
|
||||
(while only using ImageNet-1k for pre-training). There are 4 variants available (in 3 different sizes):
|
||||
*facebook/deit-tiny-patch16-224*, *facebook/deit-small-patch16-224*, *facebook/deit-base-patch16-224* and
|
||||
*facebook/deit-base-patch16-384*. Note that one should use [`DeiTFeatureExtractor`] in order to
|
||||
*facebook/deit-base-patch16-384*. Note that one should use [`DeiTImageProcessor`] in order to
|
||||
prepare images for the model.
|
||||
|
||||
This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [amyeroberts](https://huggingface.co/amyeroberts).
|
||||
|
||||
@@ -105,11 +105,11 @@ Tips:
|
||||
- DETR resizes the input images such that the shortest side is at least a certain amount of pixels while the longest is
|
||||
at most 1333 pixels. At training time, scale augmentation is used such that the shortest side is randomly set to at
|
||||
least 480 and at most 800 pixels. At inference time, the shortest side is set to 800. One can use
|
||||
[`~transformers.DetrFeatureExtractor`] to prepare images (and optional annotations in COCO format) for the
|
||||
[`~transformers.DetrImageProcessor`] to prepare images (and optional annotations in COCO format) for the
|
||||
model. Due to this resizing, images in a batch can have different sizes. DETR solves this by padding images up to the
|
||||
largest size in a batch, and by creating a pixel mask that indicates which pixels are real/which are padding.
|
||||
Alternatively, one can also define a custom `collate_fn` in order to batch images together, using
|
||||
[`~transformers.DetrFeatureExtractor.pad_and_create_pixel_mask`].
|
||||
[`~transformers.DetrImageProcessor.pad_and_create_pixel_mask`].
|
||||
- The size of the images will determine the amount of memory being used, and will thus determine the `batch_size`.
|
||||
It is advised to use a batch size of 2 per GPU. See [this Github thread](https://github.com/facebookresearch/detr/issues/150) for more info.
|
||||
|
||||
@@ -142,14 +142,14 @@ As a summary, consider the following table:
|
||||
| **Description** | Predicting bounding boxes and class labels around objects in an image | Predicting masks around objects (i.e. instances) in an image | Predicting masks around both objects (i.e. instances) as well as "stuff" (i.e. background things like trees and roads) in an image |
|
||||
| **Model** | [`~transformers.DetrForObjectDetection`] | [`~transformers.DetrForSegmentation`] | [`~transformers.DetrForSegmentation`] |
|
||||
| **Example dataset** | COCO detection | COCO detection, COCO panoptic | COCO panoptic | |
|
||||
| **Format of annotations to provide to** [`~transformers.DetrFeatureExtractor`] | {'image_id': `int`, 'annotations': `List[Dict]`} each Dict being a COCO object annotation | {'image_id': `int`, 'annotations': `List[Dict]`} (in case of COCO detection) or {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} (in case of COCO panoptic) | {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} and masks_path (path to directory containing PNG files of the masks) |
|
||||
| **Postprocessing** (i.e. converting the output of the model to COCO API) | [`~transformers.DetrFeatureExtractor.post_process`] | [`~transformers.DetrFeatureExtractor.post_process_segmentation`] | [`~transformers.DetrFeatureExtractor.post_process_segmentation`], [`~transformers.DetrFeatureExtractor.post_process_panoptic`] |
|
||||
| **Format of annotations to provide to** [`~transformers.DetrImageProcessor`] | {'image_id': `int`, 'annotations': `List[Dict]`} each Dict being a COCO object annotation | {'image_id': `int`, 'annotations': `List[Dict]`} (in case of COCO detection) or {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} (in case of COCO panoptic) | {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} and masks_path (path to directory containing PNG files of the masks) |
|
||||
| **Postprocessing** (i.e. converting the output of the model to COCO API) | [`~transformers.DetrImageProcessor.post_process`] | [`~transformers.DetrImageProcessor.post_process_segmentation`] | [`~transformers.DetrImageProcessor.post_process_segmentation`], [`~transformers.DetrImageProcessor.post_process_panoptic`] |
|
||||
| **evaluators** | `CocoEvaluator` with `iou_types="bbox"` | `CocoEvaluator` with `iou_types="bbox"` or `"segm"` | `CocoEvaluator` with `iou_tupes="bbox"` or `"segm"`, `PanopticEvaluator` |
|
||||
|
||||
In short, one should prepare the data either in COCO detection or COCO panoptic format, then use
|
||||
[`~transformers.DetrFeatureExtractor`] to create `pixel_values`, `pixel_mask` and optional
|
||||
[`~transformers.DetrImageProcessor`] to create `pixel_values`, `pixel_mask` and optional
|
||||
`labels`, which can then be used to train (or fine-tune) a model. For evaluation, one should first convert the
|
||||
outputs of the model using one of the postprocessing methods of [`~transformers.DetrFeatureExtractor`]. These can
|
||||
outputs of the model using one of the postprocessing methods of [`~transformers.DetrImageProcessor`]. These can
|
||||
be be provided to either `CocoEvaluator` or `PanopticEvaluator`, which allow you to calculate metrics like
|
||||
mean Average Precision (mAP) and Panoptic Quality (PQ). The latter objects are implemented in the [original repository](https://github.com/facebookresearch/detr). See the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR) for more info regarding evaluation.
|
||||
|
||||
|
||||
@@ -32,7 +32,7 @@ The abstract from the paper is the following:
|
||||
Tips:
|
||||
|
||||
- A notebook illustrating inference with [`GLPNForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GLPN/GLPN_inference_(depth_estimation).ipynb).
|
||||
- One can use [`GLPNFeatureExtractor`] to prepare images for the model.
|
||||
- One can use [`GLPNImageProcessor`] to prepare images for the model.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/glpn_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
@@ -49,7 +49,7 @@ Tips:
|
||||
applied k-means clustering to the (R,G,B) pixel values with k=512. This way, we only have a 32*32 = 1024-long
|
||||
sequence, but now of integers in the range 0..511. So we are shrinking the sequence length at the cost of a bigger
|
||||
embedding matrix. In other words, the vocabulary size of ImageGPT is 512, + 1 for a special "start of sentence" (SOS)
|
||||
token, used at the beginning of every sequence. One can use [`ImageGPTFeatureExtractor`] to prepare
|
||||
token, used at the beginning of every sequence. One can use [`ImageGPTImageProcessor`] to prepare
|
||||
images for the model.
|
||||
- Despite being pre-trained entirely unsupervised (i.e. without the use of any labels), ImageGPT produces fairly
|
||||
performant image features useful for downstream tasks, such as image classification. The authors showed that the
|
||||
|
||||
@@ -53,11 +53,11 @@ Tips:
|
||||
Techniques like data augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
|
||||
(while only using ImageNet-1k for pre-training). The 5 variants available are (all trained on images of size 224x224):
|
||||
*facebook/levit-128S*, *facebook/levit-128*, *facebook/levit-192*, *facebook/levit-256* and
|
||||
*facebook/levit-384*. Note that one should use [`LevitFeatureExtractor`] in order to
|
||||
*facebook/levit-384*. Note that one should use [`LevitImageProcessor`] in order to
|
||||
prepare images for the model.
|
||||
- [`LevitForImageClassificationWithTeacher`] currently supports only inference and not training or fine-tuning.
|
||||
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer)
|
||||
(you can just replace [`ViTFeatureExtractor`] by [`LevitFeatureExtractor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]).
|
||||
(you can just replace [`ViTFeatureExtractor`] by [`LevitImageProcessor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]).
|
||||
|
||||
This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT).
|
||||
|
||||
|
||||
@@ -32,8 +32,8 @@ Tips:
|
||||
- If you want to train the model in a distributed environment across multiple nodes, then one should update the
|
||||
`get_num_masks` function inside in the `MaskFormerLoss` class of `modeling_maskformer.py`. When training on multiple nodes, this should be
|
||||
set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/MaskFormer/blob/da3e60d85fdeedcb31476b5edd7d328826ce56cc/mask_former/modeling/criterion.py#L169).
|
||||
- One can use [`MaskFormerFeatureExtractor`] to prepare images for the model and optional targets for the model.
|
||||
- To get the final segmentation, depending on the task, you can call [`~MaskFormerFeatureExtractor.post_process_semantic_segmentation`] or [`~MaskFormerFeatureExtractor.post_process_panoptic_segmentation`]. Both tasks can be solved using [`MaskFormerForInstanceSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together.
|
||||
- One can use [`MaskFormerImageProcessor`] to prepare images for the model and optional targets for the model.
|
||||
- To get the final segmentation, depending on the task, you can call [`~MaskFormerImageProcessor.post_process_semantic_segmentation`] or [`~MaskFormerImageProcessor.post_process_panoptic_segmentation`]. Both tasks can be solved using [`MaskFormerForInstanceSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together.
|
||||
|
||||
The figure below illustrates the architecture of MaskFormer. Taken from the [original paper](https://arxiv.org/abs/2107.06278).
|
||||
|
||||
|
||||
@@ -26,7 +26,7 @@ Tips:
|
||||
|
||||
- Even though the checkpoint is trained on images of specific size, the model will work on images of any size. The smallest supported image size is 32x32.
|
||||
|
||||
- One can use [`MobileNetV1FeatureExtractor`] to prepare images for the model.
|
||||
- One can use [`MobileNetV1ImageProcessor`] to prepare images for the model.
|
||||
|
||||
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). However, the model predicts 1001 classes: the 1000 classes from ImageNet plus an extra “background” class (index 0).
|
||||
|
||||
|
||||
@@ -28,7 +28,7 @@ Tips:
|
||||
|
||||
- Even though the checkpoint is trained on images of specific size, the model will work on images of any size. The smallest supported image size is 32x32.
|
||||
|
||||
- One can use [`MobileNetV2FeatureExtractor`] to prepare images for the model.
|
||||
- One can use [`MobileNetV2ImageProcessor`] to prepare images for the model.
|
||||
|
||||
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). However, the model predicts 1001 classes: the 1000 classes from ImageNet plus an extra “background” class (index 0).
|
||||
|
||||
|
||||
@@ -23,7 +23,7 @@ The abstract from the paper is the following:
|
||||
Tips:
|
||||
|
||||
- MobileViT is more like a CNN than a Transformer model. It does not work on sequence data but on batches of images. Unlike ViT, there are no embeddings. The backbone model outputs a feature map. You can follow [this tutorial](https://keras.io/examples/vision/mobilevit) for a lightweight introduction.
|
||||
- One can use [`MobileViTFeatureExtractor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB).
|
||||
- One can use [`MobileViTImageProcessor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB).
|
||||
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).
|
||||
- The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
|
||||
- As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite).
|
||||
|
||||
@@ -28,7 +28,7 @@ The figure below illustrates the architecture of PoolFormer. Taken from the [ori
|
||||
Tips:
|
||||
|
||||
- PoolFormer has a hierarchical architecture, where instead of Attention, a simple Average Pooling layer is present. All checkpoints of the model can be found on the [hub](https://huggingface.co/models?other=poolformer).
|
||||
- One can use [`PoolFormerFeatureExtractor`] to prepare images for the model.
|
||||
- One can use [`PoolFormerImageProcessor`] to prepare images for the model.
|
||||
- As most models, PoolFormer comes in different sizes, the details of which can be found in the table below.
|
||||
|
||||
| **Model variant** | **Depths** | **Hidden sizes** | **Params (M)** | **ImageNet-1k Top 1** |
|
||||
|
||||
@@ -24,7 +24,7 @@ The abstract from the paper is the following:
|
||||
|
||||
Tips:
|
||||
|
||||
- One can use [`AutoFeatureExtractor`] to prepare images for the model.
|
||||
- One can use [`AutoImageProcessor`] to prepare images for the model.
|
||||
- The huge 10B model from [Self-supervised Pretraining of Visual Features in the Wild](https://arxiv.org/abs/2103.01988), trained on one billion Instagram images, is available on the [hub](https://huggingface.co/facebook/regnet-y-10b-seer)
|
||||
|
||||
This model was contributed by [Francesco](https://huggingface.co/Francesco). The TensorFlow version of the model
|
||||
|
||||
@@ -25,7 +25,7 @@ The depth of representations is of central importance for many visual recognitio
|
||||
|
||||
Tips:
|
||||
|
||||
- One can use [`AutoFeatureExtractor`] to prepare images for the model.
|
||||
- One can use [`AutoImageProcessor`] to prepare images for the model.
|
||||
|
||||
The figure below illustrates the architecture of ResNet. Taken from the [original paper](https://arxiv.org/abs/1512.03385).
|
||||
|
||||
|
||||
@@ -56,12 +56,12 @@ Tips:
|
||||
- One can also check out [this interactive demo on Hugging Face Spaces](https://huggingface.co/spaces/chansung/segformer-tf-transformers)
|
||||
to try out a SegFormer model on custom images.
|
||||
- SegFormer works on any input size, as it pads the input to be divisible by `config.patch_sizes`.
|
||||
- One can use [`SegformerFeatureExtractor`] to prepare images and corresponding segmentation maps
|
||||
for the model. Note that this feature extractor is fairly basic and does not include all data augmentations used in
|
||||
- One can use [`SegformerImageProcessor`] to prepare images and corresponding segmentation maps
|
||||
for the model. Note that this image processor is fairly basic and does not include all data augmentations used in
|
||||
the original paper. The original preprocessing pipelines (for the ADE20k dataset for instance) can be found [here](https://github.com/NVlabs/SegFormer/blob/master/local_configs/_base_/datasets/ade20k_repeat.py). The most
|
||||
important preprocessing step is that images and segmentation maps are randomly cropped and padded to the same size,
|
||||
such as 512x512 or 640x640, after which they are normalized.
|
||||
- One additional thing to keep in mind is that one can initialize [`SegformerFeatureExtractor`] with
|
||||
- One additional thing to keep in mind is that one can initialize [`SegformerImageProcessor`] with
|
||||
`reduce_labels` set to `True` or `False`. In some datasets (like ADE20k), the 0 index is used in the annotated
|
||||
segmentation maps for background. However, ADE20k doesn't include the "background" class in its 150 labels.
|
||||
Therefore, `reduce_labels` is used to reduce all labels by 1, and to make sure no loss is computed for the
|
||||
|
||||
@@ -33,7 +33,7 @@ prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO
|
||||
The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.*
|
||||
|
||||
Tips:
|
||||
- One can use the [`AutoFeatureExtractor`] API to prepare images for the model.
|
||||
- One can use the [`AutoImageProcessor`] API to prepare images for the model.
|
||||
- Swin pads the inputs supporting any input height and width (if divisible by `32`).
|
||||
- Swin can be used as a *backbone*. When `output_hidden_states = True`, it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, sequence_length, num_channels)`.
|
||||
|
||||
|
||||
@@ -21,7 +21,7 @@ The abstract from the paper is the following:
|
||||
*Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.*
|
||||
|
||||
Tips:
|
||||
- One can use the [`AutoFeatureExtractor`] API to prepare images for the model.
|
||||
- One can use the [`AutoImageProcessor`] API to prepare images for the model.
|
||||
|
||||
This model was contributed by [nandwalritik](https://huggingface.co/nandwalritik).
|
||||
The original code can be found [here](https://github.com/microsoft/Swin-Transformer).
|
||||
|
||||
@@ -32,7 +32,7 @@ special customization for these tasks.*
|
||||
Tips:
|
||||
|
||||
- The authors released 2 models, one for [table detection](https://huggingface.co/microsoft/table-transformer-detection) in documents, one for [table structure recognition](https://huggingface.co/microsoft/table-transformer-structure-recognition) (the task of recognizing the individual rows, columns etc. in a table).
|
||||
- One can use the [`AutoFeatureExtractor`] API to prepare images and optional targets for the model. This will load a [`DetrFeatureExtractor`] behind the scenes.
|
||||
- One can use the [`AutoImageProcessor`] API to prepare images and optional targets for the model. This will load a [`DetrImageProcessor`] behind the scenes.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/table_transformer_architecture.jpeg"
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
@@ -55,9 +55,9 @@ Tips:
|
||||
TrOCR's [`VisionEncoderDecoder`] model accepts images as input and makes use of
|
||||
[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
|
||||
|
||||
The [`ViTFeatureExtractor`/`DeiTFeatureExtractor`] class is responsible for preprocessing the input image and
|
||||
The [`ViTImageProcessor`/`DeiTImageProcessor`] class is responsible for preprocessing the input image and
|
||||
[`RobertaTokenizer`/`XLMRobertaTokenizer`] decodes the generated target tokens to the target string. The
|
||||
[`TrOCRProcessor`] wraps [`ViTFeatureExtractor`/`DeiTFeatureExtractor`] and [`RobertaTokenizer`/`XLMRobertaTokenizer`]
|
||||
[`TrOCRProcessor`] wraps [`ViTImageProcessor`/`DeiTImageProcessor`] and [`RobertaTokenizer`/`XLMRobertaTokenizer`]
|
||||
into a single instance to both extract the input features and decode the predicted token ids.
|
||||
|
||||
- Step-by-step Optical Character Recognition (OCR)
|
||||
|
||||
@@ -23,7 +23,7 @@ The abstract from the paper is the following:
|
||||
|
||||
Tips:
|
||||
|
||||
- One can use [`VideoMAEFeatureExtractor`] to prepare videos for the model. It will resize + normalize all frames of a video for you.
|
||||
- One can use [`VideoMAEImageProcessor`] to prepare videos for the model. It will resize + normalize all frames of a video for you.
|
||||
- [`VideoMAEForPreTraining`] includes the decoder on top for self-supervised pre-training.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/videomae_architecture.jpeg"
|
||||
|
||||
@@ -68,17 +68,17 @@ To perform inference, one uses the [`generate`] method, which allows to autoregr
|
||||
>>> import requests
|
||||
>>> from PIL import Image
|
||||
|
||||
>>> from transformers import GPT2TokenizerFast, ViTFeatureExtractor, VisionEncoderDecoderModel
|
||||
>>> from transformers import GPT2TokenizerFast, ViTImageProcessor, VisionEncoderDecoderModel
|
||||
|
||||
>>> # load a fine-tuned image captioning model and corresponding tokenizer and feature extractor
|
||||
>>> # load a fine-tuned image captioning model and corresponding tokenizer and image processor
|
||||
>>> model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
||||
>>> tokenizer = GPT2TokenizerFast.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
||||
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
||||
>>> image_processor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
||||
|
||||
>>> # let's perform inference on an image
|
||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||
>>> pixel_values = feature_extractor(image, return_tensors="pt").pixel_values
|
||||
>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values
|
||||
|
||||
>>> # autoregressively generate caption (uses greedy decoding by default)
|
||||
>>> generated_ids = model.generate(pixel_values)
|
||||
@@ -115,10 +115,10 @@ As you can see, only 2 inputs are required for the model in order to compute a l
|
||||
images) and `labels` (which are the `input_ids` of the encoded target sequence).
|
||||
|
||||
```python
|
||||
>>> from transformers import ViTFeatureExtractor, BertTokenizer, VisionEncoderDecoderModel
|
||||
>>> from transformers import ViTImageProcessor, BertTokenizer, VisionEncoderDecoderModel
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||
>>> image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
|
||||
>>> model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
|
||||
... "google/vit-base-patch16-224-in21k", "bert-base-uncased"
|
||||
@@ -129,7 +129,7 @@ images) and `labels` (which are the `input_ids` of the encoded target sequence).
|
||||
|
||||
>>> dataset = load_dataset("huggingface/cats-image")
|
||||
>>> image = dataset["test"]["image"][0]
|
||||
>>> pixel_values = feature_extractor(image, return_tensors="pt").pixel_values
|
||||
>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values
|
||||
|
||||
>>> labels = tokenizer(
|
||||
... "an image of two cats chilling on a couch",
|
||||
|
||||
@@ -53,7 +53,7 @@ vectors to a standard BERT model. The text input is concatenated in the front of
|
||||
layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT. The segment IDs must also be set
|
||||
appropriately for the textual and visual parts.
|
||||
|
||||
The [`BertTokenizer`] is used to encode the text. A custom detector/feature extractor must be used
|
||||
The [`BertTokenizer`] is used to encode the text. A custom detector/image processor must be used
|
||||
to get the visual embeddings. The following example notebooks show how to use VisualBERT with Detectron-like models:
|
||||
|
||||
- [VisualBERT VQA demo notebook](https://github.com/huggingface/transformers/tree/main/examples/research_projects/visual_bert) : This notebook
|
||||
|
||||
@@ -40,7 +40,7 @@ Tips:
|
||||
used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of
|
||||
vectors to a standard Transformer encoder.
|
||||
- As the Vision Transformer expects each image to be of the same size (resolution), one can use
|
||||
[`ViTFeatureExtractor`] to resize (or rescale) and normalize images for the model.
|
||||
[`ViTImageProcessor`] to resize (or rescale) and normalize images for the model.
|
||||
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
|
||||
each checkpoint. For example, `google/vit-base-patch16-224` refers to a base-sized architecture with patch
|
||||
resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=vit).
|
||||
@@ -67,7 +67,7 @@ Following the original Vision Transformer, some follow-up works have been made:
|
||||
The authors of DeiT also released more efficiently trained ViT models, which you can directly plug into [`ViTModel`] or
|
||||
[`ViTForImageClassification`]. There are 4 variants available (in 3 different sizes): *facebook/deit-tiny-patch16-224*,
|
||||
*facebook/deit-small-patch16-224*, *facebook/deit-base-patch16-224* and *facebook/deit-base-patch16-384*. Note that one should
|
||||
use [`DeiTFeatureExtractor`] in order to prepare images for the model.
|
||||
use [`DeiTImageProcessor`] in order to prepare images for the model.
|
||||
|
||||
- [BEiT](beit) (BERT pre-training of Image Transformers) by Microsoft Research. BEiT models outperform supervised pre-trained
|
||||
vision transformers using a self-supervised method inspired by BERT (masked image modeling) and based on a VQ-VAE.
|
||||
|
||||
@@ -37,7 +37,7 @@ One can easily tweak it for their own use case.
|
||||
- A notebook that illustrates how to visualize reconstructed pixel values with [`ViTMAEForPreTraining`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb).
|
||||
- After pre-training, one "throws away" the decoder used to reconstruct pixels, and one uses the encoder for fine-tuning/linear probing. This means that after
|
||||
fine-tuning, one can directly plug in the weights into a [`ViTForImageClassification`].
|
||||
- One can use [`ViTFeatureExtractor`] to prepare images for the model. See the code examples for more info.
|
||||
- One can use [`ViTImageProcessor`] to prepare images for the model. See the code examples for more info.
|
||||
- Note that the encoder of MAE is only used to encode the visual patches. The encoded patches are then concatenated with mask tokens, which the decoder (which also
|
||||
consists of Transformer blocks) takes as input. Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted. Fixed
|
||||
sin/cos position embeddings are added both to the input of the encoder and the decoder.
|
||||
|
||||
@@ -23,7 +23,7 @@ The abstract from the paper is the following:
|
||||
|
||||
Tips:
|
||||
|
||||
- One can use [`YolosFeatureExtractor`] for preparing images (and optional targets) for the model. Contrary to [DETR](detr), YOLOS doesn't require a `pixel_mask` to be created.
|
||||
- One can use [`YolosImageProcessor`] for preparing images (and optional targets) for the model. Contrary to [DETR](detr), YOLOS doesn't require a `pixel_mask` to be created.
|
||||
- Demo notebooks (regarding inference and fine-tuning on custom data) can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/YOLOS).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/yolos_architecture.png"
|
||||
|
||||
Reference in New Issue
Block a user