Update doc examples feature extractor -> image processor (#20501)
* Update doc example feature extractor -> image processor * Apply suggestions from code review
This commit is contained in:
@@ -23,6 +23,7 @@ Remember, architecture refers to the skeleton of the model and checkpoints are t
|
||||
In this tutorial, learn to:
|
||||
|
||||
* Load a pretrained tokenizer.
|
||||
* Load a pretrained image processor
|
||||
* Load a pretrained feature extractor.
|
||||
* Load a pretrained processor.
|
||||
* Load a pretrained model.
|
||||
@@ -49,9 +50,20 @@ Then tokenize your input as shown below:
|
||||
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
|
||||
```
|
||||
|
||||
## AutoImageProcessor
|
||||
|
||||
For vision tasks, an image processor processes the image into the correct input format.
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoImageProcessor
|
||||
|
||||
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
|
||||
```
|
||||
|
||||
|
||||
## AutoFeatureExtractor
|
||||
|
||||
For audio and vision tasks, a feature extractor processes the audio signal or image into the correct input format.
|
||||
For audio tasks, a feature extractor processes the audio signal the correct input format.
|
||||
|
||||
Load a feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
|
||||
|
||||
@@ -65,7 +77,7 @@ Load a feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
|
||||
|
||||
## AutoProcessor
|
||||
|
||||
Multimodal tasks require a processor that combines two types of preprocessing tools. For example, the [LayoutLMV2](model_doc/layoutlmv2) model requires a feature extractor to handle images and a tokenizer to handle text; a processor combines both of them.
|
||||
Multimodal tasks require a processor that combines two types of preprocessing tools. For example, the [LayoutLMV2](model_doc/layoutlmv2) model requires an image processor to handle images and a tokenizer to handle text; a processor combines both of them.
|
||||
|
||||
Load a processor with [`AutoProcessor.from_pretrained`]:
|
||||
|
||||
@@ -103,7 +115,7 @@ TensorFlow and Flax checkpoints are not affected, and can be loaded within PyTor
|
||||
|
||||
</Tip>
|
||||
|
||||
Generally, we recommend using the `AutoTokenizer` class and the `AutoModelFor` class to load pretrained instances of models. This will ensure you load the correct architecture every time. In the next [tutorial](preprocessing), learn how to use your newly loaded tokenizer, feature extractor and processor to preprocess a dataset for fine-tuning.
|
||||
Generally, we recommend using the `AutoTokenizer` class and the `AutoModelFor` class to load pretrained instances of models. This will ensure you load the correct architecture every time. In the next [tutorial](preprocessing), learn how to use your newly loaded tokenizer, image processor, feature extractor and processor to preprocess a dataset for fine-tuning.
|
||||
</pt>
|
||||
<tf>
|
||||
Finally, the `TFAutoModelFor` classes let you load a pretrained model for a given task (see [here](model_doc/auto) for a complete list of available tasks). For example, load a model for sequence classification with [`TFAutoModelForSequenceClassification.from_pretrained`]:
|
||||
@@ -122,6 +134,6 @@ Easily reuse the same checkpoint to load an architecture for a different task:
|
||||
>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased")
|
||||
```
|
||||
|
||||
Generally, we recommend using the `AutoTokenizer` class and the `TFAutoModelFor` class to load pretrained instances of models. This will ensure you load the correct architecture every time. In the next [tutorial](preprocessing), learn how to use your newly loaded tokenizer, feature extractor and processor to preprocess a dataset for fine-tuning.
|
||||
Generally, we recommend using the `AutoTokenizer` class and the `TFAutoModelFor` class to load pretrained instances of models. This will ensure you load the correct architecture every time. In the next [tutorial](preprocessing), learn how to use your newly loaded tokenizer, image processor, feature extractor and processor to preprocess a dataset for fine-tuning.
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
@@ -17,7 +17,8 @@ An [`AutoClass`](model_doc/auto) automatically infers the model architecture and
|
||||
- Load and customize a model configuration.
|
||||
- Create a model architecture.
|
||||
- Create a slow and fast tokenizer for text.
|
||||
- Create a feature extractor for audio or image tasks.
|
||||
- Create an image processor for vision tasks.
|
||||
- Create a feature extractor for audio tasks.
|
||||
- Create a processor for multimodal tasks.
|
||||
|
||||
## Configuration
|
||||
@@ -244,21 +245,21 @@ By default, [`AutoTokenizer`] will try to load a fast tokenizer. You can disable
|
||||
|
||||
</Tip>
|
||||
|
||||
## Feature Extractor
|
||||
## Image Processor
|
||||
|
||||
A feature extractor processes audio or image inputs. It inherits from the base [`~feature_extraction_utils.FeatureExtractionMixin`] class, and may also inherit from the [`ImageFeatureExtractionMixin`] class for processing image features or the [`SequenceFeatureExtractor`] class for processing audio inputs.
|
||||
An image processor processes vision inputs. It inherits from the base [`~image_processing_utils.ImageProcessingMixin`] class.
|
||||
|
||||
Depending on whether you are working on an audio or vision task, create a feature extractor associated with the model you're using. For example, create a default [`ViTFeatureExtractor`] if you are using [ViT](model_doc/vit) for image classification:
|
||||
To use, create an image processor associated with the model you're using. For example, create a default [`ViTImageProcessor`] if you are using [ViT](model_doc/vit) for image classification:
|
||||
|
||||
```py
|
||||
>>> from transformers import ViTFeatureExtractor
|
||||
>>> from transformers import ViTImageProcessor
|
||||
|
||||
>>> vit_extractor = ViTFeatureExtractor()
|
||||
>>> vit_extractor = ViTImageProcessor()
|
||||
>>> print(vit_extractor)
|
||||
ViTFeatureExtractor {
|
||||
ViTImageProcessor {
|
||||
"do_normalize": true,
|
||||
"do_resize": true,
|
||||
"feature_extractor_type": "ViTFeatureExtractor",
|
||||
"feature_extractor_type": "ViTImageProcessor",
|
||||
"image_mean": [
|
||||
0.5,
|
||||
0.5,
|
||||
@@ -276,21 +277,21 @@ ViTFeatureExtractor {
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't looking for any customization, just use the `from_pretrained` method to load a model's default feature extractor parameters.
|
||||
If you aren't looking for any customization, just use the `from_pretrained` method to load a model's default image processor parameters.
|
||||
|
||||
</Tip>
|
||||
|
||||
Modify any of the [`ViTFeatureExtractor`] parameters to create your custom feature extractor:
|
||||
Modify any of the [`ViTImageProcessor`] parameters to create your custom image processor:
|
||||
|
||||
```py
|
||||
>>> from transformers import ViTFeatureExtractor
|
||||
>>> from transformers import ViTImageProcessor
|
||||
|
||||
>>> my_vit_extractor = ViTFeatureExtractor(resample="PIL.Image.BOX", do_normalize=False, image_mean=[0.3, 0.3, 0.3])
|
||||
>>> my_vit_extractor = ViTImageProcessor(resample="PIL.Image.BOX", do_normalize=False, image_mean=[0.3, 0.3, 0.3])
|
||||
>>> print(my_vit_extractor)
|
||||
ViTFeatureExtractor {
|
||||
ViTImageProcessor {
|
||||
"do_normalize": false,
|
||||
"do_resize": true,
|
||||
"feature_extractor_type": "ViTFeatureExtractor",
|
||||
"feature_extractor_type": "ViTImageProcessor",
|
||||
"image_mean": [
|
||||
0.3,
|
||||
0.3,
|
||||
@@ -306,7 +307,11 @@ ViTFeatureExtractor {
|
||||
}
|
||||
```
|
||||
|
||||
For audio inputs, you can create a [`Wav2Vec2FeatureExtractor`] and customize the parameters in a similar way:
|
||||
## Feature Extractor
|
||||
|
||||
A feature extractor processes audio inputs. It inherits from the base [`~feature_extraction_utils.FeatureExtractionMixin`] class, and may also inherit from the [`SequenceFeatureExtractor`] class for processing audio inputs.
|
||||
|
||||
To use, create a feature extractor associated with the model you're using. For example, create a default [`Wav2Vec2FeatureExtractor`] if you are using [Wav2Vec2](model_doc/wav2vec2) for audio classification:
|
||||
|
||||
```py
|
||||
>>> from transformers import Wav2Vec2FeatureExtractor
|
||||
@@ -324,9 +329,34 @@ Wav2Vec2FeatureExtractor {
|
||||
}
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't looking for any customization, just use the `from_pretrained` method to load a model's default feature extractor parameters.
|
||||
|
||||
</Tip>
|
||||
|
||||
Modify any of the [`Wav2Vec2FeatureExtractor`] parameters to create your custom feature extractor:
|
||||
|
||||
```py
|
||||
>>> from transformers import Wav2Vec2FeatureExtractor
|
||||
|
||||
>>> w2v2_extractor = Wav2Vec2FeatureExtractor(sampling_rate=8000, do_normalize=False)
|
||||
>>> print(w2v2_extractor)
|
||||
Wav2Vec2FeatureExtractor {
|
||||
"do_normalize": false,
|
||||
"feature_extractor_type": "Wav2Vec2FeatureExtractor",
|
||||
"feature_size": 1,
|
||||
"padding_side": "right",
|
||||
"padding_value": 0.0,
|
||||
"return_attention_mask": false,
|
||||
"sampling_rate": 8000
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
## Processor
|
||||
|
||||
For models that support multimodal tasks, 🤗 Transformers offers a processor class that conveniently wraps a feature extractor and tokenizer into a single object. For example, let's use the [`Wav2Vec2Processor`] for an automatic speech recognition task (ASR). ASR transcribes audio to text, so you will need a feature extractor and a tokenizer.
|
||||
For models that support multimodal tasks, 🤗 Transformers offers a processor class that conveniently wraps processing classes such as a feature extractor and a tokenizer into a single object. For example, let's use the [`Wav2Vec2Processor`] for an automatic speech recognition task (ASR). ASR transcribes audio to text, so you will need a feature extractor and a tokenizer.
|
||||
|
||||
Create a feature extractor to handle the audio inputs:
|
||||
|
||||
@@ -352,4 +382,4 @@ Combine the feature extractor and tokenizer in [`Wav2Vec2Processor`]:
|
||||
>>> processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
|
||||
```
|
||||
|
||||
With two basic classes - configuration and model - and an additional preprocessing class (tokenizer, feature extractor, or processor), you can create any of the models supported by 🤗 Transformers. Each of these base classes are configurable, allowing you to use the specific attributes you want. You can easily setup a model for training or modify an existing pretrained model to fine-tune.
|
||||
With two basic classes - configuration and model - and an additional preprocessing class (tokenizer, image processor, feature extractor, or processor), you can create any of the models supported by 🤗 Transformers. Each of these base classes are configurable, allowing you to use the specific attributes you want. You can easily setup a model for training or modify an existing pretrained model to fine-tune.
|
||||
|
||||
@@ -299,7 +299,7 @@ whole text, individual words).
|
||||
|
||||
### pixel values
|
||||
|
||||
A tensor of the numerical representations of an image that is passed to a model. The pixel values have a shape of [`batch_size`, `num_channels`, `height`, `width`], and are generated from a feature extractor.
|
||||
A tensor of the numerical representations of an image that is passed to a model. The pixel values have a shape of [`batch_size`, `num_channels`, `height`, `width`], and are generated from an image processor.
|
||||
|
||||
### pooling
|
||||
|
||||
|
||||
@@ -20,8 +20,8 @@ Processors can mean two different things in the Transformers library:
|
||||
## Multi-modal processors
|
||||
|
||||
Any multi-modal model will require an object to encode or decode the data that groups several modalities (among text,
|
||||
vision and audio). This is handled by objects called processors, which group tokenizers (for the text modality) and
|
||||
feature extractors (for vision and audio).
|
||||
vision and audio). This is handled by objects called processors, which group together two or more processing objects
|
||||
such as tokenizers (for the text modality), image processors (for vision) and feature extractors (for audio).
|
||||
|
||||
Those processors inherit from the following base class that implements the saving and loading functionality:
|
||||
|
||||
|
||||
@@ -40,12 +40,12 @@ Tips:
|
||||
- BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
|
||||
outperform both the [original model (ViT)](vit) as well as [Data-efficient Image Transformers (DeiT)](deit) when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
|
||||
fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace
|
||||
[`ViTFeatureExtractor`] by [`BeitFeatureExtractor`] and
|
||||
[`ViTFeatureExtractor`] by [`BeitImageProcessor`] and
|
||||
[`ViTForImageClassification`] by [`BeitForImageClassification`]).
|
||||
- There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
|
||||
performing masked image modeling. You can find it [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT).
|
||||
- As the BEiT models expect each image to be of the same size (resolution), one can use
|
||||
[`BeitFeatureExtractor`] to resize (or rescale) and normalize images for the model.
|
||||
[`BeitImageProcessor`] to resize (or rescale) and normalize images for the model.
|
||||
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
|
||||
each checkpoint. For example, `microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
|
||||
resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=microsoft/beit).
|
||||
|
||||
@@ -32,7 +32,7 @@ a crucial component in existing Vision Transformers, can be safely removed in ou
|
||||
Tips:
|
||||
|
||||
- CvT models are regular Vision Transformers, but trained with convolutions. They outperform the [original model (ViT)](vit) when fine-tuned on ImageNet-1K and CIFAR-100.
|
||||
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace [`ViTFeatureExtractor`] by [`AutoFeatureExtractor`] and [`ViTForImageClassification`] by [`CvtForImageClassification`]).
|
||||
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace [`ViTFeatureExtractor`] by [`AutoImageProcessor`] and [`ViTForImageClassification`] by [`CvtForImageClassification`]).
|
||||
- The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of 14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
|
||||
images and 1,000 classes).
|
||||
|
||||
|
||||
@@ -23,7 +23,7 @@ The abstract from the paper is the following:
|
||||
|
||||
Tips:
|
||||
|
||||
- One can use [`DeformableDetrFeatureExtractor`] to prepare images (and optional targets) for the model.
|
||||
- One can use [`DeformableDetrImageProcessor`] to prepare images (and optional targets) for the model.
|
||||
- Training Deformable DETR is equivalent to training the original [DETR](detr) model. Demo notebooks can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/deformable_detr_architecture.png"
|
||||
|
||||
@@ -66,7 +66,7 @@ Tips:
|
||||
augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
|
||||
(while only using ImageNet-1k for pre-training). There are 4 variants available (in 3 different sizes):
|
||||
*facebook/deit-tiny-patch16-224*, *facebook/deit-small-patch16-224*, *facebook/deit-base-patch16-224* and
|
||||
*facebook/deit-base-patch16-384*. Note that one should use [`DeiTFeatureExtractor`] in order to
|
||||
*facebook/deit-base-patch16-384*. Note that one should use [`DeiTImageProcessor`] in order to
|
||||
prepare images for the model.
|
||||
|
||||
This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [amyeroberts](https://huggingface.co/amyeroberts).
|
||||
|
||||
@@ -105,11 +105,11 @@ Tips:
|
||||
- DETR resizes the input images such that the shortest side is at least a certain amount of pixels while the longest is
|
||||
at most 1333 pixels. At training time, scale augmentation is used such that the shortest side is randomly set to at
|
||||
least 480 and at most 800 pixels. At inference time, the shortest side is set to 800. One can use
|
||||
[`~transformers.DetrFeatureExtractor`] to prepare images (and optional annotations in COCO format) for the
|
||||
[`~transformers.DetrImageProcessor`] to prepare images (and optional annotations in COCO format) for the
|
||||
model. Due to this resizing, images in a batch can have different sizes. DETR solves this by padding images up to the
|
||||
largest size in a batch, and by creating a pixel mask that indicates which pixels are real/which are padding.
|
||||
Alternatively, one can also define a custom `collate_fn` in order to batch images together, using
|
||||
[`~transformers.DetrFeatureExtractor.pad_and_create_pixel_mask`].
|
||||
[`~transformers.DetrImageProcessor.pad_and_create_pixel_mask`].
|
||||
- The size of the images will determine the amount of memory being used, and will thus determine the `batch_size`.
|
||||
It is advised to use a batch size of 2 per GPU. See [this Github thread](https://github.com/facebookresearch/detr/issues/150) for more info.
|
||||
|
||||
@@ -142,14 +142,14 @@ As a summary, consider the following table:
|
||||
| **Description** | Predicting bounding boxes and class labels around objects in an image | Predicting masks around objects (i.e. instances) in an image | Predicting masks around both objects (i.e. instances) as well as "stuff" (i.e. background things like trees and roads) in an image |
|
||||
| **Model** | [`~transformers.DetrForObjectDetection`] | [`~transformers.DetrForSegmentation`] | [`~transformers.DetrForSegmentation`] |
|
||||
| **Example dataset** | COCO detection | COCO detection, COCO panoptic | COCO panoptic | |
|
||||
| **Format of annotations to provide to** [`~transformers.DetrFeatureExtractor`] | {'image_id': `int`, 'annotations': `List[Dict]`} each Dict being a COCO object annotation | {'image_id': `int`, 'annotations': `List[Dict]`} (in case of COCO detection) or {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} (in case of COCO panoptic) | {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} and masks_path (path to directory containing PNG files of the masks) |
|
||||
| **Postprocessing** (i.e. converting the output of the model to COCO API) | [`~transformers.DetrFeatureExtractor.post_process`] | [`~transformers.DetrFeatureExtractor.post_process_segmentation`] | [`~transformers.DetrFeatureExtractor.post_process_segmentation`], [`~transformers.DetrFeatureExtractor.post_process_panoptic`] |
|
||||
| **Format of annotations to provide to** [`~transformers.DetrImageProcessor`] | {'image_id': `int`, 'annotations': `List[Dict]`} each Dict being a COCO object annotation | {'image_id': `int`, 'annotations': `List[Dict]`} (in case of COCO detection) or {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} (in case of COCO panoptic) | {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} and masks_path (path to directory containing PNG files of the masks) |
|
||||
| **Postprocessing** (i.e. converting the output of the model to COCO API) | [`~transformers.DetrImageProcessor.post_process`] | [`~transformers.DetrImageProcessor.post_process_segmentation`] | [`~transformers.DetrImageProcessor.post_process_segmentation`], [`~transformers.DetrImageProcessor.post_process_panoptic`] |
|
||||
| **evaluators** | `CocoEvaluator` with `iou_types="bbox"` | `CocoEvaluator` with `iou_types="bbox"` or `"segm"` | `CocoEvaluator` with `iou_tupes="bbox"` or `"segm"`, `PanopticEvaluator` |
|
||||
|
||||
In short, one should prepare the data either in COCO detection or COCO panoptic format, then use
|
||||
[`~transformers.DetrFeatureExtractor`] to create `pixel_values`, `pixel_mask` and optional
|
||||
[`~transformers.DetrImageProcessor`] to create `pixel_values`, `pixel_mask` and optional
|
||||
`labels`, which can then be used to train (or fine-tune) a model. For evaluation, one should first convert the
|
||||
outputs of the model using one of the postprocessing methods of [`~transformers.DetrFeatureExtractor`]. These can
|
||||
outputs of the model using one of the postprocessing methods of [`~transformers.DetrImageProcessor`]. These can
|
||||
be be provided to either `CocoEvaluator` or `PanopticEvaluator`, which allow you to calculate metrics like
|
||||
mean Average Precision (mAP) and Panoptic Quality (PQ). The latter objects are implemented in the [original repository](https://github.com/facebookresearch/detr). See the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR) for more info regarding evaluation.
|
||||
|
||||
|
||||
@@ -32,7 +32,7 @@ The abstract from the paper is the following:
|
||||
Tips:
|
||||
|
||||
- A notebook illustrating inference with [`GLPNForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GLPN/GLPN_inference_(depth_estimation).ipynb).
|
||||
- One can use [`GLPNFeatureExtractor`] to prepare images for the model.
|
||||
- One can use [`GLPNImageProcessor`] to prepare images for the model.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/glpn_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
@@ -49,7 +49,7 @@ Tips:
|
||||
applied k-means clustering to the (R,G,B) pixel values with k=512. This way, we only have a 32*32 = 1024-long
|
||||
sequence, but now of integers in the range 0..511. So we are shrinking the sequence length at the cost of a bigger
|
||||
embedding matrix. In other words, the vocabulary size of ImageGPT is 512, + 1 for a special "start of sentence" (SOS)
|
||||
token, used at the beginning of every sequence. One can use [`ImageGPTFeatureExtractor`] to prepare
|
||||
token, used at the beginning of every sequence. One can use [`ImageGPTImageProcessor`] to prepare
|
||||
images for the model.
|
||||
- Despite being pre-trained entirely unsupervised (i.e. without the use of any labels), ImageGPT produces fairly
|
||||
performant image features useful for downstream tasks, such as image classification. The authors showed that the
|
||||
|
||||
@@ -53,11 +53,11 @@ Tips:
|
||||
Techniques like data augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
|
||||
(while only using ImageNet-1k for pre-training). The 5 variants available are (all trained on images of size 224x224):
|
||||
*facebook/levit-128S*, *facebook/levit-128*, *facebook/levit-192*, *facebook/levit-256* and
|
||||
*facebook/levit-384*. Note that one should use [`LevitFeatureExtractor`] in order to
|
||||
*facebook/levit-384*. Note that one should use [`LevitImageProcessor`] in order to
|
||||
prepare images for the model.
|
||||
- [`LevitForImageClassificationWithTeacher`] currently supports only inference and not training or fine-tuning.
|
||||
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer)
|
||||
(you can just replace [`ViTFeatureExtractor`] by [`LevitFeatureExtractor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]).
|
||||
(you can just replace [`ViTFeatureExtractor`] by [`LevitImageProcessor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]).
|
||||
|
||||
This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT).
|
||||
|
||||
|
||||
@@ -32,8 +32,8 @@ Tips:
|
||||
- If you want to train the model in a distributed environment across multiple nodes, then one should update the
|
||||
`get_num_masks` function inside in the `MaskFormerLoss` class of `modeling_maskformer.py`. When training on multiple nodes, this should be
|
||||
set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/MaskFormer/blob/da3e60d85fdeedcb31476b5edd7d328826ce56cc/mask_former/modeling/criterion.py#L169).
|
||||
- One can use [`MaskFormerFeatureExtractor`] to prepare images for the model and optional targets for the model.
|
||||
- To get the final segmentation, depending on the task, you can call [`~MaskFormerFeatureExtractor.post_process_semantic_segmentation`] or [`~MaskFormerFeatureExtractor.post_process_panoptic_segmentation`]. Both tasks can be solved using [`MaskFormerForInstanceSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together.
|
||||
- One can use [`MaskFormerImageProcessor`] to prepare images for the model and optional targets for the model.
|
||||
- To get the final segmentation, depending on the task, you can call [`~MaskFormerImageProcessor.post_process_semantic_segmentation`] or [`~MaskFormerImageProcessor.post_process_panoptic_segmentation`]. Both tasks can be solved using [`MaskFormerForInstanceSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together.
|
||||
|
||||
The figure below illustrates the architecture of MaskFormer. Taken from the [original paper](https://arxiv.org/abs/2107.06278).
|
||||
|
||||
|
||||
@@ -26,7 +26,7 @@ Tips:
|
||||
|
||||
- Even though the checkpoint is trained on images of specific size, the model will work on images of any size. The smallest supported image size is 32x32.
|
||||
|
||||
- One can use [`MobileNetV1FeatureExtractor`] to prepare images for the model.
|
||||
- One can use [`MobileNetV1ImageProcessor`] to prepare images for the model.
|
||||
|
||||
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). However, the model predicts 1001 classes: the 1000 classes from ImageNet plus an extra “background” class (index 0).
|
||||
|
||||
|
||||
@@ -28,7 +28,7 @@ Tips:
|
||||
|
||||
- Even though the checkpoint is trained on images of specific size, the model will work on images of any size. The smallest supported image size is 32x32.
|
||||
|
||||
- One can use [`MobileNetV2FeatureExtractor`] to prepare images for the model.
|
||||
- One can use [`MobileNetV2ImageProcessor`] to prepare images for the model.
|
||||
|
||||
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). However, the model predicts 1001 classes: the 1000 classes from ImageNet plus an extra “background” class (index 0).
|
||||
|
||||
|
||||
@@ -23,7 +23,7 @@ The abstract from the paper is the following:
|
||||
Tips:
|
||||
|
||||
- MobileViT is more like a CNN than a Transformer model. It does not work on sequence data but on batches of images. Unlike ViT, there are no embeddings. The backbone model outputs a feature map. You can follow [this tutorial](https://keras.io/examples/vision/mobilevit) for a lightweight introduction.
|
||||
- One can use [`MobileViTFeatureExtractor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB).
|
||||
- One can use [`MobileViTImageProcessor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB).
|
||||
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).
|
||||
- The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
|
||||
- As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite).
|
||||
|
||||
@@ -28,7 +28,7 @@ The figure below illustrates the architecture of PoolFormer. Taken from the [ori
|
||||
Tips:
|
||||
|
||||
- PoolFormer has a hierarchical architecture, where instead of Attention, a simple Average Pooling layer is present. All checkpoints of the model can be found on the [hub](https://huggingface.co/models?other=poolformer).
|
||||
- One can use [`PoolFormerFeatureExtractor`] to prepare images for the model.
|
||||
- One can use [`PoolFormerImageProcessor`] to prepare images for the model.
|
||||
- As most models, PoolFormer comes in different sizes, the details of which can be found in the table below.
|
||||
|
||||
| **Model variant** | **Depths** | **Hidden sizes** | **Params (M)** | **ImageNet-1k Top 1** |
|
||||
|
||||
@@ -24,7 +24,7 @@ The abstract from the paper is the following:
|
||||
|
||||
Tips:
|
||||
|
||||
- One can use [`AutoFeatureExtractor`] to prepare images for the model.
|
||||
- One can use [`AutoImageProcessor`] to prepare images for the model.
|
||||
- The huge 10B model from [Self-supervised Pretraining of Visual Features in the Wild](https://arxiv.org/abs/2103.01988), trained on one billion Instagram images, is available on the [hub](https://huggingface.co/facebook/regnet-y-10b-seer)
|
||||
|
||||
This model was contributed by [Francesco](https://huggingface.co/Francesco). The TensorFlow version of the model
|
||||
|
||||
@@ -25,7 +25,7 @@ The depth of representations is of central importance for many visual recognitio
|
||||
|
||||
Tips:
|
||||
|
||||
- One can use [`AutoFeatureExtractor`] to prepare images for the model.
|
||||
- One can use [`AutoImageProcessor`] to prepare images for the model.
|
||||
|
||||
The figure below illustrates the architecture of ResNet. Taken from the [original paper](https://arxiv.org/abs/1512.03385).
|
||||
|
||||
|
||||
@@ -56,12 +56,12 @@ Tips:
|
||||
- One can also check out [this interactive demo on Hugging Face Spaces](https://huggingface.co/spaces/chansung/segformer-tf-transformers)
|
||||
to try out a SegFormer model on custom images.
|
||||
- SegFormer works on any input size, as it pads the input to be divisible by `config.patch_sizes`.
|
||||
- One can use [`SegformerFeatureExtractor`] to prepare images and corresponding segmentation maps
|
||||
for the model. Note that this feature extractor is fairly basic and does not include all data augmentations used in
|
||||
- One can use [`SegformerImageProcessor`] to prepare images and corresponding segmentation maps
|
||||
for the model. Note that this image processor is fairly basic and does not include all data augmentations used in
|
||||
the original paper. The original preprocessing pipelines (for the ADE20k dataset for instance) can be found [here](https://github.com/NVlabs/SegFormer/blob/master/local_configs/_base_/datasets/ade20k_repeat.py). The most
|
||||
important preprocessing step is that images and segmentation maps are randomly cropped and padded to the same size,
|
||||
such as 512x512 or 640x640, after which they are normalized.
|
||||
- One additional thing to keep in mind is that one can initialize [`SegformerFeatureExtractor`] with
|
||||
- One additional thing to keep in mind is that one can initialize [`SegformerImageProcessor`] with
|
||||
`reduce_labels` set to `True` or `False`. In some datasets (like ADE20k), the 0 index is used in the annotated
|
||||
segmentation maps for background. However, ADE20k doesn't include the "background" class in its 150 labels.
|
||||
Therefore, `reduce_labels` is used to reduce all labels by 1, and to make sure no loss is computed for the
|
||||
|
||||
@@ -33,7 +33,7 @@ prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO
|
||||
The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.*
|
||||
|
||||
Tips:
|
||||
- One can use the [`AutoFeatureExtractor`] API to prepare images for the model.
|
||||
- One can use the [`AutoImageProcessor`] API to prepare images for the model.
|
||||
- Swin pads the inputs supporting any input height and width (if divisible by `32`).
|
||||
- Swin can be used as a *backbone*. When `output_hidden_states = True`, it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, sequence_length, num_channels)`.
|
||||
|
||||
|
||||
@@ -21,7 +21,7 @@ The abstract from the paper is the following:
|
||||
*Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.*
|
||||
|
||||
Tips:
|
||||
- One can use the [`AutoFeatureExtractor`] API to prepare images for the model.
|
||||
- One can use the [`AutoImageProcessor`] API to prepare images for the model.
|
||||
|
||||
This model was contributed by [nandwalritik](https://huggingface.co/nandwalritik).
|
||||
The original code can be found [here](https://github.com/microsoft/Swin-Transformer).
|
||||
|
||||
@@ -32,7 +32,7 @@ special customization for these tasks.*
|
||||
Tips:
|
||||
|
||||
- The authors released 2 models, one for [table detection](https://huggingface.co/microsoft/table-transformer-detection) in documents, one for [table structure recognition](https://huggingface.co/microsoft/table-transformer-structure-recognition) (the task of recognizing the individual rows, columns etc. in a table).
|
||||
- One can use the [`AutoFeatureExtractor`] API to prepare images and optional targets for the model. This will load a [`DetrFeatureExtractor`] behind the scenes.
|
||||
- One can use the [`AutoImageProcessor`] API to prepare images and optional targets for the model. This will load a [`DetrImageProcessor`] behind the scenes.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/table_transformer_architecture.jpeg"
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
@@ -55,9 +55,9 @@ Tips:
|
||||
TrOCR's [`VisionEncoderDecoder`] model accepts images as input and makes use of
|
||||
[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
|
||||
|
||||
The [`ViTFeatureExtractor`/`DeiTFeatureExtractor`] class is responsible for preprocessing the input image and
|
||||
The [`ViTImageProcessor`/`DeiTImageProcessor`] class is responsible for preprocessing the input image and
|
||||
[`RobertaTokenizer`/`XLMRobertaTokenizer`] decodes the generated target tokens to the target string. The
|
||||
[`TrOCRProcessor`] wraps [`ViTFeatureExtractor`/`DeiTFeatureExtractor`] and [`RobertaTokenizer`/`XLMRobertaTokenizer`]
|
||||
[`TrOCRProcessor`] wraps [`ViTImageProcessor`/`DeiTImageProcessor`] and [`RobertaTokenizer`/`XLMRobertaTokenizer`]
|
||||
into a single instance to both extract the input features and decode the predicted token ids.
|
||||
|
||||
- Step-by-step Optical Character Recognition (OCR)
|
||||
|
||||
@@ -23,7 +23,7 @@ The abstract from the paper is the following:
|
||||
|
||||
Tips:
|
||||
|
||||
- One can use [`VideoMAEFeatureExtractor`] to prepare videos for the model. It will resize + normalize all frames of a video for you.
|
||||
- One can use [`VideoMAEImageProcessor`] to prepare videos for the model. It will resize + normalize all frames of a video for you.
|
||||
- [`VideoMAEForPreTraining`] includes the decoder on top for self-supervised pre-training.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/videomae_architecture.jpeg"
|
||||
|
||||
@@ -68,17 +68,17 @@ To perform inference, one uses the [`generate`] method, which allows to autoregr
|
||||
>>> import requests
|
||||
>>> from PIL import Image
|
||||
|
||||
>>> from transformers import GPT2TokenizerFast, ViTFeatureExtractor, VisionEncoderDecoderModel
|
||||
>>> from transformers import GPT2TokenizerFast, ViTImageProcessor, VisionEncoderDecoderModel
|
||||
|
||||
>>> # load a fine-tuned image captioning model and corresponding tokenizer and feature extractor
|
||||
>>> # load a fine-tuned image captioning model and corresponding tokenizer and image processor
|
||||
>>> model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
||||
>>> tokenizer = GPT2TokenizerFast.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
||||
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
||||
>>> image_processor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
||||
|
||||
>>> # let's perform inference on an image
|
||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||
>>> pixel_values = feature_extractor(image, return_tensors="pt").pixel_values
|
||||
>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values
|
||||
|
||||
>>> # autoregressively generate caption (uses greedy decoding by default)
|
||||
>>> generated_ids = model.generate(pixel_values)
|
||||
@@ -115,10 +115,10 @@ As you can see, only 2 inputs are required for the model in order to compute a l
|
||||
images) and `labels` (which are the `input_ids` of the encoded target sequence).
|
||||
|
||||
```python
|
||||
>>> from transformers import ViTFeatureExtractor, BertTokenizer, VisionEncoderDecoderModel
|
||||
>>> from transformers import ViTImageProcessor, BertTokenizer, VisionEncoderDecoderModel
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||
>>> image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
|
||||
>>> model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
|
||||
... "google/vit-base-patch16-224-in21k", "bert-base-uncased"
|
||||
@@ -129,7 +129,7 @@ images) and `labels` (which are the `input_ids` of the encoded target sequence).
|
||||
|
||||
>>> dataset = load_dataset("huggingface/cats-image")
|
||||
>>> image = dataset["test"]["image"][0]
|
||||
>>> pixel_values = feature_extractor(image, return_tensors="pt").pixel_values
|
||||
>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values
|
||||
|
||||
>>> labels = tokenizer(
|
||||
... "an image of two cats chilling on a couch",
|
||||
|
||||
@@ -53,7 +53,7 @@ vectors to a standard BERT model. The text input is concatenated in the front of
|
||||
layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT. The segment IDs must also be set
|
||||
appropriately for the textual and visual parts.
|
||||
|
||||
The [`BertTokenizer`] is used to encode the text. A custom detector/feature extractor must be used
|
||||
The [`BertTokenizer`] is used to encode the text. A custom detector/image processor must be used
|
||||
to get the visual embeddings. The following example notebooks show how to use VisualBERT with Detectron-like models:
|
||||
|
||||
- [VisualBERT VQA demo notebook](https://github.com/huggingface/transformers/tree/main/examples/research_projects/visual_bert) : This notebook
|
||||
|
||||
@@ -40,7 +40,7 @@ Tips:
|
||||
used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of
|
||||
vectors to a standard Transformer encoder.
|
||||
- As the Vision Transformer expects each image to be of the same size (resolution), one can use
|
||||
[`ViTFeatureExtractor`] to resize (or rescale) and normalize images for the model.
|
||||
[`ViTImageProcessor`] to resize (or rescale) and normalize images for the model.
|
||||
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
|
||||
each checkpoint. For example, `google/vit-base-patch16-224` refers to a base-sized architecture with patch
|
||||
resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=vit).
|
||||
@@ -67,7 +67,7 @@ Following the original Vision Transformer, some follow-up works have been made:
|
||||
The authors of DeiT also released more efficiently trained ViT models, which you can directly plug into [`ViTModel`] or
|
||||
[`ViTForImageClassification`]. There are 4 variants available (in 3 different sizes): *facebook/deit-tiny-patch16-224*,
|
||||
*facebook/deit-small-patch16-224*, *facebook/deit-base-patch16-224* and *facebook/deit-base-patch16-384*. Note that one should
|
||||
use [`DeiTFeatureExtractor`] in order to prepare images for the model.
|
||||
use [`DeiTImageProcessor`] in order to prepare images for the model.
|
||||
|
||||
- [BEiT](beit) (BERT pre-training of Image Transformers) by Microsoft Research. BEiT models outperform supervised pre-trained
|
||||
vision transformers using a self-supervised method inspired by BERT (masked image modeling) and based on a VQ-VAE.
|
||||
|
||||
@@ -37,7 +37,7 @@ One can easily tweak it for their own use case.
|
||||
- A notebook that illustrates how to visualize reconstructed pixel values with [`ViTMAEForPreTraining`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb).
|
||||
- After pre-training, one "throws away" the decoder used to reconstruct pixels, and one uses the encoder for fine-tuning/linear probing. This means that after
|
||||
fine-tuning, one can directly plug in the weights into a [`ViTForImageClassification`].
|
||||
- One can use [`ViTFeatureExtractor`] to prepare images for the model. See the code examples for more info.
|
||||
- One can use [`ViTImageProcessor`] to prepare images for the model. See the code examples for more info.
|
||||
- Note that the encoder of MAE is only used to encode the visual patches. The encoded patches are then concatenated with mask tokens, which the decoder (which also
|
||||
consists of Transformer blocks) takes as input. Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted. Fixed
|
||||
sin/cos position embeddings are added both to the input of the encoder and the decoder.
|
||||
|
||||
@@ -23,7 +23,7 @@ The abstract from the paper is the following:
|
||||
|
||||
Tips:
|
||||
|
||||
- One can use [`YolosFeatureExtractor`] for preparing images (and optional targets) for the model. Contrary to [DETR](detr), YOLOS doesn't require a `pixel_mask` to be created.
|
||||
- One can use [`YolosImageProcessor`] for preparing images (and optional targets) for the model. Contrary to [DETR](detr), YOLOS doesn't require a `pixel_mask` to be created.
|
||||
- Demo notebooks (regarding inference and fine-tuning on custom data) can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/YOLOS).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/yolos_architecture.png"
|
||||
|
||||
@@ -24,7 +24,7 @@ The library was designed with two strong goals in mind:
|
||||
|
||||
- We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions,
|
||||
just three standard classes required to use each model: [configuration](main_classes/configuration),
|
||||
[models](main_classes/model), and a preprocessing class ([tokenizer](main_classes/tokenizer) for NLP, [feature extractor](main_classes/feature_extractor) for vision and audio, and [processor](main_classes/processors) for multimodal inputs).
|
||||
[models](main_classes/model), and a preprocessing class ([tokenizer](main_classes/tokenizer) for NLP, [image processor](main_classes/image_processor) for vision, [feature extractor](main_classes/feature_extractor) for audio, and [processor](main_classes/processors) for multimodal inputs).
|
||||
- All of these classes can be initialized in a simple and unified way from pretrained instances by using a common
|
||||
`from_pretrained()` method which downloads (if needed), caches and
|
||||
loads the related class instance and associated data (configurations' hyperparameters, tokenizers' vocabulary,
|
||||
@@ -62,7 +62,7 @@ The library is built around three types of classes for each model:
|
||||
|
||||
- **Model classes** can be PyTorch models ([torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)), Keras models ([tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)) or JAX/Flax models ([flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen.html)) that work with the pretrained weights provided in the library.
|
||||
- **Configuration classes** store the hyperparameters required to build a model (such as the number of layers and hidden size). You don't always need to instantiate these yourself. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model).
|
||||
- **Preprocessing classes** convert the raw data into a format accepted by the model. A [tokenizer](main_classes/tokenizer) stores the vocabulary for each model and provide methods for encoding and decoding strings in a list of token embedding indices to be fed to a model. [Feature extractors](main_classes/feature_extractor) preprocess audio or vision inputs, and a [processor](main_classes/processors) handles multimodal inputs.
|
||||
- **Preprocessing classes** convert the raw data into a format accepted by the model. A [tokenizer](main_classes/tokenizer) stores the vocabulary for each model and provide methods for encoding and decoding strings in a list of token embedding indices to be fed to a model. [Image processors](main_classes/image_processor) preprocess vision inputs, [feature extractors](main_classes/feature_extractor) preprocess audio inputs, and a [processor](main_classes/processors) handles multimodal inputs.
|
||||
|
||||
All these classes can be instantiated from pretrained instances, saved locally, and shared on the Hub with three methods:
|
||||
|
||||
|
||||
@@ -19,11 +19,11 @@ Before you can train a model on a dataset, it needs to be preprocessed into the
|
||||
* Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
|
||||
* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors.
|
||||
* Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
|
||||
* Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor.
|
||||
* Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor or image processor.
|
||||
|
||||
<Tip>
|
||||
|
||||
`AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, feature extractor or processor.
|
||||
`AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, image processor, feature extractor or processor.
|
||||
|
||||
</Tip>
|
||||
|
||||
@@ -320,9 +320,9 @@ The sample lengths are now the same and match the specified maximum length. You
|
||||
|
||||
## Computer vision
|
||||
|
||||
For computer vision tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from images, and convert them into tensors.
|
||||
For computer vision tasks, you'll need an [image processor](main_classes/image_processor) to prepare your dataset for the model. The image processor is designed to preprocess images, and convert them into tensors.
|
||||
|
||||
Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a feature extractor with computer vision datasets:
|
||||
Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use an image processor with computer vision datasets:
|
||||
|
||||
<Tip>
|
||||
|
||||
@@ -346,17 +346,17 @@ Next, take a look at the image with 🤗 Datasets [`Image`](https://huggingface.
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png"/>
|
||||
</div>
|
||||
|
||||
Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
|
||||
Load the image processor with [`AutoImageProcessor.from_pretrained`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoFeatureExtractor
|
||||
>>> from transformers import AutoImageProcessor
|
||||
|
||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
|
||||
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
|
||||
```
|
||||
|
||||
For computer vision tasks, it is common to add some type of data augmentation to the images as a part of preprocessing. You can add augmentations with any library you'd like, but in this tutorial, you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb).
|
||||
|
||||
1. Normalize the image with the feature extractor and use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain some transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) - together:
|
||||
1. Normalize the image with the image processor and use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain some transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) - together:
|
||||
|
||||
```py
|
||||
>>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor
|
||||
@@ -370,7 +370,7 @@ For computer vision tasks, it is common to add some type of data augmentation to
|
||||
>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize])
|
||||
```
|
||||
|
||||
2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as its input, which is generated by the feature extractor. Create a function that generates `pixel_values` from the transforms:
|
||||
2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as its input, which is generated by the image processor. Create a function that generates `pixel_values` from the transforms:
|
||||
|
||||
```py
|
||||
>>> def transforms(examples):
|
||||
@@ -384,7 +384,7 @@ For computer vision tasks, it is common to add some type of data augmentation to
|
||||
>>> dataset.set_transform(transforms)
|
||||
```
|
||||
|
||||
4. Now when you access the image, you'll notice the feature extractor has added `pixel_values`. You can pass your processed dataset to the model now!
|
||||
4. Now when you access the image, you'll notice the image processor has added `pixel_values`. You can pass your processed dataset to the model now!
|
||||
|
||||
```py
|
||||
>>> dataset[0]["image"]
|
||||
@@ -431,7 +431,7 @@ Here is what the image looks like after the transforms are applied. The image ha
|
||||
|
||||
## Multimodal
|
||||
|
||||
For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples a tokenizer and feature extractor.
|
||||
For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples together two processing objects such as as tokenizer and feature extractor.
|
||||
|
||||
Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR):
|
||||
|
||||
|
||||
@@ -225,7 +225,7 @@ A tokenizer can also accept a list of inputs, and pad and truncate the text to r
|
||||
|
||||
<Tip>
|
||||
|
||||
Check out the [preprocess](./preprocessing) tutorial for more details about tokenization, and how to use an [`AutoFeatureExtractor`] and [`AutoProcessor`] to preprocess image, audio, and multimodal inputs.
|
||||
Check out the [preprocess](./preprocessing) tutorial for more details about tokenization, and how to use an [`AutoImageProcessor`], [`AutoFeatureExtractor`] and [`AutoProcessor`] to preprocess image, audio, and multimodal inputs.
|
||||
|
||||
</Tip>
|
||||
|
||||
@@ -424,7 +424,7 @@ Depending on your task, you'll typically pass the following parameters to [`Trai
|
||||
... )
|
||||
```
|
||||
|
||||
3. A preprocessing class like a tokenizer, feature extractor, or processor:
|
||||
3. A preprocessing class like a tokenizer, image processor, feature extractor, or processor:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoTokenizer
|
||||
@@ -501,7 +501,7 @@ All models are a standard [`tf.keras.Model`](https://www.tensorflow.org/api_docs
|
||||
>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
|
||||
```
|
||||
|
||||
2. A preprocessing class like a tokenizer, feature extractor, or processor:
|
||||
2. A preprocessing class like a tokenizer, image processor, feature extractor, or processor:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoTokenizer
|
||||
|
||||
@@ -1101,24 +1101,24 @@ Class Egyptian cat with score 0.0239
|
||||
Class tiger cat with score 0.0229
|
||||
```
|
||||
|
||||
The general process for using a model and feature extractor for image classification is:
|
||||
The general process for using a model and image processor for image classification is:
|
||||
|
||||
1. Instantiate a feature extractor and a model from the checkpoint name.
|
||||
2. Process the image to be classified with a feature extractor.
|
||||
1. Instantiate an image processor and a model from the checkpoint name.
|
||||
2. Process the image to be classified with an image processor.
|
||||
3. Pass the input through the model and take the `argmax` to retrieve the predicted class.
|
||||
4. Convert the class id to a class name with `id2label` to return an interpretable result.
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
```py
|
||||
>>> from transformers import AutoFeatureExtractor, AutoModelForImageClassification
|
||||
>>> from transformers import AutoImageProcessor, AutoModelForImageClassification
|
||||
>>> import torch
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> dataset = load_dataset("huggingface/cats-image")
|
||||
>>> image = dataset["test"]["image"][0]
|
||||
|
||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
|
||||
>>> feature_extractor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
|
||||
>>> model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")
|
||||
|
||||
>>> inputs = feature_extractor(image, return_tensors="pt")
|
||||
|
||||
@@ -91,26 +91,26 @@ Now you can convert the label id to a label name:
|
||||
|
||||
## Preprocess
|
||||
|
||||
The next step is to load a ViT feature extractor to process the image into a tensor:
|
||||
The next step is to load a ViT image processor to process the image into a tensor:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoFeatureExtractor
|
||||
>>> from transformers import AutoImageProcessor
|
||||
|
||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||
```
|
||||
|
||||
Apply some image transformations to the images to make the model more robust against overfitting. Here you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module, but you can also use any image library you like.
|
||||
Apply some image transformations to the images to make the model more robust against overfitting. Here you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module, but you can also use any image library you like.
|
||||
|
||||
Crop a random part of the image, resize it, and normalize it with the image mean and standard deviation:
|
||||
|
||||
```py
|
||||
>>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor
|
||||
|
||||
>>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
|
||||
>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
|
||||
>>> size = (
|
||||
... feature_extractor.size["shortest_edge"]
|
||||
... if "shortest_edge" in feature_extractor.size
|
||||
... else (feature_extractor.size["height"], feature_extractor.size["width"])
|
||||
... image_processor.size["shortest_edge"]
|
||||
... if "shortest_edge" in image_processor.size
|
||||
... else (image_processor.size["height"], image_processor.size["width"])
|
||||
... )
|
||||
>>> _transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])
|
||||
```
|
||||
@@ -213,7 +213,7 @@ At this point, only three steps remain:
|
||||
... data_collator=data_collator,
|
||||
... train_dataset=food["train"],
|
||||
... eval_dataset=food["test"],
|
||||
... tokenizer=feature_extractor,
|
||||
... tokenizer=image_processor,
|
||||
... compute_metrics=compute_metrics,
|
||||
... )
|
||||
|
||||
@@ -266,14 +266,14 @@ You can also manually replicate the results of the `pipeline` if you'd like:
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
Load a feature extractor to preprocess the image and return the `input` as PyTorch tensors:
|
||||
Load an image processor to preprocess the image and return the `input` as PyTorch tensors:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoFeatureExtractor
|
||||
>>> from transformers import AutoImageProcessor
|
||||
>>> import torch
|
||||
|
||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("my_awesome_food_model")
|
||||
>>> inputs = feature_extractor(image, return_tensors="pt")
|
||||
>>> image_processor = AutoImageProcessor.from_pretrained("my_awesome_food_model")
|
||||
>>> inputs = image_processor(image, return_tensors="pt")
|
||||
```
|
||||
|
||||
Pass your inputs to the model and return the logits:
|
||||
|
||||
@@ -90,12 +90,12 @@ You'll also want to create a dictionary that maps a label id to a label class wh
|
||||
|
||||
## Preprocess
|
||||
|
||||
The next step is to load a SegFormer feature extractor to prepare the images and annotations for the model. Some datasets, like this one, use the zero-index as the background class. However, the background class isn't actually included in the 150 classes, so you'll need to set `reduce_labels=True` to subtract one from all the labels. The zero-index is replaced by `255` so it's ignored by SegFormer's loss function:
|
||||
The next step is to load a SegFormer image processor to prepare the images and annotations for the model. Some datasets, like this one, use the zero-index as the background class. However, the background class isn't actually included in the 150 classes, so you'll need to set `reduce_labels=True` to subtract one from all the labels. The zero-index is replaced by `255` so it's ignored by SegFormer's loss function:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoFeatureExtractor
|
||||
>>> from transformers import AutoImageProcessor
|
||||
|
||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("nvidia/mit-b0", reduce_labels=True)
|
||||
>>> feature_extractor = AutoImageProcessor.from_pretrained("nvidia/mit-b0", reduce_labels=True)
|
||||
```
|
||||
|
||||
It is common to apply some data augmentations to an image dataset to make a model more robust against overfitting. In this guide, you'll use the [`ColorJitter`](https://pytorch.org/vision/stable/generated/torchvision.transforms.ColorJitter.html) function from [torchvision](https://pytorch.org/vision/stable/index.html) to randomly change the color properties of an image, but you can also use any image library you like.
|
||||
@@ -106,7 +106,7 @@ It is common to apply some data augmentations to an image dataset to make a mode
|
||||
>>> jitter = ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1)
|
||||
```
|
||||
|
||||
Now create two preprocessing functions to prepare the images and annotations for the model. These functions convert the images into `pixel_values` and annotations to `labels`. For the training set, `jitter` is applied before providing the images to the feature extractor. For the test set, the feature extractor crops and normalizes the `images`, and only crops the `labels` because no data augmentation is applied during testing.
|
||||
Now create two preprocessing functions to prepare the images and annotations for the model. These functions convert the images into `pixel_values` and annotations to `labels`. For the training set, `jitter` is applied before providing the images to the image processor. For the test set, the image processor crops and normalizes the `images`, and only crops the `labels` because no data augmentation is applied during testing.
|
||||
|
||||
```py
|
||||
>>> def train_transforms(example_batch):
|
||||
@@ -281,7 +281,7 @@ The simplest way to try out your finetuned model for inference is to use it in a
|
||||
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062E10>}]
|
||||
```
|
||||
|
||||
You can also manually replicate the results of the `pipeline` if you'd like. Process the image with a feature extractor and place the `pixel_values` on a GPU:
|
||||
You can also manually replicate the results of the `pipeline` if you'd like. Process the image with an image processor and place the `pixel_values` on a GPU:
|
||||
|
||||
```py
|
||||
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use GPU if available, otherwise use a CPU
|
||||
|
||||
@@ -89,7 +89,7 @@ TensorFlow's [model.save](https://www.tensorflow.org/tutorials/keras/save_and_lo
|
||||
Another common error you may encounter, especially if it is a newly released model, is `ImportError`:
|
||||
|
||||
```
|
||||
ImportError: cannot import name 'ImageGPTFeatureExtractor' from 'transformers' (unknown location)
|
||||
ImportError: cannot import name 'ImageGPTImageProcessor' from 'transformers' (unknown location)
|
||||
```
|
||||
|
||||
For these error types, check to make sure you have the latest version of 🤗 Transformers installed to access the most recent models:
|
||||
|
||||
Reference in New Issue
Block a user