Update doc examples feature extractor -> image processor (#20501)

* Update doc example feature extractor -> image processor

* Apply suggestions from code review
This commit is contained in:
amyeroberts
2022-11-30 14:50:55 +00:00
committed by GitHub
parent afad0c18d9
commit 17a7b49bda
84 changed files with 497 additions and 458 deletions

View File

@@ -19,11 +19,11 @@ Before you can train a model on a dataset, it needs to be preprocessed into the
* Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors.
* Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
* Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor.
* Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor or image processor.
<Tip>
`AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, feature extractor or processor.
`AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, image processor, feature extractor or processor.
</Tip>
@@ -320,9 +320,9 @@ The sample lengths are now the same and match the specified maximum length. You
## Computer vision
For computer vision tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from images, and convert them into tensors.
For computer vision tasks, you'll need an [image processor](main_classes/image_processor) to prepare your dataset for the model. The image processor is designed to preprocess images, and convert them into tensors.
Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a feature extractor with computer vision datasets:
Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use an image processor with computer vision datasets:
<Tip>
@@ -346,17 +346,17 @@ Next, take a look at the image with 🤗 Datasets [`Image`](https://huggingface.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png"/>
</div>
Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
Load the image processor with [`AutoImageProcessor.from_pretrained`]:
```py
>>> from transformers import AutoFeatureExtractor
>>> from transformers import AutoImageProcessor
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
```
For computer vision tasks, it is common to add some type of data augmentation to the images as a part of preprocessing. You can add augmentations with any library you'd like, but in this tutorial, you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb).
1. Normalize the image with the feature extractor and use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain some transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) - together:
1. Normalize the image with the image processor and use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain some transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) - together:
```py
>>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor
@@ -370,7 +370,7 @@ For computer vision tasks, it is common to add some type of data augmentation to
>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize])
```
2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as its input, which is generated by the feature extractor. Create a function that generates `pixel_values` from the transforms:
2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as its input, which is generated by the image processor. Create a function that generates `pixel_values` from the transforms:
```py
>>> def transforms(examples):
@@ -384,7 +384,7 @@ For computer vision tasks, it is common to add some type of data augmentation to
>>> dataset.set_transform(transforms)
```
4. Now when you access the image, you'll notice the feature extractor has added `pixel_values`. You can pass your processed dataset to the model now!
4. Now when you access the image, you'll notice the image processor has added `pixel_values`. You can pass your processed dataset to the model now!
```py
>>> dataset[0]["image"]
@@ -431,7 +431,7 @@ Here is what the image looks like after the transforms are applied. The image ha
## Multimodal
For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples a tokenizer and feature extractor.
For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples together two processing objects such as as tokenizer and feature extractor.
Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR):