Add inference section to task guides (#18781)

* 📝 start adding inference section to task guides

*  make style

* 📝 add multiple choice

* add rest of inference sections

* make style

* add compute_metric, push_to_hub, pipeline

* make style

* add updated sequence and token classification

* make style

* make edits in token classification

* add audio classification

* make style

* add asr

* make style

* add image classification

* make style

* add summarization

* make style

* add translation

* make style

* add multiple choice

* add language modeling

* add qa

* make style

* review and edits

* apply reviews

* make style

* fix call to processor

* apply audio reviews

* update to better asr model

* make style
This commit is contained in:
Steven Liu
2022-11-21 10:06:21 -08:00
committed by GitHub
parent 4973d2a04c
commit d896029e27
11 changed files with 2401 additions and 616 deletions

View File

@@ -18,7 +18,10 @@ specific language governing permissions and limitations under the License.
Semantic segmentation assigns a label or class to each individual pixel of an image. There are several types of segmentation, and in the case of semantic segmentation, no distinction is made between unique instances of the same object. Both objects are given the same label (for example, "car" instead of "car-1" and "car-2"). Common real-world applications of semantic segmentation include training self-driving cars to identify pedestrians and important traffic information, identifying cells and abnormalities in medical imagery, and monitoring environmental changes from satellite imagery.
This guide will show you how to finetune [SegFormer](https://huggingface.co/docs/transformers/main/en/model_doc/segformer#segformer) on the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset.
This guide will show you how to:
1. Finetune [SegFormer](https://huggingface.co/docs/transformers/main/en/model_doc/segformer#segformer) on the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset.
2. Use your finetuned model for inference.
<Tip>
@@ -32,9 +35,17 @@ Before you begin, make sure you have all the necessary libraries installed:
pip install -q datasets transformers evaluate
```
We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:
```py
>>> from huggingface_hub import notebook_login
>>> notebook_login()
```
## Load SceneParse150 dataset
Load the first 50 examples of the SceneParse150 dataset from the 🤗 Datasets library so you can quickly train and test a model:
Start by loading a smaller subset of the SceneParse150 dataset from the 🤗 Datasets library. This'll give you a chance to experiment and make sure everythings works before spending more time training on the full dataset.
```py
>>> from datasets import load_dataset
@@ -42,7 +53,7 @@ Load the first 50 examples of the SceneParse150 dataset from the 🤗 Datasets l
>>> ds = load_dataset("scene_parse_150", split="train[:50]")
```
Split this dataset into a train and test set:
Split the dataset's `train` split into a train and test set with the [`~datasets.Dataset.train_test_split`] method:
```py
>>> ds = ds.train_test_split(test_size=0.2)
@@ -59,7 +70,9 @@ Then take a look at an example:
'scene_category': 368}
```
There is an `image`, an `annotation` (this is the segmentation map or label), and a `scene_category` field that describes the image scene, like "kitchen" or "office". In this guide, you'll only need `image` and `annotation`, both of which are PIL images.
- `image`: a PIL image of the scene.
- `annotation`: a PIL image of the segmentation map, which is also the model's target.
- `scene_category`: a category id that describes the image scene like "kitchen" or "office". In this guide, you'll only need `image` and `annotation`, both of which are PIL images.
You'll also want to create a dictionary that maps a label id to a label class which will be useful when you set up the model later. Download the mappings from the Hub and create the `id2label` and `label2id` dictionaries:
@@ -77,7 +90,7 @@ You'll also want to create a dictionary that maps a label id to a label class wh
## Preprocess
Next, load a SegFormer feature extractor to prepare the images and annotations for the model. Some datasets, like this one, use the zero-index as the background class. However, the background class isn't included in the 150 classes, so you'll need to set `reduce_labels=True` to subtract one from all the labels. The zero-index is replaced by `255` so it's ignored by SegFormer's loss function:
The next step is to load a SegFormer feature extractor to prepare the images and annotations for the model. Some datasets, like this one, use the zero-index as the background class. However, the background class isn't actually included in the 150 classes, so you'll need to set `reduce_labels=True` to subtract one from all the labels. The zero-index is replaced by `255` so it's ignored by SegFormer's loss function:
```py
>>> from transformers import AutoFeatureExtractor
@@ -85,7 +98,7 @@ Next, load a SegFormer feature extractor to prepare the images and annotations f
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("nvidia/mit-b0", reduce_labels=True)
```
It is common to apply some data augmentations to an image dataset to make a model more robust against overfitting. In this guide, you'll use the [`ColorJitter`](https://pytorch.org/vision/stable/generated/torchvision.transforms.ColorJitter.html) function from [torchvision](https://pytorch.org/vision/stable/index.html) to randomly change the color properties of an image:
It is common to apply some data augmentations to an image dataset to make a model more robust against overfitting. In this guide, you'll use the [`ColorJitter`](https://pytorch.org/vision/stable/generated/torchvision.transforms.ColorJitter.html) function from [torchvision](https://pytorch.org/vision/stable/index.html) to randomly change the color properties of an image, but you can also use any image library you like.
```py
>>> from torchvision.transforms import ColorJitter
@@ -117,53 +130,9 @@ To apply the `jitter` over the entire dataset, use the 🤗 Datasets [`~datasets
>>> test_ds.set_transform(val_transforms)
```
## Train
## Evaluate
Load SegFormer with [`AutoModelForSemanticSegmentation`], and pass the model the mapping between label ids and label classes:
```py
>>> from transformers import AutoModelForSemanticSegmentation
>>> pretrained_model_name = "nvidia/mit-b0"
>>> model = AutoModelForSemanticSegmentation.from_pretrained(
... pretrained_model_name, id2label=id2label, label2id=label2id
... )
```
<Tip>
If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
</Tip>
Define your training hyperparameters in [`TrainingArguments`]. It is important not to remove unused columns because this will drop the `image` column. Without the `image` column, you can't create `pixel_values`. Set `remove_unused_columns=False` to prevent this behavior!
To save and push a model under your namespace to the Hub, set `push_to_hub=True`:
```py
>>> from transformers import TrainingArguments
>>> training_args = TrainingArguments(
... output_dir="segformer-b0-scene-parse-150",
... learning_rate=6e-5,
... num_train_epochs=50,
... per_device_train_batch_size=2,
... per_device_eval_batch_size=2,
... save_total_limit=3,
... evaluation_strategy="steps",
... save_strategy="steps",
... save_steps=20,
... eval_steps=20,
... logging_steps=1,
... eval_accumulation_steps=5,
... remove_unused_columns=False,
... push_to_hub=True,
... )
```
To evaluate model performance during training, you'll need to create a function to compute and report metrics. For semantic segmentation, you'll typically compute the [mean Intersection over Union](https://huggingface.co/spaces/evaluate-metric/mean_iou) (IoU). The mean IoU measures the overlapping area between the predicted and ground truth segmentation maps.
Load the mean IoU from the 🤗 Evaluate library:
Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [mean Intersection over Union](https://huggingface.co/spaces/evaluate-metric/accuracy) (IoU) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):
```py
>>> import evaluate
@@ -199,10 +168,50 @@ Then create a function to [`~evaluate.EvaluationModule.compute`] the metrics. Yo
... return metrics
```
Pass your model, training arguments, datasets, and metrics function to the [`Trainer`]:
Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.
## Train
<Tip>
If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
</Tip>
You're ready to start training your model now! Load SegFormer with [`AutoModelForSemanticSegmentation`], and pass the model the mapping between label ids and label classes:
```py
>>> from transformers import Trainer
>>> from transformers import AutoModelForSemanticSegmentation, TrainingArguments, Trainer
>>> pretrained_model_name = "nvidia/mit-b0"
>>> model = AutoModelForSemanticSegmentation.from_pretrained(
... pretrained_model_name, id2label=id2label, label2id=label2id
... )
```
At this point, only three steps remain:
1. Define your training hyperparameters in [`TrainingArguments`]. It is important you don't remove unused columns because this'll drop the `image` column. Without the `image` column, you can't create `pixel_values`. Set `remove_unused_columns=False` to prevent this behavior! The only other required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [`Trainer`] will evaluate the IoU metric and save the training checkpoint.
2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [`~Trainer.train`] to finetune your model.
```py
>>> training_args = TrainingArguments(
... output_dir="segformer-b0-scene-parse-150",
... learning_rate=6e-5,
... num_train_epochs=50,
... per_device_train_batch_size=2,
... per_device_eval_batch_size=2,
... save_total_limit=3,
... evaluation_strategy="steps",
... save_strategy="steps",
... save_steps=20,
... eval_steps=20,
... logging_steps=1,
... eval_accumulation_steps=5,
... remove_unused_columns=False,
... push_to_hub=True,
... )
>>> trainer = Trainer(
... model=model,
@@ -211,12 +220,14 @@ Pass your model, training arguments, datasets, and metrics function to the [`Tra
... eval_dataset=test_ds,
... compute_metrics=compute_metrics,
... )
>>> trainer.train()
```
Lastly, call [`~Trainer.train`] to finetune your model:
Once training is completed, share your model to the Hub with the [`~transformers.Trainer.push_to_hub`] method so everyone can use your model:
```py
>>> trainer.train()
>>> trainer.push_to_hub()
```
## Inference
@@ -234,7 +245,43 @@ Load an image for inference:
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/semantic-seg-image.png" alt="Image of bedroom"/>
</div>
Process the image with a feature extractor and place the `pixel_values` on a GPU:
The simplest way to try out your finetuned model for inference is to use it in a [`pipeline`]. Instantiate a `pipeline` for image segmentation with your model, and pass your image to it:
```py
>>> from transformers import pipeline
>>> segmenter = pipeline("image-segmentation", model="my_awesome_seg_model")
>>> segmenter(image)
[{'score': None,
'label': 'wall',
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062690>},
{'score': None,
'label': 'sky',
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062A50>},
{'score': None,
'label': 'floor',
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062B50>},
{'score': None,
'label': 'ceiling',
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062A10>},
{'score': None,
'label': 'bed ',
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062E90>},
{'score': None,
'label': 'windowpane',
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062390>},
{'score': None,
'label': 'cabinet',
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062550>},
{'score': None,
'label': 'chair',
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062D90>},
{'score': None,
'label': 'armchair',
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062E10>}]
```
You can also manually replicate the results of the `pipeline` if you'd like. Process the image with a feature extractor and place the `pixel_values` on a GPU:
```py
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use GPU if available, otherwise use a CPU