[docs] Backbone (#28739)

* backbones

* fix path

* fix paths

* fix code snippet

* fix links
This commit is contained in:
Steven Liu
2024-02-01 09:16:16 -08:00
committed by GitHub
parent 23ea6743f2
commit abbffc4525
3 changed files with 139 additions and 92 deletions

View File

@@ -14,86 +14,47 @@ rendered properly in your Markdown viewer.
-->
# Backbones
# Backbone
Backbones are models used for feature extraction for computer vision tasks. One can use a model as backbone in two ways:
A backbone is a model used for feature extraction for higher level computer vision tasks such as object detection and image classification. Transformers provides an [`AutoBackbone`] class for initializing a Transformers backbone from pretrained model weights, and two utility classes:
* initializing `AutoBackbone` class with a pretrained model,
* initializing a supported backbone configuration and passing it to the model architecture.
* [`~utils.backbone_utils.BackboneMixin`] enables initializing a backbone from Transformers or [timm](https://hf.co/docs/timm/index) and includes functions for returning the output features and indices.
* [`~utils.backbone_utils.BackboneConfigMixin`] sets the output features and indices of the backbone configuration.
## Using AutoBackbone
[timm](https://hf.co/docs/timm/index) models are loaded with the [`TimmBackbone`] and [`TimmBackboneConfig`] classes.
You can use `AutoBackbone` class to initialize a model as a backbone and get the feature maps for any stage. You can define `out_indices` to indicate the index of the layers which you would like to get the feature maps from. You can also use `out_features` if you know the name of the layers. You can use them interchangeably. If you are using both `out_indices` and `out_features`, ensure they are consistent. Not passing any of the feature map arguments will make the backbone yield the feature maps of the last layer.
To visualize how stages look like, let's take the Swin model. Each stage is responsible from feature extraction, outputting feature maps.
<div style="text-align: center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/Swin%20Stages.png">
</div>
Backbones are supported for the following models:
Illustrating feature maps of the first stage looks like below.
<div style="text-align: center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/Swin%20Stage%201.png">
</div>
* [BEiT](..model_doc/beit)
* [BiT](../model_doc/bit)
* [ConvNet](../model_doc/convnext)
* [ConvNextV2](../model_doc/convnextv2)
* [DiNAT](..model_doc/dinat)
* [DINOV2](../model_doc/dinov2)
* [FocalNet](../model_doc/focalnet)
* [MaskFormer](../model_doc/maskformer)
* [NAT](../model_doc/nat)
* [ResNet](../model_doc/resnet)
* [Swin Transformer](../model_doc/swin)
* [Swin Transformer v2](../model_doc/swinv2)
* [ViTDet](../model_doc/vitdet)
Let's see with an example. Note that `out_indices=(0,)` results in yielding the stem of the model. Stem refers to the stage before the first feature extraction stage. In above diagram, it refers to patch partition. We would like to have the feature maps from stem, first, and second stage of the model.
```py
>>> from transformers import AutoImageProcessor, AutoBackbone
>>> import torch
>>> from PIL import Image
>>> import requests
## AutoBackbone
>>> processor = AutoImageProcessor.from_pretrained("microsoft/swin-tiny-patch4-window7-224")
>>> model = AutoBackbone.from_pretrained("microsoft/swin-tiny-patch4-window7-224", out_indices=(0,1,2))
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
[[autodoc]] AutoBackbone
>>> inputs = processor(image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> feature_maps = outputs.feature_maps
```
`feature_maps` object now has three feature maps, each can be accessed like below. Say we would like to get the feature map of the stem.
```python
>>> list(feature_maps[0].shape)
[1, 96, 56, 56]
```
## BackboneMixin
We can get the feature maps of first and second stages like below.
```python
>>> list(feature_maps[1].shape)
[1, 96, 56, 56]
>>> list(feature_maps[2].shape)
[1, 192, 28, 28]
```
[[autodoc]] utils.backbone_utils.BackboneMixin
## Initializing Backbone Configuration
## BackboneConfigMixin
In computer vision, models consist of backbone, neck, and a head. Backbone extracts the features, neck enhances the extracted features and head is used for the main task (e.g. object detection).
[[autodoc]] utils.backbone_utils.BackboneConfigMixin
<div style="text-align: center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/Backbone.png">
</div>
## TimmBackbone
You can initialize such multiple-stage model with Backbone API. Initialize the config of the backbone of your choice first. Initialize the neck config by passing the backbone config in. Then, initialize the head with the neck's config. To illustrate this, below you can see how to initialize the [MaskFormer](../model_doc/maskformer) model with instance segmentation head with [ResNet](../model_doc/resnet) backbone.
[[autodoc]] models.timm_backbone.TimmBackbone
```py
from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation, ResNetConfig
## TimmBackboneConfig
backbone_config = ResNetConfig.from_pretrained("microsoft/resnet-50")
config = MaskFormerConfig(backbone_config=backbone_config)
model = MaskFormerForInstanceSegmentation(config)
```
You can also initialize a backbone with random weights to initialize the model neck with it.
```py
backbone_config = ResNetConfig()
config = MaskFormerConfig(backbone_config=backbone_config)
model = MaskFormerForInstanceSegmentation(config)
```
`timm` models are also supported in transformers through `TimmBackbone` and `TimmBackboneConfig`.
```python
from transformers import TimmBackboneConfig, TimmBackbone
backbone_config = TimmBackboneConfig("resnet50")
model = TimmBackbone(config=backbone_config)
```
[[autodoc]] models.timm_backbone.TimmBackboneConfig