Depth estimation task guide (#22205)
* added doc to toc, auto tip with supported models, mention of task guide in model docs * make style * removed "see also" * minor fix
This commit is contained in:
@@ -85,6 +85,8 @@
|
|||||||
title: Zero-shot object detection
|
title: Zero-shot object detection
|
||||||
- local: tasks/zero_shot_image_classification
|
- local: tasks/zero_shot_image_classification
|
||||||
title: Zero-shot image classification
|
title: Zero-shot image classification
|
||||||
|
- local: tasks/monocular_depth_estimation
|
||||||
|
title: Depth estimation
|
||||||
title: Computer Vision
|
title: Computer Vision
|
||||||
- sections:
|
- sections:
|
||||||
- local: tasks/image_captioning
|
- local: tasks/image_captioning
|
||||||
|
|||||||
@@ -33,7 +33,8 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
|
|||||||
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DPT.
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DPT.
|
||||||
|
|
||||||
- Demo notebooks for [`DPTForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DPT).
|
- Demo notebooks for [`DPTForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DPT).
|
||||||
- See also: [Semantic segmentation task guide](./tasks/semantic_segmentation)
|
- [Semantic segmentation task guide](../tasks/semantic_segmentation)
|
||||||
|
- [Monocular depth estimation task guide](../tasks/monocular_depth_estimation.mdx)
|
||||||
|
|
||||||
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
|
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
|
||||||
|
|
||||||
|
|||||||
@@ -45,6 +45,7 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
|
|||||||
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GLPN.
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GLPN.
|
||||||
|
|
||||||
- Demo notebooks for [`GLPNForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/GLPN).
|
- Demo notebooks for [`GLPNForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/GLPN).
|
||||||
|
- [Monocular depth estimation task guide](../tasks/monocular_depth_estimation.mdx)
|
||||||
|
|
||||||
## GLPNConfig
|
## GLPNConfig
|
||||||
|
|
||||||
|
|||||||
147
docs/source/en/tasks/monocular_depth_estimation.mdx
Normal file
147
docs/source/en/tasks/monocular_depth_estimation.mdx
Normal file
@@ -0,0 +1,147 @@
|
|||||||
|
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Monocular depth estimation
|
||||||
|
|
||||||
|
Monocular depth estimation is a computer vision task that involves predicting the depth information of a scene from a
|
||||||
|
single image. In other words, it is the process of estimating the distance of objects in a scene from
|
||||||
|
a single camera viewpoint.
|
||||||
|
|
||||||
|
Monocular depth estimation has various applications, including 3D reconstruction, augmented reality, autonomous driving,
|
||||||
|
and robotics. It is a challenging task as it requires the model to understand the complex relationships between objects
|
||||||
|
in the scene and the corresponding depth information, which can be affected by factors such as lighting conditions,
|
||||||
|
occlusion, and texture.
|
||||||
|
|
||||||
|
<Tip>
|
||||||
|
The task illustrated in this tutorial is supported by the following model architectures:
|
||||||
|
|
||||||
|
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
|
||||||
|
|
||||||
|
[DPT](../model_doc/dpt), [GLPN](../model_doc/glpn)
|
||||||
|
|
||||||
|
<!--End of the generated tip-->
|
||||||
|
|
||||||
|
</Tip>
|
||||||
|
|
||||||
|
In this guide you'll learn how to:
|
||||||
|
|
||||||
|
* create a depth estimation pipeline
|
||||||
|
* run depth estimation inference by hand
|
||||||
|
|
||||||
|
Before you begin, make sure you have all the necessary libraries installed:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -q transformers
|
||||||
|
```
|
||||||
|
|
||||||
|
## Depth estimation pipeline
|
||||||
|
|
||||||
|
The simplest way to try out inference with a model supporting depth estimation is to use the corresponding [`pipeline`].
|
||||||
|
Instantiate a pipeline from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads):
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> from transformers import pipeline
|
||||||
|
|
||||||
|
>>> checkpoint = "vinvino02/glpn-nyu"
|
||||||
|
>>> depth_estimator = pipeline("depth-estimation", model=checkpoint)
|
||||||
|
```
|
||||||
|
|
||||||
|
Next, choose an image to analyze:
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> from PIL import Image
|
||||||
|
>>> import requests
|
||||||
|
|
||||||
|
>>> url = "https://unsplash.com/photos/HwBAsSbPBDU/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MzR8fGNhciUyMGluJTIwdGhlJTIwc3RyZWV0fGVufDB8MHx8fDE2Nzg5MDEwODg&force=true&w=640"
|
||||||
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
>>> image
|
||||||
|
```
|
||||||
|
|
||||||
|
<div class="flex justify-center">
|
||||||
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/depth-estimation-example.jpg" alt="Photo of a busy street"/>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
Pass the image to the pipeline.
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> predictions = depth_estimator(image)
|
||||||
|
```
|
||||||
|
|
||||||
|
The pipeline returns a dictionary with two entries. The first one, called `predicted_depth`, is a tensor with the values
|
||||||
|
being the depth expressed in meters for each pixel.
|
||||||
|
The second one, `depth`, is a PIL image that visualizes the depth estimation result.
|
||||||
|
|
||||||
|
Let's take a look at the visualized result:
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> predictions["depth"]
|
||||||
|
```
|
||||||
|
|
||||||
|
<div class="flex justify-center">
|
||||||
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/depth-visualization.png" alt="Depth estimation visualization"/>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
## Depth estimation inference by hand
|
||||||
|
|
||||||
|
Now that you've seen how to use the depth estimation pipeline, let's see how we can replicate the same result by hand.
|
||||||
|
|
||||||
|
Start by loading the model and associated processor from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads).
|
||||||
|
Here we'll use the same checkpoint as before:
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation
|
||||||
|
|
||||||
|
>>> checkpoint = "vinvino02/glpn-nyu"
|
||||||
|
|
||||||
|
>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
|
||||||
|
>>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint)
|
||||||
|
```
|
||||||
|
|
||||||
|
Prepare the image input for the model using the `image_processor` that will take care of the necessary image transformations
|
||||||
|
such as resizing and normalization:
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass the prepared inputs through the model:
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> import torch
|
||||||
|
|
||||||
|
>>> with torch.no_grad():
|
||||||
|
... outputs = model(pixel_values)
|
||||||
|
... predicted_depth = outputs.predicted_depth
|
||||||
|
```
|
||||||
|
|
||||||
|
Visualize the results:
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> import numpy as np
|
||||||
|
|
||||||
|
>>> # interpolate to original size
|
||||||
|
>>> prediction = torch.nn.functional.interpolate(
|
||||||
|
... predicted_depth.unsqueeze(1),
|
||||||
|
... size=image.size[::-1],
|
||||||
|
... mode="bicubic",
|
||||||
|
... align_corners=False,
|
||||||
|
... ).squeeze()
|
||||||
|
>>> output = prediction.numpy()
|
||||||
|
|
||||||
|
>>> formatted = (output * 255 / np.max(output)).astype("uint8")
|
||||||
|
>>> depth = Image.fromarray(formatted)
|
||||||
|
>>> depth
|
||||||
|
```
|
||||||
|
|
||||||
|
<div class="flex justify-center">
|
||||||
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/depth-visualization.png" alt="Depth estimation visualization"/>
|
||||||
|
</div>
|
||||||
@@ -70,6 +70,7 @@ TASK_GUIDE_TO_MODELS = {
|
|||||||
"translation.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES,
|
"translation.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES,
|
||||||
"video_classification.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING_NAMES,
|
"video_classification.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING_NAMES,
|
||||||
"document_question_answering.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES,
|
"document_question_answering.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES,
|
||||||
|
"monocular_depth_estimation.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES,
|
||||||
}
|
}
|
||||||
|
|
||||||
# This list contains model types used in some task guides that are not in `CONFIG_MAPPING_NAMES` (therefore not in any
|
# This list contains model types used in some task guides that are not in `CONFIG_MAPPING_NAMES` (therefore not in any
|
||||||
|
|||||||
Reference in New Issue
Block a user