Depth estimation task guide (#22205)

* added doc to toc, auto tip with supported models, mention of task guide in model docs * make style * removed "see also" * minor fix
2023-03-17 08:36:23 -04:00
parent 53218671d9
commit 42f8f76402
5 changed files with 153 additions and 1 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -85,6 +85,8 @@
      title: Zero-shot object detection
    - local: tasks/zero_shot_image_classification
      title: Zero-shot image classification
    - local: tasks/monocular_depth_estimation
      title: Depth estimation
    title: Computer Vision
  - sections:
    - local: tasks/image_captioning
--- a/docs/source/en/model_doc/dpt.mdx
+++ b/docs/source/en/model_doc/dpt.mdx
@@ -33,7 +33,8 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DPT.
 - Demo notebooks for [`DPTForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DPT).
- See also: [Semantic segmentation task guide](./tasks/semantic_segmentation)
+- [Semantic segmentation task guide](../tasks/semantic_segmentation)
 - [Monocular depth estimation task guide](../tasks/monocular_depth_estimation.mdx)
 If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
--- a/docs/source/en/model_doc/glpn.mdx
+++ b/docs/source/en/model_doc/glpn.mdx
@@ -45,6 +45,7 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GLPN.
 - Demo notebooks for [`GLPNForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/GLPN).
 - [Monocular depth estimation task guide](../tasks/monocular_depth_estimation.mdx)
 ## GLPNConfig
--- a/docs/source/en/tasks/monocular_depth_estimation.mdx
+++ b/docs/source/en/tasks/monocular_depth_estimation.mdx
@@ -0,0 +1,147 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Monocular depth estimation
 Monocular depth estimation is a computer vision task that involves predicting the depth information of a scene from a
 single image. In other words, it is the process of estimating the distance of objects in a scene from
 a single camera viewpoint.
 Monocular depth estimation has various applications, including 3D reconstruction, augmented reality, autonomous driving,
 and robotics. It is a challenging task as it requires the model to understand the complex relationships between objects
 in the scene and the corresponding depth information, which can be affected by factors such as lighting conditions,
 occlusion, and texture.
 <Tip>
 The task illustrated in this tutorial is supported by the following model architectures:
 <!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
 [DPT](../model_doc/dpt), [GLPN](../model_doc/glpn)
 <!--End of the generated tip-->
 </Tip>
 In this guide you'll learn how to:
 * create a depth estimation pipeline
 * run depth estimation inference by hand
 Before you begin, make sure you have all the necessary libraries installed:
 ```bash
 pip install -q transformers
 ```
 ## Depth estimation pipeline
 The simplest way to try out inference with a model supporting depth estimation is to use the corresponding [`pipeline`].
 Instantiate a pipeline from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads):
 ```py
 >>> from transformers import pipeline
 >>> checkpoint = "vinvino02/glpn-nyu"
 >>> depth_estimator = pipeline("depth-estimation", model=checkpoint)
 ```
 Next, choose an image to analyze:
 ```py
 >>> from PIL import Image
 >>> import requests
 >>> url = "https://unsplash.com/photos/HwBAsSbPBDU/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MzR8fGNhciUyMGluJTIwdGhlJTIwc3RyZWV0fGVufDB8MHx8fDE2Nzg5MDEwODg&force=true&w=640"
 >>> image = Image.open(requests.get(url, stream=True).raw)
 >>> image
 ```
 <div class="flex justify-center">
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/depth-estimation-example.jpg" alt="Photo of a busy street"/>
 </div>
 Pass the image to the pipeline.
 ```py
 >>> predictions = depth_estimator(image)
 ```
 The pipeline returns a dictionary with two entries. The first one, called `predicted_depth`, is a tensor with the values
 being the depth expressed in meters for each pixel.
 The second one, `depth`, is a PIL image that visualizes the depth estimation result.
 Let's take a look at the visualized result:
 ```py
 >>> predictions["depth"]
 ```
 <div class="flex justify-center">
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/depth-visualization.png" alt="Depth estimation visualization"/>
 </div>
 ## Depth estimation inference by hand
 Now that you've seen how to use the depth estimation pipeline, let's see how we can replicate the same result by hand.
 Start by loading the model and associated processor from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads).
 Here we'll use the same checkpoint as before:
 ```py
 >>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation
 >>> checkpoint = "vinvino02/glpn-nyu"
 >>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
 >>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint)
 ```
 Prepare the image input for the model using the `image_processor` that will take care of the necessary image transformations
 such as resizing and normalization:
 ```py
 >>> pixel_values = image_processor(image, return_tensors="pt").pixel_values
 ```
 Pass the prepared inputs through the model:
 ```py
 >>> import torch
 >>> with torch.no_grad():
 ...     outputs = model(pixel_values)
 ...     predicted_depth = outputs.predicted_depth
 ```
 Visualize the results:
 ```py
 >>> import numpy as np
 >>> # interpolate to original size
 >>> prediction = torch.nn.functional.interpolate(
 ...     predicted_depth.unsqueeze(1),
 ...     size=image.size[::-1],
 ...     mode="bicubic",
 ...     align_corners=False,
 ... ).squeeze()
 >>> output = prediction.numpy()
 >>> formatted = (output * 255 / np.max(output)).astype("uint8")
 >>> depth = Image.fromarray(formatted)
 >>> depth
 ```
 <div class="flex justify-center">
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/depth-visualization.png" alt="Depth estimation visualization"/>
 </div>
--- a/utils/check_task_guides.py
+++ b/utils/check_task_guides.py
@@ -70,6 +70,7 @@ TASK_GUIDE_TO_MODELS = {
    "translation.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES,
    "video_classification.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING_NAMES,
    "document_question_answering.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES,
    "monocular_depth_estimation.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES,
 }
 # This list contains model types used in some task guides that are not in `CONFIG_MAPPING_NAMES` (therefore not in any