diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index d8833f3efe..7586985b11 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -85,6 +85,8 @@ title: Zero-shot object detection - local: tasks/zero_shot_image_classification title: Zero-shot image classification + - local: tasks/monocular_depth_estimation + title: Depth estimation title: Computer Vision - sections: - local: tasks/image_captioning diff --git a/docs/source/en/model_doc/dpt.mdx b/docs/source/en/model_doc/dpt.mdx index b19a7468e6..c80ab8db7c 100644 --- a/docs/source/en/model_doc/dpt.mdx +++ b/docs/source/en/model_doc/dpt.mdx @@ -33,7 +33,8 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DPT. - Demo notebooks for [`DPTForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DPT). -- See also: [Semantic segmentation task guide](./tasks/semantic_segmentation) +- [Semantic segmentation task guide](../tasks/semantic_segmentation) +- [Monocular depth estimation task guide](../tasks/monocular_depth_estimation.mdx) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. diff --git a/docs/source/en/model_doc/glpn.mdx b/docs/source/en/model_doc/glpn.mdx index fe39dbb948..ce930b2e95 100644 --- a/docs/source/en/model_doc/glpn.mdx +++ b/docs/source/en/model_doc/glpn.mdx @@ -45,6 +45,7 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GLPN. - Demo notebooks for [`GLPNForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/GLPN). +- [Monocular depth estimation task guide](../tasks/monocular_depth_estimation.mdx) ## GLPNConfig diff --git a/docs/source/en/tasks/monocular_depth_estimation.mdx b/docs/source/en/tasks/monocular_depth_estimation.mdx new file mode 100644 index 0000000000..a2721d659e --- /dev/null +++ b/docs/source/en/tasks/monocular_depth_estimation.mdx @@ -0,0 +1,147 @@ + + +# Monocular depth estimation + +Monocular depth estimation is a computer vision task that involves predicting the depth information of a scene from a +single image. In other words, it is the process of estimating the distance of objects in a scene from +a single camera viewpoint. + +Monocular depth estimation has various applications, including 3D reconstruction, augmented reality, autonomous driving, +and robotics. It is a challenging task as it requires the model to understand the complex relationships between objects +in the scene and the corresponding depth information, which can be affected by factors such as lighting conditions, +occlusion, and texture. + + +The task illustrated in this tutorial is supported by the following model architectures: + + + +[DPT](../model_doc/dpt), [GLPN](../model_doc/glpn) + + + + + +In this guide you'll learn how to: + +* create a depth estimation pipeline +* run depth estimation inference by hand + +Before you begin, make sure you have all the necessary libraries installed: + +```bash +pip install -q transformers +``` + +## Depth estimation pipeline + +The simplest way to try out inference with a model supporting depth estimation is to use the corresponding [`pipeline`]. +Instantiate a pipeline from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads): + +```py +>>> from transformers import pipeline + +>>> checkpoint = "vinvino02/glpn-nyu" +>>> depth_estimator = pipeline("depth-estimation", model=checkpoint) +``` + +Next, choose an image to analyze: + +```py +>>> from PIL import Image +>>> import requests + +>>> url = "https://unsplash.com/photos/HwBAsSbPBDU/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MzR8fGNhciUyMGluJTIwdGhlJTIwc3RyZWV0fGVufDB8MHx8fDE2Nzg5MDEwODg&force=true&w=640" +>>> image = Image.open(requests.get(url, stream=True).raw) +>>> image +``` + +
+ Photo of a busy street +
+ +Pass the image to the pipeline. + +```py +>>> predictions = depth_estimator(image) +``` + +The pipeline returns a dictionary with two entries. The first one, called `predicted_depth`, is a tensor with the values +being the depth expressed in meters for each pixel. +The second one, `depth`, is a PIL image that visualizes the depth estimation result. + +Let's take a look at the visualized result: + +```py +>>> predictions["depth"] +``` + +
+ Depth estimation visualization +
+ +## Depth estimation inference by hand + +Now that you've seen how to use the depth estimation pipeline, let's see how we can replicate the same result by hand. + +Start by loading the model and associated processor from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads). +Here we'll use the same checkpoint as before: + +```py +>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation + +>>> checkpoint = "vinvino02/glpn-nyu" + +>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint) +>>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint) +``` + +Prepare the image input for the model using the `image_processor` that will take care of the necessary image transformations +such as resizing and normalization: + +```py +>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values +``` + +Pass the prepared inputs through the model: + +```py +>>> import torch + +>>> with torch.no_grad(): +... outputs = model(pixel_values) +... predicted_depth = outputs.predicted_depth +``` + +Visualize the results: + +```py +>>> import numpy as np + +>>> # interpolate to original size +>>> prediction = torch.nn.functional.interpolate( +... predicted_depth.unsqueeze(1), +... size=image.size[::-1], +... mode="bicubic", +... align_corners=False, +... ).squeeze() +>>> output = prediction.numpy() + +>>> formatted = (output * 255 / np.max(output)).astype("uint8") +>>> depth = Image.fromarray(formatted) +>>> depth +``` + +
+ Depth estimation visualization +
diff --git a/utils/check_task_guides.py b/utils/check_task_guides.py index 3800123416..42515439a9 100644 --- a/utils/check_task_guides.py +++ b/utils/check_task_guides.py @@ -70,6 +70,7 @@ TASK_GUIDE_TO_MODELS = { "translation.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES, "video_classification.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING_NAMES, "document_question_answering.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES, + "monocular_depth_estimation.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES, } # This list contains model types used in some task guides that are not in `CONFIG_MAPPING_NAMES` (therefore not in any