diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index d8833f3efe..7586985b11 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -85,6 +85,8 @@
title: Zero-shot object detection
- local: tasks/zero_shot_image_classification
title: Zero-shot image classification
+ - local: tasks/monocular_depth_estimation
+ title: Depth estimation
title: Computer Vision
- sections:
- local: tasks/image_captioning
diff --git a/docs/source/en/model_doc/dpt.mdx b/docs/source/en/model_doc/dpt.mdx
index b19a7468e6..c80ab8db7c 100644
--- a/docs/source/en/model_doc/dpt.mdx
+++ b/docs/source/en/model_doc/dpt.mdx
@@ -33,7 +33,8 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DPT.
- Demo notebooks for [`DPTForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DPT).
-- See also: [Semantic segmentation task guide](./tasks/semantic_segmentation)
+- [Semantic segmentation task guide](../tasks/semantic_segmentation)
+- [Monocular depth estimation task guide](../tasks/monocular_depth_estimation.mdx)
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
diff --git a/docs/source/en/model_doc/glpn.mdx b/docs/source/en/model_doc/glpn.mdx
index fe39dbb948..ce930b2e95 100644
--- a/docs/source/en/model_doc/glpn.mdx
+++ b/docs/source/en/model_doc/glpn.mdx
@@ -45,6 +45,7 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GLPN.
- Demo notebooks for [`GLPNForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/GLPN).
+- [Monocular depth estimation task guide](../tasks/monocular_depth_estimation.mdx)
## GLPNConfig
diff --git a/docs/source/en/tasks/monocular_depth_estimation.mdx b/docs/source/en/tasks/monocular_depth_estimation.mdx
new file mode 100644
index 0000000000..a2721d659e
--- /dev/null
+++ b/docs/source/en/tasks/monocular_depth_estimation.mdx
@@ -0,0 +1,147 @@
+
+
+# Monocular depth estimation
+
+Monocular depth estimation is a computer vision task that involves predicting the depth information of a scene from a
+single image. In other words, it is the process of estimating the distance of objects in a scene from
+a single camera viewpoint.
+
+Monocular depth estimation has various applications, including 3D reconstruction, augmented reality, autonomous driving,
+and robotics. It is a challenging task as it requires the model to understand the complex relationships between objects
+in the scene and the corresponding depth information, which can be affected by factors such as lighting conditions,
+occlusion, and texture.
+
+
+The task illustrated in this tutorial is supported by the following model architectures:
+
+
+
+[DPT](../model_doc/dpt), [GLPN](../model_doc/glpn)
+
+
+
+
+
+In this guide you'll learn how to:
+
+* create a depth estimation pipeline
+* run depth estimation inference by hand
+
+Before you begin, make sure you have all the necessary libraries installed:
+
+```bash
+pip install -q transformers
+```
+
+## Depth estimation pipeline
+
+The simplest way to try out inference with a model supporting depth estimation is to use the corresponding [`pipeline`].
+Instantiate a pipeline from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads):
+
+```py
+>>> from transformers import pipeline
+
+>>> checkpoint = "vinvino02/glpn-nyu"
+>>> depth_estimator = pipeline("depth-estimation", model=checkpoint)
+```
+
+Next, choose an image to analyze:
+
+```py
+>>> from PIL import Image
+>>> import requests
+
+>>> url = "https://unsplash.com/photos/HwBAsSbPBDU/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MzR8fGNhciUyMGluJTIwdGhlJTIwc3RyZWV0fGVufDB8MHx8fDE2Nzg5MDEwODg&force=true&w=640"
+>>> image = Image.open(requests.get(url, stream=True).raw)
+>>> image
+```
+
+
+

+
+
+Pass the image to the pipeline.
+
+```py
+>>> predictions = depth_estimator(image)
+```
+
+The pipeline returns a dictionary with two entries. The first one, called `predicted_depth`, is a tensor with the values
+being the depth expressed in meters for each pixel.
+The second one, `depth`, is a PIL image that visualizes the depth estimation result.
+
+Let's take a look at the visualized result:
+
+```py
+>>> predictions["depth"]
+```
+
+
+

+
+
+## Depth estimation inference by hand
+
+Now that you've seen how to use the depth estimation pipeline, let's see how we can replicate the same result by hand.
+
+Start by loading the model and associated processor from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads).
+Here we'll use the same checkpoint as before:
+
+```py
+>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation
+
+>>> checkpoint = "vinvino02/glpn-nyu"
+
+>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
+>>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint)
+```
+
+Prepare the image input for the model using the `image_processor` that will take care of the necessary image transformations
+such as resizing and normalization:
+
+```py
+>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values
+```
+
+Pass the prepared inputs through the model:
+
+```py
+>>> import torch
+
+>>> with torch.no_grad():
+... outputs = model(pixel_values)
+... predicted_depth = outputs.predicted_depth
+```
+
+Visualize the results:
+
+```py
+>>> import numpy as np
+
+>>> # interpolate to original size
+>>> prediction = torch.nn.functional.interpolate(
+... predicted_depth.unsqueeze(1),
+... size=image.size[::-1],
+... mode="bicubic",
+... align_corners=False,
+... ).squeeze()
+>>> output = prediction.numpy()
+
+>>> formatted = (output * 255 / np.max(output)).astype("uint8")
+>>> depth = Image.fromarray(formatted)
+>>> depth
+```
+
+
+

+
diff --git a/utils/check_task_guides.py b/utils/check_task_guides.py
index 3800123416..42515439a9 100644
--- a/utils/check_task_guides.py
+++ b/utils/check_task_guides.py
@@ -70,6 +70,7 @@ TASK_GUIDE_TO_MODELS = {
"translation.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES,
"video_classification.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING_NAMES,
"document_question_answering.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES,
+ "monocular_depth_estimation.mdx": transformers_module.models.auto.modeling_auto.MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES,
}
# This list contains model types used in some task guides that are not in `CONFIG_MAPPING_NAMES` (therefore not in any