Fix bad markdown links (#39819)

Fix bad markdown links.
This commit is contained in:
Eric Bezzam
2025-07-31 18:14:14 +02:00
committed by GitHub
parent 4fcf455517
commit 2c0af41ce5
27 changed files with 40 additions and 40 deletions

View File

@@ -25,7 +25,7 @@ Keypoint detection identifies and locates specific points of interest within an
In this guide, we will show how to extract keypoints from images.
For this tutorial, we will use [SuperPoint](./model_doc/superpoint.md), a foundation model for keypoint detection.
For this tutorial, we will use [SuperPoint](./model_doc/superpoint), a foundation model for keypoint detection.
```python
from transformers import AutoImageProcessor, SuperPointForKeypointDetection

View File

@@ -20,7 +20,7 @@ rendered properly in your Markdown viewer.
Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from video question answering to video captioning.
These models have nearly the same architecture as [image-text-to-text](../image_text_to_text.md) models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `<video>`".
These models have nearly the same architecture as [image-text-to-text](../image_text_to_text) models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `<video>`".
In this guide, we provide a brief overview of video LMs and show how to use them with Transformers for inference.