From 2e3f8f74747deeeead6cf1f0c12cf01bd7169b82 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Sun, 1 Sep 2024 12:06:31 +0300 Subject: [PATCH] Add video text to text docs (#33164) --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/_toctree.yml | 2 + docs/source/en/tasks/video_text_to_text.md | 146 +++++++++++++++++++++ 2 files changed, 148 insertions(+) create mode 100644 docs/source/en/tasks/video_text_to_text.md diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index c5e3bcddfc..dbbb9861fb 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -94,6 +94,8 @@ title: Text to speech - local: tasks/image_text_to_text title: Image-text-to-text + - local: tasks/video_text_to_text + title: Video-text-to-text title: Multimodal - isExpanded: false sections: diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md new file mode 100644 index 0000000000..fcc1c86e8b --- /dev/null +++ b/docs/source/en/tasks/video_text_to_text.md @@ -0,0 +1,146 @@ + + +# Video-text-to-text + +[[open-in-colab]] + +Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from video question answering to video captioning. + +These models have nearly the same architecture as [image-text-to-text](../image_text_to_text.md) models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `